|
sql
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
XML parsing error: An invalid character was found in text contentWe have a simple web application, classic ASP (VBScript) based, which uses XML-explicit stored procedures for most of its data transfer functions. The application has been running a-OK for 2 years+ on a Windows 2000- based web server and SQL Server 2000 database. The app has recently been moved over to a Windows 2003-based web server, still using a (different) SQL Server 2000 box as its DB server. Now, when we try to enter some text-based content into the application, it fails if that content contains special characters... £ (British Pound), < and > (angle brackets) are some examples. This never happened on the old server, so it's clearly something to do with the new server settings, but we don't know what. If I manually escape out the offending characters using, for example, " ", these will go into the database just fine, but this is not a viable solution of course. The error messge is: [Microsoft][ODBC SQL Server Driver][SQL Server]XML parsing error: An invalid character was found in text content. The problem is that, although we're aware of the type of problem here, we're not sure exactly what it is that we need to change, given that this worked just fine on the previous server setup. Can anyone point us in the right direction? Help much appreciated!! AT Hi,
I am having almost the same problem. We have a system of accepting XML into a stored proc which uses sp_xml_preparedocument to insert data into an SQLServer 2000 database. this has been working fine until some dodgy data came in on one extract (an unvalidated postcode from another system where the user obviously held down the shift key for the numbers and became 'TQ!£' instead if 'TQ13'). The XML file has UTF-8 encoding specified and the parameter is a text parameter and SQLServer complains with 'An invalid character was found in text content' error. We have tried various combinations to get this data in, but the current best solution is to use UTF-16 encoding and an ntext parameter which works but uses twice as much resources (as expected) so I am not keen to use this. I have yet to try ASCII encoding but I will try that next. Has anyone else experienced this? I can't believe that UTF-8 encoding with what are 'normal' characters are not legal for SQLServer XML. Has anyone else any other suggestions? Many thanks, Doug Show quote "Andrew Thelwell" wrote: > Hi, > > We have a simple web application, classic ASP (VBScript) based, which > uses XML-explicit stored procedures for most of its data transfer > functions. > > The application has been running a-OK for 2 years+ on a Windows 2000- > based web server and SQL Server 2000 database. > > The app has recently been moved over to a Windows 2003-based web > server, still using a (different) SQL Server 2000 box as its DB > server. > > Now, when we try to enter some text-based content into the > application, it fails if that content contains special characters... > £ (British Pound), < and > (angle brackets) are some examples. This > never happened on the old server, so it's clearly something to do with > the new server settings, but we don't know what. > > If I manually escape out the offending characters using, for example, > " ", these will go into the database just fine, but this is not a > viable solution of course. > > The error messge is: > > [Microsoft][ODBC SQL Server Driver][SQL Server]XML parsing error: An > invalid character was found in text content. > > The problem is that, although we're aware of the type of problem here, > we're not sure exactly what it is that we need to change, given that > this worked just fine on the previous server setup. > > Can anyone point us in the right direction? > > Help much appreciated!! > > AT > > As per my previous post in this thread, the issue you're finding with the £ symbol is always caused by the encoding in
the XML declaration not matching the actual encoding of the text. How is the XML text being passed to SQL Server? Do you have a sample that demonstrates the problem so that the issue can be easily reproduced? OK,
I've been readng a lot about encoding recently and now I am completely mystified! To summarise, as I understand it, UTF-8 should be able to encode ANY unicode character (from Wikipedia - UTF-8). My XML is encoded in UTF-8 and the file is stored as UTF-8. Why then when I try and run this through the code below does it complain about the pound sign and other 'special' characters? The £ appears according to the specification to be part of the Latin 1 set of Unicode characters. The collation of the database is Latin1_General_CI_AS and the language is 'British'. This is now driving me crazy as I cannot seem to find any discussion or documents on the Internet which explain why SQLServer 2000 only deals with a few of the UTF-8 characters. Here is some sample code which exhibits the problem: Database: The database has one table with a text column and a datetime column. There is a stored procedure which accepts the XML input and puts it in the table. CREATE PROCEDURE ImportXML ( @Xml as text ) AS DECLARE @idoc int --Create an internal representation of the XML document. EXEC sp_xml_preparedocument @idoc OUTPUT, @Xml --Get attributes from Xml INSERT INTO Test ( [DateTime], [Text] ) SELECT GetDate(), [Text] FROM OPENXML (@idoc, '/Root/Data',1) WITH ( [Text] varchar(255) ) --clear xml doc from memory EXEC sp_xml_removedocument @idoc GO ..Net code: This is essentially a form with a textbox called 'Xml' which the XML can put into and a button which does the call tot he database. if (Xml.Text != string.Empty) { // SqlConnection that will be used to execute the sql commands SqlConnection connection = null; try { try { connection = new SqlConnection(ConfigurationSettings.AppSettings["ConnectionInfo"]); } catch { MessageBox.Show("The connection with the database can´t be established", "Application Error", MessageBoxButtons.OK, MessageBoxIcon.Error); return; } // Prepare the parameters SqlParameter param = new SqlParameter("@Xml", SqlDbType.Text); param.Value = Xml.Text; // Call the stored procedure to insert XML data SqlHelper.ExecuteNonQuery(connection, CommandType.StoredProcedure, "ImportXML", param ); } catch(Exception ex) { string errMessage = ""; for( Exception tempException = ex; tempException != null ; tempException = tempException.InnerException ) { errMessage += tempException.Message + Environment.NewLine + Environment.NewLine; } MessageBox.Show( string.Format( "There are some problems while trying to use the Data Access Application block, please check the following error messages: {0}", errMessage ), "Application error", MessageBoxButtons.OK, MessageBoxIcon.Error ); } finally { if(connection != null) connection.Dispose(); } } XML Data: The data that I am passing will vary, but here is the offending test data. It has every character on the keyboard. <?xml version="1.0" encoding="utf-8" ?> <Root> <Data Text="`1234567890-=¬!"£$%^&*()_+¦€qwertyuiop[]QWERTYUIOP{}asdfghjkl;'#ASDFGHJKL:@~\zxcvbnm,./|ZXCVBNM<>?/*-7894561230." /> </Root> The result is that there are 4 characters which are getting rejected: '¬', '¦', '£', '€'. It also does not allow european characters with umlauts, etc. Why does this happen? If anyone has any suggestions I would appreciate hearing them as surely there should be a way to accept any UTF-8 character? I don't want to go to UTF-16 because it would mean doubling the size of the XML data files (already big) and increasing the resource usage for processing and passing the data around the system. Many thanks, Doug Show quote "Stefan Delmarco" wrote: > As per my previous post in this thread, the issue you're finding with the £ symbol is always caused by the encoding in > the XML declaration not matching the actual encoding of the text. How is the XML text being passed to SQL Server? Do you > have a sample that demonstrates the problem so that the issue can be easily reproduced? > > -- > Cheers, > Stefan Delmarco > > http://www.fotia.co.uk > > > > I'd suggest that you create a reproducer using a SQL Server 2000 script
alone. Take .NET out of the picture and see if it works. If that works, then the problem has something to do with .NET. John Hi Doug, thanks for the repro. I was able to reproduce your problem. It basically comes down to the fact that SQL Server
does not have a string data type that uses the UTF-8 encoding. You XML claims to be encoded in UTF-8 but the Xml is being passed using a TEXT data type. The SQL Server TEXT data type is encoded as Latin1 (my database's selected collation, yours may be different). This is where the mismatch is occurring. The XML encoding says UTF-8 but the actual encoding is Latin1. As a test try the following Xml instead (note that your sample XML need the & > and < characters to be escaped for your sample to be well-formed XML): <?xml version='1.0' encoding='iso-8859-1'?> <Root> <Data Text='`1234567890-=¬!""£$%^&*()_+¦?qwertyuiop[]QWERTYUIOP{}asdfghjkl;'#ASDFGHJKL:@~\zxcvbnm,./|ZXCVBNM<>?/*-7894561230.' /> </Root> Note the encoding specified in the XML declaration (iso-8859-1 or Latin1). This agrees with the encoding that TEXT / VARCHAR / CHAR use. If you re-run your tests (with the XML above) you should no longer get the invalid character errors. As a test, change the encoding in the XML declaration back to UTF-8 and it should fail as before. In terms of your options at this point, if you are dealing with your data using the .NET string data type then you have already converted the data to Unicode (strings in .NET are UTF-16) so you may as well change the data type of the ImportXML stored procedure to NTEXT and specify the encoding in the XML as UTF-16. Alternatively you can specify the encoding as ISO-8859-1 and use the TEXT data type. Unfortunately there is no way to specify the encoding as UTF-8 as SQL Server doesn't have a data type that can hold the string data in that encoding. Let me know if this solves your problem. Cheers, Stefan Hi Stefan,
Thanks for your explanation. Your suggestions are consistent with what I have found over the past week. ISO-8859-1 does work except for the Euro symbol (€) which becomes a question mark (?). It also copes with many other western european characters. I was hoping to find a way of using UTF-8 and TEXT datatype. The production of the XML is from other systems but we can specify the encoding to use. With UTF-16 yes everything works but the files become twice as big and the resources used to process the file are twice as much. It sounds like the solution for now is to use ISO-8859-1 which I was the decision I was coming to anyway; I just could not understand why UTF-8 would not work into SQLServer which you have now explained. I may have to trap and escape the Euro symbol so that it works properly, just in case it is used. My other option seems to be to get the XML produced using UTF-8 and change it to UTF16 before pumping to the database using NText datatype. A bit of a hack bodge me thinks. Many thanks for your help. Doug Show quote "Stefan Delmarco" wrote: > Hi Doug, thanks for the repro. I was able to reproduce your problem. It basically comes down to the fact that SQL Server > does not have a string data type that uses the UTF-8 encoding. You XML claims to be encoded in UTF-8 but the Xml is > being passed using a TEXT data type. The SQL Server TEXT data type is encoded as Latin1 (my database's selected > collation, yours may be different). This is where the mismatch is occurring. The XML encoding says UTF-8 but the actual > encoding is Latin1. As a test try the following Xml instead (note that your sample XML need the & > and < characters to > be escaped for your sample to be well-formed XML): > > <?xml version='1.0' encoding='iso-8859-1'?> > <Root> > <Data > Text='`1234567890-=¬!""£$%^&*()_+¦?qwertyuiop[]QWERTYUIOP{}asdfghjkl;'#ASDFGHJKL:@~\zxcvbnm,./|ZXCVBNM<>?/*-7894561230.' > /> > </Root> > > Note the encoding specified in the XML declaration (iso-8859-1 or Latin1). This agrees with the encoding that TEXT / > VARCHAR / CHAR use. If you re-run your tests (with the XML above) you should no longer get the invalid character errors. > As a test, change the encoding in the XML declaration back to UTF-8 and it should fail as before. > > In terms of your options at this point, if you are dealing with your data using the .NET string data type then you have > already converted the data to Unicode (strings in .NET are UTF-16) so you may as well change the data type of the > ImportXML stored procedure to NTEXT and specify the encoding in the XML as UTF-16. Alternatively you can specify the > encoding as ISO-8859-1 and use the TEXT data type. Unfortunately there is no way to specify the encoding as UTF-8 as SQL > Server doesn't have a data type that can hold the string data in that encoding. > > Let me know if this solves your problem. > > Cheers, > Stefan > -- > Cheers, > Stefan Delmarco > > http://www.fotia.co.uk > > > >
Show quote
"Doug Chandler" <DougChand***@discussions.microsoft.com> wrote in message What data type is Xml.Text ?news:9F2519B4-E24B-4E7B-B15B-6352F06F298E@microsoft.com... > Hi Stefan, > > Thanks for your explanation. Your suggestions are consistent with what I > have found over the past week. ISO-8859-1 does work except for the Euro > symbol (?) which becomes a question mark (?). It also copes with many other > western european characters. I was hoping to find a way of using UTF-8 and > TEXT datatype. The production of the XML is from other systems but we can > specify the encoding to use. With UTF-16 yes everything works but the files > become twice as big and the resources used to process the file are twice as > much. > > It sounds like the solution for now is to use ISO-8859-1 which I was the > decision I was coming to anyway; I just could not understand why UTF-8 would > not work into SQLServer which you have now explained. I may have to trap and > escape the Euro symbol so that it works properly, just in case it is used. > > My other option seems to be to get the XML produced using UTF-8 and change > it to UTF16 before pumping to the database using NText datatype. A bit of a > hack bodge me thinks. > > Many thanks for your help. > It will be a string. What encoding does .NET use to store strings? Unicode It doesn't matter what encoding the XML declare specifies in the document as long as it is consistent with the actual encoding of the document contents the unicode string result of Xml.Text will be correct. As has been pointed out your SP should use NText for the @XML parameter (regardless of the data types of the destination fields). Your destination fields should use a colation (a poor word MS chose to use there) that can accept all the expected characters. All should work fine assuming the XML generation is conforming to the rules. My point is you are barking up the wrong tree if you think that changing the encoding of the incoming XML from UTF-8 to UTF-16 is going to solve anything. You should also note that strictly speaking ? does not exist in the ISO-8859-1 charset although in many cases it may appear to work since tools such IE will render char 128 from a source claiming to be ISO-8859-1 as ?. 128 is the euro symbol in the Windows-1252 charset which is compatible with ISO-8859-1 except for characters 128-149 where ISO has a set of almost never used control codes that windows has replaced with a set of more useful characters. Stefan explained it well why you get the error... note that in SQL Server
2005, you can pass UTF-8 to the XML datatype via either DBTYPE XML or varbinary. Best regards Michael Show quote "Anthony Jones" <A**@yadayadayada.com> wrote in message news:%23chcWKWYHHA.2556@TK2MSFTNGP02.phx.gbl... > > "Doug Chandler" <DougChand***@discussions.microsoft.com> wrote in message > news:9F2519B4-E24B-4E7B-B15B-6352F06F298E@microsoft.com... >> Hi Stefan, >> >> Thanks for your explanation. Your suggestions are consistent with what I >> have found over the past week. ISO-8859-1 does work except for the Euro >> symbol (?) which becomes a question mark (?). It also copes with many > other >> western european characters. I was hoping to find a way of using UTF-8 >> and >> TEXT datatype. The production of the XML is from other systems but we can >> specify the encoding to use. With UTF-16 yes everything works but the > files >> become twice as big and the resources used to process the file are twice > as >> much. >> >> It sounds like the solution for now is to use ISO-8859-1 which I was the >> decision I was coming to anyway; I just could not understand why UTF-8 > would >> not work into SQLServer which you have now explained. I may have to trap > and >> escape the Euro symbol so that it works properly, just in case it is >> used. >> >> My other option seems to be to get the XML produced using UTF-8 and >> change >> it to UTF16 before pumping to the database using NText datatype. A bit of > a >> hack bodge me thinks. >> >> Many thanks for your help. >> > > What data type is Xml.Text ? > > It will be a string. > > What encoding does .NET use to store strings? > > Unicode > > It doesn't matter what encoding the XML declare specifies in the document > as > long as it is consistent with the actual encoding of the document contents > the unicode string result of Xml.Text will be correct. > > As has been pointed out your SP should use NText for the @XML parameter > (regardless of the data types of the destination fields). Your > destination > fields should use a colation (a poor word MS chose to use there) that can > accept all the expected characters. > > All should work fine assuming the XML generation is conforming to the > rules. > > My point is you are barking up the wrong tree if you think that changing > the > encoding of the incoming XML from UTF-8 to UTF-16 is going to solve > anything. > > > You should also note that strictly speaking ? does not exist in the > ISO-8859-1 charset although in many cases it may appear to work since > tools > such IE will render char 128 from a source claiming to be ISO-8859-1 as ?. > 128 is the euro symbol in the Windows-1252 charset which is compatible > with > ISO-8859-1 except for characters 128-149 where ISO has a set of almost > never > used control codes that windows has replaced with a set of more useful > characters. > > > > |
|||||||||||||||||||||||