Wednesday 9 March 2011

XML Special Character Gotchas

You know you've got past the beginner tutorial and you're doing the real thing with XML when you start to get encoding and special character issues. Here's what I mean. You get some code which takes some XML, perhaps produced by a web page, and passes it to an XML parser. Now, just like other formats like JSON and HTML, XML requires that certain 'special characters' be 'escaped' (replaced with sequences like '&' which is the escape sequence for an ampersand character). So if your XML contains some of these characters then they have to be replaced or the XML is not right. This introduces a catch-22 gotcha for XML parsers (at least it does for the ubiquitous one I use and probably does for others too) in that you might need to parse the XML as a first step in replacing the special characters (or how can you tell the character is in element or attribute content rather than a comment, namespace string or element or attribute name, say?). Trying to parse XML containing these characters might cause the parser to throw an error. That, apparently, is because the XML standard specs seem to give the impression this is the correct behavior of an XML parser (though I'm told on authority that this isn't strictly what the specs intended).

The way I had to solve this with my C# code and .NET XML parser was to try parsing the XML, then catch any 'XML exception' errors, then do some things with the exception message and parsed data which aren't exactly advisable (since you don't know what state the parser data will be in after the exception) but seem to be unavoidable for an ordinary developer like myself. I have to get the offending bit of data from the parser using the line number and line position of the error from the error / exception message. Then do some intelligent replacing of a special character while avoiding replacing special characters which are part of the escape sequence of an already replaced special character! Phew! I just wish the parser would do all this for me but there are so few XML parsers available to me in my development environment (two I think) and neither handle this scenario the way I'd like, it seems.

Here's a rendition of a general function to do all this in C# with .NET 2 or 3. I'm not necessarily proficient enough to say anyone else should use this code - it probably needs some better error handling and optimization, besides the fact it makes some assumptions which might not always be safe. It show the kind of issues a developer may face when parsing XML. Actually, I couldn't find much code anywhere on the Internet to handle this issue. In fact found very little written about this issue at all apart from an aged link here which helped as a starting point http://support.microsoft.com/kb/316063 :


public static string EscapeXmlSpecialCharacters(string XmlString)
{
string resultString = "";
//Create and load the XML document.
XmlDocument doc = new XmlDocument();
try
{
doc.LoadXml(XmlString);
resultString = XmlString;
}
catch (XmlException ex)
{
StringReader str = new StringReader(XmlString);
StringWriter stw = new StringWriter(new StringBuilder(resultString));
string output = "";
long i = 0;
string strline = "";
long linenumber = (int)ex.LineNumber;
long lineposition = (int)ex.LinePosition;
while (i < linenumber - 1)
{
strline = str.ReadLine();
stw.WriteLine(strline);
i = i + 1;
}
strline = str.ReadLine();
string strOffendingCharacter = strline.ToString().Substring((int)lineposition - 2, 1);
string strOffendingCharacterAndFollowing5 = strline.ToString().Substring((int)lineposition - 2, 5);
switch (strOffendingCharacter)
{
case "<":
strline = strline.Substring(0, (int)lineposition - 2) + "&lt;" + strline.Substring((int)lineposition - 1);
break;
case "&":
// ensure we are not replacing the ampersand in an already escaped special character (&lt;, &gt;, &apos;, &quot; or &amp;)
switch (strOffendingCharacterAndFollowing5.Substring(1, 3))
{
case "lt;":
break;
case "gt;":
break;
default:
switch (strOffendingCharacterAndFollowing5.Substring(1, 4))
{
case "amp;":
break;
default:
switch (strOffendingCharacterAndFollowing5)
{
case "apos;":
break;
case "quot;":
break;
default:
strline = strline.Substring(0, (int)lineposition - 2) + "&amp;" + strline.Substring((int)lineposition - 1);
break;
}
break;
}
break;
}
break;
}
stw.WriteLine(strline);
strline = str.ReadToEnd();
stw.WriteLine(strline);
output = stw.ToString();
str.Close();
str = null;
stw.Flush();
stw.Close();
stw = null;
resultString = EscapeXmlSpecialCharacters(output);
}
return resultString;
}

Then you can put in a step between receiving some XML from, say, a web control like a grid and sending that XML as a string to an XML parser like XmlReader so that you can read the string into a .NET dataset, say. No idea what the equivalent issues and solution are like in Java, sorry. If you write your own XML parser though this is one of the issues you will have to bear in mind and handle, along with similar XML-related issues like handling the illegal characters and characters which are not encoded with the encoding declared in the XML declaration (usually UTF-8). Targeting such a parser at supporting just MicroXML, say (see earlier blogs on MicroXML and MicroXSD) might help to keep these issues to something manageable for people writing their own parsers; we'll see perhaps.

3 comments:

  1. As discussed on xml-dev, this appears to be trying to solve the wrong problem, too late.

    Rather than generate non-xml and then try to fix it up in a catch block after an xml parser rejects it, it is almost always better to generate well formed xml in the first place.

    The code shown here doesn't show how the "xmlstring" gets generated, or how it came to be non well formed, the fix should be in the code that produced that, not in the code that's consuming it.

    ReplyDelete
  2. Also the posted code doesn't quote > so input of ]]> would produce non well formed XML and if the input uses numeric character references or a dtd and entity references other than the predefined xml ones, it will break the input by quoting the &

    If you have enough control over the input to know that it won't be using numeric character references then the input shouldn't be non well formed, and the attempted corrections won't be needed.

    ReplyDelete
  3. Thank you for your comment, David. I think there are some fundamental differences between the way an ordinary (not necessarily novice) web developer uses XML internally in web apps such as this one and the way XML experts think about XML and discuss it on lists like XML-Dev. To web developers I work with XML is just a way to manage/wrap text/data. It is a bit like using CSV and including a heading row to label the columns. Personally I see it the same way, even though I've worked in both circles. I want the XML to neatly fit around some rows/columns of data coming from a grid in an 'ASP.NET' web form and do no more than that. I don't care whether it is valid, wellformed (except to get processors to handle it), conformant to anything, etc. It is my servant to wrap my users' data, no more. If it cannot do that it is not fit for purpose and I'll do without it. As it happened in this case it did well enough to keep it in the app but novice developers might find it so frustrating to use that we end up taking the XML out of the app in future. That might be sad for an XML enthusiast but no problem for our customers. I wish it would work better and that parsers weren't so difficult for developers to use (overly fussy about things I don't much care about) because I want it to work and stay in the app, because I like XML and find it facilitating. Others wouldn't much care about that and easily give up using it.

    ReplyDelete