Wednesday, 9 March 2011

XML Special Character Gotchas

You know you've got past the beginner tutorial and you're doing the real thing with XML when you start to get encoding and special character issues. Here's what I mean. You get some code which takes some XML, perhaps produced by a web page, and passes it to an XML parser. Now, just like other formats like JSON and HTML, XML requires that certain 'special characters' be 'escaped' (replaced with sequences like '&' which is the escape sequence for an ampersand character). So if your XML contains some of these characters then they have to be replaced or the XML is not right. This introduces a catch-22 gotcha for XML parsers (at least it does for the ubiquitous one I use and probably does for others too) in that you might need to parse the XML as a first step in replacing the special characters (or how can you tell the character is in element or attribute content rather than a comment, namespace string or element or attribute name, say?). Trying to parse XML containing these characters might cause the parser to throw an error. That, apparently, is because the XML standard specs seem to give the impression this is the correct behavior of an XML parser (though I'm told on authority that this isn't strictly what the specs intended).

The way I had to solve this with my C# code and .NET XML parser was to try parsing the XML, then catch any 'XML exception' errors, then do some things with the exception message and parsed data which aren't exactly advisable (since you don't know what state the parser data will be in after the exception) but seem to be unavoidable for an ordinary developer like myself. I have to get the offending bit of data from the parser using the line number and line position of the error from the error / exception message. Then do some intelligent replacing of a special character while avoiding replacing special characters which are part of the escape sequence of an already replaced special character! Phew! I just wish the parser would do all this for me but there are so few XML parsers available to me in my development environment (two I think) and neither handle this scenario the way I'd like, it seems.

Here's a rendition of a general function to do all this in C# with .NET 2 or 3. I'm not necessarily proficient enough to say anyone else should use this code - it probably needs some better error handling and optimization, besides the fact it makes some assumptions which might not always be safe. It show the kind of issues a developer may face when parsing XML. Actually, I couldn't find much code anywhere on the Internet to handle this issue. In fact found very little written about this issue at all apart from an aged link here which helped as a starting point http://support.microsoft.com/kb/316063 :


public static string EscapeXmlSpecialCharacters(string XmlString)
{
string resultString = "";
//Create and load the XML document.
XmlDocument doc = new XmlDocument();
try
{
doc.LoadXml(XmlString);
resultString = XmlString;
}
catch (XmlException ex)
{
StringReader str = new StringReader(XmlString);
StringWriter stw = new StringWriter(new StringBuilder(resultString));
string output = "";
long i = 0;
string strline = "";
long linenumber = (int)ex.LineNumber;
long lineposition = (int)ex.LinePosition;
while (i < linenumber - 1)
{
strline = str.ReadLine();
stw.WriteLine(strline);
i = i + 1;
}
strline = str.ReadLine();
string strOffendingCharacter = strline.ToString().Substring((int)lineposition - 2, 1);
string strOffendingCharacterAndFollowing5 = strline.ToString().Substring((int)lineposition - 2, 5);
switch (strOffendingCharacter)
{
case "<":
strline = strline.Substring(0, (int)lineposition - 2) + "&lt;" + strline.Substring((int)lineposition - 1);
break;
case "&":
// ensure we are not replacing the ampersand in an already escaped special character (&lt;, &gt;, &apos;, &quot; or &amp;)
switch (strOffendingCharacterAndFollowing5.Substring(1, 3))
{
case "lt;":
break;
case "gt;":
break;
default:
switch (strOffendingCharacterAndFollowing5.Substring(1, 4))
{
case "amp;":
break;
default:
switch (strOffendingCharacterAndFollowing5)
{
case "apos;":
break;
case "quot;":
break;
default:
strline = strline.Substring(0, (int)lineposition - 2) + "&amp;" + strline.Substring((int)lineposition - 1);
break;
}
break;
}
break;
}
break;
}
stw.WriteLine(strline);
strline = str.ReadToEnd();
stw.WriteLine(strline);
output = stw.ToString();
str.Close();
str = null;
stw.Flush();
stw.Close();
stw = null;
resultString = EscapeXmlSpecialCharacters(output);
}
return resultString;
}

Then you can put in a step between receiving some XML from, say, a web control like a grid and sending that XML as a string to an XML parser like XmlReader so that you can read the string into a .NET dataset, say. No idea what the equivalent issues and solution are like in Java, sorry. If you write your own XML parser though this is one of the issues you will have to bear in mind and handle, along with similar XML-related issues like handling the illegal characters and characters which are not encoded with the encoding declared in the XML declaration (usually UTF-8). Targeting such a parser at supporting just MicroXML, say (see earlier blogs on MicroXML and MicroXSD) might help to keep these issues to something manageable for people writing their own parsers; we'll see perhaps.

Tuesday, 8 March 2011

MicroXML

Using MicroXML


Some tend to think XML is rather complex and mysterious yet all it is, fundamentally, is a set of building blocks for text: element tags, element content, attribute names, attribute values and the structure (the way the building blocks are arranged). Let’s look at these in turn, while limiting our focus to the simplification of XML which is called MicroXML. In the process we also need to address some special concerns, such as some considerations that have to be given to certain characters and how to denote a comment as being distinct from the XML content. [MicroXML is a subset profile of XML developed by James Clark and published in a blogspot by James at http://blog.jclark.com/2010/12/more-on-microxml.html .]

Element Tags


The simplest and most obvious part of XML is the tag. Tags surround text, sometimes surrounding other tags and sometimes including within the tag other structures called attributes. Tags are sometimes in pairs and sometimes alone. When they are in pairs then there is a leading tag or ‘start tag’ and a trailing tag or ‘end tag’ and usually some text between them. If there is a start tag and end tag without any text between them then it can also be written in an alternative, equivalent form called an empty tag which is the one time when the tag has no pair.

In computer programming, the simplest kinds of examples are often called ‘Hello World’ examples because a very famous book once introduced readers to the early programming language called ‘C’ with an example that output a simple string of characters saying just ‘Hello World’. Using MicroXML we might have an equivalent simple example like this:

 <Hello>World</Hello>

This is called an element. This element has a start tag <Hello>, an end tag </Hello> and between the tags some text reading ‘World’. Simple! Start tags have an opening angle bracket or ‘less-than’ sign < followed by the name given to the element followed by the closing bracket or ‘greater than’ sign >. There is no white space unless the start tag contains one or more attributes, as we’ll see later. White space is that set of characters in text which do not have any visible shape like spaces, tabs, new-lines, etc. The name of the element (in this case ‘Hello’) is not allowed to contain any white space. Neither is an element allowed to start with a digit but it can contain any number of digits after the first character. A formal definition of MicroXML states that the first character of a name (element or attribute name) cannot be a digit but can be an alpha character or underscore (or some other more obscure characters which are listed by their ASCII codes see formal definition of MicroXML here http://blog.jclark.com/2010/12/more-on-microxml.html ). Other name charcters after the first do cannot include the underscore but can include digits, alpha characters and the two punctuation characters point (full stop) ‘.’ and hyphen (dash) ‘-’.

End tags start with an opening angle bracket or less-than sign ‘<’ followed by a forward slash ‘/’ then the same element name as the start tag used and finally the closing angle bracket or greater than sign ‘>’.  So our example element:

 <Hello>World</Hello>

is composed of:
Start tag: <Hello>
Content:  World
End tag: </Hello>

Then there is that special case of element tags which stand alone without being in a pair. These tags are an alternative way to show a special kind of element; one that does not have any content. The ‘Hello’ in our example above, if the text ‘World’ were removed, could be written like this:

 <Hello></Hello>

but it could also be written like this with just one special tag:

 <Hello/>

Elements can contain another kind of structure called an attribute. Attributes when added to an empty element can look like this:

 <Hello from=”Mars” to=”World”/>

The special tag for the empty element starts with the opening angle bracket, then comes the element name and then there can sometimes be white space, such as when the element has an attribute, but whether or not there is any white space it always closes with a backslash immediately followed by a closing angle bracket or greater-than sign. (There are some times when white space follows the element name even when there are no attributes but the reason is not important at this stage except to remember it is possible.)

Note that when there are one or more attributes added to an element the attributes and their values (they always have values in MicroXML) always sit between the element name and the closing angle bracket of the start tag, or when the tag is the standalone empty element tag, between the element name and the backslash, i.e. like the example above, or like this:

 <Hello from=”Mars”>World</Hello>

Also, the quotes around the attribute value can be a pair of double quotes or a pair of single quotes.

Element Content and Mixed Content

  
We’ve seen that an element can contain text but it can also contain other elements, or both. When it contains text alone then the text is simply everything between the start and end tag. In the example above that means just ‘World’. When it contains other elements then the other elements sit between the start and end tags but there may or may not be white space between the end of the first element’s start tag and the next element’s start tag. When the element contains both elements and text then it is called mixed content and the text and other elements along with any white space still sit between the first element’s start and end tags. So you can have content like this:

 <Hello><From>Mars</From><To>World</To></Hello>

or with white space like this:

 <Hello> <From>Mars</From> <To>
World</To>
</Hello>

or with mixed content like this:

 <Hello>From Mars <To>World</To></Hello>

Both white space and mixed content together makes the XML look less recognizable as XML, but it is still allowed, like this:

.. <Hello>From
Mars
<To>
World</To>

</Hello>

Attribute Names and Values


The attribute consists of a name and a textual value. The name and value together are placed in an element start tag or empty tag (not end tag) between the element name and the closing angle bracket. The name goes first followed, always, by an equals sign (with or without any white space between) followed (again, with or without any white space) by the attribute’s textual value surrounded by either single or double quotes like this:

 <Hello from=’Mars’>World</Hello>

or like this

 <Hello from = ”Mars” to=”World”  />

or like this

 <Hello from=  ”Mars” via=”My website”>World</Hello>

Note the various combinations of characters allowed such as white spaces and single or double quotes. Note too that if an attribute value were ever to contain an element it would not normally be recognized as an element but would normally be treated as just text. (Of course, it would be possible to instruct some computational process to extract the value of the attribute and treat it as XML in its own right but normally the value of the attribute is treated as just textual content.)

The attribute names are limited in their allowed characters just as element names are: no leading digits, only certain punctuation characters allowed, etc. Note also that with an attribute name in MicroXML there is a special case of names beginning ‘xml:’ which are reserved names for special purposes specified in the XML Standards. Normally the colon is not allowed in an attribute or element name in MicroXML (because it has a special function given it in parts of the XML specifications not included in MicroXML but which would affect the way MicroXML was handled by tools used to handle XML in general). The ‘xml:’ prefix is allowed in attribute names in MicroXML, however, mainly to allow a special feature to be used which is an ‘xml:id’ special attribute (out of scope here).

Special and Illegal Characters


We have already seen that some characters in element and attribute names are illegal in MicroXML (and some of these are so in XML in general). In addition, to help processors interpret the XML properly and reliably there are some characters which are not allowed in the content. That might seem alarming and so it should: What, you might think, if there are such characters in text we wish to include between tags or in attribute values? What do we do with these characters? The answer is that they have to be replaced with what can be called ‘escape’ strings. This is one big drawback with using MicroXML or XML in general. The reason is clear when you think of what would happen if the text content of an element had a less-than sign in it: It could look very much like an end tag and in some cases might be indistinguishable from one, for example:

<MathStatement>one < two</MathStatement>

is potentially confusing but even more so is:

<two>one<two</two>

Attributes are not exempt and the following is not allowed:

<two one=”<two”</one>

The special characters this applies to are

Ampersand &
Less-than <
Greater-than >
Double-quote "
Single-quote (apostrophe) '

The reason the ampersand ‘&’ is a special character too becomes apparent when we consider what we have to do with these characters to replace them. It is called ‘escaping’ and it consists of replacing these characters with a special sequence of characters which are as follows:

Ampersand     &amp;
Less-than      &lt;
Greater-than      &gt;
Double-quote      &quot;
Single-quote (apostrophe)       &#39;     or     &apos;

The escape strings themselves can start with an apostrophe so this too is a special character.
Now all this poses problems which may have to be solved for people producing the XML, either through text editing or computer programming. You have to not just replace the special characters with their escape sequences but you have to ensure that when doing so you distinguish what is an apostrophe that is part of the text and what is an apostrophe at the start of an escape string (else you might end up producing something like &apos;apos; and eventually &apos;apos;apos;apos; or worse). Then you have to think about how to read the XML in any computer code, when to escape those characters, when to replace them back for humans to read, etc. It can all get complex but the same happens with web browsers which have to do this kind of thing for the language of the Web, HTML, too.

Other characters are said to be ‘illegal’ in MicroXML because all the content and the tags and attribute names and values have to be written in the character set of the encoding known as UTF-8 so any characters not a found in this encoding system’s character set have to be replaced at some point too. This might best be achieved by controlling which characters are actually added to the MicroXML rather than by escaping them (because escaping such characters introduces some extra complications out of scope here).

It has not been mentioned until now but the start of a document written in MicroXML should be the UTF-8 character encoding declaration along with the general XML (version 1.0) declaration:

<?xml version="1.0" encoding="UTF-8"?>
… (rest of the MicroXML follows on).

This, for the Hello World example, would look like the following:

<?xml version="1.0" encoding="UTF-8"?>
<Hello from=’Mars’>World</Hello>

 

Comments

 

To allow ignorable comment text to be interspersed with the MicroXML comments are separated from the rest of the XML by the special sequence of characters: left angle-bracket (less-than sign) followed without white space by an exclamation mark and, again no white space, two consecutive hyphens (dashes) then the comment then two consecutive hyphens and (without intervening white space) a right angle-bracket (greater-than sign).

<!--   then textual comment here then  -->

This looks like the following when the comment is embedded in some XML:

<Hello>From
<!-- this is a comment -->
Mars
<To>
World</To></Hello>

This says that the comment is not to be regarded as actual content and it can be ignored. The comment’s text does not need to be escaped. Comments cannot be nested in MicroXML. One comment has to be ended before the other begins.

This is allowed:

<Hello>From
<!-- this is a comment -->
<!-- this is another comment -->
Mars
<To>
World</To></Hello>

but not this:

<Hello>From
<!-- this is a comment
<!-- this is an illegally nested comment --> -->
Mars
<To>
World</To></Hello>

Structure


There are very many ways the above building blocks of MicroXML (and XML in general) can all be combined to form simple or complex structures but special note needs to be given to the hierarchical or tree-like nature of the structure due to there always being one and only one top level element. This is because there is a distinction made between a self-contained piece of XML and other partial pieces of XML. The first can be called an ‘instance’ and the latter can be called ‘fragments’. An ‘instance’ has special status. This is an instance:

<Hello><From>Mars</From><To>World</To></Hello>

whereas the following is not an instance but is a fragment:

<From>Mars</From><To>World</To>

That is because the first example is wrapped in a single element and the latter has two elements without any single element wrapper.

Then again the following is not even a fragment because as it stands it would not be valid MicroXML because it has a missing tag at the start and an incomplete one at the end:

Mars</From><To>World</To

To be perfect the instance example would be more complete as an instance (or XML document) if it had its declaration:

<?xml version="1.0" encoding="UTF-8"?>
<Hello><From>Mars</From><To>World</To></Hello>

Then a common practice is to add some indentation to make it easier to see that it has a structure because that structure sometimes expresses some of the meaning of the XML.

<?xml version="1.0" encoding="UTF-8"?>
<Hello>
 <From>Mars</From>
 <To>World</To>
</Hello>

The indentation is added using white space but this is not essential. It does illustrate the fact that the instance having a single wrapper around the outside, each element possibly wrapping other elements but attributes wrapping nothing gives the XML a tree-like or hierarchical logical structure.

The tree-like structure is described sometimes using language which alludes to a family tree (as we do later in the discussion of MicroXSD) where one element containing another element is said to be the parent of the second element (and XML elements can only have the one ‘parent’) and the second is said to be one of the child elements of the first. Two child elements sharing the same parent are often metaphorically called ‘siblings’.

The Use of a Namespace


There are times when there may be several instances of XML in one file or, say with mixed content, when various instances are somehow interspersed. To distinguish one instance from another we can assign a special name to each instance via a special attribute called the namespace attribute. The special name is called a namespace. We will see more about this in the section on MicroXSD, but the way that a namespace name is composed is a matter of discretion and might involve adhering to some sort of namespace naming scheme (such as a schemes involving URIs or domain names). The namespaces in MicroXML are limited compared to full, standard XML 1.0 and are assigned using the attribute ‘xmlns’ like this:

<?xml version="1.0" encoding="UTF-8"?>
<Hello xmlns="somenamespace">
 <From>Mars</From>
 <To>World</To>
</Hello>

Any string is allowed as the value of the attribute. It is worth bearing in mind that this is a key area of simplification in the MicroXML profile of XML; full, standard XML is greatly complicated by the allowance of any number of namespaces in an instance with each namespace being assigned to respective elements and attributes using special prefixes in the element and attribute names. For MicroXML this is eliminated for greater simplicity. This does mean that we have to keep the namespaces separated more so that we are limited in how we combine elements from different namespaces: Essentially all elements from one namespace have to be completed (all end tags closed) before another begins. For example we might have a file whose top element has no namespace but within this element are two parent elements for two parts of the rest of the instance each with a different namespace:

<?xml version="1.0" encoding="UTF-8"?>
<Hello>
 <From xmlns="somenamespace1">Mars</From>
 <To xmlns="somenamespace2">World</To>
</Hello>

This is allowed because the two namespaces do not overlap. Every element within an element having a particular namespace has that same namespace in MicroXML.


The Use of a Schema


It might clearly help the handling of the XML or of various parts of it if we pay attention to the design of its structure and seek to document it in a way which humans or even software can read.  The structure with its element and attribute names can be defined outside of the XML using some techniques designed especially for XML which can be used with MicroXML too. This also provides a designer of the XML structure with the opportunity to add some further information to describe and possibly constrain the XML, such as by associating a type with the content of a particular element or attribute. One such technique is to associate a schema with the XML. Such a schema can be written using MicroXSD, as seen in the previous blog [ http://stephengreenxml.blogspot.com/2011/02/microxsd.html ] which is a subset profile of W3C XML Schema. MicroXSD was developed by this blog's author especially for use with MicroXML.

Thursday, 3 March 2011

MicroXSD

Introduction

MicroXML


Toward the end of 2010 XML was getting a bit of a rethink in the XML community. First people talked about what for some had been unthinkable in its first decade, an ‘XML 2.0’. That may be on hold; perhaps just as well. For some it has been a major selling point that no 2.0 was planned when XML was created, so archived XML documents might still be readable (parseable) with software decades from now and printed XML still comprehensible decades after that.
What seems more likely is that those who find XML too onerous to use for their purposes will get as much as they need in the form of a predefined subset of XML. Efforts to define possible 80:20 subsets (80 percent of what you need with 20 percent of the complexity) have begun afresh of late, with the first major contender being called 'MicroXML'. The main discussion of MicroXML happened on the XML-Dev public mailing list in December 2010. [MicroXML is a subset profile of XML developed by James Clark and published in a blogspot by James at http://blog.jclark.com/2010/12/more-on-microxml.html . I blog on its use in the next blog.]

MicroXSD

Along with MicroXML there was discussion of the possibility of a simplified subset of languages used for defining structural constraints on XML instances. The standard full version of the main constraint language is called W3C XML Schema but often an alias is used for this in the acronym XSD (XML Schema Definition). MicroXSD is such a subset.
MicroXSD eliminates all but the essentials and only allows 'local' element definitions with unnamed complex and simple types. This makes the 'schema' look similar to the XML it defines. Some will find that an improvement because many acknowledge that the average schema written with XSD-proper is somewhat incomprehensible, even to the well-trained eye. MicroXML, as defined in late 2010 does not allow multiple namespaces (namespaces add some of the most atrocious complexities to standard XML) so MicroXSD need only support zero or one namespace, meaning no imports. Having an easier way to define (and understand) the constraints to be applied to a vocabulary written in MicroXML might allow more people to find it within their means to produce and use XML and write software supporting this subset of it.

Using MicroXSD

Writing the simplest schema

A Schema written in MicroXSD inherits the semantics of a schema written in standard W3C XML Schema 1.0 (often known by the acronym XSD) except that not every element and attribute in W3C XML Schema is allowed in MicroXSD: It is a subset. (Note again that MicroXSD is particularly but not solely targeted for defining XML instances written in the subset of XML called 'MicroXML' as mentioned in the introduction.)
A schema is a file or string (or the like) of XML markup which defines and constrains the structural aspects of another instance or fragment of XML. It cannot express every aspect of the constraints which might be required for some uses of XML but it is often the first choice to use for the definition since it supports some of the most common requirements and is quite ubiquitous, well known and well supported in XML-related software. Once a schema is defined there are other ways to fill the gaps, such as with prose specification statements, assertions or test assertions. The MicroXSD subset cannot support all typical use cases (such as cases where there is much reuse of XML syntax, as with XHTML) but where it can be used it offers ease of interpretation of the schema by a human reader plus reduced complexity when writing software to execute validation of the schema against a target instance. Often the latter is necessary as a prerequisite to further processing of XML in software.

The 'schema' Element: <schema>

Every conforming MicroXSD schema starts with the 'scheme' element as its outermost (top level) element.
An example:
<schema xmlns="http://www.w3.org/2001/XMLSchema" version="123" attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="somenamespace">


Lets pull this apart and see how it works.
<schema> is the top level element in a schema. It wraps all the other schema elements in this way
<schema ...> ...
</schema>


Attributes of the 'scheme' element are as follows.
xmlns="http://www.w3.org/2001/XMLSchema" : This attribute with this name and value is mandatory in a schema top level element. It provides the schema with a namespace. All elements between <schema> and </schema> share this namespace. This namespace is the namespace of the language used to write the schema, NOT the namespace of the XML constrained by the schema. For the latter, in MicroXSD, the 'targetNamespace' attribute is used (see below).
version="123": This attribute gives the schema a version number (sometimes characters other than numbers are used too such as '123.1.0' or even 'foo-draft-1').
attributeFormDefault="unqualified" elementFormDefault="qualified" : Included for compatibility with existing software processing W3C XML Schema and with the XML Schema standard, these attributes and the values shown are necessary to clarify how to interpret the namespaces of attributes and elements in the instance of XML constrained by the schema. They are fixed with these values in MicroXSD.

targetNamespace="somenamespace" : This attribute defines which namespace is to be assigned to an XML instance constrained by the schema. It is optional and if missing the namespace is assumed to be empty. Typically namespaces follow some scheme (scheme, not schema) which helps, for example, to associate the namespace with some provenance. One such scheme is to use the authors' URL or domain name perhaps together with an identifier of some find. Any string is allowed though.
In a MicroXSD schema no other attributes should be present in this top level 'scheme' element. The only element allowed immediately below (within) this scheme element in a MicroXSD schema is the 'element' element (the element called 'element'!). (Other possibilities acceptable in W3C XML Schema are not supported with the MicroXSD subset profile.)

The Topmost 'element' Element: <element>



The element called 'element' (beware confusion!) is found just below the 'scheme' element but can also be nested further down within itself: 'element' elements can contain other 'element' elements (albeit indirectly). An example of its use when it is directly below the 'scheme', top level element is
<element name="Hello"> ... </element>
name="Hello": This 'name' attribute is the only attribute allowed in 'element' in MicroXSD for the topmost 'element' (which is directly below the top level 'scheme' element). It defines the name of the top level element in the constrained XML instance. It can have as its value any valid XML name as specified by the XML 1.0 Standard and in MicroXML: Starting with an alpha (ASCII) or underscore ('_') character, not including any spaces, but with numbers and some punctuation characters (such as the point '.' or underscore '_') allowed after the first character.
When the 'element' element occurs lower down in the schema structure it can also take the attributes 'minOccurs' and 'maxOccurs' and these will be looked at later.

The only child element allowed in MicroXSD for 'element' element is called 'complexType'.

Example of a topmost element with its complex type:
<element name="Hello">
 <complexType mixed="true">
  ...
Wherever the 'complexType' element is used in MicroXSD its syntax is the same.

Simplest use of the 'complexType' Element


Here is a so-called 'Hello World' example:
For the simple XML
<Hello>World</Hello>
we can use MicroXSD to write the schema
<schema xmlns="http://www.w3.org/2001/XMLSchema" version="0.1" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Hello">
  <complexType>
   <simpleContent>
    <extension base="string"/>
   </simpleContent>
  </complexType>
 </element>
</schema>
Here the complexType element has one child which is 'simpleContent' (other child elements are allowed in MicroXSD and these will be looked at later on). That child 'simpleContent' itself has one required child element (the only allowed child element of 'simpleContent' in MicroXSD) which is called 'extension' which has a single, required attribute called 'base'. The 'base' attribute has a value which defines the datatype of the content (in the above example, the datatype of the content of the 'Hello' element). Allowed values for 'base' are string, decimal, integer, date, dateTime, boolean and base64Binary. These are some of the more commonly used standard datatypes for W3C XML Schema.
The above schema allows the XML
<Hello>Reader</Hello>
but does not allow
<Hello><Reader/></Hello>
because the latter contains another element named 'Reader' rather than just some text data of type 'string'.
The markup language, XML, allows both elements and attributes so in the next section we will look at how to add some attributes to this simple element.

Adding attributes

A very simple 'Hello World' example XML and a corresponding MicroXSD schema :
<Hello>World</Hello>
which may be constrained by MicroXSD schema
<schema xmlns="http://www.w3.org/2001/XMLSchema" version="0.1" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Hello">
  <complexType>
   <simpleContent>
    <extension base="string"/>
   </simpleContent>
  </complexType>
 </element>
</schema>
XML can have attributes as well as elements so a Hello World example could be written:
<Greeting to=”World”>Hello</Greeting>

The 'attribute' Element: <attribute>

The 'complexType' allows us to specify whether an element has child elements and whether it has attributes. When there are no child elements to be added to the instance element we use 'complexType' with a 'simpleContent' child and inside that we place a child of 'simpleContent' called 'extension'. In MicroXSD the element 'simpleContent' can only have this 'extension' element as a child element.
So far we have defined the following
<Greeting>Hello</Greeting>
and we need to add the attribute named 'to' to give us
<Greeting to=”World”>Hello</Greeting>
We do this by adding 'attribute' elements as children of the 'extension' element. The 'extension' element in MicroXSD is limited to having the attribute 'base' (with values limited in MicroXSD to a range of commonly used datatypes: string, decimal, integer, date, dateTime, boolean and base64Binary) and the child element 'attribute'.
<schema xmlns="http://www.w3.org/2001/XMLSchema" version="0.1" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Greeting">
  <complexType>
   <simpleContent>
    <extension base="string">
     <attribute name="to"/>
    <extension>
   </simpleContent>
  </complexType>
 </element>
</schema>
To show that the attribute itself has content of a particular datatype we can use the element 'simpleType'. With MicroXSD this is the one way to do it. With full W3C XML Schema there is an alternative way which is to use a 'type' attribute of the 'attribute' element but that is not included in MicroXSD. Likewise in full W3C XML Schema there are several ways to assign a datatype to a simple content element but only one applies in MicroXSD, as shown above.
<schema xmlns="http://www.w3.org/2001/XMLSchema" version="0.1" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Greeting">
  <complexType>
   <simpleContent>
    <extension base="string">
     <attribute name="to">
      <simpleType>
       <restriction base="string"/>
      </simpleType>
     </attribute>
    </extension>
   </simpleContent>
  </complexType>
 </element>
</schema>

The 'simpleType' Element: <simpleType>

<simpleType>
 <restriction base="string"/>
</simpleType>
In MicroXSD the element called 'simpleType' has only one child element called 'restriction' and no attributes. The 'restriction' element has only one attribute in MicroXSD named 'base' which has the same function and list of values as the 'base' attribute of the 'extension' element described above.
Of course there can be more than one attribute for an element, though only one with any particular name (multiple attributes with the same name on the same element are not allowed). We could add another attribute to the example schema with the name 'from' like this:
<schema xmlns="http://www.w3.org/2001/XMLSchema" version="0.1" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Greeting">
  <complexType>
   <simpleContent>
    <extension base="string">
     <attribute name="to">
      <simpleType>
        <restriction base="string"/>
      </simpleType>
     </attribute>
     <attribute name="from">
      <simpleType>
       <restriction base="string"/>
      </simpleType>
     </attribute>
    </extension>
   </simpleContent>
  </complexType>
 </element>
</schema>

This allows the further attribute 'to' to be added to the XML example:
<Greeting to=”World” from=”Mars”>Hello</Greeting>


The 'attribute' element (confusing language isn't it!) in MicroXSD can have just the one attribute 'name'.
For example:
<attribute name="to">...</attribute>
The 'name' assigns a name to the attribute in the XML instance and like the name of an element, the name of an attribute has to conform to the naming standard in XML 1.0 which means no numbers as first character but apart from the first character both alphanumeric characters and some other characters such as point '.' and underscore '_' are allowed in any order. Because MicroXSD supports in particular a subset of XML called MicroXML (see introduction), the attributes' (and elements') names do not include the 'prefix' allowed in standard XML. This is to eliminate multiple namespaces. This decreases the complexity required in MicroXSD significantly. Note: It has been proposed that there should be support in MicroXML for prefixes for attribute names but as yet this has not been included in MicroXSD (version 2012). The term given to standard element and attribute names in XML without the prefix is 'NCName' (as distinct from the term for a name qualified with a prefix and associated namespace which is a 'QName' and QNames are not supported in this version of MicroXSD).

Simply supporting complexity



Here is a very simple example XML and corresponding MicroXSD schema :
<Greeting>Hello</Greeting>
which may be constrained by MicroXSD schema


<schema xmlns="http://www.w3.org/2001/XMLSchema" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Greeting">
  <complexType>
   <simpleContent>
    <extension base="string"></extension>
   </simpleContent>
  </complexType>
 </element>
</schema>
XML can have child elements within any given element so a Hello World example could be written:
<Greeting><Hello/></Greeting>
This requires that we add an element, albeit an empty one, to the top level element. We do this by adding 'sequence' or 'choice' elements as children of the 'complexType' element (instead of the 'simpleContent' element). The 'sequence' element in MicroXSD is limited to having no attributes but a range of possible child elements. Likewise the 'choice' element. Each of these elements, 'sequence' and 'choice' allow us to specify that an element has child elements. The difference between them is clear from their names: 'sequence' is used when the sequence of the child elements is fixed whereas 'choice' is used to show that there is a choice between certain child elements or sets of child elements. When there is just one child element it is best to use 'sequence'. In our case we just want one child element so we will use 'sequence'.


<schema xmlns="http://www.w3.org/2001/XMLSchema" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Greeting">
  <complexType>
   <sequence>
    <element name="Hello">
     <complexType></complexType>
    </element>
   </sequence>
  </complexType>
 </element>
</schema>

Here we have the unusual situation of having an empty complexType to show that the child element 'Hello' is empty. This complexType in MicroXSD, though, can be any complexType allowed by MicroXSD.
Suppose we now wish to add several attributes to this child element, such as for an XML instance as follows:
<Greeting><Hello exclamation="true"/></Greeting>
We add an attribute using the 'attribute' element but we place this element after any 'sequence' or 'choice' elements. We cannot use an 'attribute' element like this when using the 'simpleContent' we used before but if there is neither a 'simpleContent' nor a 'sequence' or 'choice' we can still add 'attribute' elements. To show that the attribute itself has content of a particular datatype we can use the element 'simpleType'. In the example we can make the datatype 'boolean' to match the use of values 'true' and 'false'.

<schema xmlns="http://www.w3.org/2001/XMLSchema" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Greeting">
  <complexType>
   <sequence>
    <element name="Hello">
     <complexType>
      <attribute name="exclamation">
       <simpleType>
        <restriction base="boolean"/>
       </simpleType>
      </attribute>
     </complexType>
    </element>
   </sequence>
  </complexType>
 </element>
</schema>

With full W3C XML Schema there are other options available besides 'sequence' and 'choice' but with MicroXSD the only children of the 'complexType' are 'simpleContent', 'sequence', 'choice' and, when 'simpleContent' is not used, the 'attribute' element - to keep things simple. However, many structures with varying degrees of complexity are possible even with MicroXSD because the 'sequence' and 'choice' elements themselves can have nested 'sequence' and 'choice' elements. So the possible children of both 'sequence' and 'choice' in MicroXSD are 'element', 'sequence' (nested) and 'choice' (nested). Multiple 'sequence' or 'choice' elements are not allowed side-by-side though, they can only be nested.
Here is a schema using MicroXSD with a little nesting
<schema xmlns="http://www.w3.org/2001/XMLSchema" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Greeting">
  <complexType>
   <sequence>
    <element name="Hello">
     <complexType>
      <choice>
       <element name="World">
        <complexType></complexType>
       </element>
       <element name="Mars">
        <complexType></complexType>
       </element>
      </choice>
      <attribute name="exclamation">
       <simpleType>
        <restriction base="boolean"/>
       </simpleType>
      </attribute>
     </complexType>
    </element>
   </sequence>
  </complexType>
 </element>
</schema>
and this would result in allowing a XML instances a little more complex like this
<Greeting>
 <Hello exclamation="true">
  <World/>
 </Hello>
</Greeting>
or this
<Greeting>
 <Hello exclamation="false">
  <Mars/>
 </Hello>
</Greeting>

This concludes the description of the more complex uses of MicroXSD. The degree of complexity is unlimited but other aspects are limited by both the fact we are using a subset of W3C XML Schema and the fact that W3C XML Schema is itself somewhat limited in its feature set and supplemented by other related technologies such as ISO Schematron, RelaxNG, Test Assertions, formatting tables, plain text specifications and programming code. However, the fact that MicroXSD is a subset of W3C XML Schema means that adhering to the MicroXSD metaschema in your schema design should ensure that tools conforming to W3C XML Schema can readily handle a MicroXSD schema.

The MicroXSD 'Metaschema' (the schema defining a MicroXSD schema)

<?xml version="1.0" encoding="UTF-8"?>
<schema targetNamespace="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified" version="2012.02" xmlns="http://www.w3.org/2001/XMLSchema">
 <!-- MicroXSD 2012.02 -->
 <!-- -->
 <element name="schema">
  <complexType>
   <sequence>
    <element name="element" minOccurs="0">
     <complexType>
      <sequence>
       <element name="complexType" type="complexType_type"/>
      </sequence>
      <attribute name="name" type="NCName" use="required"/>
     </complexType>
    </element>
   </sequence>
   <attribute name="version" type="string" use="optional"/>
   <attribute name="attributeFormDefault" use="required" fixed="unqualified"/>
   <attribute name="elementFormDefault" use="required" fixed="qualified"/>
   <attribute name="targetNamespace" type="string" use="optional"/>
  </complexType>
 </element>
 <complexType name="element_type">
  <sequence>
   <element name="complexType" type="complexType_type"/>
  </sequence>
  <attribute name="name" type="NCName" use="required"/>
  <attribute name="minOccurs" use="optional">
   <simpleType>
    <restriction base="string">
     <enumeration value="0"/>
     <enumeration value="1"/>
    </restriction>
   </simpleType>
  </attribute>
  <attribute name="maxOccurs" use="optional">
   <simpleType>
    <restriction base="string">
     <enumeration value="1"/>
     <enumeration value="unbounded"/>
    </restriction>
   </simpleType>
  </attribute>
 </complexType>
 <simpleType name="base_type">
  <restriction base="string">
   <enumeration value="string"/>
   <enumeration value="decimal"/>
   <enumeration value="integer"/>
   <enumeration value="date"/>
   <enumeration value="dateTime"/>
   <enumeration value="boolean"/>
   <enumeration value="base64Binary"/>
  </restriction>
 </simpleType>
 <group name="element_sequence_choice">
  <choice>
   <element name="element" type="element_type"/>
   <group ref="sequence_choice"/>
  </choice>
 </group>
 <complexType name="restriction_type">
  <attribute name="base" type="base_type" use="required"/>
 </complexType>
 <complexType name="attribute_type">
  <sequence>
   <element name="simpleType" type="simpleType_type"/>
  </sequence>
  <attribute name="use" use="optional">
   <simpleType>
    <restriction base="string">
     <enumeration value="optional"/>
     <enumeration value="required"/>
    </restriction>
   </simpleType>
  </attribute>
  <attribute name="name" type="NCName" use="required"/>
 </complexType>
 <complexType name="extension_type">
  <sequence>
   <element name="attribute" type="attribute_type" minOccurs="0" maxOccurs="unbounded"/>
  </sequence>
  <attribute name="base" type="base_type" use="required"/>
 </complexType>
 <complexType name="complexType_type">
  <choice>
   <element name="simpleContent">
    <complexType>
     <sequence>
      <element name="extension" type="extension_type"/>
     </sequence>
    </complexType>
   </element>
   <sequence>
    <group ref="sequence_choice" minOccurs="0"/>
    <element name="attribute" type="attribute_type" minOccurs="0" maxOccurs="unbounded"/>
   </sequence>
  </choice>
  <attribute name="mixed" use="optional">
   <simpleType>
    <restriction base="string">
     <enumeration value="true"/>
     <enumeration value="false"/>
    </restriction>
   </simpleType>
  </attribute>
 </complexType>
 <complexType name="simpleType_type">
  <sequence>
   <element name="restriction" type="restriction_type"/>
  </sequence>
 </complexType>
 <group name="sequence_choice">
  <choice>
   <element name="sequence">
    <complexType>
     <group ref="element_sequence_choice" maxOccurs="unbounded"/>
    </complexType>
   </element>
   <element name="choice">
    <complexType>
     <group ref="element_sequence_choice" maxOccurs="unbounded"/>
    </complexType>
   </element>
  </choice>
 </group>
</schema>