Tuesday 8 March 2011

MicroXML

Using MicroXML


Some tend to think XML is rather complex and mysterious yet all it is, fundamentally, is a set of building blocks for text: element tags, element content, attribute names, attribute values and the structure (the way the building blocks are arranged). Let’s look at these in turn, while limiting our focus to the simplification of XML which is called MicroXML. In the process we also need to address some special concerns, such as some considerations that have to be given to certain characters and how to denote a comment as being distinct from the XML content. [MicroXML is a subset profile of XML developed by James Clark and published in a blogspot by James at http://blog.jclark.com/2010/12/more-on-microxml.html .]

Element Tags


The simplest and most obvious part of XML is the tag. Tags surround text, sometimes surrounding other tags and sometimes including within the tag other structures called attributes. Tags are sometimes in pairs and sometimes alone. When they are in pairs then there is a leading tag or ‘start tag’ and a trailing tag or ‘end tag’ and usually some text between them. If there is a start tag and end tag without any text between them then it can also be written in an alternative, equivalent form called an empty tag which is the one time when the tag has no pair.

In computer programming, the simplest kinds of examples are often called ‘Hello World’ examples because a very famous book once introduced readers to the early programming language called ‘C’ with an example that output a simple string of characters saying just ‘Hello World’. Using MicroXML we might have an equivalent simple example like this:

 <Hello>World</Hello>

This is called an element. This element has a start tag <Hello>, an end tag </Hello> and between the tags some text reading ‘World’. Simple! Start tags have an opening angle bracket or ‘less-than’ sign < followed by the name given to the element followed by the closing bracket or ‘greater than’ sign >. There is no white space unless the start tag contains one or more attributes, as we’ll see later. White space is that set of characters in text which do not have any visible shape like spaces, tabs, new-lines, etc. The name of the element (in this case ‘Hello’) is not allowed to contain any white space. Neither is an element allowed to start with a digit but it can contain any number of digits after the first character. A formal definition of MicroXML states that the first character of a name (element or attribute name) cannot be a digit but can be an alpha character or underscore (or some other more obscure characters which are listed by their ASCII codes see formal definition of MicroXML here http://blog.jclark.com/2010/12/more-on-microxml.html ). Other name charcters after the first do cannot include the underscore but can include digits, alpha characters and the two punctuation characters point (full stop) ‘.’ and hyphen (dash) ‘-’.

End tags start with an opening angle bracket or less-than sign ‘<’ followed by a forward slash ‘/’ then the same element name as the start tag used and finally the closing angle bracket or greater than sign ‘>’.  So our example element:

 <Hello>World</Hello>

is composed of:
Start tag: <Hello>
Content:  World
End tag: </Hello>

Then there is that special case of element tags which stand alone without being in a pair. These tags are an alternative way to show a special kind of element; one that does not have any content. The ‘Hello’ in our example above, if the text ‘World’ were removed, could be written like this:

 <Hello></Hello>

but it could also be written like this with just one special tag:

 <Hello/>

Elements can contain another kind of structure called an attribute. Attributes when added to an empty element can look like this:

 <Hello from=”Mars” to=”World”/>

The special tag for the empty element starts with the opening angle bracket, then comes the element name and then there can sometimes be white space, such as when the element has an attribute, but whether or not there is any white space it always closes with a backslash immediately followed by a closing angle bracket or greater-than sign. (There are some times when white space follows the element name even when there are no attributes but the reason is not important at this stage except to remember it is possible.)

Note that when there are one or more attributes added to an element the attributes and their values (they always have values in MicroXML) always sit between the element name and the closing angle bracket of the start tag, or when the tag is the standalone empty element tag, between the element name and the backslash, i.e. like the example above, or like this:

 <Hello from=”Mars”>World</Hello>

Also, the quotes around the attribute value can be a pair of double quotes or a pair of single quotes.

Element Content and Mixed Content

  
We’ve seen that an element can contain text but it can also contain other elements, or both. When it contains text alone then the text is simply everything between the start and end tag. In the example above that means just ‘World’. When it contains other elements then the other elements sit between the start and end tags but there may or may not be white space between the end of the first element’s start tag and the next element’s start tag. When the element contains both elements and text then it is called mixed content and the text and other elements along with any white space still sit between the first element’s start and end tags. So you can have content like this:

 <Hello><From>Mars</From><To>World</To></Hello>

or with white space like this:

 <Hello> <From>Mars</From> <To>
World</To>
</Hello>

or with mixed content like this:

 <Hello>From Mars <To>World</To></Hello>

Both white space and mixed content together makes the XML look less recognizable as XML, but it is still allowed, like this:

.. <Hello>From
Mars
<To>
World</To>

</Hello>

Attribute Names and Values


The attribute consists of a name and a textual value. The name and value together are placed in an element start tag or empty tag (not end tag) between the element name and the closing angle bracket. The name goes first followed, always, by an equals sign (with or without any white space between) followed (again, with or without any white space) by the attribute’s textual value surrounded by either single or double quotes like this:

 <Hello from=’Mars’>World</Hello>

or like this

 <Hello from = ”Mars” to=”World”  />

or like this

 <Hello from=  ”Mars” via=”My website”>World</Hello>

Note the various combinations of characters allowed such as white spaces and single or double quotes. Note too that if an attribute value were ever to contain an element it would not normally be recognized as an element but would normally be treated as just text. (Of course, it would be possible to instruct some computational process to extract the value of the attribute and treat it as XML in its own right but normally the value of the attribute is treated as just textual content.)

The attribute names are limited in their allowed characters just as element names are: no leading digits, only certain punctuation characters allowed, etc. Note also that with an attribute name in MicroXML there is a special case of names beginning ‘xml:’ which are reserved names for special purposes specified in the XML Standards. Normally the colon is not allowed in an attribute or element name in MicroXML (because it has a special function given it in parts of the XML specifications not included in MicroXML but which would affect the way MicroXML was handled by tools used to handle XML in general). The ‘xml:’ prefix is allowed in attribute names in MicroXML, however, mainly to allow a special feature to be used which is an ‘xml:id’ special attribute (out of scope here).

Special and Illegal Characters


We have already seen that some characters in element and attribute names are illegal in MicroXML (and some of these are so in XML in general). In addition, to help processors interpret the XML properly and reliably there are some characters which are not allowed in the content. That might seem alarming and so it should: What, you might think, if there are such characters in text we wish to include between tags or in attribute values? What do we do with these characters? The answer is that they have to be replaced with what can be called ‘escape’ strings. This is one big drawback with using MicroXML or XML in general. The reason is clear when you think of what would happen if the text content of an element had a less-than sign in it: It could look very much like an end tag and in some cases might be indistinguishable from one, for example:

<MathStatement>one < two</MathStatement>

is potentially confusing but even more so is:

<two>one<two</two>

Attributes are not exempt and the following is not allowed:

<two one=”<two”</one>

The special characters this applies to are

Ampersand &
Less-than <
Greater-than >
Double-quote "
Single-quote (apostrophe) '

The reason the ampersand ‘&’ is a special character too becomes apparent when we consider what we have to do with these characters to replace them. It is called ‘escaping’ and it consists of replacing these characters with a special sequence of characters which are as follows:

Ampersand     &amp;
Less-than      &lt;
Greater-than      &gt;
Double-quote      &quot;
Single-quote (apostrophe)       &#39;     or     &apos;

The escape strings themselves can start with an apostrophe so this too is a special character.
Now all this poses problems which may have to be solved for people producing the XML, either through text editing or computer programming. You have to not just replace the special characters with their escape sequences but you have to ensure that when doing so you distinguish what is an apostrophe that is part of the text and what is an apostrophe at the start of an escape string (else you might end up producing something like &apos;apos; and eventually &apos;apos;apos;apos; or worse). Then you have to think about how to read the XML in any computer code, when to escape those characters, when to replace them back for humans to read, etc. It can all get complex but the same happens with web browsers which have to do this kind of thing for the language of the Web, HTML, too.

Other characters are said to be ‘illegal’ in MicroXML because all the content and the tags and attribute names and values have to be written in the character set of the encoding known as UTF-8 so any characters not a found in this encoding system’s character set have to be replaced at some point too. This might best be achieved by controlling which characters are actually added to the MicroXML rather than by escaping them (because escaping such characters introduces some extra complications out of scope here).

It has not been mentioned until now but the start of a document written in MicroXML should be the UTF-8 character encoding declaration along with the general XML (version 1.0) declaration:

<?xml version="1.0" encoding="UTF-8"?>
… (rest of the MicroXML follows on).

This, for the Hello World example, would look like the following:

<?xml version="1.0" encoding="UTF-8"?>
<Hello from=’Mars’>World</Hello>

 

Comments

 

To allow ignorable comment text to be interspersed with the MicroXML comments are separated from the rest of the XML by the special sequence of characters: left angle-bracket (less-than sign) followed without white space by an exclamation mark and, again no white space, two consecutive hyphens (dashes) then the comment then two consecutive hyphens and (without intervening white space) a right angle-bracket (greater-than sign).

<!--   then textual comment here then  -->

This looks like the following when the comment is embedded in some XML:

<Hello>From
<!-- this is a comment -->
Mars
<To>
World</To></Hello>

This says that the comment is not to be regarded as actual content and it can be ignored. The comment’s text does not need to be escaped. Comments cannot be nested in MicroXML. One comment has to be ended before the other begins.

This is allowed:

<Hello>From
<!-- this is a comment -->
<!-- this is another comment -->
Mars
<To>
World</To></Hello>

but not this:

<Hello>From
<!-- this is a comment
<!-- this is an illegally nested comment --> -->
Mars
<To>
World</To></Hello>

Structure


There are very many ways the above building blocks of MicroXML (and XML in general) can all be combined to form simple or complex structures but special note needs to be given to the hierarchical or tree-like nature of the structure due to there always being one and only one top level element. This is because there is a distinction made between a self-contained piece of XML and other partial pieces of XML. The first can be called an ‘instance’ and the latter can be called ‘fragments’. An ‘instance’ has special status. This is an instance:

<Hello><From>Mars</From><To>World</To></Hello>

whereas the following is not an instance but is a fragment:

<From>Mars</From><To>World</To>

That is because the first example is wrapped in a single element and the latter has two elements without any single element wrapper.

Then again the following is not even a fragment because as it stands it would not be valid MicroXML because it has a missing tag at the start and an incomplete one at the end:

Mars</From><To>World</To

To be perfect the instance example would be more complete as an instance (or XML document) if it had its declaration:

<?xml version="1.0" encoding="UTF-8"?>
<Hello><From>Mars</From><To>World</To></Hello>

Then a common practice is to add some indentation to make it easier to see that it has a structure because that structure sometimes expresses some of the meaning of the XML.

<?xml version="1.0" encoding="UTF-8"?>
<Hello>
 <From>Mars</From>
 <To>World</To>
</Hello>

The indentation is added using white space but this is not essential. It does illustrate the fact that the instance having a single wrapper around the outside, each element possibly wrapping other elements but attributes wrapping nothing gives the XML a tree-like or hierarchical logical structure.

The tree-like structure is described sometimes using language which alludes to a family tree (as we do later in the discussion of MicroXSD) where one element containing another element is said to be the parent of the second element (and XML elements can only have the one ‘parent’) and the second is said to be one of the child elements of the first. Two child elements sharing the same parent are often metaphorically called ‘siblings’.

The Use of a Namespace


There are times when there may be several instances of XML in one file or, say with mixed content, when various instances are somehow interspersed. To distinguish one instance from another we can assign a special name to each instance via a special attribute called the namespace attribute. The special name is called a namespace. We will see more about this in the section on MicroXSD, but the way that a namespace name is composed is a matter of discretion and might involve adhering to some sort of namespace naming scheme (such as a schemes involving URIs or domain names). The namespaces in MicroXML are limited compared to full, standard XML 1.0 and are assigned using the attribute ‘xmlns’ like this:

<?xml version="1.0" encoding="UTF-8"?>
<Hello xmlns="somenamespace">
 <From>Mars</From>
 <To>World</To>
</Hello>

Any string is allowed as the value of the attribute. It is worth bearing in mind that this is a key area of simplification in the MicroXML profile of XML; full, standard XML is greatly complicated by the allowance of any number of namespaces in an instance with each namespace being assigned to respective elements and attributes using special prefixes in the element and attribute names. For MicroXML this is eliminated for greater simplicity. This does mean that we have to keep the namespaces separated more so that we are limited in how we combine elements from different namespaces: Essentially all elements from one namespace have to be completed (all end tags closed) before another begins. For example we might have a file whose top element has no namespace but within this element are two parent elements for two parts of the rest of the instance each with a different namespace:

<?xml version="1.0" encoding="UTF-8"?>
<Hello>
 <From xmlns="somenamespace1">Mars</From>
 <To xmlns="somenamespace2">World</To>
</Hello>

This is allowed because the two namespaces do not overlap. Every element within an element having a particular namespace has that same namespace in MicroXML.


The Use of a Schema


It might clearly help the handling of the XML or of various parts of it if we pay attention to the design of its structure and seek to document it in a way which humans or even software can read.  The structure with its element and attribute names can be defined outside of the XML using some techniques designed especially for XML which can be used with MicroXML too. This also provides a designer of the XML structure with the opportunity to add some further information to describe and possibly constrain the XML, such as by associating a type with the content of a particular element or attribute. One such technique is to associate a schema with the XML. Such a schema can be written using MicroXSD, as seen in the previous blog [ http://stephengreenxml.blogspot.com/2011/02/microxsd.html ] which is a subset profile of W3C XML Schema. MicroXSD was developed by this blog's author especially for use with MicroXML.

1 comment:

  1. Update: MicroXML has now become the output of a W3C Community Group http://www.w3.org/community/microxml/ and a draft specification has been published (the latest at this point is found here: http://dvcs.w3.org/hg/microxml/raw-file/tip/spec/microxml.html ). Those interested in using MicroXML might wish to watch this emerging spec (some particulars which might be prone to change include the specified requirements for attributes and illegal characters).

    ReplyDelete