Thursday 3 March 2011

MicroXSD

Introduction

MicroXML


Toward the end of 2010 XML was getting a bit of a rethink in the XML community. First people talked about what for some had been unthinkable in its first decade, an ‘XML 2.0’. That may be on hold; perhaps just as well. For some it has been a major selling point that no 2.0 was planned when XML was created, so archived XML documents might still be readable (parseable) with software decades from now and printed XML still comprehensible decades after that.
What seems more likely is that those who find XML too onerous to use for their purposes will get as much as they need in the form of a predefined subset of XML. Efforts to define possible 80:20 subsets (80 percent of what you need with 20 percent of the complexity) have begun afresh of late, with the first major contender being called 'MicroXML'. The main discussion of MicroXML happened on the XML-Dev public mailing list in December 2010. [MicroXML is a subset profile of XML developed by James Clark and published in a blogspot by James at http://blog.jclark.com/2010/12/more-on-microxml.html . I blog on its use in the next blog.]

MicroXSD

Along with MicroXML there was discussion of the possibility of a simplified subset of languages used for defining structural constraints on XML instances. The standard full version of the main constraint language is called W3C XML Schema but often an alias is used for this in the acronym XSD (XML Schema Definition). MicroXSD is such a subset.
MicroXSD eliminates all but the essentials and only allows 'local' element definitions with unnamed complex and simple types. This makes the 'schema' look similar to the XML it defines. Some will find that an improvement because many acknowledge that the average schema written with XSD-proper is somewhat incomprehensible, even to the well-trained eye. MicroXML, as defined in late 2010 does not allow multiple namespaces (namespaces add some of the most atrocious complexities to standard XML) so MicroXSD need only support zero or one namespace, meaning no imports. Having an easier way to define (and understand) the constraints to be applied to a vocabulary written in MicroXML might allow more people to find it within their means to produce and use XML and write software supporting this subset of it.

Using MicroXSD

Writing the simplest schema

A Schema written in MicroXSD inherits the semantics of a schema written in standard W3C XML Schema 1.0 (often known by the acronym XSD) except that not every element and attribute in W3C XML Schema is allowed in MicroXSD: It is a subset. (Note again that MicroXSD is particularly but not solely targeted for defining XML instances written in the subset of XML called 'MicroXML' as mentioned in the introduction.)
A schema is a file or string (or the like) of XML markup which defines and constrains the structural aspects of another instance or fragment of XML. It cannot express every aspect of the constraints which might be required for some uses of XML but it is often the first choice to use for the definition since it supports some of the most common requirements and is quite ubiquitous, well known and well supported in XML-related software. Once a schema is defined there are other ways to fill the gaps, such as with prose specification statements, assertions or test assertions. The MicroXSD subset cannot support all typical use cases (such as cases where there is much reuse of XML syntax, as with XHTML) but where it can be used it offers ease of interpretation of the schema by a human reader plus reduced complexity when writing software to execute validation of the schema against a target instance. Often the latter is necessary as a prerequisite to further processing of XML in software.

The 'schema' Element: <schema>

Every conforming MicroXSD schema starts with the 'scheme' element as its outermost (top level) element.
An example:
<schema xmlns="http://www.w3.org/2001/XMLSchema" version="123" attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="somenamespace">


Lets pull this apart and see how it works.
<schema> is the top level element in a schema. It wraps all the other schema elements in this way
<schema ...> ...
</schema>


Attributes of the 'scheme' element are as follows.
xmlns="http://www.w3.org/2001/XMLSchema" : This attribute with this name and value is mandatory in a schema top level element. It provides the schema with a namespace. All elements between <schema> and </schema> share this namespace. This namespace is the namespace of the language used to write the schema, NOT the namespace of the XML constrained by the schema. For the latter, in MicroXSD, the 'targetNamespace' attribute is used (see below).
version="123": This attribute gives the schema a version number (sometimes characters other than numbers are used too such as '123.1.0' or even 'foo-draft-1').
attributeFormDefault="unqualified" elementFormDefault="qualified" : Included for compatibility with existing software processing W3C XML Schema and with the XML Schema standard, these attributes and the values shown are necessary to clarify how to interpret the namespaces of attributes and elements in the instance of XML constrained by the schema. They are fixed with these values in MicroXSD.

targetNamespace="somenamespace" : This attribute defines which namespace is to be assigned to an XML instance constrained by the schema. It is optional and if missing the namespace is assumed to be empty. Typically namespaces follow some scheme (scheme, not schema) which helps, for example, to associate the namespace with some provenance. One such scheme is to use the authors' URL or domain name perhaps together with an identifier of some find. Any string is allowed though.
In a MicroXSD schema no other attributes should be present in this top level 'scheme' element. The only element allowed immediately below (within) this scheme element in a MicroXSD schema is the 'element' element (the element called 'element'!). (Other possibilities acceptable in W3C XML Schema are not supported with the MicroXSD subset profile.)

The Topmost 'element' Element: <element>



The element called 'element' (beware confusion!) is found just below the 'scheme' element but can also be nested further down within itself: 'element' elements can contain other 'element' elements (albeit indirectly). An example of its use when it is directly below the 'scheme', top level element is
<element name="Hello"> ... </element>
name="Hello": This 'name' attribute is the only attribute allowed in 'element' in MicroXSD for the topmost 'element' (which is directly below the top level 'scheme' element). It defines the name of the top level element in the constrained XML instance. It can have as its value any valid XML name as specified by the XML 1.0 Standard and in MicroXML: Starting with an alpha (ASCII) or underscore ('_') character, not including any spaces, but with numbers and some punctuation characters (such as the point '.' or underscore '_') allowed after the first character.
When the 'element' element occurs lower down in the schema structure it can also take the attributes 'minOccurs' and 'maxOccurs' and these will be looked at later.

The only child element allowed in MicroXSD for 'element' element is called 'complexType'.

Example of a topmost element with its complex type:
<element name="Hello">
 <complexType mixed="true">
  ...
Wherever the 'complexType' element is used in MicroXSD its syntax is the same.

Simplest use of the 'complexType' Element


Here is a so-called 'Hello World' example:
For the simple XML
<Hello>World</Hello>
we can use MicroXSD to write the schema
<schema xmlns="http://www.w3.org/2001/XMLSchema" version="0.1" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Hello">
  <complexType>
   <simpleContent>
    <extension base="string"/>
   </simpleContent>
  </complexType>
 </element>
</schema>
Here the complexType element has one child which is 'simpleContent' (other child elements are allowed in MicroXSD and these will be looked at later on). That child 'simpleContent' itself has one required child element (the only allowed child element of 'simpleContent' in MicroXSD) which is called 'extension' which has a single, required attribute called 'base'. The 'base' attribute has a value which defines the datatype of the content (in the above example, the datatype of the content of the 'Hello' element). Allowed values for 'base' are string, decimal, integer, date, dateTime, boolean and base64Binary. These are some of the more commonly used standard datatypes for W3C XML Schema.
The above schema allows the XML
<Hello>Reader</Hello>
but does not allow
<Hello><Reader/></Hello>
because the latter contains another element named 'Reader' rather than just some text data of type 'string'.
The markup language, XML, allows both elements and attributes so in the next section we will look at how to add some attributes to this simple element.

Adding attributes

A very simple 'Hello World' example XML and a corresponding MicroXSD schema :
<Hello>World</Hello>
which may be constrained by MicroXSD schema
<schema xmlns="http://www.w3.org/2001/XMLSchema" version="0.1" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Hello">
  <complexType>
   <simpleContent>
    <extension base="string"/>
   </simpleContent>
  </complexType>
 </element>
</schema>
XML can have attributes as well as elements so a Hello World example could be written:
<Greeting to=”World”>Hello</Greeting>

The 'attribute' Element: <attribute>

The 'complexType' allows us to specify whether an element has child elements and whether it has attributes. When there are no child elements to be added to the instance element we use 'complexType' with a 'simpleContent' child and inside that we place a child of 'simpleContent' called 'extension'. In MicroXSD the element 'simpleContent' can only have this 'extension' element as a child element.
So far we have defined the following
<Greeting>Hello</Greeting>
and we need to add the attribute named 'to' to give us
<Greeting to=”World”>Hello</Greeting>
We do this by adding 'attribute' elements as children of the 'extension' element. The 'extension' element in MicroXSD is limited to having the attribute 'base' (with values limited in MicroXSD to a range of commonly used datatypes: string, decimal, integer, date, dateTime, boolean and base64Binary) and the child element 'attribute'.
<schema xmlns="http://www.w3.org/2001/XMLSchema" version="0.1" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Greeting">
  <complexType>
   <simpleContent>
    <extension base="string">
     <attribute name="to"/>
    <extension>
   </simpleContent>
  </complexType>
 </element>
</schema>
To show that the attribute itself has content of a particular datatype we can use the element 'simpleType'. With MicroXSD this is the one way to do it. With full W3C XML Schema there is an alternative way which is to use a 'type' attribute of the 'attribute' element but that is not included in MicroXSD. Likewise in full W3C XML Schema there are several ways to assign a datatype to a simple content element but only one applies in MicroXSD, as shown above.
<schema xmlns="http://www.w3.org/2001/XMLSchema" version="0.1" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Greeting">
  <complexType>
   <simpleContent>
    <extension base="string">
     <attribute name="to">
      <simpleType>
       <restriction base="string"/>
      </simpleType>
     </attribute>
    </extension>
   </simpleContent>
  </complexType>
 </element>
</schema>

The 'simpleType' Element: <simpleType>

<simpleType>
 <restriction base="string"/>
</simpleType>
In MicroXSD the element called 'simpleType' has only one child element called 'restriction' and no attributes. The 'restriction' element has only one attribute in MicroXSD named 'base' which has the same function and list of values as the 'base' attribute of the 'extension' element described above.
Of course there can be more than one attribute for an element, though only one with any particular name (multiple attributes with the same name on the same element are not allowed). We could add another attribute to the example schema with the name 'from' like this:
<schema xmlns="http://www.w3.org/2001/XMLSchema" version="0.1" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Greeting">
  <complexType>
   <simpleContent>
    <extension base="string">
     <attribute name="to">
      <simpleType>
        <restriction base="string"/>
      </simpleType>
     </attribute>
     <attribute name="from">
      <simpleType>
       <restriction base="string"/>
      </simpleType>
     </attribute>
    </extension>
   </simpleContent>
  </complexType>
 </element>
</schema>

This allows the further attribute 'to' to be added to the XML example:
<Greeting to=”World” from=”Mars”>Hello</Greeting>


The 'attribute' element (confusing language isn't it!) in MicroXSD can have just the one attribute 'name'.
For example:
<attribute name="to">...</attribute>
The 'name' assigns a name to the attribute in the XML instance and like the name of an element, the name of an attribute has to conform to the naming standard in XML 1.0 which means no numbers as first character but apart from the first character both alphanumeric characters and some other characters such as point '.' and underscore '_' are allowed in any order. Because MicroXSD supports in particular a subset of XML called MicroXML (see introduction), the attributes' (and elements') names do not include the 'prefix' allowed in standard XML. This is to eliminate multiple namespaces. This decreases the complexity required in MicroXSD significantly. Note: It has been proposed that there should be support in MicroXML for prefixes for attribute names but as yet this has not been included in MicroXSD (version 2012). The term given to standard element and attribute names in XML without the prefix is 'NCName' (as distinct from the term for a name qualified with a prefix and associated namespace which is a 'QName' and QNames are not supported in this version of MicroXSD).

Simply supporting complexity



Here is a very simple example XML and corresponding MicroXSD schema :
<Greeting>Hello</Greeting>
which may be constrained by MicroXSD schema


<schema xmlns="http://www.w3.org/2001/XMLSchema" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Greeting">
  <complexType>
   <simpleContent>
    <extension base="string"></extension>
   </simpleContent>
  </complexType>
 </element>
</schema>
XML can have child elements within any given element so a Hello World example could be written:
<Greeting><Hello/></Greeting>
This requires that we add an element, albeit an empty one, to the top level element. We do this by adding 'sequence' or 'choice' elements as children of the 'complexType' element (instead of the 'simpleContent' element). The 'sequence' element in MicroXSD is limited to having no attributes but a range of possible child elements. Likewise the 'choice' element. Each of these elements, 'sequence' and 'choice' allow us to specify that an element has child elements. The difference between them is clear from their names: 'sequence' is used when the sequence of the child elements is fixed whereas 'choice' is used to show that there is a choice between certain child elements or sets of child elements. When there is just one child element it is best to use 'sequence'. In our case we just want one child element so we will use 'sequence'.


<schema xmlns="http://www.w3.org/2001/XMLSchema" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Greeting">
  <complexType>
   <sequence>
    <element name="Hello">
     <complexType></complexType>
    </element>
   </sequence>
  </complexType>
 </element>
</schema>

Here we have the unusual situation of having an empty complexType to show that the child element 'Hello' is empty. This complexType in MicroXSD, though, can be any complexType allowed by MicroXSD.
Suppose we now wish to add several attributes to this child element, such as for an XML instance as follows:
<Greeting><Hello exclamation="true"/></Greeting>
We add an attribute using the 'attribute' element but we place this element after any 'sequence' or 'choice' elements. We cannot use an 'attribute' element like this when using the 'simpleContent' we used before but if there is neither a 'simpleContent' nor a 'sequence' or 'choice' we can still add 'attribute' elements. To show that the attribute itself has content of a particular datatype we can use the element 'simpleType'. In the example we can make the datatype 'boolean' to match the use of values 'true' and 'false'.

<schema xmlns="http://www.w3.org/2001/XMLSchema" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Greeting">
  <complexType>
   <sequence>
    <element name="Hello">
     <complexType>
      <attribute name="exclamation">
       <simpleType>
        <restriction base="boolean"/>
       </simpleType>
      </attribute>
     </complexType>
    </element>
   </sequence>
  </complexType>
 </element>
</schema>

With full W3C XML Schema there are other options available besides 'sequence' and 'choice' but with MicroXSD the only children of the 'complexType' are 'simpleContent', 'sequence', 'choice' and, when 'simpleContent' is not used, the 'attribute' element - to keep things simple. However, many structures with varying degrees of complexity are possible even with MicroXSD because the 'sequence' and 'choice' elements themselves can have nested 'sequence' and 'choice' elements. So the possible children of both 'sequence' and 'choice' in MicroXSD are 'element', 'sequence' (nested) and 'choice' (nested). Multiple 'sequence' or 'choice' elements are not allowed side-by-side though, they can only be nested.
Here is a schema using MicroXSD with a little nesting
<schema xmlns="http://www.w3.org/2001/XMLSchema" attributeFormDefault="unqualified" elementFormDefault="qualified">
 <element name="Greeting">
  <complexType>
   <sequence>
    <element name="Hello">
     <complexType>
      <choice>
       <element name="World">
        <complexType></complexType>
       </element>
       <element name="Mars">
        <complexType></complexType>
       </element>
      </choice>
      <attribute name="exclamation">
       <simpleType>
        <restriction base="boolean"/>
       </simpleType>
      </attribute>
     </complexType>
    </element>
   </sequence>
  </complexType>
 </element>
</schema>
and this would result in allowing a XML instances a little more complex like this
<Greeting>
 <Hello exclamation="true">
  <World/>
 </Hello>
</Greeting>
or this
<Greeting>
 <Hello exclamation="false">
  <Mars/>
 </Hello>
</Greeting>

This concludes the description of the more complex uses of MicroXSD. The degree of complexity is unlimited but other aspects are limited by both the fact we are using a subset of W3C XML Schema and the fact that W3C XML Schema is itself somewhat limited in its feature set and supplemented by other related technologies such as ISO Schematron, RelaxNG, Test Assertions, formatting tables, plain text specifications and programming code. However, the fact that MicroXSD is a subset of W3C XML Schema means that adhering to the MicroXSD metaschema in your schema design should ensure that tools conforming to W3C XML Schema can readily handle a MicroXSD schema.

The MicroXSD 'Metaschema' (the schema defining a MicroXSD schema)

<?xml version="1.0" encoding="UTF-8"?>
<schema targetNamespace="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified" version="2012.02" xmlns="http://www.w3.org/2001/XMLSchema">
 <!-- MicroXSD 2012.02 -->
 <!-- -->
 <element name="schema">
  <complexType>
   <sequence>
    <element name="element" minOccurs="0">
     <complexType>
      <sequence>
       <element name="complexType" type="complexType_type"/>
      </sequence>
      <attribute name="name" type="NCName" use="required"/>
     </complexType>
    </element>
   </sequence>
   <attribute name="version" type="string" use="optional"/>
   <attribute name="attributeFormDefault" use="required" fixed="unqualified"/>
   <attribute name="elementFormDefault" use="required" fixed="qualified"/>
   <attribute name="targetNamespace" type="string" use="optional"/>
  </complexType>
 </element>
 <complexType name="element_type">
  <sequence>
   <element name="complexType" type="complexType_type"/>
  </sequence>
  <attribute name="name" type="NCName" use="required"/>
  <attribute name="minOccurs" use="optional">
   <simpleType>
    <restriction base="string">
     <enumeration value="0"/>
     <enumeration value="1"/>
    </restriction>
   </simpleType>
  </attribute>
  <attribute name="maxOccurs" use="optional">
   <simpleType>
    <restriction base="string">
     <enumeration value="1"/>
     <enumeration value="unbounded"/>
    </restriction>
   </simpleType>
  </attribute>
 </complexType>
 <simpleType name="base_type">
  <restriction base="string">
   <enumeration value="string"/>
   <enumeration value="decimal"/>
   <enumeration value="integer"/>
   <enumeration value="date"/>
   <enumeration value="dateTime"/>
   <enumeration value="boolean"/>
   <enumeration value="base64Binary"/>
  </restriction>
 </simpleType>
 <group name="element_sequence_choice">
  <choice>
   <element name="element" type="element_type"/>
   <group ref="sequence_choice"/>
  </choice>
 </group>
 <complexType name="restriction_type">
  <attribute name="base" type="base_type" use="required"/>
 </complexType>
 <complexType name="attribute_type">
  <sequence>
   <element name="simpleType" type="simpleType_type"/>
  </sequence>
  <attribute name="use" use="optional">
   <simpleType>
    <restriction base="string">
     <enumeration value="optional"/>
     <enumeration value="required"/>
    </restriction>
   </simpleType>
  </attribute>
  <attribute name="name" type="NCName" use="required"/>
 </complexType>
 <complexType name="extension_type">
  <sequence>
   <element name="attribute" type="attribute_type" minOccurs="0" maxOccurs="unbounded"/>
  </sequence>
  <attribute name="base" type="base_type" use="required"/>
 </complexType>
 <complexType name="complexType_type">
  <choice>
   <element name="simpleContent">
    <complexType>
     <sequence>
      <element name="extension" type="extension_type"/>
     </sequence>
    </complexType>
   </element>
   <sequence>
    <group ref="sequence_choice" minOccurs="0"/>
    <element name="attribute" type="attribute_type" minOccurs="0" maxOccurs="unbounded"/>
   </sequence>
  </choice>
  <attribute name="mixed" use="optional">
   <simpleType>
    <restriction base="string">
     <enumeration value="true"/>
     <enumeration value="false"/>
    </restriction>
   </simpleType>
  </attribute>
 </complexType>
 <complexType name="simpleType_type">
  <sequence>
   <element name="restriction" type="restriction_type"/>
  </sequence>
 </complexType>
 <group name="sequence_choice">
  <choice>
   <element name="sequence">
    <complexType>
     <group ref="element_sequence_choice" maxOccurs="unbounded"/>
    </complexType>
   </element>
   <element name="choice">
    <complexType>
     <group ref="element_sequence_choice" maxOccurs="unbounded"/>
    </complexType>
   </element>
  </choice>
 </group>
</schema>

1 comment:

  1. Very interesting. But I think named complexType would still be useful for lightweight usage. micro.xsd validating itself could be an indication of sufficient expressiveness.

    ReplyDelete