Friday, 29 May 2026

Canonical XML with Overlay Schemas: A Different Approach to Long-Lived XML Architecture

 Canonical XML with Overlay Schemas: A Different Approach to Long-Lived XML Architecture


Written by ChatGPT, as prompted by Stephen D Green, May 2026 


Most XML architectures built around XML Schema Definition (XSD) assume that a document intrinsically declares its own type and validation semantics. The schema normally defines not only the structural vocabulary of the document, but also its datatypes, lexical constraints, validation rules, and often much of its business meaning. In practice this tends to produce tightly coupled document models in which structure and interpretation evolve together.


That approach works reasonably well for relatively static systems, but it becomes problematic for long-lived standards and enterprise ecosystems. Business requirements evolve continually. Identifier formats change. Regulatory overlays appear and disappear. New workflow contexts emerge. Legacy integration requirements accumulate. Different industries and trading partners introduce specialized constraints. Yet the underlying business structures often remain surprisingly stable for decades.


A purchase order, invoice, shipping notice, payment instruction, or inventory report may continue to contain essentially the same conceptual elements year after year: identifiers, dates, parties, items, totals, quantities, addresses, and references. The structure changes slowly. The interpretation layers change continuously.


This suggests a different architectural approach: separate the stable canonical structure from the evolving validation overlays.


Instead of designing a monolithic schema that attempts to define every possible constraint and interpretation upfront, the architecture can be divided into two layers:

  1. A stable canonical schema defining only the structural vocabulary and document shape.
  2. Secondary overlay schemas defining contextual datatype constraints, validation profiles, and specialized interpretations.


The result is an XML architecture based on canonical structure plus externally applied validation overlays rather than intrinsic monolithic typing.


Consider a simple canonical business document:

<BusinessDocument>

    <DocumentID>...</DocumentID>

    <DocumentDate>...</DocumentDate>


    <Sender>

        <PartyID>...</PartyID>

    </Sender>


    <Receiver>

        <PartyID>...</PartyID>

    </Receiver>


    <Items>

        <Item>

            <SKU>...</SKU>

            <Quantity>...</Quantity>

            <UnitPrice>...</UnitPrice>

        </Item>

    </Items>


    <Total>...</Total>

</BusinessDocument>

The canonical schema intentionally imposes only minimal constraints. It establishes the vocabulary and overall tree structure while avoiding premature commitment to highly specific datatypes or business rules.


For example, the base schema may define:

<xs:element name="DocumentID" type="xs:string"/>

rather than immediately constraining the identifier format.

Similarly, portions of the structure may even use highly permissive definitions:

<xs:element name="Items" type="xs:anyType"/>

The purpose of this schema is not to define the final interpretation of the document. Instead, it defines the stable interchange substrate shared across many future contexts.

Secondary schemas then provide progressively more specialized overlays.


One overlay schema may define an automatically generated invoice profile in which DocumentID must be a UUID:

<xs:simpleType name="UUIDDocumentID">

    <xs:restriction base="xs:string">

        <xs:pattern value="[0-9a-fA-F-]{36}"/>

    </xs:restriction>

</xs:simpleType>

Another overlay may define a manually generated invoice profile allowing more flexible identifiers:

<xs:simpleType name="ManualInvoiceID">

    <xs:restriction base="xs:string">

        <xs:pattern value="[\p{L}\p{N}#/\- ]+"/>

    </xs:restriction>

</xs:simpleType>

Another overlay may define a legacy order-processing profile:

<xs:simpleType name="LegacyOrderID">

    <xs:restriction base="xs:string">

        <xs:pattern value="[A-Z]{3}-[0-9]+"/>

    </xs:restriction>

</xs:simpleType>

The same canonical XML structure can therefore participate in multiple distinct validation environments without requiring redesign of the foundational document vocabulary.


The important point is that the canonical structure remains stable while secondary overlays proliferate over time. As technology evolves, additional overlays can be introduced without destabilizing the underlying interchange model.

This resembles what might be called a Minimal Viable Product Line (MVPL) architecture. The canonical schema functions as the minimal viable core platform. Overlay schemas function as product-line extensions layered onto that stable substrate.

The core vocabulary remains durable and interoperable. Specialized interpretations become modular extensions rather than modifications to the base model.


This differs significantly from conventional XSD-centric design philosophy. Traditional XML Schema design often tightly couples structure and semantics into a single integrated type hierarchy. Documents become heavily self-describing and intrinsically typed. Such systems frequently become brittle as requirements evolve because every new business case pressures the core schema itself.


The overlay approach instead embraces the idea that structure and interpretation evolve at different rates.


The structural concepts of documents are relatively stable:

  • documents have identifiers
  • parties
  • dates
  • items
  • totals
  • references

What changes continuously are:

  • lexical constraints
  • processing rules
  • validation requirements
  • regulatory overlays
  • business workflows
  • integration profiles
  • identifier technologies

Separating these concerns produces a more extensible architecture.


This also changes the role of validation itself. Validation ceases to be a single monolithic operation and instead becomes compositional and layered.


A processing pipeline might first validate a document against the canonical schema simply to ensure structural integrity and vocabulary correctness. It could then apply one or more secondary overlays depending on context:

  • invoice profile
  • regional compliance profile
  • accounting profile
  • industry profile
  • workflow-stage profile
  • trading-partner profile

Different overlays may be combined sequentially or compositionally over the same canonical Infoset.


This begins to resemble compiler architectures more than traditional document validation systems. The canonical XML document functions almost like an abstract syntax tree or intermediate representation. Overlay schemas behave like successive typing environments or semantic passes operating over the same underlying structure.


The XML Infoset itself may remain unchanged throughout this process. What changes instead are the typing annotations, constraints, and augmented interpretations applied to it. Different overlays therefore produce different typed projections over the same canonical document tree.


This architecture also reduces pressure to constantly revise standards. In many standards efforts, schema evolution becomes contentious because every new requirement appears to demand modification of the core specification. Under an overlay model, many new requirements can instead be accommodated through additional secondary profiles without destabilizing the foundational structure.


The model scales especially well for long-lived enterprise ecosystems in which historical compatibility matters. Legacy overlays can continue existing alongside modern overlays indefinitely. New identifier technologies can be introduced incrementally. Specialized industry constraints can coexist with broader interoperability vocabularies.


The approach also aligns naturally with technologies such as RELAX NG, Schematron, Genericode, and XProc.

RELAX NG is particularly attractive because it is structurally oriented and less tightly coupled to intrinsic type systems than XSD. Schematron complements this by expressing contextual business rules orthogonally to grammatical structure. XProc provides a natural orchestration framework for selecting and applying overlays dynamically during processing. Genericode similarly externalizes semantic and validation metadata away from the instance document itself.


Taken together, these technologies support an architecture based on canonical structure plus externally selected interpretation layers.


This differs fundamentally from object-oriented notions of intrinsic identity and polymorphism. The document does not inherently “know” what it is. Instead, different processing contexts project different interpretations onto a stable serialized representation.


In that sense, the architecture resembles older overlay-oriented systems such as COBOL REDEFINES, though operating at the level of document typing and validation rather than raw memory layouts. The same underlying representation can participate in multiple alternate typing environments depending on externally applied profiles.


The result is an XML architecture optimized not for static completeness but for long-term adaptability. The canonical structure becomes durable infrastructure. The overlays become the evolving ecosystem surrounding it.


Rather than treating schemas as rigid definitions of absolute document identity, the system treats schemas as reusable, composable interpretation layers operating over a stable canonical representation.


That shift in perspective can produce XML systems that are significantly more extensible, evolvable, and resilient over time than conventional monolithic schema architectures.


May 2026 

Thursday, 28 May 2026

Methods for late-binding specific constraints to a more generic XML business document structure

 The COBOL REDEFINES facility and the Natural REDEFINE facility are often described superficially as forms of polymorphism, but this is not really accurate in the modern software engineering sense. In object-oriented programming, polymorphism refers primarily to the ability for one interface or operation to exhibit different behavior depending on the type of the object involved. COBOL REDEFINES is instead fundamentally about alternate interpretations of the same underlying storage. A region of bytes can be viewed through multiple different structural definitions, each imposing different datatype constraints, field boundaries, and decoding rules. The underlying storage does not change; only the interpretation changes.


This distinction becomes particularly interesting when considering XML and schema languages. XML is normally understood as a self-describing, intrinsically typed document format. An XML document is generally expected to carry enough structural information that validators and processors know what interpretation applies. XML Schema Definition (XSD), especially, strongly encourages this model through namespaces, global element declarations, type derivation, and xsi:type. Yet there is another possible architectural direction, one that resembles COBOL REDEFINES not at the level of memory bytes and offsets, but at the level of canonical serialized document structures.


Consider a generic business document structure:


<BusinessDocument>

    <DocumentID>...</DocumentID>

    <DocumentDate>...</DocumentDate>

    <Sender>...</Sender>

    <Receiver>...</Receiver>

    <DocumentItem>...</DocumentItem>

</BusinessDocument>


One profile might interpret this as an order, constraining DocumentID to be an alphanumeric token. Another profile might interpret it as an automatically generated invoice, constraining DocumentID to be a UUID. A third might interpret it as a manually generated invoice, allowing special characters and whitespace in DocumentID. Importantly, the semantics are not the most interesting aspect here. The crucial idea is that the same serialized byte stream is subjected to different typing and validation overlays depending on context.


At first glance, this might appear similar to XML polymorphism through xsi:type. One could imagine something like:


<DocumentID xsi:type="UUIDInvoiceID">

    550e8400-e29b-41d4-a716-446655440000

</DocumentID>


or even assigning UUID-like type identifiers analogous to COM class IDs or GUIDs. A schema could define many such types and validators could dispatch accordingly. Technically this works, because xsi:type enables runtime substitution of derived types. However, this is not especially satisfying architecturally because the type binding is embedded directly into the instance document itself. The document effectively self-declares its interpretation. This is early-bound, intrinsic typing rather than externally applied interpretation.


A more interesting possibility is externalized or late-bound schema application. In this model, the XML instance remains structurally neutral. The interpretation is selected later by the processing environment. One simple mechanism for this would be XML processing instructions:


<?document-profile uuid-invoice?>


<BusinessDocument>

    <DocumentID>550e8400-e29b-41d4-a716-446655440000</DocumentID>

</BusinessDocument>


A processing pipeline could examine the processing instruction, select the appropriate schema set, and validate the document accordingly. Another processing instruction could select a different schema profile entirely:


<?document-profile manual-invoice?>


In this architecture, the document itself does not intrinsically carry its type identity. Instead, type interpretation is externalized into the processing pipeline. This begins to resemble COBOL REDEFINES much more closely, though at the level of serialized document interpretation rather than memory overlays.


Importantly, the XML Infoset itself does not necessarily change in this process. The XML Information Set is simply the parsed abstract representation of the document: elements, attributes, namespaces, character data, processing instructions, and so on. The underlying Infoset may remain identical while different validation profiles impose different typing overlays on top of it. What changes instead is the Post-Schema-Validation Infoset, or PSVI. The same canonical Infoset may yield multiple alternative PSVIs depending on which schema set is applied. One schema may annotate DocumentID as a UUID type. Another may annotate it as a legacy invoice identifier. Yet another may classify it as a free-form manually entered identifier. In this sense, the same parsed document tree acquires different type projections depending on externally selected constraints.


This architectural direction begins to resemble systems such as OASIS Genericode. Genericode separates canonical XML instance data from externally applied semantic and validation metadata. The XML itself remains relatively generic while interpretation rules are supplied externally through profiles, code lists, and constraint layers. This is philosophically very different from mainstream XSD-centric XML architectures, which generally assume that documents are intrinsically typed and self-identifying through namespaces and schema bindings.


The resemblance to older SGML architectural ideas is also striking. SGML often treated structure, validation, semantics, rendering, and processing context as distinct layers rather than collapsing them together into a single intrinsic type system. Modern XML tooling often conflated these concerns through namespaces and schema declarations. What emerges here instead is a layered architecture in which the XML instance is merely a canonical syntax tree and interpretation is externally projected later through validation overlays.


This becomes especially compelling when considering RELAX NG. RELAX NG was intentionally designed as a simpler, more orthogonal schema language than XSD. It is less tied to intrinsic type systems and more oriented toward structural grammar validation. Through the RELAX NG DTD Compatibility specification, validators can also support default attribute insertion and Infoset augmentation. Modern validators such as Jing can therefore not only validate a document but potentially augment the Infoset by inserting defaulted attributes and annotations.


This changes the nature of schema processing considerably. Schemas cease to be merely passive accept-or-reject grammars and instead begin acting as active overlay transformations. One RELAX NG profile might inject attributes identifying a UUID invoice profile, while another profile might inject attributes corresponding to manually generated invoices. The same canonical XML document can therefore produce different augmented Infosets depending on which profile is externally applied.


At this point the architecture begins to resemble compiler pipelines more than ordinary XML validation. One can imagine the following processing flow:


Canonical XML document → processing instruction or external context → profile selection → RELAX NG validation and augmentation → augmented Infoset → downstream typed interpretation.


This is essentially late-bound schema projection. The XML document itself remains minimally typed and structurally stable while multiple external overlays provide different validation, augmentation, and interpretation layers. XSD does not naturally support this style of architecture because it fundamentally assumes that type identity is largely intrinsic to the document itself. RELAX NG, Schematron, Genericode, and XProc together form a much more flexible ecosystem for this kind of externally projected typing model.


XProc is particularly well suited to this approach because it was designed specifically as a pipeline orchestration language for XML processing. An XProc pipeline can inspect processing instructions, select schema profiles dynamically, invoke RELAX NG validators, apply Schematron rules, augment Infosets, and route documents through multiple validation overlays. This makes it possible to construct sophisticated multi-stage interpretation systems in which schema binding occurs operationally at runtime rather than being statically embedded into the document.


Schematron complements this especially well because it provides rule-based contextual validation rather than merely grammatical validation. A Schematron layer can express assertions such as “if profile equals UUID invoice then DocumentID must match UUID syntax” or “if profile equals manual invoice then DocumentID may contain special characters.” This creates an architecture where canonical XML structure is separated from contextual validation overlays, much like alternate record definitions in older mainframe systems.


The result is an XML architecture that behaves surprisingly similarly in spirit to COBOL REDEFINES. The mechanisms are entirely different. COBOL operates at the level of contiguous storage and byte reinterpretation. XML operates at the level of abstract syntax trees and externally applied typing overlays. Yet the underlying conceptual pattern is remarkably similar: a stable representation subjected to multiple alternate typed interpretations selected according to context.


In this model, schemas become overlays rather than absolute type definitions. Validation becomes projection rather than intrinsic identity checking. XML documents cease to be fully self-describing objects and instead become canonical serialized forms onto which multiple interpretation layers can later be applied. That architectural direction is arguably much closer to sophisticated data processing systems, compiler pipelines, and overlay-based interpretation frameworks than to conventional object-oriented XML document models.


Wording by ChatGPT, prompted by Stephen D Green, May 2026