Saturday, 30 May 2026

Canonical XML with Overlay Schemas: A Different Approach to Long-Lived XML Architecture - Part 4 - Public vs Private

 Canonical XML with Overlay Schemas: A Different Approach to Long-Lived XML Architecture - Part 4 - Public vs Private


Written by ChatGPT, as prompted by Stephen D Green, May 2026 


Suppose an organization has a canonical business document instance:


<BusinessDocument>

    <DocumentID>550e8400-e29b-41d4-a716-446655440000</DocumentID>

    <DocumentDate>2026-05-28</DocumentDate>

    ...

</BusinessDocument>


Within the organization, different departments may apply different validation overlays to that same instance.

The accounting department may validate it against an accounting profile.

The logistics department may validate it against a shipping profile.

The analytics team may apply a data-quality profile.

The archive department may apply records-management constraints.

None of these interpretations is necessarily more "correct" than the others. They are contextual views serving different purposes. The document is being projected into different local validation environments according to the needs of each department.

In this private, internal setting, "rightness" is often determined pragmatically. A schema is useful if it supports the work that needs to be done. Multiple schemas can coexist without difficulty because they are all operating within a single organizational boundary and under a shared authority structure.

The situation changes once the document crosses organizational boundaries.

Suppose the document is transmitted from one legal entity to another. At that point, the document is no longer merely an internal information object. It becomes a shared artifact whose interpretation must be sufficiently stable for independent parties to rely upon it.

This is where the distinction between public and private overlays becomes significant.

A public overlay is not merely a validation profile. It is effectively part of the interoperability contract between independent actors. If a sender claims that a document conforms to a particular public profile, the receiver must be able to validate that claim independently and obtain substantially the same result.

Consequently, public overlays typically require stronger governance, clearer specifications, published conformance criteria, versioning policies, and often formal change-control processes. They become part of the shared infrastructure of communication.

Private overlays, by contrast, can remain opportunistic and context-specific. They may evolve rapidly. They may be undocumented outside the organization. They may encode assumptions that would be inappropriate for external parties. They may even conflict with overlays used elsewhere, without causing problems because they are never exposed beyond their local context.

This suggests that there may actually be three layers rather than two.

The first layer is the canonical core. This defines the stable structural vocabulary shared across the ecosystem.

The second layer consists of public overlays. These represent profiles, standards, industry agreements, regulatory specifications, or contractual constraints that are intended to be shared across organizational boundaries.

The third layer consists of private overlays. These represent local interpretations, departmental requirements, operational constraints, internal workflows, reporting needs, and other context-specific projections that exist solely within a particular organizational environment.

A document instance may therefore participate simultaneously in all three layers.

It conforms to the canonical structure.

It conforms to one or more public interoperability profiles.

It conforms to a number of private local profiles.

The further one moves from the canonical core toward the private overlays, the less universal the notion of "correctness" becomes. Internal departments may legitimately disagree about which validation rules matter most because they are optimizing for different objectives. Accounting, logistics, legal compliance, customer service, and analytics may each apply different criteria to the same document.

By contrast, the public profiles occupy an intermediate position. Their purpose is precisely to establish a common notion of correctness between parties that do not share the same internal priorities. They provide the basis upon which documents can be exchanged, audited, trusted, and processed across organizational boundaries.

This perspective also helps explain why standards bodies often focus primarily on the canonical structure and a limited number of public profiles rather than attempting to standardize every possible use case. The closer a schema is to a private local concern, the weaker the case for broad standardization becomes. Local actors can often manage those concerns themselves.

Viewed in this way, the architecture is not merely a hierarchy of technical schemas. It is also a hierarchy of social visibility and institutional responsibility. The canonical core belongs to the ecosystem as a whole. Public profiles belong to communities of interoperability. Private overlays belong to individual organizations and departments.

The same XML instance may therefore be interpreted through many different schemas, but the significance of those schemas depends on who is expected to recognize and trust them. Internal departmental schemas derive their authority from local usefulness. Public interoperability schemas derive their authority from shared agreement. The canonical core derives its authority from providing a stable common language within which all of these other interpretations can coexist.

Canonical XML with Overlay Schemas: A Different Approach to Long-Lived XML Architecture - Part 3 - Funding Model

 Canonical XML with Overlay Schemas: A Different Approach to Long-Lived XML Architecture - Part 3 - Funding Model


I tend to think a distinction between one standard stable core schema and multiple transient extension schemas might be a difference of funding model. The work of designing the core, managing and maintaining its stability and the overall architecture, governing compliance of extensions with it all: This is long term and requires a long term funding model. The funding of extension work, on the other hand, by nature of their being possibly experimental, ephemeral, private to their context, could be a funding which is just part of the private use scenario, perhaps simply part of the funding of the project in which they are designed and maintained. 


ChatGPT responded to this point of view as follows.


“That is a very interesting observation, because it moves the discussion beyond technical architecture into the economics and governance of standards ecosystems. In fact, many successful long-lived standards seem to exhibit exactly this separation, whether it is made explicit or not.

Under this view, the distinction between the canonical core and the overlays is not merely a technical distinction but also an institutional one.

The core schema represents shared infrastructure. It provides value to the entire ecosystem rather than to any single participant. Its purpose is to provide continuity, interoperability, stability, and a common vocabulary. Because many independent parties depend upon it, changes must be managed conservatively and governance must be carefully maintained. The work involved includes not only technical design but also version management, compatibility analysis, dispute resolution, documentation, conformance testing, architectural stewardship, and long-term maintenance.

These activities resemble the maintenance of public infrastructure. The benefits are diffuse and collective. Consequently, the funding model tends to require a long-term perspective. It may be supported by standards organizations, industry consortia, membership subscriptions, government bodies, foundations, or other forms of sustained institutional backing. The key characteristic is that the funding exists to preserve the stability and integrity of the shared platform over many years or even decades.

The extension schemas occupy a very different position. Their purpose is not to provide universal interoperability but to address particular contexts, opportunities, projects, regulatory environments, industries, organizations, or experiments. An extension may emerge because a specific company needs to support a new workflow. Another may exist only for the duration of a regulatory regime. Another may be developed for a pilot project, an internal integration effort, or a particular business partnership.

Such extensions are often valuable even if they never achieve widespread adoption. Their utility is local rather than universal. Consequently, their funding model can also be local. The costs of design, implementation, maintenance, and eventual retirement may simply be borne by the project or organization that benefits from them.

In economic terms, the core behaves somewhat like a public good within the ecosystem, while the extensions behave more like private investments. The return on investment for maintaining the core is distributed across many actors. The return on investment for an extension is typically concentrated within a smaller group of stakeholders.

This separation may actually help explain why the architectural distinction is so powerful. If every new requirement had to be incorporated directly into the core schema, the governance process would become overwhelmed. Every project would seek to embed its own local needs into the shared standard. The core would grow increasingly complex, unstable, and difficult to evolve. Funding pressures would also become confused, because organizations would expect the shared infrastructure to absorb the costs of their specialized requirements.

By contrast, an overlay-oriented architecture creates a natural division of responsibility. The core governance body focuses on maintaining conceptual coherence, interoperability, and long-term stability. Extension authors focus on solving their own immediate problems. The two activities can proceed at different speeds and under different funding arrangements.

The model also accommodates different lifecycles. The core may be expected to survive for decades. Some extensions may survive only months. Others may evolve rapidly through multiple generations. Still others may begin as private experiments and later become sufficiently important that they are standardized and incorporated into the broader ecosystem.

This suggests an evolutionary pathway within the architecture. Extensions can serve as a kind of innovation layer. New ideas can be explored without risking disruption of the shared substrate. Successful patterns can gradually prove themselves through practical use. If an extension eventually demonstrates broad applicability and long-term value, aspects of it may be promoted into the canonical architecture through the normal governance process. Many extensions, however, will never need to be promoted; they can remain local and specialized indefinitely.

Viewed this way, the architecture provides not only technical separation of concerns but also institutional separation of concerns. It allows the stability-oriented activities of standardization and governance to coexist with the experimentation-oriented activities of innovation and adaptation.

The possibility of AI-generated extensions makes this distinction even more compelling. If future systems are capable of generating large numbers of specialized validation overlays, contextual schemas, transformation rules, or domain-specific profiles, it becomes even more important to maintain a carefully governed and stable core. The more rapidly the extension layer evolves, the more valuable the stable substrate becomes.

One could therefore imagine a future ecosystem in which the canonical core remains under long-term human stewardship, supported by durable funding and governance mechanisms, while a vast and continually changing population of extensions is created and maintained by organizations, projects, communities, and perhaps increasingly by AI systems acting on behalf of those stakeholders. The core provides continuity. The extensions provide adaptability. The funding models naturally diverge because the functions they serve within the ecosystem are fundamentally different.

In that sense, the architectural distinction between a stable core and transient overlays may be as much an economic and organizational principle as a technical one. The architecture succeeds not merely because it separates structure from interpretation, but because it aligns technical responsibilities, governance responsibilities, lifecycle expectations, and funding mechanisms into coherent and mutually reinforcing layers.”


May 2026


One XML instance with many possible schemas

 I find it understandable why we might want one XML instance to be valid according to one primary structural, but many possible secondary schemas, typically one secondary schema for each possible context. 


For example, ChatGPT can ‘understand’ it as follows:


“The core schema defines the common platform. The secondary schemas define variants, profiles, configurations, or specializations of that platform. Just as a software product family may contain many product-line variants built upon a shared core architecture, a document ecosystem may contain many contextual validation overlays built upon a shared canonical structure.

Seen this way, the existence of many secondary schemas is not a sign of fragmentation or poor design. It is an expected consequence of separating structural stability from contextual variability. The core schema remains relatively stable because the fundamental concepts of the domain change slowly. The secondary schemas proliferate because contexts, technologies, regulations, workflows, organizations, and business requirements change much more rapidly.

Indeed, in a mature ecosystem one might expect the number of secondary schemas to greatly exceed the number of primary schemas. There may be a single canonical business-document structure but hundreds of contextual overlays accumulated over time. New overlays can be added as new requirements emerge without forcing redesign of the foundational vocabulary.

This perspective also helps explain why the notion of "the schema" can become misleading. In conventional XML discussions, one often speaks as though a document has a single schema that completely defines it. In an overlay-oriented architecture, however, validation becomes a layered process. A document may simultaneously participate in multiple validation regimes, each contributing a different aspect of its interpretation. The primary schema provides structural legitimacy. The secondary schemas provide contextual legitimacy.

The result is a model in which a document is not characterized by one absolute type, but by a combination of structural conformance and contextual conformance. The canonical structure supplies continuity and interoperability, while the overlay schemas supply specialization and adaptability. That division of responsibility is precisely what makes the architecture capable of evolving over long periods without requiring continual redefinition of the core standard itself.”

Friday, 29 May 2026

Canonical XML with Overlay Schemas: A Different Approach to Long-Lived XML Architecture - Part 2

 Canonical XML with Overlay Schemas: A Different Approach to Long-Lived XML Architecture - Part 2


Written by ChatGPT, as prompted by Stephen D Green, May 2026 


One illustrative scenario is to view the architecture through the lens of a Minimal Viable Product Line (MVPL) model. In such a system, there exists a stable, standardized core schema defining only the canonical structure and vocabulary of the domain. This core schema establishes the durable element and attribute names, hierarchical relationships, and broad structural organization of documents, while deliberately avoiding excessive commitment to specialized datatypes or context-specific validation semantics. If XML Schema is used, many elements may intentionally be declared with highly permissive definitions such as xs:anyType, broad string types, or minimally constrained complex structures. The purpose of the core is therefore not to exhaustively define all future interpretations of the data, but rather to provide a stable canonical substrate over which multiple evolving overlays may later operate.


This core schema functions as the “Product” in MVPL terminology. It is governed, maintained, standardized, versioned, and evolved carefully by humans according to a curated understanding of the problem domain. Human stewardship remains critically important at this level because the canonical structure represents long-term semantic and interoperability infrastructure. Decisions at this layer affect the durability and stability of the entire ecosystem. Governance therefore emphasizes restraint, continuity, compatibility, clarity of vocabulary, and preservation of stable conceptual abstractions that can endure through technological and business change.


The overlays or secondary schemas then function as “product lines” extending this core product into many specialized contexts. These secondary schemas may constrain datatypes, impose business rules, introduce workflow-specific validation overlays, or adapt the canonical structure for industry-specific, regional, organizational, or technological requirements. One overlay might constrain a document identifier to UUID syntax. Another may require compatibility with a legacy alphanumeric coding system. Another may add jurisdiction-specific accounting rules. Another may define specialized extensions for automated supply-chain processing or machine-generated transactional records.


Over time, the number of such overlays may grow continuously. The important point is that this growth does not necessarily destabilize the core schema itself. The core remains intentionally conservative and durable while the surrounding ecosystem of overlays evolves dynamically.

This architecture becomes particularly interesting when considering the future role of AI systems in schema generation and maintenance. The canonical core may remain largely under human governance because it embodies long-lived conceptual structure, institutional understanding, interoperability guarantees, and semantic stewardship. However, many of the overlay schemas may eventually become partially or substantially machine-generated.


An AI system could, for example, analyze emerging business processes, integration patterns, regulatory changes, industry conventions, or observed document populations and automatically synthesize secondary validation overlays appropriate to particular contexts. It could generate datatype constraints, contextual validation rules, compatibility layers, transformation mappings, or profile-specific augmentations without requiring redesign of the canonical substrate itself.

In such a system, the core schema effectively becomes a stable semantic platform maintained under careful human governance, while the overlay ecosystem becomes adaptive, proliferating, and increasingly automated. AI systems may create, refine, merge, specialize, deprecate, or dynamically select overlays according to operational context, workflow stage, trading-partner requirements, or regulatory environment.


This creates an architectural separation not merely between structure and semantics, but also potentially between human-governed stability and machine-generated adaptability. Humans curate the enduring conceptual infrastructure. AI systems increasingly manage the combinatorial proliferation of contextual overlays surrounding it.


The result resembles a layered ecosystem rather than a monolithic schema hierarchy. The canonical XML structure behaves almost like a durable intermediate representation or abstract syntax tree shared across an evolving landscape of interpretation systems. Overlay schemas become modular semantic projections capable of evolving independently of the foundational vocabulary.


Such an architecture may prove especially valuable in domains characterized by long time horizons, regulatory complexity, heterogeneous integration environments, and rapid contextual evolution. The stable core preserves continuity and interoperability. The overlays provide adaptability and specialization. AI systems may eventually operate most effectively not by redefining the core conceptual substrate itself, but by continuously generating and refining the surrounding layers of contextual interpretation.

In this sense, the architecture begins to resemble not merely traditional XML validation systems, but broader platform ecosystems in which a carefully governed canonical substrate supports an open-ended and dynamically evolving constellation of specialized extensions.


May 2026 

Canonical XML with Overlay Schemas: A Different Approach to Long-Lived XML Architecture

 Canonical XML with Overlay Schemas: A Different Approach to Long-Lived XML Architecture


Written by ChatGPT, as prompted by Stephen D Green, May 2026 


Most XML architectures built around XML Schema Definition (XSD) assume that a document intrinsically declares its own type and validation semantics. The schema normally defines not only the structural vocabulary of the document, but also its datatypes, lexical constraints, validation rules, and often much of its business meaning. In practice this tends to produce tightly coupled document models in which structure and interpretation evolve together.


That approach works reasonably well for relatively static systems, but it becomes problematic for long-lived standards and enterprise ecosystems. Business requirements evolve continually. Identifier formats change. Regulatory overlays appear and disappear. New workflow contexts emerge. Legacy integration requirements accumulate. Different industries and trading partners introduce specialized constraints. Yet the underlying business structures often remain surprisingly stable for decades.


A purchase order, invoice, shipping notice, payment instruction, or inventory report may continue to contain essentially the same conceptual elements year after year: identifiers, dates, parties, items, totals, quantities, addresses, and references. The structure changes slowly. The interpretation layers change continuously.


This suggests a different architectural approach: separate the stable canonical structure from the evolving validation overlays.


Instead of designing a monolithic schema that attempts to define every possible constraint and interpretation upfront, the architecture can be divided into two layers:

  1. A stable canonical schema defining only the structural vocabulary and document shape.
  2. Secondary overlay schemas defining contextual datatype constraints, validation profiles, and specialized interpretations.


The result is an XML architecture based on canonical structure plus externally applied validation overlays rather than intrinsic monolithic typing.


Consider a simple canonical business document:

<BusinessDocument>

    <DocumentID>...</DocumentID>

    <DocumentDate>...</DocumentDate>


    <Sender>

        <PartyID>...</PartyID>

    </Sender>


    <Receiver>

        <PartyID>...</PartyID>

    </Receiver>


    <Items>

        <Item>

            <SKU>...</SKU>

            <Quantity>...</Quantity>

            <UnitPrice>...</UnitPrice>

        </Item>

    </Items>


    <Total>...</Total>

</BusinessDocument>

The canonical schema intentionally imposes only minimal constraints. It establishes the vocabulary and overall tree structure while avoiding premature commitment to highly specific datatypes or business rules.


For example, the base schema may define:

<xs:element name="DocumentID" type="xs:string"/>

rather than immediately constraining the identifier format.

Similarly, portions of the structure may even use highly permissive definitions:

<xs:element name="Items" type="xs:anyType"/>

The purpose of this schema is not to define the final interpretation of the document. Instead, it defines the stable interchange substrate shared across many future contexts.

Secondary schemas then provide progressively more specialized overlays.


One overlay schema may define an automatically generated invoice profile in which DocumentID must be a UUID:

<xs:simpleType name="UUIDDocumentID">

    <xs:restriction base="xs:string">

        <xs:pattern value="[0-9a-fA-F-]{36}"/>

    </xs:restriction>

</xs:simpleType>

Another overlay may define a manually generated invoice profile allowing more flexible identifiers:

<xs:simpleType name="ManualInvoiceID">

    <xs:restriction base="xs:string">

        <xs:pattern value="[\p{L}\p{N}#/\- ]+"/>

    </xs:restriction>

</xs:simpleType>

Another overlay may define a legacy order-processing profile:

<xs:simpleType name="LegacyOrderID">

    <xs:restriction base="xs:string">

        <xs:pattern value="[A-Z]{3}-[0-9]+"/>

    </xs:restriction>

</xs:simpleType>

The same canonical XML structure can therefore participate in multiple distinct validation environments without requiring redesign of the foundational document vocabulary.


The important point is that the canonical structure remains stable while secondary overlays proliferate over time. As technology evolves, additional overlays can be introduced without destabilizing the underlying interchange model.

This resembles what might be called a Minimal Viable Product Line (MVPL) architecture. The canonical schema functions as the minimal viable core platform. Overlay schemas function as product-line extensions layered onto that stable substrate.

The core vocabulary remains durable and interoperable. Specialized interpretations become modular extensions rather than modifications to the base model.


This differs significantly from conventional XSD-centric design philosophy. Traditional XML Schema design often tightly couples structure and semantics into a single integrated type hierarchy. Documents become heavily self-describing and intrinsically typed. Such systems frequently become brittle as requirements evolve because every new business case pressures the core schema itself.


The overlay approach instead embraces the idea that structure and interpretation evolve at different rates.


The structural concepts of documents are relatively stable:

  • documents have identifiers
  • parties
  • dates
  • items
  • totals
  • references

What changes continuously are:

  • lexical constraints
  • processing rules
  • validation requirements
  • regulatory overlays
  • business workflows
  • integration profiles
  • identifier technologies

Separating these concerns produces a more extensible architecture.


This also changes the role of validation itself. Validation ceases to be a single monolithic operation and instead becomes compositional and layered.


A processing pipeline might first validate a document against the canonical schema simply to ensure structural integrity and vocabulary correctness. It could then apply one or more secondary overlays depending on context:

  • invoice profile
  • regional compliance profile
  • accounting profile
  • industry profile
  • workflow-stage profile
  • trading-partner profile

Different overlays may be combined sequentially or compositionally over the same canonical Infoset.


This begins to resemble compiler architectures more than traditional document validation systems. The canonical XML document functions almost like an abstract syntax tree or intermediate representation. Overlay schemas behave like successive typing environments or semantic passes operating over the same underlying structure.


The XML Infoset itself may remain unchanged throughout this process. What changes instead are the typing annotations, constraints, and augmented interpretations applied to it. Different overlays therefore produce different typed projections over the same canonical document tree.


This architecture also reduces pressure to constantly revise standards. In many standards efforts, schema evolution becomes contentious because every new requirement appears to demand modification of the core specification. Under an overlay model, many new requirements can instead be accommodated through additional secondary profiles without destabilizing the foundational structure.


The model scales especially well for long-lived enterprise ecosystems in which historical compatibility matters. Legacy overlays can continue existing alongside modern overlays indefinitely. New identifier technologies can be introduced incrementally. Specialized industry constraints can coexist with broader interoperability vocabularies.


The approach also aligns naturally with technologies such as RELAX NG, Schematron, Genericode, and XProc.

RELAX NG is particularly attractive because it is structurally oriented and less tightly coupled to intrinsic type systems than XSD. Schematron complements this by expressing contextual business rules orthogonally to grammatical structure. XProc provides a natural orchestration framework for selecting and applying overlays dynamically during processing. Genericode similarly externalizes semantic and validation metadata away from the instance document itself.


Taken together, these technologies support an architecture based on canonical structure plus externally selected interpretation layers.


This differs fundamentally from object-oriented notions of intrinsic identity and polymorphism. The document does not inherently “know” what it is. Instead, different processing contexts project different interpretations onto a stable serialized representation.


In that sense, the architecture resembles older overlay-oriented systems such as COBOL REDEFINES, though operating at the level of document typing and validation rather than raw memory layouts. The same underlying representation can participate in multiple alternate typing environments depending on externally applied profiles.


The result is an XML architecture optimized not for static completeness but for long-term adaptability. The canonical structure becomes durable infrastructure. The overlays become the evolving ecosystem surrounding it.


Rather than treating schemas as rigid definitions of absolute document identity, the system treats schemas as reusable, composable interpretation layers operating over a stable canonical representation.


That shift in perspective can produce XML systems that are significantly more extensible, evolvable, and resilient over time than conventional monolithic schema architectures.


May 2026