Thursday, 28 May 2026

Methods for late-binding specific constraints to a more generic XML business document structure

 The COBOL REDEFINES facility and the Natural REDEFINE facility are often described superficially as forms of polymorphism, but this is not really accurate in the modern software engineering sense. In object-oriented programming, polymorphism refers primarily to the ability for one interface or operation to exhibit different behavior depending on the type of the object involved. COBOL REDEFINES is instead fundamentally about alternate interpretations of the same underlying storage. A region of bytes can be viewed through multiple different structural definitions, each imposing different datatype constraints, field boundaries, and decoding rules. The underlying storage does not change; only the interpretation changes.


This distinction becomes particularly interesting when considering XML and schema languages. XML is normally understood as a self-describing, intrinsically typed document format. An XML document is generally expected to carry enough structural information that validators and processors know what interpretation applies. XML Schema Definition (XSD), especially, strongly encourages this model through namespaces, global element declarations, type derivation, and xsi:type. Yet there is another possible architectural direction, one that resembles COBOL REDEFINES not at the level of memory bytes and offsets, but at the level of canonical serialized document structures.


Consider a generic business document structure:


<BusinessDocument>

    <DocumentID>...</DocumentID>

    <DocumentDate>...</DocumentDate>

    <Sender>...</Sender>

    <Receiver>...</Receiver>

    <DocumentItem>...</DocumentItem>

</BusinessDocument>


One profile might interpret this as an order, constraining DocumentID to be an alphanumeric token. Another profile might interpret it as an automatically generated invoice, constraining DocumentID to be a UUID. A third might interpret it as a manually generated invoice, allowing special characters and whitespace in DocumentID. Importantly, the semantics are not the most interesting aspect here. The crucial idea is that the same serialized byte stream is subjected to different typing and validation overlays depending on context.


At first glance, this might appear similar to XML polymorphism through xsi:type. One could imagine something like:


<DocumentID xsi:type="UUIDInvoiceID">

    550e8400-e29b-41d4-a716-446655440000

</DocumentID>


or even assigning UUID-like type identifiers analogous to COM class IDs or GUIDs. A schema could define many such types and validators could dispatch accordingly. Technically this works, because xsi:type enables runtime substitution of derived types. However, this is not especially satisfying architecturally because the type binding is embedded directly into the instance document itself. The document effectively self-declares its interpretation. This is early-bound, intrinsic typing rather than externally applied interpretation.


A more interesting possibility is externalized or late-bound schema application. In this model, the XML instance remains structurally neutral. The interpretation is selected later by the processing environment. One simple mechanism for this would be XML processing instructions:


<?document-profile uuid-invoice?>


<BusinessDocument>

    <DocumentID>550e8400-e29b-41d4-a716-446655440000</DocumentID>

</BusinessDocument>


A processing pipeline could examine the processing instruction, select the appropriate schema set, and validate the document accordingly. Another processing instruction could select a different schema profile entirely:


<?document-profile manual-invoice?>


In this architecture, the document itself does not intrinsically carry its type identity. Instead, type interpretation is externalized into the processing pipeline. This begins to resemble COBOL REDEFINES much more closely, though at the level of serialized document interpretation rather than memory overlays.


Importantly, the XML Infoset itself does not necessarily change in this process. The XML Information Set is simply the parsed abstract representation of the document: elements, attributes, namespaces, character data, processing instructions, and so on. The underlying Infoset may remain identical while different validation profiles impose different typing overlays on top of it. What changes instead is the Post-Schema-Validation Infoset, or PSVI. The same canonical Infoset may yield multiple alternative PSVIs depending on which schema set is applied. One schema may annotate DocumentID as a UUID type. Another may annotate it as a legacy invoice identifier. Yet another may classify it as a free-form manually entered identifier. In this sense, the same parsed document tree acquires different type projections depending on externally selected constraints.


This architectural direction begins to resemble systems such as OASIS Genericode. Genericode separates canonical XML instance data from externally applied semantic and validation metadata. The XML itself remains relatively generic while interpretation rules are supplied externally through profiles, code lists, and constraint layers. This is philosophically very different from mainstream XSD-centric XML architectures, which generally assume that documents are intrinsically typed and self-identifying through namespaces and schema bindings.


The resemblance to older SGML architectural ideas is also striking. SGML often treated structure, validation, semantics, rendering, and processing context as distinct layers rather than collapsing them together into a single intrinsic type system. Modern XML tooling often conflated these concerns through namespaces and schema declarations. What emerges here instead is a layered architecture in which the XML instance is merely a canonical syntax tree and interpretation is externally projected later through validation overlays.


This becomes especially compelling when considering RELAX NG. RELAX NG was intentionally designed as a simpler, more orthogonal schema language than XSD. It is less tied to intrinsic type systems and more oriented toward structural grammar validation. Through the RELAX NG DTD Compatibility specification, validators can also support default attribute insertion and Infoset augmentation. Modern validators such as Jing can therefore not only validate a document but potentially augment the Infoset by inserting defaulted attributes and annotations.


This changes the nature of schema processing considerably. Schemas cease to be merely passive accept-or-reject grammars and instead begin acting as active overlay transformations. One RELAX NG profile might inject attributes identifying a UUID invoice profile, while another profile might inject attributes corresponding to manually generated invoices. The same canonical XML document can therefore produce different augmented Infosets depending on which profile is externally applied.


At this point the architecture begins to resemble compiler pipelines more than ordinary XML validation. One can imagine the following processing flow:


Canonical XML document → processing instruction or external context → profile selection → RELAX NG validation and augmentation → augmented Infoset → downstream typed interpretation.


This is essentially late-bound schema projection. The XML document itself remains minimally typed and structurally stable while multiple external overlays provide different validation, augmentation, and interpretation layers. XSD does not naturally support this style of architecture because it fundamentally assumes that type identity is largely intrinsic to the document itself. RELAX NG, Schematron, Genericode, and XProc together form a much more flexible ecosystem for this kind of externally projected typing model.


XProc is particularly well suited to this approach because it was designed specifically as a pipeline orchestration language for XML processing. An XProc pipeline can inspect processing instructions, select schema profiles dynamically, invoke RELAX NG validators, apply Schematron rules, augment Infosets, and route documents through multiple validation overlays. This makes it possible to construct sophisticated multi-stage interpretation systems in which schema binding occurs operationally at runtime rather than being statically embedded into the document.


Schematron complements this especially well because it provides rule-based contextual validation rather than merely grammatical validation. A Schematron layer can express assertions such as “if profile equals UUID invoice then DocumentID must match UUID syntax” or “if profile equals manual invoice then DocumentID may contain special characters.” This creates an architecture where canonical XML structure is separated from contextual validation overlays, much like alternate record definitions in older mainframe systems.


The result is an XML architecture that behaves surprisingly similarly in spirit to COBOL REDEFINES. The mechanisms are entirely different. COBOL operates at the level of contiguous storage and byte reinterpretation. XML operates at the level of abstract syntax trees and externally applied typing overlays. Yet the underlying conceptual pattern is remarkably similar: a stable representation subjected to multiple alternate typed interpretations selected according to context.


In this model, schemas become overlays rather than absolute type definitions. Validation becomes projection rather than intrinsic identity checking. XML documents cease to be fully self-describing objects and instead become canonical serialized forms onto which multiple interpretation layers can later be applied. That architectural direction is arguably much closer to sophisticated data processing systems, compiler pipelines, and overlay-based interpretation frameworks than to conventional object-oriented XML document models.


Wording by ChatGPT, prompted by Stephen D Green, May 2026

Tuesday, 14 April 2026

Case Study: Semantic Drift Through Structural Simplification in Electronic Business Standards

 1. Introduction

Electronic business document standards aim to enable reliable, interoperable exchange of commercial information across diverse systems. Standards such as Universal Business Language provide formal schemas that define the structure and permissible content of documents like invoices, credit notes, and orders. These schemas are designed not only to ensure syntactic correctness but also to support consistent interpretation across accounting, procurement, and regulatory systems.

However, the formal structure of a schema does not always fully capture the semantic intent embedded in the design of a document model. As standards evolve, changes motivated by structural reasoning may unintentionally weaken or remove implicit constraints that were originally introduced to preserve interoperability or reflect common business practices. The following case illustrates how such a situation can arise.


2. Background: Currency Representation in UBL Documents

In UBL financial documents, monetary values are typically represented using Amount elements. Each amount includes a currency attribute specifying the currency in which the value is expressed. This design ensures that any individual monetary value is self-describing.

At the same time, many UBL financial document types include a document-level element named DocumentCurrencyCode. This element declares the primary currency of the document as a whole and, in several document types, is defined as mandatory.

From a purely structural perspective, the presence of currency information on every amount element appears to make the document-level currency redundant. Since each monetary value already specifies its currency explicitly, the overall document currency might appear unnecessary.


3. The Implementation Perspective

An implementer examining the schema for a document type such as SelfBilledCreditNote observes two facts:

  1. DocumentCurrencyCode is mandatory at the document level.
  2. All currency-related Amount elements include their own currency attribute.

From this perspective, the following reasoning seems logical:

  • Every monetary value already carries its own currency.
  • Therefore, the document-level currency provides no additional information.
  • Making DocumentCurrencyCode optional would simplify the schema without affecting correctness.

Based on this reasoning, the implementer proposes that the element be made optional in a future revision of the standard.

To newer contributors or maintainers of the specification, this proposal may appear reasonable. The change does not introduce structural inconsistency, and it appears to reduce redundancy in the model. Consequently, such a proposal could plausibly be accepted during the evolution of the standard.


4. Original Design Intent

The mandatory nature of DocumentCurrencyCode in early UBL document models reflected practical assumptions about the systems expected to process these documents.

In many accounting and enterprise resource planning systems, financial documents are handled as single-currency artefacts. While the schema technically allows amounts in different currencies—because each amount carries its own currency attribute—most accounting systems historically expected all monetary values within a document to share a common currency.

The document-level currency therefore served several important purposes:

  1. Default Currency Declaration
    It provided a clear statement of the primary currency for the document.
  2. Operational Constraint Signal
    Its mandatory presence implicitly reinforced the expectation that the document should be treated as a single-currency document.
  3. Implementation Simplification
    Systems could rely on a single declared currency when validating totals, performing calculations, or posting transactions.
  4. Interoperability Assurance
    Trading partners could assume that all monetary values were intended to be expressed in the declared document currency unless explicitly specified otherwise.

In effect, DocumentCurrencyCode encoded a business invariant: that the document represented a transaction expressed primarily in one currency.


5. Consequences of Structural Simplification

If DocumentCurrencyCode were made optional, several subtle changes in system behaviour could occur.

Loss of Explicit Default

Without a mandatory document-level currency, systems must derive the effective currency by examining individual monetary values. If all values share the same currency, this may be straightforward, but the document itself no longer asserts that assumption.

Implicit Permission of Multi-Currency Documents

The absence of a mandatory document currency removes a clear signal that the document should be interpreted as single-currency. Documents containing mixed currencies could become structurally valid without any indication that this was unintended.

Increased Burden on Implementations

Accounting systems expecting a single document currency would need to introduce additional validation logic to ensure consistency across amounts. Implementers would have to enforce constraints that were previously guaranteed by the schema.

Silent Semantic Drift

Most importantly, no immediate failure occurs. Documents continue to validate, and systems continue to exchange data. The erosion occurs at the semantic level rather than the structural level, gradually weakening the shared assumptions that once ensured consistent interpretation.


6. Analysis

This example demonstrates a common phenomenon in evolving technical standards: semantic intent may not be fully represented in formal schema constraints.

When contributors assess changes based primarily on structural logic, elements that appear redundant may in fact encode important design assumptions. Once the original design rationale is forgotten or insufficiently documented, later contributors may reinterpret the model according to contemporary expectations.

The result is not necessarily an incorrect standard, but one that gradually loses the implicit constraints that previously ensured interoperability across heterogeneous systems.


7. Implications for Standards Governance

The case highlights several challenges in maintaining long-lived electronic standards:

  • Incomplete Formalisation of Business Rules
    Not all domain assumptions can be captured through schema constraints alone.
  • Loss of Historical Design Context
    As contributors change over time, the reasoning behind earlier modelling decisions may become obscure.
  • Structural vs. Semantic Evaluation
    Changes that appear harmless from a syntactic perspective may have significant semantic consequences.

Addressing these challenges requires governance mechanisms that preserve design rationale and evaluate proposed changes in terms of both structural correctness and business semantics.


8. Conclusion

The proposed simplification of DocumentCurrencyCode illustrates how seemingly minor schema changes can undermine implicit business constraints embedded in electronic document standards. While the change appears structurally reasonable, it weakens a semantic signal that supported interoperability with accounting systems expecting single-currency documents.

This case demonstrates a broader issue in the evolution of digital standards: structural correctness alone does not guarantee semantic stability. Preserving the integrity of electronic business documents requires sustained attention not only to schema design but also to the historical intent and operational assumptions that shape how those schemas are used in practice.


ChatGPT, prompted by Stephen D Green, April 2026 

Monday, 13 April 2026

Semantic Drift and Digitisation

 The transition from paper-based business documents to electronic standards such as the Universal Business Languagerepresents not merely a technological shift, but a profound change in how meaning, trust, and continuity are maintained in commercial practice. For centuries, documents like invoices evolved slowly within a dense web of legal, accounting, and social expectations. Their structure, terminology, and presentation were not arbitrary; they were shaped by the need to serve as durable evidence, to withstand audit scrutiny, and to remain intelligible across long spans of time. This gradual evolution created a form of semantic stability that was rarely formalised, yet widely understood and consistently applied.

Paper documents derived much of their strength from this embeddedness in human practice. Their meaning was reinforced by shared conventions, professional training, and legal precedent. An invoice issued decades ago can still be interpreted today with a high degree of confidence because the underlying concepts—supplier, buyer, total amount, obligation—have remained stable, and their representation has changed only incrementally. The inertia of paper-based systems, often seen as a limitation, functioned in reality as a safeguard. It constrained the pace of change and ensured that any modification was both visible and socially negotiated.

By contrast, electronic standards such as those developed under OASIS Open operate in an environment where structural change is comparatively easy and inexpensive. Schema definitions can be extended, constraints relaxed, and new elements introduced with far less friction than would be possible in paper-based systems. This flexibility is one of the principal advantages of digitisation, enabling automation, scalability, and integration across diverse systems. Yet it also removes the natural constraints that historically preserved semantic coherence. Where paper relied on shared human understanding, electronic standards rely on formal structures that, while precise in syntax, are often incomplete in their expression of meaning.

This shift introduces the risk that semantic stability, once maintained implicitly, must now be actively managed. Elements within a standard may retain their structural validity while their interpretation subtly changes over time. Cardinality constraints, for example, may be relaxed to accommodate new use cases, transforming what was once a singular, well-defined concept into something more ambiguous. Such changes rarely produce immediate failures; documents continue to validate, systems continue to exchange data, and yet the underlying assumptions that once ensured consistent interpretation begin to erode. The result is not overt incompatibility, but a quieter form of divergence in meaning.

The problem is compounded by the natural turnover of designers and contributors within standards bodies. As original authors move on, the rationale behind earlier decisions—often only partially documented—can fade from view. New contributors, acting in good faith, reinterpret the model according to current needs and their own understanding of the domain. In doing so, they may unintentionally override constraints that were originally introduced to preserve clarity or enforce business invariants. This process is not a failure of governance so much as an inherent feature of human design: each generation reshapes the systems it inherits. However, in a standard where meaning is only partially formalised, such reinterpretation can gradually displace the original conceptual coherence.

In this context, the rapid digitisation of business documents can be seen as having placed certain long-standing practices under strain. The qualities that made paper documents reliable—semantic stability, long-term interpretability, and auditability—are not automatically preserved in electronic form. Instead, they must be reconstructed through explicit rules, constrained profiles, and governance mechanisms. Where once a document could be understood largely in isolation, an electronic equivalent may require knowledge of schema versions, implementation conventions, and external validation rules to be interpreted correctly. The burden of maintaining meaning shifts from the document itself to the surrounding ecosystem.

At the same time, it would be misleading to conclude that digitisation has simply jeopardised these practices without offering compensating benefits. Electronic standards enable levels of efficiency and interoperability that paper systems could never achieve. They allow for precise data exchange, automated processing, and the integration of complex supply chains. In many cases, regulatory frameworks and industry initiatives are actively working to reintroduce the discipline that paper once provided, by enforcing strict subsets of standards and clearly defined semantics. These efforts suggest that the problem is recognised, even if it is not fully resolved.

What emerges, then, is not a simple narrative of loss, but a transition from one form of stability to another. Paper-based systems achieved stability through inertia, shared understanding, and legal conservatism. Electronic systems must achieve it through explicit design, careful governance, and sustained attention to semantic integrity. The risk lies in underestimating this requirement—in assuming that structural correctness is sufficient to preserve meaning. Where that assumption takes hold, the integrity of business documents can indeed be weakened, not through sudden failure, but through the gradual accumulation of small, individually reasonable changes that collectively alter what those documents signify.

In the end, the digitisation of business documents has not eliminated the need for the principles that guided their paper predecessors. It has merely changed the way those principles must be upheld. The challenge is to ensure that, in the pursuit of flexibility and innovation, the deeper requirements of clarity, consistency, and long-term interpretability are not allowed to drift out of focus.

ChatGPT, as prompted by Stephen D Green, April 2026 

Thursday, 26 February 2026

Universal Business Language (UBL) provides a powerful real-world example of what can be described as a Minimal Viable Product Line (MVPL)

 The OASIS Open Universal Business Language (UBL) provides a powerful real-world example of what can be described as a Minimal Viable Product Line (MVPL) architecture. Although UBL was not originally framed in those terms, its design reflects the same underlying principles: a stable, carefully governed core combined with well-defined, controlled mechanisms for variation and extension. Understanding UBL through the lens of MVPL reveals not only how it achieves global interoperability, but also how it defines a structured “geometry” of variability that allows adaptation without fragmentation.

At the heart of UBL lies an intentionally stable semantic and structural core. This core includes the standardized document types such as Invoice, Order, and DespatchAdvice; the Common Basic Components (CBC) and Common Aggregate Components (CAC); the XML schema structure; the namespace conventions; and the Naming and Design Rules (NDR). Together, these elements form a globally shared grammar for electronic business documents. The core is governed conservatively, with strong backward compatibility expectations and disciplined version management. Its purpose is not rapid innovation, but dependable interoperability across jurisdictions, industries, and decades.

This stable core plays the same role in UBL that a platform kernel plays in an extensible software system. It defines invariants—structural and semantic guarantees that all conformant implementations can rely upon. These invariants ensure that a UBL Invoice from one country can be parsed, validated, and understood by systems in another. If the core were frequently altered to satisfy local requirements, interoperability would quickly erode. The integrity of the shared semantic space depends on the immutability of the foundation.

Variation in UBL is therefore not achieved by modifying the core, but by moving along carefully defined dimensions of variability. These dimensions form a kind of semantic coordinate system. An implementation of UBL can be understood as occupying a point within this multidimensional space, determined by the choices it makes along each axis.

One major axis of variability is document type. Selecting Invoice rather than Order or CreditNote changes the transactional intent and business semantics, yet it does not alter the underlying structural rules. This axis defines the fundamental business interaction being represented.

Another axis is the profile dimension. Profiles constrain UBL for specific business contexts, industries, or regulatory environments. A government procurement profile, for example, may restrict optional elements, require certain fields, or tighten validation rules. Importantly, profiles do not change the schema itself; they project a constrained subset of it. This allows different communities to specialize UBL while remaining within the shared structural grammar.

UBL also provides an explicit structural extension mechanism through the ext:UBLExtensions element. This extension point allows additional elements to be included in a namespace-separated manner. Local jurisdictions or industries can introduce custom data structures without modifying or polluting the standardized core. Because extensions are clearly isolated, systems that do not understand them can still process the base document safely. This is a textbook example of MVPL extensibility: adaptation occurs through defined ports rather than through mutation of the foundation.

A further dimension of variability lies in codelists. Many business concepts—such as tax categories, payment means, or item classifications—are represented by coded values. UBL deliberately supports extensible codelists, allowing jurisdictions or industries to introduce additional codes where necessary. This axis permits variation in value space without changing structural definitions. In effect, it separates structural variability from semantic enumeration variability, enabling flexible adaptation to local policy environments.

UBL also supports contextual and rule-based constraints, often implemented through Schematron or other validation technologies. These constraints allow additional rules to be applied based on jurisdiction, process stage, or business role. This dimension alters what is considered valid in a particular context without changing the underlying XML schema. It represents variability in behavioral semantics rather than in structure.

Finally, there is the temporal dimension of versioning. UBL’s governance distinguishes between major and minor versions, with strict rules about backward compatibility. Movement along the version axis is deliberate and controlled. The version dimension ensures that evolution can occur over time without destabilizing existing ecosystems.

When these axes are considered together—document type, profile constraints, structural extensions, codelist variation, contextual rules, and versioning—they form a multidimensional geometry of variability. An individual UBL implementation can be described as a coordinate within this space: for example, an Invoice document using a particular national procurement profile, with specific extension elements, constrained by a jurisdiction’s validation rules, and conforming to a particular UBL version. Each coordinate represents a specialized dialect of global trade, yet all coordinates share the same underlying semantic grammar.

This geometric perspective clarifies why UBL has achieved durable interoperability. Rather than allowing ad hoc modifications to the schema for each new requirement, UBL channels change along predefined axes. Extensions move within the space; core changes reshape the space itself. Because reshaping the space would affect all participants, it is subject to strict governance through the OASIS technical committee process. This governance ensures that the foundation remains stable even as the ecosystem expands.

Seen through the MVPL lens, UBL exemplifies how large-scale systems can balance global uniformity with local diversity. The stable core provides shared meaning and tooling compatibility. The defined dimensions of variability enable specialization for industries, countries, and business processes. The geometry constrains change to safe directions, preserving interoperability while allowing innovation.

In this way, UBL is not merely an XML vocabulary. It is a carefully constructed semantic platform for international commerce. Its architecture demonstrates that extensibility need not imply instability, and that variation can be structured rather than chaotic. By separating the immutable foundation from the permitted dimensions of change, UBL embodies the core principle of MVPL: design the space of possible variation explicitly, and protect the integrity of the system that defines that space.


By ChatGPT, as prompted by Stephen D Green, February 2026