From Schemas to Overlays: XML, documented schemas, and an LLM interpreter of the XML document
Written by ChatGPT, as prompted by Stephen D Green, June 2026
With the use of an LLM to interpret a document, a role of an overlay might shift from being primarily a validation mechanism to being an interpretation aid.
Historically, schemas have largely been used by deterministic software. A datatype declaration such as:
<DocumentDate>2461043</DocumentDate>
is either valid or invalid according to some rule.
An LLM changes the situation because it is fundamentally a probabilistic interpreter. It benefits enormously from contextual information, explanatory notes, assumptions, domain knowledge, caveats, and historical context.
In such a system, the overlay may not merely constrain interpretation; it may actively guide interpretation.
For example, consider a canonical document:
<BusinessDocument>
<DocumentID>INV-2025-001</DocumentID>
<DocumentDate>2461043</DocumentDate>
<CurrencyCode>ABC</CurrencyCode>
</BusinessDocument>
The canonical representation contains only the data.
A contextual overlay supplied alongside the document might contain annotations such as:
BusinessDocument:
In this context, BusinessDocument represents
a supplier invoice.
DocumentDate:
This integer represents a Julian Day Number,
not a calendar year.
Convert using Julian Day conversion rules.
CurrencyCode:
The code ABC referred to Currency A until
31 December 2010.
From 1 January 2015 the code was reused for
Currency B.
Use the document date when determining which
currency is intended.
A conventional validator would largely ignore such prose.
An LLM would not.
The annotations become part of the interpretive environment.
This is where the architecture begins to diverge from traditional schema thinking.
Traditionally:
Document
+
Schema
↓
Validation
In an LLM-oriented architecture:
Document
+
Interpretation Overlay
↓
Contextual Understanding
The overlay functions almost as a semantic lens.
One could even imagine different overlays being supplied to different agents.
For example:
BusinessDocument
+
Accounting Overlay
↓
Accounting Interpretation
BusinessDocument
+
Logistics Overlay
↓
Logistics Interpretation
BusinessDocument
+
Legal Overlay
↓
Legal Interpretation
The underlying document remains identical.
The contextual guidance differs.
Now the consumer is not a deterministic validator. It is an intelligent interpreter.
In fact, one might argue that schema documentation suddenly becomes much more valuable than it has historically been.
For decades, schema documentation was often treated as a convenience for humans reading specifications.
In an AI-assisted architecture, documentation may become executable context.
The distinction between:
<xs:documentation>
This integer is a Julian Day Number.
</xs:documentation>
and
<xs:documentation>
This integer is a year.
</xs:documentation>
may profoundly affect the resulting interpretation.
The same applies to business concepts.
Suppose the canonical model intentionally uses a neutral name:
<BusinessDocument>
One overlay might explain:
In this context, BusinessDocument represents a purchase order.
Another might explain:
In this context, BusinessDocument represents a supplier invoice.
Another might explain:
In this context, BusinessDocument represents a credit note.
A deterministic processor might already know which profile is in use and therefore not need this explanation.
An LLM can benefit greatly from receiving the explanation explicitly.
In this sense, the overlay begins to resemble a prompt.
Indeed, one could describe the architecture as:
Canonical Representation
+
Contextual Overlay
↓
Interpretive Prompt
↓
LLM Understanding
The overlay is no longer merely a validation artifact. It becomes part of the prompt engineering architecture.
That is a significant conceptual shift.
Historically, schemas were primarily intended to communicate constraints to software.
In an AI-rich environment, overlays may increasingly communicate meaning to interpreters.
The same core document could therefore support a large ecosystem of contextual overlays, some aimed at deterministic software, some aimed at human readers, and some aimed at LLMs.
The architecture then becomes even more general:
Core Representation
+
Contextual Overlay
↓
Contextual Interpretation
Whether the interpreter is:
- a validator,
- an application,
- a human,
- a workflow engine,
- an LLM,
becomes almost secondary.
The core remains stable.
The overlays provide context.
The interpreter combines the two to produce an understanding appropriate to its task.
Viewed this way, the rise of LLMs may actually strengthen the case for the Core-and-Extensions architecture. A stable canonical representation becomes increasingly valuable because many different interpretive overlays can be attached to it over time, allowing different intelligent agents to derive different contextual understandings without requiring the underlying representation itself to be redesigned.
June 2026
No comments:
Post a Comment