rfc: Implement generators for XML text and attribute #113

tienvx · 2025-02-10T15:42:10Z

No description provided.

YOU54F · 2025-03-07T11:44:52Z

Associated draft pull requests

YOU54F · 2025-03-07T11:48:29Z

Tagging @pact-foundation/maintainers for review of this RFC to introduce generators for XML text and attributes. Thank you 🙌🏾

JP-Ellis · 2025-03-11T00:30:03Z

First and foremost, I think this is a great idea; and as an initial starting point, I think it is a great RFC 🚀

I do have some questions and comments, and these mostly revolve around the complexity of XML as a format. For reference, I'm referring to the XML 1.1 specification.

Explicit support for UTF-8

XML explicitly allows (nearly) all of Unicode. As an example of valid XML from Wikipedia:

<?xml version="1.0" encoding="UTF-8"?>
<俄语 լեզու="ռուսերեն">данные</俄语>

The RFC should be clear that we do support UTF-8, in tags, attributes, and bodies.

Support for XML Declaration

At the start of XML documents, an XML declaration is required (though often ignored). This declaration specifies the version of XML being used, as well as the encoding:

<?xml version="1.0" encoding="ASCII"?>
...

Since this is a required element of a valid XML document (even though in practice it is often ignored), we should make sure that we can generate it when required.

Empty tags

There are two ways an empty tag can be represented in XML:

Using an empty tag:
```
<data />
```
Using a start-end tag:
```
<data></data>
```

The specification is clear that both forms are equivalent:

[Definition: An element with no content is said to be empty.] The representation of an empty element is either a start-tag immediately followed by an end-tag, or an empty-element tag.

The specification also makes it clear that the preference is to use the empty-tag representation as opposed to a start-end tag.

The RFC should make a mention that empty tags will be generated as the <data /> form and not the <data></data> form.

Support for white-space preserving

XML parsers typically don't care about whitespaces much and allow for indentation. The exception to the rule is if the special xml:space attribute is set to preserve (spec).

This is somewhat niche, but the RFC should be clear as to whether this is supported or not. In particular, nothing stops us from defining a generator which adds xml:space="preserve" to an element, but the generated XML may not actually be valid if the generator does not respect whitespaces.

Support for escaping data

Since the generators are specified in JSON, the RFC should make it completely clear how/when data is escaped.

To avoid any confusion, I think the generator should automatically escape data, and the inputs should be read using standard JSON parsing rules. This means that the following is correct:

{
   "name": "example",
   "children": {
      ...,
      "content": "<foo />"
   }
}

and it would be incorrect to have:

{
   "name": "example",
   "children": {
      ...,
      "content": "&lt;foo /&gt;"
   }
}

unless one wanted to have the <foo /> body be a double-escaped XML string.

There's also a question as to the way to escape the data, since XML has two options:

Using &...; escape for individual characters:
```
<example>&lt;foo /&gt;</example>
```
Using the CDATA constructioN:
```
<example><![CDATA[<foo />]]></example>
```

I feel like the CDATA version is nicer as it makes it easier for a human to read the data within.

Support for Comments (?)

Unlike JSON, XML supports for comments, and therefore a natural question is: should we support them?

<data>
  <!-- some explanation -->
  <date>...</date>
</data>

Comments are explicitly not part of the actual data being transferred, so my gut instinct to the question is 'no' and this should be made clear in the RFC.

Having said that, has anyone ever run in a scenario which (for whatever reason) required a comment to be present to pass validation?

Support for Arrays

XML does not have an array data type, and instead uses repeated tags:

<data>
  <tag>1</tag>
  <tag>2</tag>
  ...
</data>

I don't see anything in the way your RFC is structured which would conflict with this, but I do think the RFC should include an explicit example of generating arrays.

Support for Type Declarations

XML can be a self-describing format, with the definitions being transmitted alongside the data:

<!DOCTYPE data [
  <!ELEMENT data (tag*)>
  <!ELEMENT tag (#PCDATA)>
]>
<data>
  <tag>1</tag>
  <tag>2</tag>
  ...
</data>

I suspect adding support for this may be out of the scope of the RFC, but it is worth mentioning withi nthe RFC that this is not supported.

Support for Attribute-list Declarations

XML also allows for the definition of attributes:

<!ATTLIST data
  id ID #REQUIRED
  name CDATA #IMPLIED
>
<data id="1" name="example">
  ...
</data>

Again, I suspect this is out of scope, but it is worth mentioning that this is not supported.

Support for Entities

XML allows for the definition of entities (whether internal or external):

<!DOCTYPE github [
  <!ENTITY domain "github.com">
  <!ENTITY ips SYSTEM "https://dns.google/resolve?name=github.com&type=A">
]>
<github>
  <domain>&domain;</domain>
  <ips>&ips;</ips>
</github>

It is straightforward to have a generator use entities (in fact, < and > are entities), but I suspect that the ability to generate entity definitions is out of scope and should be mentioned in the RFC.

rholshausen · 2025-03-11T03:21:09Z

Just a note, the reason that generators are not supported for XML is because it is more complex than with JSON. The current implementation is based on finding the element in the body that matches the path, and replacing it with a generated value.

XML can have multiple matching nodes, and that would require being able to find all of them and replacing all the items. This is where I gave up on it.

I.e. with

<container>
  <tag/><tag/><tag/>
</container>

and the path $.container.tag would require 3 updates.

with

<container>
  <tag>
    <child>
        <subchild>
        </subchild>
        <subchild>
        </subchild>
    </child>
  <tag/>
  <tag>
  <tag/>
  <other>
     <child>
        <name>
        </name>
        <name>
        </name>
    </child>
  <other/>
</container>

and the path $..container.*.child.* would match 4 nodes (2 subchild and 2 name tags, and think how this can expand on non-trivial documents).

rholshausen · 2025-03-11T03:34:27Z

I do have some questions and comments, and these mostly revolve around the complexity of XML as a format. For reference, I'm referring to the XML 1.1 specification.

Support for XML Declaration

Empty tags

Support for white-space preserving

Support for escaping data

Support for Comments (?)

Support for Type Declarations

Support for Attribute-list Declarations

Support for Entities

These are not relevant for using generators with XML.

JP-Ellis · 2025-03-11T06:00:14Z

These are not relevant for using generators with XML.

Care to go into any level of detail as to why?

mefellows · 2025-03-12T11:21:32Z

Thanks for doing this Tien! A couple of additional clarifications:

1. Mixed Content Handling
It might not be a possible state (given how the DSL would likely be defined), but if it were, how would we plan to support the following edge case:

<summary>Release date: <date>2024-03-12</date>. Copyright ACME inc.</summary>

If a generator is applied to <summary>, does it replace the entire content or just the text node?

Presumably it would leave the nested element in tact i.e.

Element has mixed children (text + elements)

Generator: RandomInt(0, 999)
Path: $.a['#text']
Before: <?xml version='1.0'?><a>OldText<b/>MoreText</a>
After: <?xml version='1.0'?><a>123<b/>456</a>

Note:

The generator is applied to each separate text node within a.
Since text nodes do not have an index in the current JSON format, the same generator applies to all text nodes.
If a only had one text node, it would be fully replaced.
If a had multiple text nodes (before and after <b/>), each would be replaced independently.

If we introduce indexing for text nodes in the future, we could support defining different generators for each text node:

Generators:

Path: $.a['#text'][0] → RandomInt(0, 999)
Path: $.a['#text'][1] → RandomString(3)

Would produce <?xml version='1.0'?><a>123<b/>abc</a>

--

2. Namespaces and Prefixes

The RFC covers namespace handling for attributes and elements, but it's focussed on locating elements. It isn't clear whether prefixes or values can be dynamically generated or replaced.

My assumption is that we would not support this.

--

3. Comments

The IETF RFC for XML, specifically XML 1.0 (Fifth Edition), mentions the following with regards to comments:

Comments are ignored by XML processors except for their presence (i.e., they do not affect the data model).
They are not part of the document’s logical structure and are purely for human readability or annotation.
They must not appear within markup (e.g., inside a tag name or attribute).

As such, I think we can ignore them entirely.

Having said that, has anyone ever run in a scenario which (for whatever reason) required a comment to be present to pass validation?

No, I haven't.

--

4. Processing Instructions e.g. <?xml-stylesheet?>

The RFC does not mention how processing instructions should be treated.
Should they be ignored, or could generators modify them?
Example:

<?xml-stylesheet type="text/xsl" href="style.xsl"?>
<root>...</root>

Processing instructions are application-specific, changing them doesn't technically impact semantics but they may be important application-level instructions (e.g. used in a data pipeline down the line).

rfc: Implement generators for XML text and attribute

2fcc470

tienvx force-pushed the implement-generators-for-xml branch from bdce8c5 to 2fcc470 Compare February 11, 2025 01:00

YOU54F requested review from mefellows, JP-Ellis, YOU54F and rholshausen March 5, 2025 10:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfc: Implement generators for XML text and attribute #113

rfc: Implement generators for XML text and attribute #113

tienvx commented Feb 10, 2025

YOU54F commented Mar 7, 2025

YOU54F commented Mar 7, 2025

JP-Ellis commented Mar 11, 2025

rholshausen commented Mar 11, 2025 •

edited

Loading

rholshausen commented Mar 11, 2025

Support for XML Declaration

Empty tags

Support for white-space preserving

Support for escaping data

Support for Comments (?)

Support for Type Declarations

Support for Attribute-list Declarations

Support for Entities

JP-Ellis commented Mar 11, 2025

mefellows commented Mar 12, 2025

rfc: Implement generators for XML text and attribute #113

Are you sure you want to change the base?

rfc: Implement generators for XML text and attribute #113

Conversation

tienvx commented Feb 10, 2025

YOU54F commented Mar 7, 2025

YOU54F commented Mar 7, 2025

JP-Ellis commented Mar 11, 2025

Explicit support for UTF-8

Support for XML Declaration

Empty tags

Support for white-space preserving

Support for escaping data

Support for Comments (?)

Support for Arrays

Support for Type Declarations

Support for Attribute-list Declarations

Support for Entities

rholshausen commented Mar 11, 2025 • edited Loading

rholshausen commented Mar 11, 2025

Support for XML Declaration

Empty tags

Support for white-space preserving

Support for escaping data

Support for Comments (?)

Support for Type Declarations

Support for Attribute-list Declarations

Support for Entities

JP-Ellis commented Mar 11, 2025

mefellows commented Mar 12, 2025

rholshausen commented Mar 11, 2025 •

edited

Loading