Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: Implement generators for XML text and attribute #113

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tienvx
Copy link
Contributor

@tienvx tienvx commented Feb 10, 2025

No description provided.

@tienvx tienvx force-pushed the implement-generators-for-xml branch from bdce8c5 to 2fcc470 Compare February 11, 2025 01:00
@YOU54F
Copy link
Member

YOU54F commented Mar 7, 2025

Tagging @pact-foundation/maintainers for review of this RFC to introduce generators for XML text and attributes. Thank you 🙌🏾

@JP-Ellis
Copy link
Contributor

First and foremost, I think this is a great idea; and as an initial starting point, I think it is a great RFC 🚀

I do have some questions and comments, and these mostly revolve around the complexity of XML as a format. For reference, I'm referring to the XML 1.1 specification.

Explicit support for UTF-8

XML explicitly allows (nearly) all of Unicode. As an example of valid XML from Wikipedia:

<?xml version="1.0" encoding="UTF-8"?>
<俄语 լեզու="ռուսերեն">данные</俄语>

The RFC should be clear that we do support UTF-8, in tags, attributes, and bodies.

Support for XML Declaration

At the start of XML documents, an XML declaration is required (though often ignored). This declaration specifies the version of XML being used, as well as the encoding:

<?xml version="1.0" encoding="ASCII"?>
...

Since this is a required element of a valid XML document (even though in practice it is often ignored), we should make sure that we can generate it when required.

Empty tags

There are two ways an empty tag can be represented in XML:

  1. Using an empty tag:

    <data />
  2. Using a start-end tag:

    <data></data>

The specification is clear that both forms are equivalent:

[Definition: An element with no content is said to be empty.] The representation of an empty element is either a start-tag immediately followed by an end-tag, or an empty-element tag.

The specification also makes it clear that the preference is to use the empty-tag representation as opposed to a start-end tag.

The RFC should make a mention that empty tags will be generated as the <data /> form and not the <data></data> form.

Support for white-space preserving

XML parsers typically don't care about whitespaces much and allow for indentation. The exception to the rule is if the special xml:space attribute is set to preserve (spec).

This is somewhat niche, but the RFC should be clear as to whether this is supported or not. In particular, nothing stops us from defining a generator which adds xml:space="preserve" to an element, but the generated XML may not actually be valid if the generator does not respect whitespaces.

Support for escaping data

Since the generators are specified in JSON, the RFC should make it completely clear how/when data is escaped.

To avoid any confusion, I think the generator should automatically escape data, and the inputs should be read using standard JSON parsing rules. This means that the following is correct:

{
   "name": "example",
   "children": {
      ...,
      "content": "<foo />"
   }
}

and it would be incorrect to have:

{
   "name": "example",
   "children": {
      ...,
      "content": "&lt;foo /&gt;"
   }
}

unless one wanted to have the <foo /> body be a double-escaped XML string.

There's also a question as to the way to escape the data, since XML has two options:

  • Using &...; escape for individual characters:
    <example>&lt;foo /&gt;</example>
  • Using the CDATA constructioN:
    <example><![CDATA[<foo />]]></example>

I feel like the CDATA version is nicer as it makes it easier for a human to read the data within.

Support for Comments (?)

Unlike JSON, XML supports for comments, and therefore a natural question is: should we support them?

<data>
  <!-- some explanation -->
  <date>...</date>
</data>

Comments are explicitly not part of the actual data being transferred, so my gut instinct to the question is 'no' and this should be made clear in the RFC.

Having said that, has anyone ever run in a scenario which (for whatever reason) required a comment to be present to pass validation?

Support for Arrays

XML does not have an array data type, and instead uses repeated tags:

<data>
  <tag>1</tag>
  <tag>2</tag>
  ...
</data>

I don't see anything in the way your RFC is structured which would conflict with this, but I do think the RFC should include an explicit example of generating arrays.

Support for Type Declarations

XML can be a self-describing format, with the definitions being transmitted alongside the data:

<!DOCTYPE data [
  <!ELEMENT data (tag*)>
  <!ELEMENT tag (#PCDATA)>
]>
<data>
  <tag>1</tag>
  <tag>2</tag>
  ...
</data>

I suspect adding support for this may be out of the scope of the RFC, but it is worth mentioning withi nthe RFC that this is not supported.

Support for Attribute-list Declarations

XML also allows for the definition of attributes:

<!ATTLIST data
  id ID #REQUIRED
  name CDATA #IMPLIED
>
<data id="1" name="example">
  ...
</data>

Again, I suspect this is out of scope, but it is worth mentioning that this is not supported.

Support for Entities

XML allows for the definition of entities (whether internal or external):

<!DOCTYPE github [
  <!ENTITY domain "github.com">
  <!ENTITY ips SYSTEM "https://dns.google/resolve?name=github.com&type=A">
]>
<github>
  <domain>&domain;</domain>
  <ips>&ips;</ips>
</github>

It is straightforward to have a generator use entities (in fact, &lt; and &gt; are entities), but I suspect that the ability to generate entity definitions is out of scope and should be mentioned in the RFC.

@rholshausen
Copy link

rholshausen commented Mar 11, 2025

Just a note, the reason that generators are not supported for XML is because it is more complex than with JSON. The current implementation is based on finding the element in the body that matches the path, and replacing it with a generated value.

XML can have multiple matching nodes, and that would require being able to find all of them and replacing all the items. This is where I gave up on it.

I.e. with

<container>
  <tag/><tag/><tag/>
</container>

and the path $.container.tag would require 3 updates.

with

<container>
  <tag>
    <child>
        <subchild>
        </subchild>
        <subchild>
        </subchild>
    </child>
  <tag/>
  <tag>
  <tag/>
  <other>
     <child>
        <name>
        </name>
        <name>
        </name>
    </child>
  <other/>
</container>

and the path $..container.*.child.* would match 4 nodes (2 subchild and 2 name tags, and think how this can expand on non-trivial documents).

@rholshausen
Copy link

I do have some questions and comments, and these mostly revolve around the complexity of XML as a format. For reference, I'm referring to the XML 1.1 specification.

Support for XML Declaration

Empty tags

Support for white-space preserving

Support for escaping data

Support for Comments (?)

Support for Type Declarations

Support for Attribute-list Declarations

Support for Entities

These are not relevant for using generators with XML.

@JP-Ellis
Copy link
Contributor

These are not relevant for using generators with XML.

Care to go into any level of detail as to why?

@mefellows
Copy link
Member

Thanks for doing this Tien! A couple of additional clarifications:

1. Mixed Content Handling
It might not be a possible state (given how the DSL would likely be defined), but if it were, how would we plan to support the following edge case:

<summary>Release date: <date>2024-03-12</date>. Copyright ACME inc.</summary>

If a generator is applied to <summary>, does it replace the entire content or just the text node?

Presumably it would leave the nested element in tact i.e.

Element has mixed children (text + elements)

  • Generator: RandomInt(0, 999)
  • Path: $.a['#text']
  • Before: <?xml version='1.0'?><a>OldText<b/>MoreText</a>
  • After: <?xml version='1.0'?><a>123<b/>456</a>

Note:

  • The generator is applied to each separate text node within a.
  • Since text nodes do not have an index in the current JSON format, the same generator applies to all text nodes.
  • If a only had one text node, it would be fully replaced.
  • If a had multiple text nodes (before and after <b/>), each would be replaced independently.

If we introduce indexing for text nodes in the future, we could support defining different generators for each text node:

Generators:

  • Path: $.a['#text'][0] → RandomInt(0, 999)
  • Path: $.a['#text'][1] → RandomString(3)

Would produce <?xml version='1.0'?><a>123<b/>abc</a>

--

2. Namespaces and Prefixes

The RFC covers namespace handling for attributes and elements, but it's focussed on locating elements. It isn't clear whether prefixes or values can be dynamically generated or replaced.

My assumption is that we would not support this.

--

3. Comments

The IETF RFC for XML, specifically XML 1.0 (Fifth Edition), mentions the following with regards to comments:

  • Comments are ignored by XML processors except for their presence (i.e., they do not affect the data model).
  • They are not part of the document’s logical structure and are purely for human readability or annotation.
  • They must not appear within markup (e.g., inside a tag name or attribute).

As such, I think we can ignore them entirely.

Having said that, has anyone ever run in a scenario which (for whatever reason) required a comment to be present to pass validation?

No, I haven't.

--

4. Processing Instructions e.g. <?xml-stylesheet?>

  • The RFC does not mention how processing instructions should be treated.
  • Should they be ignored, or could generators modify them?
  • Example:
<?xml-stylesheet type="text/xsl" href="style.xsl"?>
<root>...</root>

Processing instructions are application-specific, changing them doesn't technically impact semantics but they may be important application-level instructions (e.g. used in a data pipeline down the line).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants