Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URI Parser design change proposal for performance and stability #3193

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

habbes
Copy link
Contributor

@habbes habbes commented Mar 4, 2025

Issues

This pull request fixes #xxx.

Description

Briefly describe the changes of this pull request.

Checklist (Uncheck if it is not completed)

  • Test cases added
  • Build and test with one-click build and test script passed

Additional work necessary

If documentation update is needed, please add "Docs Needed" label to the issue and provide details about the required document change in the issue.

@habbes habbes changed the title URI Parser rewrite for performance and stability URI Parser design change proposal for performance and stability Mar 4, 2025
public ExpressionNodeKind Kind => _parser[_index].Kind;
public int GetIntValue()
{
return int.TryParse(_parser[_index].Range.GetValue(_parser.Source));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this is meant to be _parser._nodes[_index]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the parser could have an indexer as well that forwards to _parser._nodes.

```csharp
interface IQueryNodeHandler<T>
{
public T HandleIntNode(QueryNode intLiteralNode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a huge perf overhead for these methods to take in more specific types? Like, it would be great if we could make this IntQueryNode instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we have to be very careful about this because we will need to constantly add support for async, unsafe, methods that require parameterized context, allows ref struct, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your suggestions, I will investigate the cost of adding more specific types on top QueryNode for the different methods, e.g. Handle(IntQueryNode)

Where IntQueryNode could be defined as

struct IntQueryNode
{
     private QueryNode _node;
     public int GetIntValue() => _node.GetIntValue();
}

Maybe it could be optimized to a ref struct itself or it could reference the expression parser directly if such optimizations prove meaningful and practical.

The key changes proposed:

- Avoid premature string allocations. Reference segments in the source string using `ReadOnlySpan<char>` until the user explicitly asks for a `string`.
- Store parsed nodes in array of structs instead of graph of objects.
Copy link
Contributor

@corranrogue9 corranrogue9 Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to be clear about this aspect and the terminology that we are using. By not having a tree, this is more akin to the readers that we have. This is not really "parsing" anymore. There's nothing wrong with this, I'm more having an issue with the terminology, and the terminology is important to me because (as you note later in the document), we don't differentiate between the different kinds of notes (in the same way that the current reader doesn't): the caller is expected to understand what kinds of nodes can come next and needs to be aware of this based on the context of the previous nodes.

```csharp
internal struct ExpressionNode
{
public ExpressionNodeKind Kind { get; internal set; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use classes to model the enumeration instead of an enum, we could have a visitor on those classes which would help implement the code that generates the ref struct nodes that could be used in the handler that I suggested below

Copy link
Contributor Author

@habbes habbes Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could compare both and see what the perf overhead is for each. But I assume the enum would have less overhead, it's smaller value, which could make it easier to pack the struct more densely and we don't have to cast.

Another I opted for an enum is that we could pack additional data in the enum. For example we could have a bit flag that tells us whether the item is a binary operator, another to tell us whether the string needs escaping or whether it could be returned as is (in which case we could return a span of the string to the user without allocating), etc. If we have an int enum we could use the lower 16 bits for the value and higher 16 bits for flags (we could also go for a smaller enum type depending on how many node kinds we'll have and whether alignment would make it worthwhile). Such flags could help optimize the common case. But I haven't yet done benchmarks to prove their worth.

@@ -0,0 +1,272 @@
# OData URI Parser design change for higher performance and reliability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was brought up in the meeting that we have the whole URI string in memory, so we should leverage that for performance gains instead of trying to overly generalize. Have you considered what it looks like for someone to compose an OData URI? For example, in the OData Client, we compose the URI as the caller tells us what data they are requesting. In that situation, we actually don't have the full URI anywhere in memory. Are we thinking we will have an entirely separate AST model for the client code to leverage?

Copy link
Contributor Author

@habbes habbes Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client can either consume or produce OData URI. When it consumes OData URI, the URI is available in full, in memory. When it produces the URI, for example when translating LINQ expressions to an OData request, it generates the URI on the fly using a StringBuilder, the output of the translation will be string that's in memory, but even if it wasn't, this URI generation is not the role of the parser. The parser as described in this doc (and as implemented by the existing ODataUriParser) takes the URI as input, it does not generate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants