Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle token stream in Parser #616

Closed
wants to merge 4 commits into from
Closed

Conversation

ehaas
Copy link
Collaborator

@ehaas ehaas commented Jan 18, 2024

This is very far from being ready for review but @Vexu I wanted to get your feedback on the general approach before going any further, since it will be a fairly large change.

The basic idea is that a significant source of memory usage currently is that we completely preprocess the file before parsing it. With large files that means we have a lot of tokens laying around even though most don't need to be saved. For example zig2.c results in a pp.tokens.capacity of 23,031,552 after preprocessing (552,757,248 bytes in ReleaseFast, not counting expansion locations).

But, in general the only tokens we need to save are those used in decl and decl_ref tree nodes. Things like keywords, semicolons & other punctuation, literals, etc aren't needed after they're parsed. So the general idea is to have the parser pull in tokens from the preprocessor as needed, instead of doing it all up front. This is handled by having a stack of tokenizers in the preprocessor (#include pushes a new tokenizer onto the stack; finishing the file pops the tokenizer). Macros are fully expanded into an arraylist, so we don't need to store additional state in the preprocessor. Parser backtracking is handled by having "checkpoints" that prevent the preprocessor's token array from being cleared if any checkpoints are active.

@Vexu
Copy link
Owner

Vexu commented Jan 18, 2024

The basic idea is that a significant source of memory usage currently is that we completely preprocess the file before parsing it. With large files that means we have a lot of tokens laying around even though most don't need to be saved. For example zig2.c results in a pp.tokens.capacity of 23,031,552 after preprocessing (552,757,248 bytes in ReleaseFast, not counting expansion locations).

I'm guessing you did some analysis on the memory usage, do you have any more detailed stats on it like what % of the memory usage is just tokens?

Either way this change looks to make the parser code a bit more readable with helper functions and values being used instead of fields and indexes. But that is only assuming that this results in an equivalent or faster runtime. Since the main target of Aro is modern systems using more memory for improved performance is acceptable as long as we can stay under a reasonable amount like 1GB for even the largest inputs so any performance regressions should be well justified by significant memory usage improvements.

But, in general the only tokens we need to save are those used in decl and decl_ref tree nodes.

Note that I plan to add (at least optionally) source locations to all nodes for use in translate-c and debug info generation in codegen.

@ehaas
Copy link
Collaborator Author

ehaas commented Jan 18, 2024

I agree it's only worth doing if performance is equal to or faster than status quo - just wanted to make sure you didn't see any obvious red flags. The total RSS size of the zig2.c parse for me with ReleaseFast is 1.4 gigs, so 1/3 of it is tokens. It has 7,477,094 identifier tokens, so if those are indeed the only ones that need to be saved then I think we'd save about 370 megs, plus expansion locations that take an additional 40 megs. So if all goes according to plan that would get us down near the 1GB mark.

I think preserving node source locations will be compatible with this change - would that just be another field in Node?

@Vexu
Copy link
Owner

Vexu commented Jan 18, 2024

It looks like it would improve readability since it removes the data-oriented "abstraction" so no objections on that front.

This is compatible with the node source location thing, I was just noting that it'll increase the amount of nodes that need to be saved.

@ehaas
Copy link
Collaborator Author

ehaas commented Feb 6, 2024

I'm going to cancel this; it ended up being really invasive and given the uncertain outcome I'm not sure I want to pursue it. I'm working on a new idea for storing fewer expansion location pointers that's much more straightforward. I have a proof of concept working for that, just need to clean it up and take some measurements.

@ehaas ehaas closed this Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants