Handle token stream in Parser #616

ehaas · 2024-01-18T00:13:33Z

This is very far from being ready for review but @Vexu I wanted to get your feedback on the general approach before going any further, since it will be a fairly large change.

The basic idea is that a significant source of memory usage currently is that we completely preprocess the file before parsing it. With large files that means we have a lot of tokens laying around even though most don't need to be saved. For example zig2.c results in a pp.tokens.capacity of 23,031,552 after preprocessing (552,757,248 bytes in ReleaseFast, not counting expansion locations).

But, in general the only tokens we need to save are those used in decl and decl_ref tree nodes. Things like keywords, semicolons & other punctuation, literals, etc aren't needed after they're parsed. So the general idea is to have the parser pull in tokens from the preprocessor as needed, instead of doing it all up front. This is handled by having a stack of tokenizers in the preprocessor (#include pushes a new tokenizer onto the stack; finishing the file pops the tokenizer). Macros are fully expanded into an arraylist, so we don't need to store additional state in the preprocessor. Parser backtracking is handled by having "checkpoints" that prevent the preprocessor's token array from being cleared if any checkpoints are active.

Vexu · 2024-01-18T07:06:57Z

The basic idea is that a significant source of memory usage currently is that we completely preprocess the file before parsing it. With large files that means we have a lot of tokens laying around even though most don't need to be saved. For example zig2.c results in a pp.tokens.capacity of 23,031,552 after preprocessing (552,757,248 bytes in ReleaseFast, not counting expansion locations).

I'm guessing you did some analysis on the memory usage, do you have any more detailed stats on it like what % of the memory usage is just tokens?

Either way this change looks to make the parser code a bit more readable with helper functions and values being used instead of fields and indexes. But that is only assuming that this results in an equivalent or faster runtime. Since the main target of Aro is modern systems using more memory for improved performance is acceptable as long as we can stay under a reasonable amount like 1GB for even the largest inputs so any performance regressions should be well justified by significant memory usage improvements.

But, in general the only tokens we need to save are those used in decl and decl_ref tree nodes.

Note that I plan to add (at least optionally) source locations to all nodes for use in translate-c and debug info generation in codegen.

ehaas · 2024-01-18T07:50:47Z

I agree it's only worth doing if performance is equal to or faster than status quo - just wanted to make sure you didn't see any obvious red flags. The total RSS size of the zig2.c parse for me with ReleaseFast is 1.4 gigs, so 1/3 of it is tokens. It has 7,477,094 identifier tokens, so if those are indeed the only ones that need to be saved then I think we'd save about 370 megs, plus expansion locations that take an additional 40 megs. So if all goes according to plan that would get us down near the 1GB mark.

I think preserving node source locations will be compatible with this change - would that just be another field in Node?

Vexu · 2024-01-18T13:32:46Z

It looks like it would improve readability since it removes the data-oriented "abstraction" so no objections on that front.

This is compatible with the node source location thing, I was just noting that it'll increase the amount of nodes that need to be saved.

ehaas · 2024-02-06T08:25:03Z

I'm going to cancel this; it ended up being really invasive and given the uncertain outcome I'm not sure I want to pursue it. I'm working on a new idea for storing fewer expansion location pointers that's much more straightforward. I have a proof of concept working for that, just need to clean it up and take some measurements.

ehaas added 2 commits January 17, 2024 12:55

Preprocessor: Use a stack of tokenizers

81360d5

wip

ff3f80b

ehaas added 2 commits January 20, 2024 23:19

*wip* continue token stream conversion

057699b

148 tok_i's to go

3cb0893

ehaas closed this Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle token stream in Parser #616

Handle token stream in Parser #616

ehaas commented Jan 18, 2024

Vexu commented Jan 18, 2024

ehaas commented Jan 18, 2024 •

edited

Loading

Vexu commented Jan 18, 2024

ehaas commented Feb 6, 2024

Handle token stream in Parser #616

Handle token stream in Parser #616

Conversation

ehaas commented Jan 18, 2024

Vexu commented Jan 18, 2024

ehaas commented Jan 18, 2024 • edited Loading

Vexu commented Jan 18, 2024

ehaas commented Feb 6, 2024

ehaas commented Jan 18, 2024 •

edited

Loading