Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is very far from being ready for review but @Vexu I wanted to get your feedback on the general approach before going any further, since it will be a fairly large change.
The basic idea is that a significant source of memory usage currently is that we completely preprocess the file before parsing it. With large files that means we have a lot of tokens laying around even though most don't need to be saved. For example
zig2.c
results in app.tokens.capacity
of23,031,552
after preprocessing (552,757,248
bytes inReleaseFast
, not counting expansion locations).But, in general the only tokens we need to save are those used in
decl
anddecl_ref
tree nodes. Things like keywords, semicolons & other punctuation, literals, etc aren't needed after they're parsed. So the general idea is to have the parser pull in tokens from the preprocessor as needed, instead of doing it all up front. This is handled by having a stack of tokenizers in the preprocessor (#include
pushes a new tokenizer onto the stack; finishing the file pops the tokenizer). Macros are fully expanded into an arraylist, so we don't need to store additional state in the preprocessor. Parser backtracking is handled by having "checkpoints" that prevent the preprocessor's token array from being cleared if any checkpoints are active.