-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preallocate storage for log attributes on stack #1965
Preallocate storage for log attributes on stack #1965
Conversation
Have also included the Boxing of AnyValue enum as part of this PR, as that overall affects the size of the preallocated buffer, and is a small change. There is an improvement with the bench, but stress throughput has dropped drastically. Need to look into that, along with memory stats. full-log-with-4-attributes/no-context: stress test: |
Thanks @cijothomas for review. Have resolved the tests to not rely on the internal implementation details of the attribute storage. Let me know if you still see any issues. |
@lalitb and I observed that in some machines the perf is getting de-graded (benchmarks and stress tests) with the change. We are trying to understand what might be causing that, even though we greatly reduce heap allocation. Holding off merge until we understand the source. |
Adding the relevant part of flamegraph before (main branch) and after (PR branch): The cost of execution of opentelemetry-rust/opentelemetry-sdk/src/logs/log_emitter.rs Lines 276 to 279 in d583695
i.e,
The cost of moving the data (i.e., the array) outweighs the benefit of initial creation and population. This leads to increased latency. The option could be using references or smart pointers (Rc/Arc) over attribute array to mitigate this issue, but they come with their own trade-offs - increased complexity, lifetime management challenge, potential overhead of indirections. Also attaching the full flamegraph svg files if someone is interested in doing more analysis :) Main branch: PR branch: |
Thanks @lalitb for getting to the root cause of the observed behavior! This is similar to what we faced in OTel .NET where using stack allocation to avoid heap (and gc pressure) actually increased hot path cost, due to increase in amount of stack to stack copying! https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/test/Benchmarks/Metrics/MetricsBenchmarks.cs#L42-L52 |
Great analysis @lalitb! |
I have now modified the Before: #[derive(Clone, Debug)]
pub struct LogData {
/// Log record
pub record: LogRecord,
/// Instrumentation details for the emitter who produced this `LogEvent`.
pub instrumentation: InstrumentationLibrary,
} After: #[derive(Clone, Debug)]
pub struct LogData<'a> {
/// Log record, which can be borrowed or owned.
pub record: Cow<'a, LogRecord>,
/// Instrumentation details for the emitter who produced this `LogEvent`.
pub instrumentation: Cow<'a, InstrumentationLibrary>,
} This allows us to avoid copying of LogRecord structure when used with a synchronous processor.
I can see ~20% improvement with this:
|
https://github.com/open-telemetry/opentelemetry-rust/blob/main/opentelemetry-sdk/benches/log_processor.rs#L9-L10 Can you check the effect on this benchmark with/without latest commit? (Trying to see if our benchmarks are actually helping us make more informed decisions!) |
Before: After: We don't do cloning anymore in Simple/Concurrent processor, so benchmark 2 won't be used. bench 3 is the batch scenario, and this will now do the stack memory copy, and so more cost. |
I am also seeing same perf results!(42 M/sec in stress test) Thanks for continuously striving to get us better and better! I just compared with OTel .NET's similar setup in same machine, and OTel .NET gets around 46 M/sec, so we are getting pretty close to match .NET (which is heavily optimized already and GA for years). With the next set of planned improvements, we should easily exceed that as well. I requested to extract out the changes to LogData/lifetime into own PR, so this one just has only the change to stack-allocate attributes and nothing more. |
56074a2
to
f3b63fb
Compare
The PR now only contains the growable array related changes. Should be ready to merge if there are no further comments. The feature-flag for heap allocated attributes would be added subsequently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left a nit about benchmark results not in sync with the PR desc.
A combined changelog highlighting all perf improvements could be added in a follow up?
Have fixed the benchmark result as per the desc, along with the comment. Yes planning to add the combined changelog as follow up. With this, we've hit the 100th comment on this PR :) |
merged the other PR, so this need to handle the conflict. Sorry I wasn't paying attention to the merge order, Hopefully this is an easy conflict to resolve and we are good merge. This is comment 102 🤣 ! (We still haven't broken the records of some PRs in OTel specs) |
Yes, fixed the merge conflict now :) |
The idea of using stack for attributes is not super common, and may be undesired in some scenarios. We do have precedence in Go, .NET and even in Rust for similar changes. But this change is done as a purely internal implementation detail, so it is possible to change this in the future, without any breaking change. Thanks Lalit for continuously pursuing better performance and thanks everyone who helped review. |
towards #1920
Changes
The benchmark and stress results as below. A negligible improvement in perf for stress, while benchmark has further improved from 5%-10%.
CHANGELOG.md
files updated for non-trivial, user-facing changes