-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Allow overlapping context scopes #2378
fix: Allow overlapping context scopes #2378
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2378 +/- ##
======================================
Coverage 79.5% 79.6%
======================================
Files 123 123
Lines 22923 23041 +118
======================================
+ Hits 18242 18356 +114
- Misses 4681 4685 +4 ☔ View full report in Codecov by Sentry. |
b64d904
to
dd46520
Compare
ab2f259
to
4502611
Compare
828ba76
to
ea537b8
Compare
@bantonsson Can you add some details on how this fixes the issue of contexts being dropped out-of-order? Perhaps some code comments will help future readers. |
@cijothomas I'll clean this up a bit more and add comments describing the |
ecb0279
to
94ddd7e
Compare
@cijothomas Do you know if the integration tests are flaky? The previous run failed but now it succeeded without any functional changes to the code. |
You can ignore it for now. Integration tests are not marked required for merging PRs. It does has some instability due to spinning up OTel Collector in a docker and multiple tests using the same instance etc. That will be addressed separately. |
Interesting. I too am curious to know a bit more about What is the expected behavior of the context stack if new contexts are pushed during out-of-order pops? |
@shaun-cox So this is not for a normal use case, but to ensure that the The same type of mechanism exists in the This might be obvious but just to be clear, the pushing and popping of a |
94ddd7e
to
c202910
Compare
@bantonsson Thank you for the clarification... it is helpful. I was originally interpretting the associated bug and this PR as an indication that ContextGuards could be dropped in the wrong order by correctly functioning application code. My understanding now is that this change is kind of "defense-in-depth" change to allow for buggy application code to keep executing? If that is the case, it kind of goes against the spirit of idiomatic Rust (IMO) which says that one should panic at the first sign of a bug, so that it can be fixed. Is it not sufficient to avoid this complexity and associated hiding of application bugs and just add a |
OTel Spec explicitly prohibit crash/panic/exception except when at the initialization stage. However, this is probably a place where Rust idiomatic ways supersede specs, and panic, as there is no "correct" way for us to resolve this user issue. |
@shaun-cox I agree that dropping the Maybe we should add this lenient mode behind a feature flag? |
I don't think logging/tracing code should panic except during explicit calls in the setup phase.
If possible, you should not panic in If you decide that this is unnecessary complexity added just for broken programs, I think the current behavior (i.e. just broken spans but no crashes) with additional |
I think that if we want to have any chance of real interoperability with |
I might be missing the larger picture here too... what is meant by "real interoperability with |
Thanks for pointing this out @shaun-cox. I see that my comment is lacking some context (pun intended). Long story short, the existing |
7c0988e
to
ca2501e
Compare
Since the github action can't post any benchmark results on fork PRs and there is a metric benchmark that fails I'm just going to post the results for
|
5fe5181
to
7725b37
Compare
I've implemented the discussed behavior now. |
7725b37
to
c97dc97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for me. There are generally a lack of tests for Context area, but we should address that separately from this.
Context is not announced as RC/Stable yet, needs to finish tests coverage before that.
Also - I wonder if we could do the perf improvement in its own PR, separate from supporting out-of-order drops? Or is the perf gain strictly tied to supporting out-of-order drops? (Not a blocker)
@open-telemetry/rust-approvers PTAL. I could use 1-2 extra reviews before merging. |
@bantonsson Could you update PR desc. with the perf results from the most recent run. Also add a changelog entry for this. |
@mladedav Could you review this please? |
0799947
to
20ecab7
Compare
next_id as u16 | ||
} else { | ||
// This is an overflow, log it and ignore it. | ||
otel_warn!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to attempt purging stale entries when the limit is reached before issuing a warning? Not for this PR, but possibly as a future enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current limit is already quite high, so the simple protection is good enough in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the limit is pretty high, but it would be better to still proactively clear out stale entries when it’s reached, especially for long-running applications. We could revisit this later as a future improvement after we gather more usage data or feedback. For now, I’m good with moving forward as-is. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no stale entries in the stack. When an entry is removed out of order, it is replaced by a None
, so there can be holes in the stack. Since the id
in the ContextGuard
is the position in the stack, we can't resize it until the topmost element is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just small documentation nits but otherwise this looks good to me.
opentelemetry/src/context.rs
Outdated
/// [`ContextGuard`] instances that are constructed using it can't be shared with | ||
/// other threads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, I think "can't be shared with other threads" written like this usually means !Sync
while here the guards are just !Send
.
I think changing it to "can't be moved to other threads" would be clearer.
opentelemetry/src/context.rs
Outdated
/// The stack relies on the fact that it is thread local and that the | ||
/// [`ContextGuard`] instances that are constructed using it can't be shared with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This talks about guards constructed using the stack, but they're just stored there and not constructed using it? Or what does the "it" here refer to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is not clear. This refers to the id
in the guards
coming from the use of the stack
. I'll rephrase it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. It would also be helpful to have a design doc for the current Context implementation here: https://github.com/open-telemetry/opentelemetry-rust/tree/main/docs/design. Something to consider for a separate PR.
20ecab7
to
08ec6e1
Compare
Fixes #1887
Changes
This PR changes the single current
Context
to a more robust stack that is resilient to out of order and overlappingContext
scopes. This robustness is needed to be able to support better interop with tokio-rs/tracing.There is also an optimization based on the fact that the
Map
in theContext
is immutable, and these are the performance benchmarks for all the context related benchmarks.context/has_active_span/in-cx/alt
8.6±1.15ns
8.4±0.15ns
-2.0
context/has_active_span/in-cx/spec
16.5±0.06ns
16.5±0.05ns
0.0
context/has_active_span/no-cx/alt
8.4±0.01ns
8.4±0.02ns
0.0
context/has_active_span/no-cx/spec
15.6±0.05ns
15.2±0.07ns
-2.0
context/has_active_span/no-sdk/alt
8.4±0.02ns
8.4±0.04ns
0.0
context/has_active_span/no-sdk/spec
15.6±0.06ns
15.2±0.05ns
-2.0
context/is_recording/in-cx/alt
4.7±0.05ns
4.7±0.07ns
0.0
context/is_recording/in-cx/spec
18.1±0.12ns
17.8±0.13ns
-2.0
context/is_recording/no-cx/alt
4.7±0.05ns
4.7±0.06ns
0.0
context/is_recording/no-cx/spec
15.9±0.06ns
15.9±0.12ns
0.0
context/is_recording/no-sdk/alt
4.7±0.05ns
4.7±0.04ns
0.0
context/is_recording/no-sdk/spec
15.9±0.06ns
15.9±0.08ns
0.0
context/is_sampled/in-cx/alt
8.7±0.18ns
8.7±0.02ns
0.0
context/is_sampled/in-cx/spec
16.8±0.03ns
16.5±0.04ns
-2.0
context/is_sampled/no-cx/alt
8.7±0.03ns
8.7±0.06ns
0.0
context/is_sampled/no-cx/spec
15.6±0.11ns
15.3±0.14ns
-2.0
context/is_sampled/no-sdk/alt
8.7±0.03ns
8.7±0.04ns
0.0
context/is_sampled/no-sdk/spec
15.6±0.11ns
15.2±0.07ns
-2.0
context_attach/nested_cx/empty_cx
48.6±0.31ns
29.4±0.26ns
-39
context_attach/nested_cx/single_value_cx
96.9±1.66ns
30.8±0.11ns
-68
context_attach/nested_cx/span_cx
55.7±0.21ns
30.9±0.28ns
-44
context_attach/out_of_order_cx_drop/empty_cx
51.6±0.60ns
38.0±0.27ns
-26
context_attach/out_of_order_cx_drop/single_value_cx
161.5±673.14ns
39.5±0.79ns
-76
context_attach/out_of_order_cx_drop/span_cx
61.0±0.37ns
39.6±0.29ns
-35
context_attach/single_cx/empty_cx
29.3±0.23ns
16.1±0.09ns
-45
context_attach/single_cx/single_value_cx
43.5±1.09ns
17.6±0.09ns
-60
context_attach/single_cx/span_cx
29.5±0.14ns
17.2±0.12ns
-42
Merge requirement checklist
CHANGELOG.md
files updated for non-trivial, user-facing changes