Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: process parallel events and potential deadlock in on_record #95

Merged
merged 5 commits into from
Feb 26, 2024

Conversation

mladedav
Copy link
Contributor

@mladedav mladedav commented Feb 17, 2024

Motivation

#94 - Mutli-threaded tracing drops most of the events

Solution

I have reorganized the code so that the lock on extensions is held, but not while record is called which should solve the deadlock originally solved in #59

Benchmark

I've used the benchmark from #93, please ignore the "filtered" vs "non_filtered" distinction, for the purpose of this PR they should be identical cases.

See results in later comment as the code got changed.

@mladedav mladedav requested a review from jtescher as a code owner February 17, 2024 00:14
This deadlock can happen when extensions of a span are locked while we
visit either an event or attributes of a span. A `Debug` implementation
can also try to access the extensions which results in a deadlock.

See tokio-rs#59 and tokio-rs#95 for additional information.
@mladedav mladedav changed the title fix: process parallel events fix: process parallel events and potential deadlock in on_record Feb 17, 2024
@mladedav
Copy link
Contributor Author

I've dug into it a bit more and found out #59 fixed the deadlock only in on_event and it was actually present in on_record the whole time. I've added a test for this but since the fix for both overlaps a lot, I've squashed it into a single commit where a different structure from SpanBuilder is used in the visitor and the changes recorded there are then merged into the builder itself.

Benchmark against current HEAD

The results here seem hopeful that this is actually even improvement. But again, first time really using criterion so if you double-checked it, it would be great.

     Running benches/trace.rs (target/release/deps/trace-811e6591eded5464)
Gnuplot not found, using plotters backend
otel_many_children/spec_baseline
                        time:   [22.606 µs 22.610 µs 22.614 µs]
                        change: [-0.8576% -0.8330% -0.8084%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
otel_many_children/no_data_baseline
                        time:   [50.365 µs 50.379 µs 50.394 µs]
                        change: [+1.1564% +1.2176% +1.2804%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild
otel_many_children/data_only_baseline
                        time:   [58.989 µs 59.005 µs 59.021 µs]
                        change: [-3.9068% -3.8545% -3.8043%] (p = 0.00 < 0.05)
                        Performance has improved.
otel_many_children/full time:   [114.33 µs 114.34 µs 114.35 µs]
                        change: [-1.3542% -1.3192% -1.2730%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

otel_many_events/spec_baseline
                        time:   [42.680 µs 42.713 µs 42.749 µs]
                        change: [+0.2011% +0.3659% +0.5146%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
otel_many_events/no_data_baseline
                        time:   [104.22 µs 104.25 µs 104.27 µs]
                        change: [+1.4316% +1.4605% +1.4944%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  3 (3.00%) high mild
  2 (2.00%) high severe
otel_many_events/data_only_baseline
                        time:   [145.88 µs 145.92 µs 145.96 µs]
                        change: [-1.4142% -1.3765% -1.3420%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
otel_many_events/full_filtered
                        time:   [457.29 µs 457.37 µs 457.46 µs]
                        change: [-15.471% -15.440% -15.409%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
otel_many_events/full_not_filtered
                        time:   [465.70 µs 465.76 µs 465.82 µs]
                        change: [-16.094% -16.059% -16.028%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

Sorry for the late change to this.

Copy link
Collaborator

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. We may have been affected by the loss of parallel events, too -- would be nice to get this into the next release.

@mladedav
Copy link
Contributor Author

@jtescher do you think you'll have some time to take a look at this? I'd like to try and tackle the span limits next but I think it should be built upon this so I'd like to have even a soft go-ahead on this first.

Copy link
Collaborator

@jtescher jtescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this looks like a reasonable solution to the concurrent otel data access. Thanks @mladedav

@jtescher jtescher merged commit 4f747a0 into tokio-rs:v0.1.x Feb 26, 2024
13 checks passed
@mladedav mladedav deleted the dm/parallel branch February 26, 2024 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants