Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore client messages after stopping the IO task #1126

Merged
merged 2 commits into from
Jan 31, 2024

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Jan 30, 2024

This protects us from a subtle race condition:

  • The main task decides to stop the client task (e.g. because we've hit IO_OUTSTANDING_MAX)
    • It sends a stop request down the tokio::sync::oneshot channel
    • Then, it marks all jobs as skipped, which may result in discarding some of them
  • The IO task receives a message from the downstairs
    • It passes this message to the main task
    • This message is a reply to one of the jobs that we skipped
    • The main task gets mad

I think this is may be the failure seen here:

{"msg":"cea05089-f330-4222-81d2-9c03095dfb2b WARNING finish job 125586 when downstairs state: Offline","v":0,"name":"propolis-server","level":40,"time":"2024-01-30T01:48:01.555693836Z","hostname":"oxz_propolis-server_edb3b1a5-7a60-441f-bfb0-a8cafae99226","pid":22206,"client":"2","":"downstairs","session_id":"9be1d7cc-2bf0-48cd-a496-d34f9c8245ec","component":"crucible-cea05089-f330-4222-81d2-9c03095dfb2b"}
thread 'tokio-runtime-worker' panicked at /home/build/.cargo/git/checkouts/crucible-f3b5bdecdc6486d6/2d4bc11/upstairs/src/client.rs:1237:13:
[2] Job 125586 completed while not InProgress: New
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

(from /staff/core/crucible-unsorted/oxz_propolis-server_edb3b1a5.log)

@mkeeter mkeeter requested a review from leftwo January 30, 2024 19:49
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this still race with the downstairs where we have just taken a message
off the socket, but that message has not made it into the upstairs?

I ask because there are some panic's I'm updating to not panic if we receive
a message from a job we already finished and/or no longer exists.
(See process_io_completion())

@mkeeter
Copy link
Contributor Author

mkeeter commented Jan 30, 2024

Could this still race with the downstairs where we have just taken a message off the socket, but that message has not made it into the upstairs?

I don't think so, but I'm also not sure I understand the question.

The IO task is responsible for taking messages off the socket and sending them via an mpsc queue to the main task. The main task reads + processes those messages, and is also responsible for stopping the IO task under certain circumstances. When the IO task is requested to stop (by the main task), any pending jobs are handled, e.g. skipping them (also in the main task).

This PR means that once we've done those two things (stopping the IO task and marking pending jobs as skipped), we no longer see any messages from the IO task – even if they have arrived over the socket or were just about to be pushed into the mpsc queue. The IO task is dead to us; the only ClientAction that we care about is the actual notification that the task has stopped.

I think that if we do this correctly, we could (and should!) keep those panics in place, because they're checking a real thing (and this PR fixes a real issue).

@leftwo
Copy link
Contributor

leftwo commented Jan 30, 2024

Could this still race with the downstairs where we have just taken a message off the socket, but that message has not made it into the upstairs?

This PR means that once we've done those two things (stopping the IO task and marking pending jobs as skipped), we no longer see any messages from the IO task – even if they have arrived over the socket or were just about to be pushed into the mpsc queue. The IO task is dead to us; the only ClientAction that we care about is the actual notification that the task has stopped.

I think that if we do this correctly, we could (and should!) keep those panics in place, because they're checking a real thing (and this PR fixes a real issue).

Yes, perfect!
And, for LIveRepair operations, those also come throught the same client IO task, correct?

@mkeeter
Copy link
Contributor Author

mkeeter commented Jan 30, 2024

The client IO task is the owner of the actual socket, so we shouldn't see any LiveRepair replies after stopping it. This PR doesn't change sending LiveRepair requests (which is managed by the main task).

@leftwo leftwo self-requested a review January 30, 2024 23:00
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for answering my questions, let's get this in!

@mkeeter mkeeter merged commit cc8c5a0 into oxidecomputer:main Jan 31, 2024
18 checks passed
@mkeeter mkeeter deleted the ignore-messages-after-io-stop branch January 31, 2024 00:05
@mkeeter mkeeter mentioned this pull request Feb 2, 2024
mkeeter added a commit that referenced this pull request Feb 2, 2024
This fixes another occurrence of the panic thought to be addressed in #1126:
```
Feb 02 00:44:11.050 WARN 63a91608-1f7f-41a4-bd21-76371c72a2d0 WARNING finish job 1083 when downstairs state: Offline, client: 1, : downstairs, session_id: 610eadbf-df34-44b5-8520-b4e88207bf54
thread 'test::integration_test_problematic_downstairs' panicked at upstairs/src/client.rs:1249:13:
[1] Job 1083 completed while not InProgress: New
stack backtrace:
   0: rust_begin_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
   2: crucible::client::DownstairsClient::process_io_completion
             at /Users/mjk/oxide/crucible/upstairs/src/client.rs:1249:13
   3: crucible::downstairs::Downstairs::process_io_completion_inner
             at /Users/mjk/oxide/crucible/upstairs/src/downstairs.rs:3177:12
   4: crucible::downstairs::Downstairs::process_io_completion
             at /Users/mjk/oxide/crucible/upstairs/src/downstairs.rs:3063:9
   5: crucible::upstairs::Upstairs::on_client_message::{{closure}}
             at /Users/mjk/oxide/crucible/upstairs/src/upstairs.rs:1616:25
   6: crucible::upstairs::Upstairs::apply::{{closure}}
             at /Users/mjk/oxide/crucible/upstairs/src/upstairs.rs:540:43
   7: crucible::upstairs::Upstairs::run::{{closure}}
             at /Users/mjk/oxide/crucible/upstairs/src/upstairs.rs:429:32
   8: crucible::up_main::{{closure}}
             at /Users/mjk/oxide/crucible/upstairs/src/lib.rs:2962:58
   9: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/future/future.rs:125:9
  10: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /Users/mjk/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/task/core.rs:328:17
  11: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /Users/mjk/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/loom/std/unsafe_cell.rs:16:9
```

That fix was insufficient because messages from an about-to-be-disconnected
client can be deferred (for decryption), and therefore still arrive in the main
task after the client connection is closed.

The fix is to tag every message with its connection index, which is just
`DownstairsStats::connected` for that particular client.  Then, when processing
deferred messages, skip them if the connection index has changed.

This change also revealed a bug in our live-repair tests!  After we send an
error, both the error-sending and the under-repair Downstairs will be moved to
`DsState::Faulted`; therefore, we shouldn't be sending a reply from the latter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants