Ignore client messages after stopping the IO task #1126

mkeeter · 2024-01-30T19:49:22Z

This protects us from a subtle race condition:

The main task decides to stop the client task (e.g. because we've hit IO_OUTSTANDING_MAX)
- It sends a stop request down the tokio::sync::oneshot channel
- Then, it marks all jobs as skipped, which may result in discarding some of them
The IO task receives a message from the downstairs
- It passes this message to the main task
- This message is a reply to one of the jobs that we skipped
- The main task gets mad

I think this is may be the failure seen here:

{"msg":"cea05089-f330-4222-81d2-9c03095dfb2b WARNING finish job 125586 when downstairs state: Offline","v":0,"name":"propolis-server","level":40,"time":"2024-01-30T01:48:01.555693836Z","hostname":"oxz_propolis-server_edb3b1a5-7a60-441f-bfb0-a8cafae99226","pid":22206,"client":"2","":"downstairs","session_id":"9be1d7cc-2bf0-48cd-a496-d34f9c8245ec","component":"crucible-cea05089-f330-4222-81d2-9c03095dfb2b"}
thread 'tokio-runtime-worker' panicked at /home/build/.cargo/git/checkouts/crucible-f3b5bdecdc6486d6/2d4bc11/upstairs/src/client.rs:1237:13:
[2] Job 125586 completed while not InProgress: New
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

(from /staff/core/crucible-unsorted/oxz_propolis-server_edb3b1a5.log)

leftwo

Could this still race with the downstairs where we have just taken a message
off the socket, but that message has not made it into the upstairs?

I ask because there are some panic's I'm updating to not panic if we receive
a message from a job we already finished and/or no longer exists.
(See process_io_completion())

mkeeter · 2024-01-30T21:53:24Z

Could this still race with the downstairs where we have just taken a message off the socket, but that message has not made it into the upstairs?

I don't think so, but I'm also not sure I understand the question.

The IO task is responsible for taking messages off the socket and sending them via an mpsc queue to the main task. The main task reads + processes those messages, and is also responsible for stopping the IO task under certain circumstances. When the IO task is requested to stop (by the main task), any pending jobs are handled, e.g. skipping them (also in the main task).

This PR means that once we've done those two things (stopping the IO task and marking pending jobs as skipped), we no longer see any messages from the IO task – even if they have arrived over the socket or were just about to be pushed into the mpsc queue. The IO task is dead to us; the only ClientAction that we care about is the actual notification that the task has stopped.

I think that if we do this correctly, we could (and should!) keep those panics in place, because they're checking a real thing (and this PR fixes a real issue).

leftwo · 2024-01-30T22:35:00Z

Could this still race with the downstairs where we have just taken a message off the socket, but that message has not made it into the upstairs?

This PR means that once we've done those two things (stopping the IO task and marking pending jobs as skipped), we no longer see any messages from the IO task – even if they have arrived over the socket or were just about to be pushed into the mpsc queue. The IO task is dead to us; the only ClientAction that we care about is the actual notification that the task has stopped.

I think that if we do this correctly, we could (and should!) keep those panics in place, because they're checking a real thing (and this PR fixes a real issue).

Yes, perfect!
And, for LIveRepair operations, those also come throught the same client IO task, correct?

mkeeter · 2024-01-30T22:55:25Z

The client IO task is the owner of the actual socket, so we shouldn't see any LiveRepair replies after stopping it. This PR doesn't change sending LiveRepair requests (which is managed by the main task).

leftwo

Thanks for answering my questions, let's get this in!

This fixes another occurrence of the panic thought to be addressed in #1126: ``` Feb 02 00:44:11.050 WARN 63a91608-1f7f-41a4-bd21-76371c72a2d0 WARNING finish job 1083 when downstairs state: Offline, client: 1, : downstairs, session_id: 610eadbf-df34-44b5-8520-b4e88207bf54 thread 'test::integration_test_problematic_downstairs' panicked at upstairs/src/client.rs:1249:13: [1] Job 1083 completed while not InProgress: New stack backtrace: 0: rust_begin_unwind at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5 1: core::panicking::panic_fmt at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14 2: crucible::client::DownstairsClient::process_io_completion at /Users/mjk/oxide/crucible/upstairs/src/client.rs:1249:13 3: crucible::downstairs::Downstairs::process_io_completion_inner at /Users/mjk/oxide/crucible/upstairs/src/downstairs.rs:3177:12 4: crucible::downstairs::Downstairs::process_io_completion at /Users/mjk/oxide/crucible/upstairs/src/downstairs.rs:3063:9 5: crucible::upstairs::Upstairs::on_client_message::{{closure}} at /Users/mjk/oxide/crucible/upstairs/src/upstairs.rs:1616:25 6: crucible::upstairs::Upstairs::apply::{{closure}} at /Users/mjk/oxide/crucible/upstairs/src/upstairs.rs:540:43 7: crucible::upstairs::Upstairs::run::{{closure}} at /Users/mjk/oxide/crucible/upstairs/src/upstairs.rs:429:32 8: crucible::up_main::{{closure}} at /Users/mjk/oxide/crucible/upstairs/src/lib.rs:2962:58 9: <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/future/future.rs:125:9 10: tokio::runtime::task::core::Core<T,S>::poll::{{closure}} at /Users/mjk/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/task/core.rs:328:17 11: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut at /Users/mjk/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/loom/std/unsafe_cell.rs:16:9 ``` That fix was insufficient because messages from an about-to-be-disconnected client can be deferred (for decryption), and therefore still arrive in the main task after the client connection is closed. The fix is to tag every message with its connection index, which is just `DownstairsStats::connected` for that particular client. Then, when processing deferred messages, skip them if the connection index has changed. This change also revealed a bug in our live-repair tests! After we send an error, both the error-sending and the under-repair Downstairs will be moved to `DsState::Faulted`; therefore, we shouldn't be sending a reply from the latter.

Ignore client messages after stopping the IO task

d32f1e2

mkeeter requested a review from leftwo January 30, 2024 19:49

Propagate non-Response actions even when stopping the client

fc1f50f

leftwo reviewed Jan 30, 2024

View reviewed changes

leftwo self-requested a review January 30, 2024 23:00

leftwo approved these changes Jan 30, 2024

View reviewed changes

mkeeter merged commit cc8c5a0 into oxidecomputer:main Jan 31, 2024
18 checks passed

mkeeter deleted the ignore-messages-after-io-stop branch January 31, 2024 00:05

mkeeter mentioned this pull request Feb 2, 2024

Discard deferred msgs #1131

Merged

mkeeter mentioned this pull request Feb 20, 2024

Crucible hit assertion failed: self.completed.contains(&ds_id) during pstop/prun of a downstairs #1121

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore client messages after stopping the IO task #1126

Ignore client messages after stopping the IO task #1126

mkeeter commented Jan 30, 2024

leftwo left a comment

mkeeter commented Jan 30, 2024 •

edited

Loading

leftwo commented Jan 30, 2024

mkeeter commented Jan 30, 2024

leftwo left a comment

Ignore client messages after stopping the IO task #1126

Ignore client messages after stopping the IO task #1126

Conversation

mkeeter commented Jan 30, 2024

leftwo left a comment

Choose a reason for hiding this comment

mkeeter commented Jan 30, 2024 • edited Loading

leftwo commented Jan 30, 2024

mkeeter commented Jan 30, 2024

leftwo left a comment

Choose a reason for hiding this comment

mkeeter commented Jan 30, 2024 •

edited

Loading