-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore client messages after stopping the IO task #1126
Ignore client messages after stopping the IO task #1126
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this still race with the downstairs where we have just taken a message
off the socket, but that message has not made it into the upstairs?
I ask because there are some panic's I'm updating to not panic if we receive
a message from a job we already finished and/or no longer exists.
(See process_io_completion()
)
I don't think so, but I'm also not sure I understand the question. The IO task is responsible for taking messages off the socket and sending them via an This PR means that once we've done those two things (stopping the IO task and marking pending jobs as skipped), we no longer see any messages from the IO task – even if they have arrived over the socket or were just about to be pushed into the I think that if we do this correctly, we could (and should!) keep those panics in place, because they're checking a real thing (and this PR fixes a real issue). |
Yes, perfect! |
The client IO task is the owner of the actual socket, so we shouldn't see any LiveRepair replies after stopping it. This PR doesn't change sending LiveRepair requests (which is managed by the main task). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for answering my questions, let's get this in!
This fixes another occurrence of the panic thought to be addressed in #1126: ``` Feb 02 00:44:11.050 WARN 63a91608-1f7f-41a4-bd21-76371c72a2d0 WARNING finish job 1083 when downstairs state: Offline, client: 1, : downstairs, session_id: 610eadbf-df34-44b5-8520-b4e88207bf54 thread 'test::integration_test_problematic_downstairs' panicked at upstairs/src/client.rs:1249:13: [1] Job 1083 completed while not InProgress: New stack backtrace: 0: rust_begin_unwind at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5 1: core::panicking::panic_fmt at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14 2: crucible::client::DownstairsClient::process_io_completion at /Users/mjk/oxide/crucible/upstairs/src/client.rs:1249:13 3: crucible::downstairs::Downstairs::process_io_completion_inner at /Users/mjk/oxide/crucible/upstairs/src/downstairs.rs:3177:12 4: crucible::downstairs::Downstairs::process_io_completion at /Users/mjk/oxide/crucible/upstairs/src/downstairs.rs:3063:9 5: crucible::upstairs::Upstairs::on_client_message::{{closure}} at /Users/mjk/oxide/crucible/upstairs/src/upstairs.rs:1616:25 6: crucible::upstairs::Upstairs::apply::{{closure}} at /Users/mjk/oxide/crucible/upstairs/src/upstairs.rs:540:43 7: crucible::upstairs::Upstairs::run::{{closure}} at /Users/mjk/oxide/crucible/upstairs/src/upstairs.rs:429:32 8: crucible::up_main::{{closure}} at /Users/mjk/oxide/crucible/upstairs/src/lib.rs:2962:58 9: <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/future/future.rs:125:9 10: tokio::runtime::task::core::Core<T,S>::poll::{{closure}} at /Users/mjk/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/task/core.rs:328:17 11: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut at /Users/mjk/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/loom/std/unsafe_cell.rs:16:9 ``` That fix was insufficient because messages from an about-to-be-disconnected client can be deferred (for decryption), and therefore still arrive in the main task after the client connection is closed. The fix is to tag every message with its connection index, which is just `DownstairsStats::connected` for that particular client. Then, when processing deferred messages, skip them if the connection index has changed. This change also revealed a bug in our live-repair tests! After we send an error, both the error-sending and the under-repair Downstairs will be moved to `DsState::Faulted`; therefore, we shouldn't be sending a reply from the latter.
This protects us from a subtle race condition:
IO_OUTSTANDING_MAX
)tokio::sync::oneshot
channelI think this is may be the failure seen here:
(from
/staff/core/crucible-unsorted/oxz_propolis-server_edb3b1a5.log
)