-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only kick a Downstairs on timeout #1252
Milestone
Comments
This was referenced Apr 9, 2024
mkeeter
added a commit
that referenced
this issue
Apr 10, 2024
This is another standalone prep PR for #1252 We are planning to use backpressure alone to limit queue lengths for an `Active` downstairs, removing the `IO_MAX_ACTIVE_JOBS/BYTES` fault condition; using bounded queues doesn't make sense any more (and gives a false sense of security). Using unbounded queues is more honest about our strategy, and also removes `async` from large chunks of the codebase. The latter isn't done in this PR (to avoid churn), but once `DownstairsClient::send` is made sync, it will ripple through a bunch of places!
mkeeter
added a commit
that referenced
this issue
Apr 10, 2024
This is more prep for addressing #1252 Previously, our live-repair tests were checking two different things in a single test: - Sending 57K jobs caused a Downstairs to be marked as faulted - Once faulted, various live-repair operations occurred There's also a separate test, `test_byte_fault_condition` (#1192), to confirm that sending > 1 GiB caused a Downstairs to be marked as faulted. This PR splits those two checks into separate unit tests, adding new (test-only) APIs to (1) check Downstairs state and (2) manually fault a Downstairs. We want these checks to be separate because – once we make the changes for #1252 – sending 57K jobs will _no longer_ cause an active downstairs to be marked as faulted. In addition, common `TestHarness` code is consolidated into helper functions: - Clone the `Guest` and `tokio::spawn` → `TestHarness::spawn` - Expect a particular message type and reply to it →`DownstairsHandle::ack_read`, `ack_write`, `ack_flush`
mkeeter
added a commit
that referenced
this issue
Apr 12, 2024
This is (hopefully) the final prep PR before addressing #1252. There was an untested path of `Active` -(timeout)-> `Offline` -(too many jobs)-> `Faulted`; this PR adds tests for both the job and byte-based fault conditions. In addition, it confirms that live-repair works after all fault-inducing tests, moving that logic to a new `async fn run_live_repair(mut harness: TestHarness)`. Making these tests pass required fixing a (probably innocuous) race condition: - Downstairs is marked as offline, begins to reconnect (with a 10s pause) - Too many jobs accumulate, and we try to stop the downstairs (to restart it again) - `ClientIoTask::run_inner` ignores the `stop` message during the initial 10s pause - After the pause completes, the client IO task connects then immediately disconnects, because it finally looks at the stop message I think this was only an issue in unit tests, where we manage reconnections by hand; in the real system, the second reconnection should work fine. Anyhow, I fixed it by adding the `stop` condition to the initial client sleep and connection calls, so that it can interrupt the client at any point.
mkeeter
added a commit
that referenced
this issue
Apr 18, 2024
This PR removes the Active → Faulted transition when too many IO operations are in flight. See #1252 for details on why this is desirable! There are a bunch of changes that come along for the ride: - The backpressure equation is changed from quadratic to the form `1 / (1 - f)`, i.e. it now goes to infinity at our desired maximum value. These maximum values are set at 2x the fault for the Offline → Faulted transition. - Computing backpressure is now a standalone function on `BackpressureConfig`, so it can be unit tested - Unit tests are added to confirm backpressure properties - The `Offline` → `Faulted` transition happens at a reasonable timescale (> 1 sec, < 2 minutes) - Backpressure goes to ∞ as we approach these limits - Backpressure settings are tweaked based on these new requirements
Done in #1260 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Right now, there are several ways a Downstairs can be kicked out of the
Active
stateFaulted
, which means it goes through live-repair when reconnectingOffline
, which means it goes through replay when reconnectingOffline
toFaulted
if it hits either 57K jobs / 1 GiB bytes-in-flight criteria(there are other paths, e.g. returning an IO error or having a connection close, but those aren't relevant to the issue)
The jobs / bytes-in-flight criteria is uncomfortably coupled to backpressure: if the maximum backpressure is less than the true job completion time, then jobs will inevitably accumulate and will eventually cause the Downstairs to be faulted. In other words, we have to guess a value for maximum backpressure such that the system will never exceed it during normal operation.
We've gone through several rounds of tuning maximum byte-based backpressure (#1240, #1243), and are now seeing (https://github.com/oxidecomputer/customer-support/issues/121) that job-based backpressure may also be insufficient.
After talking about this in the 2024-04-01 Crucible Huddle, our new plan is to remove the jobs / bytes-in-flight criteria for the
Active
→Faulted
transition. In other words, as long as a Downstairs is replying (however slowly), we won't mark it as faulted. TheOffline
→Faulted
transition will remain in place, so that we don't accumulate jobs forever.This theoretically means that the Upstairs can buffer arbitrary amounts of data on behalf of a pathologically-slow Downstairs; however, with quadratic backpressure, we can do the math and satisfy ourselves that it won't ever hit worrying values.
The text was updated successfully, but these errors were encountered: