Only kick a Downstairs on timeout #1252

mkeeter · 2024-04-09T17:25:06Z

Right now, there are several ways a Downstairs can be kicked out of the Active state

If there are more than 57K jobs or 1 GiB in flight, it's moved to Faulted, which means it goes through live-repair when reconnecting
If it doesn't reply for 45 seconds, it's moved to Offline, which means it goes through replay when reconnecting
- Note that it can later be moved from Offline to Faulted if it hits either 57K jobs / 1 GiB bytes-in-flight criteria

(there are other paths, e.g. returning an IO error or having a connection close, but those aren't relevant to the issue)

The jobs / bytes-in-flight criteria is uncomfortably coupled to backpressure: if the maximum backpressure is less than the true job completion time, then jobs will inevitably accumulate and will eventually cause the Downstairs to be faulted. In other words, we have to guess a value for maximum backpressure such that the system will never exceed it during normal operation.

We've gone through several rounds of tuning maximum byte-based backpressure (#1240, #1243), and are now seeing (https://github.com/oxidecomputer/customer-support/issues/121) that job-based backpressure may also be insufficient.

After talking about this in the 2024-04-01 Crucible Huddle, our new plan is to remove the jobs / bytes-in-flight criteria for the Active → Faulted transition. In other words, as long as a Downstairs is replying (however slowly), we won't mark it as faulted. The Offline → Faulted transition will remain in place, so that we don't accumulate jobs forever.

This theoretically means that the Upstairs can buffer arbitrary amounts of data on behalf of a pathologically-slow Downstairs; however, with quadratic backpressure, we can do the math and satisfy ourselves that it won't ever hit worrying values.

The text was updated successfully, but these errors were encountered:

This is another standalone prep PR for #1252 We are planning to use backpressure alone to limit queue lengths for an `Active` downstairs, removing the `IO_MAX_ACTIVE_JOBS/BYTES` fault condition; using bounded queues doesn't make sense any more (and gives a false sense of security). Using unbounded queues is more honest about our strategy, and also removes `async` from large chunks of the codebase. The latter isn't done in this PR (to avoid churn), but once `DownstairsClient::send` is made sync, it will ripple through a bunch of places!

This is more prep for addressing #1252 Previously, our live-repair tests were checking two different things in a single test: - Sending 57K jobs caused a Downstairs to be marked as faulted - Once faulted, various live-repair operations occurred There's also a separate test, `test_byte_fault_condition` (#1192), to confirm that sending > 1 GiB caused a Downstairs to be marked as faulted. This PR splits those two checks into separate unit tests, adding new (test-only) APIs to (1) check Downstairs state and (2) manually fault a Downstairs. We want these checks to be separate because – once we make the changes for #1252 – sending 57K jobs will _no longer_ cause an active downstairs to be marked as faulted. In addition, common `TestHarness` code is consolidated into helper functions: - Clone the `Guest` and `tokio::spawn` → `TestHarness::spawn` - Expect a particular message type and reply to it →`DownstairsHandle::ack_read`, `ack_write`, `ack_flush`

This is (hopefully) the final prep PR before addressing #1252. There was an untested path of `Active` -(timeout)-> `Offline` -(too many jobs)-> `Faulted`; this PR adds tests for both the job and byte-based fault conditions. In addition, it confirms that live-repair works after all fault-inducing tests, moving that logic to a new `async fn run_live_repair(mut harness: TestHarness)`. Making these tests pass required fixing a (probably innocuous) race condition: - Downstairs is marked as offline, begins to reconnect (with a 10s pause) - Too many jobs accumulate, and we try to stop the downstairs (to restart it again) - `ClientIoTask::run_inner` ignores the `stop` message during the initial 10s pause - After the pause completes, the client IO task connects then immediately disconnects, because it finally looks at the stop message I think this was only an issue in unit tests, where we manage reconnections by hand; in the real system, the second reconnection should work fine. Anyhow, I fixed it by adding the `stop` condition to the initial client sleep and connection calls, so that it can interrupt the client at any point.

This PR removes the Active → Faulted transition when too many IO operations are in flight. See #1252 for details on why this is desirable! There are a bunch of changes that come along for the ride: - The backpressure equation is changed from quadratic to the form `1 / (1 - f)`, i.e. it now goes to infinity at our desired maximum value. These maximum values are set at 2x the fault for the Offline → Faulted transition. - Computing backpressure is now a standalone function on `BackpressureConfig`, so it can be unit tested - Unit tests are added to confirm backpressure properties - The `Offline` → `Faulted` transition happens at a reasonable timescale (> 1 sec, < 2 minutes) - Backpressure goes to ∞ as we approach these limits - Backpressure settings are tweaked based on these new requirements

mkeeter · 2024-04-23T17:31:04Z

Done in #1260

This was referenced Apr 9, 2024

Reorganize dummy downstairs tests #1253

Merged

Switch to unbounded queues #1256

Merged

morlandi7 added this to the 8 milestone Apr 11, 2024

mkeeter mentioned this issue Apr 11, 2024

Add Active -> Offline -> Faulted tests #1257

Merged

mkeeter mentioned this issue Apr 12, 2024

Remove the Active → Faulted transition #1260

Merged

morlandi7 assigned mkeeter Apr 15, 2024

mkeeter closed this as completed Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only kick a Downstairs on timeout #1252

Only kick a Downstairs on timeout #1252

mkeeter commented Apr 9, 2024

mkeeter commented Apr 23, 2024

Only kick a Downstairs on timeout #1252

Only kick a Downstairs on timeout #1252

Comments

mkeeter commented Apr 9, 2024

mkeeter commented Apr 23, 2024