Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only kick a Downstairs on timeout #1252

Closed
mkeeter opened this issue Apr 9, 2024 · 1 comment
Closed

Only kick a Downstairs on timeout #1252

mkeeter opened this issue Apr 9, 2024 · 1 comment
Assignees
Milestone

Comments

@mkeeter
Copy link
Contributor

mkeeter commented Apr 9, 2024

Right now, there are several ways a Downstairs can be kicked out of the Active state

  • If there are more than 57K jobs or 1 GiB in flight, it's moved to Faulted, which means it goes through live-repair when reconnecting
  • If it doesn't reply for 45 seconds, it's moved to Offline, which means it goes through replay when reconnecting
    • Note that it can later be moved from Offline to Faulted if it hits either 57K jobs / 1 GiB bytes-in-flight criteria

(there are other paths, e.g. returning an IO error or having a connection close, but those aren't relevant to the issue)

The jobs / bytes-in-flight criteria is uncomfortably coupled to backpressure: if the maximum backpressure is less than the true job completion time, then jobs will inevitably accumulate and will eventually cause the Downstairs to be faulted. In other words, we have to guess a value for maximum backpressure such that the system will never exceed it during normal operation.

We've gone through several rounds of tuning maximum byte-based backpressure (#1240, #1243), and are now seeing (https://github.com/oxidecomputer/customer-support/issues/121) that job-based backpressure may also be insufficient.

After talking about this in the 2024-04-01 Crucible Huddle, our new plan is to remove the jobs / bytes-in-flight criteria for the ActiveFaulted transition. In other words, as long as a Downstairs is replying (however slowly), we won't mark it as faulted. The OfflineFaulted transition will remain in place, so that we don't accumulate jobs forever.

This theoretically means that the Upstairs can buffer arbitrary amounts of data on behalf of a pathologically-slow Downstairs; however, with quadratic backpressure, we can do the math and satisfy ourselves that it won't ever hit worrying values.

mkeeter added a commit that referenced this issue Apr 10, 2024
This is another standalone prep PR for #1252

We are planning to use backpressure alone to limit queue lengths for an
`Active` downstairs, removing the `IO_MAX_ACTIVE_JOBS/BYTES` fault
condition; using bounded queues doesn't make sense any more (and gives a
false sense of security).

Using unbounded queues is more honest about our strategy, and also
removes `async` from large chunks of the codebase. The latter isn't done
in this PR (to avoid churn), but once `DownstairsClient::send` is made
sync, it will ripple through a bunch of places!
mkeeter added a commit that referenced this issue Apr 10, 2024
This is more prep for addressing #1252 

Previously, our live-repair tests were checking two different things in
a single test:

- Sending 57K jobs caused a Downstairs to be marked as faulted
- Once faulted, various live-repair operations occurred

There's also a separate test, `test_byte_fault_condition` (#1192), to
confirm that sending > 1 GiB caused a Downstairs to be marked as
faulted.

This PR splits those two checks into separate unit tests, adding new
(test-only) APIs to (1) check Downstairs state and (2) manually fault a
Downstairs. We want these checks to be separate because – once we make
the changes for #1252 – sending 57K jobs will _no longer_ cause an
active downstairs to be marked as faulted.

In addition, common `TestHarness` code is consolidated into helper
functions:

- Clone the `Guest` and `tokio::spawn` → `TestHarness::spawn`
- Expect a particular message type and reply to it
→`DownstairsHandle::ack_read`, `ack_write`, `ack_flush`
@morlandi7 morlandi7 added this to the 8 milestone Apr 11, 2024
mkeeter added a commit that referenced this issue Apr 12, 2024
This is (hopefully) the final prep PR before addressing #1252.

There was an untested path of `Active` -(timeout)-> `Offline` -(too many
jobs)-> `Faulted`; this PR adds tests for both the job and byte-based
fault conditions. In addition, it confirms that live-repair works after
all fault-inducing tests, moving that logic to a new `async fn
run_live_repair(mut harness: TestHarness)`.

Making these tests pass required fixing a (probably innocuous) race
condition:

- Downstairs is marked as offline, begins to reconnect (with a 10s
pause)
- Too many jobs accumulate, and we try to stop the downstairs (to
restart it again)
- `ClientIoTask::run_inner` ignores the `stop` message during the
initial 10s pause
- After the pause completes, the client IO task connects then
immediately disconnects, because it finally looks at the stop message

I think this was only an issue in unit tests, where we manage
reconnections by hand; in the real system, the second reconnection
should work fine. Anyhow, I fixed it by adding the `stop` condition to
the initial client sleep and connection calls, so that it can interrupt
the client at any point.
mkeeter added a commit that referenced this issue Apr 18, 2024
This PR removes the Active → Faulted transition when too many IO
operations are in flight.

See #1252 for details on why this is desirable!

There are a bunch of changes that come along for the ride:

- The backpressure equation is changed from quadratic to the form `1 /
(1 - f)`, i.e. it now goes to infinity at our desired maximum value.
These maximum values are set at 2x the fault for the Offline → Faulted
transition.
- Computing backpressure is now a standalone function on
`BackpressureConfig`, so it can be unit tested
- Unit tests are added to confirm backpressure properties
- The `Offline` → `Faulted` transition happens at a reasonable timescale
(> 1 sec, < 2 minutes)
  - Backpressure goes to ∞ as we approach these limits
- Backpressure settings are tweaked based on these new requirements
@mkeeter
Copy link
Contributor Author

mkeeter commented Apr 23, 2024

Done in #1260

@mkeeter mkeeter closed this as completed Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants