You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is (hopefully) the final prep PR before addressing #1252.
There was an untested path of `Active` -(timeout)-> `Offline` -(too many
jobs)-> `Faulted`; this PR adds tests for both the job and byte-based
fault conditions. In addition, it confirms that live-repair works after
all fault-inducing tests, moving that logic to a new `async fn
run_live_repair(mut harness: TestHarness)`.
Making these tests pass required fixing a (probably innocuous) race
condition:
- Downstairs is marked as offline, begins to reconnect (with a 10s
pause)
- Too many jobs accumulate, and we try to stop the downstairs (to
restart it again)
- `ClientIoTask::run_inner` ignores the `stop` message during the
initial 10s pause
- After the pause completes, the client IO task connects then
immediately disconnects, because it finally looks at the stop message
I think this was only an issue in unit tests, where we manage
reconnections by hand; in the real system, the second reconnection
should work fine. Anyhow, I fixed it by adding the `stop` condition to
the initial client sleep and connection calls, so that it can interrupt
the client at any point.
0 commit comments