Skip to content

Commit 59bd303

Browse files
authored
Add Active -> Offline -> Faulted tests (#1257)
This is (hopefully) the final prep PR before addressing #1252. There was an untested path of `Active` -(timeout)-> `Offline` -(too many jobs)-> `Faulted`; this PR adds tests for both the job and byte-based fault conditions. In addition, it confirms that live-repair works after all fault-inducing tests, moving that logic to a new `async fn run_live_repair(mut harness: TestHarness)`. Making these tests pass required fixing a (probably innocuous) race condition: - Downstairs is marked as offline, begins to reconnect (with a 10s pause) - Too many jobs accumulate, and we try to stop the downstairs (to restart it again) - `ClientIoTask::run_inner` ignores the `stop` message during the initial 10s pause - After the pause completes, the client IO task connects then immediately disconnects, because it finally looks at the stop message I think this was only an issue in unit tests, where we manage reconnections by hand; in the real system, the second reconnection should work fine. Anyhow, I fixed it by adding the `stop` condition to the initial client sleep and connection calls, so that it can interrupt the client at any point.
1 parent df5391a commit 59bd303

File tree

2 files changed

+326
-66
lines changed

2 files changed

+326
-66
lines changed

upstairs/src/client.rs

+42-2
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,10 @@ const TIMEOUT_SECS: f32 = 15.0;
3737
const TIMEOUT_LIMIT: usize = 3;
3838
const PING_INTERVAL_SECS: f32 = 5.0;
3939

40+
/// Total time before a client is timed out
41+
#[cfg(test)]
42+
pub const CLIENT_TIMEOUT_SECS: f32 = TIMEOUT_SECS * TIMEOUT_LIMIT as f32;
43+
4044
/// Handle to a running I/O task
4145
///
4246
/// The I/O task is "thin"; it simply forwards messages around. The task
@@ -2503,7 +2507,25 @@ impl ClientIoTask {
25032507
// spinning (e.g. if something is fundamentally wrong with the
25042508
// Downstairs)
25052509
if self.delay {
2506-
tokio::time::sleep(std::time::Duration::from_secs(10)).await;
2510+
tokio::select! {
2511+
s = &mut self.stop => {
2512+
warn!(self.log, "client IO task stopped during sleep");
2513+
return match s {
2514+
Ok(s) =>
2515+
ClientRunResult::RequestedStop(s),
2516+
Err(e) => {
2517+
warn!(
2518+
self.log,
2519+
"client_stop_rx closed unexpectedly: {e:?}"
2520+
);
2521+
ClientRunResult::QueueClosed
2522+
}
2523+
}
2524+
}
2525+
_ = tokio::time::sleep(std::time::Duration::from_secs(10)) => {
2526+
// this is fine
2527+
},
2528+
}
25072529
}
25082530

25092531
// Wait for the start oneshot to fire. This may happen immediately, but
@@ -2551,7 +2573,11 @@ impl ClientIoTask {
25512573
tcp = sock.connect(self.target) => {
25522574
match tcp {
25532575
Ok(tcp) => {
2554-
info!(self.log, "ds_connection connected");
2576+
info!(
2577+
self.log,
2578+
"ds_connection connected from {:?}",
2579+
tcp.local_addr()
2580+
);
25552581
tcp
25562582
}
25572583
Err(e) => {
@@ -2564,6 +2590,20 @@ impl ClientIoTask {
25642590
}
25652591
}
25662592
}
2593+
s = &mut self.stop => {
2594+
warn!(self.log, "client IO task stopped during connection");
2595+
return match s {
2596+
Ok(s) =>
2597+
ClientRunResult::RequestedStop(s),
2598+
Err(e) => {
2599+
warn!(
2600+
self.log,
2601+
"client_stop_rx closed unexpectedly: {e:?}"
2602+
);
2603+
ClientRunResult::QueueClosed
2604+
}
2605+
}
2606+
}
25672607
};
25682608

25692609
// We're connected; before we wrap it, set TCP_NODELAY to assure

0 commit comments

Comments
 (0)