You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[nexus] Explicitly terminate pools for qorb (#6881)
Fixes#6505 , integrates
usage of the new qorb APIs to terminate pools cleanly:
oxidecomputer/qorb#45
# Background
-
[https://github.com/oxidecomputer/qorb](https://github.com/oxidecomputer/qorb)
is a connection pooling crate, which spins up tokio task that attempt to
connect to backends.
- When `qorb` was integrated into Omicron in
#5876, I used it to connect
to our database backend (CockroachDB). This included usage in tests,
even with a "single backend host" (one, test-only CRDB server) -- I
wanted to ensure that we used the same pool and connector logic in tests
and prod.
# What Went Wrong
As identified in #6505 , we saw some tests failing during
**termination**. The specific cause of the failure was a panic from
[async-bb8-diesel](https://github.com/oxidecomputer/async-bb8-diesel),
where we attempted to spawn tasks with a terminating tokio executor.
This issue stems from async-bb8-diesel's usage of
`tokio::task::spawn_blocking`, where the returned `JoinHandle` is
immediately awaited and **unwrapped**, with an expectation that "we
should always be able to spawn a blocking task".
There was a separate discussion about "whether or not async-bb8-diesel
should be unwrapping in this case" (see:
oxidecomputer/async-bb8-diesel#77), but instead,
I chose to focus on the question:
## Why are we trying to send requests to async-bb8-diesel while the
tokio runtime is exiting?
The answer to this question lies in qorb's internals -- qorb itself
spawns many tokio tasks to handle ongoing work, including monitoring DNS
resolution, checking on backend health, and making connections before
they are requested. One of these qorb tasks calling `ping_async`, which
checks connection health, used the `async-bb8-diesel` interface that
ultimately panicked.
Within qorb most of these tokio tasks have a drop implementation of the
form:
```rust
struct MyQorbStructure {
handle: tokio::task::JoinHandle<()>,
}
impl Drop for MyQorbStructure {
fn drop(&mut self) {
self.handle.abort();
}
}
```
Tragically, without async drop support in Rust, this is the best we can
do implicitly -- signal that the background tasks should stop ASAP --
but that may not be soon enough! Calling `.abort()` on the `JoinHandle`
does not terminate the task immediately, it simply signals that it
should shut down at the next yield point.
As a result, we can still see the flake observed in #6505:
- A qorb pool is initialized with background tasks
- One of the qorb worker tasks is about to make a call to check on the
health of a connection to a backend
- The test finishes, and returns. The tokio runtime begins terminating
- We call `drop` on the `qorb` pool, which signals the background task
should abort
- The aforementioned qorb worker task makes the call to `ping_async`,
which calls `spawn_blocking`. This fails, because the tokio runtime is
terminating, and returns a
[JoinError::Cancelled](https://buildomat.eng.oxide.computer/wg/0/details/01J9YQVN7X5EQNXFSEY6XJBH8B/zfviqPz9RoJp3bY4TafbyqXTwbhqdr7w4oupqBtVARR00CXF/01J9YQWAXY36WM0R2VG27QMFRK#S6049).
- `async-bb8-diesel` unwraps this `JoinError`, and the test panics.
# How do we mitigate this?
That's where this PR comes in. Using the new qorb APIs, we don't rely on
the synchronous drop methods -- we explicitly call `.terminate().await`
functions which do the following:
- Use `tokio::sync::oneshot`s as signals to `tokio::tasks` that they
should exit
- `.await` the `JoinHandle` for those tasks before returning
Doing this work explicitly as a part of cleanup ensures that there are
not any background tasks attempting to do new work while the tokio
runtime is terminating.
0 commit comments