Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add explicit API to terminate pool cleanly #45

Merged
merged 3 commits into from
Oct 16, 2024
Merged

Add explicit API to terminate pool cleanly #45

merged 3 commits into from
Oct 16, 2024

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Oct 15, 2024

Part of oxidecomputer/omicron#6505

Adds an API to cleanly terminate a qorb Pool, and guarantee that no background tasks are still executing beyond the termination point.

smklein added a commit to oxidecomputer/omicron that referenced this pull request Oct 15, 2024
I'd like to integrate oxidecomputer/qorb#45 into
qorb, and then omicron, but I want to be able to control that revision
explicitly.

Before I do that, pin the revision of qorb we're using so that "merging
something into qorb" doesn't break omicron.
@smklein smklein merged commit fb7e50d into master Oct 16, 2024
4 checks passed
@smklein smklein deleted the terminate branch October 16, 2024 00:01
hawkw added a commit that referenced this pull request Oct 16, 2024
Currently, the `SetWorker::run` function will terminate all per-slot
tasks in the set when termination is explicitly signaled. However, it
does *not* currently do this when the `SetWorker` shuts down because
there are no more clients, so we don't clean up the per-slot tasks in
that case. I noticed this [while reviewing][1] PR #45, but left the
review only after it merged.

This commit fixes that by moving the termination code out of the loop,
and making both the "explicit termination channel fired" and the "no
more clients" cases `break` out of the loop into the termination code,
instead of `return`ing.

[1]: #45 (comment)
smklein added a commit to oxidecomputer/omicron that referenced this pull request Oct 18, 2024
Fixes #6505 , integrates
usage of the new qorb APIs to terminate pools cleanly:
oxidecomputer/qorb#45

# Background

-
[https://github.com/oxidecomputer/qorb](https://github.com/oxidecomputer/qorb)
is a connection pooling crate, which spins up tokio task that attempt to
connect to backends.
- When `qorb` was integrated into Omicron in
#5876, I used it to connect
to our database backend (CockroachDB). This included usage in tests,
even with a "single backend host" (one, test-only CRDB server) -- I
wanted to ensure that we used the same pool and connector logic in tests
and prod.

# What Went Wrong

As identified in #6505 , we saw some tests failing during
**termination**. The specific cause of the failure was a panic from
[async-bb8-diesel](https://github.com/oxidecomputer/async-bb8-diesel),
where we attempted to spawn tasks with a terminating tokio executor.
This issue stems from async-bb8-diesel's usage of
`tokio::task::spawn_blocking`, where the returned `JoinHandle` is
immediately awaited and **unwrapped**, with an expectation that "we
should always be able to spawn a blocking task".

There was a separate discussion about "whether or not async-bb8-diesel
should be unwrapping in this case" (see:
oxidecomputer/async-bb8-diesel#77), but instead,
I chose to focus on the question:

## Why are we trying to send requests to async-bb8-diesel while the
tokio runtime is exiting?

The answer to this question lies in qorb's internals -- qorb itself
spawns many tokio tasks to handle ongoing work, including monitoring DNS
resolution, checking on backend health, and making connections before
they are requested. One of these qorb tasks calling `ping_async`, which
checks connection health, used the `async-bb8-diesel` interface that
ultimately panicked.

Within qorb most of these tokio tasks have a drop implementation of the
form:

```rust
struct MyQorbStructure {
  handle: tokio::task::JoinHandle<()>,
}

impl Drop for MyQorbStructure {
  fn drop(&mut self) {
    self.handle.abort();
  }
}
```

Tragically, without async drop support in Rust, this is the best we can
do implicitly -- signal that the background tasks should stop ASAP --
but that may not be soon enough! Calling `.abort()` on the `JoinHandle`
does not terminate the task immediately, it simply signals that it
should shut down at the next yield point.

As a result, we can still see the flake observed in #6505:

- A qorb pool is initialized with background tasks
- One of the qorb worker tasks is about to make a call to check on the
health of a connection to a backend
- The test finishes, and returns. The tokio runtime begins terminating
- We call `drop` on the `qorb` pool, which signals the background task
should abort
- The aforementioned qorb worker task makes the call to `ping_async`,
which calls `spawn_blocking`. This fails, because the tokio runtime is
terminating, and returns a
[JoinError::Cancelled](https://buildomat.eng.oxide.computer/wg/0/details/01J9YQVN7X5EQNXFSEY6XJBH8B/zfviqPz9RoJp3bY4TafbyqXTwbhqdr7w4oupqBtVARR00CXF/01J9YQWAXY36WM0R2VG27QMFRK#S6049).
- `async-bb8-diesel` unwraps this `JoinError`, and the test panics.

# How do we mitigate this?

That's where this PR comes in. Using the new qorb APIs, we don't rely on
the synchronous drop methods -- we explicitly call `.terminate().await`
functions which do the following:

- Use `tokio::sync::oneshot`s as signals to `tokio::tasks` that they
should exit
- `.await` the `JoinHandle` for those tasks before returning

Doing this work explicitly as a part of cleanup ensures that there are
not any background tasks attempting to do new work while the tokio
runtime is terminating.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants