-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PHD: convert to async #633
Conversation
This got a positive enough reaction in team chat that I've taken off the Draft tag (though I still need to experiment with the downstream changes I have in mind here). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall! I'm a bit on the fence about the async-trait
thing, but...I'm willing to be convinced that it's the best approach.
bd828df
to
230a53a
Compare
Just have lifecycle test functions move/clone their contexts instead. It's still mildly annoying to have to do this, but this is still less boilerplate than the trait object version.
This PR converts the PHD framework to run on an async runtime (instead of running mostly synchronously and using
block_on
to handle cases where async calls are needed, e.g. working with Progenitor clients).Motivation
PHD tests that use file-backed disks make hard copies of the artifacts that are supposed to underlie those disks. This takes up a lot of space and is a significant performance bottleneck. This isn't very obvious with the tiny Alpine image that PHD uses by default, but creating a new copy of a 30 GiB Windows image for every test gets expensive quickly!
One promising way to solve this is to use ZFS snapshots and clones to create copy-on-write forks of these artifacts. PHD used to do this, but it exacerbated a second problem that PHD already has: if you interrupt a PHD run via SIGINT, the resources that it was using--kernel VMMs, copies of images, and Crucible downstairs data directories--don't get cleaned up. Adding another set of leak-able resources into the mix makes this problem worse.
Solving the SIGINT problem in a synchronous runner is tricky. It's not too hard to install a handler, catch the signal, and cleanly cancel the test run after the current test completes, but if the current test is stuck, e.g. waiting for serial console output that will never arrive, this can take a long time. A more advanced approach is to ensure (via signal masks, see generally
SIGNAL(3HEAD)
) that incoming SIGINT signals are directed to the thread that's running the test. In PHD's current architecture, the signal handler can panic to abort the test then and there. But it's not clear (to this author) how this would work in a hypothetical future where there are multiple threads executing tests simultaneously.On the other hand, if test cases are futures, things get a lot simpler: the runner's test execution tasks can simply
select!
between a test execution future and an indicator that the run was canceled by a signal; in the latter case, each task simply needs to drop the test execution future to drop all of the VM and disk objects it was using and clean up their resources.Implementation
Most of the delta in this change just comes from sticking
async
andawait
on everything. The parts of the change that don't do that are as follows:testcase
andtestcase_macro
crates changed a bit to reflect the fact that test cases are now futures that yield aTestOutcome
instead of functions that return one.Drop
impl forTestVm
changed significantly. Before, it could useblock_on
to call Progenitor client functions to try to stop a Propolis VM when itsTestVm
dropped. Since there's no async drop, this impl now builds a task to handle this cleanup and sends it back to the PHD framework, which allows the runner to await it in its test cleanup fixture.Framework::lifecycle_test
can no longer take a callback that operates on a test VM; instead, it takes a callback that produces a future that runs checks on that VM. This is painless in some cases, but getting the lifetimes right for check functions that capture from their environments turned out to be a major hassle. I'm too inexperienced as yet to have worked out how to express the relevant lifetimes torustc
to get this to compile without an ugly workaround. Guidance here would be much appreciated; see theREVIEW(gjc)
comments in the code for the places where this came up.Tested via local runs with Alpine and Debian 11 guests.