Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHD: convert to async #633

Merged
merged 21 commits into from
Feb 7, 2024
Merged

PHD: convert to async #633

merged 21 commits into from
Feb 7, 2024

Conversation

gjcolombo
Copy link
Contributor

@gjcolombo gjcolombo commented Jan 30, 2024

This PR converts the PHD framework to run on an async runtime (instead of running mostly synchronously and using block_on to handle cases where async calls are needed, e.g. working with Progenitor clients).

Motivation

PHD tests that use file-backed disks make hard copies of the artifacts that are supposed to underlie those disks. This takes up a lot of space and is a significant performance bottleneck. This isn't very obvious with the tiny Alpine image that PHD uses by default, but creating a new copy of a 30 GiB Windows image for every test gets expensive quickly!

One promising way to solve this is to use ZFS snapshots and clones to create copy-on-write forks of these artifacts. PHD used to do this, but it exacerbated a second problem that PHD already has: if you interrupt a PHD run via SIGINT, the resources that it was using--kernel VMMs, copies of images, and Crucible downstairs data directories--don't get cleaned up. Adding another set of leak-able resources into the mix makes this problem worse.

Solving the SIGINT problem in a synchronous runner is tricky. It's not too hard to install a handler, catch the signal, and cleanly cancel the test run after the current test completes, but if the current test is stuck, e.g. waiting for serial console output that will never arrive, this can take a long time. A more advanced approach is to ensure (via signal masks, see generally SIGNAL(3HEAD)) that incoming SIGINT signals are directed to the thread that's running the test. In PHD's current architecture, the signal handler can panic to abort the test then and there. But it's not clear (to this author) how this would work in a hypothetical future where there are multiple threads executing tests simultaneously.

On the other hand, if test cases are futures, things get a lot simpler: the runner's test execution tasks can simply select! between a test execution future and an indicator that the run was canceled by a signal; in the latter case, each task simply needs to drop the test execution future to drop all of the VM and disk objects it was using and clean up their resources.

Implementation

Most of the delta in this change just comes from sticking async and await on everything. The parts of the change that don't do that are as follows:

  • The testcase and testcase_macro crates changed a bit to reflect the fact that test cases are now futures that yield a TestOutcome instead of functions that return one.
  • The Drop impl for TestVm changed significantly. Before, it could use block_on to call Progenitor client functions to try to stop a Propolis VM when its TestVm dropped. Since there's no async drop, this impl now builds a task to handle this cleanup and sends it back to the PHD framework, which allows the runner to await it in its test cleanup fixture.
  • Framework::lifecycle_test can no longer take a callback that operates on a test VM; instead, it takes a callback that produces a future that runs checks on that VM. This is painless in some cases, but getting the lifetimes right for check functions that capture from their environments turned out to be a major hassle. I'm too inexperienced as yet to have worked out how to express the relevant lifetimes to rustc to get this to compile without an ugly workaround. Guidance here would be much appreciated; see the REVIEW(gjc) comments in the code for the places where this came up.

Tested via local runs with Alpine and Debian 11 guests.

@hawkw hawkw self-requested a review January 30, 2024 22:41
@gjcolombo gjcolombo changed the title [RFC] PHD: convert to async PHD: convert to async Jan 31, 2024
@gjcolombo gjcolombo marked this pull request as ready for review January 31, 2024 00:31
@gjcolombo
Copy link
Contributor Author

This got a positive enough reaction in team chat that I've taken off the Draft tag (though I still need to experiment with the downstream changes I have in mind here).

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall! I'm a bit on the fence about the async-trait thing, but...I'm willing to be convinced that it's the best approach.

@gjcolombo gjcolombo force-pushed the gjcolombo/phd/async branch from bd828df to 230a53a Compare February 7, 2024 17:24
Just have lifecycle test functions move/clone their contexts instead. It's still
mildly annoying to have to do this, but this is still less boilerplate than the
trait object version.
@gjcolombo
Copy link
Contributor Author

Thanks for spending so much time with this one, @hawkw! Will get this merged and rebase #634 onto master once the latest runs are green.

@gjcolombo gjcolombo merged commit c8f6937 into master Feb 7, 2024
@gjcolombo gjcolombo deleted the gjcolombo/phd/async branch February 7, 2024 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants