PHD: convert to async #633

gjcolombo · 2024-01-30T22:36:48Z

This PR converts the PHD framework to run on an async runtime (instead of running mostly synchronously and using block_on to handle cases where async calls are needed, e.g. working with Progenitor clients).

Motivation

PHD tests that use file-backed disks make hard copies of the artifacts that are supposed to underlie those disks. This takes up a lot of space and is a significant performance bottleneck. This isn't very obvious with the tiny Alpine image that PHD uses by default, but creating a new copy of a 30 GiB Windows image for every test gets expensive quickly!

One promising way to solve this is to use ZFS snapshots and clones to create copy-on-write forks of these artifacts. PHD used to do this, but it exacerbated a second problem that PHD already has: if you interrupt a PHD run via SIGINT, the resources that it was using--kernel VMMs, copies of images, and Crucible downstairs data directories--don't get cleaned up. Adding another set of leak-able resources into the mix makes this problem worse.

Solving the SIGINT problem in a synchronous runner is tricky. It's not too hard to install a handler, catch the signal, and cleanly cancel the test run after the current test completes, but if the current test is stuck, e.g. waiting for serial console output that will never arrive, this can take a long time. A more advanced approach is to ensure (via signal masks, see generally SIGNAL(3HEAD)) that incoming SIGINT signals are directed to the thread that's running the test. In PHD's current architecture, the signal handler can panic to abort the test then and there. But it's not clear (to this author) how this would work in a hypothetical future where there are multiple threads executing tests simultaneously.

On the other hand, if test cases are futures, things get a lot simpler: the runner's test execution tasks can simply select! between a test execution future and an indicator that the run was canceled by a signal; in the latter case, each task simply needs to drop the test execution future to drop all of the VM and disk objects it was using and clean up their resources.

Implementation

Most of the delta in this change just comes from sticking async and await on everything. The parts of the change that don't do that are as follows:

The testcase and testcase_macro crates changed a bit to reflect the fact that test cases are now futures that yield a TestOutcome instead of functions that return one.
The Drop impl for TestVm changed significantly. Before, it could use block_on to call Progenitor client functions to try to stop a Propolis VM when its TestVm dropped. Since there's no async drop, this impl now builds a task to handle this cleanup and sends it back to the PHD framework, which allows the runner to await it in its test cleanup fixture.
Framework::lifecycle_test can no longer take a callback that operates on a test VM; instead, it takes a callback that produces a future that runs checks on that VM. This is painless in some cases, but getting the lifetimes right for check functions that capture from their environments turned out to be a major hassle. I'm too inexperienced as yet to have worked out how to express the relevant lifetimes to rustc to get this to compile without an ugly workaround. Guidance here would be much appreciated; see the REVIEW(gjc) comments in the code for the places where this came up.

Tested via local runs with Alpine and Debian 11 guests.

phd-tests/framework/src/artifacts/buildomat.rs

phd-tests/testcase_macro/src/lib.rs

phd-tests/framework/src/test_vm/config.rs

phd-tests/framework/src/test_vm/mod.rs

gjcolombo · 2024-01-31T00:32:58Z

This got a positive enough reaction in team chat that I've taken off the Draft tag (though I still need to experiment with the downstream changes I have in mind here).

phd-tests/framework/src/lifecycle.rs

phd-tests/tests/src/hw.rs

phd-tests/tests/src/migrate.rs

phd-tests/framework/src/test_vm/mod.rs

phd-tests/testcase_macro/src/lib.rs

hawkw

Looks good to me overall! I'm a bit on the fence about the async-trait thing, but...I'm willing to be convinced that it's the best approach.

phd-tests/framework/src/test_vm/mod.rs

phd-tests/tests/src/hw.rs

Just have lifecycle test functions move/clone their contexts instead. It's still mildly annoying to have to do this, but this is still less boilerplate than the trait object version.

gjcolombo · 2024-02-07T17:54:50Z

Thanks for spending so much time with this one, @hawkw! Will get this merged and rebase #634 onto master once the latest runs are green.

hawkw self-requested a review January 30, 2024 22:41

hawkw reviewed Jan 30, 2024

View reviewed changes

phd-tests/framework/src/artifacts/buildomat.rs Show resolved Hide resolved

hawkw reviewed Jan 30, 2024

View reviewed changes

gjcolombo changed the title ~~[RFC] PHD: convert to async~~ PHD: convert to async Jan 31, 2024

gjcolombo marked this pull request as ready for review January 31, 2024 00:31

gjcolombo mentioned this pull request Jan 31, 2024

PHD: improve ctrl-c handling #634

Merged

hawkw reviewed Jan 31, 2024

View reviewed changes

gjcolombo commented Feb 1, 2024

View reviewed changes

phd-tests/testcase_macro/src/lib.rs Outdated Show resolved Hide resolved

gjcolombo mentioned this pull request Feb 1, 2024

phd: share single reqwest client for all artifact store operations #635

Open

gjcolombo requested a review from hawkw February 2, 2024 05:16

hawkw approved these changes Feb 5, 2024

View reviewed changes

phd-tests/framework/src/test_vm/mod.rs Show resolved Hide resolved

phd-tests/tests/src/hw.rs Outdated Show resolved Hide resolved

gjcolombo added 18 commits February 6, 2024 01:22

WIP: make test VM functions async

f70950a

async-ify everything

63b238a

more async plumbing

0357891

wait for VM cleanup in test cleanup fixture

5139cbf

Clippy and Toad Are Friends

5f09f0f

instrument test case routines

0382e12

make VM cleanup tracing less noisy

2cc0f90

tokio::task::spawn -> tokio::spawn

df90349

polish up cleanup tasks

acde8b9

fix whitespace

08586f0

style cleanup

802c6a0

remove now-vestigial panic hook

cf66c14

use backoff::Error::transient

2573123

Indentation Reduction Act

c5daee2

add module path to top-level test span

eb4cc67

hand-roll closures to avoid cloning

204db85

instrument instead of manually entering spans

79ee8db

asyncify after rebasing

230a53a

gjcolombo force-pushed the gjcolombo/phd/async branch from bd828df to 230a53a Compare February 7, 2024 17:24

gjcolombo added 3 commits February 7, 2024 17:24

Revert "hand-roll closures to avoid cloning"

1bb7bcb

Just have lifecycle test functions move/clone their contexts instead. It's still mildly annoying to have to do this, but this is still less boilerplate than the trait object version.

improve tracing output a bit more

04b48c3

remove unnecessary moves

03ff957

gjcolombo merged commit c8f6937 into master Feb 7, 2024

gjcolombo deleted the gjcolombo/phd/async branch February 7, 2024 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHD: convert to async #633

PHD: convert to async #633

gjcolombo commented Jan 30, 2024 •

edited

Loading

gjcolombo commented Jan 31, 2024

hawkw left a comment

gjcolombo commented Feb 7, 2024

PHD: convert to async #633

PHD: convert to async #633

Conversation

gjcolombo commented Jan 30, 2024 • edited Loading

Motivation

Implementation

gjcolombo commented Jan 31, 2024

hawkw left a comment

Choose a reason for hiding this comment

gjcolombo commented Feb 7, 2024

gjcolombo commented Jan 30, 2024 •

edited

Loading