run PHD tests in parallel by default #882

iximeow · 2025-03-08T00:37:37Z

this adds a --parallelism flag to phd-runner (and by extension, cargo xtask phd run), which when set tells phd-runner what the upper bound is for how many tests to run concurrently.

if --parallelism is not passed, phd-runner will try to guess a level of parallelism that won't cause tests to fail. this involves a few pretty arbitrary choices that i think are reasonable generally, definitely fine for the environments i know most about (my workstation and lab CI), but if they're wrong will be kind of annoying.

MAX_VMS_GUESS assumes that no more than three VMs are running in a test at the same time. i think the actual number here is two. the heuristic also assumes it's safe to use up to a quarter of physical memory if the reservoir is not configured. if these are wrong the worst case is we'll try running too many test VMs concurrently and either have slower tests (reclaiming ARC memory, paging stuff out, ...) or outright fail to allocate for a VM and fail the test for that reason.

and of course, if PHD is run with --omicron-build and there's no reservoir, that'll fail regardless of parallelism. so really, please set the reservoir to some non-zero size! but having phd-runner do that implies having to pfexec cargo xtask phd run and that seems in poor form. so i implore the i'm-sure-studious-reader-of-test-logs to set a reservoir size for tests.

on my workstation the PHD tests run with the Alpine image in about 320 seconds. out of the box this infers a max parallelism of 5, wherein the tests take 71 seconds. or with a manual --parallelism 8, 42 seconds. seems good! i've run with --parallelism 8 in a loop for a bit and haven't seen anything untoward.

iximeow · 2025-03-10T18:27:16Z

after being mostly perplexed on Friday afternoon: there are a few wrinkles that show up with this form of concurrency in CI that i wasn't seeing locally. cpuid-utils uses a VM named cpuid-gen-<pid> to make queries of the default bhyve cpuid config, and of course if you try to create one of those on a different thread while one exists, File exists. simple enough.

the more confusing one is crucible-downstairs. i think what happened with this is multiple runners are trying to get the right crucible-nightly.tar.gz. then whoever writes it first proceeds to trying to extract crucible-downstairs from a .tar.gz that someone else started scribbling over.

now, it's not great to download artifacts N times, or extract them potentially N times, but it's weird that the extra rounds of extracting crucible-downstairs ends up with missing or broken files. except that tar's unpack() specifically does OpenOptions::new().write(true).create_new(true).open(dst) to head off risk of clobbering an in-use file with changes as an attack vector. so while it'd be fine-but-uncomfortable to overwrite here, we gotta really just not do that.

if phd-runner decides to run with parallelism > 1, the tmpdir and artifact directories, which phd-runner will write to from each framework, are rewritten to be unique per-framework. this means the directory layout changes based on available parallelism, and makes the `phd-run-with-args.sh` script somewhat more gross, and generally is pretty unfortunate. at this point it's just to demonstrate that something is possible, but this really deserves the requisite surgery to share one log dir and one artifact dir across the Frameworks.

iximeow · 2025-03-10T19:52:04Z

589a5ea leaves this in an "it.... works, i guess" place - instead of doing the requisite surgery for ArtifactsStore to be shared, i'm making runner-%d directories under the tmpdir and artifacts-dir directories and carrying on from there.

especially since i'm doing this only if the inferred parallelism is >1 it's pretty gross. since at this point i only intend to show that we could do something if we were gonna sweat the time, i'm moving this back to draft. in the future i'd love to do the right kind of surgery to just share state between runners here!

iximeow · 2025-03-10T20:10:52Z

the run for 589a5ea was fine until it wasn't. more Frameworks were in a good state, but with more crucible-downstairses we used more disk, and the ended up with a test getting Error: IO Error: Os { code: 28, kind: StorageFull, message: "No space left on device" }. that'll be an issue to be careful about even if the framework is otherwise handling all the state nicely.

iximeow added 2 commits March 7, 2025 16:35

parallel phd! test run in 40 seconds or your money back

a514531

dont parse lgrpinfo for cpu/memory, be better about that

f9d25ca

iximeow marked this pull request as ready for review March 8, 2025 01:10

iximeow force-pushed the ixi/parallel-phd branch from 6615594 to f95fa59 Compare March 8, 2025 04:49

fix a units error, log more about parallelism determination

e0d4da0

iximeow force-pushed the ixi/parallel-phd branch from f95fa59 to e0d4da0 Compare March 8, 2025 04:51

iximeow force-pushed the ixi/parallel-phd branch from f09ce77 to 589a5ea Compare March 10, 2025 19:49

iximeow marked this pull request as draft March 10, 2025 19:52

clamp parallelism so at least CI passes good grief

5cc9a75

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run PHD tests in parallel by default #882

run PHD tests in parallel by default #882

iximeow commented Mar 8, 2025

iximeow commented Mar 10, 2025

iximeow commented Mar 10, 2025

iximeow commented Mar 10, 2025

run PHD tests in parallel by default #882

Are you sure you want to change the base?

run PHD tests in parallel by default #882

Conversation

iximeow commented Mar 8, 2025

iximeow commented Mar 10, 2025

iximeow commented Mar 10, 2025

iximeow commented Mar 10, 2025