Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run PHD tests in parallel by default #882

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft

Conversation

iximeow
Copy link
Member

@iximeow iximeow commented Mar 8, 2025

this adds a --parallelism flag to phd-runner (and by extension, cargo xtask phd run), which when set tells phd-runner what the upper bound is for how many tests to run concurrently.

if --parallelism is not passed, phd-runner will try to guess a level of parallelism that won't cause tests to fail. this involves a few pretty arbitrary choices that i think are reasonable generally, definitely fine for the environments i know most about (my workstation and lab CI), but if they're wrong will be kind of annoying.

MAX_VMS_GUESS assumes that no more than three VMs are running in a test at the same time. i think the actual number here is two. the heuristic also assumes it's safe to use up to a quarter of physical memory if the reservoir is not configured. if these are wrong the worst case is we'll try running too many test VMs concurrently and either have slower tests (reclaiming ARC memory, paging stuff out, ...) or outright fail to allocate for a VM and fail the test for that reason.

and of course, if PHD is run with --omicron-build and there's no reservoir, that'll fail regardless of parallelism. so really, please set the reservoir to some non-zero size! but having phd-runner do that implies having to pfexec cargo xtask phd run and that seems in poor form. so i implore the i'm-sure-studious-reader-of-test-logs to set a reservoir size for tests.

on my workstation the PHD tests run with the Alpine image in about 320 seconds. out of the box this infers a max parallelism of 5, wherein the tests take 71 seconds. or with a manual --parallelism 8, 42 seconds. seems good! i've run with --parallelism 8 in a loop for a bit and haven't seen anything untoward.

@iximeow iximeow marked this pull request as ready for review March 8, 2025 01:10
@iximeow iximeow force-pushed the ixi/parallel-phd branch from 6615594 to f95fa59 Compare March 8, 2025 04:49
@iximeow iximeow force-pushed the ixi/parallel-phd branch from f95fa59 to e0d4da0 Compare March 8, 2025 04:51
@iximeow
Copy link
Member Author

iximeow commented Mar 10, 2025

after being mostly perplexed on Friday afternoon: there are a few wrinkles that show up with this form of concurrency in CI that i wasn't seeing locally. cpuid-utils uses a VM named cpuid-gen-<pid> to make queries of the default bhyve cpuid config, and of course if you try to create one of those on a different thread while one exists, File exists. simple enough.

the more confusing one is crucible-downstairs. i think what happened with this is multiple runners are trying to get the right crucible-nightly.tar.gz. then whoever writes it first proceeds to trying to extract crucible-downstairs from a .tar.gz that someone else started scribbling over.

now, it's not great to download artifacts N times, or extract them potentially N times, but it's weird that the extra rounds of extracting crucible-downstairs ends up with missing or broken files. except that tar's unpack() specifically does OpenOptions::new().write(true).create_new(true).open(dst) to head off risk of clobbering an in-use file with changes as an attack vector. so while it'd be fine-but-uncomfortable to overwrite here, we gotta really just not do that.

if phd-runner decides to run with parallelism > 1, the tmpdir and
artifact directories, which phd-runner will write to from each
framework, are rewritten to be unique per-framework.

this means the directory layout changes based on available parallelism,
and makes the `phd-run-with-args.sh` script somewhat more gross, and
generally is pretty unfortunate. at this point it's just to demonstrate
that something is possible, but this really deserves the requisite
surgery to share one log dir and one artifact dir across the Frameworks.
@iximeow
Copy link
Member Author

iximeow commented Mar 10, 2025

589a5ea leaves this in an "it.... works, i guess" place - instead of doing the requisite surgery for ArtifactsStore to be shared, i'm making runner-%d directories under the tmpdir and artifacts-dir directories and carrying on from there.

especially since i'm doing this only if the inferred parallelism is >1 it's pretty gross. since at this point i only intend to show that we could do something if we were gonna sweat the time, i'm moving this back to draft. in the future i'd love to do the right kind of surgery to just share state between runners here!

@iximeow iximeow marked this pull request as draft March 10, 2025 19:52
@iximeow
Copy link
Member Author

iximeow commented Mar 10, 2025

the run for 589a5ea was fine until it wasn't. more Frameworks were in a good state, but with more crucible-downstairses we used more disk, and the ended up with a test getting Error: IO Error: Os { code: 28, kind: StorageFull, message: "No space left on device" }. that'll be an issue to be careful about even if the framework is otherwise handling all the state nicely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant