Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHD: Use ZFS clones for file-backed disks #640

Merged
merged 12 commits into from
Feb 14, 2024
Merged

Conversation

gjcolombo
Copy link
Contributor

(N.B. Stacked on #634.)

When creating a file-backed disk from an artifact, create a ZFS snapshot and of the dataset containing the artifact, then create and mount a clone of the snapshot and back the disk from the clone. This is much, much less expensive than copying the underlying file and markedly improves performance for test runs with disks of nontrivial size. (On my local test machine, a full test run with a 2 GiB Debian nocloud image took 688 seconds before this change and 610 seconds after, about a 10% speedup.)

This is unfortunately not likely to move the needle much on CI times because the Alpine test image CI uses is very small. However, this should make it much easier to switch to non-Alpine CI guest(s) without needlessly bloating CI times.

Tested with full runs against Debian 11 and Alpine guests.

@gjcolombo
Copy link
Contributor Author

Oops. The phd-run job stores artifacts on a tmpfs mount, which makes the runner upset when it tries to figure out which ZFS dataset they're on. Two threads to pull on here: (1) make PHD fall back to file copies if the artifact can't be cloned, (2) see if there's a better directory to use for artifacts so that CI runs can take advantage of this feature.

@gjcolombo
Copy link
Contributor Author

OK, this seems to be working now: the logs from the most recent CI run don't have the warnings/errors about ZFS clones failing to work as intended, and the run before that shows that the fallback behavior works as intended.

@gjcolombo gjcolombo marked this pull request as ready for review February 8, 2024 18:41
@jclulow
Copy link
Contributor

jclulow commented Feb 8, 2024

In the lab* target environments, if you want a ZFS pool (reasonable!) you'll need to create one on one of the available SSDs.

@jclulow
Copy link
Contributor

jclulow commented Feb 8, 2024

@gjcolombo
Copy link
Contributor Author

Oh nice, beat me to it. Thanks for the pointer! (I see now from a perusal of the lab image post-boot script that I'm definitely getting away with a crime here, abetted by the small size of the Alpine image, which just so happens to fit into the /work ramdisk...)

I'll move this back to draft status while I work on this a bit more. I think I'll also want to change the framework not to presume anything about the name of the zpool under which it should try to place the clones.

@gjcolombo gjcolombo marked this pull request as draft February 8, 2024 19:12
@jclulow
Copy link
Contributor

jclulow commented Feb 8, 2024

Yeah I think two things to note:

  • if you've got clones in pool B you need the original in B as well, as you can't clone a snapshot across two pools
  • it's always good to be able to just take a ZFS dataset name (which might be a child of the pool, rather than at the top level) and work exclusively under that; that's what the various Helios image building things do, so that they can coexist on multiuser systems, or work in different CI shapes, etc

@gjcolombo
Copy link
Contributor Author

gjcolombo commented Feb 8, 2024

I have (hopefully) tightened this up a bit in the newest changes. The phd-run script now creates a pool spanning the runner's SSDs (as in the omicron-deploy job) and uses this pool's default dataset's root directory to store downloaded artifacts. I also tweaked the framework so that, when cloning artifact A, the clone is made a descendant of A's original dataset (the framework was already looking this up to figure out what to snapshot) instead of sticking it under whatever happens to be named rpool and hoping for the best.

I ran this locally (where I have a bespoke artifact TOML that specifies some handmade artisan locally-blended OS images, not all of which live in the same directory as the artifact dir I pass to PHD) in addition to the CI run on 2b025bd and things looked OK (tests pass, nothing obviously leaked).

@gjcolombo gjcolombo marked this pull request as ready for review February 8, 2024 21:21
@hawkw hawkw self-requested a review February 8, 2024 21:43
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is very straightforward and pretty easy to understand! I commented on some small nits.

@gjcolombo gjcolombo force-pushed the gjcolombo/phd/ctrl-c branch from 13a27d1 to c655dfb Compare February 14, 2024 19:08
Base automatically changed from gjcolombo/phd/ctrl-c to master February 14, 2024 20:38
When creating a file-backed disk from an artifact, create a ZFS snapshot and
of the dataset containing the artifact, then create and mount a clone of the
snapshot and back the disk from the clone. This is much, much less expensive
than copying the underlying file and markedly improves performance for test
runs with disks of nontrivial size. (On my local test machine, a full test run
with a 2 GiB Debian nocloud image took 688 seconds before this change and 610
seconds after, about a 10% speedup.)

This is unfortunately not likely to move the needle much on CI times because
the Alpine test image CI uses is very small. However, this should make it much
easier to switch to non-Alpine CI guest(s) without needlessly bloating CI times.

Tested with full runs against Debian 11 and Alpine guests.
This keeps tests from falling over if the user chose an artifact directory that
isn't on a ZFS dataset (e.g. everything is in tmpfs).

This should fix PHD CI, but I'm also going to try to make another change to
allow CI runs to take advantage of the cloning feature.
This is needed to use ZFS clones for file-backed disks.
@gjcolombo gjcolombo merged commit 02493c1 into master Feb 14, 2024
10 checks passed
@gjcolombo gjcolombo deleted the gjcolombo/phd/zfs branch February 14, 2024 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants