-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test oximeter metrics end to end #855
Conversation
Other tests in Propolis validate that the Oximeter translation of kstats for a VM are coherent but do so from stubbed kstat readings. Those tests are fine (and quick!), but don't cover the risk of a translation error in Propolis reading from a real system, nor the possibility of reading kstats incorrectly in the first place. ... good news: test doesn't catch anything interesting! The bug I was looking at is somewhere else.
|
||
/// Integrate a list of samples into this collection of virtual machine | ||
/// metrics. | ||
fn add_producer_result(&mut self, samples: &[Sample]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this felt kind of awful to write! i hope it's less awful to read! there's a lot of enum type checking and crawling through the Oximeter types to extract the relevant details for this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if it's worth trying to write some kind of test oximeter collector crate for doing stuff like this (not blocking this PR on it, obviously). there might be other codebases (e.g. networking daemons?) that want to make assertions about the samples they emit to oximeter...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably, but the bit that can be generalized is probably PropolisOximeterSampler
and FakeNexusContext
, here, more than anything. i assume dendrite or crucible or whomever would have their own versions of VirtualMachineMetrics
and walking through Oximeter samples to rebuild their expected reporting feels like it'd be an annoying day of plumbing or so for other tests like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on exactly what you're looking for, there is a test oximeter
collector crate. The main program can be run in "standalone" mode. In that case, it pretends to be Nexus, very similarly to what's done in this PR. It can also either actually insert data into ClickHouse, or simply print the results it collects if we don't want to stand up ClickHouse as well. One thing it does not currently do is push the samples to a controllable sink -- that is, you can't directly get access to the collected samples today, you have to fetch them from the database. If that functionality is desirable, we can add it to oximeter-collector
without too much fuss.
In any case, that standalone mode uses the existing oximeter-collector
crate's public API. I don't know what the constraints are here, but you can certainly use that in PHD if you wish.
just since we're here looking at times. didn't expect this to be interesting, and it wasn't.
phd-tests/tests/src/stats.rs
Outdated
// been spent emulating instructions on the guest's behalf. Anecdotally the | ||
// this is on the order of 8ms between the two samples. This should be very | ||
// low; the workload is almost entirely guest user mode execution. | ||
assert!(full_emul_delta < Duration::from_millis(100).as_nanos()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is basically the whole motivator for the test. we'd seen a chart where emul
seemed to track closely but not perfectly with run
, and now we can be a bit more confident that the system is producing coherent numbers at least, and what we saw there was an issue somewhere further up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgive the naive question, by why would one expect emul
to perfectly track run
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, i wouldn't expect emul
to perfectly track run
either. if they were exactly the same, i would have just first looked for something to be mislabeled (anywhere between illumos and the console)
close-but-not-perfectly tells me that it's not just mislabeled data, which could be illumos calculating stats in a strange way, or something weird about the sampling, or something later at query time. so, test helps rule out the first two possibilities..
my actual expectation is that emul
might count up a little more often when run
is counting up, but as a side effect of having exit for instruction emulation. they ought to be orders of magnitude different relatively independent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's helpful. I misunderstood your first comment as meaning that you expected run / emul to be very highly correlated. It sounds like you expect a weak positive correlation, which makes sense to me. Is the issue is that you see the values being much closer in absolute terms than you expect? How close are we talking about here?
I'm just curious about all this, I have very little intuition about what one should see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not ixi, but my $0.02: one way to think of the emul
bucket is to treat it as time spent in the host doing work on behalf of the guest. This might include things like
- handling port I/O through the IN and OUT instructions (needs to be handled by an emulated device)
- handling accesses to memory-mapped I/O regions (same)
- emulating instructions that the guest isn't permitted to execute natively (e.g. when the guest executes CPUID the host needs to decide what values to return)
- handling page faults where the guest's memory isn't present in host physical memory (not an issue for us because we require guest memory not to be paged out, but could happen in theory)
This work is going to correlate at least loosely to the time you spend running work in the guest, because a totally idle vCPU generally won't need any of these services.
The relative magnitudes of a vCPU's running time vs. its emulated time depend on what exactly the guest is doing. A CPU-bound workload that's just crunching on data that's resident in guest memory won't generally need to exit very much; an I/O-heavy workload that needs to access the disk a lot will spend more time handling disk commands in the host; a guest thread that executes RDMSR
in a tight loop will spend most of its time in the host's RDMSR handler and won't have much guest time at all.
Since the guest generally wants to exit as little as possible, and the host wants each exit to be as short as possible, for non-pathological workloads I'd generally expect the time spent running vCPUs to be much greater than the time spent in emulation. If we think we see a CPU that's spending as much time in emulation as in the guest itself, then either the guest is doing something very unusual, or our accounting is weird.
One useful piece of information that I don't think we get from the vCPU kstats today is the number of exits seen on each vCPU. That can help shed light on some of the unusual cases--if we see lots of emulation time but very few exits, then we want to understand why the average exit time is so long; if we see a credible amount of time per exit and are just seeing a lot of exits, we need to understand what the guest is up to.
/// `VirtualMachineMetrics` collects these all back together into a single view | ||
/// to test against. See [`VirtualMachineMetrics::add_producer_result`] as the | ||
/// means to accumulate samples into this struct. | ||
struct VirtualMachineMetrics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm right on the fence for this being buried away in stats.rs
or more formally part of PHD. probably depends on how much more we expect propolis-sever to provide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote for trying to make it part of the framework, although I'd have to think a bit more about precisely what interface the framework should expose to the tests. (At $PASTJOB it was often useful to be able to query a specific VM component's internal metrics to make sure that the component and the guest agreed on how the system had behaved.)
I could see a world where we extend this to, say, boot a guest with a Crucible disk and then save various latency metrics as test output so that we can examine them build-over-build. Obviously that's not for this PR, but I think we'll find that kind of thing easier if the "act as an Oximeter consumer" logic is part of the framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've given a think to it and split this a bit - the parts i've moved to the framework here are more infrastructural (the fake Oximeter server, API declaration, spawning the server, handing out leases and the client that queries Oximeter producers). the part that's still in stats.rs
is what i suspect will vary more from test to test depending on which stats you're interested in - here, a pile of vCPU metrics, but in the future with the bhyve metrics moved to kstats i could imagine someone else wanting to count VM exit reasons without caring about vCPUs for example.
on the plus side, i can easily imagine MetricsLocation
being useful with other variants, especially when coordinating a fake Oximeter with Crucibles. going that direction in the limit seems a bit silly though, and i've left some hopefully-useful thoughts about the limits of this approach in the additional doc comments on VirtualMachineMetrics
.
the worst of this all, IMO, is wait_for_propolis_stats
as it is now. makes me scrunch my nose but does the job. my inclination is that this all is in a much better spot for other folks to use when they have Oximeter-related testing needs, not perfect, but much better to go with for the time being..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the caveat that I haven't looked at as many Oximeter interfaces or internals as I probably should have, I think this split looks pretty reasonable. Someday we may want to see about streamlining this a bit (so that, if I do all the work to define a new metric series in Propolis, it just "lights up" in PHD without my having to do anything), but I think this is a really good place to start. Thanks for reorganizing these bits!
|
||
/// Integrate a list of samples into this collection of virtual machine | ||
/// metrics. | ||
fn add_producer_result(&mut self, samples: &[Sample]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if it's worth trying to write some kind of test oximeter collector crate for doing stuff like this (not blocking this PR on it, obviously). there might be other codebases (e.g. networking daemons?) that want to make assertions about the samples they emit to oximeter...
if there's a significant delay between the initial idle kstats observation becoming available and idle_start being measured, it's possible that the guest's actual idle vCPU time is greater than the "measured" idle time from PHD's view. instead, measure max idle time as "the time from before initial measurement to after last measurement", and min idle time as "the time just after initial measurement to after last measurement". still bias min idle time down a bit since the guest may have some active CPU time.
phd-tests/tests/src/stats.rs
Outdated
let metrics_url = { | ||
let sampler = self.sampler.lock().unwrap(); | ||
let stats = sampler.as_ref().expect("stats url info exists"); | ||
format!("http://{}/{}", stats.addr, stats.uuid) | ||
}; | ||
let res = reqwest::Client::new() | ||
.get(metrics_url) | ||
.send() | ||
.await | ||
.expect("can send oximeter stats request"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @bnaecker how much of a crime is this, in your Oximeter eyes? to me it ffeels kinda bad to embed how to query the Oximeter producer here in PHD, but at least this ought to break loudly enough if it stopped returning the expected stats.
Eliza was also talking about having something like an Oximeter test crate here, and having to make the metrics URL here is really what sells me on the idea, but maybe we expect this to be more stable than i'm giving it credit for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This URL is very stable, but I can see why one would neither know that nor want to rely on it. The producer server one stands up to emit metrics fixes the URL for you, there's zero client control over it today. But as I mentioned on the other thread, the only reason we don't already have the ability for you to extract the samples that oximeter collects already is because we haven't needed it yet. Printing or inserting into the DB has been sufficient.
I think adding functionality to give oximeter
your own sink or channel or whatever, where it will forward any samples it collects, seems like a fine and and valuable tool to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff here! I'm excited about the possibility of being able to use metrics in other PHD tests in the future.
I'd love to see if we can get rid of some of the long sleeps in the CPU time accounting test. I have a couple of (perhaps off-the-wall) suggestions for how we might cajole a guest into putting a specific CPU into a specific state for a shorter interval so that we don't need to sleep as much to avoid noise or to average that noise out over longer periods.
/// `VirtualMachineMetrics` collects these all back together into a single view | ||
/// to test against. See [`VirtualMachineMetrics::add_producer_result`] as the | ||
/// means to accumulate samples into this struct. | ||
struct VirtualMachineMetrics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote for trying to make it part of the framework, although I'd have to think a bit more about precisely what interface the framework should expose to the tests. (At $PASTJOB it was often useful to be able to query a specific VM component's internal metrics to make sure that the component and the guest agreed on how the system had behaved.)
I could see a world where we extend this to, say, boot a guest with a Crucible disk and then save various latency metrics as test output so that we can examine them build-over-build. Obviously that's not for this PR, but I think we'll find that kind of thing easier if the "act as an Oximeter consumer" logic is part of the framework.
// variable-time CI. We'll validate idle time measurements separately, | ||
// below. | ||
|
||
// Idle time boundaries are a little differnt than running time boundaries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is perhaps a "no bad ideas in brainstorming"-level idea, but a thought about measuring idle times. Suppose we had a test that worked like this:
- Boot a guest with two vCPUs
- Offline CPU 1 via sysfs (
echo 0 > /sys/devices/system/cpu/cpu1/online
) - Collect samples
- Wait for a short interval
- Collect more samples
(Steps 3-5 could be re-expressed as "wait for samples, then wait for another batch of samples" if we needed to do that.)
I would expect CPU1 to spend ~100% of its time idle in this case--the guest definitely won't be scheduling any regular threads to it, and I think it may not even try directing any more interrupts to it (from devices or otherwise). If this works, it should let us use a shorter sleep window and a tighter tolerance here. The main downside is that CPU offlining is a pure Linux-ism, so this will be yet another test that we can't run with a Windows guest (although that's probably fine, since the goal here is to validate how we report vCPU stats, not to correlate those with any particular statistics in the guest).
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this is mostly what we'd talked about on chat, noted here for reference):
while i totally agree on the mechanics here - this should work and measure time more precisely - i'd really prefer this test to work across guest OSes if only because if we trust our accounting it starts to double as a useful smoke test of guests themselves. and even if we could make the intentional sleeps 1s or 0s, we're still beholden to the ~5s periods for propolis-server
to scrape new vCPU kstats. the explicit sleeps are pretty rough, but each metrics sample is also an implicit up-to-5s delay.. we could configure that, but at that point i'd rather just make this conditional and run only in CI :D
if the test itself ends up being long enough to be really annoying locally, splitting the test in two for idle/running time could help some, too.
as for being a useful test of guests, in restructuring the metrics glue i'd noticed that Windows is always checking the current time and so this test becomes very sensitive to Windows using the PM timer vs TSC. so once your TSC enlightenment work is all in, we can and should strike this condition - guests starting to do a bunch of emulated vCPU time while idle seems CI fail worthy, generally speaking!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so once your TSC enlightenment work is all in, we can and should strike this condition - guests starting to do a bunch of emulated vCPU time while idle seems CI fail worthy, generally speaking!
We can do this now! You'll need to copy the VM configuration logic from the Hyper-V tests for now, though. (I'm strongly considering making the Hyper-V config the default once the feature is available/on by default in Omicron, but need to do a smoke test or two first, and need to think about whether we lose any coverage by setting this default for all tests--I don't think we do but need to consider it a bit.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately the TSC was not the sole source of Windows having more emulation time.. i've included the enlightenment for Linux to be consistently well-behaved, and refined the comment with my better understanding of the state of affairs now.
FWIW my inclination is that we probably want the defaults for PHD to hew towards defaults in Omicron? we'll definitely lose test coverage in a whole-system sense, like if the PM timer became panicful and we didn't know because everyone trusted the TSC enlightenment. but that seems more.. niche. feels like checking that guests start without enlightenments and are basically functional covers that?
phd-tests/tests/src/stats.rs
Outdated
// been spent emulating instructions on the guest's behalf. Anecdotally the | ||
// this is on the order of 8ms between the two samples. This should be very | ||
// low; the workload is almost entirely guest user mode execution. | ||
assert!(full_emul_delta < Duration::from_millis(100).as_nanos()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not ixi, but my $0.02: one way to think of the emul
bucket is to treat it as time spent in the host doing work on behalf of the guest. This might include things like
- handling port I/O through the IN and OUT instructions (needs to be handled by an emulated device)
- handling accesses to memory-mapped I/O regions (same)
- emulating instructions that the guest isn't permitted to execute natively (e.g. when the guest executes CPUID the host needs to decide what values to return)
- handling page faults where the guest's memory isn't present in host physical memory (not an issue for us because we require guest memory not to be paged out, but could happen in theory)
This work is going to correlate at least loosely to the time you spend running work in the guest, because a totally idle vCPU generally won't need any of these services.
The relative magnitudes of a vCPU's running time vs. its emulated time depend on what exactly the guest is doing. A CPU-bound workload that's just crunching on data that's resident in guest memory won't generally need to exit very much; an I/O-heavy workload that needs to access the disk a lot will spend more time handling disk commands in the host; a guest thread that executes RDMSR
in a tight loop will spend most of its time in the host's RDMSR handler and won't have much guest time at all.
Since the guest generally wants to exit as little as possible, and the host wants each exit to be as short as possible, for non-pathological workloads I'd generally expect the time spent running vCPUs to be much greater than the time spent in emulation. If we think we see a CPU that's spending as much time in emulation as in the guest itself, then either the guest is doing something very unusual, or our accounting is weird.
One useful piece of information that I don't think we get from the vCPU kstats today is the number of exits seen on each vCPU. That can help shed light on some of the unusual cases--if we see lots of emulation time but very few exits, then we want to understand why the average exit time is so long; if we see a credible amount of time per exit and are just seeing a lot of exits, we need to understand what the guest is up to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this is coming together! I have a couple of nitpicky typo fixes, a thought about the "metrics location" enum, and a note about logging, but I'm really liking the direction here.
fn test_logger() -> Logger { | ||
let dec = slog_term::PlainSyncDecorator::new(slog_term::TestStdoutWriter); | ||
let drain = slog_term::FullFormat::new(dec).build().fuse(); | ||
Logger::root(drain, slog::o!("component" => "fake-cleanup-task")) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In CI runs this causes non-Bunyan-formatted output to be interleaved with the regular PHD logs (see e.g. lines 892-894 of this log).
It'd be nice to have this obey the logging rules from the PHD runner. It might be a bit of a headache to plumb the emit_bunyan
setting all the way down here, but I think it's probably better than the alternatives (e.g. putting framework Oximeter logs into a separate log file). The way I did this for the propolis-server logging params is to pass them from the test runner to the Framework
, which packs them into a ServerProcessParameters
that get passed to start_local_vm
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i hadn't realized that we take emit_bunyan
or plumb it anywhere in the first place! e88d07c does the plumbing here, and it's.. not all that bad really!
the crux of it is that it feels like the format for log output probably should flow from what the caller of phd-runner
wanted, where we take --emit-bunyan
. as-is propolis-server
and Crucible guess the right output formats based on atty::is()
, but the plumbing for this preference through PHD itself isn't too bad.
after looking at the logs from phd-run
output i think it'd be interesting and useful to have separate log files for the fake Oximeter server, but that should also come with reorganizing the logs into per-test directories so we could have phd/migration_smoke/propolis.{stderr,stdout}.log
and phd/migration_smoke/oximeter.log
and ... etc. bonus being that Crucible disks are just a top-level mash of crucible-{uuid}-{port}
and have fun cross-referencing that with tests... and hey, we could plumb that per-test log root in LogConfig
!
// When the time comes to support remote VMs, it will presumably be useful | ||
// to have local and (perhaps multiple) remote VMs report metrics to the | ||
// same server. But we don't support remote VMs yet. | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multi-host PHD is obviously not very well fleshed-out yet, but I think the way I would develop it might make this enum obsolete. My 30,000-foot sketch of this is
- of the several machines conducting a multi-host PHD run, one is primary and the others are secondary
- the primary machine runs
phd-runner
and executes the actual test cases - the secondary machines run a
phd-agent
that exposes a Dropshot API that handles the tasks that need to be performed on remote machines (e.g. starting/stopping Propolis servers) - when invoking the runner on the primary, one passes the addresses of the agent servers
IIUC, even with an architecture like this, the fake Oximeter server still needs to be "local," even if the metrics are coming from a secondary host: submitted samples need to show up in the runner's Framework
so that they can be queried by the tests running in that process.
This is obviated somewhat by switching to Clickhouse, I think, since then (I presume) there would be a single Clickhouse server instance that all PHD processes (both runner and agent) could insert data into, and the framework on the runner would just query that server. Even so, it still seems possible in that case for remote Propolis processes just to submit their samples to the collector in the primary runner.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was figuring in a case like that we'd still use Local
to send localhost:{oximeter_port}
instead of something fancier, but overall that would boil it down to "wants oximeter: yes/no". i do think the plumbing to express that here today is a bit more awkward.. we don't really have a clean "set up dependent services" -> "start propolis-server
" split in PHD and that does get to be more reorganizing than i want to do here 😅
/// `VirtualMachineMetrics` collects these all back together into a single view | ||
/// to test against. See [`VirtualMachineMetrics::add_producer_result`] as the | ||
/// means to accumulate samples into this struct. | ||
struct VirtualMachineMetrics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the caveat that I haven't looked at as many Oximeter interfaces or internals as I probably should have, I think this split looks pretty reasonable. Someday we may want to see about streamlining this a bit (so that, if I do all the work to define a new metric series in Propolis, it just "lights up" in PHD without my having to do anything), but I think this is a really good place to start. Thanks for reorganizing these bits!
// variable-time CI. We'll validate idle time measurements separately, | ||
// below. | ||
|
||
// Idle time boundaries are a little differnt than running time boundaries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so once your TSC enlightenment work is all in, we can and should strike this condition - guests starting to do a bunch of emulated vCPU time while idle seems CI fail worthy, generally speaking!
We can do this now! You'll need to copy the VM configuration logic from the Hyper-V tests for now, though. (I'm strongly considering making the Hyper-V config the default once the feature is available/on by default in Omicron, but need to do a smoke test or two first, and need to think about whether we lose any coverage by setting this default for all tests--I don't think we do but need to consider it a bit.)
this comes with a wider replumbing of log configuration, since it seems prudent to use the same log configuration for any supporting tasks (be they propolis-server itself, Crucible downstairs we spawn, or a fake Oximeter that happens to be in-process). one upside here is that conceivably an in-process task like fake Oximeter servers could be configured to write to a separate stdout/stderr file pair, if we could come up with useful unique filenames for them..
unfortunately even with the TSC enlightenment passed along, guests can still cause nontrivial instruction emulation time even while "idle" by relying on the APIC for tasks of some sort or another. I don't really know *what* Windows does here, but on Linux guests we can still make strong assertions about expected emulation time.
Other tests in Propolis validate that the Oximeter translation of kstats for a VM are coherent but do so from stubbed kstat readings. Those tests are fine (and quick!), but don't cover the risk of a translation error in Propolis reading from a real system, nor the possibility of reading kstats incorrectly in the first place.
... good news: test doesn't catch anything interesting! The bug I saw is somewhere else.
.... i also spent an embarrassingly long time trying to figure out how i'd stand up an Oximeter producer in the test before realizing i really needed to just plumb the flag for
propolis-server
to start its Oximeter producer.