-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publish instance vCPU stats to oximeter #659
Conversation
This is the companion to oxidecomputer/omicron#4900, in which Propolis reads |
- Add more metadata to instance and HTTP API, for tracking silo / project IDs for an instance. - Add the `VirtualMachine` kstat target for tracking individual vCPU usage by microstate. - Use `VirtualMachine` in place of the bare instance UUID for tracking other statistics too. - Use `KstatSampler` to pull vCPU kernel statistics to track vCPU usage - A few minor tweaks to existing metrics - Update OpenAPI documents
3a9c65d
to
defad6e
Compare
CI is failing because the |
625d7d2
to
761e835
Compare
761e835
to
de0896a
Compare
Ok, this last test failure is real. One thing the PHD tests verify is that we can migrate between the PR commit and its merge base. That is failing, because the PR branch changes the instance-ensure API, by adding the project / silo ID for tracking instance vCPU metrics. I'm not sure how this is supposed to work when one changes the API. @gjcolombo or perhaps @hawkw, are there any existing plans for handling breaking API changes like this? I can imagine this case is handled pretty easily, by making this new data optional. If we're migrating an instance from before the change, then those fields can't exist and we will simply forgo producing vCPU data in any case. Are there additional test-suite changes one would like to see in such cases? |
All this is a unforeseen but mostly positive consequence of adding the migrate-from-base tests: we've begun to test Propolis server API versioning! The specific problem here is:
This would be a Real Problem in production if there were ever heterogeneous Propolis client and server versions, e.g. if we were doing a live upgrade and had an upgraded Nexus trying to query state directly from a not-yet-upgraded Propolis. However, I don't think we currently have that situation, for two reasons:
I think there are a couple of ways to make this sort of change in general:
For this specific PR, I would be OK with the first approach, but I'd also be OK with declaring that because we're pre-live-update, breaking API changes like this are technically permissible even though they break these updateability tests. We basically did this in #639--breaking change, would have been Very Illegal if we were actually migrating anything, but because we understood what was happening we were willing to live with a reddish CI run on the PR. |
I talked with @gjcolombo offline after reading through his comments. I agree that it's not possible to hit this actual issue as long as we continue to stop instances during a Propolis upgrade. I'm going to confirm that's still the plan for the next release. As he pointed out, intra-version migration is still working fine, so there is no "real" failure here. Assuming @pfmooney is also OK with it, and that the plan is still to stop instances during an upgrade, we feel it's most expedient to ignore that specific CI failure for the purposes of this PR. |
I've confirmed that we're still planning to take instances offline during the next release cycle. This version incompatibility should not show up in that case. |
Yeah, that's fine by me. |
- Merge various use statements - Docstring improvements - Ref issue tracking PVPANIC statistic organization - Add oximeter state to `ServiceProviders`, storing registration task handle. Abort the registration task in `stop()` method, like the others.
Thanks for the review @pfmooney. Let me know if you have new comments. |
- Move lock around oximeter state up one layer - Add mechanism for the `ServiceProviders::stop()` method to notify the metric registration task to bail if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a couple of little questions/comments.
- Added test to catch changes in kstate->oximeter vCPU microstate mapping - Make oximeter state names consts - Define the number of expected states explicitly as the len of an array of the state names.
Thanks @pfmooney, @gjcolombo, and @hawkw for all the input on this one. Let me know if y'all have any more thoughts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, thanks for addressing my suggestions!
VirtualMachine
kstat target for tracking individual vCPU usage by microstate.VirtualMachine
in place of the bare instance UUID for tracking other statistics too.KstatSampler
to pull vCPU kernel statistics to track vCPU usage