Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish instance vCPU stats to oximeter #659

Merged
merged 7 commits into from
Mar 14, 2024
Merged

Publish instance vCPU stats to oximeter #659

merged 7 commits into from
Mar 14, 2024

Conversation

bnaecker
Copy link
Contributor

@bnaecker bnaecker commented Mar 7, 2024

  • Add more metadata to instance and HTTP API, for tracking silo / project IDs for an instance.
  • Add the VirtualMachine kstat target for tracking individual vCPU usage by microstate.
  • Use VirtualMachine in place of the bare instance UUID for tracking other statistics too.
  • Use KstatSampler to pull vCPU kernel statistics to track vCPU usage
  • A few minor tweaks to existing metrics
  • Update OpenAPI documents

@bnaecker bnaecker requested a review from pfmooney March 7, 2024 23:57
@bnaecker
Copy link
Contributor Author

bnaecker commented Mar 8, 2024

This is the companion to oxidecomputer/omicron#4900, in which Propolis reads kstats for the vCPU usage data of its managed instance. These are published to oximeter along with the other statistics. This also modifies the pre-existing instance-reset counter statistic, subsuming it under the new VirtualMachine target for identifying an instance by silo, project, and instance ID.

- Add more metadata to instance and HTTP API, for tracking silo /
  project IDs for an instance.
- Add the `VirtualMachine` kstat target for tracking individual vCPU
  usage by microstate.
- Use `VirtualMachine` in place of the bare instance UUID for tracking
  other statistics too.
- Use `KstatSampler` to pull vCPU kernel statistics to track vCPU usage
- A few minor tweaks to existing metrics
- Update OpenAPI documents
@bnaecker
Copy link
Contributor Author

bnaecker commented Mar 8, 2024

CI is failing because the oximeter_instruments::kstat module is only available on illumos, while we're running a number of tests on Linux. I'm going to experiment with some cfg directives to work around it, hopefully it's not too nasty.

@bnaecker
Copy link
Contributor Author

bnaecker commented Mar 8, 2024

Ok, this last test failure is real. One thing the PHD tests verify is that we can migrate between the PR commit and its merge base. That is failing, because the PR branch changes the instance-ensure API, by adding the project / silo ID for tracking instance vCPU metrics. I'm not sure how this is supposed to work when one changes the API. @gjcolombo or perhaps @hawkw, are there any existing plans for handling breaking API changes like this?

I can imagine this case is handled pretty easily, by making this new data optional. If we're migrating an instance from before the change, then those fields can't exist and we will simply forgo producing vCPU data in any case. Are there additional test-suite changes one would like to see in such cases?

@gjcolombo
Copy link
Contributor

I'm not sure how this is supposed to work when one changes the API.

All this is a unforeseen but mostly positive consequence of adding the migrate-from-base tests: we've begun to test Propolis server API versioning! The specific problem here is:

  • InstanceProperties now contains an InstanceMetadata
  • The Propolis instance GET endpoint returns an Instance that contains an InstanceProperties
  • PHD always uses the Propolis client API defined by the OpenAPI spec that was in the commit used to build the framework
  • So, when the framework (with the new Instance definition) asks to get properties from a downlevel Propolis that doesn't have the new definition, deserialization fails on the client (because the downlevel server didn't supply all the fields it needs)

This would be a Real Problem in production if there were ever heterogeneous Propolis client and server versions, e.g. if we were doing a live upgrade and had an upgraded Nexus trying to query state directly from a not-yet-upgraded Propolis. However, I don't think we currently have that situation, for two reasons:

  • Because all our upgrades are still offline (for now), there's no way to get this kind of heterogeneity: all Propolises get upgraded all at once
  • Even if that weren't true, I believe it's true that the only user of this particular API is a Propolis's overseeing sled agent, and sled agent and Propolis are supposed to be shipped in a single logical package (i.e. it shouldn't currently be possible to upgrade a sled's sled agent without installing software that will also immediately and atomically upgrade its Propolis image)

I think there are a couple of ways to make this sort of change in general:

  • The straightforward way, which you've suggested, is to make the new field (in this case the metadata field in InstanceProperties an Option and tag it with the #[serde(default)] field attribute. This will get everything to deserialize properly, at the cost of making the server have to deal with cases where this field hasn't been specified, which may never occur in practice (see above).
    • There's probably a separate discussion to be had here about whether all these API types should also be labeled deny_unknown_fields so that a client gets an error if it specifies a field a server doesn't understand. (I tend to prefer this behavior to "the server silently drops the fields"--I specified them because they were important, dang it!)
  • The buttoned-down way is to create a new InstancePropertiesV2 struct and then create v2/foo endpoints in the server that take the new type. Then PHD can try to use v2 endpoints first and fall back to v1 endpoints if v2 isn't available. Ideally this would require a minimum of boilerplate; RFD 421 talks about how to achieve that ideal but I don't think we've yet built any of the mechanisms it talks about.

For this specific PR, I would be OK with the first approach, but I'd also be OK with declaring that because we're pre-live-update, breaking API changes like this are technically permissible even though they break these updateability tests. We basically did this in #639--breaking change, would have been Very Illegal if we were actually migrating anything, but because we understood what was happening we were willing to live with a reddish CI run on the PR.

@bnaecker
Copy link
Contributor Author

bnaecker commented Mar 8, 2024

I talked with @gjcolombo offline after reading through his comments. I agree that it's not possible to hit this actual issue as long as we continue to stop instances during a Propolis upgrade. I'm going to confirm that's still the plan for the next release. As he pointed out, intra-version migration is still working fine, so there is no "real" failure here.

Assuming @pfmooney is also OK with it, and that the plan is still to stop instances during an upgrade, we feel it's most expedient to ignore that specific CI failure for the purposes of this PR.

@bnaecker
Copy link
Contributor Author

bnaecker commented Mar 8, 2024

I've confirmed that we're still planning to take instances offline during the next release cycle. This version incompatibility should not show up in that case.

@pfmooney
Copy link
Collaborator

pfmooney commented Mar 8, 2024

Assuming @pfmooney is also OK with it, and that the plan is still to stop instances during an upgrade, we feel it's most expedient to ignore that specific CI failure for the purposes of this PR.

Yeah, that's fine by me.

- Merge various use statements
- Docstring improvements
- Ref issue tracking PVPANIC statistic organization
- Add oximeter state to `ServiceProviders`, storing registration task
  handle. Abort the registration task in `stop()` method, like the
  others.
@bnaecker
Copy link
Contributor Author

bnaecker commented Mar 9, 2024

Thanks for the review @pfmooney. Let me know if you have new comments.

@bnaecker bnaecker requested a review from pfmooney March 9, 2024 17:45
- Move lock around oximeter state up one layer
- Add mechanism for the `ServiceProviders::stop()` method to notify the
  metric registration task to bail if needed.
@bnaecker bnaecker requested a review from pfmooney March 11, 2024 20:30
Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple of little questions/comments.

- Added test to catch changes in kstate->oximeter vCPU microstate
  mapping
- Make oximeter state names consts
- Define the number of expected states explicitly as the len of an array
  of the state names.
@bnaecker
Copy link
Contributor Author

Thanks @pfmooney, @gjcolombo, and @hawkw for all the input on this one. Let me know if y'all have any more thoughts.

@bnaecker bnaecker requested a review from hawkw March 14, 2024 19:10
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, thanks for addressing my suggestions!

@bnaecker bnaecker merged commit 3e296a6 into master Mar 14, 2024
9 of 10 checks passed
@bnaecker bnaecker deleted the vcpu-usage-stats branch March 14, 2024 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants