-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework VMM reservoir sizing to scale better with memory configurations #7837
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,7 +20,19 @@ skip_timesync = true | |
|
||
# Percentage of usable physical DRAM to use for the VMM reservoir, which | ||
# guest memory is pulled from. | ||
vmm_reservoir_percentage = 80 | ||
vmm_reservoir_percentage = 86.3 | ||
# The amount of memory held back for services which exist between zero and one | ||
# on this Gimlet. This currently includes some additional terms reflecting | ||
# OS memory use under load. | ||
# | ||
# As of writing, this is the sum of the following items from RFD 413: | ||
# * Network buffer slush: 18 GiB | ||
# * Other kernel heap: 20 GiB | ||
# * ZFS ARC minimum: 5 GiB | ||
# * Sled agent: 0.5 GiB | ||
# * Maghemite: 0.25 GiB | ||
# * NTP: 0.25 GiB | ||
control_plane_memory_earmark_mb = 45056 | ||
Comment on lines
+23
to
+35
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. one obvious way this is not quite right: ClickHouse, Cockroach, DNS, Oximeter are all missing here, so this misses the premise of "budget enough memory that if we have to move a control plane service here, we don't have to evict a VM to do it". so are There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. from talking with @faithanalog earlier, it looks like Crucible's kb-per-extent as i see in https://github.com/oxidecomputer/crucible/runs/39057809960 (~91KiB/extent) is a lower bound, whereas she sees as much as 225KiB/extent. that's around 58 GiB of variance all-told. so, trying to avoid swapping with everything running on a sled here would have us wanting as much as 139 GiB set aside for control plane (95 GiB of Crucibles, 20 GiB of other kernel heap, 18 GiB for expected NIC buffers, the ARC minimum size and then one-per-sled services), with another up-to-40 GiB of services that are only sometimes present like databases, DNS, etc. that in turn would have us sizing the VMM reservoir at around 95% of what's left to keep the actual reservoir size the same, which should be fine as long as no one is making hundreds of 512 MiB instances... my inclination at this point is we could really dial things in as they are today but we'd end up more brittle if anything changes in the future. we'd be better off connecting the "expected fixed use" term to what the control plane knows what a sled should be running. |
||
|
||
# Swap device size for the system. The device is a sparsely allocated zvol on | ||
# the internal zpool of the M.2 that we booted from. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would love ideas here if anyone's got 'em..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally misinterpreted this value at first glance - is this supposed to be:
RAM eligible for VMM reservoir = Total RAM - OS usage - Other control plane usage?
Is this value of "usable physical pages * PAGE_T_SIZE" supposed to be "all other OS usage?"
If no - I'm misinterpreting this calculation!
If yes - that seems like it's likely an underestimate, right? Like, agreed that the host OS has metadata which scales with the number of physical pages, but it also has a lot of other metadata too, presumably?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your first read is the one that is implemented here, yeah:
RAM eligible for VMM reservoir = Total RAM - OS usage - Other control plane usage
The difference here is that the "control plane" earmark is set high enough to include other OS allocations (I elaborated more on this in this part of the 413 refresh). Even then this is an underestimate, because the 44 GiB-for-control-plane figure doesn't account for some sleds having
dendrite
,wicket
,clickhouse
, etc.edit: a worst-case figure here - all services have an instance on a scrimlet, Crucibles for all the storage, everything under load - would be closer to 180 GiB of non-VMM stuff. I'm just not inclined to set the fixed term that high because it implies setting the VMM reservoir percentage a lot higher. In one sense that's fine, we can use these terms to get to the same actual reservoir-vs-non-reservoir numbers. But honestly I'm just kind of uncomfortable setting the reservoir percentage so high while we don't have super great resource monitoring for various services