-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework VMM reservoir sizing to scale better with memory configurations #7837
base: main
Are you sure you want to change the base?
Conversation
The core observation of this change is that some uses of memory are relatively fixed regardless of a sled's hardware configuration. By subtracting these more constrained uses of memory before calculating a VMM reservoir size, the remaining memory will be used mostly for services that scale either with the amount of physical memory or the amount of storage installed. The new `control_plane_memory_earmark_mb` setting for sled-agent describes the sum of this fixed allocation, and existing sled-agent config.toml files are updated so that actual VMM reservoir sizes for Gimlets with 1TB of installed memory are about the same: Before: `1012 * 0.8 => 809.6 GiB` of VMM reservoir After: `(1012 - 30 - 44) * 0.863 => 809.494 GiB` of VMM reservoir A Gimlet with 2 TiB of DRAM sees a larger VMM reservoir: Before: `2048 * 0.8 => 1638.4 GiB` of VMM reservoir After: `(2048 - 60 - 44) * 0.863 => 1677.672 GiB` of VMM reservoir A Gimlet with less than 1 TiB of DRAM would see a smaller VMM reservoir, but this is in some sense correct: we would otherwise "overprovision" the VMM reservoir and eat into what is currently effectively a slush fund of memory for Oxide services supporting the rack's operation, risking overall system stability if inferring from observation and testing on systems with 1 TiB gimlets. A useful additional step in the direction of "config that is workable across SKUs" would be to measure Crucible overhead in the context of number of disks or total installed storage. Then we could calculate the VMM reservoir after subtracting the maximum memory expected to be used by Crucible if all storage was allocated, and have a presumably-higher VMM reservoir percentage for the yet-smaller slice of system memory that is not otherwise accounted. Fixes #7448.
vmm_reservoir_percentage = 86.3 | ||
# The amount of memory held back for services which exist between zero and one | ||
# on this Gimlet. This currently includes some additional terms reflecting | ||
# OS memory use under load. | ||
# | ||
# As of writing, this is the sum of the following items from RFD 413: | ||
# * Network buffer slush: 18 GiB | ||
# * Other kernel heap: 20 GiB | ||
# * ZFS ARC minimum: 5 GiB | ||
# * Sled agent: 0.5 GiB | ||
# * Maghemite: 0.25 GiB | ||
# * NTP: 0.25 GiB | ||
control_plane_memory_earmark_mb = 45056 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one obvious way this is not quite right: ClickHouse, Cockroach, DNS, Oximeter are all missing here, so this misses the premise of "budget enough memory that if we have to move a control plane service here, we don't have to evict a VM to do it". so are dendrite
and wicket
. i think the "earmark" amount should be closer to 76 GiB given earlier measurements, and the VMM reservoir percentage updated to around 89%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from talking with @faithanalog earlier, it looks like Crucible's kb-per-extent as i see in https://github.com/oxidecomputer/crucible/runs/39057809960 (~91KiB/extent) is a lower bound, whereas she sees as much as 225KiB/extent. that's around 58 GiB of variance all-told.
so, trying to avoid swapping with everything running on a sled here would have us wanting as much as 139 GiB set aside for control plane (95 GiB of Crucibles, 20 GiB of other kernel heap, 18 GiB for expected NIC buffers, the ARC minimum size and then one-per-sled services), with another up-to-40 GiB of services that are only sometimes present like databases, DNS, etc. that in turn would have us sizing the VMM reservoir at around 95% of what's left to keep the actual reservoir size the same, which should be fine as long as no one is making hundreds of 512 MiB instances...
my inclination at this point is we could really dial things in as they are today but we'd end up more brittle if anything changes in the future. we'd be better off connecting the "expected fixed use" term to what the control plane knows what a sled should be running.
the number of physical pages won't change at runtime really, nor will the size of pages, but it seems a bit nicer this way..
077ea86
to
8039a08
Compare
// Don't like hardcoding a struct size from the host OS here like | ||
// this, maybe we shuffle some bits around before merging.. On the | ||
// other hand, the last time page_t changed was illumos-gate commit | ||
// a5652762e5 from 2006. | ||
const PAGE_T_SIZE: u64 = 120; | ||
let max_page_t_space = | ||
self.hardware_manager.usable_physical_pages() * PAGE_T_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would love ideas here if anyone's got 'em..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally misinterpreted this value at first glance - is this supposed to be:
RAM eligible for VMM reservoir = Total RAM - OS usage - Other control plane usage?
Is this value of "usable physical pages * PAGE_T_SIZE" supposed to be "all other OS usage?"
If no - I'm misinterpreting this calculation!
If yes - that seems like it's likely an underestimate, right? Like, agreed that the host OS has metadata which scales with the number of physical pages, but it also has a lot of other metadata too, presumably?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your first read is the one that is implemented here, yeah: RAM eligible for VMM reservoir = Total RAM - OS usage - Other control plane usage
The difference here is that the "control plane" earmark is set high enough to include other OS allocations (I elaborated more on this in this part of the 413 refresh). Even then this is an underestimate, because the 44 GiB-for-control-plane figure doesn't account for some sleds having dendrite
, wicket
, clickhouse
, etc.
edit: a worst-case figure here - all services have an instance on a scrimlet, Crucibles for all the storage, everything under load - would be closer to 180 GiB of non-VMM stuff. I'm just not inclined to set the fixed term that high because it implies setting the VMM reservoir percentage a lot higher. In one sense that's fine, we can use these terms to get to the same actual reservoir-vs-non-reservoir numbers. But honestly I'm just kind of uncomfortable setting the reservoir percentage so high while we don't have super great resource monitoring for various services
pairs well with this refresh of RFD 413 where i worked through the math for this approach.
The core observation of this change is that some uses of memory are relatively fixed regardless of a sled's hardware configuration. By subtracting these more constrained uses of memory before calculating a VMM reservoir size, the remaining memory will be used mostly for services that scale either with the amount of physical memory or the amount of storage installed.
The new
control_plane_memory_earmark_mb
setting for sled-agent describes the sum of this fixed allocation, and existing sled-agent config.toml files are updated so that actual VMM reservoir sizes for Gimlets with 1TB of installed memory are about the same:Before:
1012 * 0.8 => 809.6 GiB
of VMM reservoirAfter:
(1012 - 30 - 44) * 0.863 => 809.494 GiB
of VMM reservoirA Gimlet with 2 TiB of DRAM sees a larger VMM reservoir:
Before:
2048 * 0.8 => 1638.4 GiB
of VMM reservoirAfter:
(2048 - 60 - 44) * 0.863 => 1677.672 GiB
of VMM reservoirA Gimlet with less than 1 TiB of DRAM would see a smaller VMM reservoir, but this is in some sense correct: we would otherwise "overprovision" the VMM reservoir and eat into what is currently effectively a slush fund of memory for Oxide services supporting the rack's operation, risking overall system stability if inferring from observation and testing on systems with 1 TiB gimlets.
A useful additional step in the direction of "config that is workable across SKUs" would be to measure Crucible overhead in the context of number of disks or total installed storage. Then we could calculate the VMM reservoir after subtracting the maximum memory expected to be used by Crucible if all storage was allocated, and have a presumably-higher VMM reservoir percentage for the yet-smaller slice of system memory that is not otherwise accounted.
Fixes #7448.