Rework VMM reservoir sizing to scale better with memory configurations #7837

iximeow · 2025-03-19T21:09:44Z

pairs well with this refresh of RFD 413 where i worked through the math for this approach.

The core observation of this change is that some uses of memory are relatively fixed regardless of a sled's hardware configuration. By subtracting these more constrained uses of memory before calculating a VMM reservoir size, the remaining memory will be used mostly for services that scale either with the amount of physical memory or the amount of storage installed.

The new control_plane_memory_earmark_mb setting for sled-agent describes the sum of this fixed allocation, and existing sled-agent config.toml files are updated so that actual VMM reservoir sizes for Gimlets with 1TB of installed memory are about the same:

Before: 1012 * 0.8 => 809.6 GiB of VMM reservoir
After: (1012 - 30 - 44) * 0.863 => 809.494 GiB of VMM reservoir

A Gimlet with 2 TiB of DRAM sees a larger VMM reservoir:

Before: 2048 * 0.8 => 1638.4 GiB of VMM reservoir
After: (2048 - 60 - 44) * 0.863 => 1677.672 GiB of VMM reservoir

A Gimlet with less than 1 TiB of DRAM would see a smaller VMM reservoir, but this is in some sense correct: we would otherwise "overprovision" the VMM reservoir and eat into what is currently effectively a slush fund of memory for Oxide services supporting the rack's operation, risking overall system stability if inferring from observation and testing on systems with 1 TiB gimlets.

A useful additional step in the direction of "config that is workable across SKUs" would be to measure Crucible overhead in the context of number of disks or total installed storage. Then we could calculate the VMM reservoir after subtracting the maximum memory expected to be used by Crucible if all storage was allocated, and have a presumably-higher VMM reservoir percentage for the yet-smaller slice of system memory that is not otherwise accounted.

Fixes #7448.

The core observation of this change is that some uses of memory are relatively fixed regardless of a sled's hardware configuration. By subtracting these more constrained uses of memory before calculating a VMM reservoir size, the remaining memory will be used mostly for services that scale either with the amount of physical memory or the amount of storage installed. The new `control_plane_memory_earmark_mb` setting for sled-agent describes the sum of this fixed allocation, and existing sled-agent config.toml files are updated so that actual VMM reservoir sizes for Gimlets with 1TB of installed memory are about the same: Before: `1012 * 0.8 => 809.6 GiB` of VMM reservoir After: `(1012 - 30 - 44) * 0.863 => 809.494 GiB` of VMM reservoir A Gimlet with 2 TiB of DRAM sees a larger VMM reservoir: Before: `2048 * 0.8 => 1638.4 GiB` of VMM reservoir After: `(2048 - 60 - 44) * 0.863 => 1677.672 GiB` of VMM reservoir A Gimlet with less than 1 TiB of DRAM would see a smaller VMM reservoir, but this is in some sense correct: we would otherwise "overprovision" the VMM reservoir and eat into what is currently effectively a slush fund of memory for Oxide services supporting the rack's operation, risking overall system stability if inferring from observation and testing on systems with 1 TiB gimlets. A useful additional step in the direction of "config that is workable across SKUs" would be to measure Crucible overhead in the context of number of disks or total installed storage. Then we could calculate the VMM reservoir after subtracting the maximum memory expected to be used by Crucible if all storage was allocated, and have a presumably-higher VMM reservoir percentage for the yet-smaller slice of system memory that is not otherwise accounted. Fixes #7448.

iximeow · 2025-03-19T21:16:03Z

smf/sled-agent/gimlet-standalone/config.toml

+vmm_reservoir_percentage = 86.3
+# The amount of memory held back for services which exist between zero and one
+# on this Gimlet. This currently includes some additional terms reflecting
+# OS memory use under load.
+#
+# As of writing, this is the sum of the following items from RFD 413:
+# * Network buffer slush: 18 GiB
+# * Other kernel heap: 20 GiB
+# * ZFS ARC minimum: 5 GiB
+# * Sled agent: 0.5 GiB
+# * Maghemite: 0.25 GiB
+# * NTP: 0.25 GiB
+control_plane_memory_earmark_mb = 45056


one obvious way this is not quite right: ClickHouse, Cockroach, DNS, Oximeter are all missing here, so this misses the premise of "budget enough memory that if we have to move a control plane service here, we don't have to evict a VM to do it". so are dendrite and wicket. i think the "earmark" amount should be closer to 76 GiB given earlier measurements, and the VMM reservoir percentage updated to around 89%

from talking with @faithanalog earlier, it looks like Crucible's kb-per-extent as i see in https://github.com/oxidecomputer/crucible/runs/39057809960 (~91KiB/extent) is a lower bound, whereas she sees as much as 225KiB/extent. that's around 58 GiB of variance all-told.

so, trying to avoid swapping with everything running on a sled here would have us wanting as much as 139 GiB set aside for control plane (95 GiB of Crucibles, 20 GiB of other kernel heap, 18 GiB for expected NIC buffers, the ARC minimum size and then one-per-sled services), with another up-to-40 GiB of services that are only sometimes present like databases, DNS, etc. that in turn would have us sizing the VMM reservoir at around 95% of what's left to keep the actual reservoir size the same, which should be fine as long as no one is making hundreds of 512 MiB instances...

my inclination at this point is we could really dial things in as they are today but we'd end up more brittle if anything changes in the future. we'd be better off connecting the "expected fixed use" term to what the control plane knows what a sled should be running.

the number of physical pages won't change at runtime really, nor will the size of pages, but it seems a bit nicer this way..

iximeow · 2025-03-21T01:46:14Z

sled-hardware/src/lib.rs

+        // Don't like hardcoding a struct size from the host OS here like
+        // this, maybe we shuffle some bits around before merging.. On the
+        // other hand, the last time page_t changed was illumos-gate commit
+        // a5652762e5 from 2006.
+        const PAGE_T_SIZE: u64 = 120;
+        let max_page_t_space =
+            self.hardware_manager.usable_physical_pages() * PAGE_T_SIZE;


would love ideas here if anyone's got 'em..

Totally misinterpreted this value at first glance - is this supposed to be:

RAM eligible for VMM reservoir = Total RAM - OS usage - Other control plane usage?

Is this value of "usable physical pages * PAGE_T_SIZE" supposed to be "all other OS usage?"

If no - I'm misinterpreting this calculation!
If yes - that seems like it's likely an underestimate, right? Like, agreed that the host OS has metadata which scales with the number of physical pages, but it also has a lot of other metadata too, presumably?

Your first read is the one that is implemented here, yeah: RAM eligible for VMM reservoir = Total RAM - OS usage - Other control plane usage

The difference here is that the "control plane" earmark is set high enough to include other OS allocations (I elaborated more on this in this part of the 413 refresh). Even then this is an underestimate, because the 44 GiB-for-control-plane figure doesn't account for some sleds having dendrite, wicket, clickhouse, etc.

edit: a worst-case figure here - all services have an instance on a scrimlet, Crucibles for all the storage, everything under load - would be closer to 180 GiB of non-VMM stuff. I'm just not inclined to set the fixed term that high because it implies setting the VMM reservoir percentage a lot higher. In one sense that's fine, we can use these terms to get to the same actual reservoir-vs-non-reservoir numbers. But honestly I'm just kind of uncomfortable setting the reservoir percentage so high while we don't have super great resource monitoring for various services

iximeow commented Mar 19, 2025

View reviewed changes

keep the HardwareManager handle around instead of caching measurements

8039a08

the number of physical pages won't change at runtime really, nor will the size of pages, but it seems a bit nicer this way..

iximeow force-pushed the ixi/revised-reservoir-calculations branch from 077ea86 to 8039a08 Compare March 20, 2025 23:58

iximeow commented Mar 21, 2025

View reviewed changes

iximeow marked this pull request as ready for review March 21, 2025 01:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework VMM reservoir sizing to scale better with memory configurations #7837

Rework VMM reservoir sizing to scale better with memory configurations #7837

iximeow commented Mar 19, 2025 •

edited

Loading

iximeow Mar 19, 2025 •

edited

Loading

iximeow Mar 21, 2025 •

edited

Loading

iximeow Mar 21, 2025

smklein Mar 21, 2025

iximeow Mar 21, 2025 •

edited

Loading

Rework VMM reservoir sizing to scale better with memory configurations #7837

Are you sure you want to change the base?

Rework VMM reservoir sizing to scale better with memory configurations #7837

Conversation

iximeow commented Mar 19, 2025 • edited Loading

iximeow Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

iximeow Mar 21, 2025 • edited Loading

Choose a reason for hiding this comment

iximeow Mar 21, 2025

Choose a reason for hiding this comment

smklein Mar 21, 2025

Choose a reason for hiding this comment

iximeow Mar 21, 2025 • edited Loading

Choose a reason for hiding this comment

iximeow commented Mar 19, 2025 •

edited

Loading

iximeow Mar 19, 2025 •

edited

Loading

iximeow Mar 21, 2025 •

edited

Loading

iximeow Mar 21, 2025 •

edited

Loading