Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise reservoir calculations for sleds with larger DRAM #7448

Open
askfongjojo opened this issue Jan 30, 2025 · 4 comments · May be fixed by #7837
Open

Revise reservoir calculations for sleds with larger DRAM #7448

askfongjojo opened this issue Jan 30, 2025 · 4 comments · May be fixed by #7837
Assignees
Milestone

Comments

@askfongjojo
Copy link

askfongjojo commented Jan 30, 2025

Currently the reservoir size for a gimlet with 2 TiB DRAM is 2x of 1-TiB gimlet, i.e. 1748894744576 bytes (~= 1.63 TiB, as seen in our dublin lab environment). This may be too conservative and considered a waste of resources by customers. A new calculation method (e.g., fixed-plus-variable) may be required to address the 2-TiB DRAM and future new HW situations.

Besides the large DRAM gimlet use case, it is also possible that we'll revise the calculations after profiling control plane usage further in the future. It will be desirable to allow the reservoir settings to be dynamic enough to accommodate such changes.

@askfongjojo
Copy link
Author

Initial response from @gjcolombo:

It's probably needlessly conservative. If I've done my algebra correctly, we could go to about an 88.5% reservoir on this SKU while maintaining the same amount of physical memory that's not in the reservoir or specifically earmarked for OS usage (i.e. the page_t database). (I came up with this percentage by solving (0.2 * 1024) - 30 = ((1 - r) * 2048) - 60 for r.)

I expect most control plane services' charged memory usage not to depend much on the amount of physical memory in the machine, so unless we're changing other bits of configuration, most of the figures in https://rfd.shared.oxide.computer/rfd/0413#possible_budgets will apply to a 2 TiB Gimlet. The exceptions are:

  • the page_t database will be 60 GiB instead of 30 (accounted for above)
  • I'm unsure whether the APCB issue that prevented all physical memory from being addressed on the 1 TiB Gimlet is addressed (third row in the table)
  • the Crucible numbers in the table are badly outdated and need to be recaptured (and ideally more rigorously)

There's also the general caveat that we haven't implemented any of RFD 413 yet, so we can't be certain the other figures in the table are especially accurate. But if they are accurate I wouldn't expect changing the physical memory size to change them.

@askfongjojo askfongjojo added this to the 13 milestone Jan 30, 2025
@askfongjojo
Copy link
Author

askfongjojo commented Jan 31, 2025

I can think of another argument for having some kind of background tasks to set the reservoir based on the current availability. In the event of DIMM failures, the system should not attempt to provision more VMMs that could lead to over-subscription. Crucially, the reservoir size change will not affect running instances (they may already be in a degraded state regardless); the goal of dynamically changing the reservoir to reflect the current capacity is to avoid making things worse.

@askfongjojo askfongjojo modified the milestones: 13, 14 Jan 31, 2025
@rmustacc
Copy link

I can think of another argument for having some kind of background tasks to set the reservoir based on the current availability. In the event of DIMM failures, the system should not attempt to provision more VMMs that could lead to over-subscription. Crucially, the reservoir size change will not affect running instances (they may already be in a degraded state regardless); the goal of dynamically changing the reservoir to reflect the current capacity is to avoid making things worse.

I think an initial reservoir size is important though. While background adjustment is useful, the longer the system is up, the more fragmented the pages will be and the harder it'll be to get contiguous memory to VMs. So it is important that we actually do get a large initial chunk into the reservoir.

@jclulow
Copy link
Collaborator

jclulow commented Jan 31, 2025

I'm not sure that we make use of available large page style contiguity today, FWIW. It's more about the cost of clawing it back.

@iximeow iximeow self-assigned this Mar 18, 2025
iximeow added a commit that referenced this issue Mar 19, 2025
The core observation of this change is that some uses of memory are
relatively fixed regardless of a sled's hardware configuration. By
subtracting these more constrained uses of memory before calculating a
VMM reservoir size, the remaining memory will be used mostly for
services that scale either with the amount of physical memory or the
amount of storage installed.

The new `control_plane_memory_earmark_mb` setting for sled-agent
describes the sum of this fixed allocation, and existing sled-agent
config.toml files are updated so that actual VMM reservoir sizes for
Gimlets with 1TB of installed memory are about the same:

Before: `1012 * 0.8 => 809.6 GiB` of VMM reservoir
After:  `(1012 - 30 - 44) * 0.863 => 809.494 GiB` of VMM reservoir

A Gimlet with 2 TiB of DRAM sees a larger VMM reservoir:

Before: `2048 * 0.8 => 1638.4 GiB` of VMM reservoir
After:  `(2048 - 60 - 44) * 0.863 => 1677.672 GiB` of VMM reservoir

A Gimlet with less than 1 TiB of DRAM would see a smaller VMM reservoir,
but this is in some sense correct: we would otherwise "overprovision"
the VMM reservoir and eat into what is currently effectively a slush
fund of memory for Oxide services supporting the rack's operation,
risking overall system stability if inferring from observation and
testing on systems with 1 TiB gimlets.

A useful additional step in the direction of "config that is workable
across SKUs" would be to measure Crucible overhead in the context of
number of disks or total installed storage. Then we could calculate the
VMM reservoir after subtracting the maximum memory expected to be used
by Crucible if all storage was allocated, and have a presumably-higher
VMM reservoir percentage for the yet-smaller slice of system memory that
is not otherwise accounted.

Fixes #7448.
iximeow added a commit that referenced this issue Mar 19, 2025
The core observation of this change is that some uses of memory are
relatively fixed regardless of a sled's hardware configuration. By
subtracting these more constrained uses of memory before calculating a
VMM reservoir size, the remaining memory will be used mostly for
services that scale either with the amount of physical memory or the
amount of storage installed.

The new `control_plane_memory_earmark_mb` setting for sled-agent
describes the sum of this fixed allocation, and existing sled-agent
config.toml files are updated so that actual VMM reservoir sizes for
Gimlets with 1TB of installed memory are about the same:

Before: `1012 * 0.8 => 809.6 GiB` of VMM reservoir
After:  `(1012 - 30 - 44) * 0.863 => 809.494 GiB` of VMM reservoir

A Gimlet with 2 TiB of DRAM sees a larger VMM reservoir:

Before: `2048 * 0.8 => 1638.4 GiB` of VMM reservoir
After:  `(2048 - 60 - 44) * 0.863 => 1677.672 GiB` of VMM reservoir

A Gimlet with less than 1 TiB of DRAM would see a smaller VMM reservoir,
but this is in some sense correct: we would otherwise "overprovision"
the VMM reservoir and eat into what is currently effectively a slush
fund of memory for Oxide services supporting the rack's operation,
risking overall system stability if inferring from observation and
testing on systems with 1 TiB gimlets.

A useful additional step in the direction of "config that is workable
across SKUs" would be to measure Crucible overhead in the context of
number of disks or total installed storage. Then we could calculate the
VMM reservoir after subtracting the maximum memory expected to be used
by Crucible if all storage was allocated, and have a presumably-higher
VMM reservoir percentage for the yet-smaller slice of system memory that
is not otherwise accounted.

Fixes #7448.
iximeow added a commit that referenced this issue Mar 19, 2025
The core observation of this change is that some uses of memory are
relatively fixed regardless of a sled's hardware configuration. By
subtracting these more constrained uses of memory before calculating a
VMM reservoir size, the remaining memory will be used mostly for
services that scale either with the amount of physical memory or the
amount of storage installed.

The new `control_plane_memory_earmark_mb` setting for sled-agent
describes the sum of this fixed allocation, and existing sled-agent
config.toml files are updated so that actual VMM reservoir sizes for
Gimlets with 1TB of installed memory are about the same:

Before: `1012 * 0.8 => 809.6 GiB` of VMM reservoir
After:  `(1012 - 30 - 44) * 0.863 => 809.494 GiB` of VMM reservoir

A Gimlet with 2 TiB of DRAM sees a larger VMM reservoir:

Before: `2048 * 0.8 => 1638.4 GiB` of VMM reservoir
After:  `(2048 - 60 - 44) * 0.863 => 1677.672 GiB` of VMM reservoir

A Gimlet with less than 1 TiB of DRAM would see a smaller VMM reservoir,
but this is in some sense correct: we would otherwise "overprovision"
the VMM reservoir and eat into what is currently effectively a slush
fund of memory for Oxide services supporting the rack's operation,
risking overall system stability if inferring from observation and
testing on systems with 1 TiB gimlets.

A useful additional step in the direction of "config that is workable
across SKUs" would be to measure Crucible overhead in the context of
number of disks or total installed storage. Then we could calculate the
VMM reservoir after subtracting the maximum memory expected to be used
by Crucible if all storage was allocated, and have a presumably-higher
VMM reservoir percentage for the yet-smaller slice of system memory that
is not otherwise accounted.

Fixes #7448.
iximeow added a commit that referenced this issue Mar 19, 2025
The core observation of this change is that some uses of memory are
relatively fixed regardless of a sled's hardware configuration. By
subtracting these more constrained uses of memory before calculating a
VMM reservoir size, the remaining memory will be used mostly for
services that scale either with the amount of physical memory or the
amount of storage installed.

The new `control_plane_memory_earmark_mb` setting for sled-agent
describes the sum of this fixed allocation, and existing sled-agent
config.toml files are updated so that actual VMM reservoir sizes for
Gimlets with 1TB of installed memory are about the same:

Before: `1012 * 0.8 => 809.6 GiB` of VMM reservoir
After:  `(1012 - 30 - 44) * 0.863 => 809.494 GiB` of VMM reservoir

A Gimlet with 2 TiB of DRAM sees a larger VMM reservoir:

Before: `2048 * 0.8 => 1638.4 GiB` of VMM reservoir
After:  `(2048 - 60 - 44) * 0.863 => 1677.672 GiB` of VMM reservoir

A Gimlet with less than 1 TiB of DRAM would see a smaller VMM reservoir,
but this is in some sense correct: we would otherwise "overprovision"
the VMM reservoir and eat into what is currently effectively a slush
fund of memory for Oxide services supporting the rack's operation,
risking overall system stability if inferring from observation and
testing on systems with 1 TiB gimlets.

A useful additional step in the direction of "config that is workable
across SKUs" would be to measure Crucible overhead in the context of
number of disks or total installed storage. Then we could calculate the
VMM reservoir after subtracting the maximum memory expected to be used
by Crucible if all storage was allocated, and have a presumably-higher
VMM reservoir percentage for the yet-smaller slice of system memory that
is not otherwise accounted.

Fixes #7448.
iximeow added a commit that referenced this issue Mar 19, 2025
The core observation of this change is that some uses of memory are
relatively fixed regardless of a sled's hardware configuration. By
subtracting these more constrained uses of memory before calculating a
VMM reservoir size, the remaining memory will be used mostly for
services that scale either with the amount of physical memory or the
amount of storage installed.

The new `control_plane_memory_earmark_mb` setting for sled-agent
describes the sum of this fixed allocation, and existing sled-agent
config.toml files are updated so that actual VMM reservoir sizes for
Gimlets with 1TB of installed memory are about the same:

Before: `1012 * 0.8 => 809.6 GiB` of VMM reservoir
After:  `(1012 - 30 - 44) * 0.863 => 809.494 GiB` of VMM reservoir

A Gimlet with 2 TiB of DRAM sees a larger VMM reservoir:

Before: `2048 * 0.8 => 1638.4 GiB` of VMM reservoir
After:  `(2048 - 60 - 44) * 0.863 => 1677.672 GiB` of VMM reservoir

A Gimlet with less than 1 TiB of DRAM would see a smaller VMM reservoir,
but this is in some sense correct: we would otherwise "overprovision"
the VMM reservoir and eat into what is currently effectively a slush
fund of memory for Oxide services supporting the rack's operation,
risking overall system stability if inferring from observation and
testing on systems with 1 TiB gimlets.

A useful additional step in the direction of "config that is workable
across SKUs" would be to measure Crucible overhead in the context of
number of disks or total installed storage. Then we could calculate the
VMM reservoir after subtracting the maximum memory expected to be used
by Crucible if all storage was allocated, and have a presumably-higher
VMM reservoir percentage for the yet-smaller slice of system memory that
is not otherwise accounted.

Fixes #7448.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants