sled-agent

improved TUF artifact replication robustness (#7519 )

Mar 20, 2025

c31dc2f · Mar 20, 2025

Name	Name	Last commit message	Last commit date
parent directory ..
api	api	improved TUF artifact replication robustness (#7519 )	Mar 20, 2025
bootstrap-agent-api	bootstrap-agent-api	remove the `SemverVersion` wrapper struct (#7570 )	Feb 22, 2025
repo-depot-api	repo-depot-api	[repo depot 2/n] sled agent APIs to manage update artifact storage (#…	Oct 31, 2024
src	src	improved TUF artifact replication robustness (#7519 )	Mar 20, 2025
tests	tests	[meta] update Rust style edition to 2024 (#7591 )	Mar 3, 2025
types	types	[sled-agent] Merge disk/dataset/zone `PUT` endpoints into one (#7762 )	Mar 10, 2025
.gitignore	.gitignore	Remove 'omicron-' prefix from paths - it is implied (#296 )	Oct 8, 2021
Cargo.toml	Cargo.toml	Extract `IdMap` to its own crate (#7797 )	Mar 14, 2025
README.adoc	README.adoc	Update Sled Agent Readme (#827 )	Mar 29, 2022

README.adoc

Sled Agent

Table of Contents

Real vs Simulated
Code Tour
Life of an Instance
TODOs

This directory contains the per-sled "agent", responsible for managing local hardware and instances.

Real vs Simulated

This subdirectory contains implementations of both a "simulated" and "real" Sled Agent. The simulated agent allows testing state machine management in a low-overhead environment. Where reasonable, decision-making logic is shared between the two implementations.

Code Tour

src/bin: Contains binaries for the sled agent (and simulated sled agent).
src/bootstrap: Contains bootstrap-related services, operating on a distinct HTTP endpoint from typical sled operation.
src/common: Shared state machine code between the simulated and real sled agent.
src/sim: Library code responsible for operating a simulated sled agent.
src/illumos: Illumos-specific helpers for accessing OS utilities to manage a sled.

Additionally, there are some noteworthy top-level files used by the sled agent:

Launching Instances

src/instance_manager.rs: Manages multiple instances on a sled.
src/instance.rs: Manages a single instance.

Managing Sled-local Resources

src/storage_manager.rs: Manages storage within a sled.
src/services.rs: Manages services that are running within the sled.
src/updates.rs: Manages packages installed on the sled.

HTTP Services

src/http_entrypoints.rs: The HTTP API exposed by the Sled Agent.
src/params.rs: Parameters to the HTTP endpoints.

As well as some utilities:

src/illumos/running_zone.rs: RAII wrapper around a running Zone owned by the Sled Agent.
src/illumos/vnic.rs: RAII wrapper around VNICs owned by the Sled Agent.

Life of an Instance

Note	This process is subject to change. What follows attempts to be an accurate description of the current implementation.

As a prerequisite for starting new instances, the Sled Agent takes the following steps on initialization to manage OS-local resources.

When booting, the Sled Agent…

… creates a ZFS filesystem for zones, called rpool/zone, mounted at /zone.
… identifies all Oxide-controlled zones (with the prefix oxz_) and all Oxide-controlled VNICs (with the prefix ox), which are removed from the machine.

To allocate an instance on the Sled, the following steps occur:

A request arrives via the HTTP server (typically from Nexus), requesting an operation like instance_put.
This request is routed to the InstanceManager through the ensure method. If the instance already exists, it is identified by UUID and updated. Otherwise, a new instance is created.
A dedicated "control" VNIC is created by the Sled Agent, for use by the instance’s zone.
- This requires identifying a physical data-link device, and creating a new VNIC on top of it.
An omicron-branded zone is created for the instance. Within this zone…
- … Propolis' executable files exist.
- … The previously created "control" VNIC is attached as a network interface.
- … Necessary devices for virtualization (/dev/{vmm,vmmctl,viona}) are added.
The zone is then booted. Once the SMF network milestone has been reached…
- … An IP address is allocated on the aforementioned VNIC.
- … Propolis' HTTP server is initialized, using this IP address.
- … And the original request to "create an instance" is passed to this Propolis server, where the actual VM initialization occurs.
At this point, the instance is up and running, and the Sled Agent monitors it such that Nexus may be notified if the VM changes state.

TODOs

We track issues on Github using the following labels:

However, the following are general areas of improvement for the Sled Agent:

(Correctness) Plumbing of Nexus-customizable InstanceProperties, such as image, bootrom, memory, vcpus.
(Correctness) Integration of OPTE to manage the network and apply policy changes as requested by Nexus.
(Resilience) On initialization of the InstanceManager, the Sled Agent currently removes all Zones and VNICs on the system with known magic prefixes. Instead, these should be inspected and re-organized, especially to deal with a case where the Sled Agent reboots, but customer VMs continue to execute on a rack.
(Performance) Minimize polling-based behavior, where possible. The Sled Agent is responsible for launching zones, SMF services, and HTTP servers - for all of these, the agent uses polling with timeouts to monitor progress between "starting" and "actually running and responding to requests". If possible, it would be preferable to replace these timeouts with event-triggered behavior, which would avoid unnecessary stalls.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

sled-agent

sled-agent

README.adoc

Sled Agent

Real vs Simulated

Code Tour

Launching Instances

Managing Sled-local Resources

HTTP Services

Life of an Instance

TODOs

Files

sled-agent

Directory actions

More options

Directory actions

More options

Latest commit

History

sled-agent

Folders and files

parent directory

README.adoc

Sled Agent

Real vs Simulated

Code Tour

Launching Instances

Managing Sled-local Resources

HTTP Services

Life of an Instance

TODOs