Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

propolis-server should not crash when failing to start a VM #866

Merged
merged 4 commits into from
Feb 21, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 63 additions & 29 deletions bin/propolis-server/src/lib/vm/ensure.rs
Original file line number Diff line number Diff line change
Expand Up @@ -236,39 +236,67 @@ impl<'a> VmEnsureNotStarted<'a> {
},
));

// Create the runtime that will host tasks created by VMM components
// (e.g. block device runtime tasks).
let vmm_rt = tokio::runtime::Builder::new_multi_thread()
.thread_name("tokio-rt-vmm")
.worker_threads(usize::max(
VMM_MIN_RT_THREADS,
VMM_BASE_RT_THREADS + spec.board.cpus as usize,
))
.enable_all()
.build()?;

let log_for_init = self.log.clone();
let properties = self.ensure_request.properties.clone();
let options = self.ensure_options.clone();
let queue_for_init = input_queue.clone();
let init_result = vmm_rt
.spawn(async move {
initialize_vm_objects(
log_for_init,
spec,
properties,
options,
queue_for_init,
)
.await
})
.await
.map_err(|e| {
anyhow::anyhow!("failed to join VM object creation task: {e}")
})?;

match init_result {
Ok(objects) => {

// Either the block following this succeeds with both a Tokio runtime
// and VM objects, or entirely fails with no partial state for us to
// clean up.
type InitResult =
anyhow::Result<(tokio::runtime::Runtime, InputVmObjects)>;

// We need to create a new runtime to host the tasks for this VMM's
// objects, but that initialization is fallible and results in dropping
// the fledgling VMM runtime itself. Dropping a Tokio runtime on a
// worker thread in a Tokio runtime will panic, so do all init in a
// `spawn_blocking` where this won't be an issue.
//
// When the runtime is returned to this thread, it must not be dropped.
// That means that the path between this result and returning an
// `Ok(VmEnsureObjectsCreated)` must be infallible.
let result: InitResult = tokio::task::spawn_blocking(move || {
// Create the runtime that will host tasks created by
// VMM components (e.g. block device runtime tasks).
let vmm_rt = tokio::runtime::Builder::new_multi_thread()
.thread_name("tokio-rt-vmm")
.worker_threads(usize::max(
VMM_MIN_RT_THREADS,
VMM_BASE_RT_THREADS + spec.board.cpus as usize,
))
.enable_all()
.build()?;

let init_result = vmm_rt
.block_on(async move {
initialize_vm_objects(
log_for_init,
spec,
properties,
options,
queue_for_init,
)
.await
})
.map_err(|e| {
anyhow::anyhow!(
"failed to join VM object creation task: {e}"
)
})?;
Ok((vmm_rt, init_result))
})
.await
.map_err(|e| {
// This is extremely unexpected: if the join failed, the init
// task panicked or was cancelled. If the init itself failed,
// which is somewhat more reasonable, we would expect the join
// to succeed and have an error below.
anyhow::anyhow!("failed to join VMM runtime init task: {e}")
})?;

match result {
Ok((vmm_rt, objects)) => {
// N.B. Once these `VmObjects` exist, it is no longer safe to
// call `vm_init_failed`.
let objects = Arc::new(VmObjects::new(
Expand Down Expand Up @@ -310,6 +338,12 @@ impl<'a> VmEnsureNotStarted<'a> {
/// Represents an instance ensure request that has proceeded far enough to
/// create a set of VM objects, but that has not yet installed those objects as
/// an `ActiveVm` or notified the requestor that its request is complete.
///
/// WARNING: dropping `VmEnsureObjectsCreated` is a panic risk since dropping
/// the contained `tokio::runtime::Runtime` on in a worker thread will panic. It
/// is probably a bug to drop `VmEnsureObjectsCreated`, as it is expected users
/// will quickly call [`VmEnsureObjectsCreated::ensure_active`], but if you
/// must, take care in handling the contained `vmm_rt`.
pub(crate) struct VmEnsureObjectsCreated<'a> {
log: &'a slog::Logger,
vm: &'a Arc<super::Vm>,
Expand Down
Loading