propolis-server should not crash when failing to start a VM #866

iximeow · 2025-02-20T20:52:47Z

tripped over this while trying to figure out how it was that propolis-server is now failing to create VMs with a toml that was working fine just an hour ago. what was a crash with stdout having an INFO ... migration: InstanceMigrateStatusResponse { migration_in: None, migration_out: None} (silently truncated early due to panic) now very helpfully does not kill propolis-server and instead has an ERRO ... failed to add low memory region: Not enough space.

Fixes #838

hawkw

bit of a bummer we have to do this, but yeah, seems good to me!

gjcolombo · 2025-02-20T22:20:41Z

bin/propolis-server/src/lib/vm/ensure.rs

@@ -226,7 +226,26 @@ impl<'a> VmEnsureNotStarted<'a> {
                    kernel_vm_paused: false,
                })
            }
-            Err(e) => Err(self.fail(e).await),
+            Err(e) => {


One minor caveat: this works in part because in today's code, no one ever actually drops a VmEnsureObjectsCreated (both code paths that instantiate one more or less immediately call ensure_active on it, which infallibly moves the runtime to Vm::make_active; after that point it's guaranteed that someone will call shutdown_background on the runtime before it gets dropped).

I think fixing this might be mildly involved (though maybe there's an elegant solution I haven't seen yet), so I don't think I'd hold up the whole PR for it, but it might be good to note this property in a comment somewhere around here.

so, from this nudging and our chat, i think being more intentional about creating and initializing tasks on vmm_rt is pretty workable! that's how i've got the PR now. at least this leaves us with a clearly-safe way to handle fallibility on the path to a VmEnsureObjectsCreated.

if we decided to do something fallible between a VmEnsureObjectsCreated and an ensure_active, well, that would be pretty unfortunate.

Fixes #838

this approach also fixes the immediate issue, but also makes sure that we have a clear way to do fallible init on the way to VmEnsureObjectsCreated. another approach here could be to move the call tree leading to VMM runtime creation and init into a separate thread, be it spawn_blocking or a scoped thread. this would start from `create_and_activate_vm()` to capture both the migration-based and local-request-based init. i'm just hesitant about moving more than strictly needed into "strange" scopes that might make future-us have to think twice about the execution environment.

gjcolombo

Thanks for going the additional mile here!

gjcolombo · 2025-02-21T01:50:10Z

bin/propolis-server/src/lib/vm/ensure.rs

+        // `VmEnsureObjectsCreated` (and later state transitions) take care to
+        // `shutdown_background` the runtime.


nit: this isn't quite right, I think; once you have a VmEnsureActive you are assured that the runtime will be shut down in the background when it's dropped. You get that by calling VmEnsureObjectsCreated::ensure_active.

We might want to put a WARNING: comment on VmEnsureObjectsCreated indicating that it can't currently be dropped (or maybe just add a Drop impl for it that calls shutdown_background, though I would have to convince myself that this won't leak any tasks).

oh, i'd conflated ActiveVm and VmEnsureActive, yeah you're right.

i'll go with a stern warning. a Drop impl that calls shutdown_background would immediately leak any object's tasks, but they're all also cancelled so they should stop "soon", so i think it's fine. but being in a situation where we're dropping VmEnsureObjectsCreated itself feels like a bug. and if you innocently destructure VmEnsureObjectsCreated outside ensure_active you still have to be careful about the runtime again..

(7053acf)

iximeow force-pushed the ixi/838 branch from 68468da to 1030cbc Compare February 20, 2025 20:53

hawkw approved these changes Feb 20, 2025

View reviewed changes

Base automatically changed from rust-1.85-clippy to master February 20, 2025 22:34

gjcolombo approved these changes Feb 20, 2025

View reviewed changes

iximeow added 3 commits February 21, 2025 01:29

propolis-server should not crash when failing to start a VM

dc7712b

Fixes #838

whitespace

c05401b

iximeow force-pushed the ixi/838 branch from 9a7cc9d to b614d50 Compare February 21, 2025 01:31

gjcolombo approved these changes Feb 21, 2025

View reviewed changes

shuffle around who must take care to what

7053acf

iximeow merged commit bd7a8cf into master Feb 21, 2025
11 checks passed

iximeow deleted the ixi/838 branch February 21, 2025 05:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

propolis-server should not crash when failing to start a VM #866

propolis-server should not crash when failing to start a VM #866

iximeow commented Feb 20, 2025

hawkw left a comment

gjcolombo Feb 20, 2025

iximeow Feb 21, 2025

gjcolombo left a comment

gjcolombo Feb 21, 2025

iximeow Feb 21, 2025

iximeow Feb 21, 2025

		// `VmEnsureObjectsCreated` (and later state transitions) take care to
		// `shutdown_background` the runtime.

propolis-server should not crash when failing to start a VM #866

propolis-server should not crash when failing to start a VM #866

Conversation

iximeow commented Feb 20, 2025

hawkw left a comment

Choose a reason for hiding this comment

gjcolombo Feb 20, 2025

Choose a reason for hiding this comment

iximeow Feb 21, 2025

Choose a reason for hiding this comment

gjcolombo left a comment

Choose a reason for hiding this comment

gjcolombo Feb 21, 2025

Choose a reason for hiding this comment

iximeow Feb 21, 2025

Choose a reason for hiding this comment

iximeow Feb 21, 2025

Choose a reason for hiding this comment