[sled-agent] handle disappearing Propolis zones (#7794)

hawkw · web-flow · commit 8c132223a20f · 2025-03-18T09:55:45.000-07:00
At present, sled-agents don't really have any way to detect the unexpected disappearance of a Propolis zone. If a Propolis zone is deleted, such as by someone `ssh`ing into the sled and running `zoneadm` commands, the sled-agent won't detect this and instead believes the instance to still exist. Fortunately, this is a fairly rare turn of events: if a VMM panics or other bad things happen, the zone is not usually torn down. Instead, the `propolis-server` process is restarted, in a state where it is no longer aware that it's supposed to have been running an instance. VMs in such a state report to the sled-agent that they're screwed up, and it knows to treat them as having failed. This is all discussed in detail in [RFD 486 What Shall We Do With The Failèd Instance?][486]. Unfortunately, under the rules laid down in that RFD, sled-agents will _only_ treat a Propolis zone as having failed when the `propolis-server` returns one of the errors that *affirmatively indicate* that it has crashed and been restarted. All other errors that occur while checking an instance's state are retried, whether they HTTP errors returned by the `propolis-server` process, or (critically, in this case) failures to establish a TCP connection because the `propolis-server` process no longer exists. This is pretty bad, as the sled-agent is now left believing that the instance was in its last observed state indefinitely, and that instance (which is now Way Gone) cannot be stopped, restarted, or deleted through normal means. That _sucks_, man! This commit changes sled-agent to behave more intelligently in this situation. Now, when attempts to check `propolis-server`'s instance-state-monitor API endpoint fail with communication errors, the sled-agent will run `zoneadm list` to find out whether the Propolis zone is still there. If it isn't, we now move the instance to `Failed`, because it's..., you know, totally gone. Doing this properly requires some changes to the code for halting and removing Propolis zones. Halting a zone transitions it through the `DOWN` and/or `SHUTTING_DOWN` states, in which it cannot be uninstalled. Therefore, the `InstanceManager` code may need to retry uninstalling the zone a few times in the case where it's manually halted. This, in turn, required some modifications to the `illumos-utils` `zones::AdmError` type in order to indicate this error condition in a programmatically visible way. Fixes #7563 [486]: https://rfd.shared.oxide.computer/rfd/0486
diff --git a/illumos-utils/src/zone.rs b/illumos-utils/src/zone.rs
@@ -70,7 +70,25 @@ pub struct AdmError {
     op: Operation,
     zone: String,
     #[source]
-    err: zone::ZoneError,
+    err: AdmErrorKind,
+}
+
+#[derive(thiserror::Error, Debug)]
+pub enum AdmErrorKind {
+    /// The zone is currently in a state in which it cannot be uninstalled.
+    /// These states are generally transient, so this error is likely to be
+    /// retryable.
+    #[error("this operation cannot be performed in the '{:?}' state", .0)]
+    InvalidState(zone::State),
+    /// Another zoneadm error occurred.
+    #[error(transparent)]
+    Zoneadm(#[from] zone::ZoneError),
+}
+
+impl AdmError {
+    pub fn is_invalid_state(&self) -> bool {
+        matches!(self.err, AdmErrorKind::InvalidState(_))
+    }
 }
 
 /// Errors which may be encountered when deleting addresses.
@@ -236,6 +254,16 @@ impl Zones {
                     // For zones where we never performed installation, simply
                     // delete the zone - uninstallation is invalid.
                     zone::State::Configured => (false, false),
+                    // Attempting to uninstall a zone in the "down" state will
+                    // fail. Instead, the caller must wait until the zone
+                    // transitions to "installed".
+                    zone::State::Down | zone::State::ShuttingDown => {
+                        return Err(AdmError {
+                            op: Operation::Uninstall,
+                            zone: name.to_string(),
+                            err: AdmErrorKind::InvalidState(state),
+                        });
+                    }
                     // For most zone states, perform uninstallation.
                     _ => (false, true),
                 };
@@ -245,7 +273,7 @@ impl Zones {
                         AdmError {
                             op: Operation::Halt,
                             zone: name.to_string(),
-                            err,
+                            err: err.into(),
                         }
                     })?;
                 }
@@ -256,7 +284,7 @@ impl Zones {
                         .map_err(|err| AdmError {
                             op: Operation::Uninstall,
                             zone: name.to_string(),
-                            err,
+                            err: err.into(),
                         })?;
                 }
                 zone::Config::new(name)
@@ -266,7 +294,7 @@ impl Zones {
                     .map_err(|err| AdmError {
                     op: Operation::Delete,
                     zone: name.to_string(),
-                    err,
+                    err: err.into(),
                 })?;
                 Ok(Some(state))
             }
@@ -360,7 +388,7 @@ impl Zones {
         cfg.run().await.map_err(|err| AdmError {
             op: Operation::Configure,
             zone: zone_name.to_string(),
-            err,
+            err: err.into(),
         })?;
 
         info!(log, "Installing Omicron zone: {}", zone_name);
@@ -374,7 +402,7 @@ impl Zones {
             .map_err(|err| AdmError {
                 op: Operation::Install,
                 zone: zone_name.to_string(),
-                err,
+                err: err.into(),
             })?;
         Ok(())
     }
@@ -384,7 +412,7 @@ impl Zones {
         zone::Adm::new(name).boot().await.map_err(|err| AdmError {
             op: Operation::Boot,
             zone: name.to_string(),
-            err,
+            err: err.into(),
         })?;
         Ok(())
     }
@@ -398,7 +426,7 @@ impl Zones {
             .map_err(|err| AdmError {
                 op: Operation::List,
                 zone: "<all>".to_string(),
-                err,
+                err: err.into(),
             })?
             .into_iter()
             .filter(|z| z.name().starts_with(ZONE_PREFIX))
diff --git a/sled-agent/src/instance.rs b/sled-agent/src/instance.rs
@@ -315,6 +315,7 @@ struct TerminateRequest {
 // This task communicates with the "InstanceRunner" task to report status.
 struct InstanceMonitorRunner {
     client: Arc<PropolisClient>,
+    zone_name: String,
     tx_monitor: mpsc::Sender<InstanceMonitorRequest>,
     log: slog::Logger,
 }
@@ -382,6 +383,53 @@ impl InstanceMonitorRunner {
             Err(e) if self.tx_monitor.is_closed() => {
                 Err(BackoffError::permanent(e))
             }
+            // If we couldn't communicate with propolis-server, let's make sure
+            // the zone is still there...
+            Err(e @ PropolisClientError::CommunicationError(_)) => {
+                match Zones::find(&self.zone_name).await {
+                    Ok(None) => {
+                        // Oh it's GONE!
+                        info!(
+                            self.log,
+                            "Propolis zone is Way Gone!";
+                            "zone" => %self.zone_name,
+                        );
+                        Ok(InstanceMonitorUpdate::ZoneGone)
+                    }
+                    Ok(Some(zone)) if zone.state() == zone::State::Running => {
+                        warn!(
+                            self.log,
+                            "communication error checking up on Propolis, but \
+                             the zone is still running...";
+                            "error" => %e,
+                            "zone" => %self.zone_name,
+                        );
+                        Err(BackoffError::transient(e))
+                    }
+                    Ok(Some(zone)) => {
+                        info!(
+                            self.log,
+                            "Propolis zone is no longer running!";
+                            "error" => %e,
+                            "zone" => %self.zone_name,
+                            "zone_state" => ?zone.state(),
+                        );
+                        Ok(InstanceMonitorUpdate::ZoneGone)
+                    }
+                    Err(zoneadm_error) => {
+                        // If we couldn't figure out whether the zone is still
+                        // running, just keep retrying
+                        error!(
+                            self.log,
+                            "error checking if Propolis zone still exists after \
+                             commuication error";
+                            "error" => %zoneadm_error,
+                            "zone" => %self.zone_name,
+                        );
+                        Err(BackoffError::transient(e))
+                    }
+                }
+            }
             // Otherwise, was there a known error code from Propolis?
             Err(e) => propolis_error_code(&self.log, &e)
                 // If we were able to parse a known error code, send it along to
@@ -396,6 +444,7 @@ impl InstanceMonitorRunner {
 
 enum InstanceMonitorUpdate {
     State(propolis_client::types::InstanceStateMonitorResponse),
+    ZoneGone,
     Error(PropolisErrorCode),
 }
 
@@ -520,7 +569,20 @@ impl InstanceRunner {
                                 warn!(self.log, "InstanceRunner failed to send to InstanceMonitorRunner");
                             }
                         },
-                         Some(InstanceMonitorRequest { update: Error(code), tx }) => {
+                        // The Propolis zone has abruptly vanished. It's not supposed
+                        // to do that! Move it to failed.
+                        Some(InstanceMonitorRequest { update: ZoneGone, tx }) => {
+                            warn!(
+                                self.log,
+                                "Propolis zone has gone away entirely! Moving \
+                                 to Failed"
+                            );
+                            self.terminate(true).await;
+                            if let Err(_) = tx.send(Reaction::Terminate) {
+                                warn!(self.log, "InstanceRunner failed to send to InstanceMonitorRunner");
+                            }
+                        }
+                        Some(InstanceMonitorRequest { update: Error(code), tx }) => {
                             let reaction = if code == PropolisErrorCode::NoInstance {
                                 // If we see a `NoInstance` error code from
                                 // Propolis after the instance has been ensured,
@@ -1307,6 +1369,7 @@ impl InstanceRunner {
         // it exited or because the Propolis server was terminated by other
         // means).
         let runner = InstanceMonitorRunner {
+            zone_name: running_zone.name().to_string(),
             client: client.clone(),
             tx_monitor: self.tx_monitor.clone(),
             log: self.log.clone(),
@@ -1379,7 +1442,31 @@ impl InstanceRunner {
         // `RunningZone::stop` in case we're called between creating the
         // zone and assigning `running_state`.
         warn!(self.log, "Halting and removing zone: {}", zname);
-        Zones::halt_and_remove_logged(&self.log, &zname).await.unwrap();
+        let result = tokio::time::timeout(
+            Duration::from_secs(60 * 5),
+            omicron_common::backoff::retry(
+                omicron_common::backoff::retry_policy_local(),
+                || async {
+                    Zones::halt_and_remove_logged(&self.log, &zname)
+                        .await
+                        .map_err(|e| {
+                            if e.is_invalid_state() {
+                                BackoffError::transient(e)
+                            } else {
+                                BackoffError::permanent(e)
+                            }
+                        })
+                },
+            ),
+        )
+        .await;
+        match result {
+            Ok(Ok(_)) => {}
+            Ok(Err(e)) => panic!("{e}"),
+            Err(_) => {
+                panic!("Zone {zname:?} could not be halted within 5 minutes")
+            }
+        }
 
         // Remove ourselves from the instance manager's map of instances.
         self.instance_ticket.deregister();