Skip to content

Commit 8c13222

Browse files
authored
[sled-agent] handle disappearing Propolis zones (#7794)
At present, sled-agents don't really have any way to detect the unexpected disappearance of a Propolis zone. If a Propolis zone is deleted, such as by someone `ssh`ing into the sled and running `zoneadm` commands, the sled-agent won't detect this and instead believes the instance to still exist. Fortunately, this is a fairly rare turn of events: if a VMM panics or other bad things happen, the zone is not usually torn down. Instead, the `propolis-server` process is restarted, in a state where it is no longer aware that it's supposed to have been running an instance. VMs in such a state report to the sled-agent that they're screwed up, and it knows to treat them as having failed. This is all discussed in detail in [RFD 486 What Shall We Do With The Failèd Instance?][486]. Unfortunately, under the rules laid down in that RFD, sled-agents will _only_ treat a Propolis zone as having failed when the `propolis-server` returns one of the errors that *affirmatively indicate* that it has crashed and been restarted. All other errors that occur while checking an instance's state are retried, whether they HTTP errors returned by the `propolis-server` process, or (critically, in this case) failures to establish a TCP connection because the `propolis-server` process no longer exists. This is pretty bad, as the sled-agent is now left believing that the instance was in its last observed state indefinitely, and that instance (which is now Way Gone) cannot be stopped, restarted, or deleted through normal means. That _sucks_, man! This commit changes sled-agent to behave more intelligently in this situation. Now, when attempts to check `propolis-server`'s instance-state-monitor API endpoint fail with communication errors, the sled-agent will run `zoneadm list` to find out whether the Propolis zone is still there. If it isn't, we now move the instance to `Failed`, because it's..., you know, totally gone. Doing this properly requires some changes to the code for halting and removing Propolis zones. Halting a zone transitions it through the `DOWN` and/or `SHUTTING_DOWN` states, in which it cannot be uninstalled. Therefore, the `InstanceManager` code may need to retry uninstalling the zone a few times in the case where it's manually halted. This, in turn, required some modifications to the `illumos-utils` `zones::AdmError` type in order to indicate this error condition in a programmatically visible way. Fixes #7563 [486]: https://rfd.shared.oxide.computer/rfd/0486
1 parent 9b7c249 commit 8c13222

File tree

2 files changed

+125
-10
lines changed

2 files changed

+125
-10
lines changed

illumos-utils/src/zone.rs

+36-8
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,25 @@ pub struct AdmError {
7070
op: Operation,
7171
zone: String,
7272
#[source]
73-
err: zone::ZoneError,
73+
err: AdmErrorKind,
74+
}
75+
76+
#[derive(thiserror::Error, Debug)]
77+
pub enum AdmErrorKind {
78+
/// The zone is currently in a state in which it cannot be uninstalled.
79+
/// These states are generally transient, so this error is likely to be
80+
/// retryable.
81+
#[error("this operation cannot be performed in the '{:?}' state", .0)]
82+
InvalidState(zone::State),
83+
/// Another zoneadm error occurred.
84+
#[error(transparent)]
85+
Zoneadm(#[from] zone::ZoneError),
86+
}
87+
88+
impl AdmError {
89+
pub fn is_invalid_state(&self) -> bool {
90+
matches!(self.err, AdmErrorKind::InvalidState(_))
91+
}
7492
}
7593

7694
/// Errors which may be encountered when deleting addresses.
@@ -236,6 +254,16 @@ impl Zones {
236254
// For zones where we never performed installation, simply
237255
// delete the zone - uninstallation is invalid.
238256
zone::State::Configured => (false, false),
257+
// Attempting to uninstall a zone in the "down" state will
258+
// fail. Instead, the caller must wait until the zone
259+
// transitions to "installed".
260+
zone::State::Down | zone::State::ShuttingDown => {
261+
return Err(AdmError {
262+
op: Operation::Uninstall,
263+
zone: name.to_string(),
264+
err: AdmErrorKind::InvalidState(state),
265+
});
266+
}
239267
// For most zone states, perform uninstallation.
240268
_ => (false, true),
241269
};
@@ -245,7 +273,7 @@ impl Zones {
245273
AdmError {
246274
op: Operation::Halt,
247275
zone: name.to_string(),
248-
err,
276+
err: err.into(),
249277
}
250278
})?;
251279
}
@@ -256,7 +284,7 @@ impl Zones {
256284
.map_err(|err| AdmError {
257285
op: Operation::Uninstall,
258286
zone: name.to_string(),
259-
err,
287+
err: err.into(),
260288
})?;
261289
}
262290
zone::Config::new(name)
@@ -266,7 +294,7 @@ impl Zones {
266294
.map_err(|err| AdmError {
267295
op: Operation::Delete,
268296
zone: name.to_string(),
269-
err,
297+
err: err.into(),
270298
})?;
271299
Ok(Some(state))
272300
}
@@ -360,7 +388,7 @@ impl Zones {
360388
cfg.run().await.map_err(|err| AdmError {
361389
op: Operation::Configure,
362390
zone: zone_name.to_string(),
363-
err,
391+
err: err.into(),
364392
})?;
365393

366394
info!(log, "Installing Omicron zone: {}", zone_name);
@@ -374,7 +402,7 @@ impl Zones {
374402
.map_err(|err| AdmError {
375403
op: Operation::Install,
376404
zone: zone_name.to_string(),
377-
err,
405+
err: err.into(),
378406
})?;
379407
Ok(())
380408
}
@@ -384,7 +412,7 @@ impl Zones {
384412
zone::Adm::new(name).boot().await.map_err(|err| AdmError {
385413
op: Operation::Boot,
386414
zone: name.to_string(),
387-
err,
415+
err: err.into(),
388416
})?;
389417
Ok(())
390418
}
@@ -398,7 +426,7 @@ impl Zones {
398426
.map_err(|err| AdmError {
399427
op: Operation::List,
400428
zone: "<all>".to_string(),
401-
err,
429+
err: err.into(),
402430
})?
403431
.into_iter()
404432
.filter(|z| z.name().starts_with(ZONE_PREFIX))

sled-agent/src/instance.rs

+89-2
Original file line numberDiff line numberDiff line change
@@ -315,6 +315,7 @@ struct TerminateRequest {
315315
// This task communicates with the "InstanceRunner" task to report status.
316316
struct InstanceMonitorRunner {
317317
client: Arc<PropolisClient>,
318+
zone_name: String,
318319
tx_monitor: mpsc::Sender<InstanceMonitorRequest>,
319320
log: slog::Logger,
320321
}
@@ -382,6 +383,53 @@ impl InstanceMonitorRunner {
382383
Err(e) if self.tx_monitor.is_closed() => {
383384
Err(BackoffError::permanent(e))
384385
}
386+
// If we couldn't communicate with propolis-server, let's make sure
387+
// the zone is still there...
388+
Err(e @ PropolisClientError::CommunicationError(_)) => {
389+
match Zones::find(&self.zone_name).await {
390+
Ok(None) => {
391+
// Oh it's GONE!
392+
info!(
393+
self.log,
394+
"Propolis zone is Way Gone!";
395+
"zone" => %self.zone_name,
396+
);
397+
Ok(InstanceMonitorUpdate::ZoneGone)
398+
}
399+
Ok(Some(zone)) if zone.state() == zone::State::Running => {
400+
warn!(
401+
self.log,
402+
"communication error checking up on Propolis, but \
403+
the zone is still running...";
404+
"error" => %e,
405+
"zone" => %self.zone_name,
406+
);
407+
Err(BackoffError::transient(e))
408+
}
409+
Ok(Some(zone)) => {
410+
info!(
411+
self.log,
412+
"Propolis zone is no longer running!";
413+
"error" => %e,
414+
"zone" => %self.zone_name,
415+
"zone_state" => ?zone.state(),
416+
);
417+
Ok(InstanceMonitorUpdate::ZoneGone)
418+
}
419+
Err(zoneadm_error) => {
420+
// If we couldn't figure out whether the zone is still
421+
// running, just keep retrying
422+
error!(
423+
self.log,
424+
"error checking if Propolis zone still exists after \
425+
commuication error";
426+
"error" => %zoneadm_error,
427+
"zone" => %self.zone_name,
428+
);
429+
Err(BackoffError::transient(e))
430+
}
431+
}
432+
}
385433
// Otherwise, was there a known error code from Propolis?
386434
Err(e) => propolis_error_code(&self.log, &e)
387435
// If we were able to parse a known error code, send it along to
@@ -396,6 +444,7 @@ impl InstanceMonitorRunner {
396444

397445
enum InstanceMonitorUpdate {
398446
State(propolis_client::types::InstanceStateMonitorResponse),
447+
ZoneGone,
399448
Error(PropolisErrorCode),
400449
}
401450

@@ -520,7 +569,20 @@ impl InstanceRunner {
520569
warn!(self.log, "InstanceRunner failed to send to InstanceMonitorRunner");
521570
}
522571
},
523-
Some(InstanceMonitorRequest { update: Error(code), tx }) => {
572+
// The Propolis zone has abruptly vanished. It's not supposed
573+
// to do that! Move it to failed.
574+
Some(InstanceMonitorRequest { update: ZoneGone, tx }) => {
575+
warn!(
576+
self.log,
577+
"Propolis zone has gone away entirely! Moving \
578+
to Failed"
579+
);
580+
self.terminate(true).await;
581+
if let Err(_) = tx.send(Reaction::Terminate) {
582+
warn!(self.log, "InstanceRunner failed to send to InstanceMonitorRunner");
583+
}
584+
}
585+
Some(InstanceMonitorRequest { update: Error(code), tx }) => {
524586
let reaction = if code == PropolisErrorCode::NoInstance {
525587
// If we see a `NoInstance` error code from
526588
// Propolis after the instance has been ensured,
@@ -1307,6 +1369,7 @@ impl InstanceRunner {
13071369
// it exited or because the Propolis server was terminated by other
13081370
// means).
13091371
let runner = InstanceMonitorRunner {
1372+
zone_name: running_zone.name().to_string(),
13101373
client: client.clone(),
13111374
tx_monitor: self.tx_monitor.clone(),
13121375
log: self.log.clone(),
@@ -1379,7 +1442,31 @@ impl InstanceRunner {
13791442
// `RunningZone::stop` in case we're called between creating the
13801443
// zone and assigning `running_state`.
13811444
warn!(self.log, "Halting and removing zone: {}", zname);
1382-
Zones::halt_and_remove_logged(&self.log, &zname).await.unwrap();
1445+
let result = tokio::time::timeout(
1446+
Duration::from_secs(60 * 5),
1447+
omicron_common::backoff::retry(
1448+
omicron_common::backoff::retry_policy_local(),
1449+
|| async {
1450+
Zones::halt_and_remove_logged(&self.log, &zname)
1451+
.await
1452+
.map_err(|e| {
1453+
if e.is_invalid_state() {
1454+
BackoffError::transient(e)
1455+
} else {
1456+
BackoffError::permanent(e)
1457+
}
1458+
})
1459+
},
1460+
),
1461+
)
1462+
.await;
1463+
match result {
1464+
Ok(Ok(_)) => {}
1465+
Ok(Err(e)) => panic!("{e}"),
1466+
Err(_) => {
1467+
panic!("Zone {zname:?} could not be halted within 5 minutes")
1468+
}
1469+
}
13831470

13841471
// Remove ourselves from the instance manager's map of instances.
13851472
self.instance_ticket.deregister();

0 commit comments

Comments
 (0)