You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[sled-agent] handle disappearing Propolis zones (#7794)
At present, sled-agents don't really have any way to detect the
unexpected disappearance of a Propolis zone. If a Propolis zone is
deleted, such as by someone `ssh`ing into the sled and running `zoneadm`
commands, the sled-agent won't detect this and instead believes the
instance to still exist.
Fortunately, this is a fairly rare turn of events: if a VMM panics or
other bad things happen, the zone is not usually torn down. Instead, the
`propolis-server` process is restarted, in a state where it is no longer
aware that it's supposed to have been running an instance. VMs in such a
state report to the sled-agent that they're screwed up, and it knows to
treat them as having failed. This is all discussed in detail in [RFD 486
What Shall We Do With The Failèd Instance?][486].
Unfortunately, under the rules laid down in that RFD, sled-agents will
_only_ treat a Propolis zone as having failed when the `propolis-server`
returns one of the errors that *affirmatively indicate* that it has
crashed and been restarted. All other errors that occur while checking
an instance's state are retried, whether they HTTP errors returned by
the `propolis-server` process, or (critically, in this case) failures to
establish a TCP connection because the `propolis-server` process no
longer exists. This is pretty bad, as the sled-agent is now left
believing that the instance was in its last observed state indefinitely,
and that instance (which is now Way Gone) cannot be stopped, restarted,
or deleted through normal means. That _sucks_, man!
This commit changes sled-agent to behave more intelligently in this
situation. Now, when attempts to check `propolis-server`'s
instance-state-monitor API endpoint fail with communication errors, the
sled-agent will run `zoneadm list` to find out whether the Propolis zone
is still there. If it isn't, we now move the instance to `Failed`,
because it's..., you know, totally gone.
Doing this properly requires some changes to the code for halting and
removing Propolis zones. Halting a zone transitions it through the
`DOWN` and/or `SHUTTING_DOWN` states, in which it cannot be
uninstalled. Therefore, the `InstanceManager` code may need to retry
uninstalling the zone a few times in the case where it's manually
halted. This, in turn, required some modifications to the
`illumos-utils` `zones::AdmError` type in order to indicate this
error condition in a programmatically visible way.
Fixes#7563
[486]: https://rfd.shared.oxide.computer/rfd/0486
0 commit comments