-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A new nexus zone created after a scrimlet expungement gets stuck trying to find switch zone addresses #7739
Comments
Nexus log at: staff/core/omicron-7739 |
I think what you saw in #6076 was getting stuck at the same place yet.
|
See also #5092. |
Currently, callers of `map_switch_zone_addrs()` first get the IP for `ServiceName::Dendrite` from DNS, then loop (forever) trying to translate that IP into a `SwitchLocation`. Under normal conditions, this is fine. However, if a sled has been expunged, or a new sled is being added, it's possible that what is returned in: ``` let switch_zone_addresses = match resolver .lookup_all_ipv6(ServiceName::Dendrite) .await ``` Will change. If that changes happens after we start looping in `map_switch_zone_addrs()`, then the loop will go on forever looking for something that is no longer correct. To fix this we put the `lookup_all_ipv6` into the loop by using the function `switch_zone_address_mappings()` instead. `switch_zone_address_mappings()`'s loop includes the call to lookup addresses in DNS will call `map_switch_zone_addrs()`. This allows us to include the DNS lookup inside the loop. Most places where we called `map_switch_zone_addrs()` were also using the same `lookup_all_ipv6()` call, so transitioning them to call `switch_zone_address_mappings()` will just drop right in. A fix for #7739 --------- Co-authored-by: Alan Hanson <alan@oxide.computer>
Fixed in #7779 |
On London racklette I expunged sled 14, which had a nexus zone on it.
The global zone of sled 14 had
fd00:1122:3344:101::1/64
for underlay0/sled6 addressAs part of the handling of that expungement, sled 15 was assigned the honor of creating a replacement nexus zone.
During nexus startup, we get stuck:
From the new nexus log on 15, just after the OSO blob:
From
omicron/nexus/src/app/mod.rs
we havemap_switch_zone_addrs()
which has this loop:We enter the loop with a switch zone address of the now missing sled 14.
Once we enter that loop, we are never going to come out as sled 14 is gone.
Being stuck in this loop prevents Nexus from starting up.
The text was updated successfully, but these errors were encountered: