Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new nexus zone created after a scrimlet expungement gets stuck trying to find switch zone addresses #7739

Closed
leftwo opened this issue Mar 5, 2025 · 6 comments
Assignees
Labels
expunge expunge sled or disk issues
Milestone

Comments

@leftwo
Copy link
Contributor

leftwo commented Mar 5, 2025

On London racklette I expunged sled 14, which had a nexus zone on it.
The global zone of sled 14 had fd00:1122:3344:101::1/64 for underlay0/sled6 address

As part of the handling of that expungement, sled 15 was assigned the honor of creating a replacement nexus zone.
During nexus startup, we get stuck:

From the new nexus log on 15, just after the OSO blob:

20:58:12.013Z INFO fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): Setting up resolver using DNS servers for subnet: Ipv6Subnet { net: Ipv6Net { addr: fd00:1122:3344::, width: 48 } }
    file = nexus/src/context.rs:219
20:58:12.017Z INFO fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): new DNS resolver
    addresses = [[fd00:1122:3344:1::1]:53, [fd00:1122:3344:2::1]:53, [fd00:1122:3344:3::1]:53]
    file = internal-dns/resolver/src/resolver.rs:111
20:58:12.022Z INFO fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): Setting up qorb database pool from DNS
    dns_addrs = [[fd00:1122:3344:1::1]:53, [fd00:1122:3344:2::1]:53, [fd00:1122:3344:3::1]:53]
    file = nexus/src/context.rs:275
20:58:12.028Z DEBG fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): registered USDT probes
20:58:12.054Z INFO fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): Database schema version is up to date
    desired_version = 124.0.0
    file = nexus/db-queries/src/db/datastore/db_metadata.rs:145
    found_version = 124.0.0
20:58:12.058Z INFO fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): SEC running
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.1/src/sec.rs:813
    sec_id = fd9231e4-0c8c-42be-b06c-9593eaa3b009
20:58:12.059Z DEBG fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): lookup_all_ipv6 srv
    dns_name = _dendrite._tcp.control-plane.oxide.internal
    response = SrvLookup(Lookup { query: Query { name: Name("_dendrite._tcp.control-plane.oxide.internal"), query_type: SRV, query_class: IN }, records: [Record { name_labels: Name("_dendrite._tcp.control-plane.oxide.internal."), rr_type: SRV, dns_class: IN, ttl: 0, rdata: Some(SRV(SRV { priority: 0, weight: 0, port: 12224, target: Name("dendrite-0dd453b6-7a98-4011-aecf-fb48821a6fc7.host.control-plane.oxide.internal.") })) }, Record { name_labels: Name("_dendrite._tcp.control-plane.oxide.internal."), rr_type: SRV, dns_class: IN, ttl: 0, rdata: Some(SRV(SRV { priority: 0, weight: 0, port: 12224, target: Name("dendrite-bb1c5b3c-d7dd-4973-a7f0-93c6a69b5385.host.control-plane.oxide.internal.") })) }, Record { name_labels: Name("dendrite-bb1c5b3c-d7dd-4973-a7f0-93c6a69b5385.host.control-plane.oxide.internal."), rr_type: AAAA, dns_class: IN, ttl: 0, rdata: Some(AAAA(AAAA(fd00:1122:3344:102::2))) }], valid_until: Instant { tv_sec: 2645, tv_nsec: 277768426 } })
20:58:12.059Z INFO fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): Determining switch slots managed by switch zones
    file = nexus/src/app/mod.rs:1154
20:58:12.118Z INFO fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): determining switch slot managed by dendrite zone
    file = nexus/src/app/mod.rs:1162
    zone_address = fd00:1122:3344:101::2
20:58:12.119Z DEBG fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): client request
    body = None
    method = GET
    uri = http://[fd00:1122:3344:101::2]:12225/local/switch-id
20:58:27.129Z DEBG fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): client response
    result = Err(reqwest::Error { kind: Request, url: "http://[fd00:1122:3344:101::2]:12225/local/switch-id", source: TimedOut })
20:58:27.129Z WARN fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): failed to identify switch slot for dendrite, will retry in 2 seconds
    file = nexus/src/app/mod.rs:1176
    reason = Communication Error: error sending request for url (http://[fd00:1122:3344:101::2]:12225/local/switch-id): operation timed out
    zone_address = fd00:1122:3344:101::2
20:58:29.130Z DEBG fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): client request
    body = None
    method = GET
    uri = http://[fd00:1122:3344:101::2]:12225/local/switch-id
20:58:44.132Z DEBG fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): client response
    result = Err(reqwest::Error { kind: Request, url: "http://[fd00:1122:3344:101::2]:12225/local/switch-id", source: TimedOut })
20:58:44.132Z WARN fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): failed to identify switch slot for dendrite, will retry in 2 seconds
    file = nexus/src/app/mod.rs:1176
    reason = Communication Error: error sending request for url (http://[fd00:1122:3344:101::2]:12225/local/switch-id): operation timed out
    zone_address = fd00:1122:3344:101::2
20:58:46.133Z DEBG fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): client request
    body = None
    method = GET
    uri = http://[fd00:1122:3344:101::2]:12225/local/switch-id
20:59:01.135Z DEBG fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): client response
    result = Err(reqwest::Error { kind: Request, url: "http://[fd00:1122:3344:101::2]:12225/local/switch-id", source: TimedOut })
20:59:01.135Z WARN fd9231e4-0c8c-42be-b06c-9593eaa3b009 (ServerContext): failed to identify switch slot for dendrite, will retry in 2 seconds

From omicron/nexus/src/app/mod.rs we have map_switch_zone_addrs() which has this loop:

async fn map_switch_zone_addrs(
    log: &Logger,
    switch_zone_addresses: Vec<Ipv6Addr>,
) -> HashMap<SwitchLocation, Ipv6Addr> {
    use gateway_client::Client as MgsClient;
    info!(log, "Determining switch slots managed by switch zones");
    let mut switch_zone_addrs = HashMap::new();
    for addr in switch_zone_addresses {
        let mgs_client = MgsClient::new(
            &format!("http://[{}]:{}", addr, MGS_PORT),
            log.new(o!("component" => "MgsClient")),
        );

        info!(log, "determining switch slot managed by dendrite zone"; "zone_address" => #?addr);
        // TODO: #3599 Use retry function instead of looping on a fixed timer
        let switch_slot = loop {
            match mgs_client.sp_local_switch_id().await {
                Ok(switch) => {
                    info!(
                        log,
                        "identified switch slot for dendrite zone";
                        "slot" => #?switch,
                        "zone_address" => #?addr
                    );
                    break switch.slot;
                }
                Err(e) => {
                    warn!(
                        log,
                        "failed to identify switch slot for dendrite, will retry in 2 seconds";
                        "zone_address" => #?addr,
                        "reason" => #?e
                    );
                }
            }
            tokio::time::sleep(std::time::Duration::from_secs(2)).await;
        };

We enter the loop with a switch zone address of the now missing sled 14.
Once we enter that loop, we are never going to come out as sled 14 is gone.
Being stuck in this loop prevents Nexus from starting up.

@leftwo
Copy link
Contributor Author

leftwo commented Mar 5, 2025

Nexus log at: staff/core/omicron-7739

@davepacheco
Copy link
Collaborator

I saw similar behavior in #6076 and concluded this might be a result of #5201 but I'm not sure.

@leftwo
Copy link
Contributor Author

leftwo commented Mar 6, 2025

I saw similar behavior in #6076 and concluded this might be a result of #5201 but I'm not sure.

I think what you saw in #6076 was getting stuck at the same place yet.
But in the situation here we have I believe two problems, and they are slightly different.

  1. That the retry loop never asks DNS to see if I.P.s have changed, it starts with an IP or list of IPs, and it never updates those.

  2. If there are two IPs in DNS, but one scrimlet is down or not responding we won't ever give up and just take the single working address and move forward. I'm not sure what the desired behavior is in that case though. Maybe we want to keep trying until either DNS is updated or the scrimlet providing the service shows up.

@leftwo leftwo added this to the 14 milestone Mar 11, 2025
@leftwo leftwo self-assigned this Mar 12, 2025
@davepacheco
Copy link
Collaborator

See also #5092.

@davepacheco
Copy link
Collaborator

I think #7850 will fix this immediate problem, allowing Nexus to start up. There likely remains a problem in the other code paths that do the same thing and #7779 should improve those.

leftwo added a commit that referenced this issue Mar 21, 2025

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Currently, callers of `map_switch_zone_addrs()` first get the IP for
`ServiceName::Dendrite` from DNS, then loop (forever) trying to
translate that IP into a `SwitchLocation`. Under normal conditions, this
is fine. However, if a sled has been expunged, or a new sled is being
added, it's possible that what is returned in:
```
let switch_zone_addresses = match resolver
            .lookup_all_ipv6(ServiceName::Dendrite)
            .await
```
Will change. If that changes happens after we start looping in
`map_switch_zone_addrs()`, then the loop will go on forever looking for
something that is no longer correct.

To fix this we put the `lookup_all_ipv6` into the loop by using the
function `switch_zone_address_mappings()` instead.
`switch_zone_address_mappings()`'s loop includes the call to lookup
addresses in DNS will call `map_switch_zone_addrs()`. This allows us to
include the DNS lookup inside the loop.

Most places where we called `map_switch_zone_addrs()` were also using
the same `lookup_all_ipv6()` call, so transitioning them to call
`switch_zone_address_mappings()` will just drop right in.

A fix for #7739

---------

Co-authored-by: Alan Hanson <alan@oxide.computer>
@leftwo
Copy link
Contributor Author

leftwo commented Mar 28, 2025

Fixed in #7779

@leftwo leftwo closed this as completed Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expunge expunge sled or disk issues
Projects
None yet
Development

No branches or pull requests

3 participants