Skip to content

Commit d2d8f8a

Browse files
authored
Select a random Nexus from DNS (#1622)
When a sled drops off the network, there's a delay before an expunged Nexus is no longer returned in the list of records (due to a blueprint update). Crucible will try a few times to send high priority messages but will eventually give up. If the expunged Nexus is the first in the list of DNS records returned, this unfortunately won't change: the DNS server does not randomize the returned records. Crucible's notification will be continuously sent to the same expunged Nexus and won't arrive. If the "started" message for a repair or reconciliation is not received by a Nexus, it will drop progress and finish messages with the same upstairs ID. This can cause stuck repairs if Nexus can't poll for a repair's progress for some reason.
1 parent 03f940b commit d2d8f8a

File tree

1 file changed

+16
-2
lines changed

1 file changed

+16
-2
lines changed

upstairs/src/notify.rs

+16-2
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
//! Nexus-flavored types internally.
66
77
use chrono::{DateTime, Utc};
8+
use rand::prelude::SliceRandom;
89
use slog::{debug, error, info, o, warn, Logger};
910
use std::net::{Ipv6Addr, SocketAddr};
1011
use tokio::sync::mpsc;
@@ -425,8 +426,21 @@ pub(crate) async fn get_nexus_client(
425426
};
426427

427428
let nexus_address =
428-
match resolver.lookup_socket_v6(ServiceName::Nexus).await {
429-
Ok(addr) => addr,
429+
match resolver.lookup_all_socket_v6(ServiceName::Nexus).await {
430+
Ok(addrs) => {
431+
if addrs.is_empty() {
432+
error!(log, "no Nexus addresses returned!");
433+
return None;
434+
}
435+
436+
let Some(addr) = addrs.choose(&mut rand::thread_rng()) else {
437+
error!(log, "somehow, choose failed!");
438+
return None;
439+
};
440+
441+
*addr
442+
}
443+
430444
Err(e) => {
431445
error!(log, "lookup Nexus address failed: {e}");
432446
return None;

0 commit comments

Comments
 (0)