-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nexus notifications have different importance #1621
Conversation
Even if it's best effort, notifications about progress should be low priority, and not starve out notifications related to processes starting and stopping. Otherwise, we see: 00:16:18.781Z INFO propolis-server (vm_state_driver): live-repair completed successfully = downstairs session_id = 67a91355-4dd1-4e8d-9631-15f5fed073d9 00:16:18.781Z WARN propolis-server (vm_state_driver): could not send notify "Full(..)"; queue is full job = notify_queue session_id = 67a91355-4dd1-4e8d-9631-15f5fed073d9
Importantly, remove `retry_until_known_result`: if Nexus disappears, then the task will be stuck trying to notify it indefinitely, and _of course_ queues will fill up!
upstairs/src/notify.rs
Outdated
log: log.new(o!("job" => "notify_queue")), | ||
} | ||
} | ||
|
||
struct Notification { | ||
maybe_message: Option<(DateTime<Utc>, NotifyRequest)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it was confusing to me that a Notification
with an empty message was the "queue is closed" case, rather than using an Option<Notification>
(with a non-optional message).
Here's a diff:
diff --git a/upstairs/src/notify.rs b/upstairs/src/notify.rs
index ade380a04c..c8600360ef 100644
--- a/upstairs/src/notify.rs
+++ b/upstairs/src/notify.rs
@@ -125,7 +125,7 @@
}
struct Notification {
- maybe_message: Option<(DateTime<Utc>, NotifyRequest)>,
+ message: (DateTime<Utc>, NotifyRequest),
qos: NotifyQos,
retries: usize,
}
@@ -148,33 +148,34 @@
.unwrap();
loop {
- let Notification {
- maybe_message,
- qos,
- retries,
- } = {
- if let Some(notification) = stored_notification.take() {
- notification
+ let r = {
+ if stored_notification.is_some() {
+ stored_notification.take()
} else {
tokio::select! {
biased;
- i = rx_high.recv() => Notification {
- maybe_message: i,
+ i = rx_high.recv() => i.map(|message| Notification {
+ message,
qos: NotifyQos::High,
retries: 0,
- },
+ }),
- i = rx_low.recv() => Notification {
- maybe_message: i,
+ i = rx_low.recv() => i.map(|message| Notification {
+ message,
qos: NotifyQos::Low,
retries: 0,
- },
+ }),
}
}
};
- let Some((time, m)) = maybe_message else {
+ let Some(Notification {
+ message: (time, m),
+ qos,
+ retries,
+ }) = r
+ else {
error!(log, "one of the notify channels was closed!");
break;
};
@@ -395,7 +396,7 @@
warn!(log, "retries > 3, dropping {m:?}");
} else {
stored_notification = Some(Notification {
- maybe_message: Some((time, m)),
+ message: (time, m),
qos,
retries: retries + 1,
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this too, done in 912e376
upstairs/src/notify.rs
Outdated
qos, | ||
retries, | ||
} = { | ||
if let Some(notification) = stored_notification.take() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can combine this into the biased select with
tokio::select! {
biased;
Some(n) = async { stored_notification.take() } => Some(n),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we want to add a delay here to improve the odds of reconnecting? I don't feel strongly about it, since the delay would also make it more likely for future messages to be dropped, so pick your poison...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, looks like it :) 446bd1e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we want to add a delay here to improve the odds of reconnecting? I don't feel strongly about it, since the delay would also make it more likely for future messages to be dropped, so pick your poison...
I'm thinking no here - the retry should pick another Nexus, which shouldn't need to be delayed.
Even if it's best effort, notifications about progress should be low priority, and not starve out notifications related to processes starting and stopping. Otherwise, we see:
Store high priority messages and retry them 3 times
Importantly, remove
retry_until_known_result
: if Nexus disappears, then the task will be stuck trying to notify it indefinitely, and of course queues will fill up!