Nexus notifications have different importance #1621

jmpesp · 2025-02-05T16:56:33Z

Even if it's best effort, notifications about progress should be low priority, and not starve out notifications related to processes starting and stopping. Otherwise, we see:

00:16:18.781Z INFO propolis-server (vm_state_driver): live-repair completed successfully
     = downstairs
    session_id = 67a91355-4dd1-4e8d-9631-15f5fed073d9
00:16:18.781Z WARN propolis-server (vm_state_driver): could not send notify "Full(..)"; queue is full
    job = notify_queue
    session_id = 67a91355-4dd1-4e8d-9631-15f5fed073d9

Store high priority messages and retry them 3 times

Importantly, remove retry_until_known_result: if Nexus disappears, then the task will be stuck trying to notify it indefinitely, and of course queues will fill up!

Even if it's best effort, notifications about progress should be low priority, and not starve out notifications related to processes starting and stopping. Otherwise, we see: 00:16:18.781Z INFO propolis-server (vm_state_driver): live-repair completed successfully = downstairs session_id = 67a91355-4dd1-4e8d-9631-15f5fed073d9 00:16:18.781Z WARN propolis-server (vm_state_driver): could not send notify "Full(..)"; queue is full job = notify_queue session_id = 67a91355-4dd1-4e8d-9631-15f5fed073d9

Importantly, remove `retry_until_known_result`: if Nexus disappears, then the task will be stuck trying to notify it indefinitely, and _of course_ queues will fill up!

mkeeter · 2025-02-05T18:18:22Z

upstairs/src/notify.rs

        log: log.new(o!("job" => "notify_queue")),
    }
 }

+struct Notification {
+    maybe_message: Option<(DateTime<Utc>, NotifyRequest)>,


Nit: it was confusing to me that a Notification with an empty message was the "queue is closed" case, rather than using an Option<Notification> (with a non-optional message).

Here's a diff:

diff --git a/upstairs/src/notify.rs b/upstairs/src/notify.rs index ade380a04c..c8600360ef 100644 --- a/upstairs/src/notify.rs +++ b/upstairs/src/notify.rs @@ -125,7 +125,7 @@ } struct Notification { - maybe_message: Option<(DateTime<Utc>, NotifyRequest)>, + message: (DateTime<Utc>, NotifyRequest), qos: NotifyQos, retries: usize, } @@ -148,33 +148,34 @@ .unwrap(); loop { - let Notification { - maybe_message, - qos, - retries, - } = { - if let Some(notification) = stored_notification.take() { - notification + let r = { + if stored_notification.is_some() { + stored_notification.take() } else { tokio::select! { biased; - i = rx_high.recv() => Notification { - maybe_message: i, + i = rx_high.recv() => i.map(|message| Notification { + message, qos: NotifyQos::High, retries: 0, - }, + }), - i = rx_low.recv() => Notification { - maybe_message: i, + i = rx_low.recv() => i.map(|message| Notification { + message, qos: NotifyQos::Low, retries: 0, - }, + }), } } }; - let Some((time, m)) = maybe_message else { + let Some(Notification { + message: (time, m), + qos, + retries, + }) = r + else { error!(log, "one of the notify channels was closed!"); break; }; @@ -395,7 +396,7 @@ warn!(log, "retries > 3, dropping {m:?}"); } else { stored_notification = Some(Notification { - maybe_message: Some((time, m)), + message: (time, m), qos, retries: retries + 1, });

I like this too, done in 912e376

mkeeter · 2025-02-05T18:24:48Z

upstairs/src/notify.rs

+            qos,
+            retries,
+        } = {
+            if let Some(notification) = stored_notification.take() {


I think you can combine this into the biased select with

tokio::select! { biased; Some(n) = async { stored_notification.take() } => Some(n),

Also, do we want to add a delay here to improve the odds of reconnecting? I don't feel strongly about it, since the delay would also make it more likely for future messages to be dropped, so pick your poison...

yeah, looks like it :) 446bd1e

Also, do we want to add a delay here to improve the odds of reconnecting? I don't feel strongly about it, since the delay would also make it more likely for future messages to be dropped, so pick your poison...

I'm thinking no here - the retry should pick another Nexus, which shouldn't need to be delayed.

jmpesp requested a review from mkeeter February 5, 2025 16:56

mkeeter approved these changes Feb 5, 2025

View reviewed changes

jmpesp added 3 commits February 5, 2025 17:14

info message for when the task starts

aa11b5f

Store high priority messages and retry them 3 times

9a97cf6

Importantly, remove `retry_until_known_result`: if Nexus disappears, then the task will be stuck trying to notify it indefinitely, and _of course_ queues will fill up!

fmt

007d95a

mkeeter reviewed Feb 5, 2025

View reviewed changes

jmpesp added 2 commits February 5, 2025 18:29

review feedback: use Option<Notification> instead

912e376

pull out of option in select

446bd1e

mkeeter approved these changes Feb 5, 2025

View reviewed changes

jmpesp merged commit 03f940b into oxidecomputer:main Feb 5, 2025
17 checks passed

jmpesp deleted the qos_notifications branch February 5, 2025 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nexus notifications have different importance #1621

Nexus notifications have different importance #1621

jmpesp commented Feb 5, 2025 •

edited

Loading

mkeeter Feb 5, 2025

jmpesp Feb 5, 2025

mkeeter Feb 5, 2025

mkeeter Feb 5, 2025

jmpesp Feb 5, 2025

jmpesp Feb 5, 2025

Nexus notifications have different importance #1621

Nexus notifications have different importance #1621

Conversation

jmpesp commented Feb 5, 2025 • edited Loading

mkeeter Feb 5, 2025

Choose a reason for hiding this comment

jmpesp Feb 5, 2025

Choose a reason for hiding this comment

mkeeter Feb 5, 2025

Choose a reason for hiding this comment

mkeeter Feb 5, 2025

Choose a reason for hiding this comment

jmpesp Feb 5, 2025

Choose a reason for hiding this comment

jmpesp Feb 5, 2025

Choose a reason for hiding this comment

jmpesp commented Feb 5, 2025 •

edited

Loading