Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexus notifications have different importance #1621

Merged
merged 6 commits into from
Feb 5, 2025

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Feb 5, 2025

Even if it's best effort, notifications about progress should be low priority, and not starve out notifications related to processes starting and stopping. Otherwise, we see:

00:16:18.781Z INFO propolis-server (vm_state_driver): live-repair completed successfully
     = downstairs
    session_id = 67a91355-4dd1-4e8d-9631-15f5fed073d9
00:16:18.781Z WARN propolis-server (vm_state_driver): could not send notify "Full(..)"; queue is full
    job = notify_queue
    session_id = 67a91355-4dd1-4e8d-9631-15f5fed073d9

Store high priority messages and retry them 3 times

Importantly, remove retry_until_known_result: if Nexus disappears, then the task will be stuck trying to notify it indefinitely, and of course queues will fill up!

Even if it's best effort, notifications about progress should be low
priority, and not starve out notifications related to processes starting
and stopping. Otherwise, we see:

    00:16:18.781Z INFO propolis-server (vm_state_driver): live-repair completed successfully
         = downstairs
        session_id = 67a91355-4dd1-4e8d-9631-15f5fed073d9
    00:16:18.781Z WARN propolis-server (vm_state_driver): could not send notify "Full(..)"; queue is full
        job = notify_queue
        session_id = 67a91355-4dd1-4e8d-9631-15f5fed073d9
@jmpesp jmpesp requested a review from mkeeter February 5, 2025 16:56
Importantly, remove `retry_until_known_result`: if Nexus disappears,
then the task will be stuck trying to notify it indefinitely, and _of
course_ queues will fill up!
log: log.new(o!("job" => "notify_queue")),
}
}

struct Notification {
maybe_message: Option<(DateTime<Utc>, NotifyRequest)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it was confusing to me that a Notification with an empty message was the "queue is closed" case, rather than using an Option<Notification> (with a non-optional message).

Here's a diff:

diff --git a/upstairs/src/notify.rs b/upstairs/src/notify.rs
index ade380a04c..c8600360ef 100644
--- a/upstairs/src/notify.rs
+++ b/upstairs/src/notify.rs
@@ -125,7 +125,7 @@
 }
 
 struct Notification {
-    maybe_message: Option<(DateTime<Utc>, NotifyRequest)>,
+    message: (DateTime<Utc>, NotifyRequest),
     qos: NotifyQos,
     retries: usize,
 }
@@ -148,33 +148,34 @@
         .unwrap();
 
     loop {
-        let Notification {
-            maybe_message,
-            qos,
-            retries,
-        } = {
-            if let Some(notification) = stored_notification.take() {
-                notification
+        let r = {
+            if stored_notification.is_some() {
+                stored_notification.take()
             } else {
                 tokio::select! {
                     biased;
 
-                    i = rx_high.recv() => Notification {
-                        maybe_message: i,
+                    i = rx_high.recv() => i.map(|message| Notification {
+                        message,
                         qos: NotifyQos::High,
                         retries: 0,
-                    },
+                    }),
 
-                    i = rx_low.recv() => Notification {
-                        maybe_message: i,
+                    i = rx_low.recv() => i.map(|message| Notification {
+                        message,
                         qos: NotifyQos::Low,
                         retries: 0,
-                    },
+                    }),
                 }
             }
         };
 
-        let Some((time, m)) = maybe_message else {
+        let Some(Notification {
+            message: (time, m),
+            qos,
+            retries,
+        }) = r
+        else {
             error!(log, "one of the notify channels was closed!");
             break;
         };
@@ -395,7 +396,7 @@
                         warn!(log, "retries > 3, dropping {m:?}");
                     } else {
                         stored_notification = Some(Notification {
-                            maybe_message: Some((time, m)),
+                            message: (time, m),
                             qos,
                             retries: retries + 1,
                         });

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this too, done in 912e376

qos,
retries,
} = {
if let Some(notification) = stored_notification.take() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can combine this into the biased select with

            tokio::select! {
                biased;

                Some(n) = async { stored_notification.take() } => Some(n),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we want to add a delay here to improve the odds of reconnecting? I don't feel strongly about it, since the delay would also make it more likely for future messages to be dropped, so pick your poison...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, looks like it :) 446bd1e

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we want to add a delay here to improve the odds of reconnecting? I don't feel strongly about it, since the delay would also make it more likely for future messages to be dropped, so pick your poison...

I'm thinking no here - the retry should pick another Nexus, which shouldn't need to be delayed.

@jmpesp jmpesp merged commit 03f940b into oxidecomputer:main Feb 5, 2025
17 checks passed
@jmpesp jmpesp deleted the qos_notifications branch February 5, 2025 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants