-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add worker id to on_thread_park()
callback (for stuck worker watchdog)
#6354
Add worker id to on_thread_park()
callback (for stuck worker watchdog)
#6354
Conversation
It's possible to find out from WorkerMetrics which workers are not actively polling for tasks. However it's not possible to tell if non-polling workers are merely idle (parked) or if they are stuck. By adding the worker id (same usize as used in WorkerMetrics calls) to the on_thread_park() and on_thread_unpark() callbacks it is possible to track which specific workers are parked. Then any worker that is not polling tasks and is not parked is a worker that is stuck. With this information it's possible to create a watchdog that alerts or kills the process if a worker is stuck for too long.
@@ -684,7 +688,7 @@ impl Context { | |||
/// after all the IOs get dispatched | |||
fn park(&self, mut core: Box<Core>) -> Box<Core> { | |||
if let Some(f) = &self.worker.handle.shared.config.before_park { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems better to me if this check were inside the if core.transition_to_parked()
check below, to avoid generating spurious callbacks. However it looks like it might be this way for tracing? Since I don't understand the details I just left it as is.
pub fn on_thread_park<F>(&mut self, f: F) -> &mut Self | ||
where | ||
F: Fn() + Send + Sync + 'static, | ||
F: Fn(usize) + Send + Sync + 'static, | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change. We can't do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Darksonn It's a change to an API that is available only when --cfg tokio_unstable
is used. I assumed those APIs were subject to modification. Am I incorrect?
If that's not the case I can certainly add a on_thread_park2()
method and deprecate the old one (since the new method would have all of the functionality of the old one). Or how would you suggest evolving the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that we can change --cfg tokio_unstable
methods, but the on_thread_park
method is stable and accessible without --cfg tokio_unstable
.
It's possible to find out from WorkerMetrics which workers are not actively polling for tasks. However it's not possible to tell if non-polling workers are merely idle (parked) or if they are stuck. By adding the worker id (same usize as used in WorkerMetrics calls) to the on_thread_park() and on_thread_unpark() callbacks it is possible to track which specific workers are parked. Then any worker that is not polling tasks and is not parked is a worker that is stuck.
With this information it's possible to create a watchdog that alerts or kills the process if a worker is stuck for too long.
Fixes #6353
Motivation
If a task gets stuck and doesn't yield back to the tokio runtime, other tasks can be starved (even if other workers are idle). I would like a watchdog that can detect when a task is stuck, and either alert or kill the process.
Using
WorkerMetrics
andworker_poll_count()
it's possible to determine if a worker is not polling for new tasks. However it's not possible to tell if that worker is merely parked, or stuck.Solution
Include a worker id in the
on_thread_park()
andon_thread_unpark()
callbacks so that the idle / busy status of each worker can be inferred. Then it's possible for a watchdog thread to monitorworker_poll_count()
and the parked/unparked status of each worker, and do something about it.