-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Cluster Manager task throttler blocks network threads #13741
Comments
Looking at the stack-trace of thread dump, the workers are waiting for lock on the throttling-key ( One possible fix to avoid lock contention, is to move out the following two code blocks outside of
In that way the critical section is limited to incrementing the request count and not any other computation. This would avoid transport-workers getting blocked, if all of them happen to enqueue the tasks that belong to same throttling-key. @Bukhtawar @shwetathareja - Thoughts ? |
Do you think we should move the task submission and throttling logic off network threads to avoid getting into retry loops and stalling transport? |
I see, the task submission involves the following two operations
Let me know if i am missing something. Both of these operations, looks to be light-weight based on the current implementation and should not incur time. On the retry of request once throttled, i think this should happen from the caller, and not add retry handlers in cluster-manager. We can have cluster-manager send additional signals about PendingTaskQueue once the request is throttled, which can be used by caller to decide on retries. |
@Bukhtawar As long as we are ensuring the work done on the transport thread is minimal, we dont need to move it to separate threadpool. @rajiv-kv lets profile submit-task method post fixing the logging overhead. |
We are probably also retrying in the network thread which I don't think is right. Let's verify the retry logic of it's on the network thread |
Memory allocations during Task Throttling.
Thanks @backslasht for sharing the dumps. |
With shallow checks, some race condition might occur when we decrement the counter and cluster manager can take the task, but shallow check outside of the lock reject the request. Considering the tradeoff against the transport thread being blocked, I guess this race condition should be okay. Anyway data node will have retry and retry can go through. |
Yes @dhwanilpatel , the throttling enforcement will be based on the size of queue observed when the request was submitted. It will not consider the in-flight tasks that are getting drained and might free-up the space. |
Proposed changes were merged in 2.17 as part of this PR - #15508. |
Describe the bug
Related component
Cluster Manager
To Reproduce
Expected behavior
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: