-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add per-client backpressure #1181
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only real question is if there is a way to know from the outside what the
delay settings are currently set to?
I could see in the future we could do even more with that info, but I don't think
we need to hold up this PR to figure out the best way to figure out how to
archive it, but it would be cool to know what the settings are at some sample
frequency or point in time.
}; | ||
|
||
self.clients[i].set_delay_us(delay_time_us); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way we can see this from the outside? It would be interesting to be able
to track this and watch how it changes, but I'm not sure how to bubble it out to
where we could grab it with the rest of the collect_stats()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to the existing up-status
DTrace probe in 597b950
(The string is now 768 bytes long with all zeros, so still well below the strsize=1k
limit)
9abd890
to
4b48762
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions:
- should we change
IO_OUTSTANDING_MAX
? - what if jobs arrive out of order? technically the downstairs is free to do non-dependent jobs in any order, and
reply_time
isn't the same as some notion of how long a job took to do by a particular downstairs.
I'm not sure, but I think it's orthogonal to this change – from the perspective of backpressure, it doesn't matter if jobs are waiting in the upstairs queues or downstairs.
Good question, I hadn't thought about this. To clarify one point, we don't actually care about "how long a job took to do by a particular downstairs"; we care about the end-to-end time, which also includes queues and sockets on both sides (adding delays on top of raw execution time). Still, it's a valid concern. Let's say we have three independent read operations, If Downstairs 1 does I'm not sure how to fix this – if end-to-end time varies based on things other than true execution time + queue delays, then it becomes tricky to control (e.g. what if every downstairs randomly delayed jobs by 1-10 ms?). My inclination is to punt on it until we actually start reordering jobs in the Downstairs, and then do so with caution. |
|
||
self.clients[i].set_delay_us(delay_time_us); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not setting the per-client time, should we clear it? What if one Downstairs is Faulted
, or in replay - the previously set delays will still apply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not setting the per-client time, should we clear it?
I don't think so. If we did that, then the first two replies to a job would clear the delay, then the third reply would set it (before it's cleared again by the first reply to the next job).
What if one Downstairs is Faulted, or in replay - the previously set delays will still apply.
I added logic to clear the per-client delay in Client::reinitialize
(1190fa3).
Maybe we should clear it if any of the clients restart? Otherwise, I could imagine a situation where
- DS1 is temporarily slower than DS2, so we delay DS2
- Then DS3 returns an error, so we restart it
- We're still delaying DS2, and that won't change until DS3 comes back online
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. If we did that, then the first two replies to a job would clear the delay, then the third reply would set it (before it's cleared again by the first reply to the next job).
👍
Maybe we should clear it if any of the clients restart?
That sounds like a good idea, yeah.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, done in 80b6f73
I'm curious if doing per-client backpressure based on the queue lengths makes sense? It doesn't take into account the actual time difference like For what it's worth, starting a Downstairs with |
I considered this, but initially decided against it because queue length is a pretty crude measurement of expected delay: a queue with 500x 4K jobs is very different from a queue with 500x 1 MiB jobs. If we use the existing backpressure system as an analogy, we could instead delay individual clients based on the maximum of relative jobs-in-flight and bytes-in-flight (versus the slowest Downstairs). Do you want me to throw that together? Kicking out a very slow Downstairs should take the same amount of time as before; it will just delay the other two Downstairs before it gets kicked out. (In the current code, the max delay is 10 ms/job) |
Summarizing a long DM chain, I do think this is worth testing! It may solve a bunch of questions I/we had. |
This PR adds per-client delays to prevent queues from becoming arbitrarily long during read-heavy workflows. These delays are based either per-client queue lengths and bytes in flight (whichever is worst compared to the slowest client), which is analogous to the guest backpressure implementation. Compared to main, this keeps the queues relatively bounded, and is at least similar in performance (hard to tell, because there significant run-to-run variability). Fixes #1167, and is an alternative to #1181
Done in #1186 instead |
Fixes #1167
This PR adds per-client delays to prevent queues from becoming arbitrarily long during read-heavy workflows.
In our current code, because we only need one reply to ack a read, the fastest Downstairs will consistently beat the others. This means that the queues for the shorter two Downstairs will keep getting longer and longer (so the fastest Downstairs will keep winning), and those queues could eventually hit the
IO_OUTSTANDING_MAX
limit (which means the Upstairs would mark the slower Downstairs as faulted).This PR adds monitoring of the difference between Downstairs reply times with a new
reply_times: ClientData<Instant>
in thestruct DownstairsIO
. If the difference between fastest and slowest is significant (> 10 ms), then we add an artificial delay to the fastest downstairs.Compared to
main
, this slows down reads, but keeps the queues relatively bounded. Here's performance before:(notice pending jobs creeping up in the 1M and 4M read sections of the graph)
...and after:
I don't think we can avoid slowing down reads in the steady-state: the whole point is to keep the queues bounded, which necessarily means matching the speed of the slowest Downstairs. However, burst performance should be unaffected, because the delays only kick in at > 10 ms of discrepancy (which takes some amount of time to wind up).
Note that we don't delay writes! This took a little convincing, but I believe it's correct: during write-heavy workloads, we're already limited by global backpressure (both jobs and bytes-in-flight), so it's already impossible for a queue to grow without bounds.