Add per-client backpressure #1181

mkeeter · 2024-02-26T17:31:13Z

This PR adds per-client delays to prevent queues from becoming arbitrarily long during read-heavy workflows.

In our current code, because we only need one reply to ack a read, the fastest Downstairs will consistently beat the others. This means that the queues for the shorter two Downstairs will keep getting longer and longer (so the fastest Downstairs will keep winning), and those queues could eventually hit the IO_OUTSTANDING_MAX limit (which means the Upstairs would mark the slower Downstairs as faulted).

This PR adds monitoring of the difference between Downstairs reply times with a new reply_times: ClientData<Instant> in the struct DownstairsIO. If the difference between fastest and slowest is significant (> 10 ms), then we add an artificial delay to the fastest downstairs.

Compared to main, this slows down reads, but keeps the queues relatively bounded. Here's performance before:

1M WRITE: bw=692MiB/s (726MB/s), 692MiB/s-692MiB/s (726MB/s-726MB/s), io=40.7GiB (43.7GB), run=60146-60146msec
4K WRITE: bw=28.4MiB/s (29.8MB/s), 28.4MiB/s-28.4MiB/s (29.8MB/s-29.8MB/s), io=1707MiB (1789MB), run=60005-60005msec
1M WRITE: bw=713MiB/s (748MB/s), 713MiB/s-713MiB/s (748MB/s-748MB/s), io=41.8GiB (44.9GB), run=60051-60051msec
4M WRITE: bw=713MiB/s (748MB/s), 713MiB/s-713MiB/s (748MB/s-748MB/s), io=41.9GiB (45.0GB), run=60181-60181msec
4K READ: bw=29.0MiB/s (30.4MB/s), 29.0MiB/s-29.0MiB/s (30.4MB/s-30.4MB/s), io=1741MiB (1826MB), run=60004-60004msec
1M READ: bw=603MiB/s (632MB/s), 603MiB/s-603MiB/s (632MB/s-632MB/s), io=35.3GiB (37.9GB), run=60038-60038msec
4M READ: bw=738MiB/s (774MB/s), 738MiB/s-738MiB/s (774MB/s-774MB/s), io=43.3GiB (46.5GB), run=60135-60135msec

(notice pending jobs creeping up in the 1M and 4M read sections of the graph)

...and after:

1M WRITE: bw=687MiB/s (721MB/s), 687MiB/s-687MiB/s (721MB/s-721MB/s), io=40.4GiB (43.4GB), run=60198-60198msec
4K WRITE: bw=29.4MiB/s (30.8MB/s), 29.4MiB/s-29.4MiB/s (30.8MB/s-30.8MB/s), io=1765MiB (1850MB), run=60008-60008msec
1M WRITE: bw=714MiB/s (749MB/s), 714MiB/s-714MiB/s (749MB/s-749MB/s), io=41.9GiB (44.9GB), run=60029-60029msec
4M WRITE: bw=697MiB/s (731MB/s), 697MiB/s-697MiB/s (731MB/s-731MB/s), io=40.9GiB (44.0GB), run=60134-60134msec
4K READ: bw=28.8MiB/s (30.2MB/s), 28.8MiB/s-28.8MiB/s (30.2MB/s-30.2MB/s), io=1731MiB (1815MB), run=60003-60003msec
1M READ: bw=562MiB/s (589MB/s), 562MiB/s-562MiB/s (589MB/s-589MB/s), io=32.9GiB (35.4GB), run=60038-60038msec
4M READ: bw=561MiB/s (588MB/s), 561MiB/s-561MiB/s (588MB/s-588MB/s), io=32.9GiB (35.4GB), run=60162-60162msec

I don't think we can avoid slowing down reads in the steady-state: the whole point is to keep the queues bounded, which necessarily means matching the speed of the slowest Downstairs. However, burst performance should be unaffected, because the delays only kick in at > 10 ms of discrepancy (which takes some amount of time to wind up).

Note that we don't delay writes! This took a little convincing, but I believe it's correct: during write-heavy workloads, we're already limited by global backpressure (both jobs and bytes-in-flight), so it's already impossible for a queue to grow without bounds.

leftwo

My only real question is if there is a way to know from the outside what the
delay settings are currently set to?

I could see in the future we could do even more with that info, but I don't think
we need to hold up this PR to figure out the best way to figure out how to
archive it, but it would be cool to know what the settings are at some sample
frequency or point in time.

upstairs/src/client.rs

leftwo · 2024-02-26T23:07:52Z

upstairs/src/downstairs.rs

+                };
+
+                self.clients[i].set_delay_us(delay_time_us);
+            }


Is there any way we can see this from the outside? It would be interesting to be able
to track this and watch how it changes, but I'm not sure how to bubble it out to
where we could grab it with the rest of the collect_stats().

I added this to the existing up-status DTrace probe in 597b950

(The string is now 768 bytes long with all zeros, so still well below the strsize=1k limit)

jmpesp

Some questions:

should we change IO_OUTSTANDING_MAX?
what if jobs arrive out of order? technically the downstairs is free to do non-dependent jobs in any order, andreply_time isn't the same as some notion of how long a job took to do by a particular downstairs.

mkeeter · 2024-02-28T19:56:58Z

should we change IO_OUTSTANDING_MAX?

I'm not sure, but I think it's orthogonal to this change – from the perspective of backpressure, it doesn't matter if jobs are waiting in the upstairs queues or downstairs.

what if jobs arrive out of order? technically the downstairs is free to do non-dependent jobs in any order, and reply_time isn't the same as some notion of how long a job took to do by a particular downstairs.

Good question, I hadn't thought about this. To clarify one point, we don't actually care about "how long a job took to do by a particular downstairs"; we care about the end-to-end time, which also includes queues and sockets on both sides (adding delays on top of raw execution time).

Still, it's a valid concern. Let's say we have three independent read operations, A, B, and C, which all take the same amount of time.

If Downstairs 1 does ABC, Downstars 2 does BCA, and Downstairs 3 does CAB, then we wouldn't update backpressure until the third reply; at that point, job C thinks that DS1 is lowest, job A thinks DS2 is slowest, and job B thinks DS3 is slowest. Depending on which reply comes in last, we could end up delaying any of the Downstairs, even though they're performing the same.

I'm not sure how to fix this – if end-to-end time varies based on things other than true execution time + queue delays, then it becomes tricky to control (e.g. what if every downstairs randomly delayed jobs by 1-10 ms?).

My inclination is to punt on it until we actually start reordering jobs in the Downstairs, and then do so with caution.

jmpesp · 2024-02-28T20:50:34Z

upstairs/src/downstairs.rs

+
+                self.clients[i].set_delay_us(delay_time_us);
+            }
+        }


If we're not setting the per-client time, should we clear it? What if one Downstairs is Faulted, or in replay - the previously set delays will still apply.

If we're not setting the per-client time, should we clear it?

I don't think so. If we did that, then the first two replies to a job would clear the delay, then the third reply would set it (before it's cleared again by the first reply to the next job).

What if one Downstairs is Faulted, or in replay - the previously set delays will still apply.

I added logic to clear the per-client delay in Client::reinitialize (1190fa3).

Maybe we should clear it if any of the clients restart? Otherwise, I could imagine a situation where

DS1 is temporarily slower than DS2, so we delay DS2

Then DS3 returns an error, so we restart it

We're still delaying DS2, and that won't change until DS3 comes back online

I don't think so. If we did that, then the first two replies to a job would clear the delay, then the third reply would set it (before it's cleared again by the first reply to the next job).

👍

Maybe we should clear it if any of the clients restart?

That sounds like a good idea, yeah.

Okay, done in 80b6f73

jmpesp · 2024-02-28T20:56:00Z

I'm not sure how to fix this – if end-to-end time varies based on things other than true execution time + queue delays, then it becomes tricky to control (e.g. what if every downstairs randomly delayed jobs by 1-10 ms?).

My inclination is to punt on it until we actually start reordering jobs in the Downstairs, and then do so with caution.

I'm curious if doing per-client backpressure based on the queue lengths makes sense? It doesn't take into account the actual time difference like reply_time does though, and I don't know how that would affect how we decide to kick out a very slow Downstairs.

For what it's worth, starting a Downstairs with --lossy will randomly insert pauses and skip jobs, and lead to this sort of out of order execution.

mkeeter · 2024-02-28T21:13:24Z

I'm curious if doing per-client backpressure based on the queue lengths makes sense? It doesn't take into account the actual time difference like reply_time does though, and I don't know how that would affect how we decide to kick out a very slow Downstairs.

I considered this, but initially decided against it because queue length is a pretty crude measurement of expected delay: a queue with 500x 4K jobs is very different from a queue with 500x 1 MiB jobs.

If we use the existing backpressure system as an analogy, we could instead delay individual clients based on the maximum of relative jobs-in-flight and bytes-in-flight (versus the slowest Downstairs). Do you want me to throw that together?

Kicking out a very slow Downstairs should take the same amount of time as before; it will just delay the other two Downstairs before it gets kicked out. (In the current code, the max delay is 10 ms/job)

jmpesp · 2024-02-28T21:51:01Z

Do you want me to throw that together?

Summarizing a long DM chain, I do think this is worth testing! It may solve a bunch of questions I/we had.

This PR adds per-client delays to prevent queues from becoming arbitrarily long during read-heavy workflows. These delays are based either per-client queue lengths and bytes in flight (whichever is worst compared to the slowest client), which is analogous to the guest backpressure implementation. Compared to main, this keeps the queues relatively bounded, and is at least similar in performance (hard to tell, because there significant run-to-run variability). Fixes #1167, and is an alternative to #1181

mkeeter · 2024-03-02T00:47:03Z

Done in #1186 instead

mkeeter requested review from faithanalog, jmpesp and leftwo February 26, 2024 17:31

mkeeter added this to the 7 milestone Feb 26, 2024

mkeeter self-assigned this Feb 26, 2024

leftwo approved these changes Feb 26, 2024

View reviewed changes

Add per-client backpressure

4b48762

mkeeter force-pushed the per-client-backpressure branch from 9abd890 to 4b48762 Compare February 27, 2024 20:18

Add per-client delay to stats

597b950

mkeeter mentioned this pull request Feb 27, 2024

Reduce serialization during Downstairs reads #1183

Closed

jmpesp requested changes Feb 28, 2024

View reviewed changes

jmpesp reviewed Feb 28, 2024

View reviewed changes

Clear per-client delay in reinitialize

1190fa3

Clear client delay for all clients whenever one is reinitialized

80b6f73

mkeeter mentioned this pull request Feb 29, 2024

Per client, queue-based backpressure #1186

Merged

mkeeter closed this Mar 2, 2024

mkeeter mentioned this pull request Mar 6, 2024

Backpressure consolidation #1176

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-client backpressure #1181

Add per-client backpressure #1181

mkeeter commented Feb 26, 2024

leftwo left a comment

leftwo Feb 26, 2024

mkeeter Feb 27, 2024

jmpesp left a comment

mkeeter commented Feb 28, 2024

jmpesp Feb 28, 2024

mkeeter Feb 28, 2024

jmpesp Feb 28, 2024

mkeeter Feb 28, 2024

jmpesp commented Feb 28, 2024

mkeeter commented Feb 28, 2024

jmpesp commented Feb 28, 2024

mkeeter commented Mar 2, 2024 •

edited

Loading

Add per-client backpressure #1181

Add per-client backpressure #1181

Conversation

mkeeter commented Feb 26, 2024

leftwo left a comment

Choose a reason for hiding this comment

leftwo Feb 26, 2024

Choose a reason for hiding this comment

mkeeter Feb 27, 2024

Choose a reason for hiding this comment

jmpesp left a comment

Choose a reason for hiding this comment

mkeeter commented Feb 28, 2024

jmpesp Feb 28, 2024

Choose a reason for hiding this comment

mkeeter Feb 28, 2024

Choose a reason for hiding this comment

jmpesp Feb 28, 2024

Choose a reason for hiding this comment

mkeeter Feb 28, 2024

Choose a reason for hiding this comment

jmpesp commented Feb 28, 2024

mkeeter commented Feb 28, 2024

jmpesp commented Feb 28, 2024

mkeeter commented Mar 2, 2024 • edited Loading

mkeeter commented Mar 2, 2024 •

edited

Loading