Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add per-client backpressure #1181

Closed
wants to merge 4 commits into from

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Feb 26, 2024

Fixes #1167

This PR adds per-client delays to prevent queues from becoming arbitrarily long during read-heavy workflows.

In our current code, because we only need one reply to ack a read, the fastest Downstairs will consistently beat the others. This means that the queues for the shorter two Downstairs will keep getting longer and longer (so the fastest Downstairs will keep winning), and those queues could eventually hit the IO_OUTSTANDING_MAX limit (which means the Upstairs would mark the slower Downstairs as faulted).

This PR adds monitoring of the difference between Downstairs reply times with a new reply_times: ClientData<Instant> in the struct DownstairsIO. If the difference between fastest and slowest is significant (> 10 ms), then we add an artificial delay to the fastest downstairs.

Compared to main, this slows down reads, but keeps the queues relatively bounded. Here's performance before:

1M WRITE: bw=692MiB/s (726MB/s), 692MiB/s-692MiB/s (726MB/s-726MB/s), io=40.7GiB (43.7GB), run=60146-60146msec
4K WRITE: bw=28.4MiB/s (29.8MB/s), 28.4MiB/s-28.4MiB/s (29.8MB/s-29.8MB/s), io=1707MiB (1789MB), run=60005-60005msec
1M WRITE: bw=713MiB/s (748MB/s), 713MiB/s-713MiB/s (748MB/s-748MB/s), io=41.8GiB (44.9GB), run=60051-60051msec
4M WRITE: bw=713MiB/s (748MB/s), 713MiB/s-713MiB/s (748MB/s-748MB/s), io=41.9GiB (45.0GB), run=60181-60181msec
4K READ: bw=29.0MiB/s (30.4MB/s), 29.0MiB/s-29.0MiB/s (30.4MB/s-30.4MB/s), io=1741MiB (1826MB), run=60004-60004msec
1M READ: bw=603MiB/s (632MB/s), 603MiB/s-603MiB/s (632MB/s-632MB/s), io=35.3GiB (37.9GB), run=60038-60038msec
4M READ: bw=738MiB/s (774MB/s), 738MiB/s-738MiB/s (774MB/s-774MB/s), io=43.3GiB (46.5GB), run=60135-60135msec

upstairs_info_prev

(notice pending jobs creeping up in the 1M and 4M read sections of the graph)

...and after:

1M WRITE: bw=687MiB/s (721MB/s), 687MiB/s-687MiB/s (721MB/s-721MB/s), io=40.4GiB (43.4GB), run=60198-60198msec
4K WRITE: bw=29.4MiB/s (30.8MB/s), 29.4MiB/s-29.4MiB/s (30.8MB/s-30.8MB/s), io=1765MiB (1850MB), run=60008-60008msec
1M WRITE: bw=714MiB/s (749MB/s), 714MiB/s-714MiB/s (749MB/s-749MB/s), io=41.9GiB (44.9GB), run=60029-60029msec
4M WRITE: bw=697MiB/s (731MB/s), 697MiB/s-697MiB/s (731MB/s-731MB/s), io=40.9GiB (44.0GB), run=60134-60134msec
4K READ: bw=28.8MiB/s (30.2MB/s), 28.8MiB/s-28.8MiB/s (30.2MB/s-30.2MB/s), io=1731MiB (1815MB), run=60003-60003msec
1M READ: bw=562MiB/s (589MB/s), 562MiB/s-562MiB/s (589MB/s-589MB/s), io=32.9GiB (35.4GB), run=60038-60038msec
4M READ: bw=561MiB/s (588MB/s), 561MiB/s-561MiB/s (588MB/s-588MB/s), io=32.9GiB (35.4GB), run=60162-60162msec

upstairs_info_pr

I don't think we can avoid slowing down reads in the steady-state: the whole point is to keep the queues bounded, which necessarily means matching the speed of the slowest Downstairs. However, burst performance should be unaffected, because the delays only kick in at > 10 ms of discrepancy (which takes some amount of time to wind up).

Note that we don't delay writes! This took a little convincing, but I believe it's correct: during write-heavy workloads, we're already limited by global backpressure (both jobs and bytes-in-flight), so it's already impossible for a queue to grow without bounds.

@mkeeter mkeeter added this to the 7 milestone Feb 26, 2024
@mkeeter mkeeter self-assigned this Feb 26, 2024
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only real question is if there is a way to know from the outside what the
delay settings are currently set to?

I could see in the future we could do even more with that info, but I don't think
we need to hold up this PR to figure out the best way to figure out how to
archive it, but it would be cool to know what the settings are at some sample
frequency or point in time.

};

self.clients[i].set_delay_us(delay_time_us);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way we can see this from the outside? It would be interesting to be able
to track this and watch how it changes, but I'm not sure how to bubble it out to
where we could grab it with the rest of the collect_stats().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to the existing up-status DTrace probe in 597b950

(The string is now 768 bytes long with all zeros, so still well below the strsize=1k limit)

@mkeeter mkeeter force-pushed the per-client-backpressure branch from 9abd890 to 4b48762 Compare February 27, 2024 20:18
Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions:

  • should we change IO_OUTSTANDING_MAX?
  • what if jobs arrive out of order? technically the downstairs is free to do non-dependent jobs in any order, andreply_time isn't the same as some notion of how long a job took to do by a particular downstairs.

@mkeeter
Copy link
Contributor Author

mkeeter commented Feb 28, 2024

  • should we change IO_OUTSTANDING_MAX?

I'm not sure, but I think it's orthogonal to this change – from the perspective of backpressure, it doesn't matter if jobs are waiting in the upstairs queues or downstairs.

  • what if jobs arrive out of order? technically the downstairs is free to do non-dependent jobs in any order, and reply_time isn't the same as some notion of how long a job took to do by a particular downstairs.

Good question, I hadn't thought about this. To clarify one point, we don't actually care about "how long a job took to do by a particular downstairs"; we care about the end-to-end time, which also includes queues and sockets on both sides (adding delays on top of raw execution time).

Still, it's a valid concern. Let's say we have three independent read operations, A, B, and C, which all take the same amount of time.

If Downstairs 1 does ABC, Downstars 2 does BCA, and Downstairs 3 does CAB, then we wouldn't update backpressure until the third reply; at that point, job C thinks that DS1 is lowest, job A thinks DS2 is slowest, and job B thinks DS3 is slowest. Depending on which reply comes in last, we could end up delaying any of the Downstairs, even though they're performing the same.

I'm not sure how to fix this – if end-to-end time varies based on things other than true execution time + queue delays, then it becomes tricky to control (e.g. what if every downstairs randomly delayed jobs by 1-10 ms?).

My inclination is to punt on it until we actually start reordering jobs in the Downstairs, and then do so with caution.


self.clients[i].set_delay_us(delay_time_us);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not setting the per-client time, should we clear it? What if one Downstairs is Faulted, or in replay - the previously set delays will still apply.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not setting the per-client time, should we clear it?

I don't think so. If we did that, then the first two replies to a job would clear the delay, then the third reply would set it (before it's cleared again by the first reply to the next job).

What if one Downstairs is Faulted, or in replay - the previously set delays will still apply.

I added logic to clear the per-client delay in Client::reinitialize (1190fa3).

Maybe we should clear it if any of the clients restart? Otherwise, I could imagine a situation where

  • DS1 is temporarily slower than DS2, so we delay DS2
  • Then DS3 returns an error, so we restart it
  • We're still delaying DS2, and that won't change until DS3 comes back online

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. If we did that, then the first two replies to a job would clear the delay, then the third reply would set it (before it's cleared again by the first reply to the next job).

👍

Maybe we should clear it if any of the clients restart?

That sounds like a good idea, yeah.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, done in 80b6f73

@jmpesp
Copy link
Contributor

jmpesp commented Feb 28, 2024

I'm not sure how to fix this – if end-to-end time varies based on things other than true execution time + queue delays, then it becomes tricky to control (e.g. what if every downstairs randomly delayed jobs by 1-10 ms?).

My inclination is to punt on it until we actually start reordering jobs in the Downstairs, and then do so with caution.

I'm curious if doing per-client backpressure based on the queue lengths makes sense? It doesn't take into account the actual time difference like reply_time does though, and I don't know how that would affect how we decide to kick out a very slow Downstairs.

For what it's worth, starting a Downstairs with --lossy will randomly insert pauses and skip jobs, and lead to this sort of out of order execution.

@mkeeter
Copy link
Contributor Author

mkeeter commented Feb 28, 2024

I'm curious if doing per-client backpressure based on the queue lengths makes sense? It doesn't take into account the actual time difference like reply_time does though, and I don't know how that would affect how we decide to kick out a very slow Downstairs.

I considered this, but initially decided against it because queue length is a pretty crude measurement of expected delay: a queue with 500x 4K jobs is very different from a queue with 500x 1 MiB jobs.

If we use the existing backpressure system as an analogy, we could instead delay individual clients based on the maximum of relative jobs-in-flight and bytes-in-flight (versus the slowest Downstairs). Do you want me to throw that together?

Kicking out a very slow Downstairs should take the same amount of time as before; it will just delay the other two Downstairs before it gets kicked out. (In the current code, the max delay is 10 ms/job)

@jmpesp
Copy link
Contributor

jmpesp commented Feb 28, 2024

Do you want me to throw that together?

Summarizing a long DM chain, I do think this is worth testing! It may solve a bunch of questions I/we had.

mkeeter added a commit that referenced this pull request Mar 2, 2024
This PR adds per-client delays to prevent queues from becoming arbitrarily long
during read-heavy workflows. These delays are based either per-client queue
lengths and bytes in flight (whichever is worst compared to the slowest client),
which is analogous to the guest backpressure implementation.

Compared to main, this keeps the queues relatively bounded, and is at least
similar in performance (hard to tell, because there significant run-to-run
variability).

Fixes #1167, and is an alternative to #1181
@mkeeter
Copy link
Contributor Author

mkeeter commented Mar 2, 2024

Done in #1186 instead

@mkeeter mkeeter closed this Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revenge of the backpressure
3 participants