Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the Active → Faulted transition #1260

Merged
merged 8 commits into from
Apr 18, 2024
Merged

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Apr 12, 2024

This PR removes the Active → Faulted transition when too many IO operations are in flight.

See #1252 for details on why this is desirable!

There are a bunch of changes that come along for the ride:

  • The backpressure equation is changed from quadratic to the form 1 / (1 - f), i.e. it now goes to infinity at our desired maximum value. These maximum values are set at 2x the fault for the Offline → Faulted transition.
  • Computing backpressure is now a standalone function on BackpressureConfig, so it can be unit tested
  • Unit tests are added to confirm backpressure properties
    • The OfflineFaulted transition happens at a reasonable timescale (> 1 sec, < 2 minutes)
    • Backpressure goes to ∞ as we approach these limits
  • Backpressure settings are tweaked based on these new requirements

@mkeeter mkeeter requested review from jmpesp and leftwo April 12, 2024 18:59
@@ -812,7 +807,7 @@ async fn run_live_repair(mut harness: TestHarness) {
let mut ds2_buffered_messages = vec![];
let mut ds3_buffered_messages = vec![];

for eid in 0..10 {
for eid in 0..25 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably should have been some test global to begin with, so you did not
have to find it and change it to match the default_config extent size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nudge, done in 9760bad

@mkeeter mkeeter force-pushed the remove-io-limit-fail-redux branch from 77faa50 to 1e2b998 Compare April 17, 2024 18:53
@mkeeter
Copy link
Contributor Author

mkeeter commented Apr 17, 2024

I did a performance sweep on the Madrid cluster, using the new crutest rand-read/write subcommands:

#!/bin/sh
set -e

if [[ -z ${CRUTEST_BIN} ]]; then
    print "Error: must run with CRUTEST_BIN set"
    exit 1
fi

FLAGS="\
    -t [fd00:1122:3344:104::1]:28830 \
    -t [fd00:1122:3344:103::1]:28830 \
    -t [fd00:1122:3344:101::1]:28830 \
    --io-depth=8 -q --time=60 --sample-time=5 --subsample-count=2\
    --key tCw7zw0hAsPuxMOTWwnPEFYjBK9qJRtYyGdEXKEnrg0= \
"

${CRUTEST_BIN} rand-write $FLAGS --gen $(date "+%s") --io-size=1024
${CRUTEST_BIN} rand-write $FLAGS --gen $(date "+%s") --io-size=1
${CRUTEST_BIN} rand-write $FLAGS --gen $(date "+%s") --io-size=256
${CRUTEST_BIN} rand-write $FLAGS --gen $(date "+%s") --io-size=1024
${CRUTEST_BIN} rand-read $FLAGS --gen $(date "+%s") --io-size=1
${CRUTEST_BIN} rand-read $FLAGS --gen $(date "+%s") --io-size=256
${CRUTEST_BIN} rand-read $FLAGS --gen $(date "+%s") --io-size=1024

main

4M write: 806.8 MiB/sec ± 35.6 MiB/sec (201 IOPS)
4K write: 28.7 MiB/sec ± 2.3 MiB/sec (7334 IOPS)
1M write: 813.5 MiB/sec ± 29.5 MiB/sec (813 IOPS)
4M write: 779.3 MiB/sec ± 40.3 MiB/sec (194 IOPS)
4K read:  28.4 MiB/sec ± 273.8 KiB/sec (7275 IOPS)
1M read:  825.3 MiB/sec ± 43.4 MiB/sec (825 IOPS)
4M read:  733.2 MiB/sec ± 27.5 MiB/sec (183 IOPS)

This branch

4M write: 783.8 MiB/sec ± 23.2 MiB/sec (195 IOPS)
4K write: 32.1 MiB/sec ± 2 MiB/sec (8227 IOPS)
1M write: 795.8 MiB/sec ± 29.6 MiB/sec (795 IOPS)
4M write: 796.5 MiB/sec ± 34.6 MiB/sec (199 IOPS)
4K read:  27.7 MiB/sec ± 290.7 KiB/sec (7096 IOPS)
1M read:  744.3 MiB/sec ± 15.5 MiB/sec (744 IOPS)
4M read:  818.9 MiB/sec ± 28.3 MiB/sec (204 IOPS)

Writes consistently look about the same; reads are a little more varied but I don't see any major red flags.

@mkeeter mkeeter force-pushed the remove-io-limit-fail-redux branch from afcce41 to 1e2b998 Compare April 18, 2024 14:40
Copy link
Contributor

@faithanalog faithanalog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All makes sense to me.

@mkeeter mkeeter merged commit 11a4cf8 into main Apr 18, 2024
23 checks passed
@mkeeter mkeeter deleted the remove-io-limit-fail-redux branch April 18, 2024 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants