Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor live-repair start and send #1673

Merged
merged 2 commits into from
Mar 18, 2025
Merged

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Mar 17, 2025

Downstairs::begin_repair_for had confusing semantics: it was used to both start live-repair (setting self.repair to Some(..)) and to send subsequent jobs. Most of its arguments were only used in one or the other case.

This PR splits it into three functions:

  • start_live_repair initializes self.repair and sends the first live-repair IO
  • send_next_live_repair sends the next live-repair IO
  • Under the hood, both of these functions call send_live_repair_jobs to reserve repair IDs and actually send IO

In passing, I also removed the extent_count argument from notify_live_repair_progress, since it can be computed locally (basically just pushing the change from #1672 further down the call stack).

source_downstairs,
aborting_repair: false,
active_extent: ExtentId(0),
min_id: JobId(0), // fixed below
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we send this first repair with min_id not set, is it possible this could be
a problem with dependency removal?

The path for the first repair goes like this:
send_live_repair_jobs()
-> create_and_enqueue_..()
-> self.enqueue()
-> self.send()
In send, we prune_deps(), where we check min_id.

I think it's going to be okay, but .. I want to be sure (as this was a problem with the initial LR and min_id was added to prevent a hang).

  • A non LiveRepair downstairs should not prune deps, as it must honor them.
  • A downstairs under Repair, this is the first actual repair it is getting, so any previous IO should have been skipped already, and should not be on the dependency list?

At least, thats what I think will happen. Does that make sense? Even if true, I still feel like we are relying on some behavior that may not be obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it should work as written, because there should be no IO going to a downstairs that's LiveRepairReady. However, I realized it's an easy fix to initialize min_id to self.peek_next_id(), so let's just do that instead to remove all doubt!

Fixed in 50e16fd

@leftwo leftwo self-requested a review March 18, 2025 15:03
@mkeeter mkeeter merged commit 61c6688 into main Mar 18, 2025
17 checks passed
@mkeeter mkeeter deleted the mkeeter/clean-up-live-repair-start branch March 18, 2025 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants