Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify extent under repair is valid after copying files #1159

Merged
merged 2 commits into from
Feb 21, 2024

Conversation

leftwo
Copy link
Contributor

@leftwo leftwo commented Feb 12, 2024

Added a call to the downstairs repair API that will verify that an extent is closed (or the region is read only),
meaning the extent is okay to copy.

This prevents a rare but possible condition where an extent is closed when a remote side contacts it for repair, but is re-opened before the extent file copy starts. This leaves the resulting copy as invalid.

During the pstop/prun tests, I saw this once where the pstop to a downstairs happened just after we had
pulled over the list of files to verify, but before the actual files were copied over. The details of that
are described here: #1146

Added a call to the downstairs repair API that will verify that
an extent is closed (or the region is read only), meaning the extent
is okay to copy.

This prevents a rare but possible condition where an extent is closed
when a remote side contacts it for repair, but is re-opened before the
extent file copy starts and the resulting copy is invalid.
@leftwo leftwo requested review from mkeeter and jmpesp February 12, 2024 18:15
bail!("got {:?} from repair main", e);
}
let (repair_request_tx, repair_request_rx) = mpsc::channel(10);
let repair_listener = match repair::repair_main(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of spawning a separate task, could we handle this new functionality in repair::repair_main? We're already passing in an Arc<Mutex<Downstairs>> (as the &dss argument), so I'm not sure what the extra task buys us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do pass in the downstairs mutex, but we don't keep it. We only use it to pull out a few fields.
I was actually moddeling this after what you did in upstairs/src/control.rs for the single task
refactoring (#1058).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's 6 of one, half-dozen of the other.

Having any task holding on to an Arc<Mutex<Downstairs>> is something that will have to change if we switch to a single-task implementation (with a dedicated task owning a Downstairs). It doesn't much matter whether the Arc<Mutex<Downstairs>> is held by repair_main or the new repair_info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hope here was to make an eventual transition to a similar one-task implementation easier, and make the
downstairs similar to the upstairs in that regard. Is this at least a step in that direction?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, there were only two categories of tasks using the Arc<Mutex<Downstairs>>:

  • The main listener task
  • Any tasks spawned by it via proc_stream, i.e. pf_task and dw_task

After this PR, the important change is that we'll have a third category of task using the Arc<Mutex<Downstairs>>.

This will either be repair_info (in the current implementation) or repair_main (in my suggestion). Which one isn't important, the important part (architecturally) is that it's a third thing sharing that mutex; and that's what we'll have to deal with in a future single-task refactoring.

From that perspective, either implementation is making a single-task refactoring harder.

That being said, it's not going to make such a refactoring significantly harder; we can just make requests to the Downstairs task through a queue, instead of calling into its APIs directly.

Since both implementations are architecturally equivalent (w.r.t. a single-task refactoring), I'd vote for the one that doesn't proliferate tasks further (i.e. adding this API to repair_main). Does that make sense? I don't feel too strongly about it, so I'd defer to your judgement if you still think a separate task is the Right Choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it all in repair_main(), no more repair task.

@mkeeter
Copy link
Contributor

mkeeter commented Feb 12, 2024

The complexity here of "checking before, sending files, checking after" makes me nervous – could we instead hold the Downstairs lock for the entire time we're sending files to the requester, ensuring that no one else could change them?

@leftwo
Copy link
Contributor Author

leftwo commented Feb 12, 2024

The complexity here of "checking before, sending files, checking after" makes me nervous – could we instead hold the Downstairs lock for the entire time we're sending files to the requester, ensuring that no one else could change them?

I thought about that idea originally. The repair API task does not have access to the downstairs lock currently, but it
could be updated to have it. However, I still want to be able to recover from the case where a downstairs under
repair starts streaming, then goes out to lunch. I would be concerned the source downstairs side is holding a lock,
then never letting it go because the remote side is stuck/crashed/unavailable. This would eventually result in the
upstairs giving up and the LiveRepair needing to abort.

I was also not sure how to tell in dropshot when know when the stream has ended.

@mkeeter
Copy link
Contributor

mkeeter commented Feb 12, 2024

I thought about that idea originally. The repair API task does not have access to the downstairs lock currently, but it
could be updated to have it. However, I still want to be able to recover from the case where a downstairs under
repair starts streaming, then goes out to lunch. I would be concerned the source downstairs side is holding a lock,
then never letting it go because the remote side is stuck/crashed/unavailable. This would eventually result in the
upstairs giving up and the LiveRepair needing to abort.

I was also not sure how to tell in dropshot when know when the stream has ended.

Okay, that seems reasonable as a justification for the three-step process.

Somewhat orthogonally, I still think that moving this new API into repair_main is sensible (per this comment); what do you think?

@leftwo leftwo requested a review from mkeeter February 14, 2024 00:28
@leftwo leftwo merged commit ae8630f into main Feb 21, 2024
18 checks passed
@leftwo leftwo deleted the alan/repair-but-verify branch February 21, 2024 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants