-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify extent under repair is valid after copying files #1159
Conversation
Added a call to the downstairs repair API that will verify that an extent is closed (or the region is read only), meaning the extent is okay to copy. This prevents a rare but possible condition where an extent is closed when a remote side contacts it for repair, but is re-opened before the extent file copy starts and the resulting copy is invalid.
downstairs/src/lib.rs
Outdated
bail!("got {:?} from repair main", e); | ||
} | ||
let (repair_request_tx, repair_request_rx) = mpsc::channel(10); | ||
let repair_listener = match repair::repair_main( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of spawning a separate task, could we handle this new functionality in repair::repair_main
? We're already passing in an Arc<Mutex<Downstairs>>
(as the &dss
argument), so I'm not sure what the extra task buys us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do pass in the downstairs mutex, but we don't keep it. We only use it to pull out a few fields.
I was actually moddeling this after what you did in upstairs/src/control.rs
for the single task
refactoring (#1058).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's 6 of one, half-dozen of the other.
Having any task holding on to an Arc<Mutex<Downstairs>>
is something that will have to change if we switch to a single-task implementation (with a dedicated task owning a Downstairs
). It doesn't much matter whether the Arc<Mutex<Downstairs>>
is held by repair_main
or the new repair_info
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My hope here was to make an eventual transition to a similar one-task implementation easier, and make the
downstairs similar to the upstairs in that regard. Is this at least a step in that direction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, there were only two categories of tasks using the Arc<Mutex<Downstairs>>
:
- The main listener task
- Any tasks spawned by it via
proc_stream
, i.e.pf_task
anddw_task
After this PR, the important change is that we'll have a third category of task using the Arc<Mutex<Downstairs>>
.
This will either be repair_info
(in the current implementation) or repair_main
(in my suggestion). Which one isn't important, the important part (architecturally) is that it's a third thing sharing that mutex; and that's what we'll have to deal with in a future single-task refactoring.
From that perspective, either implementation is making a single-task refactoring harder.
That being said, it's not going to make such a refactoring significantly harder; we can just make requests to the Downstairs
task through a queue, instead of calling into its APIs directly.
Since both implementations are architecturally equivalent (w.r.t. a single-task refactoring), I'd vote for the one that doesn't proliferate tasks further (i.e. adding this API to repair_main
). Does that make sense? I don't feel too strongly about it, so I'd defer to your judgement if you still think a separate task is the Right Choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it all in repair_main()
, no more repair task.
The complexity here of "checking before, sending files, checking after" makes me nervous – could we instead hold the |
I thought about that idea originally. The repair API task does not have access to the downstairs lock currently, but it I was also not sure how to tell in dropshot when know when the stream has ended. |
Okay, that seems reasonable as a justification for the three-step process. Somewhat orthogonally, I still think that moving this new API into |
Added a call to the downstairs repair API that will verify that an extent is closed (or the region is read only),
meaning the extent is okay to copy.
This prevents a rare but possible condition where an extent is closed when a remote side contacts it for repair, but is re-opened before the extent file copy starts. This leaves the resulting copy as invalid.
During the pstop/prun tests, I saw this once where the pstop to a downstairs happened just after we had
pulled over the list of files to verify, but before the actual files were copied over. The details of that
are described here: #1146