Verify extent under repair is valid after copying files #1159

leftwo · 2024-02-12T18:15:12Z

Added a call to the downstairs repair API that will verify that an extent is closed (or the region is read only),
meaning the extent is okay to copy.

This prevents a rare but possible condition where an extent is closed when a remote side contacts it for repair, but is re-opened before the extent file copy starts. This leaves the resulting copy as invalid.

During the pstop/prun tests, I saw this once where the pstop to a downstairs happened just after we had
pulled over the list of files to verify, but before the actual files were copied over. The details of that
are described here: #1146

Added a call to the downstairs repair API that will verify that an extent is closed (or the region is read only), meaning the extent is okay to copy. This prevents a rare but possible condition where an extent is closed when a remote side contacts it for repair, but is re-opened before the extent file copy starts and the resulting copy is invalid.

mkeeter · 2024-02-12T18:28:49Z

downstairs/src/lib.rs

-                bail!("got {:?} from repair main", e);
-            }
+    let (repair_request_tx, repair_request_rx) = mpsc::channel(10);
+    let repair_listener = match repair::repair_main(


Instead of spawning a separate task, could we handle this new functionality in repair::repair_main? We're already passing in an Arc<Mutex<Downstairs>> (as the &dss argument), so I'm not sure what the extra task buys us.

We do pass in the downstairs mutex, but we don't keep it. We only use it to pull out a few fields.
I was actually moddeling this after what you did in upstairs/src/control.rs for the single task
refactoring (#1058).

I think it's 6 of one, half-dozen of the other.

Having any task holding on to an Arc<Mutex<Downstairs>> is something that will have to change if we switch to a single-task implementation (with a dedicated task owning a Downstairs). It doesn't much matter whether the Arc<Mutex<Downstairs>> is held by repair_main or the new repair_info.

My hope here was to make an eventual transition to a similar one-task implementation easier, and make the
downstairs similar to the upstairs in that regard. Is this at least a step in that direction?

Previously, there were only two categories of tasks using the Arc<Mutex<Downstairs>>:

The main listener task

Any tasks spawned by it via proc_stream, i.e. pf_task and dw_task

After this PR, the important change is that we'll have a third category of task using the Arc<Mutex<Downstairs>>.

This will either be repair_info (in the current implementation) or repair_main (in my suggestion). Which one isn't important, the important part (architecturally) is that it's a third thing sharing that mutex; and that's what we'll have to deal with in a future single-task refactoring.

From that perspective, either implementation is making a single-task refactoring harder.

That being said, it's not going to make such a refactoring significantly harder; we can just make requests to the Downstairs task through a queue, instead of calling into its APIs directly.

Since both implementations are architecturally equivalent (w.r.t. a single-task refactoring), I'd vote for the one that doesn't proliferate tasks further (i.e. adding this API to repair_main). Does that make sense? I don't feel too strongly about it, so I'd defer to your judgement if you still think a separate task is the Right Choice.

I put it all in repair_main(), no more repair task.

mkeeter · 2024-02-12T18:30:41Z

The complexity here of "checking before, sending files, checking after" makes me nervous – could we instead hold the Downstairs lock for the entire time we're sending files to the requester, ensuring that no one else could change them?

leftwo · 2024-02-12T18:56:40Z

The complexity here of "checking before, sending files, checking after" makes me nervous – could we instead hold the Downstairs lock for the entire time we're sending files to the requester, ensuring that no one else could change them?

I thought about that idea originally. The repair API task does not have access to the downstairs lock currently, but it
could be updated to have it. However, I still want to be able to recover from the case where a downstairs under
repair starts streaming, then goes out to lunch. I would be concerned the source downstairs side is holding a lock,
then never letting it go because the remote side is stuck/crashed/unavailable. This would eventually result in the
upstairs giving up and the LiveRepair needing to abort.

I was also not sure how to tell in dropshot when know when the stream has ended.

mkeeter · 2024-02-12T19:43:56Z

I thought about that idea originally. The repair API task does not have access to the downstairs lock currently, but it
could be updated to have it. However, I still want to be able to recover from the case where a downstairs under
repair starts streaming, then goes out to lunch. I would be concerned the source downstairs side is holding a lock,
then never letting it go because the remote side is stuck/crashed/unavailable. This would eventually result in the
upstairs giving up and the LiveRepair needing to abort.

I was also not sure how to tell in dropshot when know when the stream has ended.

Okay, that seems reasonable as a justification for the three-step process.

Somewhat orthogonally, I still think that moving this new API into repair_main is sensible (per this comment); what do you think?

leftwo requested review from mkeeter and jmpesp February 12, 2024 18:15

mkeeter reviewed Feb 12, 2024

View reviewed changes

Keep a mutex for the repair task

2d5d050

leftwo requested a review from mkeeter February 14, 2024 00:28

mkeeter approved these changes Feb 14, 2024

View reviewed changes

jmpesp approved these changes Feb 20, 2024

View reviewed changes

leftwo merged commit ae8630f into main Feb 21, 2024
18 checks passed

leftwo deleted the alan/repair-but-verify branch February 21, 2024 00:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify extent under repair is valid after copying files #1159

Verify extent under repair is valid after copying files #1159

leftwo commented Feb 12, 2024

mkeeter Feb 12, 2024

leftwo Feb 12, 2024

mkeeter Feb 12, 2024

leftwo Feb 13, 2024

mkeeter Feb 13, 2024

leftwo Feb 13, 2024

mkeeter commented Feb 12, 2024 •

edited

Loading

leftwo commented Feb 12, 2024

mkeeter commented Feb 12, 2024

Verify extent under repair is valid after copying files #1159

Verify extent under repair is valid after copying files #1159

Conversation

leftwo commented Feb 12, 2024

mkeeter Feb 12, 2024

Choose a reason for hiding this comment

leftwo Feb 12, 2024

Choose a reason for hiding this comment

mkeeter Feb 12, 2024

Choose a reason for hiding this comment

leftwo Feb 13, 2024

Choose a reason for hiding this comment

mkeeter Feb 13, 2024

Choose a reason for hiding this comment

leftwo Feb 13, 2024

Choose a reason for hiding this comment

mkeeter commented Feb 12, 2024 • edited Loading

leftwo commented Feb 12, 2024

mkeeter commented Feb 12, 2024

mkeeter commented Feb 12, 2024 •

edited

Loading