Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return when VCRs match in target_replace #1661

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Mar 4, 2025

Notifications to Nexus from an Upstairs are best-effort only: they can be missed or dropped, or suffer from all sorts of bad weather. For this reason, Nexus will poll an Upstairs (eg a Pantry attachment or Propolis disk backed by Crucible) to query the state of a repair (especially during region replacement).

For a given VCR replacement request, Propolis will:

  • call target_replace(original, replacement)
  • update its stored VCR after the replacement has taken place.

The next time that Nexus sends the same VCR replacement request, Propolis will end up calling what amounts to
target_replace(replacement, replacement), and before this commit that would fail with "The VCRs have no difference".

Nexus will poll using with VCR replacement requests, so this commit changes target_replace to return VcrMatches instead of an error. Note that Nexus will not consider this a signal that a replacement has completed.

Notifications to Nexus from an Upstairs are best-effort only: they can
be missed or dropped, or suffer from all sorts of bad weather. For this
reason, Nexus will poll an Upstairs (eg a Pantry attachment or Propolis
disk backed by Crucible) to query the state of a repair (especially
during region replacement).

For a given VCR replacement request, Propolis will:

- call `target_replace(original, replacement)`
- update its stored VCR after the replacement has taken place.

The next time that Nexus sends the same VCR replacement request,
Propolis will end up calling what amounts to
`target_replace(replacement, replacement)`, and before this commit that
would fail with "The VCRs have no difference".

Nexus will poll using with VCR replacement requests, so this commit
changes `target_replace` to return `VcrMatches` instead of an error.
Note that Nexus will _not_ consider this a signal that a replacement has
completed.
@jmpesp jmpesp requested a review from leftwo March 4, 2025 15:13
@jmpesp
Copy link
Contributor Author

jmpesp commented Mar 4, 2025

A possible wrinkle:

If Propolis is stuck in starting and the relevant Upstairs (the one where the VCR is being modified) has not activated yet, then Nexus will be able to tell when the repair has finished (after sending the VCR replacement request) with a check to see if the Upstairs activated, but that only works for reconciliation.

If Propolis is running, and the relevant Upstairs is active, then Nexus will not be able to tell when the repair has finished by checking if the Upstairs activated ok (because it was already active to begin with), and after this commit will not be able to track the progress of the live repair using the other variants that would be returned if the arguments didn't change (if the call remained target_replace(original, replacement), the caller would see variants like ReplaceResult::StartedAlready).

This could lead to Nexus attempting to perform multiple target replacements on the same volume because it could believe a replacement is done (read: the live repair has completed) before it is, or it would have to rely on a finished notification, which we've already said is best effort.

@leftwo
Copy link
Contributor

leftwo commented Mar 4, 2025

A possible wrinkle:

If Propolis is stuck in starting and the relevant Upstairs (the one where the VCR is being modified) has not activated yet, then Nexus will be able to tell when the repair has finished (after sending the VCR replacement request) with a check to see if the Upstairs activated, but that only works for reconciliation.

If Propolis is running, and the relevant Upstairs is active, then Nexus will not be able to tell when the repair has finished by checking if the Upstairs activated ok (because it was already active to begin with), and after this commit will not be able to track the progress of the live repair using the other variants that would be returned if the arguments didn't change (if the call remained target_replace(original, replacement), the caller would see variants like ReplaceResult::StartedAlready).

This could lead to Nexus attempting to perform multiple target replacements on the same volume because it could believe a replacement is done (read: the live repair has completed) before it is, or it would have to rely on a finished notification, which we've already said is best effort.

In this situation, could crucible tell Nexus "the VCRs match what I have, and there is no repair in progress" Enum type?
I think we have to be very vigilant about replacing downstairs, and being sure we don't replace too many, as that is a path to problems we don't want. This logic here must be rock solid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants