Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crucible must be able to activate with 2/3 downstairs redundancy #7826

Open
smklein opened this issue Mar 19, 2025 · 18 comments
Open

Crucible must be able to activate with 2/3 downstairs redundancy #7826

smklein opened this issue Mar 19, 2025 · 18 comments
Labels
storage Related to storage. Update System Replacing old bits with newer, cooler bits

Comments

@smklein
Copy link
Collaborator

smklein commented Mar 19, 2025

Currently, when a Crucible upstairs is activating downstairs, it blocks until all three downstairs have responded.

This requires perfect availability of all downstairs disks for instances to be started, which is especially problematic in the live update case. In the case where we pick any single sled to be updated, we migrate all instances off that sled, and we reboot it: for the duration of that sled being updated:

  • If any instance across the fleet has a downstairs on any of the disks on the sled-under-update are trying to start...
  • They won't see an ACK from the downstairs on the sled-under-update -- so they'll hang for the entire duration of the update

This is rough - it'll be a user-visible lack-of-availability, and will continue happening for different instances as we proceed with update across sleds in the rack.

@smklein smklein added storage Related to storage. Update System Replacing old bits with newer, cooler bits labels Mar 19, 2025
@smklein smklein changed the title Crucible must be activate volume with two-of-three redunancy Crucible must be able to activate volumes with two-of-three downstairs redundancy Mar 19, 2025
@smklein smklein changed the title Crucible must be able to activate volumes with two-of-three downstairs redundancy Crucible must be able to activate with 2/3 downstairs redundancy Mar 19, 2025
@leftwo
Copy link
Contributor

leftwo commented Mar 19, 2025

To support this, the crucible upstairs has to support either a timeout during activation if 2/3 downstairs are not present, or be given some signal that it should not wait for all three downstairs.

My concern with activation with 2/3 here is to prevent additional downstairs replacements from trying to happen while things are in a "degraded" state.

@smklein
Copy link
Collaborator Author

smklein commented Mar 19, 2025

Can I ask, why is a timeout necessary? Couldn't we immediately start activating once we have 2/3, since that's the minimum bar to start processing writes post-activation too?

  • The downside of a timeout is that it causes a latency bubble in a relatively common case - I know it's not exactly the same, but it'll feel like we added a sleep(whatever_our_timeout_is) to all instance start requests during update
  • Further, the "signal to not wait" seems hairy to rely on - many cases of availability loss will result in a failure, then delay, then much later explicitly identifying that a server is "fully gone". Expecting that we'll be told by someone upstack in every case where one server is inaccessible seems unreliable.

Putting this all another way - what's so important about "activation" vs "normal access", that requires perfect redundancy? Would it be possible to treat this case as:

  • If 2/3 downstairs are available, this is enough to activate the volume
  • If the third downstairs shows up, we can try to catch it up-to-speed
  • If it doesn't show up, presumably it'll be replaced later anyway
  • If we lose another downstairs and go to 1/3, writes stop

@mkeeter
Copy link
Contributor

mkeeter commented Mar 19, 2025

See also the partially-writen RFD 542, which talks about this problem.

@smklein
Copy link
Collaborator Author

smklein commented Mar 19, 2025

See also the partially-writen RFD 542, which talks about this problem.

Totally agreed with the perspective in RFD 542 - this is a much more concrete view of my handwavey bullet points. Specifically:

  • We already have the metadata (via generations on extents) to do reconciliation with two downstairs
  • It seems possible to have the third downstairs come online later via replay or live-repair
  • The only way this risks losing writes is when writes haven't been flushed -- and this is okay, that's the contract with flush
  • It seems nice to have this be the "singular pathway" of start-up -- immediately start once 2/3 downstairs are available, and always perform reconciliation with the last joiner.

This means that, if we have 2/3 downstairs online, and the third show up {one nanosecond, one second, one minute} later, it should look the same from a user's perspective: they'll have write access immediately, and things will become three-way redundant later, once everyone catches up.

@askfongjojo
Copy link

I think the case we want to better handle is when an instance and one of its downstairs both live on a sled going through online update. In this particular scenario, its new vmm has to reconnect to all three downstairs.

In other cases - i.e. user happens to start or stop an instance when one of its downstairs is on a sled being updated (but the vmm itself is on a running sled), Nexus can potentially check the status of all the sleds involved in its attached disks and refuse to start/stop if one of them is known to be in the middle of an update.

@smklein
Copy link
Collaborator Author

smklein commented Mar 19, 2025

I think the case we want to better handle is when an instance and one of its downstairs both live on a sled going through online update. In this particular scenario, its new vmm has to reconnect to all three downstairs.

In this case, I believe we would be live migrating an instance from the sled-under-update to a different sled within the rack.

My impression is that a "VMM being transferred to a destination sled" performs activation, much like a "VMM being started on an arbitrary sled, without migration".

Wouldn't these cases be the same? Either way, it's a VMM starting up with only 2-of-3 downstairs being contactable.

@smklein
Copy link
Collaborator Author

smklein commented Mar 19, 2025

In other cases - i.e. user happens to start or stop an instance when one of its downstairs is on a sled being updated (but the vmm itself is on a running sled), Nexus can potentially check the status of all the sleds involved in its attached disks and refuse to start/stop if one of them is known to be in the middle of an update.

For what it's worth, I consider this a loss of availability -- IMO users should be able to start instances, even when one of their downstairs are undergoing an update.

in other words, I don't think Nexus should be monitoring these cases and treating them differently from the normal start-up case.

@leftwo
Copy link
Contributor

leftwo commented Mar 19, 2025

I also agree that instances should be able to start with 2/3 downstairs.

This means that, if we have 2/3 downstairs online, and the third show up {one nanosecond, one second, one minute} later, it should look the same from a user's perspective: they'll have write access immediately, and things will become three-way redundant later, once everyone catches up.

Yes, this is true, however

and things will become three-way redundant later, once everyone catches up.

There is a performance impact to doing LiveRepair. If we jump into activation with 2/3 downstairs as soon as two are available, it means we will be taking a hit to bring that third downstairs back. Even if there is no differences between the three downstairs, a LiveRepair means we don't trust whatever is on the third downstairs and still have to walk every extent and compare. I think we can save ourselves a bunch of work if we can wait just a little bit before moving forward with 2/3 activation.

As a side note, we have the 3/3 requirement for activation because we initially did not have LiveRepair.

@leftwo
Copy link
Contributor

leftwo commented Mar 19, 2025

In other cases - i.e. user happens to start or stop an instance when one of its downstairs is on a sled being updated (but the vmm itself is on a running sled), Nexus can potentially check the status of all the sleds involved in its attached disks and refuse to start/stop if one of them is known to be in the middle of an update.

For what it's worth, I consider this a loss of availability -- IMO users should be able to start instances, even when one of their downstairs are undergoing an update.

in other words, I don't think Nexus should be monitoring these cases and treating them differently from the normal start-up case.

I think for stopping an instance, we don't need to do anything different. When the instance is next started we will repair whatever is out of sync.

@smklein
Copy link
Collaborator Author

smklein commented Mar 19, 2025

There is a performance impact to doing LiveRepair. If we jump into activation with 2/3 downstairs as soon as two are available, it means we will be taking a hit to bring that third downstairs back. Even if there is no differences between the three downstairs, a LiveRepair means we don't trust whatever is on the third downstairs and still have to walk every extent and compare. I think we can save ourselves a bunch of work if we can wait just a little bit before moving forward with 2/3 activation.

I think this case is discussed a bit in https://rfd.shared.oxide.computer/rfd/0542#_doing_reconciliation_with_only_two_downstairs and https://rfd.shared.oxide.computer/rfd/0542#_should_we_do_three_downstairs_reconciliation_if_possible , and I agree with @mkeeter 's takes here:

[ what happens if we do reconciliation with only two Downstairs, and then the third Downstairs comes up ]

If the third Downstairs’s extent metadata matches our post-reconciliation flush and generation numbers, and all IO since startup is still buffered in memory, we can replay that IO and bring it into sync
Otherwise, it has to come online through live-repair from one of the running Downstairs

[ should we wait for the third downstairs? ]

We can imagine a system which reaches WaitQuorum on 2/3 downstairs, then waits some amount of time for a third downstairs. If the third downstairs shows up, then it performs full-quorum negotiation; otherwise, it performs min-quorum negotiation. This would give us the benefit of full reconciliation most of the time, while still working if a downstairs is persistently unavailable.
I don’t think this is a good idea. It doubles our surface area for testing; now, we have to make sure that both the min and full-quorum reconciliation paths work. In addition, we will usually pick the full-quorum reconciliation path, so the min-quorum path will be less well tested – and more likely to fail when we need it most.
Always doing min-quorum reconciliation means that it will be thoroughly tested, instead of being an exceptional path.

Basically:

  • If we can dodge live-repair because the third downstairs comes up quickly, that seems good!
  • ... but this doesn't mean we have to wait for it, right? We can start without it, and if it happens to join fast enough, it can go through a fast-path of using buffered I/O. This lets the "common fast case" stay fast
  • ... and if it doesn't join quickly enough, the fallback to live repair is a more expensive, but very reasonable approach to ensure we reach eventual redundancy without inducing unnecessary latency bubbles

@leftwo
Copy link
Contributor

leftwo commented Mar 19, 2025

Can I ask, why is a timeout necessary? Couldn't we immediately start activating once we have 2/3, since that's the minimum bar to start processing writes post-activation too?

I added support for read-only regions to start without all downstairs present, and every time it starts without all three downstairs, even if all downstairs are ready to go. There is enough of a delay in the initial communication that one downstairs always arrives and makes it far enough that we activate without the others. This is not a big deal for read only downstairs as there is no actual live repair that needs to happen, so they just hop right in.

  • The downside of a timeout is that it causes a latency bubble in a relatively common case - I know it's not exactly the same, but it'll feel like we added a sleep(whatever_our_timeout_is) to all instance start requests during update

A small timeout here I believe is going to be much less of an impact as what LiveRepair will cause to IO. Something on the order of seconds to wait for all three, and given we expect all three downstairs in most cases, it's going to be much less work.

  • Further, the "signal to not wait" seems hairy to rely on - many cases of availability loss will result in a failure, then delay, then much later explicitly identifying that a server is "fully gone". Expecting that we'll be told by someone upstack in every case where one server is inaccessible seems unreliable.

Yeah, maybe this is not a good idea.

@leftwo
Copy link
Contributor

leftwo commented Mar 19, 2025

I'll go put some comments over in the 2/3 RFD but I'll put some here as well:

... but this doesn't mean we have to wait for it, right? We can start without it, and if it happens to join fast enough, it can go through a fast-path of using buffered I/O. This lets the "common fast case" stay fast

This could work, but we don't have that code written yet. And, it would be more code to test/verify.

... and if it doesn't join quickly enough, the fallback to live repair is a more expensive, but very reasonable approach to ensure we reach eventual redundancy without inducing unnecessary latency bubbles

We would have a latency bubble though, the act of going through LiveRepair is going to cause IOs to slow down.

@smklein
Copy link
Collaborator Author

smklein commented Mar 19, 2025

  • The downside of a timeout is that it causes a latency bubble in a relatively common case - I know it's not exactly the same, but it'll feel like we added a sleep(whatever_our_timeout_is) to all instance start requests during update

A small timeout here I believe is going to be much less of an impact as what LiveRepair will cause to IO. Something on the order of seconds to wait for all three, and given we expect all three downstairs in most cases, it's going to be much less work.

I'd like to compare these options explicitly, for the question of:

"We are starting an upstairs, we have two-of-three downstairs, what do we do re: the third?"

... Sleep for a few seconds, see if it comes back Proceed optimistically, let the third catch up if it comes online later
User-Visible latency when all three downstairs are available quickly ✅ - no issues ✅ - no issues
User-visible latency when one downstairs is slow/unreliable, but exists ❌ - this is bad, we'll delay access to the disk for a while. Hopefully we don't sleep for too long, but we are going to potentially prevent user access to the upstairs disk for a while ✅ - no issues, we should be able to proceed with 2/3
User-visible latency when one downstairs is fully dead ❌ - this is quite bad, we'll delay access to the upstairs disk for a while ✅ - no issues, we should be able to proceed with 2/3
Cost to repair the third downstairs when three downstairs are available quickly ✅ - there's nothing to repair, we reconciled with three downstairs ✅ - as long as all I/O is buffered in memory, this should be as cheap as normal writes - replay I/O to bring the last-to-join downstairs in sync
Cost to repair the third downstairs when one downstairs is slow/unreliable, but exists ✅ - assuming the third downstairs appears eventually, we paid the price by waiting, but reconciliation is normal. ❓- this depends on how much write traffic we have incurred. If our writes fit in-memory, this is cheap. If we've done a ton of writes, we'll fall-back to live repair.
Cost to repair the third downstairs when one downstairs is fully dead ❌ - we have to do live repair ❌ - we have to do live repair

My concern here is that the approach of "sleeping to wait for the third" causes user-visible latency to suffer, in any case where we have less-than-perfect redundancy.

I hear you about not wanting to unnecessarily perform live repair, but the buffering in-memory approach mentioned in RFD 542 seems like it mitigates that issue in the common case, and keeps latency low in the vast majority of "2 of 3 upstairs are available" cases.

@leftwo
Copy link
Contributor

leftwo commented Mar 19, 2025

A few more thoughts here,
We need a start with 2/3 option for crucible, we don't have one now and that impacts us for repair, so I'm not disagreeing with some solution for that issue. And, in either situation you propose above, we still need to write the code to do activation with 2/3 downstairs.

Now, to the table, I'm not sure if I agree with your conclusions, let me see if I can comment on each row.

I've copied your table, then (after edit) made a row with my comments for the cell above it.

... Sleep for a few seconds, see if it comes back Proceed optimistically, let the third catch up if it comes online later
User-Visible latency when all three downstairs are available quickly ✅ - no issues ✅ - no issues
Alan:How quickly is quickly here? In my experience with the read only code, it always starts with just one or two downstairs, the third always arrives after activation. This tells me the "Proceed optimistically" path, and starting with 2/3 will always be running, on every instance start. We also have not written this code yet, so we don't really know what the impact is. It's hard to argue with code that has not been written, but it has got to have some impact?
User-visible latency when one downstairs is slow/unreliable, but exists ❌ - this is bad, we'll delay access to the disk for a while. Hopefully we don't sleep for too long, but we are going to potentially prevent user access to the upstairs disk for a while ✅ - no issues, we should be able to proceed with 2/3
Alan: This is a non typical situation, but possible sure. What exactly is the situation though? Is is always slow, or just slow to respond during the instance start up, then it starts working okay? The details here make a difference Alan:You say this is bad, but is it? If we sleep for 2 seconds, then we proceed with 2/3 reconcile and start up, is that bad? If the third downstairs shows up, then we go through LiveRepair. This is only adding 2 seconds if there is a failure. We control how long we wait here. Alan:To say "no issues" here, while true, I don't feel is reflective of the situation. Yes, the activation will happen faster, but then what? You still have a downstairs that is slow/unreliable, so IO is going to start backing up and either we will kick this downstairs out, or constantly be playing catchup. You may mask a problem during activation, but in either this case or the "sleep a few seconds" case you still have a downstairs that is slow/unreliable.
User-visible latency when one downstairs is fully dead ❌ - this is quite bad, we'll delay access to the upstairs disk for a while ✅ - no issues, we should be able to proceed with 2/3
Alan: you say "this is quite bad", but how are you defining bad? We sleep for a few seconds then proceed with activation. You say "we'll delay access to the upstairs disk for a while", but we get to pick that while, and we can choose a number that makes sense.
Cost to repair the third downstairs when three downstairs are available quickly ✅ - there's nothing to repair, we reconciled with three downstairs ✅ - as long as all I/O is buffered in memory, this should be as cheap as normal writes - replay I/O to bring the last-to-join downstairs in sync Alan: We have not written the code yet. We would aspire to make it transparent, but until we do, we don't know the actual cost.
Alan: We have not written the code yet. We would aspire to make it transparent, but until we do, we don't know the actual cost.
Cost to repair the third downstairs when one downstairs is slow/unreliable, but exists. ✅ - assuming the third downstairs appears eventually, we paid the price by waiting, but reconciliation is normal. ❓- this depends on how much write traffic we have incurred. If our writes fit in-memory, this is cheap. If we've done a ton of writes, we'll fall-back to live repair.
Alan: I have the same problem with this case here as above, the details matter here. ❓- this depends on how much write traffic we have incurred. If our writes fit in-memory, this is cheap. If we've done a ton of writes, we'll fall-back to live repair.
Cost to repair the third downstairs when one downstairs is fully dead ❌ - we have to do live repair ❌ - we have to do live repair

I hope those points come across. Let's keep talking to figure out our best path forward here.

@leftwo
Copy link
Contributor

leftwo commented Mar 20, 2025

So specifically about these items:

My concern here is that the approach of "sleeping to wait for the third" causes user-visible latency to suffer, in any case where we have less-than-perfect redundancy.

Sleeping for at most a few seconds before activation I don't feel is going to be noticed. My concern is that we are setting up to spend time writing code and increase code complexity for a small gain during a compromised window.

I hear you about not wanting to unnecessarily perform live repair, but the buffering in-memory approach mentioned in RFD 542 seems like it mitigates that issue in the common case, and keeps latency low in the vast majority of "2 of 3 upstairs are available" cases.

I'm not convinced that a sleep of 2 seconds to wait for all three downstairs is a worse solution than writing a bunch of new code to handle replaying IOs received if we start with 2/3

We will always have to handle the "start with 2/3, then do LiveRepair" case, as we can only buffer IO for so long, so that solution needs to exist no matter what path we take. This extra code is to prevent (at most) 2 seconds of sleeping in the case where a downstairs is unavailable, but then comes back before we have reached the point where we had to kick it out and do Live Repair anyway. I would rather we spend time on things like supporting >1 TiB volumes, or growing a volume, or disk export. I think there are other features we should be working on before we complicate startup to cover this small window.

@smklein
Copy link
Collaborator Author

smklein commented Mar 20, 2025

Sleeping for at most a few seconds before activation I don't feel is going to be noticed. My concern is that we are setting up to spend time writing code and increase code complexity for a small gain during a compromised window.

Image

I'm basically trying to avoid Clulow's Lament. Starting instances is a really critical use-case for Oxide, and one where I feel our time-to-start path should be optimized as much as possible. During update, a lot of instances will be in this two-of-three state, and will hit the full duration of whatever we use as the timeout.

I do think that, in a model where we are capable of starting with 2-of-3 downstairs instead of waiting, we effectively have a pipeline problem -- the quantity of data we can buffer in-memory, and defer writing out to the third downstairs (whenever it shows up) is bound I'm proposing using, rather than an arbitrary amount of time.

Re; your comments in the table, this is also what I mean by "quickly" -- for the "run optimistically with 2-of-3 case", it's "fast enough to not fill the in-memory dirty buffers". If write traffic is low, this could be a long time, and we still wouldn't need to incur live repair. Conversely, if the traffic is high-bandwidth and write-heavy, we might be forced to live repair the 3rd upstairs relatively soon.

I'm not convinced that a sleep of 2 seconds to wait for all three downstairs is a worse solution than writing a bunch of new code to handle replaying IOs received if we start with 2/3

We will always have to handle the "start with 2/3, then do LiveRepair" case, as we can only buffer IO for so long, so that solution needs to exist no matter what path we take. This extra code is to prevent (at most) 2 seconds of sleeping in the case where a downstairs is unavailable, but then comes back before we have reached the point where we had to kick it out and do Live Repair anyway. I would rather we spend time on things like supporting >1 TiB volumes, or growing a volume, or disk export. I think there are other features we should be working on before we complicate startup to cover this small window.

Ultimately I defer to your judgement on the code itself. I had hoped that this code would basically be:

  1. We have a queue of block writes, since the 2-of-3 downstairs activated. These are ordered.
  2. If the third downstairs becomes present, we write this queue of writes in the same order they were written to the 2-of-3 downstairs.
  3. If the oldest buffered write is evicted from memory before the third downstairs appears, we must go to live repair.

I had hoped that we could re-use existing buffering mechanisms to accomplish this, and re-use existing write mechanisms to transfer to the blocks to the third downstairs once it appears.

@leftwo
Copy link
Contributor

leftwo commented Mar 20, 2025

I'm basically trying to avoid Clulow's Lament.

I agree as well, and in the limit the solution that is hinted at in RFD 542 could be the best solution. But, the details matter here and we don't have them all yet.

Starting instances is a really critical use-case for Oxide, and one where I feel our time-to-start path should be optimized as much as possible. During update, a lot of instances will be in this two-of-three state, and will hit the full duration of whatever we use as the timeout.

I agree that starting instance time is important as well. However, this 2/3 state situation is only when starting instances when an upgrade happens on a sled. Instances that are running will remain running, and crucible will resync when the update is completed. So the actual impact is a bit smaller than you suggest. And, "the full duration" is taking a few seconds longer to get to activation.

The current code will just hang on start until the update, and we all agree this is not desirable.

I do think that, in a model where we are capable of starting with 2-of-3 downstairs instead of waiting, we effectively have a pipeline problem

The problem is that it's much more complicated than that.
It is in part a pipeline problem, but it's also a reconciliation problem. Remember we crucible first starts up, it does not know what state the downstairs is in. The reconciliation process is not too complicated, there are a set of rules and we decide which of the three downstairs to pick for each extent, and if things are the same, it's great. But we are not taking IO during that process, everything is static.

In the 2-3 case, we have taken the two downstairs and decided between those two, then started accepting IO. But, now a third downstairs comes along and wants to join. We still have to reconcile that downstairs with the existing downstairs as we don't know what the condition of the third downstairs is. In addition the first two downstairs have taken IO since starting, so what they currently have is also changing while we are trying to compare with what the third downstairs has. We basically are doing a LiveRepair.

Re; your comments in the table, this is also what I mean by "quickly" -- for the "run optimistically with 2-of-3 case", it's "fast enough to not fill the in-memory dirty buffers". If write traffic is low, this could be a long time, and we still wouldn't need to incur live repair. Conversely, if the traffic is high-bandwidth and write-heavy, we might be forced to live repair the 3rd upstairs relatively soon.

I wonder here, what the actual amount of IO is that we could hold in memory before we have to just do LiveRepair? Also remember, we still need to reconcile the new incoming downstairs, so it's more than just buffering. Again here, the exact plan an details make a difference. If someone has just started an instance, I would think that means they plan to do something with it. So I would expect that some level of IO is going to happen probably soon after booting it. How big of a window are we proposing to support with this solution, and how much code do we need to write to do it?

I had hoped that this code would basically be:

I like the idea you suggest here, and it sound simple when you pose it that way :)
but that's not the problem we need to solve. The problem is making an unknown downstairs consistent (while taking IO).

The pause for a few seconds, then kick out any missing downstairs and require LiveRepair for them to rejoin is a much simpler problem to solve, with the solution not being optimal when starting an instance that has a downstairs on a sled being updated. Where not optimal is adding a few seconds to boot. The optimal solution here I think is a fair amount of work.

But, the details here of the solution proposed in RFD 542 matter, so we should consider those once we have more details on what they are, and the cost of implementing and testing them. We have pretty good confidence in the reconciliation and LiveRepair code paths we have now, so I feel like there should be pretty good reasons to change them.

@mkeeter
Copy link
Contributor

mkeeter commented Mar 20, 2025

I've now moved 0542 into discussion, and added the "no replay, only live-repair" as an alternative.

In particular, I think we may get "two-Downstairs reconciliation with replay and live-repair" for free if we kick the third Downstairs into Offline as we do reconciliation with the other two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
storage Related to storage. Update System Replacing old bits with newer, cooler bits
Projects
None yet
Development

No branches or pull requests

4 participants