-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crucible must be able to activate with 2/3 downstairs redundancy #7826
Comments
To support this, the crucible upstairs has to support either a timeout during activation if 2/3 downstairs are not present, or be given some signal that it should not wait for all three downstairs. My concern with activation with 2/3 here is to prevent additional downstairs replacements from trying to happen while things are in a "degraded" state. |
Can I ask, why is a timeout necessary? Couldn't we immediately start activating once we have 2/3, since that's the minimum bar to start processing writes post-activation too?
Putting this all another way - what's so important about "activation" vs "normal access", that requires perfect redundancy? Would it be possible to treat this case as:
|
See also the partially-writen RFD 542, which talks about this problem. |
Totally agreed with the perspective in RFD 542 - this is a much more concrete view of my handwavey bullet points. Specifically:
This means that, if we have 2/3 downstairs online, and the third show up {one nanosecond, one second, one minute} later, it should look the same from a user's perspective: they'll have write access immediately, and things will become three-way redundant later, once everyone catches up. |
I think the case we want to better handle is when an instance and one of its downstairs both live on a sled going through online update. In this particular scenario, its new vmm has to reconnect to all three downstairs. In other cases - i.e. user happens to start or stop an instance when one of its downstairs is on a sled being updated (but the vmm itself is on a running sled), Nexus can potentially check the status of all the sleds involved in its attached disks and refuse to start/stop if one of them is known to be in the middle of an update. |
In this case, I believe we would be live migrating an instance from the sled-under-update to a different sled within the rack. My impression is that a "VMM being transferred to a destination sled" performs activation, much like a "VMM being started on an arbitrary sled, without migration". Wouldn't these cases be the same? Either way, it's a VMM starting up with only 2-of-3 downstairs being contactable. |
For what it's worth, I consider this a loss of availability -- IMO users should be able to start instances, even when one of their downstairs are undergoing an update. in other words, I don't think Nexus should be monitoring these cases and treating them differently from the normal start-up case. |
I also agree that instances should be able to start with 2/3 downstairs.
Yes, this is true, however
There is a performance impact to doing LiveRepair. If we jump into activation with 2/3 downstairs as soon as two are available, it means we will be taking a hit to bring that third downstairs back. Even if there is no differences between the three downstairs, a LiveRepair means we don't trust whatever is on the third downstairs and still have to walk every extent and compare. I think we can save ourselves a bunch of work if we can wait just a little bit before moving forward with 2/3 activation. As a side note, we have the 3/3 requirement for activation because we initially did not have LiveRepair. |
I think for stopping an instance, we don't need to do anything different. When the instance is next started we will repair whatever is out of sync. |
I think this case is discussed a bit in https://rfd.shared.oxide.computer/rfd/0542#_doing_reconciliation_with_only_two_downstairs and https://rfd.shared.oxide.computer/rfd/0542#_should_we_do_three_downstairs_reconciliation_if_possible , and I agree with @mkeeter 's takes here: [ what happens if we do reconciliation with only two Downstairs, and then the third Downstairs comes up ]
[ should we wait for the third downstairs? ]
Basically:
|
I added support for read-only regions to start without all downstairs present, and every time it starts without all three downstairs, even if all downstairs are ready to go. There is enough of a delay in the initial communication that one downstairs always arrives and makes it far enough that we activate without the others. This is not a big deal for read only downstairs as there is no actual live repair that needs to happen, so they just hop right in.
A small timeout here I believe is going to be much less of an impact as what LiveRepair will cause to IO. Something on the order of seconds to wait for all three, and given we expect all three downstairs in most cases, it's going to be much less work.
Yeah, maybe this is not a good idea. |
I'll go put some comments over in the 2/3 RFD but I'll put some here as well:
This could work, but we don't have that code written yet. And, it would be more code to test/verify.
We would have a latency bubble though, the act of going through |
I'd like to compare these options explicitly, for the question of: "We are starting an upstairs, we have two-of-three downstairs, what do we do re: the third?"
My concern here is that the approach of "sleeping to wait for the third" causes user-visible latency to suffer, in any case where we have less-than-perfect redundancy. I hear you about not wanting to unnecessarily perform live repair, but the buffering in-memory approach mentioned in RFD 542 seems like it mitigates that issue in the common case, and keeps latency low in the vast majority of "2 of 3 upstairs are available" cases. |
A few more thoughts here, Now, to the table, I'm not sure if I agree with your conclusions, let me see if I can comment on each row. I've copied your table, then (after edit) made a row with my comments for the cell above it.
I hope those points come across. Let's keep talking to figure out our best path forward here. |
So specifically about these items:
Sleeping for at most a few seconds before activation I don't feel is going to be noticed. My concern is that we are setting up to spend time writing code and increase code complexity for a small gain during a compromised window.
I'm not convinced that a sleep of 2 seconds to wait for all three downstairs is a worse solution than writing a bunch of new code to handle replaying IOs received if we start with 2/3 We will always have to handle the "start with 2/3, then do LiveRepair" case, as we can only buffer IO for so long, so that solution needs to exist no matter what path we take. This extra code is to prevent (at most) 2 seconds of sleeping in the case where a downstairs is unavailable, but then comes back before we have reached the point where we had to kick it out and do Live Repair anyway. I would rather we spend time on things like supporting >1 TiB volumes, or growing a volume, or disk export. I think there are other features we should be working on before we complicate startup to cover this small window. |
I'm basically trying to avoid Clulow's Lament. Starting instances is a really critical use-case for Oxide, and one where I feel our time-to-start path should be optimized as much as possible. During update, a lot of instances will be in this two-of-three state, and will hit the full duration of whatever we use as the timeout. I do think that, in a model where we are capable of starting with 2-of-3 downstairs instead of waiting, we effectively have a pipeline problem -- the quantity of data we can buffer in-memory, and defer writing out to the third downstairs (whenever it shows up) is bound I'm proposing using, rather than an arbitrary amount of time. Re; your comments in the table, this is also what I mean by "quickly" -- for the "run optimistically with 2-of-3 case", it's "fast enough to not fill the in-memory dirty buffers". If write traffic is low, this could be a long time, and we still wouldn't need to incur live repair. Conversely, if the traffic is high-bandwidth and write-heavy, we might be forced to live repair the 3rd upstairs relatively soon.
Ultimately I defer to your judgement on the code itself. I had hoped that this code would basically be:
I had hoped that we could re-use existing buffering mechanisms to accomplish this, and re-use existing write mechanisms to transfer to the blocks to the third downstairs once it appears. |
I agree as well, and in the limit the solution that is hinted at in RFD 542 could be the best solution. But, the details matter here and we don't have them all yet.
I agree that starting instance time is important as well. However, this 2/3 state situation is only when starting instances when an upgrade happens on a sled. Instances that are running will remain running, and crucible will resync when the update is completed. So the actual impact is a bit smaller than you suggest. And, "the full duration" is taking a few seconds longer to get to activation. The current code will just hang on start until the update, and we all agree this is not desirable.
The problem is that it's much more complicated than that. In the 2-3 case, we have taken the two downstairs and decided between those two, then started accepting IO. But, now a third downstairs comes along and wants to join. We still have to reconcile that downstairs with the existing downstairs as we don't know what the condition of the third downstairs is. In addition the first two downstairs have taken IO since starting, so what they currently have is also changing while we are trying to compare with what the third downstairs has. We basically are doing a LiveRepair.
I wonder here, what the actual amount of IO is that we could hold in memory before we have to just do LiveRepair? Also remember, we still need to reconcile the new incoming downstairs, so it's more than just buffering. Again here, the exact plan an details make a difference. If someone has just started an instance, I would think that means they plan to do something with it. So I would expect that some level of IO is going to happen probably soon after booting it. How big of a window are we proposing to support with this solution, and how much code do we need to write to do it?
I like the idea you suggest here, and it sound simple when you pose it that way :) The pause for a few seconds, then kick out any missing downstairs and require LiveRepair for them to rejoin is a much simpler problem to solve, with the solution not being optimal when starting an instance that has a downstairs on a sled being updated. Where not optimal is adding a few seconds to boot. The optimal solution here I think is a fair amount of work. But, the details here of the solution proposed in RFD 542 matter, so we should consider those once we have more details on what they are, and the cost of implementing and testing them. We have pretty good confidence in the reconciliation and LiveRepair code paths we have now, so I feel like there should be pretty good reasons to change them. |
I've now moved 0542 into discussion, and added the "no replay, only live-repair" as an alternative. In particular, I think we may get "two-Downstairs reconciliation with replay and live-repair" for free if we kick the third Downstairs into |
Currently, when a Crucible upstairs is activating downstairs, it blocks until all three downstairs have responded.
This requires perfect availability of all downstairs disks for instances to be started, which is especially problematic in the live update case. In the case where we pick any single sled to be updated, we migrate all instances off that sled, and we reboot it: for the duration of that sled being updated:
This is rough - it'll be a user-visible lack-of-availability, and will continue happening for different instances as we proceed with update across sleds in the rack.
The text was updated successfully, but these errors were encountered: