-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
post sled expungement, not all volumes were repaired #7790
Comments
What's happening here is the volume delete saga is hard-deleting the snapshot volume as a result of deleting a snapshot: omicron/nexus/src/app/sagas/volume_delete.rs Lines 496 to 510 in 0ea05e3
This check only checks for lingering regions and prevents the deletion of a volume if there still are some. It does not check for lingering region snapshots. Yesterday and today I've been looking at the volume deletion logic. I'm not 100% convinced that we need to keep this volume around, but it definitely makes the region snapshot replacement code unhappy. |
Note that the impact of this is that if any snapshot is deleted, region snapshot replacement cannot not take place for the associated region snapshots or read-only regions. Once this is fixed, they should proceed normally. |
There were many tests for when a snapshot is delete _after_ a region snapshot replacement was requested and the process had started, but not for when one was deleted _before_ any of the associated sagas ran. Deleting the snapshot caused the snapshot volume to be hard-deleted, which caused the region snapshot replacement machinery to believe it couldn't do its job. My first thought was that Nexus needed to _not_ hard-delete the snapshot volume, and this would let the region snapshot replacement proceed. Nexus would also need to reconstruct any of the snapshot volumes that were hard-deleted through some arduous process of finding all the copies of the snapshot volume in the read-only parent fields of other volumes. But then I started diving into the code relevant to the early stages of region snapshot replacement. When inserting a region snapshot replacement request, the snapshot volume ID was required so that a volume repair record can "lock" the volume. The comments here state that the repair record isn't needed (as snapshot volumes are never directly constructed and the lock is meant to serialize access to a volume so that an associated Upstairs won't receive multiple replacement requests) and that the lock was done out of an abudance of caution. Looking at the "region snapshot replacement start" saga, it doesn't need the contents of the volume either: the snapshot volume ID is used to find other resources, but the volume contents themselves are never read. Digging further, it became apparent that not only was this lock not required (narrator: spoiler alert it actually was), snapshot volume wasn't either! The first step in realizing this was to remove creating the volume repair record with the snapshot's (or read-only region's) volume ID when inserting region snapshot replacement requests into the DB, and to stop creating the volume repair record. This required changing a bunch of unit tests, but those unit tests were creating 100% blank volumes to pass along for this purpose, further proving that the contents of the volume were not actually required: ```patch @@ -1675,21 +1484,6 @@ mod test { let region_id = Uuid::new_v4(); let snapshot_id = Uuid::new_v4(); - let volume_id = VolumeUuid::new_v4(); - - datastore - .volume_create( - volume_id, - VolumeConstructionRequest::Volume { - id: Uuid::new_v4(), // not required to match! - block_size: 512, - sub_volumes: vec![], // nothing needed here - read_only_parent: None, - }, - ) - .await - .unwrap(); - let request = RegionSnapshotReplacement::new_from_region_snapshot( dataset_id, region_id, ``` The next step to realizing the snapshot volume was not required is that no other region snapshot replacement related code had to change! _Most_ of the test suite passed after this change, however I added an integration test for the issue seen in #7790 and it plus some others were intermittently failing. Was this the wrong move? Nope, it just revealed a case where serialization _was_ required: - without a volume repair record on the snapshot volume, multiple region snapshot replacement requests could now concurrently run for the same snapshot volume - this is ok! the swapping of read-only targets will be serialized by the transaction in `volume_replace_snapshot` - but: caller A and caller B both attempting to 1. get the number of currently allocated regions for a snapshot volume 2. allocate a single read-only replacement region for the snapshot volume using a redundancy level increased by 1 will, with a certain interleaving, _both_ think that they've increased the number of allocated regions for a snapshot volume by 1, but both see the same new region. Classic TOCTOU! - two region snapshot replacement requests for different region snapshots that both have the _same_ new replacement region will result in (after the two `volume_replace_snapshot` calls) a snapshot volume that has duplicate targets (this was the motivation for the check in #7846). Concurrently getting the number of currently allocated regions, then calling `arbitrary_region_allocate` in this way is not safe, so Nexus is required to serialize these callers. The general way to do this is... using the volume repair record to "lock" the volume. Before this commit, creating a volume repair record for a hard-deleted volume was not possible: the creation routine would return an error, saying that it didn't make sense to lock a non-existent volume. Nexus is faced with the following scenario: - users can delete snapshots at any time, which will hard-delete the snapshot volume - getting the current number of allocated regions for the hard-deleted snapshot volume and calling `arbitrary_region_allocate` needs to be exclusive So this commit relaxes that restriction: volume repair records can now be created for either not-created-yet or hard-deleted volumes. This means that region snapshot replacements will still be serialized even for hard-deleted snapshot volumes. The alternative to this would either be some other form of exclusivity (another lock), or to spend cycles changing the "get number of currently allocated regions then allocation an additional one" pattern to be safe for multiple callers. In the name of urgency, the existing volume repair record is used without the aforementioned restriction. Fixes #7790
After an expungement of sled 16 on dublin, the automatic region and snapshot repair process kicked off.
Many repairs were done:
However, even after several hours, there are still things needing repair that don't seem to be making forward progress:
Trying to force a repair, it complains about a missing volume:
Checking the database, indeed, that volume does not exist:
Trolling through the VCR in all the volumes, we can see there are still references to the IP address of the expunged sled:
Part of what was created on this rack was a snapshot with 25 levels, so it's not suprising that there could be that many references to a single downstairs target.
The text was updated successfully, but these errors were encountered: