post sled expungement, not all volumes were repaired #7790

leftwo · 2025-03-13T01:40:19Z

After an expungement of sled 16 on dublin, the automatic region and snapshot repair process kicked off.

Many repairs were done:

root@oxz_switch0:~# omdb db region-replacement list           
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:102::21]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:102::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (129.0.0)
Region replacement requests                                                     
ID                                   REQUEST_TIME             REPLACEMENT_STATE 
04816136-4464-4ad6-80e8-cb8904319a43 2025-03-12T23:37:51.320Z Complete          
0bd503b2-b9bf-460e-88de-425edde36450 2025-03-12T23:37:51.335Z Complete          
1407b721-eed9-4359-9e77-13e2debbcefa 2025-03-12T23:37:50.782Z Complete          
33e7ba72-a01a-4154-b666-aa3e64c24949 2025-03-12T23:37:51.199Z Complete          
38656bdc-c8fb-44b0-b816-0c646b699de6 2025-03-12T23:37:51.056Z Complete          
46c39e93-e761-43fd-bba5-e740285a627b 2025-03-12T23:37:50.651Z Complete          
4ca845b1-c315-46eb-a9ed-fc286941e9ee 2025-03-12T23:37:51.027Z Complete          
5482d055-7705-4131-a921-a9ed8c8571de 2025-03-12T23:37:51.230Z Complete          
57349fa1-b083-41fd-95c1-56e56f367862 2025-03-12T23:37:51.306Z Complete          
59ae505f-06e9-4f69-8110-a916f5f1b23e 2025-03-12T23:37:51.351Z Complete          
5b096ffb-ddc8-4a53-a6d5-2fc0d1981019 2025-03-12T23:37:51.396Z Complete          
5b67cc5b-025e-4067-9bff-5629817f3457 2025-03-12T23:37:51.072Z Complete          
65453af9-3291-4a65-a158-57eee98c8fdf 2025-03-12T23:37:50.891Z Complete          
6bca40c0-f002-40ac-b716-812a4b38386e 2025-03-12T23:37:50.842Z Complete          
6ee21a86-1a7f-4ae5-9d14-1a72917bfd38 2025-03-12T23:37:51.121Z Complete          
79c4fa24-3dcd-4050-8c8b-e3c994be582f 2025-03-12T23:37:51.380Z Complete          
863bcf9a-d201-4709-ade3-b2f41f631739 2025-03-12T23:37:51.170Z Complete          
8c2bb71d-44c6-432b-ba90-e3a5e2f70027 2025-03-12T23:37:50.999Z Complete          
8fe56f9d-bce3-4c8e-8794-bd7843308683 2025-03-12T23:37:50.953Z Complete          
93ad8ee5-fbfa-4d1f-8d1c-4dd2e4928119 2025-03-12T23:37:51.013Z Complete          
9adae5e7-5565-48f6-815e-6deb0d95412e 2025-03-12T23:37:51.154Z Complete          
9d4231af-d849-4e53-b79f-f37c09f5fb38 2025-03-12T23:37:51.245Z Complete          
a969945f-b2a1-4d19-8fe9-41c0ea02fda5 2025-03-12T23:37:51.260Z Complete          
ad0df630-b148-4b2f-9451-b70dc0acfa35 2025-03-12T23:37:51.184Z Complete          
b3b352e9-ebb0-47c7-8d90-8159e20d1167 2025-03-12T23:37:50.905Z Complete          
bade3cee-8293-40f3-8d32-f377e0bf6cb5 2025-03-12T23:37:51.215Z Complete          
c2b5c538-43b8-4e55-8470-f50385414d7f 2025-03-12T23:37:51.042Z Complete          
c3051ed1-a415-41bd-a993-b38fb34580c8 2025-03-12T23:37:50.731Z Complete          
c95cbe09-b9de-418f-a544-8a7c595270b0 2025-03-12T23:37:51.275Z Complete          
c96b3782-74de-4343-8cef-f8a96b4dee97 2025-03-12T23:37:51.103Z Complete          
cac28554-2d83-40d1-9870-03a01e131c18 2025-03-12T23:37:51.139Z Complete          
dbcb2c60-6357-44de-a297-4adfa311b931 2025-03-12T23:37:51.087Z Complete          
e2d199dd-4f83-47e8-b7ea-da9276cdbc7d 2025-03-12T23:37:50.710Z Complete          
e3c75fe8-e36f-4af6-b207-d756c3516521 2025-03-12T23:37:51.290Z Complete          
e543bc16-fdd1-4a7d-8f44-6b68ac169c91 2025-03-12T23:37:51.366Z Complete          
e8229dae-b212-4a34-a76a-48ff91fbcbd6 2025-03-12T23:37:50.797Z Complete

root@oxz_switch0:~# omdb db region-snapshot-replacement list
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:102::21]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:102::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (129.0.0)
Region snapshot replacement requests                                            
ID                                   REQUEST_TIME             REPLACEMENT_STATE 
1caf45b5-ebef-4252-9309-bad1add010cb 2025-03-12T23:37:51.197Z Complete          
3d498459-74fb-4440-88f0-ef4c72737401 2025-03-12T23:37:50.954Z Complete          
3e7c0905-a792-4d82-8830-ba414bf462e6 2025-03-12T23:37:51.040Z Complete          
473e5325-f396-4e7c-a4ef-837e805a76e6 2025-03-12T23:37:50.906Z Complete          
49c4dd44-3eef-4953-8364-5bf8b596d7e6 2025-03-12T23:37:51.130Z Complete          
4c645b03-e769-4bf2-8211-c94de4d79386 2025-03-12T23:37:51.107Z Complete          
845ff38e-a2df-4f41-8422-d088b6bf1216 2025-03-12T23:37:51.182Z Complete          
8e84c18f-152c-45c9-8136-f0ff205ad067 2025-03-12T23:37:51.075Z Complete          
9536903d-2d82-41b7-98f4-1f7a5d5908b2 2025-03-12T23:37:50.679Z Complete          
992582d0-fb97-479a-9df4-e75b51b598e0 2025-03-12T23:37:51.165Z Complete          
a1338734-3d99-44f7-a9a7-b9ed9170a71b 2025-03-12T23:37:51.147Z Complete          
a3fc1d2b-b1b6-4350-a034-b9f6a0230814 2025-03-12T23:37:51.023Z Complete          
a832293b-ddfa-4a84-a9fe-cf525f356a0e 2025-03-12T23:37:50.831Z Complete          
b1057522-015b-4755-b659-657d32882498 2025-03-12T23:37:51.091Z Complete          
dd1b1cd6-f7a2-43bc-8252-00bb53216d47 2025-03-12T23:37:51.058Z Complete          
ed46b2e9-3a57-453e-a537-f49abdc1a958 2025-03-12T23:37:51.005Z Complete

However, even after several hours, there are still things needing repair that don't seem to be making forward progress:

root@oxz_switch0:~# /tmp/omdb-replace-2d db replacements-to-do             
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:102::21]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:102::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (129.0.0)
ID                                   DATASET_ID                           RESOURCE          EXISTING_REQUEST_TIME    EXISTING_REQUEST                                      
909f1d1e-a882-4244-bc24-2938dee1d896 16274d5b-47be-45f9-a54c-8724b6c9edb2 read/write region 2025-03-12T23:37:51.199Z 33e7ba72-a01a-4154-b666-aa3e64c24949 (state Complete) 

DATASET_ID                           REGION_ID                            SNAPSHOT_ID                          EXISTING_REQUEST_TIME EXISTING_REQUEST 
16274d5b-47be-45f9-a54c-8724b6c9edb2 909f1d1e-a882-4244-bc24-2938dee1d896 d7057015-c547-4601-ae33-f26c5ed175ae

Trying to force a repair, it complains about a missing volume:

root@oxz_switch0:~# omdb -w db region-snapshot-replacement request 16274d5b-47be-45f9-a54c-8724b6c9edb2 909f1d1e-a882-4244-bc24-2938dee1d896 d7057015-c547-4601-ae33-f26c5ed175ae
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:102::21]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:102::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (129.0.0)
Error: Invalid Request: cannot create record: volume d97753bc-62a5-476e-9a61-6467c4b9d0a9 does not exist

Checking the database, indeed, that volume does not exist:

root@[fd00:1122:3344:101::3]:32221/omicron> select * from volume where id='d97753bc-62a5-476e-9a61-6467c4b9d0a9';
  id | time_created | time_modified | time_deleted | rcgen | data | resources_to_clean_up
-----+--------------+---------------+--------------+-------+------+------------------------
(0 rows)

Time: 2ms total (execution 1ms / network 0ms)

Trolling through the VCR in all the volumes, we can see there are still references to the IP address of the expunged sled:

root@oxz_switch0:~# for vv in $(omdb db volumes list | grep -v ID | awk '{print $1}'); do omdb db volumes info $vv 2> /dev/null | grep "fd00:1122:3344:103"; done | tr -d ' ' | sort -n | uniq -c
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:102::21]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:102::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (129.0.0)
  22 [fd00:1122:3344:103::e]:19006

Part of what was created on this rack was a snapshot with 25 levels, so it's not suprising that there could be that many references to a single downstairs target.

The text was updated successfully, but these errors were encountered:

jmpesp · 2025-03-14T16:48:08Z

What's happening here is the volume delete saga is hard-deleting the snapshot volume as a result of deleting a snapshot:

omicron/nexus/src/app/sagas/volume_delete.rs

Lines 496 to 510 in 0ea05e3

    
           // Do not hard delete the volume record if there are lingering regions 
        
           // associated with them. This occurs when a region snapshot hasn't been 
        
           // deleted, which means we can't delete the region. Later on, deleting the 
        
           // region snapshot will free up the region(s) to be deleted (this occurs in 
        
           // svd_delete_freed_crucible_regions). 
        
           let allocated_regions = osagactx 
        
               .datastore() 
        
               .get_allocated_regions(params.volume_id) 
        
               .await 
        
               .map_err(|e| { 
        
                   ActionError::action_failed(format!( 
        
                       "failed to get_allocated_regions for {}: {:?}", 
        
                       params.volume_id, e, 
        
                   )) 
        
               })?;

This check only checks for lingering regions and prevents the deletion of a volume if there still are some. It does not check for lingering region snapshots.

Yesterday and today I've been looking at the volume deletion logic. I'm not 100% convinced that we need to keep this volume around, but it definitely makes the region snapshot replacement code unhappy.

jmpesp · 2025-03-14T17:21:24Z

Note that the impact of this is that if any snapshot is deleted, region snapshot replacement cannot not take place for the associated region snapshots or read-only regions. Once this is fixed, they should proceed normally.

There were many tests for when a snapshot is delete _after_ a region snapshot replacement was requested and the process had started, but not for when one was deleted _before_ any of the associated sagas ran. Deleting the snapshot caused the snapshot volume to be hard-deleted, which caused the region snapshot replacement machinery to believe it couldn't do its job. My first thought was that Nexus needed to _not_ hard-delete the snapshot volume, and this would let the region snapshot replacement proceed. Nexus would also need to reconstruct any of the snapshot volumes that were hard-deleted through some arduous process of finding all the copies of the snapshot volume in the read-only parent fields of other volumes. But then I started diving into the code relevant to the early stages of region snapshot replacement. When inserting a region snapshot replacement request, the snapshot volume ID was required so that a volume repair record can "lock" the volume. The comments here state that the repair record isn't needed (as snapshot volumes are never directly constructed and the lock is meant to serialize access to a volume so that an associated Upstairs won't receive multiple replacement requests) and that the lock was done out of an abudance of caution. Looking at the "region snapshot replacement start" saga, it doesn't need the contents of the volume either: the snapshot volume ID is used to find other resources, but the volume contents themselves are never read. Digging further, it became apparent that not only was this lock not required (narrator: spoiler alert it actually was), snapshot volume wasn't either! The first step in realizing this was to remove creating the volume repair record with the snapshot's (or read-only region's) volume ID when inserting region snapshot replacement requests into the DB, and to stop creating the volume repair record. This required changing a bunch of unit tests, but those unit tests were creating 100% blank volumes to pass along for this purpose, further proving that the contents of the volume were not actually required: ```patch @@ -1675,21 +1484,6 @@ mod test { let region_id = Uuid::new_v4(); let snapshot_id = Uuid::new_v4(); - let volume_id = VolumeUuid::new_v4(); - - datastore - .volume_create( - volume_id, - VolumeConstructionRequest::Volume { - id: Uuid::new_v4(), // not required to match! - block_size: 512, - sub_volumes: vec![], // nothing needed here - read_only_parent: None, - }, - ) - .await - .unwrap(); - let request = RegionSnapshotReplacement::new_from_region_snapshot( dataset_id, region_id, ``` The next step to realizing the snapshot volume was not required is that no other region snapshot replacement related code had to change! _Most_ of the test suite passed after this change, however I added an integration test for the issue seen in #7790 and it plus some others were intermittently failing. Was this the wrong move? Nope, it just revealed a case where serialization _was_ required: - without a volume repair record on the snapshot volume, multiple region snapshot replacement requests could now concurrently run for the same snapshot volume - this is ok! the swapping of read-only targets will be serialized by the transaction in `volume_replace_snapshot` - but: caller A and caller B both attempting to 1. get the number of currently allocated regions for a snapshot volume 2. allocate a single read-only replacement region for the snapshot volume using a redundancy level increased by 1 will, with a certain interleaving, _both_ think that they've increased the number of allocated regions for a snapshot volume by 1, but both see the same new region. Classic TOCTOU! - two region snapshot replacement requests for different region snapshots that both have the _same_ new replacement region will result in (after the two `volume_replace_snapshot` calls) a snapshot volume that has duplicate targets (this was the motivation for the check in #7846). Concurrently getting the number of currently allocated regions, then calling `arbitrary_region_allocate` in this way is not safe, so Nexus is required to serialize these callers. The general way to do this is... using the volume repair record to "lock" the volume. Before this commit, creating a volume repair record for a hard-deleted volume was not possible: the creation routine would return an error, saying that it didn't make sense to lock a non-existent volume. Nexus is faced with the following scenario: - users can delete snapshots at any time, which will hard-delete the snapshot volume - getting the current number of allocated regions for the hard-deleted snapshot volume and calling `arbitrary_region_allocate` needs to be exclusive So this commit relaxes that restriction: volume repair records can now be created for either not-created-yet or hard-deleted volumes. This means that region snapshot replacements will still be serialized even for hard-deleted snapshot volumes. The alternative to this would either be some other form of exclusivity (another lock), or to spend cycles changing the "get number of currently allocated regions then allocation an additional one" pattern to be safe for multiple callers. In the name of urgency, the existing volume repair record is used without the aforementioned restriction. Fixes #7790

leftwo assigned jmpesp Mar 13, 2025

askfongjojo added the expunge expunge sled or disk issues label Mar 13, 2025

jmpesp linked a pull request Mar 24, 2025 that will close this issue

Region snapshot replacement for deleted snapshots #7862

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

post sled expungement, not all volumes were repaired #7790

post sled expungement, not all volumes were repaired #7790

leftwo commented Mar 13, 2025

jmpesp commented Mar 14, 2025

jmpesp commented Mar 14, 2025

post sled expungement, not all volumes were repaired #7790

post sled expungement, not all volumes were repaired #7790

Comments

leftwo commented Mar 13, 2025

jmpesp commented Mar 14, 2025

jmpesp commented Mar 14, 2025