Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crucible hit assertion failed: self.completed.contains(&ds_id) during pstop/prun of a downstairs #1121

Closed
leftwo opened this issue Jan 27, 2024 · 3 comments

Comments

@leftwo
Copy link
Contributor

leftwo commented Jan 27, 2024

Saw the following panic while running pstop/prun on a downstairs with IO coming from the
upstairs. The test is intended to trigger a client timeout and then start live repair, as well as
a timeout during live repair.

{"msg":"[1] 60597 Done  fw.send for m","v":0,"name":"crucible","level":30,"time":"2024-01-27T03:18:08.421599761Z","hostname":"EVT22200005","pid":3948,"":"io task","client":"1","":"downstairs","session_id":"8d6f6bc4-49fa-488d-9e3d-589486e37cca"}
{"msg":"[0] 60597 Done  fw.send for m","v":0,"name":"crucible","level":30,"time":"2024-01-27T03:18:08.42160955Z","hostname":"EVT22200005","pid":3948,"":"io task","client":"0","":"downstairs","session_id":"8d6f6bc4-49fa-488d-9e3d-589486e37cca"}
{"msg":"[1] 60598 Start fw.send for m","v":0,"name":"crucible","level":30,"time":"2024-01-27T03:18:08.421622064Z","hostname":"EVT22200005","pid":3948,"":"io task","client":"1","":"downstairs","session_id":"8d6f6bc4-49fa-488d-9e3d-589486e37cca"}
{"msg":"[1] 60598 Done  fw.send for m","v":0,"name":"crucible","level":30,"time":"2024-01-27T03:18:08.421631572Z","hostname":"EVT22200005","pid":3948,"":"io task","client":"1","":"downstairs","session_id":"8d6f6bc4-49fa-488d-9e3d-589486e37cca"}
{"msg":"[0] 60598 Start fw.send for m","v":0,"name":"crucible","level":30,"time":"2024-01-27T03:18:08.421650217Z","hostname":"EVT22200005","pid":3948,"":"io task","client":"0","":"downstairs","session_id":"8d6f6bc4-49fa-488d-9e3d-589486e37cca"}
{"msg":"[0] 60598 Done  fw.send for m","v":0,"name":"crucible","level":30,"time":"2024-01-27T03:18:08.421660276Z","hostname":"EVT22200005","pid":3948,"":"io task","client":"0","":"downstairs","session_id":"8d6f6bc4-49fa-488d-9e3d-589486e37cca"}
{"msg":"IO completion error: missing 3316 ","v":0,"name":"crucible","level":50,"time":"2024-01-27T03:18:08.423613237Z","hostname":"EVT22200005","pid":3948,"client":"2","":"downstairs","session_id":"8d6f6bc4-49fa-488d-9e3d-589486e37cca"}
thread 'tokio-runtime-worker' panicked at upstairs/src/downstairs.rs:3224:13:
assertion failed: self.completed.contains(&ds_id)

The assertion in in process_io_completion_inner()

The code (and comment around it) might provide some hope this is a false panic:

    fn process_io_completion_inner(                                                        &mut self,                                                                 
        ds_id: JobId,                                                              
        client_id: ClientId,                                                       
        responses: Result<Vec<ReadResponse>, CrucibleError>,                       
        read_response_hashes: Vec<Option<u64>>,                                    
        up_state: &UpstairsState,                                                  
        extent_info: Option<ExtentInfo>,                                           
    ) {                                                                            
        /*                                                                         
         * Assume we don't have enough completed jobs, and only change             
         * it if we have the exact amount required                                 
         */                                                                        
        let deactivate = matches!(up_state, UpstairsState::Deactivating { .. });   
                                                                                   
        let Some(job) = self.ds_active.get_mut(&ds_id) else {                      
            error!(                                                                
                self.clients[client_id].log,                                       
                "IO completion error: missing {ds_id} "                            
            );                                                                     
            /*                                                                     
             * This assertion is only true for a limited time after                
             * the downstairs has failed.  An old in-flight IO                     
             * could, in theory, ack back to us at some time                       
             * in the future after we cleared the completed.                       
             * I also think this path could be  possible if we                     
             * are in failure mode for LiveRepair, as we could                     
             * get an ack back from a job after we failed the DS                   
             * (from the upstairs side) and flushed the job away.                  
             */                                                                    
            assert!(self.completed.contains(&ds_id));                              
            return;                                                                
        };

This test in question could indeed trigger this situation, but filing the issue now so I can
restart the test for other reasons.

@leftwo
Copy link
Contributor Author

leftwo commented Jan 27, 2024

Logs on catacomb at /staff/core/crucible-1121

The test was on loop 16:

[015][2] 5:15  ds_err:0 abrt:0 ave:5:12 total:78:05 last_run:315                  
using ds_pid 13896
New loop starts now Sat Jan 27 03:10:13 UTC 2024 faulting: 1                      
Wait for our downstairs to fault
Wait for our downstairs to begin live_repair                                      
After 5 seconds, Stop 1 again
Wait for downstairs 1 go go back to faulted                                       
Start 1 for a second time
Now wait for all downstairs to be active
All downstairs active, now stop IO test and wait for it to finish
[016][1] 5:13  ds_err:0 abrt:0 ave:5:12 total:83:18 last_run:313                  
using ds_pid 13897
New loop starts now Sat Jan 27 03:15:25 UTC 2024 faulting: 2                      
Wait for our downstairs to fault
Wait for our downstairs to begin live_repair 

So, we had pstopped the downstairs, waited for it to fault, then issued the prun and we were
waiting for the live repair to start: from dtrace upstairs_info.d

       DS STATE 0        DS STATE 1        DS STATE 2    UPW   DSW BAKPR   NEW0  NEW1  NEW2    IP0   IP1   IP2     D0    D1    D2     S0    S1    S2  E0 E1 E2
           active            active           offline      0 56404  4890      0     0 56404      1     1     0  56403 56403     0      0     0     0   0  0  0
           active            active           offline      0 56645  4934      0     0 56645      0     0     0  56645 56645     0      0     0     0   0  0  0
           active            active           offline      1 56886  4978      0     0 56886      1     1     0  56885 56885     0      0     0     0   0  0  0
           active            active           faulted      1    13     0      0     0     0      1     2     0     12    11     0      0     0    13   0  0  0
           active            active           faulted      1    21     0      0     0     0      1     1     0     20    20     0      0     0    21   0  0  0
           active            active           faulted      1    17     0      0     0     0      1     1     0     16    16     0      0     0    17   0  0  0

This could be the situation we describe in the comment.

@mkeeter
Copy link
Contributor

mkeeter commented Feb 20, 2024

I believe this should be fixed by the combination of #1126, #1131, #1157

@leftwo
Copy link
Contributor Author

leftwo commented Feb 21, 2024

Agreed, some combo of those should fix it.

@leftwo leftwo closed this as completed Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants