Shutdown should cancel Tablet internal major compactions #5393

dlmarion · 2025-03-10T18:16:04Z

Is your feature request related to a problem? Please describe.
When a user shuts down the cluster it sets the Manager goal state
to CLEAN_STOP, which will set the Manager to SAFE_MODE, which
will set user table tablets goal state to UNASSIGNED. The Manager
will then start to unload tablets for the user tables by calling the
UnloadTabletHandler on the TabletServer, which in turn calls
Tablet.close, which calls CompactableImpl.close. The following
code block will wait for all internal major compactions to
complete.

accumulo/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableImpl.java

Lines 1602 to 1624 in 9037d61

    
           // Wait while internal jobs are running or external compactions are committing. When 
        
           // chopStatus is MARKING or selectStatus is SELECTING, there may be metadata table writes so 
        
           // wait on those. Do not wait on external compactions that are running. 
        
           Predicate<CompactionJob> jobsToWaitFor = 
        
               job -> !((CompactionExecutorIdImpl) job.getExecutor()).isExternalId(); 
        
           while (runningJobs.stream().anyMatch(jobsToWaitFor) 
        
               || !externalCompactionsCommitting.isEmpty() 
        
               || fileMgr.chopStatus == ChopSelectionStatus.MARKING 
        
               || fileMgr.selectStatus == FileSelectionStatus.SELECTING) { 
        
             log.debug( 
        
                 "Closing {} is waiting on {} running compactions, {} committing external compactions, chop marking {}, file selection {}", 
        
                 getExtent(), runningJobs.stream().filter(jobsToWaitFor).count(), 
        
                 externalCompactionsCommitting.size(), fileMgr.chopStatus == ChopSelectionStatus.MARKING, 
        
                 fileMgr.selectStatus == FileSelectionStatus.SELECTING); 
        
             try { 
        
               wait(50); 
        
             } catch (InterruptedException e) { 
        
               Thread.currentThread().interrupt(); 
        
               throw new RuntimeException(e); 
        
             } 
        
           }

In the case of a tablet unload due to migration, waiting for the
major compactions to complete might make sense as you may
not want to lose the time spent on the compaction. When shutting
down the system, you might not care about that, and just need
to shut the system down now.

Describe the solution you'd like
When shutting down, don't wait for in-progress major compactions
to complete. Cancel them instead.

Describe alternatives you've considered
The only option when in this state and needing to shutdown the
cluster immediately is to kill the processes and incur the cost
of recovery.

dlmarion · 2025-03-10T18:32:19Z

More information: CompactableImp.close sets closed to true before the code block referenced in the description. closed is used in the CompactionCheck object and is used to determine if any in-progress major compactions should abort processing. This should have occurred, but it was noticed that compactions were progressing and the queue was increasing for a tablet that was supposed to be unloading.

keith-turner · 2025-03-10T21:58:35Z

Looked to see if we had any test for this and did not find any, so opened #5395. The basic functionality does seem to work.

Looking through the code, I only saw the check to cancel a compaction done after the iterator used to read input files returned a key/value. If that iterator is filtering and never returns anything, then that is one possible thing that could hold it up. Maybe the compaction thread also needs to be interrupted, we made some changes like that for the scan threads where they set an atomic boolean to cancel and try to interrupt the thread.

keith-turner · 2025-03-10T22:08:05Z

I modified the test in #5395 to sleep for 300 seconds instead of 3 seconds in the slow iterator to simulate an iterator filtering and not returning data. With that change the test times out and fails.

dlmarion · 2025-03-11T11:40:32Z

Looking through the code, I only saw the check to cancel a compaction done after the iterator used to read input files returned a key/value. If that iterator is filtering and never returns anything, then that is one possible thing that could hold it up. Maybe the compaction thread also needs to be interrupted, we made some changes like that for the scan threads where they set an atomic boolean to cancel and try to interrupt the thread.

Looking at the code below, the CompactionEnv is passed to the FileCompactor and checked as you said above. But there is also a background thread that will interrupt the FileCompactor if the compaction is no longer enabled.

accumulo/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableUtils.java

Lines 557 to 585 in 9037d61

    
             /** 
        
              * Create the FileCompactor and finally call compact. Returns the Major CompactionStats. 
        
              */ 
        
             static CompactionStats compact(Tablet tablet, CompactionJob job, 
        
                 CompactableImpl.CompactionInfo cInfo, CompactionEnv cenv, 
        
                 Map<StoredTabletFile,DataFileValue> compactFiles, TabletFile tmpFileName) 
        
                 throws IOException, CompactionCanceledException { 
        
               TableConfiguration tableConf = tablet.getTableConfiguration(); 
        
               AccumuloConfiguration compactionConfig = getCompactionConfig(tableConf, 
        
                   getOverrides(job.getKind(), tablet, cInfo.localHelper, job.getFiles())); 
        
               final FileCompactor compactor = new FileCompactor(tablet.getContext(), tablet.getExtent(), 
        
                   compactFiles, tmpFileName, cInfo.propagateDeletes, cenv, cInfo.iters, compactionConfig, 
        
                   tableConf.getCryptoService()); 
        
               final Runnable compactionCancellerTask = () -> { 
        
                 if (!cenv.isCompactionEnabled()) { 
        
                   compactor.interrupt(); 
        
                 } 
        
               }; 
        
               final ScheduledFuture<?> future = tablet.getContext().getScheduledExecutor() 
        
                   .scheduleWithFixedDelay(compactionCancellerTask, 10, 10, TimeUnit.SECONDS); 
        
               try { 
        
                 return compactor.call(); 
        
               } finally { 
        
                 future.cancel(true); 
        
               } 
        
             }

I modified the test in #5395 to sleep for 300 seconds instead of 3 seconds in the slow iterator to simulate an iterator filtering and not returning data. With that change the test times out and fails.

I'm wondering if you should set the Manager goal state to SAFE_MODE in the IT to simulate a shutdown and unloading of user tablets. Also, I'm seeing in the SlowIterator that it's swallowing the InterruptedException, but setting the Thread's interrupted state, then calling next or seek. We are assuming that some other code is checking the Thread state, but it doesn't look like it's happening in this case. FileCompactor.call handles an IterationInterruptedException, maybe we should be throwing it in more places. From what I can tell looking at TabletIteratorEnv.getTopLevelIterator, the MultiIterator is the top-level iterator and it's not checking the Thread interrupt state.

keith-turner · 2025-03-11T19:28:27Z

I tried modifying the test in #5395 to make the SlowIterator not eat the interrupted exception and to sleep for 300s. The test would still hang. Looking at the code not seeing anything that interrupts the compaction thread. The following code does the compaction on 1344 and then updates the metadata table and tablet in memory data on 1346.

accumulo/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableImpl.java

Lines 1344 to 1346 in 9037d61

    
           stats = CompactableUtils.compact(tablet, job, cInfo, compactEnv, compactFiles, tmpFileName); 
        
           newFile = CompactableUtils.bringOnline(tablet.getDatafileManager(), cInfo, stats,

Seems like the ideal thing would be to only interrupt the thread when its in line 1344. Have been looking into ways to do that. Found that interrupting the thread when the code is in line 1346 may have some undesirable effects based on seeing #3750 while looking into this.

keith-turner · 2025-03-11T22:47:13Z

I added code in #5395 to interrupt compactions threads and updated the new IT there to test this case.

If we wanted to get aggressive w/ troublesome compaction threads we could introduce the concept of zombie compaction threads where close will eventually abandon a compaction thread. However it will only do so if it can atomically confirm the thread is not making a metadata update and ensure the thread can never make a metadata update. Could probably make all of these changes in CompactableImpl. In general would probably want metrics for zombie compaction threads like the zombie scan threads metrics, so those changes would end up being outside of CompactableImpl.

dlmarion added the enhancement This issue describes a new feature, improvement, or optimization. label Mar 10, 2025

dlmarion added this to the 2.1.4 milestone Mar 10, 2025

dlmarion added bug This issue has been verified to be a bug. and removed enhancement This issue describes a new feature, improvement, or optimization. labels Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shutdown should cancel Tablet internal major compactions #5393

Shutdown should cancel Tablet internal major compactions #5393

dlmarion commented Mar 10, 2025

dlmarion commented Mar 10, 2025

keith-turner commented Mar 10, 2025

keith-turner commented Mar 10, 2025

dlmarion commented Mar 11, 2025

keith-turner commented Mar 11, 2025

keith-turner commented Mar 11, 2025

Shutdown should cancel Tablet internal major compactions #5393

Shutdown should cancel Tablet internal major compactions #5393

Comments

dlmarion commented Mar 10, 2025

dlmarion commented Mar 10, 2025

keith-turner commented Mar 10, 2025

keith-turner commented Mar 10, 2025

dlmarion commented Mar 11, 2025

keith-turner commented Mar 11, 2025

keith-turner commented Mar 11, 2025