-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shutdown should cancel Tablet internal major compactions #5393
Comments
More information: |
Looked to see if we had any test for this and did not find any, so opened #5395. The basic functionality does seem to work. Looking through the code, I only saw the check to cancel a compaction done after the iterator used to read input files returned a key/value. If that iterator is filtering and never returns anything, then that is one possible thing that could hold it up. Maybe the compaction thread also needs to be interrupted, we made some changes like that for the scan threads where they set an atomic boolean to cancel and try to interrupt the thread. |
I modified the test in #5395 to sleep for 300 seconds instead of 3 seconds in the slow iterator to simulate an iterator filtering and not returning data. With that change the test times out and fails. |
Looking at the code below, the accumulo/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableUtils.java Lines 557 to 585 in 9037d61
I'm wondering if you should set the Manager goal state to SAFE_MODE in the IT to simulate a shutdown and unloading of user tablets. Also, I'm seeing in the |
I tried modifying the test in #5395 to make the SlowIterator not eat the interrupted exception and to sleep for 300s. The test would still hang. Looking at the code not seeing anything that interrupts the compaction thread. The following code does the compaction on 1344 and then updates the metadata table and tablet in memory data on 1346. accumulo/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableImpl.java Lines 1344 to 1346 in 9037d61
Seems like the ideal thing would be to only interrupt the thread when its in line 1344. Have been looking into ways to do that. Found that interrupting the thread when the code is in line 1346 may have some undesirable effects based on seeing #3750 while looking into this. |
I added code in #5395 to interrupt compactions threads and updated the new IT there to test this case. If we wanted to get aggressive w/ troublesome compaction threads we could introduce the concept of zombie compaction threads where close will eventually abandon a compaction thread. However it will only do so if it can atomically confirm the thread is not making a metadata update and ensure the thread can never make a metadata update. Could probably make all of these changes in CompactableImpl. In general would probably want metrics for zombie compaction threads like the zombie scan threads metrics, so those changes would end up being outside of CompactableImpl. |
Is your feature request related to a problem? Please describe.
When a user shuts down the cluster it sets the Manager goal state
to CLEAN_STOP, which will set the Manager to SAFE_MODE, which
will set user table tablets goal state to UNASSIGNED. The Manager
will then start to unload tablets for the user tables by calling the
UnloadTabletHandler on the TabletServer, which in turn calls
Tablet.close, which calls CompactableImpl.close. The following
code block will wait for all internal major compactions to
complete.
accumulo/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableImpl.java
Lines 1602 to 1624 in 9037d61
In the case of a tablet unload due to migration, waiting for the
major compactions to complete might make sense as you may
not want to lose the time spent on the compaction. When shutting
down the system, you might not care about that, and just need
to shut the system down now.
Describe the solution you'd like
When shutting down, don't wait for in-progress major compactions
to complete. Cancel them instead.
Describe alternatives you've considered
The only option when in this state and needing to shutdown the
cluster immediately is to kill the processes and incur the cost
of recovery.
The text was updated successfully, but these errors were encountered: