Make garbage collecting wal and recovery files faster #5399

dlmarion · 2025-03-11T21:08:06Z

Uses the same property and technique used for deleting RFiles to delete wal and recovery files.

Closes #5397

Uses the same property and technique used for deleting RFiles to delete wal and recovery files. Closes apache#5397

dlmarion · 2025-03-11T21:24:52Z

Kicked off IT build

dlmarion · 2025-03-12T12:08:45Z

Full IT build successful

server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java

ddanielr

Changes look good to me.

The only suggestion I have is using a separate GC_WAL_DELETE_THREADS property.

In the current code, user table activity will actively delay WAL cleanup processing as the wals are processed at the end of a GC run cycle before flushing the metadata and root tables.

If the goal is to speed up GC of the wals and recovery files, then I don't see a reason why the WAL logs have to be handled at the end of the GC run cycle vs just always be running as a separate task thread.

GC_DELETE_THREADS is most likely going to be set higher than needed for the WALs since it's based on file activity in the system while the upper limit of WALs should be based on the amount of tservers * tserver.wal.max.referenced and tserver churn (dead/recovered/etc).

If we move to having the WALs GC'd in a separate thread or just pulled into a different GC process, then using a separate property would make it easier to avoid exceeding the max amount of available threads on the server.

keith-turner · 2025-03-12T18:43:41Z

server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java

+    } catch (InterruptedException e1) {
+      log.error("{}", e1.getMessage(), e1);
+    }
+    return counter.get();


The code was incrementing this deleted counter, should it still do that?

Suggested change

return counter.get();

status.currentLog.deleted += counter.get();

return counter.get();

Yep, I forgot to increment the counter. I created it though, so I was half way there.

On second look, counter is being passed to removeFile, and incremented there.

yeah counter is being incrementing removeFile, but the old code used to increment status.currentLog.deleted which may be used for display on the monitor and logging. Seems like that is no longer incremented.

Updated in 7c54495

cshannon · 2025-03-12T19:26:52Z

server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java

+          fs.deleteRecursively(path);
+        }
+        counter.incrementAndGet();
+      } catch (FileNotFoundException ex) {


I was wondering if we need to handle any other exception types here as an uncaught exception can bubble up and kill the thead pool. It's pretty common to catch all exceptions as a catch all.

The previous code just bubbled the RuntimeExceptions up the call stack. I could call submit instead, test the futures, and return any errors up the stack.

If you used futures that might be better, you could wait for them to finish instead of relying on the thread pool to shutdown for completion. Either way could work, I just figured I'd mention it as I'm not too familiar with this code but I noticed we were only catching those specific exceptions so was curious if a random runtime exception would cause issues with the new thread pool.

Updated to use futures in 7c54495

dlmarion · 2025-03-12T20:51:40Z

The only suggestion I have is using a separate GC_WAL_DELETE_THREADS property.

Added in 7c54495

keith-turner · 2025-03-12T21:35:42Z

server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java

    }

+    while (!futures.isEmpty()) {


The get() method in a future waits until its done or failed, so should be able to loop through all futures once calling get.

futures.forEach((path,future)->{ try { future.get(); } catch (InterruptedException | ExecutionException e) { throw new RuntimeException("Uncaught exception deleting recovery log file" + path, e); } });

I was thinking the faster I process the ones that are complete and remove from the map, the faster I'm giving back memory to the VM. I probably need to put a wait at the bottom of the loop though. If the first one the iterator returns is the last one submitted, then I would be waiting until they are all done.

keith-turner · 2025-03-12T21:38:41Z

server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java

+        }
+      }
+    }
+    deleteThreadPool.shutdown();


If an exception is throw the pool would not be shut down.

keith-turner · 2025-03-12T21:39:08Z

server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java

+    }
+    deleteThreadPool.shutdown();
+    try {
+      while (!deleteThreadPool.awaitTermination(1000, TimeUnit.MILLISECONDS)) { // empty


Could remove this code now that futures are being waited on.

Make garbage collecting wal and recovery files faster

4ce3fe1

Uses the same property and technique used for deleting RFiles to delete wal and recovery files. Closes apache#5397

dlmarion added this to the 2.1.4 milestone Mar 11, 2025

dlmarion self-assigned this Mar 11, 2025

dlmarion marked this pull request as ready for review March 12, 2025 12:08

dlmarion linked an issue Mar 12, 2025 that may be closed by this pull request

Garbage Collecting Write Ahead Logs can be slow for a large number of logs #5397

Open

ddanielr reviewed Mar 12, 2025

View reviewed changes

server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java Show resolved Hide resolved

ddanielr approved these changes Mar 12, 2025

View reviewed changes

keith-turner reviewed Mar 12, 2025

View reviewed changes

cshannon reviewed Mar 12, 2025

View reviewed changes

Use futures, new property, and update gcstatus variable

7c54495

keith-turner reviewed Mar 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make garbage collecting wal and recovery files faster #5399

Make garbage collecting wal and recovery files faster #5399

dlmarion commented Mar 11, 2025

dlmarion commented Mar 11, 2025

dlmarion commented Mar 12, 2025

ddanielr left a comment

keith-turner Mar 12, 2025

dlmarion Mar 12, 2025

dlmarion Mar 12, 2025

keith-turner Mar 12, 2025

dlmarion Mar 12, 2025

cshannon Mar 12, 2025

dlmarion Mar 12, 2025

cshannon Mar 12, 2025

dlmarion Mar 12, 2025

dlmarion commented Mar 12, 2025

keith-turner Mar 12, 2025

dlmarion Mar 12, 2025

keith-turner Mar 12, 2025

keith-turner Mar 12, 2025

	return counter.get();
	status.currentLog.deleted += counter.get();
	return counter.get();

Make garbage collecting wal and recovery files faster #5399

Are you sure you want to change the base?

Make garbage collecting wal and recovery files faster #5399

Conversation

dlmarion commented Mar 11, 2025

dlmarion commented Mar 11, 2025

dlmarion commented Mar 12, 2025

ddanielr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlmarion commented Mar 12, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment