-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefetch blocks and place into data BlockCache for major compactions #5302
base: main
Are you sure you want to change the base?
Conversation
Looking at the new vectored read API in Hadoop has been on my todo list. Another good resource for understanding it is here. I attempted to use this, but was unable to figure out a good way to use it as we don't directly deal with HDFS blocks. Instead, we deal with RFile blocks, and we cache them, at a much different layer than where the HDFS block is retrieved. Instead I attempted to create something similar in this PR, prefetching RFile blocks and preemptively caching them. I think this might make sense for operations that perform sequential reads, like compactions. So I wired this up in the FileCompactor for major compactions, and I targeted the main branch because major compactions only run in Compactors. In earlier releases this change would cause churn in the data block cache and might cause a decrease in scan performance due to eviction of other blocks. There are still some changes to be made, like adding the BlockCache to the Compactor, making the number of blocks to prefetch a property, and moving the ThreadPoolExecutor out of the Reader and somewhere else. But wanted to get early feedback on the concept before putting more work into it. |
Full IT build completed successfully |
Related to #2770