-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk import times scale with the number of tablets in a table. #5201
Labels
bug
This issue has been verified to be a bug.
Comments
This applies to bulk v2 code, not sure if applies to bulk v1 code. |
dlmarion
added a commit
to dlmarion/accumulo
that referenced
this issue
Feb 19, 2025
In the Bulk Import v2 LoadFiles step a single TabletsMetadata object was used to map a tables tablets to a set of bulk import files. In the case where a small percentage of tablets were involved in the bulk import a majority of the tables tablets would still be evaluated. In the case where bulk imports were not importing into contiguous tablets the code would just iterate over the tables tablets until it found the next starting point. This change recreates the TabletMetadata object when a set of files is not going to start at the next tablet in the table. A likely better way to achieve the same thing would be to reset the range on the underlying Scanner and create a new iterator, but the TabletsMetadata object does not expose the Scanner. This change also closes the TabletsMetadata objects which was not being done previously. Related to apache#5201
ddanielr
pushed a commit
to ddanielr/accumulo
that referenced
this issue
Feb 21, 2025
In the Bulk Import v2 LoadFiles step a single TabletsMetadata object was used to map a tables tablets to a set of bulk import files. In the case where a small percentage of tablets were involved in the bulk import a majority of the tables tablets would still be evaluated. In the case where bulk imports were not importing into contiguous tablets the code would just iterate over the tables tablets until it found the next starting point. This change recreates the TabletMetadata object when a set of files is not going to start at the next tablet in the table. A likely better way to achieve the same thing would be to reset the range on the underlying Scanner and create a new iterator, but the TabletsMetadata object does not expose the Scanner. This change also closes the TabletsMetadata objects which was not being done previously. Related to apache#5201
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
When bulk importing into N tablets the bulk import v2 code will scan all tablet in the metadata table between the minimum and maximum tablet being imported into. For example if importing into 10 tablets into a table with 100K tablets its possible that the bulk import scans all 100K tablets, it depends on where the minimum and maximum tablet in the 10 fall in the 100k.
Expected behavior
Ideally the amount of scanning done would be directly related to the number of tablets being bulk imported and not the number of tablets int he table. This would be a large change to the way the code works. A good first step would be to add some logging to the current code that captures how much time this behavior is wasting. Then further decisions could be made about improving the code based on that.
The text was updated successfully, but these errors were encountered: