Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk import times scale with the number of tablets in a table. #5201

Open
keith-turner opened this issue Dec 19, 2024 · 1 comment · May be fixed by #5341
Open

Bulk import times scale with the number of tablets in a table. #5201

keith-turner opened this issue Dec 19, 2024 · 1 comment · May be fixed by #5341
Assignees
Labels
bug This issue has been verified to be a bug.

Comments

@keith-turner
Copy link
Contributor

Describe the bug

When bulk importing into N tablets the bulk import v2 code will scan all tablet in the metadata table between the minimum and maximum tablet being imported into. For example if importing into 10 tablets into a table with 100K tablets its possible that the bulk import scans all 100K tablets, it depends on where the minimum and maximum tablet in the 10 fall in the 100k.

Expected behavior

Ideally the amount of scanning done would be directly related to the number of tablets being bulk imported and not the number of tablets int he table. This would be a large change to the way the code works. A good first step would be to add some logging to the current code that captures how much time this behavior is wasting. Then further decisions could be made about improving the code based on that.

@keith-turner keith-turner added the bug This issue has been verified to be a bug. label Dec 19, 2024
@keith-turner
Copy link
Contributor Author

This applies to bulk v2 code, not sure if applies to bulk v1 code.

@DomGarguilo DomGarguilo self-assigned this Feb 3, 2025
dlmarion added a commit to dlmarion/accumulo that referenced this issue Feb 19, 2025
In the Bulk Import v2 LoadFiles step a single TabletsMetadata
object was used to map a tables tablets to a set of bulk import
files. In the case where a small percentage of tablets were
involved in the bulk import a majority of the tables tablets
would still be evaluated. In the case where bulk imports were
not importing into contiguous tablets the code would just
iterate over the tables tablets until it found the next starting
point.

This change recreates the TabletMetadata object when a set of
files is not going to start at the next tablet in the table. A
likely better way to achieve the same thing would be to reset
the range on the underlying Scanner and create a new iterator,
but the TabletsMetadata object does not expose the Scanner. This
change also closes the TabletsMetadata objects which was not
being done previously.

Related to apache#5201
ddanielr pushed a commit to ddanielr/accumulo that referenced this issue Feb 21, 2025
In the Bulk Import v2 LoadFiles step a single TabletsMetadata
object was used to map a tables tablets to a set of bulk import
files. In the case where a small percentage of tablets were
involved in the bulk import a majority of the tables tablets
would still be evaluated. In the case where bulk imports were
not importing into contiguous tablets the code would just
iterate over the tables tablets until it found the next starting
point.

This change recreates the TabletMetadata object when a set of
files is not going to start at the next tablet in the table. A
likely better way to achieve the same thing would be to reset
the range on the underlying Scanner and create a new iterator,
but the TabletsMetadata object does not expose the Scanner. This
change also closes the TabletsMetadata objects which was not
being done previously.

Related to apache#5201
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue has been verified to be a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants