Bulk import times scale with the number of tablets in a table. #5201

keith-turner · 2024-12-19T20:19:54Z

Describe the bug

When bulk importing into N tablets the bulk import v2 code will scan all tablet in the metadata table between the minimum and maximum tablet being imported into. For example if importing into 10 tablets into a table with 100K tablets its possible that the bulk import scans all 100K tablets, it depends on where the minimum and maximum tablet in the 10 fall in the 100k.

Expected behavior

Ideally the amount of scanning done would be directly related to the number of tablets being bulk imported and not the number of tablets int he table. This would be a large change to the way the code works. A good first step would be to add some logging to the current code that captures how much time this behavior is wasting. Then further decisions could be made about improving the code based on that.

keith-turner · 2024-12-19T20:20:20Z

This applies to bulk v2 code, not sure if applies to bulk v1 code.

In the Bulk Import v2 LoadFiles step a single TabletsMetadata object was used to map a tables tablets to a set of bulk import files. In the case where a small percentage of tablets were involved in the bulk import a majority of the tables tablets would still be evaluated. In the case where bulk imports were not importing into contiguous tablets the code would just iterate over the tables tablets until it found the next starting point. This change recreates the TabletMetadata object when a set of files is not going to start at the next tablet in the table. A likely better way to achieve the same thing would be to reset the range on the underlying Scanner and create a new iterator, but the TabletsMetadata object does not expose the Scanner. This change also closes the TabletsMetadata objects which was not being done previously. Related to apache#5201

keith-turner added the bug This issue has been verified to be a bug. label Dec 19, 2024

DomGarguilo self-assigned this Feb 3, 2025

DomGarguilo mentioned this issue Feb 6, 2025

Log iteration to result count in BulkImport code #5312

Merged

dlmarion linked a pull request Feb 19, 2025 that will close this issue

Recreate TabletsMetadata iterator when file ranges are not contiguous #5341

Open

ddanielr mentioned this issue Feb 21, 2025

Added trace logging for determining tablet lookup #5345

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk import times scale with the number of tablets in a table. #5201

Bulk import times scale with the number of tablets in a table. #5201

keith-turner commented Dec 19, 2024

keith-turner commented Dec 19, 2024

Bulk import times scale with the number of tablets in a table. #5201

Bulk import times scale with the number of tablets in a table. #5201

Comments

keith-turner commented Dec 19, 2024

keith-turner commented Dec 19, 2024