[GLUTEN-8846][CH] [Part 1] Support Positional Deletes #8937

baibaichen · 2025-03-08T10:23:09Z

What changes were proposed in this pull request?

(Fixes: #8846)

Enhance Parquet and Iceberg readers to handle delete columns not in select list and support positional delete operations in Iceberg tables.

Positional Delete Support

Bitmap
Introduced DeltaDVRoaringBitmapArray for efficient position tracking:

void rb_add(Int64 x);  // Add deleted positions
bool rb_contains(Int64 x) const;  // Check position existence

Reader Integration
- PositionalDeleteFileReader processes position delete files
- Iceberg metadata column handling:
```
static std::shared_ptr<IcebergMetadataColumn> icebergDeletePosColumn();
```
Filter
Modified Iceberg reader to apply bitmap filters during chunk generation:
```
delete_bitmap_array->rb_contains(pos[i]);
```

Delete columns not in select list

Fix an issue when excuting sql like select c0 from t and Equality Delete File contains c1

Added collectFileSchema in ParquetMetaBuilder to extract file schema
Append read header isf delete columns are not in select list
Modified IcebergReader to remove temporary columns after applying deletes:

columns.erase(columns.begin() + start_remove_index, columns.end());

How was this patch tested?

using gtest

github-actions · 2025-03-08T10:23:27Z

#8846

github-actions · 2025-03-08T10:23:40Z

Run Gluten Clickhouse CI on x86

Copilot reviewed 20 out of 20 changed files in this pull request and generated no comments.

baibaichen · 2025-03-08T10:24:41Z

@CodiumAI-Agent /review

CodiumAI-Agent · 2025-03-08T10:25:31Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶 8846 - Partially compliant Compliant requirements: Integration of new delete file readers (EqualityDeleteFileReader and PositionalDeleteFileReader). Enhanced iceberg reader that supports both delete-expression and bitmap deletion filtering. Extensive tests added/updated covering varied delete scenarios. Non-compliant requirements: Support for deleting columns not in the select set is still pending. Requires further human verification: Benchmark performance across different datasets. Verify correctness of row filtering logic for edge-case row group boundaries.
⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Complex Filtering Logic The merging of equality and positional delete logic introduces non-trivial filtering; ensure all edge cases (empty deletes, complete deletes, and mixed scenarios) are thoroughly validated. IcebergReader::~IcebergReader() = default; Chunk IcebergReader::doPull() { if (!delete_expr && !delete_bitmap_array) return NormalFileReader::doPull(); while (true) { Chunk chunk = NormalFileReader::doPull(); if (chunk.getNumRows() == 0) return chunk; Block deleted_block; if (delete_expr) deleted_block = applyEqualityDelete(chunk); if (delete_bitmap_array) { Test Coverage New tests add extensive delete scenarios with random and continuous values. Review that all potential edge cases are covered and that test expectations align with the new deletion logic. /// This test creates one single data file and one delete file. The parameter /// passed to assertSingleBaseFileSingleDeleteFile is the delete positions. TEST_F(IcebergTest, singleBaseFileSinglePositionalDeleteFile) { assertSingleBaseFileSingleDeleteFile({{0, 1, 2, 3}}); // Delete the first and last row in each batch (10000 rows per batch) assertSingleBaseFileSingleDeleteFile({{0, 9999, 10000, 19999}}); // Delete several rows in the second batch (10000 rows per batch) assertSingleBaseFileSingleDeleteFile({{10000, 10002, 19999}}); // Delete random rows assertSingleBaseFileSingleDeleteFile({makeRandomIncreasingValues(0, 20000)}); // Delete 0 rows assertSingleBaseFileSingleDeleteFile({}); // Delete all rows assertSingleBaseFileSingleDeleteFile( {makeContinuousIncreasingValues(0, 20000)}); // Delete rows that don't exist assertSingleBaseFileSingleDeleteFile({{20000, 29999}}); } /// This test creates 3 base data files, only the middle one has corresponding /// delete positions. The parameter passed to /// assertSingleBaseFileSingleDeleteFile is the delete positions.for the middle /// base file. TEST_F(IcebergTest, MultipleBaseFilesSinglePositionalDeleteFile) { assertMultipleBaseFileSingleDeleteFile({0, 1, 2, 3}); assertMultipleBaseFileSingleDeleteFile({0, 9999, 10000, 19999}); assertMultipleBaseFileSingleDeleteFile({10000, 10002, 19999}); assertMultipleBaseFileSingleDeleteFile({10000, 10002, 19999}); assertMultipleBaseFileSingleDeleteFile( makeRandomIncreasingValues(0, rowCount)); assertMultipleBaseFileSingleDeleteFile({}); assertMultipleBaseFileSingleDeleteFile( makeContinuousIncreasingValues(0, rowCount)); } /// This test creates one base data file/split with multiple delete files. The /// parameter passed to assertSingleBaseFileMultipleDeleteFiles is the vector of /// delete files. Each leaf vector represents the delete positions in that /// delete file. TEST_F(IcebergTest, singleBaseFileMultiplePositionalDeleteFiles) { // Delete row 0, 1, 2, 3 from the first batch out of two. assertSingleBaseFileMultipleDeleteFiles({{1}, {2}, {3}, {4}}); // Delete the first and last row in each batch (10000 rows per batch). assertSingleBaseFileMultipleDeleteFiles({{0}, {9999}, {10000}, {19999}}); assertSingleBaseFileMultipleDeleteFiles({{500, 21000}}); assertSingleBaseFileMultipleDeleteFiles( {makeRandomIncreasingValues(0, 10000), makeRandomIncreasingValues(10000, 20000), makeRandomIncreasingValues(5000, 15000)}); assertSingleBaseFileMultipleDeleteFiles( {makeContinuousIncreasingValues(0, 10000), makeContinuousIncreasingValues(10000, 20000)}); assertSingleBaseFileMultipleDeleteFiles( {makeContinuousIncreasingValues(0, 10000), makeContinuousIncreasingValues(10000, 20000), makeRandomIncreasingValues(5000, 15000)}); assertSingleBaseFileMultipleDeleteFiles( {makeContinuousIncreasingValues(0, 20000), makeContinuousIncreasingValues(0, 20000)}); assertSingleBaseFileMultipleDeleteFiles( {makeRandomIncreasingValues(0, 20000), {}, makeRandomIncreasingValues(5000, 15000)}); assertSingleBaseFileMultipleDeleteFiles({{}, {}}); } /// This test creates 2 base data files, and 1 or 2 delete files, with unaligned /// RowGroup boundaries TEST_F(IcebergTest, multipleBaseFileMultiplePositionalDeleteFiles) { std::map<std::string, std::vector<int64_t>> rowGroupSizesForFiles; std::unordered_map< std::string, std::multimap<std::string, std::vector<int64_t>>> deleteFilesForBaseDatafiles; // Create two data files, each with two RowGroups rowGroupSizesForFiles["data_file_1"] = {100, 85}; rowGroupSizesForFiles["data_file_2"] = {99, 1}; // Delete 3 rows from the first RowGroup in data_file_1 deleteFilesForBaseDatafiles["delete_file_1"] = {{"data_file_1", {0, 1, 99}}}; assertPositionalDeletes(rowGroupSizesForFiles, deleteFilesForBaseDatafiles); // Delete 3 rows from the second RowGroup in data_file_1 deleteFilesForBaseDatafiles["delete_file_1"] = { {"data_file_1", {100, 101, 184}}}; assertPositionalDeletes(rowGroupSizesForFiles, deleteFilesForBaseDatafiles); // Delete random rows from the both RowGroups in data_file_1 deleteFilesForBaseDatafiles["delete_file_1"] = { {"data_file_1", makeRandomIncreasingValues(0, 185)}}; assertPositionalDeletes(rowGroupSizesForFiles, deleteFilesForBaseDatafiles); // Delete all rows in data_file_1 deleteFilesForBaseDatafiles["delete_file_1"] = { {"data_file_1", makeContinuousIncreasingValues(0, 185)}}; assertPositionalDeletes(rowGroupSizesForFiles, deleteFilesForBaseDatafiles); // // Delete non-existent rows from data_file_1 deleteFilesForBaseDatafiles["delete_file_1"] = { {"data_file_1", makeRandomIncreasingValues(186, 300)}}; assertPositionalDeletes(rowGroupSizesForFiles, deleteFilesForBaseDatafiles); // Delete several rows from both RowGroups in both data files

baibaichen · 2025-03-09T09:08:30Z

@CodiumAI-Agent /improve

CodiumAI-Agent · 2025-03-09T09:09:32Z

PR Code Suggestions ✨

Latest suggestions up to 49839b9

Category	Suggestion	Impact
Possible issue	Validate key existence Verify that the key exists in `baseFilePaths` before using it to prevent potential crashes. cpp-ch/local-engine/tests/gtest_iceberge_test.cpp [80-92] for (const auto& deleteFile : deleteFilesForBaseDatafiles) { auto deleteFileName = deleteFile.first; auto deleteFileContent = deleteFile.second; auto deleteFilePath = TempFilePath::tmp("parquet"); for (auto& deleteFileRowGroup : deleteFileContent) { auto baseFileName = deleteFileRowGroup.first; + if (baseFilePaths.find(baseFileName) == baseFilePaths.end()) + { + throw std::runtime_error("Base file path for " + baseFileName + " not found."); + } auto baseFilePath = "file://" + baseFilePaths[baseFileName]->string(); ... Suggestion importance[1-10]: 8 __ Why: Adding a check to verify the key exists in baseFilePaths prevents possible runtime crashes, addressing a potential bug and increasing robustness.	Medium
Possible issue	Use value for context Store the context member by value instead of a reference to prevent potential dangling references. cpp-ch/local-engine/Storages/SubstraitSource/iceberg/EqualityDeleteFileReader.h [60] -const DB::ContextPtr & context_; +DB::ContextPtr context_; Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a potential lifetime issue by recommending storing DB::ContextPtr by value rather than a reference, thus avoiding dangling references; however, the change may have minor performance implications depending on context usage.	Medium
General	Use const reference in loop Use constant references when iterating over vectors to avoid unnecessary copies. cpp-ch/local-engine/tests/gtest_iceberge_test.cpp [211-217] std::vector<int64_t> flattenAndDedup( const std::vector<std::vector<int64_t>>& deletePositionVectors, int64_t baseFileSize) { std::vector<int64_t> deletePositionVector; - for (auto vec : deletePositionVectors) { + for (const auto & vec : deletePositionVectors) { for (auto pos : vec) { if (pos >= 0 && pos < baseFileSize) { deletePositionVector.push_back(pos); } } } ... Suggestion importance[1-10]: 5 __ Why: Replacing the copy in the loop with a const reference improves performance and code clarity. However, it's a minor improvement and does not affect functionality.	Low

Previous suggestions

Suggestions up to commit c1945f7

Suggestions up to commit a00ad70

Category	Suggestion	Impact
Possible issue	Implement equality operator Provide a proper implementation for the equality operator to correctly compare object state. cpp-ch/local-engine/Storages/SubstraitSource/Delta/Bitmap/DeltaDVRoaringBitmapArray.h [46] -bool operator==(const DeltaDVRoaringBitmapArray & other) const; +bool operator==(const DeltaDVRoaringBitmapArray & other) const { + return context == other.context && roaring_bitmap_array == other.roaring_bitmap_array; +} Suggestion importance[1-10]: 8 __ Why: Implementing the equality operator ensures that object comparisons correctly reflect the internal state, which is crucial for correctness and potential debugging.	Medium
	Validate key existence Verify that the key exists in `baseFilePaths` before accessing the pointer to avoid potential null dereferences. cpp-ch/local-engine/tests/gtest_iceberge_test.cpp [93] -auto baseFilePath = "file://" + baseFilePaths[baseFileName]->string(); +if (baseFilePaths.find(baseFileName) != baseFilePaths.end()) + auto baseFilePath = "file://" + baseFilePaths.at(baseFileName)->string(); +else + // Handle missing base file case appropriately Suggestion importance[1-10]: 7 __ Why: The suggestion addresses a potential null dereference by ensuring that the key exists in the baseFilePaths map, enhancing runtime safety in a critical code path.	Medium
	Store smart pointer by value Store the context smart pointer by value to prevent potential lifetime issues. cpp-ch/local-engine/Storages/SubstraitSource/iceberg/EqualityDeleteFileReader.h [60] -const DB::ContextPtr & context_; +DB::ContextPtr context_; Suggestion importance[1-10]: 7 __ Why: Changing from a const reference to storing the smart pointer by value can help prevent potential lifetime issues. Although this change might affect design assumptions, it is a reasonable improvement if ensuring robustness is a priority.	Medium
	Confirm consistent block schema Ensure that initializing the writer with the first block's schema is intentional for all blocks; if blocks can differ, consider reinitializing the writer based on a consistent schema. cpp-ch/local-engine/tests/utils/ReaderTestBase.cpp [97-99] +// Validate all blocks share the same schema or reinitialize the writer per block as needed const auto writer = NormalFileWriter::create(context_, fileUri.toString(), blocks[0], file.getExtension()); for (const auto & block : blocks) writer->write(block); Suggestion importance[1-10]: 2 __ Why: This suggestion is more of a review comment to check schema consistency rather than a concrete change, offering only marginal improvement.	Low

Suggestions up to commit 5ea18c0

Category	Suggestion	Impact
Possible issue	Validate base file key Check that baseFilePaths contains the key before using it, and handle the missing case to avoid a potential crash. cpp-ch/local-engine/tests/gtest_iceberge_test.cpp [92] -auto baseFilePath = "file://" + baseFilePaths[baseFileName]->string(); +auto it = baseFilePaths.find(baseFileName); +if (it != baseFilePaths.end()) +{ + auto baseFilePath = "file://" + it->second->string(); + // proceed with baseFilePath +} +else +{ + // handle missing key appropriately (log error or throw exception) +} Suggestion importance[1-10]: 8 __ Why: The suggestion addresses a potential crash by checking for the key's existence in baseFilePaths, which improves robustness and error handling in the file path construction.	Medium
General	Use dynamic RNG seeding Consider seeding the random number generator dynamically (e.g. with std::random_device) if varied random outputs are intended for testing. cpp-ch/local-engine/tests/gtest_iceberge_test.cpp [705] -std::mt19937 gen{0}; +std::mt19937 gen(std::random_device{}()); Suggestion importance[1-10]: 3 __ Why: While the change enhances randomness by avoiding a fixed seed, it is a minor stylistic improvement that might affect reproducibility in tests.	Low

github-actions · 2025-03-10T14:57:35Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-03-11T07:40:43Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-03-11T07:40:58Z

Run Gluten Clickhouse CI on x86

baibaichen · 2025-03-11T07:55:15Z

@CodiumAI-Agent /review

Copilot reviewed 23 out of 23 changed files in this pull request and generated no comments.

baibaichen · 2025-03-11T09:46:26Z

@CodiumAI-Agent /improve

github-actions · 2025-03-12T02:45:41Z

Run Gluten Clickhouse CI on x86

zzcclp

LGTM

[CH] [Part 1] Support Positional Deletes

github-actions bot added the CLICKHOUSE label Mar 8, 2025

baibaichen mentioned this pull request Feb 28, 2025

[CH][UMBRELLA] Support Iceberg MOR #8846

Open

3 tasks

baibaichen requested a review from Copilot March 8, 2025 10:23

Copilot AI reviewed Mar 8, 2025

View reviewed changes

baibaichen force-pushed the feature/iceberg-mor branch from 5ea18c0 to a00ad70 Compare March 10, 2025 14:57

baibaichen force-pushed the feature/iceberg-mor branch 2 times, most recently from 398d57d to c1945f7 Compare March 11, 2025 07:40

baibaichen requested a review from Copilot March 11, 2025 07:55

Copilot AI reviewed Mar 11, 2025

View reviewed changes

baibaichen changed the title ~~[GLUTEN-8846][CH] [Part 1] ..~~ [GLUTEN-8846][CH] [Part 1] Support Positional Deletes Mar 11, 2025

baibaichen marked this pull request as ready for review March 11, 2025 09:46

baibaichen added 3 commits March 12, 2025 10:10

Supporting Position delete

d84f18a

Fix a native parquet reader crash when iopool is nullptr

8257ae0

Supporting delete columns not in select columns

49839b9

baibaichen force-pushed the feature/iceberg-mor branch from c1945f7 to 49839b9 Compare March 12, 2025 02:45

zzcclp approved these changes Mar 12, 2025

View reviewed changes

zzcclp merged commit 21abbbc into apache:main Mar 12, 2025
8 checks passed

baibaichen deleted the feature/iceberg-mor branch March 12, 2025 05:22

yikf pushed a commit to yikf/incubator-gluten that referenced this pull request Mar 13, 2025

[GLUTEN-8846][CH] [Part 1] Support Positional Deletes (apache#8937)

29a8597

[CH] [Part 1] Support Positional Deletes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-8846][CH] [Part 1] Support Positional Deletes #8937

[GLUTEN-8846][CH] [Part 1] Support Positional Deletes #8937

baibaichen commented Mar 8, 2025 •

edited

Loading

github-actions bot commented Mar 8, 2025

github-actions bot commented Mar 8, 2025

baibaichen commented Mar 8, 2025

CodiumAI-Agent commented Mar 8, 2025

baibaichen commented Mar 9, 2025

CodiumAI-Agent commented Mar 9, 2025 •

edited

Loading

github-actions bot commented Mar 10, 2025

github-actions bot commented Mar 11, 2025

github-actions bot commented Mar 11, 2025

baibaichen commented Mar 11, 2025

baibaichen commented Mar 11, 2025

github-actions bot commented Mar 12, 2025

zzcclp left a comment

[GLUTEN-8846][CH] [Part 1] Support Positional Deletes #8937

[GLUTEN-8846][CH] [Part 1] Support Positional Deletes #8937

Conversation

baibaichen commented Mar 8, 2025 • edited Loading

What changes were proposed in this pull request?

Positional Delete Support

Delete columns not in select list

How was this patch tested?

github-actions bot commented Mar 8, 2025

github-actions bot commented Mar 8, 2025

Choose a reason for hiding this comment

baibaichen commented Mar 8, 2025

CodiumAI-Agent commented Mar 8, 2025

PR Reviewer Guide 🔍

baibaichen commented Mar 9, 2025

CodiumAI-Agent commented Mar 9, 2025 • edited Loading

PR Code Suggestions ✨

Previous suggestions

github-actions bot commented Mar 10, 2025

github-actions bot commented Mar 11, 2025

github-actions bot commented Mar 11, 2025

baibaichen commented Mar 11, 2025

Choose a reason for hiding this comment

baibaichen commented Mar 11, 2025

github-actions bot commented Mar 12, 2025

zzcclp left a comment

Choose a reason for hiding this comment

baibaichen commented Mar 8, 2025 •

edited

Loading

CodiumAI-Agent commented Mar 9, 2025 •

edited

Loading