Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-8846][CH] [Part 1] Support Positional Deletes #8937

Merged
merged 3 commits into from
Mar 12, 2025

Conversation

baibaichen
Copy link
Contributor

@baibaichen baibaichen commented Mar 8, 2025

What changes were proposed in this pull request?

(Fixes: #8846)

Enhance Parquet and Iceberg readers to handle delete columns not in select list and support positional delete operations in Iceberg tables.

Positional Delete Support

  • Bitmap
    Introduced DeltaDVRoaringBitmapArray for efficient position tracking:

    void rb_add(Int64 x);  // Add deleted positions
    bool rb_contains(Int64 x) const;  // Check position existence
  • Reader Integration

    • PositionalDeleteFileReader processes position delete files

    • Iceberg metadata column handling:

      static std::shared_ptr<IcebergMetadataColumn> icebergDeletePosColumn();
  • Filter
    Modified Iceberg reader to apply bitmap filters during chunk generation:

    delete_bitmap_array->rb_contains(pos[i]);

Delete columns not in select list

Fix an issue when excuting sql like select c0 from t and Equality Delete File contains c1

  • Added collectFileSchema in ParquetMetaBuilder to extract file schema
  • Append read header isf delete columns are not in select list
  • Modified IcebergReader to remove temporary columns after applying deletes:
columns.erase(columns.begin() + start_remove_index, columns.end());

How was this patch tested?

using gtest

Copy link

github-actions bot commented Mar 8, 2025

#8846

@baibaichen baibaichen requested a review from Copilot March 8, 2025 10:23
Copy link

github-actions bot commented Mar 8, 2025

Run Gluten Clickhouse CI on x86

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 20 out of 20 changed files in this pull request and generated no comments.

@baibaichen
Copy link
Contributor Author

@CodiumAI-Agent /review

@CodiumAI-Agent
Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

8846 - Partially compliant

Compliant requirements:

  • Integration of new delete file readers (EqualityDeleteFileReader and PositionalDeleteFileReader).
  • Enhanced iceberg reader that supports both delete-expression and bitmap deletion filtering.
  • Extensive tests added/updated covering varied delete scenarios.

Non-compliant requirements:

  • Support for deleting columns not in the select set is still pending.

Requires further human verification:

  • Benchmark performance across different datasets.
  • Verify correctness of row filtering logic for edge-case row group boundaries.
⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Complex Filtering Logic

The merging of equality and positional delete logic introduces non-trivial filtering; ensure all edge cases (empty deletes, complete deletes, and mixed scenarios) are thoroughly validated.

IcebergReader::~IcebergReader() = default;


Chunk IcebergReader::doPull()
{
    if (!delete_expr && !delete_bitmap_array)
        return NormalFileReader::doPull();

    while (true)
    {
        Chunk chunk = NormalFileReader::doPull();
        if (chunk.getNumRows() == 0)
            return chunk;

        Block deleted_block;
        if (delete_expr)
            deleted_block = applyEqualityDelete(chunk);

        if (delete_bitmap_array)
        {
Test Coverage

New tests add extensive delete scenarios with random and continuous values. Review that all potential edge cases are covered and that test expectations align with the new deletion logic.

/// This test creates one single data file and one delete file. The parameter
/// passed to assertSingleBaseFileSingleDeleteFile is the delete positions.
TEST_F(IcebergTest, singleBaseFileSinglePositionalDeleteFile)
{
    assertSingleBaseFileSingleDeleteFile({{0, 1, 2, 3}});
    // Delete the first and last row in each batch (10000 rows per batch)
    assertSingleBaseFileSingleDeleteFile({{0, 9999, 10000, 19999}});
    // Delete several rows in the second batch (10000 rows per batch)
    assertSingleBaseFileSingleDeleteFile({{10000, 10002, 19999}});
    // Delete random rows
    assertSingleBaseFileSingleDeleteFile({makeRandomIncreasingValues(0, 20000)});
    // Delete 0 rows
    assertSingleBaseFileSingleDeleteFile({});
    // Delete all rows
    assertSingleBaseFileSingleDeleteFile(
        {makeContinuousIncreasingValues(0, 20000)});
    // Delete rows that don't exist
    assertSingleBaseFileSingleDeleteFile({{20000, 29999}});
}

/// This test creates 3 base data files, only the middle one has corresponding
/// delete positions. The parameter passed to
/// assertSingleBaseFileSingleDeleteFile is the delete positions.for the middle
/// base file.
TEST_F(IcebergTest, MultipleBaseFilesSinglePositionalDeleteFile) {

    assertMultipleBaseFileSingleDeleteFile({0, 1, 2, 3});
    assertMultipleBaseFileSingleDeleteFile({0, 9999, 10000, 19999});
    assertMultipleBaseFileSingleDeleteFile({10000, 10002, 19999});
    assertMultipleBaseFileSingleDeleteFile({10000, 10002, 19999});
    assertMultipleBaseFileSingleDeleteFile(
        makeRandomIncreasingValues(0, rowCount));
    assertMultipleBaseFileSingleDeleteFile({});
    assertMultipleBaseFileSingleDeleteFile(
        makeContinuousIncreasingValues(0, rowCount));
}

/// This test creates one base data file/split with multiple delete files. The
/// parameter passed to assertSingleBaseFileMultipleDeleteFiles is the vector of
/// delete files. Each leaf vector represents the delete positions in that
/// delete file.
TEST_F(IcebergTest, singleBaseFileMultiplePositionalDeleteFiles) {

    // Delete row 0, 1, 2, 3 from the first batch out of two.
    assertSingleBaseFileMultipleDeleteFiles({{1}, {2}, {3}, {4}});
    // Delete the first and last row in each batch (10000 rows per batch).
    assertSingleBaseFileMultipleDeleteFiles({{0}, {9999}, {10000}, {19999}});

    assertSingleBaseFileMultipleDeleteFiles({{500, 21000}});

    assertSingleBaseFileMultipleDeleteFiles(
        {makeRandomIncreasingValues(0, 10000),
         makeRandomIncreasingValues(10000, 20000),
         makeRandomIncreasingValues(5000, 15000)});

    assertSingleBaseFileMultipleDeleteFiles(
        {makeContinuousIncreasingValues(0, 10000),
         makeContinuousIncreasingValues(10000, 20000)});

    assertSingleBaseFileMultipleDeleteFiles(
        {makeContinuousIncreasingValues(0, 10000),
         makeContinuousIncreasingValues(10000, 20000),
         makeRandomIncreasingValues(5000, 15000)});

    assertSingleBaseFileMultipleDeleteFiles(
        {makeContinuousIncreasingValues(0, 20000),
         makeContinuousIncreasingValues(0, 20000)});

    assertSingleBaseFileMultipleDeleteFiles(
        {makeRandomIncreasingValues(0, 20000),
         {},
         makeRandomIncreasingValues(5000, 15000)});

    assertSingleBaseFileMultipleDeleteFiles({{}, {}});
}

/// This test creates 2 base data files, and 1 or 2 delete files, with unaligned
/// RowGroup boundaries
TEST_F(IcebergTest, multipleBaseFileMultiplePositionalDeleteFiles) {

  std::map<std::string, std::vector<int64_t>> rowGroupSizesForFiles;
  std::unordered_map<
      std::string,
      std::multimap<std::string, std::vector<int64_t>>>
      deleteFilesForBaseDatafiles;

  // Create two data files, each with two RowGroups
  rowGroupSizesForFiles["data_file_1"] = {100, 85};
  rowGroupSizesForFiles["data_file_2"] = {99, 1};

  // Delete 3 rows from the first RowGroup in data_file_1
  deleteFilesForBaseDatafiles["delete_file_1"] = {{"data_file_1", {0, 1, 99}}};
  assertPositionalDeletes(rowGroupSizesForFiles, deleteFilesForBaseDatafiles);

  // Delete 3 rows from the second RowGroup in data_file_1
  deleteFilesForBaseDatafiles["delete_file_1"] = {
      {"data_file_1", {100, 101, 184}}};
  assertPositionalDeletes(rowGroupSizesForFiles, deleteFilesForBaseDatafiles);

  // Delete random rows from the both RowGroups in data_file_1
  deleteFilesForBaseDatafiles["delete_file_1"] = {
      {"data_file_1", makeRandomIncreasingValues(0, 185)}};
  assertPositionalDeletes(rowGroupSizesForFiles, deleteFilesForBaseDatafiles);

  // Delete all rows in data_file_1
  deleteFilesForBaseDatafiles["delete_file_1"] = {
      {"data_file_1", makeContinuousIncreasingValues(0, 185)}};
  assertPositionalDeletes(rowGroupSizesForFiles, deleteFilesForBaseDatafiles);
  //
  // Delete non-existent rows from data_file_1
  deleteFilesForBaseDatafiles["delete_file_1"] = {
      {"data_file_1", makeRandomIncreasingValues(186, 300)}};
  assertPositionalDeletes(rowGroupSizesForFiles, deleteFilesForBaseDatafiles);

  // Delete several rows from both RowGroups in both data files

@baibaichen
Copy link
Contributor Author

@CodiumAI-Agent /improve

@CodiumAI-Agent
Copy link

CodiumAI-Agent commented Mar 9, 2025

PR Code Suggestions ✨

Latest suggestions up to 49839b9

CategorySuggestion                                                                                                                                    Impact
Possible issue
Validate key existence

Verify that the key exists in baseFilePaths before using it to prevent potential
crashes.

cpp-ch/local-engine/tests/gtest_iceberge_test.cpp [80-92]

 for (const auto& deleteFile : deleteFilesForBaseDatafiles)
 {
     auto deleteFileName = deleteFile.first;
     auto deleteFileContent = deleteFile.second;
     auto deleteFilePath = TempFilePath::tmp("parquet");
     for (auto& deleteFileRowGroup : deleteFileContent)
     {
         auto baseFileName = deleteFileRowGroup.first;
+        if (baseFilePaths.find(baseFileName) == baseFilePaths.end())
+        {
+            throw std::runtime_error("Base file path for " + baseFileName + " not found.");
+        }
         auto baseFilePath = "file://" + baseFilePaths[baseFileName]->string();
         ...
Suggestion importance[1-10]: 8

__

Why: Adding a check to verify the key exists in baseFilePaths prevents possible runtime crashes, addressing a potential bug and increasing robustness.

Medium
Use value for context

Store the context member by value instead of a reference to prevent potential
dangling references.

cpp-ch/local-engine/Storages/SubstraitSource/iceberg/EqualityDeleteFileReader.h [60]

-const DB::ContextPtr & context_;
+DB::ContextPtr context_;
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential lifetime issue by recommending storing DB::ContextPtr by value rather than a reference, thus avoiding dangling references; however, the change may have minor performance implications depending on context usage.

Medium
General
Use const reference in loop

Use constant references when iterating over vectors to avoid unnecessary copies.

cpp-ch/local-engine/tests/gtest_iceberge_test.cpp [211-217]

 std::vector<int64_t> flattenAndDedup(
     const std::vector<std::vector<int64_t>>& deletePositionVectors,
     int64_t baseFileSize) {
     std::vector<int64_t> deletePositionVector;
-    for (auto vec : deletePositionVectors) {
+    for (const auto & vec : deletePositionVectors) {
         for (auto pos : vec) {
             if (pos >= 0 && pos < baseFileSize) {
                 deletePositionVector.push_back(pos);
             }
         }
     }
     ...
Suggestion importance[1-10]: 5

__

Why: Replacing the copy in the loop with a const reference improves performance and code clarity. However, it's a minor improvement and does not affect functionality.

Low

Previous suggestions

Suggestions up to commit c1945f7
Suggestions up to commit a00ad70
CategorySuggestion                                                                                                                                    Impact
Possible issue
Implement equality operator

Provide a proper implementation for the equality operator to correctly compare
object state.

cpp-ch/local-engine/Storages/SubstraitSource/Delta/Bitmap/DeltaDVRoaringBitmapArray.h [46]

-bool operator==(const DeltaDVRoaringBitmapArray & other) const;
+bool operator==(const DeltaDVRoaringBitmapArray & other) const {
+    return context == other.context && roaring_bitmap_array == other.roaring_bitmap_array;
+}
Suggestion importance[1-10]: 8

__

Why: Implementing the equality operator ensures that object comparisons correctly reflect the internal state, which is crucial for correctness and potential debugging.

Medium
Validate key existence

Verify that the key exists in baseFilePaths before accessing the pointer to avoid
potential null dereferences.

cpp-ch/local-engine/tests/gtest_iceberge_test.cpp [93]

-auto baseFilePath = "file://" + baseFilePaths[baseFileName]->string();
+if (baseFilePaths.find(baseFileName) != baseFilePaths.end())
+  auto baseFilePath = "file://" + baseFilePaths.at(baseFileName)->string();
+else
+  // Handle missing base file case appropriately
Suggestion importance[1-10]: 7

__

Why: The suggestion addresses a potential null dereference by ensuring that the key exists in the baseFilePaths map, enhancing runtime safety in a critical code path.

Medium
Store smart pointer by value

Store the context smart pointer by value to prevent potential lifetime issues.

cpp-ch/local-engine/Storages/SubstraitSource/iceberg/EqualityDeleteFileReader.h [60]

-const DB::ContextPtr & context_;
+DB::ContextPtr context_;
Suggestion importance[1-10]: 7

__

Why: Changing from a const reference to storing the smart pointer by value can help prevent potential lifetime issues. Although this change might affect design assumptions, it is a reasonable improvement if ensuring robustness is a priority.

Medium
Confirm consistent block schema

Ensure that initializing the writer with the first block's schema is intentional for
all blocks; if blocks can differ, consider reinitializing the writer based on a
consistent schema.

cpp-ch/local-engine/tests/utils/ReaderTestBase.cpp [97-99]

+// Validate all blocks share the same schema or reinitialize the writer per block as needed
 const auto writer = NormalFileWriter::create(context_, fileUri.toString(), blocks[0], file.getExtension());
 for (const auto & block : blocks)
     writer->write(block);
Suggestion importance[1-10]: 2

__

Why: This suggestion is more of a review comment to check schema consistency rather than a concrete change, offering only marginal improvement.

Low
Suggestions up to commit 5ea18c0
CategorySuggestion                                                                                                                                    Impact
Possible issue
Validate base file key

Check that baseFilePaths contains the key before using it, and handle the missing
case to avoid a potential crash.

cpp-ch/local-engine/tests/gtest_iceberge_test.cpp [92]

-auto baseFilePath = "file://" + baseFilePaths[baseFileName]->string();
+auto it = baseFilePaths.find(baseFileName);
+if (it != baseFilePaths.end())
+{
+    auto baseFilePath = "file://" + it->second->string();
+    // proceed with baseFilePath
+}
+else
+{
+    // handle missing key appropriately (log error or throw exception)
+}
Suggestion importance[1-10]: 8

__

Why: The suggestion addresses a potential crash by checking for the key's existence in baseFilePaths, which improves robustness and error handling in the file path construction.

Medium
General
Use dynamic RNG seeding

Consider seeding the random number generator dynamically (e.g. with
std::random_device) if varied random outputs are intended for testing.

cpp-ch/local-engine/tests/gtest_iceberge_test.cpp [705]

-std::mt19937 gen{0};
+std::mt19937 gen(std::random_device{}());
Suggestion importance[1-10]: 3

__

Why: While the change enhances randomness by avoiding a fixed seed, it is a minor stylistic improvement that might affect reproducibility in tests.

Low

@baibaichen baibaichen force-pushed the feature/iceberg-mor branch from 5ea18c0 to a00ad70 Compare March 10, 2025 14:57
Copy link

Run Gluten Clickhouse CI on x86

@baibaichen baibaichen force-pushed the feature/iceberg-mor branch 2 times, most recently from 398d57d to c1945f7 Compare March 11, 2025 07:40
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
Copy link

Run Gluten Clickhouse CI on x86

@baibaichen
Copy link
Contributor Author

@CodiumAI-Agent /review

@baibaichen baibaichen requested a review from Copilot March 11, 2025 07:55

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 23 out of 23 changed files in this pull request and generated no comments.

@baibaichen baibaichen changed the title [GLUTEN-8846][CH] [Part 1] .. [GLUTEN-8846][CH] [Part 1] Support Positional Deletes Mar 11, 2025
@baibaichen
Copy link
Contributor Author

@CodiumAI-Agent /improve

@baibaichen baibaichen marked this pull request as ready for review March 11, 2025 09:46
@baibaichen baibaichen force-pushed the feature/iceberg-mor branch from c1945f7 to 49839b9 Compare March 12, 2025 02:45
Copy link

Run Gluten Clickhouse CI on x86

Copy link
Contributor

@zzcclp zzcclp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zzcclp zzcclp merged commit 21abbbc into apache:main Mar 12, 2025
8 checks passed
@baibaichen baibaichen deleted the feature/iceberg-mor branch March 12, 2025 05:22
yikf pushed a commit to yikf/incubator-gluten that referenced this pull request Mar 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH][UMBRELLA] Support Iceberg MOR
3 participants