Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-8872][CH][Part-2] Support Delta Deletion Vectors read for CH backend #8947

Merged
merged 2 commits into from
Mar 12, 2025

Conversation

zzcclp
Copy link
Contributor

@zzcclp zzcclp commented Mar 10, 2025

What changes were proposed in this pull request?

Part-2:
Support reading deletion vectors bitmap when queryring

(Fixes: #8872)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Copy link

#8872

Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
Copy link

Run Gluten Clickhouse CI on x86

@zzcclp zzcclp force-pushed the gluten-8872-part-2 branch from 50f2c0f to 57076c9 Compare March 10, 2025 11:43
Copy link

Run Gluten Clickhouse CI on x86

2 similar comments
Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@zzcclp zzcclp requested review from loneylee and baibaichen March 10, 2025 12:36
@zzcclp
Copy link
Contributor Author

zzcclp commented Mar 11, 2025

@CodiumAI-Agent /review

@zzcclp
Copy link
Contributor Author

zzcclp commented Mar 11, 2025

@CodiumAI-Agent /improve

@CodiumAI-Agent
Copy link

CodiumAI-Agent commented Mar 11, 2025

PR Code Suggestions ✨

Latest suggestions up to e86e2c5

CategorySuggestion                                                                                                                                    Impact
Possible issue
Checksum over correct data

Capture the checksum data region before consuming any bytes to ensure the checksum
is computed over the intended data range.

cpp-ch/local-engine/Storages/SubstraitSource/Delta/Bitmap/DeltaDVRoaringBitmapArray.cpp [72-77]

+char* checksum_start = in->position();
 readBinaryBigEndian(size, *in);
 if (data_size != size)
     throw DB::Exception(DB::ErrorCodes::BAD_ARGUMENTS, "The size of the deletion vector is mismatch.");
-int checksum_value = static_cast<Int32>(crc32_z(0L, reinterpret_cast<unsigned char*>(in->position()), size));
+int checksum_value = static_cast<Int32>(crc32_z(0L, reinterpret_cast<unsigned char*>(checksum_start), size));
Suggestion importance[1-10]: 8

__

Why: By capturing the data pointer before consuming bytes for reading, the suggestion addresses a potential bug in checksum calculation, ensuring that the checksum is computed over the correct data region.

Medium
Use dynamic index lookup

Replace the hard-coded column index with a dynamic lookup based on the column name.

cpp-ch/local-engine/Storages/SubstraitSource/Delta/DeltaReader.cpp [93-96]

+size_t rowIndexPos = outputHeader.getPositionByName(DeltaParquetVirtualMeta::ROW_INDEX_COLUMN);
 for (int i = 0; i < num_rows; i++)
 {
-    vec[i] = bitmap_array->rb_contains(chunk.getColumns()[1]->get64(i));
+    vec[i] = bitmap_array->rb_contains(chunk.getColumns()[rowIndexPos]->get64(i));
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion improves code maintainability by replacing the hard-coded index with a dynamic lookup, though it is a minor enhancement and relies on the existence and correctness of the constant used for lookup.

Low

Previous suggestions

Suggestions up to commit ee811df
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add JSON error handling

Add proper error handling for JSON deserialization to gracefully handle invalid or
unexpected input.

backends-clickhouse/src-delta-32/main/scala/org/apache/gluten/sql/shims/delta32/Delta32Shims.scala [50-65]

 if (k.equalsIgnoreCase(DeltaParquetFileFormat.FILE_ROW_INDEX_FILTER_ID_ENCODED)) {
-  val decoded = JsonUtils.fromJson[DeletionVectorDescriptor](v.toString)
-  ...
+  try {
+    val decoded = JsonUtils.fromJson[DeletionVectorDescriptor](v.toString)
+    ...
+  } catch {
+    case e: Exception => 
+      // Log error and handle fallback scenario
+      newOtherConstantMetadataColumnValues.put(k, v)
+  }
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion improves robustness by catching JSON deserialization errors, which can prevent runtime crashes when encountering malformed input. The change is actionable and directly targets the code block handling delta parsing.

Low
Correct checksum ordering

Verify the ordering and buffer offsets when computing the checksum to ensure it
covers the intended data block.

cpp-ch/local-engine/Storages/SubstraitSource/Delta/Bitmap/DeltaDVRoaringBitmapArray.cpp [69-80]

 in->seek(offset, SEEK_SET);
 int size;
 readBinaryBigEndian(size, *in);
 if (data_size != size)
   throw DB::Exception(DB::ErrorCodes::BAD_ARGUMENTS, "The size of the deletion vector is mismatch.");
-int checksum_value = static_cast<Int32>(crc32_z(0L, reinterpret_cast<unsigned char*>(in->position()), size));
+// Capture the current buffer position for checksum computation before further reads.
+auto checksumDataPtr = in->position();
 int magic_num;
 readBinaryLittleEndian(magic_num, *in);
+// Compute checksum over the intended data block.
+int checksum_value = static_cast<Int32>(crc32_z(0L, reinterpret_cast<unsigned char*>(checksumDataPtr), size));
 ...
Suggestion importance[1-10]: 3

__

Why: The proposal to capture the buffer pointer before further reads is a minor stylistic improvement that clarifies the intended checksum computation, though the current implementation already computes the checksum correctly.

Low
Suggestions
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add null check for buffer

Add a null pointer check after the dynamic_cast to ensure the read buffer is valid
before use.

cpp-ch/local-engine/Storages/SubstraitSource/Delta/Bitmap/DeltaDVRoaringBitmapArray.cpp [58-60]

 auto * in = dynamic_cast<DB::SeekableReadBuffer *>(read_buffer_builder->build(file_info).release());
+if (in == nullptr) {
+    throw DB::Exception(DB::ErrorCodes::BAD_ARGUMENTS, "Failed to create a valid SeekableReadBuffer.");
+}
 in->seek(offset, SEEK_SET);
Suggestion importance[1-10]: 8

__

Why: The suggestion is relevant and helps prevent potential null pointer dereferences after using dynamic_cast. It adds important defensive programming without contradicting the PR changes.

Medium
Add missing semicolon

Append a semicolon at the end of the repeated field declaration to ensure valid
protobuf syntax.

gluten-substrait/src/main/resources/substrait/proto/substrait/algebra.proto [240]

 message otherConstantMetadataColumnValues {
   string key = 1;
   google.protobuf.Any value = 2;
 }
-repeated otherConstantMetadataColumnValues other_const_metadata_columns = 21
+repeated otherConstantMetadataColumnValues other_const_metadata_columns = 21;
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a potential syntax issue by adding the missing semicolon at the end of the repeated field declaration. The proposed improved code accurately reflects the modification needed for valid protobuf syntax.

Medium
General
Eliminate duplicate include

Remove the duplicate include directive for the Parser/SubstraitParserUtils.h file.

cpp-ch/local-engine/Storages/SubstraitSource/FileReader.cpp [30-32]

 #include <Parser/SubstraitParserUtils.h>
 #include <Storages/SubstraitSource/Delta/DeltaReader.h>
-#include <Parser/SubstraitParserUtils.h>
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly removes a duplicate include, cleaning up the code. However, it is a minor improvement that does not impact functionality significantly.

Low

@zzcclp
Copy link
Contributor Author

zzcclp commented Mar 11, 2025

@CodiumAI-Agent /review

@CodiumAI-Agent
Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 5 🔵🔵🔵🔵🔵
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Null Check

In the rb_read function, the new dynamic_cast result (ReadBufferBuilderPtr) is used without verifying if the pointer is non-null. Please add a proper null check to prevent potential null-pointer dereferences.

file_info.set_length(data_size);
const Poco::URI file_uri(file_path);
ReadBufferBuilderPtr read_buffer_builder = ReadBufferBuilderFactory::instance().createBuilder(file_uri.getScheme(), context);
auto * in = dynamic_cast<DB::SeekableReadBuffer *>(read_buffer_builder->build(file_info).release());

in->seek(offset, SEEK_SET);

int size;
readBinaryBigEndian(size, *in);
JSON Parsing

The new convertRowIndexFilterIdEncoded function parses JSON without explicit error handling. Consider validating the parsed output and handling possible parse errors or unexpected formats.

override def convertRowIndexFilterIdEncoded(
    file: PartitionedFile,
    otherConstantMetadataColumnValues: JMap[String, Object]): JMap[String, Object] = {
  val newOtherConstantMetadataColumnValues: JMap[String, Object] =
    new JHashMap[String, Object]
  for ((k, v) <- otherConstantMetadataColumnValues.asScala) {
    if (k.equalsIgnoreCase(DeltaParquetFileFormat.FILE_ROW_INDEX_FILTER_ID_ENCODED)) {
      val decoded = JsonUtils.fromJson[DeletionVectorDescriptor](v.toString)
      val decodedPath = decoded.absolutePath(new Path(file.filePath.toString()).getParent)
      val newDeletionVectorDescriptor = decoded.copy(
        decoded.storageType,
        decodedPath.toUri.toASCIIString,
        decoded.offset,
        decoded.sizeInBytes,
        decoded.cardinality,
        decoded.maxRowIndex
      )
      newOtherConstantMetadataColumnValues.put(k, JsonUtils.toJson(newDeletionVectorDescriptor))
    } else {
      newOtherConstantMetadataColumnValues.put(k, v)
    }
  }
  newOtherConstantMetadataColumnValues
}

@zzcclp
Copy link
Contributor Author

zzcclp commented Mar 11, 2025

@CodiumAI-Agent /improve

@CodiumAI-Agent
Copy link

PR Code Suggestions ✨

CategorySuggestion                                                                                                                                    Impact
Possible issue
Add null check for buffer

Add a null pointer check after the dynamic_cast to ensure the read buffer is valid
before use.

cpp-ch/local-engine/Storages/SubstraitSource/Delta/Bitmap/DeltaDVRoaringBitmapArray.cpp [58-60]

 auto * in = dynamic_cast<DB::SeekableReadBuffer *>(read_buffer_builder->build(file_info).release());
+if (in == nullptr) {
+    throw DB::Exception(DB::ErrorCodes::BAD_ARGUMENTS, "Failed to create a valid SeekableReadBuffer.");
+}
 in->seek(offset, SEEK_SET);
Suggestion importance[1-10]: 8

__

Why: The suggestion is relevant and helps prevent potential null pointer dereferences after using dynamic_cast. It adds important defensive programming without contradicting the PR changes.

Medium
Add missing semicolon

Append a semicolon at the end of the repeated field declaration to ensure valid
protobuf syntax.

gluten-substrait/src/main/resources/substrait/proto/substrait/algebra.proto [240]

 message otherConstantMetadataColumnValues {
   string key = 1;
   google.protobuf.Any value = 2;
 }
-repeated otherConstantMetadataColumnValues other_const_metadata_columns = 21
+repeated otherConstantMetadataColumnValues other_const_metadata_columns = 21;
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a potential syntax issue by adding the missing semicolon at the end of the repeated field declaration. The proposed improved code accurately reflects the modification needed for valid protobuf syntax.

Medium
General
Eliminate duplicate include

Remove the duplicate include directive for the Parser/SubstraitParserUtils.h file.

cpp-ch/local-engine/Storages/SubstraitSource/FileReader.cpp [30-32]

 #include <Parser/SubstraitParserUtils.h>
 #include <Storages/SubstraitSource/Delta/DeltaReader.h>
-#include <Parser/SubstraitParserUtils.h>
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly removes a duplicate include, cleaning up the code. However, it is a minor improvement that does not impact functionality significantly.

Low

@zzcclp zzcclp force-pushed the gluten-8872-part-2 branch from fe55669 to ee811df Compare March 11, 2025 04:57
Copy link

Run Gluten Clickhouse CI on x86

2 similar comments
Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@baibaichen baibaichen requested a review from Copilot March 11, 2025 09:41

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This pull request adds support for reading deletion vectors bitmap in the CH backend by extending the Substrait conversion logic and updating file metadata handling.

  • Introduces a new helper method in SubstraitUtil.java for converting Java objects to Protobuf Any messages.
  • Adds a new proto message and repeated field for handling additional metadata columns in algebra.proto.
  • Updates LocalFilesNode.java, IcebergLocalFilesNode.java, and LocalFilesBuilder.java to support and propagate a new parameter for extra metadata columns.

Reviewed Changes

File Description
gluten-substrait/src/main/java/org/apache/gluten/substrait/utils/SubstraitUtil.java New utility method for object-to-Protobuf conversion.
gluten-substrait/src/main/resources/substrait/proto/substrait/algebra.proto Added proto message and repeated field for extra metadata columns.
gluten-substrait/src/main/java/org/apache/gluten/substrait/rel/LocalFilesNode.java Updated to handle additional metadata columns and incorporate conversion using SubstraitUtil.
gluten-iceberg/src-iceberg/main/java/org/apache/gluten/substrait/rel/IcebergLocalFilesNode.java Adjusted constructor call to include the new metadata parameter.
gluten-substrait/src/main/java/org/apache/gluten/substrait/rel/LocalFilesBuilder.java Updated helper method to propagate the new extra metadata parameter.

Copilot reviewed 25 out of 25 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

gluten-substrait/src/main/java/org/apache/gluten/substrait/rel/LocalFilesNode.java:202

  • [nitpick] The local variable name 'metadataColumns' in this block may be confused with the instance variable of a similar name. Consider renaming it (e.g., to 'otherMetadata') to improve clarity.
Map<String, Object> metadataColumns = otherMetadataColumns.get(i);
@zzcclp zzcclp force-pushed the gluten-8872-part-2 branch from bf04df8 to 6b06f2f Compare March 11, 2025 09:52
Copy link

Run Gluten Clickhouse CI on x86

@zzcclp zzcclp force-pushed the gluten-8872-part-2 branch from 6b06f2f to 8d8ad90 Compare March 12, 2025 03:59
Copy link

Run Gluten Clickhouse CI on x86

zzcclp added 2 commits March 12, 2025 13:08
…backend

Part-2:
Support reading deletion vectors bitmap when queryring
@zzcclp zzcclp force-pushed the gluten-8872-part-2 branch from 8d8ad90 to b1b3490 Compare March 12, 2025 06:16
Copy link

Run Gluten Clickhouse CI on x86

@zzcclp zzcclp merged commit 50f18dc into apache:main Mar 12, 2025
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH] Support Delta Deletion Vectors read for CH backend
3 participants