Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support scan filter for ORC decimal reader #11067

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

rui-mo
Copy link
Collaborator

@rui-mo rui-mo commented Sep 23, 2024

'leafCallToSubfieldFilter' converts decimal filter as Subfield filter, but
'SelectiveDecimalColumnReader::read' rejects scan spec filter. This PR supports
scan filter for ORC decimal reader.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 23, 2024
Copy link

netlify bot commented Sep 23, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit f61dc1f
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67cecd921714720008850ab2

@Yuhta Yuhta self-requested a review September 23, 2024 14:41
Copy link
Contributor

@Yuhta Yuhta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to add some tests to E2EFilterTest for decimal types

// Fill decimals before applying filter.
fillDecimals();

const auto rawNulls = nullsInReadRange_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract this logic to a static function or trait class so that it can be later reused in parquet as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet decimal uses SelectiveIntegerColumnReader::processFilter so it might don't need to reuse this logic. How do you think? Thanks.

if constexpr (std::is_same_v<DataT, int64_t>) {
processFilter(filter, rows, rawNulls);
} else {
VELOX_UNSUPPORTED("Unsupported filter: {}.", (int)filterKind);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be the case for schema evolution. Just throw VELOX_FAIL here or VELOX_NYI for these and add the requested type and file type information to the error message.

@rui-mo
Copy link
Collaborator Author

rui-mo commented Sep 25, 2024

You also need to add some tests to E2EFilterTest for decimal types

@Yuhta I find the E2EFilterTest relies on DWRF writer, but for now bigint-decimal will be written as int64_t, and hugeint-decimal is not supported. Do we need to support them first?

case TypeKind::BIGINT:
return std::make_unique<IntegerColumnWriter<int64_t>>(
context, type, sequence, onRecordPosition);

DWIO_RAISE("not supported yet ", mapTypeKindToName(type.type()->kind()));

@Yuhta
Copy link
Contributor

Yuhta commented Sep 25, 2024

@rui-mo Yes it would be ideal to support writing decimals first. Otherwise the tests are very limited and we have no confidence the new code would work correctly.

@rui-mo rui-mo force-pushed the wip_decimal_filter branch from 2937bac to e1b7ef8 Compare October 10, 2024 06:39
@rui-mo rui-mo force-pushed the wip_decimal_filter branch from e1b7ef8 to 3117603 Compare November 20, 2024 03:12
@rui-mo rui-mo changed the title Support scan filter for ORC decimal reader feat: Support scan filter for ORC decimal reader Nov 20, 2024
@rui-mo rui-mo force-pushed the wip_decimal_filter branch from 3117603 to edfb582 Compare December 5, 2024 03:31
@rui-mo rui-mo force-pushed the wip_decimal_filter branch from edfb582 to 86a8f95 Compare January 24, 2025 12:35
@rui-mo rui-mo force-pushed the wip_decimal_filter branch from 86a8f95 to 1a86c34 Compare February 11, 2025 15:57
@rui-mo rui-mo marked this pull request as ready for review February 11, 2025 15:58
@rui-mo rui-mo force-pushed the wip_decimal_filter branch from 1a86c34 to 6750596 Compare February 11, 2025 16:43
@rui-mo rui-mo force-pushed the wip_decimal_filter branch from 6750596 to 6e1ad03 Compare February 11, 2025 16:55
@rui-mo rui-mo force-pushed the wip_decimal_filter branch from 6e1ad03 to ed04a8e Compare February 12, 2025 14:10
@rui-mo
Copy link
Collaborator Author

rui-mo commented Feb 12, 2025

@rui-mo Yes it would be ideal to support writing decimals first. Otherwise the tests are very limited and we have no confidence the new code would work correctly.

The E2E filter tests depend on decimal write support in #11431.

facebook-github-bot pushed a commit that referenced this pull request Feb 20, 2025
Summary:
'SelectiveDecimalColumnReader' is typically used for reading decimal column in
ORC file. This PR supports corresponding writer for decimal column in ORC file.
Adds 'format' in 'WriterOptions' to distinguish between DWRF and ORC when
writing.

#11067 (comment)

Pull Request resolved: #11431

Reviewed By: Yuhta

Differential Revision: D69859556

Pulled By: xiaoxmeng

fbshipit-source-id: a09b06a0d47fb6095a9ebe26b5b2ade91b683cbb
@rui-mo rui-mo force-pushed the wip_decimal_filter branch 2 times, most recently from a38ca17 to e063f7f Compare February 20, 2025 20:47
@majetideepak
Copy link
Collaborator

@rui-mo do we still need to check in the orc files if we now have the writer?

@rui-mo rui-mo force-pushed the wip_decimal_filter branch from e063f7f to fb700fe Compare February 21, 2025 11:16
@rui-mo rui-mo force-pushed the wip_decimal_filter branch 2 times, most recently from cee517e to 5552315 Compare February 21, 2025 11:22
@rui-mo
Copy link
Collaborator Author

rui-mo commented Feb 21, 2025

Hi @majetideepak, I removed the orc file tests and added E2E filter tests. I find current footer writing is incompatible with orc file format, seeing details in #12418. To enable the orc decimal E2E tests, some status adjustment only for testing are added in ReaderBase::convertType and ProtoUtils::writeType as a workaround for the footer difference. Do you have any suggestion? Thanks! cc: @Yuhta

@rui-mo rui-mo force-pushed the wip_decimal_filter branch from 5552315 to 10c280a Compare February 21, 2025 14:12
@majetideepak
Copy link
Collaborator

@rui-mo I feel if we use the proto::orc::Footer for writes, we can then do the right thing and generate valid ORC files?

@rui-mo
Copy link
Collaborator Author

rui-mo commented Feb 24, 2025

@rui-mo I feel if we use the proto::orc::Footer for writes, we can then do the right thing and generate valid ORC files?

@majetideepak I suppose so. My workaround in the test is all about the incompatible footer.

@majetideepak majetideepak requested a review from Yuhta February 24, 2025 14:17
@majetideepak
Copy link
Collaborator

@Yuhta can you please help review this change? Thanks.

@rui-mo
Copy link
Collaborator Author

rui-mo commented Mar 6, 2025

@Yuhta @majetideepak I integrated this PR with the ORC writer support in #12487 and the E2E filter tests work, seeing rui-mo@f04fbfc. Would you rather we merge this PR first, or should I wait for the ORC writer to be landed? This PR contains a few workarounds on the footer for testing purposes only, as the ORC writer is not supported. Thanks.

@rui-mo rui-mo force-pushed the wip_decimal_filter branch from 4baad3d to f61dc1f Compare March 10, 2025 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants