Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow INT96 timestamps to be stores as FixedSizeBinary(12) #7250

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mbutrovich
Copy link

@mbutrovich mbutrovich commented Mar 7, 2025

Which issue does this PR close?

Closes #7220.

Rationale for this change

Please see #7220 for the motivation.

What changes are included in this PR?

  • IntoBuffer for Vec<Int96> now creates a byte buffer which consume_batch in turn uses to create a FSB(12) Array. If the requested type is Timestamp(Timeunit::Nanoseconds,_) only then does the conversion occur.

Are there any user-facing changes?

ArrowType::FixedSizeBinary(12) is now a valid hint/provided schema data type for Timestamp(Timeunit::Nanoseconds,_).

…tion of INT96 timestamps in consume_batch(). Allow the hint that TS(Nano) can be requested as FSB(12).
@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 7, 2025
@mbutrovich
Copy link
Author

mbutrovich commented Mar 7, 2025

This is my first contribution to this repo and I'm still getting comfortable with Rust, and I have a few thoughts after working on this:

  • // TODO - Can Int96 and bool be implemented in these terms?
    AsSliceBytes would be nice for the IntoBuffer change I made. Where could I read more on if type alignment concerns could be addressed to support AsSliceBytes for Int96?
  • What does this mean for pushdown filters? I need to look at that code more to fully understand how this would affect filters. It's my understanding that if page index is enabled and filters are pushed down, we might skip rows based on undesired timestamp conversion logic.

@tustvold
Copy link
Contributor

tustvold commented Mar 8, 2025

Thank you for this, I think as written this PR would non-trivially regress performance for Int96 as it is effectively reading the data twice. I also am not sure this is the right approach to take here, left a comment on the ticket #7220 (comment)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich -- I think allowing a reader to get an INT96 column as FixedSizeBinary is a reasonable approach (I also commented #7220 (comment))

I agree with @tustvold 's performance concerns and left some suggestions on how to address them

In general I also think we should have an "end to end" test -- namely show reading an actual parquet file that has a INT96 column that is read ad FixedSizeBinary

Maybe there is some file in https://github.com/apache/parquet-testing/blob/master/data/README.md we could use

If not, maybe we can construct our own:

fn test_read_binary_as_utf8() {

@@ -194,7 +190,7 @@ where
},
PhysicalType::FLOAT => Arc::new(Float32Array::from(array_data)),
PhysicalType::DOUBLE => Arc::new(Float64Array::from(array_data)),
PhysicalType::INT96 => Arc::new(TimestampNanosecondArray::from(array_data)),
PhysicalType::INT96 => Arc::new(FixedSizeBinaryArray::from(array_data)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as @tustvold mentions, for performance reasons I think this would have to match on the array_data.data_type to directly create FixedSizeBinaryArray or TimestampNanosecondArray (as is done for the PhysicalType::INT32 array above for example)

ArrowType::Timestamp(TimeUnit::Nanosecond, _) => target_type.clone(),
_ => unreachable!("INT96 must be timestamp nanosecond"),
},
PhysicalType::INT96 => ArrowType::FixedSizeBinary(12),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest matching here on the target type to allow either fixed size binary or timestamp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet reader: option to pass INT96 as bytes instead of as Timestamp
3 participants