-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow INT96 timestamps to be stores as FixedSizeBinary(12) #7250
base: main
Are you sure you want to change the base?
Conversation
…tion of INT96 timestamps in consume_batch(). Allow the hint that TS(Nano) can be requested as FSB(12).
This is my first contribution to this repo and I'm still getting comfortable with Rust, and I have a few thoughts after working on this:
|
Thank you for this, I think as written this PR would non-trivially regress performance for Int96 as it is effectively reading the data twice. I also am not sure this is the right approach to take here, left a comment on the ticket #7220 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mbutrovich -- I think allowing a reader to get an INT96 column as FixedSizeBinary is a reasonable approach (I also commented #7220 (comment))
I agree with @tustvold 's performance concerns and left some suggestions on how to address them
In general I also think we should have an "end to end" test -- namely show reading an actual parquet file that has a INT96 column that is read ad FixedSizeBinary
Maybe there is some file in https://github.com/apache/parquet-testing/blob/master/data/README.md we could use
If not, maybe we can construct our own:
arrow-rs/parquet/src/arrow/arrow_reader/mod.rs
Line 3259 in 34b2184
fn test_read_binary_as_utf8() { |
@@ -194,7 +190,7 @@ where | |||
}, | |||
PhysicalType::FLOAT => Arc::new(Float32Array::from(array_data)), | |||
PhysicalType::DOUBLE => Arc::new(Float64Array::from(array_data)), | |||
PhysicalType::INT96 => Arc::new(TimestampNanosecondArray::from(array_data)), | |||
PhysicalType::INT96 => Arc::new(FixedSizeBinaryArray::from(array_data)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as @tustvold mentions, for performance reasons I think this would have to match
on the array_data.data_type
to directly create FixedSizeBinaryArray
or TimestampNanosecondArray
(as is done for the PhysicalType::INT32 array above for example)
ArrowType::Timestamp(TimeUnit::Nanosecond, _) => target_type.clone(), | ||
_ => unreachable!("INT96 must be timestamp nanosecond"), | ||
}, | ||
PhysicalType::INT96 => ArrowType::FixedSizeBinary(12), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest matching here on the target type to allow either fixed size binary or timestamp
Which issue does this PR close?
Closes #7220.
Rationale for this change
Please see #7220 for the motivation.
What changes are included in this PR?
IntoBuffer
forVec<Int96>
now creates a byte buffer whichconsume_batch
in turn uses to create aFSB(12)
Array
. If the requested type isTimestamp(Timeunit::Nanoseconds,_)
only then does the conversion occur.Are there any user-facing changes?
ArrowType::FixedSizeBinary(12)
is now a valid hint/provided schema data type forTimestamp(Timeunit::Nanoseconds,_)
.