Allow INT96 timestamps to be stores as FixedSizeBinary(12) #7250

mbutrovich · 2025-03-07T18:50:42Z

Which issue does this PR close?

Closes #7220.

Rationale for this change

Please see #7220 for the motivation.

What changes are included in this PR?

IntoBuffer for Vec<Int96> now creates a byte buffer which consume_batch in turn uses to create a FSB(12) Array. If the requested type is Timestamp(Timeunit::Nanoseconds,_) only then does the conversion occur.

Are there any user-facing changes?

ArrowType::FixedSizeBinary(12) is now a valid hint/provided schema data type for Timestamp(Timeunit::Nanoseconds,_).

…tion of INT96 timestamps in consume_batch(). Allow the hint that TS(Nano) can be requested as FSB(12).

mbutrovich · 2025-03-07T18:57:29Z

This is my first contribution to this repo and I'm still getting comfortable with Rust, and I have a few thoughts after working on this:

arrow-rs/parquet/src/data_type.rs

Line 548 in 5b04ca7

// TODO - Can Int96 and bool be implemented in these terms?

AsSliceBytes would be nice for the IntoBuffer change I made. Where could I read more on if type alignment concerns could be addressed to support AsSliceBytes for Int96?
What does this mean for pushdown filters? I need to look at that code more to fully understand how this would affect filters. It's my understanding that if page index is enabled and filters are pushed down, we might skip rows based on undesired timestamp conversion logic.

tustvold · 2025-03-08T09:21:27Z

Thank you for this, I think as written this PR would non-trivially regress performance for Int96 as it is effectively reading the data twice. I also am not sure this is the right approach to take here, left a comment on the ticket #7220 (comment)

alamb

Thanks @mbutrovich -- I think allowing a reader to get an INT96 column as FixedSizeBinary is a reasonable approach (I also commented #7220 (comment))

I agree with @tustvold 's performance concerns and left some suggestions on how to address them

In general I also think we should have an "end to end" test -- namely show reading an actual parquet file that has a INT96 column that is read ad FixedSizeBinary

Maybe there is some file in https://github.com/apache/parquet-testing/blob/master/data/README.md we could use

If not, maybe we can construct our own:

arrow-rs/parquet/src/arrow/arrow_reader/mod.rs

Line 3259 in 34b2184

fn test_read_binary_as_utf8() {

alamb · 2025-03-08T10:48:51Z

parquet/src/arrow/array_reader/primitive_array.rs

@@ -194,7 +190,7 @@ where
            },
            PhysicalType::FLOAT => Arc::new(Float32Array::from(array_data)),
            PhysicalType::DOUBLE => Arc::new(Float64Array::from(array_data)),
-            PhysicalType::INT96 => Arc::new(TimestampNanosecondArray::from(array_data)),
+            PhysicalType::INT96 => Arc::new(FixedSizeBinaryArray::from(array_data)),


as @tustvold mentions, for performance reasons I think this would have to match on the array_data.data_type to directly create FixedSizeBinaryArray or TimestampNanosecondArray (as is done for the PhysicalType::INT32 array above for example)

alamb · 2025-03-08T10:49:32Z

parquet/src/arrow/array_reader/primitive_array.rs

-                ArrowType::Timestamp(TimeUnit::Nanosecond, _) => target_type.clone(),
-                _ => unreachable!("INT96 must be timestamp nanosecond"),
-            },
+            PhysicalType::INT96 => ArrowType::FixedSizeBinary(12),


I suggest matching here on the target type to allow either fixed size binary or timestamp

Default to ArrowType::FixedSizeBinary(12) for intermediate representa…

b62ab3b

…tion of INT96 timestamps in consume_batch(). Allow the hint that TS(Nano) can be requested as FSB(12).

github-actions bot added the parquet Changes to the parquet crate label Mar 7, 2025

alamb reviewed Mar 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow INT96 timestamps to be stores as FixedSizeBinary(12) #7250

Allow INT96 timestamps to be stores as FixedSizeBinary(12) #7250

mbutrovich commented Mar 7, 2025 •

edited

Loading

mbutrovich commented Mar 7, 2025 •

edited

Loading

tustvold commented Mar 8, 2025

alamb left a comment

alamb Mar 8, 2025

alamb Mar 8, 2025

Allow INT96 timestamps to be stores as FixedSizeBinary(12) #7250

Are you sure you want to change the base?

Allow INT96 timestamps to be stores as FixedSizeBinary(12) #7250

Conversation

mbutrovich commented Mar 7, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

mbutrovich commented Mar 7, 2025 • edited Loading

tustvold commented Mar 8, 2025

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 8, 2025

Choose a reason for hiding this comment

alamb Mar 8, 2025

Choose a reason for hiding this comment

mbutrovich commented Mar 7, 2025 •

edited

Loading

mbutrovich commented Mar 7, 2025 •

edited

Loading