Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsafe improvements: arrow/{array_reader,buffer}. #6025

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions parquet/src/arrow/array_reader/fixed_len_byte_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,12 @@ impl ArrayReader for FixedLenByteArrayReader {
.add_buffer(Buffer::from_vec(record_data.buffer))
.null_bit_buffer(self.record_reader.consume_bitmap_buffer());

// SAFETY: Assuming that a RecordReader produces a valid FixedLenByteArrayBuffer (as
// otherwise that code itself would be unsound), this call is safe if self.byte_length
// matches the byte length of the FixedLenByteArrayBuffer produced by the RecordReader
// matches the byte length passed as a type parameter to ArrayDataBuilder.
// FIXME: how do we know that the byte lengths match? What if record_data.byte_length
// is None?
let binary = FixedSizeBinaryArray::from(unsafe { array_data.build_unchecked() });

// TODO: An improvement might be to do this conversion on read
Expand Down
1 change: 1 addition & 0 deletions parquet/src/arrow/array_reader/fixed_size_list_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,7 @@ impl ArrayReader for FixedSizeListArrayReader {
list_builder = list_builder.null_bit_buffer(Some(builder.into()));
}

// SAFETY: FIXME: document why this call is safe.
let list_data = unsafe { list_builder.build_unchecked() };

let result_array = FixedSizeListArray::from(list_data);
Expand Down
1 change: 1 addition & 0 deletions parquet/src/arrow/array_reader/list_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,7 @@ impl<OffsetSize: OffsetSizeTrait> ArrayReader for ListArrayReader<OffsetSize> {
data_builder = data_builder.null_bit_buffer(Some(builder.into()))
}

// SAFETY: FIXME: document why this call is safe.
let list_data = unsafe { data_builder.build_unchecked() };

let result_array = GenericListArray::<OffsetSize>::from(list_data);
Expand Down
2 changes: 1 addition & 1 deletion parquet/src/arrow/array_reader/map_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ impl ArrayReader for MapArrayReader {
let data = array.to_data();
let builder = data.into_builder().data_type(self.data_type.clone());

// SAFETY - we can assume that ListArrayReader produces valid ListArray
// SAFETY: we can assume that ListArrayReader produces valid ListArray
// of the expected type, and as such its output can be reinterpreted as
// a MapArray without validation
Ok(Arc::new(MapArray::from(unsafe {
Expand Down
5 changes: 5 additions & 0 deletions parquet/src/arrow/array_reader/primitive_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,11 @@ where
.add_buffer(record_data)
.null_bit_buffer(self.record_reader.consume_bitmap_buffer());

// SAFETY: Assuming that the RecordReader's null buffer is sufficiently sized for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the safety argument goes something like the RecordReader writes only valid values for type T into record_data and that the record_reader's null buffer was correctly created.

// corresponding Vec<T> buffer (or None), this is safe, as we have a single buffer
// obtained from a vector of `T`s, which must therefore have a valid memory representation
// for T. The type mapping above ensures that a valid T results in a valid Arrow array of
// type `arrow_data_type`.
let array_data = unsafe { array_data.build_unchecked() };
let array: ArrayRef = match T::get_physical_type() {
PhysicalType::BOOLEAN => Arc::new(BooleanArray::from(array_data)),
Expand Down
1 change: 1 addition & 0 deletions parquet/src/arrow/array_reader/struct_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,7 @@ impl ArrayReader for StructArrayReader {
array_data_builder.null_bit_buffer(Some(bitmap_builder.into()));
}

// SAFETY: FIXME: document why this call is safe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument basically goes something like the code in this module correctly constructed the struct array values and validated the input during the construction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's what I assumed, but the method was a bit too long and I am not familiar enough with the library to really figure out the details here...

let array_data = unsafe { array_data_builder.build_unchecked() };
Ok(Arc::new(StructArray::from(array_data)))
}
Expand Down
4 changes: 4 additions & 0 deletions parquet/src/arrow/buffer/dictionary_buffer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,10 @@ impl<K: ArrowNativeType + Ord, V: OffsetSizeTrait> DictionaryBuffer<K, V> {

let data = match cfg!(debug_assertions) {
true => builder.build().unwrap(),
// SAFETY: FIXME: this is unsound. data_type is passed by the caller without
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will review this more carefully -- I think this code is still sound as the only callers of data_type pass in the correct type.

However, maybe we could double check the data_type with an assert

Copy link
Contributor Author

@veluca93 veluca93 Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking data_type SGTM.

Technically speaking, relying on callers doing "the right thing" is unsound (which is not the same as "exhibits undefined behaviour"): unsafe code is sound if no safe code could cause UB, regardless of current usage.

Of course, if this is not exposed to the user, it is less important to fix, but would still be nice IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this PR is still on my list to help clean, but I am struggling to find time.

Basically i plan to do some more code investigation to improve the comments here.

If there are any specific API changes / proposal to improve safety I think we should file a separate ticket to discuss them. I will do so if I find one during analysis

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, take your time! I find with these things speediness should not trump thoroughness :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @veluca93 -- I wonder if you have had any more time to do some more code investigation? If you think this is actually unsound I will try and find more time to review this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had a lot of time - I've been busy and I am at a programming competition this week - but I can confirm that the code as written is unsound (even if the unsoundness cannot trigger undefined behaviour): validating the data type, or marking the function as unsafe and writing down in the caller why the data type is correct would fix the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alamb - any updates on when you might be able to dig a bit more into the safety invariants of this code? :-) unfortunately I don't think I am familiar enough with the codebase to effectively look more into it myself...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am likewise not familar with this part of the code -- and I haven't been able to find time to spend studying it in detail.

One of the nice things about open source code is that it is all available for everyone to look -- I welcome additional thoughts / reviews / assistance on this item

FRom my perspective, this ticket is current a reasearch item as it is a seemingly theoretical problem. Thus it will likely will remain a relatively low priority for me personally unless someone can cook up a rust example showing actual unsafe/unsound behavior resulting

thank you again for the help. It would be great if you could find time to analyze the code or make an example showing a problem

// any validation, which might result in creating invalid arrays, and this is a
// safe function which cannot have preconditions. Similar considerations apply
// to null_buffer.
false => unsafe { builder.build_unchecked() },
};

Expand Down
4 changes: 4 additions & 0 deletions parquet/src/arrow/buffer/offset_buffer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,10 @@ impl<I: OffsetSizeTrait> OffsetBuffer<I> {

let data = match cfg!(debug_assertions) {
true => array_data_builder.build().unwrap(),
// SAFETY: FIXME: this is unsound. data_type is passed by the caller without
// any validation, which might result in creating invalid arrays, and this is a
// safe function which cannot have preconditions. Similar considerations apply
// to null_buffer.
false => unsafe { array_data_builder.build_unchecked() },
};

Expand Down
Loading