Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align buffers of array data imported through the FFI if they aren't aligned #7137

Closed
wants to merge 2 commits into from

Conversation

felipecrv
Copy link
Contributor

@felipecrv felipecrv commented Feb 15, 2025

Closes #7136.

This will also fix this issue in arrow-adbc once the arrow-array crate version is bumped: apache/arrow-adbc#2526

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 15, 2025
@tustvold
Copy link
Contributor

tustvold commented Feb 15, 2025

I wonder if we could make this behaviour opt-in, whilst I accept systems may produce unaligned data, and people may want to accommodate this (although historically the consensus has been this is a bug), IMO I would not expect an FFI interface to be silently reallocating memory buffers without being explicitly asked to do so.

Returning an error instead of panicking would be a reasonable parallel improvement

@felipecrv
Copy link
Contributor Author

I wonder if we could make this behaviour opt-in

I considered that, but couldn't think of any scenario where a Rust program importing data from another system would not want to make sure the buffers are aligned. And note that buffers are checked already, that's how the panic is triggered after all.

I would not expect an FFI interface to be silently reallocating memory buffers without being explicitly asked to do so

It's important to consider here that reallocation won't happen if the producer is a good citizen and produces aligned buffers already. Often by reallocation on their side as well. That's what I think is being currently implemented in the C++ implementation of Flight. The buffers coming from gRPC can't possibly be aligned according to Arrow types so no matter who performs the copy for alignment, it will have to be done at least once.

Returning an error instead of panicking would be a reasonable parallel improvement

I can't see how that error could be handled at a higher level layer

@tustvold
Copy link
Contributor

tustvold commented Feb 15, 2025

I think it boils down to a subjective judgement over which of the following is better.

  • arrow-adbc produces unaligned buffers
  • error reported consuming data
  • bug filed on arrow-adbc
  • workaround of realigning enabled linking back to upstream bug

Or

  • arrow-adbc produces unaligned buffers
  • implementation silently performs copy
  • bug goes unnoticed

I personally think the former is better as it drives the ecosystem forwards, but appreciate some people are more in the camp of "I just want it to work".

I can't see how that error could be handled at a higher level layer

Neither, but some people have an aversion to panics

The buffers coming from gRPC can't possibly be aligned according to Arrow types so no matter who performs the copy for alignment, it will have to be done at least once.

Whilst this technically isn't true, I agree practically speaking it is unavoidable in this instance, and tbh given network IO is involved is likely irrelevant.

My concern would more be for systems where this copy is unnecessary, e.g. simply not using aligned_alloc.

@felipecrv
Copy link
Contributor Author

I personally think the former is better as it drives the ecosystem forwards, but appreciate some people are more in the camp of "I just want it to work".

Note that arrow-rs is effectively asking other Arrow implementations to provide 16-byte aligned buffers for some types [1][2]. That's a tough call -- malloc gives 8 by default (64 bits). Possible but hard in C, almost impossible on language runtimes that give less memory allocation controls.

[1] https://github.com/apache/arrow-rs/blob/main/arrow-data/src/data.rs#L1591
[2] apache/arrow-adbc#2526 (comment)

@tustvold
Copy link
Contributor

Note that arrow-rs is effectively asking other Arrow implementations to provide 16-byte aligned buffers for some types

Most arrow implementations actually provide 64 byte alignment

Possible but hard in C, almost impossible on language runtimes that give less memory allocation controls.

You can always obtain an aligned allocation, by over allocating by the alignment, and then slicing. This is in fact what many arrow implementations do, Go included

https://github.com/apache/arrow-go/blob/main/arrow/memory/go_allocator.go#L23

@alamb
Copy link
Contributor

alamb commented Mar 7, 2025

So it seems like the request for next steps on this PR is to add a opt-in flag. Something like

let stream_reader = ArrowArrayStreamReader::try_new(ffi_stream)
  .with_align_buffers()

Which would set a flag that would do the opt in automatic data alignment

@felipecrv
Copy link
Contributor Author

felipecrv commented Mar 8, 2025

So it seems like the request for next steps on this PR is to add a opt-in flag. Something like

If this is the conclusion after the discussion, I should probably close this PR. It will require a lot of work to pipe this flag all the way from ADBC to the code that instantiates the stream readers. More importantly, I don't see a scenario where it would make sense to not set this flag -- reallocation only happens if necessary and the check is cheap.

It's easier to maintain a set of locally applied patches than to argue about the need for robustness in software.

@felipecrv felipecrv closed this Mar 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Array data imported through the FFI might contain unaligned buffers
3 participants