Add owning types to hold Arrow data #18084

vyasr · 2025-02-25T01:42:11Z

Description

This PR introduces two new types to cudf, arrow_column and arrow_table. These types are analogous to cudf::column and cudf::table, but instead of using cudf's unique ownership semantics these follow a shared ownership model that is more amenable for use with Arrow interop. These types are intended to be used in place of direct calls to Arrow interop functions like from_arrow_device, which place the onus on the caller to track which APIs handle Arrow C Data Interface memory management semantics for you and which ones do not (see discussion in this thread for lots of examples). With the new types, the semantics are fairly straightforward and map to what one would expect when using the C Data Interface: the cudf objects either copy (in the case of host data) or move (in the case of device data) the input array structures and leave them in a released state afterwards.

To keep the scope of this PR limited, I have implemented the core logic by simply calling the existing interop functions in suitable ways and only adding new logic as needed to handle proper ownership management. If we are happy with the new model, over time we can move to deprecate those code paths and move more of the logic directly into these classes.

Closes #16104

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…sting private data

vyasr · 2025-02-25T01:42:26Z

CC @davidwendt @zeroshade

zeroshade

Did a first pass and have a bunch of questions 😄

cpp/include/cudf/interop.hpp

zeroshade · 2025-02-26T19:31:57Z

cpp/src/interop/arrow_data_structures.cpp

+  ArrowDeviceArray owner{};  //< ArrowDeviceArray that owns the data
+  ArrowSchema schema{};      //< ArrowSchema that describes the data


any reason to not just use nanoarrow::UniqueSchema and friends?

I originally wasn't sure if I would wind up with such a clear separation between the user-facing types and the underlying container, so I was worried about exposing nanoarrow types in the public interface. Now that everything's using PImpl for the container that is moot, so I can probably switch over to nanoarrow types here.

zeroshade · 2025-02-26T19:34:47Z

cpp/src/interop/arrow_data_structures.cpp

+void copy_array(ArrowArray* output,
+                ArrowArray const* input,
+                std::shared_ptr<arrow_array_container> container)
+{
+  auto private_data  = new ArrowArrayPrivateData{container};
+  output->length     = input->length;
+  output->null_count = input->null_count;
+  output->offset     = input->offset;
+  output->n_buffers  = input->n_buffers;
+  output->n_children = input->n_children;
+  output->buffers    = input->buffers;
+
+  if (input->n_children > 0) {
+    private_data->children_raw.resize(input->n_children);
+    for (auto i = 0; i < input->n_children; ++i) {
+      private_data->children.push_back(std::make_unique<ArrowArray>());
+      private_data->children_raw[i] = private_data->children.back().get();
+      copy_array(private_data->children_raw[i], input->children[i], container);
+    }
+  }
+  output->children     = private_data->children_raw.data();
+  output->dictionary   = input->dictionary;
+  output->release      = ArrayReleaseCallback;
+  output->private_data = private_data;
+}


doesn't this mean that if either the input or output call their release callback, both arrays will break as it will clean up all the buffers that both are pointing at since it's only a shallow copy?

All the arrays wind up with private_data of type ArrowArrayPrivateData, which in turn contains a std::shared_ptr<arrow_array_container> parent. The memory is ultimately always owned by the arrow_array_container, which is only deleted when the shared_ptr's ref count goes to zero. The release callback just deletes the copied array's ArrowArrayPrivateData, so it just decrements that ref count. The buffers won't be released until nothing is pointing to it.

With multiple layers of exporting mixing cudf with other libraries presumably you could wind up with a private_data that is another type (e.g. defined by the Arrow C++ library) that internally contains a pointer of some sort to an instance of ArrowArrayPrivateData or something, but that's fine as long as the levels of indirection preserve the shared pointer using normal semantics.

I guess the reason why I'm wary here is that all of this is explicitly predicated on the idea that the input to this function is already using the arrow_array_container under the hood. If it isn't, then you are open to potential problems with all of the layers. If we're guaranteed that this will only ever be called to copy a source that is already using ArrowArrayPrivateData holding an std::shared_ptr<arrow_array_container>, then i think this is fine.

I guess the reason why I'm wary here is that all of this is explicitly predicated on the idea that the input to this function is already using the arrow_array_container under the hood

This isn't quite true, but it's on the right track. There's some subtlety here because the container's owner member is itself an ArrowDeviceArray. The assumption is that the input argument to this function is always the owner member of a the container argument to this function OR one of the children of the owner. That second clause is part of what makes this confusing. Ideally this function would simply take a container and an output and always copy from the container to the output, but that doesn't work for the recursive copying of children. If you have a suggestion for how to clarify this relationship, please let me know. For now I'll update the docstring to make this relationship explicit there.

So the key is that the owner is always what owns the data, but its release callback is managed by the container. The container is in turn only ever owned via shared_ptr, so we are guaranteed a single point of deletion. Let's walk through some cases to be sure:

Case 1: A device array is created by some third-party library, then used to construct a cudf::arrow_table
The user will call the ArrowDeviceArray constructor of arrow_table, which will internally construct a std::shared_ptr<arrow_array_container>. Since it is device data, we move the input array into the container's owner members (and deep copy the schema). Now, the user calls arrow_table::to_arrow, which calls copy_array. The output array's private_data is an ArrowArrayPrivateData that contains a std::shared_ptr<arrow_array_container>. The release callback simply deletes the ArrowArrayPrivateData. The lifetime of the underlying data is therefore the longer of the arrow_table or the new ArrowDeviceArray that we have exported to.

Case 2: A cudf table is used to construct a cudf::arrow_table
This case is almost the same as case 1 except that we first call to_arrow_device(cudf::table&&) to convert the cudf table to an ArrowDeviceArray. After the conversion all ownership is handed to the ArrowDeviceArray, so the same subsequent ownership semantics apply.

Case 3: A host array is created by some third-party library, then used to construct a cudf::arrow_table
In this case we call from_arrow_host, which copies the data from host to device, then we call the arrow_table(cudf::table&&) constructor from case 2 and copy over the shared_ptr to the container from the resulting object. The remaining logic therefore devolves to that of case 2.

zeroshade · 2025-02-26T19:35:37Z

cpp/src/interop/arrow_data_structures.cpp

+  auto private_data = reinterpret_cast<ArrowArrayPrivateData*>(array->private_data);
+  for (auto& child : private_data->children) {
+    child->release(child.get());
+  }
+  delete private_data;
+  array->release = nullptr;


something should also be checking if it needs to release the dictionary too

If the object was created from preexisting (host or device) arrow data, then presumably the dictionary must be owned by the private data of the original object and should be released by the original array's release pointer, right? My assumption at the moment is that whether the original source is another library or a libcudf type, ultimately calling ArrowArrayRelease in the arrow_array_container's destructor is sufficient to handle memory cleanup.

You're not preserving the original release callback though, since it's a shallow copy. Which means that if the input array (including its dictionary) is not already being managed by the arrow_array_container that was passed in, then there's the potential for memory leaks.

actually, I'm realizing that there's a problem here: This is only safe if you know that all of the children are also using the arrow_array_container internally which is not guaranteed. You probably don't want to call the release callback for the children until the arrow_array_container's destructor does the cleanup

hmm, actually scratch that. It looks like it's handled safely due to copy_array being recursive. though i guess you could probably reduce the number of copies and allocations with a bit of extra complexity but this is likely fine for now. and certainly won't be a bottleneck

You're not preserving the original release callback though, since it's a shallow copy.

The original release callback should be preserved upon construction of the arrow_array_container (note that we are using ArrowArrayMove rather than from_arrow_device since the latter creates a cudf view type that does not have any ownership management). So regarding

Which means that if the input array (including its dictionary) is not already being managed by the arrow_array_container that was passed in, then there's the potential for memory leaks.

The arrow_array_container's owner should therefore always preserve the input array's release callback (for device data), regardless of whether the input array was already managed by arrow_array_container.

It looks like it's handled safely due to copy_array being recursive.

Yes, I think I got the recursion handling this correctly. Since ownership is ultimately always managed by the container, the top-level array and the children don't actually need any special handling since they can all simply incref the container pointer and keep it alive until everything is released.

though i guess you could probably reduce the number of copies and allocations

Are there other attributes that you think I could be avoiding copying? I made the decisions I did at least in part because this seemed like the most faithful representation of the spec.

This is definitely the most faithful representation of the spec. I was mostly referring to the idea that, in theory, you only really need to have the top-level array hold the arrow_array_container and then everything else can just be dependent on the parent doing the clean-up. but that gets a lot more complex to manage correctly and safely since people can easily grab individual children and do stuff with them. So, unless you start seeing bottlenecks in this copy and in the management of the private_data / shared_pointers, I think this is fine the way it is for now.

Yeah I was thinking of doing something like that initially to avoid having everyone hold the container, but then I wasn't sure we could safely handle all the cases where someone moves a child column since then you have to release the parent, so you'd have to jump through more hoops to make the children keep the parent alive in order to keep the data alive, so ultimately you're still effectively keeping the private data alive. The current approach seemed simpler, more robust, and ultimately no more expensive in terms of what data is preserved.

cpp/src/interop/arrow_data_structures.cpp

vyasr · 2025-02-27T20:23:07Z

Did a first pass and have a bunch of questions 😄

Questions are great! I think I answered all of them sufficiently, but please let me know if there are any holes in my reasoning. This PR took me a while to put together, there's a lot of subtlety to the memory semantics that I had to work through and I may have missed things.

zeroshade · 2025-02-27T20:31:12Z

cpp/src/interop/arrow_data_structures.cpp

+      if (input->array.release != nullptr) { ArrowArrayRelease(&input->array); }
+      break;
+    }
+    default: container = std::make_shared<arrow_array_container>(schema, input, stream, mr);


you should do an ArrowArrayMove from input into what you pass to the arrow_array_container. Same for the schema, otherwise the original caller can bypass the arrow_array_container and release stuff out from under you.

Good point. Yes, we can create a temporary here and move into that before forwarding into the container.

The reason I didn't originally do this (and the reason for some of the other minor discrepancies between the two different constructors for the arrow_array_container) is because all the from_arrow* functions that you previously added accept an ArrowSchema const*. In the construction of an arrow_array_container from a cudf::table or cudf::column I just move the schema into the container, but when constructing from an externally provided ArrowDeviceArray I currently perform an ArrowSchemaDeepCopy because I matched the from_arrow_* signatures and had a const schema. I can change that and then moving things out here will work just fine.

Now that I've remove the const bits I think simply forwarding along the arguments is fine since they're all pointers and I'm moving all of them in the arrow_array_container constructors, but let me know if I'm missing anything there.

cpp/src/interop/arrow_data_structures.cpp

…column

vyasr added 30 commits February 25, 2025 01:28

First attempt

015ddbd

Get basic conversion working starting with a cudf column

7661570

Enable construction from an ArrowDeviceArray

f01a07b

CMake testing changes

851d4dc

Use nanoarray for copying schema

a09e87d

Fix reference bug

e9865bb

More CMake testing changes

a07b2ae

Enable getting views and try asserting equivalence

bd65da6

Fully passing first round of tests.

a55e508

Add explicit test of lifetime management

943eb2b

Basic arrow table

461c6c8

Add arrow conversions

362529d

Support construction from device array

fda4ca4

Some cleanup

7c07058

Add documentation

11ec3d9

Fix some bugs

2c7e78a

More cleanup

f75b17b

More CMake testing

951914d

Get tests passing with complex nanoarrow host tables

31f134e

More CMake testing

a76c3b9

Support single columns from arrow host data

6a59f41

Also test nanoarrow device data

9600c3c

Implement conversion to host array

68fb9d8

Support direct ingestion of ArrowArray data

36bbd36

Support construction from streams

0bc9389

Make ownership semantics consistent across types

2b2e77f

Make sure stream and mr are forwarded everywhere

89b34c3

Centralize as much logic as possible

eb93262

Dictionary behavior is correct since we are just pointing back to exi…

374e7ac

…sting private data

Update comments

7014f30

vyasr added 2 commits February 25, 2025 01:28

Revert debugging changes

4410262

Rename file

dab19cf

vyasr added feature request New feature or request non-breaking Non-breaking change labels Feb 25, 2025

vyasr self-assigned this Feb 25, 2025

vyasr requested review from a team as code owners February 25, 2025 01:42

vyasr requested review from shrshi and vuule February 25, 2025 01:42

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Feb 25, 2025

Rename test

57b7cac

vyasr mentioned this pull request Feb 26, 2025

[BUG] make_empty_column produces a dictionary column in an invalid state #18071

Open

zeroshade reviewed Feb 26, 2025

View reviewed changes

zeroshade reviewed Feb 27, 2025

View reviewed changes

cpp/src/interop/arrow_data_structures.cpp Outdated Show resolved Hide resolved

zeroshade reviewed Feb 27, 2025

View reviewed changes

cpp/src/interop/arrow_data_structures.cpp Show resolved Hide resolved

vyasr added 5 commits February 27, 2025 21:01

Merge remote-tracking branch 'upstream/branch-25.04' into feat/arrow_…

bcfdc09

…column

Use protocol for dlpack instead of deprecated function

e3fce98

Merge remote-tracking branch 'upstream/branch-25.04' into feat/arrow_…

0a059ec

…column

Address most PR comments

fd5b3cc

Stop accepting ArrowSchemas as const and move from them

03e049b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add owning types to hold Arrow data #18084

Add owning types to hold Arrow data #18084

vyasr commented Feb 25, 2025

vyasr commented Feb 25, 2025

zeroshade left a comment

zeroshade Feb 26, 2025

vyasr Feb 27, 2025

zeroshade Feb 26, 2025

vyasr Feb 27, 2025 •

edited

Loading

zeroshade Feb 28, 2025

vyasr Mar 4, 2025

zeroshade Feb 26, 2025

vyasr Feb 27, 2025

zeroshade Feb 27, 2025

zeroshade Feb 27, 2025

zeroshade Feb 27, 2025

vyasr Feb 27, 2025

zeroshade Feb 28, 2025

vyasr Mar 4, 2025

vyasr commented Feb 27, 2025

zeroshade Feb 27, 2025

vyasr Feb 27, 2025

vyasr Mar 4, 2025

vyasr Mar 4, 2025

		ArrowDeviceArray owner{}; //< ArrowDeviceArray that owns the data
		ArrowSchema schema{}; //< ArrowSchema that describes the data

Add owning types to hold Arrow data #18084

Are you sure you want to change the base?

Add owning types to hold Arrow data #18084

Conversation

vyasr commented Feb 25, 2025

Description

Checklist

vyasr commented Feb 25, 2025

zeroshade left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr commented Feb 27, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr Feb 27, 2025 •

edited

Loading