-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add owning types to hold Arrow data #18084
base: branch-25.04
Are you sure you want to change the base?
Conversation
…sting private data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first pass and have a bunch of questions 😄
ArrowDeviceArray owner{}; //< ArrowDeviceArray that owns the data | ||
ArrowSchema schema{}; //< ArrowSchema that describes the data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to not just use nanoarrow::UniqueSchema
and friends?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally wasn't sure if I would wind up with such a clear separation between the user-facing types and the underlying container, so I was worried about exposing nanoarrow types in the public interface. Now that everything's using PImpl for the container that is moot, so I can probably switch over to nanoarrow types here.
void copy_array(ArrowArray* output, | ||
ArrowArray const* input, | ||
std::shared_ptr<arrow_array_container> container) | ||
{ | ||
auto private_data = new ArrowArrayPrivateData{container}; | ||
output->length = input->length; | ||
output->null_count = input->null_count; | ||
output->offset = input->offset; | ||
output->n_buffers = input->n_buffers; | ||
output->n_children = input->n_children; | ||
output->buffers = input->buffers; | ||
|
||
if (input->n_children > 0) { | ||
private_data->children_raw.resize(input->n_children); | ||
for (auto i = 0; i < input->n_children; ++i) { | ||
private_data->children.push_back(std::make_unique<ArrowArray>()); | ||
private_data->children_raw[i] = private_data->children.back().get(); | ||
copy_array(private_data->children_raw[i], input->children[i], container); | ||
} | ||
} | ||
output->children = private_data->children_raw.data(); | ||
output->dictionary = input->dictionary; | ||
output->release = ArrayReleaseCallback; | ||
output->private_data = private_data; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this mean that if either the input or output call their release
callback, both arrays will break as it will clean up all the buffers that both are pointing at since it's only a shallow copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the arrays wind up with private_data of type ArrowArrayPrivateData, which in turn contains a std::shared_ptr<arrow_array_container> parent
. The memory is ultimately always owned by the arrow_array_container
, which is only deleted when the shared_ptr's ref count goes to zero. The release callback just deletes the copied array's ArrowArrayPrivateData, so it just decrements that ref count. The buffers won't be released until nothing is pointing to it.
With multiple layers of exporting mixing cudf with other libraries presumably you could wind up with a private_data that is another type (e.g. defined by the Arrow C++ library) that internally contains a pointer of some sort to an instance of ArrowArrayPrivateData or something, but that's fine as long as the levels of indirection preserve the shared pointer using normal semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the reason why I'm wary here is that all of this is explicitly predicated on the idea that the input to this function is already using the arrow_array_container
under the hood. If it isn't, then you are open to potential problems with all of the layers. If we're guaranteed that this will only ever be called to copy a source that is already using ArrowArrayPrivateData
holding an std::shared_ptr<arrow_array_container>
, then i think this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the reason why I'm wary here is that all of this is explicitly predicated on the idea that the input to this function is already using the arrow_array_container under the hood
This isn't quite true, but it's on the right track. There's some subtlety here because the container's owner member is itself an ArrowDeviceArray. The assumption is that the input
argument to this function is always the owner
member of a the container
argument to this function OR one of the children of the owner. That second clause is part of what makes this confusing. Ideally this function would simply take a container and an output and always copy from the container to the output, but that doesn't work for the recursive copying of children. If you have a suggestion for how to clarify this relationship, please let me know. For now I'll update the docstring to make this relationship explicit there.
So the key is that the owner is always what owns the data, but its release callback is managed by the container. The container is in turn only ever owned via shared_ptr
, so we are guaranteed a single point of deletion. Let's walk through some cases to be sure:
Case 1: A device array is created by some third-party library, then used to construct a cudf::arrow_table
The user will call the ArrowDeviceArray
constructor of arrow_table
, which will internally construct a std::shared_ptr<arrow_array_container>
. Since it is device data, we move the input array into the container's owner
members (and deep copy the schema). Now, the user calls arrow_table::to_arrow
, which calls copy_array
. The output array's private_data is an ArrowArrayPrivateData that contains a std::shared_ptr<arrow_array_container>
. The release callback simply deletes the ArrowArrayPrivateData. The lifetime of the underlying data is therefore the longer of the arrow_table
or the new ArrowDeviceArray that we have exported to.
Case 2: A cudf table is used to construct a cudf::arrow_table
This case is almost the same as case 1 except that we first call to_arrow_device(cudf::table&&)
to convert the cudf table to an ArrowDeviceArray. After the conversion all ownership is handed to the ArrowDeviceArray, so the same subsequent ownership semantics apply.
Case 3: A host array is created by some third-party library, then used to construct a cudf::arrow_table
In this case we call from_arrow_host
, which copies the data from host to device, then we call the arrow_table(cudf::table&&)
constructor from case 2 and copy over the shared_ptr to the container from the resulting object. The remaining logic therefore devolves to that of case 2.
auto private_data = reinterpret_cast<ArrowArrayPrivateData*>(array->private_data); | ||
for (auto& child : private_data->children) { | ||
child->release(child.get()); | ||
} | ||
delete private_data; | ||
array->release = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something should also be checking if it needs to release the dictionary too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the object was created from preexisting (host or device) arrow data, then presumably the dictionary must be owned by the private data of the original object and should be released by the original array's release pointer, right? My assumption at the moment is that whether the original source is another library or a libcudf type, ultimately calling ArrowArrayRelease in the arrow_array_container
's destructor is sufficient to handle memory cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not preserving the original release callback though, since it's a shallow copy. Which means that if the input array (including its dictionary) is not already being managed by the arrow_array_container
that was passed in, then there's the potential for memory leaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I'm realizing that there's a problem here: This is only safe if you know that all of the children are also using the arrow_array_container
internally which is not guaranteed. You probably don't want to call the release callback for the children until the arrow_array_container
's destructor does the cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, actually scratch that. It looks like it's handled safely due to copy_array being recursive. though i guess you could probably reduce the number of copies and allocations with a bit of extra complexity but this is likely fine for now. and certainly won't be a bottleneck
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not preserving the original release callback though, since it's a shallow copy.
The original release callback should be preserved upon construction of the arrow_array_container
(note that we are using ArrowArrayMove
rather than from_arrow_device
since the latter creates a cudf view type that does not have any ownership management). So regarding
Which means that if the input array (including its dictionary) is not already being managed by the arrow_array_container that was passed in, then there's the potential for memory leaks.
The arrow_array_container
's owner
should therefore always preserve the input array's release callback (for device data), regardless of whether the input array was already managed by arrow_array_container
.
It looks like it's handled safely due to copy_array being recursive.
Yes, I think I got the recursion handling this correctly. Since ownership is ultimately always managed by the container, the top-level array and the children don't actually need any special handling since they can all simply incref the container pointer and keep it alive until everything is released.
though i guess you could probably reduce the number of copies and allocations
Are there other attributes that you think I could be avoiding copying? I made the decisions I did at least in part because this seemed like the most faithful representation of the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely the most faithful representation of the spec. I was mostly referring to the idea that, in theory, you only really need to have the top-level array hold the arrow_array_container
and then everything else can just be dependent on the parent doing the clean-up. but that gets a lot more complex to manage correctly and safely since people can easily grab individual children and do stuff with them. So, unless you start seeing bottlenecks in this copy and in the management of the private_data / shared_pointers, I think this is fine the way it is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking of doing something like that initially to avoid having everyone hold the container, but then I wasn't sure we could safely handle all the cases where someone moves a child column since then you have to release the parent, so you'd have to jump through more hoops to make the children keep the parent alive in order to keep the data alive, so ultimately you're still effectively keeping the private data alive. The current approach seemed simpler, more robust, and ultimately no more expensive in terms of what data is preserved.
Questions are great! I think I answered all of them sufficiently, but please let me know if there are any holes in my reasoning. This PR took me a while to put together, there's a lot of subtlety to the memory semantics that I had to work through and I may have missed things. |
if (input->array.release != nullptr) { ArrowArrayRelease(&input->array); } | ||
break; | ||
} | ||
default: container = std::make_shared<arrow_array_container>(schema, input, stream, mr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should do an ArrowArrayMove
from input into what you pass to the arrow_array_container
. Same for the schema, otherwise the original caller can bypass the arrow_array_container
and release stuff out from under you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Yes, we can create a temporary here and move into that before forwarding into the container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I didn't originally do this (and the reason for some of the other minor discrepancies between the two different constructors for the arrow_array_container
) is because all the from_arrow*
functions that you previously added accept an ArrowSchema const*
. In the construction of an arrow_array_container
from a cudf::table
or cudf::column
I just move the schema into the container, but when constructing from an externally provided ArrowDeviceArray
I currently perform an ArrowSchemaDeepCopy
because I matched the from_arrow_*
signatures and had a const schema. I can change that and then moving things out here will work just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I've remove the const bits I think simply forwarding along the arguments is fine since they're all pointers and I'm moving all of them in the arrow_array_container constructors, but let me know if I'm missing anything there.
Description
This PR introduces two new types to cudf,
arrow_column
andarrow_table
. These types are analogous tocudf::column
andcudf::table
, but instead of using cudf's unique ownership semantics these follow a shared ownership model that is more amenable for use with Arrow interop. These types are intended to be used in place of direct calls to Arrow interop functions likefrom_arrow_device
, which place the onus on the caller to track which APIs handle Arrow C Data Interface memory management semantics for you and which ones do not (see discussion in this thread for lots of examples). With the new types, the semantics are fairly straightforward and map to what one would expect when using the C Data Interface: the cudf objects either copy (in the case of host data) or move (in the case of device data) the input array structures and leave them in a released state afterwards.To keep the scope of this PR limited, I have implemented the core logic by simply calling the existing interop functions in suitable ways and only adding new logic as needed to handle proper ownership management. If we are happy with the new model, over time we can move to deprecate those code paths and move more of the logic directly into these classes.
Closes #16104
Checklist