You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently once a SpillableHostBuffer is spilled from memory to disk, all subsequent invocations of SpillableHostBuffer#getHostBuffer will read and deserialize from disk. It's very costly and won't be acceptable in cases where we will call the getHostBuffer multiple times.
One example would be the Kudo shuffle read concat case, let's asssume the read KudoTables are placed into a spillable state (by wrapping the HostMemoryBuffer in SpillableHostBuffer), then when doing the kudo concat, we will have to frequently and randomly call SpillableHostBuffer#getHostBuffer, since we know kudo concat adopts a random read visitor to read all the input KudoTables. It's a performance nightmere if we have to read from disk every time.
The text was updated successfully, but these errors were encountered:
One example would be the Kudo shuffle read concat case, let's asssume the read KudoTables are placed into a spillable state (by wrapping the HostMemoryBuffer in SpillableHostBuffer), then when doing the kudo concat, we will have to frequently and randomly call SpillableHostBuffer#getHostBuffer, since we know kudo concat adopts a random read visitor to read all the input KudoTables. It's a performance nightmere if we have to read from disk every time.
I propose a change to the way the merge works then. I don't think unspill is the right way to solve the problem.
When you call materialize for a SpillableHostBuffer you get a HostMemoryBuffer that you must close after the operation is done. You are guaranteed the host memory. When we call mergeToTable and we are passing KudoTable those should just be all materialized on the host at once. In other words, expose a method in KudoTable that allows you materialize it, and hold a reference to the HostMemoryBuffer in KudoTable while you concat. When you are done, simply close the references.
Is your feature request related to a problem? Please describe.
Currently once a SpillableHostBuffer is spilled from memory to disk, all subsequent invocations of
SpillableHostBuffer#getHostBuffer
will read and deserialize from disk. It's very costly and won't be acceptable in cases where we will call thegetHostBuffer
multiple times.One example would be the Kudo shuffle read concat case, let's asssume the read KudoTables are placed into a spillable state (by wrapping the HostMemoryBuffer in SpillableHostBuffer), then when doing the kudo concat, we will have to frequently and randomly call
SpillableHostBuffer#getHostBuffer
, since we know kudo concat adopts a random read visitor to read all the input KudoTables. It's a performance nightmere if we have to read from disk every time.The text was updated successfully, but these errors were encountered: