Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support unspill for SpillableHostBuffer #12184

Open
binmahone opened this issue Feb 20, 2025 · 1 comment · May be fixed by #12186
Open

[FEA] Support unspill for SpillableHostBuffer #12184

binmahone opened this issue Feb 20, 2025 · 1 comment · May be fixed by #12186
Assignees
Labels
feature request New feature or request

Comments

@binmahone
Copy link
Collaborator

binmahone commented Feb 20, 2025

Is your feature request related to a problem? Please describe.

Currently once a SpillableHostBuffer is spilled from memory to disk, all subsequent invocations of SpillableHostBuffer#getHostBuffer will read and deserialize from disk. It's very costly and won't be acceptable in cases where we will call the getHostBuffer multiple times.

One example would be the Kudo shuffle read concat case, let's asssume the read KudoTables are placed into a spillable state (by wrapping the HostMemoryBuffer in SpillableHostBuffer), then when doing the kudo concat, we will have to frequently and randomly call SpillableHostBuffer#getHostBuffer, since we know kudo concat adopts a random read visitor to read all the input KudoTables. It's a performance nightmere if we have to read from disk every time.

@abellina
Copy link
Collaborator

abellina commented Feb 25, 2025

One example would be the Kudo shuffle read concat case, let's asssume the read KudoTables are placed into a spillable state (by wrapping the HostMemoryBuffer in SpillableHostBuffer), then when doing the kudo concat, we will have to frequently and randomly call SpillableHostBuffer#getHostBuffer, since we know kudo concat adopts a random read visitor to read all the input KudoTables. It's a performance nightmere if we have to read from disk every time.

I propose a change to the way the merge works then. I don't think unspill is the right way to solve the problem.

When you call materialize for a SpillableHostBuffer you get a HostMemoryBuffer that you must close after the operation is done. You are guaranteed the host memory. When we call mergeToTable and we are passing KudoTable those should just be all materialized on the host at once. In other words, expose a method in KudoTable that allows you materialize it, and hold a reference to the HostMemoryBuffer in KudoTable while you concat. When you are done, simply close the references.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants