Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-7548][VL][test] Optimize BHJ in velox backend #8931

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

JkSelf
Copy link
Contributor

@JkSelf JkSelf commented Mar 7, 2025

What changes were proposed in this pull request?

This PR implements the optimization from the CK backend in the BHJ to the Velox backend, ensuring that the hash table is built only once per executor. The detailed design document can be found here.

How was this patch tested?

Existing tests

@github-actions github-actions bot added CORE works for Gluten Core BUILD VELOX labels Mar 7, 2025
Copy link

github-actions bot commented Mar 7, 2025

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

github-actions bot commented Mar 7, 2025

Run Gluten Clickhouse CI on x86

@JkSelf JkSelf changed the title [VL][test] test bhj optimization [GLUTEN-7548][VL][test] test bhj optimization Mar 7, 2025
Copy link

github-actions bot commented Mar 7, 2025

#7548

@JkSelf JkSelf force-pushed the bhj-optimization-1 branch from a6e8905 to 2b90118 Compare March 7, 2025 09:58
Copy link

github-actions bot commented Mar 7, 2025

Run Gluten Clickhouse CI on x86

@JkSelf JkSelf changed the title [GLUTEN-7548][VL][test] test bhj optimization [GLUTEN-7548][VL][test] Optimize BHJ in velox backend Mar 7, 2025
@FelixYBW
Copy link
Contributor

FelixYBW commented Mar 8, 2025

In long term, we need to implement the Spark way. Broadcast hashtable instead of raw table data.

@JkSelf
Copy link
Contributor Author

JkSelf commented Mar 10, 2025

In long term, we need to implement the Spark way. Broadcast hashtable instead of raw table data.

@FelixYBW Yes, we will support broadcasting the hash table approach after adding serialization/deserialization support to Velox's HashTable.

@JkSelf JkSelf force-pushed the bhj-optimization-1 branch from 2b90118 to 8e677fa Compare March 10, 2025 02:11
Copy link

Run Gluten Clickhouse CI on x86

@JkSelf JkSelf force-pushed the bhj-optimization-1 branch from 8e677fa to 8fed157 Compare March 10, 2025 02:30
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
Copy link

Run Gluten Clickhouse CI on x86

@FelixYBW
Copy link
Contributor

@zhztheplayer Is there memory management issue in this solution? Is the memory allocated in storage memory?

@JkSelf will this solution helpful to the final solution?

@JkSelf
Copy link
Contributor Author

JkSelf commented Mar 11, 2025

@JkSelf will this solution helpful to the final solution?

@FelixYBW Yes, the primary difference between Design 1 and Design 2 is the need for serialization and deserialization of Velox's hash table. Most of the remaining code can be shared between the two designs. We will evaluate the TPCH performance after addressing the result mismatch issue. If the performance does not meet expectations, we will proceed with implementing Design 2.

@JkSelf JkSelf marked this pull request as draft March 11, 2025 01:41
@JkSelf JkSelf force-pushed the bhj-optimization-1 branch from e7100cf to 7606ed3 Compare March 12, 2025 10:56
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
Copy link

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUILD CORE works for Gluten Core VELOX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants