-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU][POC] clDNN gemv optimization for LLM second token #28976
base: master
Are you sure you want to change the base?
[GPU][POC] clDNN gemv optimization for LLM second token #28976
Conversation
7e63d3e
to
17d1e29
Compare
654da9a
to
80357ed
Compare
src/plugins/intel_gpu/include/intel_gpu/graph/kernel_impl_params.hpp
Outdated
Show resolved
Hide resolved
if (f.is_type<swiglu>()) | ||
return false; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random spot:
But why dont' you add a new fc kernel under bf_tiled kernel?
(as it is done for b1-b8 kernels in the bf_tiled_kernel.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for more clear implement and don't want to be mixed with bf_tile kernel. Actually, the gemv kernel implement focuses on memory bound scenario, while bf_tile kernel is better for computational bound cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see fc_bf_tile_common.cl, you can still separate the kernel file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, it is a good idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yeonbok I have tried to put this gemv kernel as one part of bf_tiled kernel file, but found it would bring some negative impact:
- There will be more complex logic in
fully_connected_kernel_bf_tiled.cpp
to decide when to choose gemv kernel or bf_tile kernel, which have different prerequisite and process, it will bring some potential risk of race condition and performance issue. - System memory consumption will become bigger after merge gemv kernel_selector into bf_tile, because compared to bf_tile kernel selector, gemv kernel_selector is much simpler and don't need too much memory and also don't need run parameter auto-tune( after merge them together, gemv case has to keep part of auto-tune, which will consume more memory and CPU computation).
So I prefer to create a new kernel impl instance for gemv kernel, how do you think about it?
Another curiosity: AFAIK Swiglu should be fused for Qwen2 model. But how could you measure the above data for this kernel even though the new kernel is not supporting swiglu? |
Swiglu fusion is used for gate_up fc fusion in Qwen2 model, while current gemv kernel doesn't support gate_up fusion in this PR due to non-interlace gate_up weights will impact memory reading bandwidth. So in this PR, gate_up fusion fc still adopts bf_tile kernel. Next step, if possible, I plan to optimize current gate_up_swiglu bf_tile kernel by reordering gate_up weights to be interlace layout, and then implement gemv kernel to support gate_up swiglu fusion. Note: gate_up fusion kernel occupancies more latency than the sum of o_proj/q_proj/down, so gate_up gemv implement is supposed to get more performance improvement than current PR. |
src/plugins/intel_gpu/src/kernel_selector/kernel_selector_params.h
Outdated
Show resolved
Hide resolved
...lugins/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_gemv.cpp
Outdated
Show resolved
Hide resolved
...lugins/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_gemv.cpp
Outdated
Show resolved
Hide resolved
1. reuse some util functions in fc_tile 2. remove unused async compilation logic 3. remove unused method
src/plugins/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_params.h
Outdated
Show resolved
Hide resolved
if (inst.get_node().is_type<fully_connected>() && need_single_batch_optimization(impl)) { | ||
// Switch to single batch optimization. | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this logic? Optimized static impl should be selected at 1) step before this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we need it to switch gemv kernel for second token, let's double check the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirm it works well if run LLM model without this logic, but in case of dynamic shape it will choose fc_fb_tiled kernel rather than gemv kernel for single batch input. @sshlyapn Is there better solution to solve this problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to set priority value in GetKernelsPriority()
lower than for bf_tiled kernel, something like FORCE_PRIORITY_3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it doesn't work, as we see gemv impl is only for input with single batch, and for dynamic shape case input batch is not decided before choose fc impl, so it will first select fc_bf_tiled impl. Once input shape is set, there is no chance to re-choose new fc impl, we have to add above logic to make it can re-choose fc impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sshlyapn great help to solve the dynamic shape issue!
Details:
Test result shows that the second token has been improved about 7.5% for 26 INT4 LLM models:

WWB test result:
Tickets: