[GPU][POC] clDNN gemv optimization for LLM second token #28976

riverlijunjie · 2025-02-13T15:23:25Z

Details:

Add gemv kernel for cldnn to optimize single batch input of FC case.
Support weight data compression type: i4 and u4.
Support weight data layout: os_iyx_osv16, os_is_yx_osv32_isv2 and os_is_yx_osv64_isv2.
Not support swiglu op fused so far.
Qwen2_7B INT4 test result:

	instances	PR(ms)	Master(ms)	improve
16 input token		55.22	59.35	7.5%
512 input token		56.91	61.25	7.6%

gate_up	28	0.895	0.907	n/a
q_proj	28	0.436	0.514	15%
last fc	1	6.193	6.110	n/a
o_proj	28	0.126	0.195	35%
down	28	0.105	0.148	29%

Test result shows that the second token has been improved about 7.5% for 26 INT4 LLM models:

WWB test result:

swiglu fused support
weight int8 data type support

Tickets:

CVS-160547

src/plugins/intel_gpu/include/intel_gpu/graph/kernel_impl_params.hpp

src/plugins/intel_gpu/src/graph/primitive_inst.cpp

yeonbok · 2025-03-14T04:04:59Z

src/plugins/intel_gpu/include/intel_gpu/graph/kernel_impl_params.hpp

+                if (f.is_type<swiglu>())
+                    return false;
+            }
+        }


random spot:
But why dont' you add a new fc kernel under bf_tiled kernel?
(as it is done for b1-b8 kernels in the bf_tiled_kernel.)

Just for more clear implement and don't want to be mixed with bf_tile kernel. Actually, the gemv kernel implement focuses on memory bound scenario, while bf_tile kernel is better for computational bound cases.

Please see fc_bf_tile_common.cl, you can still separate the kernel file.

Got it, it is a good idea!

@yeonbok I have tried to put this gemv kernel as one part of bf_tiled kernel file, but found it would bring some negative impact:

There will be more complex logic in fully_connected_kernel_bf_tiled.cpp to decide when to choose gemv kernel or bf_tile kernel, which have different prerequisite and process, it will bring some potential risk of race condition and performance issue.

System memory consumption will become bigger after merge gemv kernel_selector into bf_tile, because compared to bf_tile kernel selector, gemv kernel_selector is much simpler and don't need too much memory and also don't need run parameter auto-tune( after merge them together, gemv case has to keep part of auto-tune, which will consume more memory and CPU computation).
So I prefer to create a new kernel impl instance for gemv kernel, how do you think about it?

yeonbok · 2025-03-14T04:08:41Z

Another curiosity: AFAIK Swiglu should be fused for Qwen2 model. But how could you measure the above data for this kernel even though the new kernel is not supporting swiglu?

riverlijunjie · 2025-03-14T05:36:51Z

Another curiosity: AFAIK Swiglu should be fused for Qwen2 model. But how could you measure the above data for this kernel even though the new kernel is not supporting swiglu?

Swiglu fusion is used for gate_up fc fusion in Qwen2 model, while current gemv kernel doesn't support gate_up fusion in this PR due to non-interlace gate_up weights will impact memory reading bandwidth. So in this PR, gate_up fusion fc still adopts bf_tile kernel.

Next step, if possible, I plan to optimize current gate_up_swiglu bf_tile kernel by reordering gate_up weights to be interlace layout, and then implement gemv kernel to support gate_up swiglu fusion.

Note: gate_up fusion kernel occupancies more latency than the sum of o_proj/q_proj/down, so gate_up gemv implement is supposed to get more performance improvement than current PR.

src/plugins/intel_gpu/src/graph/primitive_inst.cpp

src/plugins/intel_gpu/src/kernel_selector/kernel_selector_params.h

...lugins/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_gemv.cpp

1. reuse some util functions in fc_tile 2. remove unused async compilation logic 3. remove unused method

src/plugins/intel_gpu/src/graph/fully_connected.cpp

src/plugins/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_params.h

sshlyapn · 2025-03-25T06:40:24Z

src/plugins/intel_gpu/src/graph/primitive_inst.cpp

+        if (inst.get_node().is_type<fully_connected>() && need_single_batch_optimization(impl)) {
+            // Switch to single batch optimization.
+            continue;
+        }


Do we still need this logic? Optimized static impl should be selected at 1) step before this

It seems we need it to switch gemv kernel for second token, let's double check the details.

Confirm it works well if run LLM model without this logic, but in case of dynamic shape it will choose fc_fb_tiled kernel rather than gemv kernel for single batch input. @sshlyapn Is there better solution to solve this problem?

Please try to set priority value in GetKernelsPriority() lower than for bf_tiled kernel, something like FORCE_PRIORITY_3

Seems it doesn't work, as we see gemv impl is only for input with single batch, and for dynamic shape case input batch is not decided before choose fc impl, so it will first select fc_bf_tiled impl. Once input shape is set, there is no chance to re-choose new fc impl, we have to add above logic to make it can re-choose fc impl.

Thanks @sshlyapn great help to solve the dynamic shape issue!

src/plugins/intel_gpu/src/kernel_selector/cl_kernels/fully_connected_gpu_gemv.cl

riverlijunjie added 2 commits February 10, 2025 22:50

[GPU][MTL]support gemv

c6a8d8d

Merge branch 'master' into river/cldnn_gemv_opt

c4d9e84

riverlijunjie added do_not_review do_not_merge labels Feb 13, 2025

riverlijunjie requested review from a team as code owners February 13, 2025 15:23

github-actions bot added the category: GPU OpenVINO GPU plugin label Feb 13, 2025

riverlijunjie added 6 commits February 13, 2025 23:33

Fix accuracy issue

1b03afd

Fix unit test failures

a646496

Merge branch 'master' into river/cldnn_gemv_opt

0e0786e

Kernel refine

8d02fe5

Merge branch 'master' into river/cldnn_gemv_opt

668b480

Merge branch 'master' into river/cldnn_gemv_opt

8c3909d

wenjiew changed the title ~~[TESE] cldnn gemv optimization~~ [TEST] cldnn gemv optimization Feb 20, 2025

Merge branch 'master' into river/cldnn_gemv_opt

d1d7c53

peterchen-intel changed the title ~~[TEST] cldnn gemv optimization~~ clDNN gemv optimization Feb 21, 2025

riverlijunjie force-pushed the river/cldnn_gemv_opt branch from 7e63d3e to 17d1e29 Compare February 22, 2025 01:16

riverlijunjie added 3 commits February 22, 2025 09:30

Fix zp issue

17d1e29

Fixed INT4_CW issue

5bbe1f9

Fix USS issue caused by kernel cache

d9db6bf

ceciliapeng2011 self-requested a review February 28, 2025 02:47

riverlijunjie removed do_not_review do_not_merge labels Feb 28, 2025

riverlijunjie changed the title ~~clDNN gemv optimization~~ [GPU][POC] clDNN gemv optimization for LLM second token Feb 28, 2025

riverlijunjie added 3 commits March 3, 2025 01:01

Fix activation issue

870ae43

Update kernel data type

509ccb0

Merge branch 'master' into river/cldnn_gemv_opt

f5b40ec

riverlijunjie force-pushed the river/cldnn_gemv_opt branch from 654da9a to 80357ed Compare March 7, 2025 05:52

Fix gws mismatch lws if output size less than 16

80357ed

riverlijunjie requested a review from yeonbok March 7, 2025 07:30

riverlijunjie requested a review from sshlyapn March 7, 2025 07:31

yeonbok reviewed Mar 12, 2025

View reviewed changes

src/plugins/intel_gpu/include/intel_gpu/graph/kernel_impl_params.hpp Outdated Show resolved Hide resolved

yeonbok reviewed Mar 14, 2025

View reviewed changes

src/plugins/intel_gpu/src/graph/primitive_inst.cpp Outdated Show resolved Hide resolved

yeonbok reviewed Mar 14, 2025

View reviewed changes

zhaixuejun1993 added 2 commits March 18, 2025 09:53

update: move judgement logic to fully_connected_inst

2e8998e

Merge branch 'master' into river/cldnn_gemv_opt

00103d8

peterchen-intel requested a review from yeonbok March 19, 2025 12:22

Fix zp out of memory access issue

30198d0

sshlyapn reviewed Mar 21, 2025

View reviewed changes

riverlijunjie added 2 commits March 24, 2025 00:01

Resolve review comments

4f86428

1. reuse some util functions in fc_tile 2. remove unused async compilation logic 3. remove unused method

Update missed file

25e55c2

riverlijunjie requested a review from sshlyapn March 24, 2025 03:26

sshlyapn reviewed Mar 25, 2025

View reviewed changes

riverlijunjie added 3 commits March 27, 2025 00:48

Remove unnecessary kernel switch logic

1f178c0

Revert for dynamic shape

3bbd10b

Solve dynamic shape kernel selector issue and remove debug code

d9a70bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU][POC] clDNN gemv optimization for LLM second token #28976

[GPU][POC] clDNN gemv optimization for LLM second token #28976

riverlijunjie commented Feb 13, 2025 •

edited by peterchen-intel

Loading

yeonbok Mar 14, 2025

riverlijunjie Mar 14, 2025

yeonbok Mar 14, 2025

riverlijunjie Mar 14, 2025

riverlijunjie Mar 18, 2025

yeonbok commented Mar 14, 2025

riverlijunjie commented Mar 14, 2025

sshlyapn Mar 25, 2025

riverlijunjie Mar 25, 2025

riverlijunjie Mar 26, 2025 •

edited

Loading

sshlyapn Mar 27, 2025

riverlijunjie Mar 28, 2025

riverlijunjie Mar 28, 2025

[GPU][POC] clDNN gemv optimization for LLM second token #28976

Are you sure you want to change the base?

[GPU][POC] clDNN gemv optimization for LLM second token #28976

Conversation

riverlijunjie commented Feb 13, 2025 • edited by peterchen-intel Loading

Details:

Tickets:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yeonbok commented Mar 14, 2025

riverlijunjie commented Mar 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

riverlijunjie Mar 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

riverlijunjie commented Feb 13, 2025 •

edited by peterchen-intel

Loading

riverlijunjie Mar 26, 2025 •

edited

Loading