Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU][POC] clDNN gemv optimization for LLM second token #28976

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

riverlijunjie
Copy link
Contributor

@riverlijunjie riverlijunjie commented Feb 13, 2025

Details:

  • Add gemv kernel for cldnn to optimize single batch input of FC case.
  • Support weight data compression type: i4 and u4.
  • Support weight data layout: os_iyx_osv16, os_is_yx_osv32_isv2 and os_is_yx_osv64_isv2.
  • Not support swiglu op fused so far.
  • Qwen2_7B INT4 test result:
  instances PR(ms) Master(ms) improve
16 input token   55.22 59.35 7.5%
512 input token   56.91 61.25 7.6%
         
gate_up 28 0.895 0.907  n/a
q_proj 28 0.436 0.514 15%
last fc 1 6.193 6.110  n/a
o_proj 28 0.126 0.195 35%
down 28 0.105 0.148 29%

image

Test result shows that the second token has been improved about 7.5% for 26 INT4 LLM models:
image

WWB test result:

image

  • swiglu fused support
  • weight int8 data type support

Tickets:

@wenjiew wenjiew changed the title [TESE] cldnn gemv optimization [TEST] cldnn gemv optimization Feb 20, 2025
@peterchen-intel peterchen-intel changed the title [TEST] cldnn gemv optimization clDNN gemv optimization Feb 21, 2025
@ceciliapeng2011 ceciliapeng2011 self-requested a review February 28, 2025 02:47
@riverlijunjie riverlijunjie changed the title clDNN gemv optimization [GPU][POC] clDNN gemv optimization for LLM second token Feb 28, 2025
@riverlijunjie riverlijunjie force-pushed the river/cldnn_gemv_opt branch from 654da9a to 80357ed Compare March 7, 2025 05:52
@riverlijunjie riverlijunjie requested a review from yeonbok March 7, 2025 07:30
@riverlijunjie riverlijunjie requested a review from sshlyapn March 7, 2025 07:31
if (f.is_type<swiglu>())
return false;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random spot:
But why dont' you add a new fc kernel under bf_tiled kernel?
(as it is done for b1-b8 kernels in the bf_tiled_kernel.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for more clear implement and don't want to be mixed with bf_tile kernel. Actually, the gemv kernel implement focuses on memory bound scenario, while bf_tile kernel is better for computational bound cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see fc_bf_tile_common.cl, you can still separate the kernel file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, it is a good idea!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yeonbok I have tried to put this gemv kernel as one part of bf_tiled kernel file, but found it would bring some negative impact:

  1. There will be more complex logic in fully_connected_kernel_bf_tiled.cpp to decide when to choose gemv kernel or bf_tile kernel, which have different prerequisite and process, it will bring some potential risk of race condition and performance issue.
  2. System memory consumption will become bigger after merge gemv kernel_selector into bf_tile, because compared to bf_tile kernel selector, gemv kernel_selector is much simpler and don't need too much memory and also don't need run parameter auto-tune( after merge them together, gemv case has to keep part of auto-tune, which will consume more memory and CPU computation).
    So I prefer to create a new kernel impl instance for gemv kernel, how do you think about it?

@yeonbok
Copy link
Contributor

yeonbok commented Mar 14, 2025

Another curiosity: AFAIK Swiglu should be fused for Qwen2 model. But how could you measure the above data for this kernel even though the new kernel is not supporting swiglu?

@riverlijunjie
Copy link
Contributor Author

Another curiosity: AFAIK Swiglu should be fused for Qwen2 model. But how could you measure the above data for this kernel even though the new kernel is not supporting swiglu?

Swiglu fusion is used for gate_up fc fusion in Qwen2 model, while current gemv kernel doesn't support gate_up fusion in this PR due to non-interlace gate_up weights will impact memory reading bandwidth. So in this PR, gate_up fusion fc still adopts bf_tile kernel.

Next step, if possible, I plan to optimize current gate_up_swiglu bf_tile kernel by reordering gate_up weights to be interlace layout, and then implement gemv kernel to support gate_up swiglu fusion.

Note: gate_up fusion kernel occupancies more latency than the sum of o_proj/q_proj/down, so gate_up gemv implement is supposed to get more performance improvement than current PR.

@peterchen-intel peterchen-intel requested a review from yeonbok March 19, 2025 12:22
1. reuse some util functions in fc_tile
2. remove unused async compilation logic
3. remove unused method
@riverlijunjie riverlijunjie requested a review from sshlyapn March 24, 2025 03:26
Comment on lines 2884 to 2887
if (inst.get_node().is_type<fully_connected>() && need_single_batch_optimization(impl)) {
// Switch to single batch optimization.
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this logic? Optimized static impl should be selected at 1) step before this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we need it to switch gemv kernel for second token, let's double check the details.

Copy link
Contributor Author

@riverlijunjie riverlijunjie Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm it works well if run LLM model without this logic, but in case of dynamic shape it will choose fc_fb_tiled kernel rather than gemv kernel for single batch input. @sshlyapn Is there better solution to solve this problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to set priority value in GetKernelsPriority() lower than for bf_tiled kernel, something like FORCE_PRIORITY_3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it doesn't work, as we see gemv impl is only for input with single batch, and for dynamic shape case input batch is not decided before choose fc impl, so it will first select fc_bf_tiled impl. Once input shape is set, there is no chance to re-choose new fc impl, we have to add above logic to make it can re-choose fc impl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sshlyapn great help to solve the dynamic shape issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GPU OpenVINO GPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants