[Quantization] Int8 Dynamic Quantization for LLM #3312

yang-ahuan · 2025-02-26T03:39:27Z

yang-ahuan
Feb 26, 2025

According to the documentation, when applying dynamic quantization on LLMs, the DYNAMIC_QUANTIZATION_GROUP_SIZE must be nonzero [1]. But, I have observed the following issue:

Int8 weight compression: When I set nonzero DYNAMIC_QUANTIZATION_GROUP_SIZE (or the default 32), I observe that weight-only quantization has lower latency than dynamic quantization.
Int4 weight compression: However, When I set the value of DYNAMIC_QUANTIZATION_GROUP_SIZE to be the same as the group size used for compressed weights (default is 128), I find that weight-only quantization has higher latency than dynamic quantization.

Is this behavior expected? Any insights would be appreciated!
BTW, I know that the group size for INT8 weight compression must be -1 [2]. However, I'm not sure if this is the reason causing the above results.

[1] https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-optimum-intel.html#enabling-openvino-runtime-optimizations
[2] https://github.com/openvinotoolkit/nncf/blob/develop/nncf/quantization/algorithms/weight_compression/algorithm.py#L131

dmitry-gorokhov · 2025-02-26T09:23:22Z

dmitry-gorokhov
Feb 26, 2025

Hi @yang-ahuan, thanks for reporting this!
We recently introduced noticable performance improvements for dynamic quantization on CPU: openvinotoolkit/openvino#28576. Could you try latest OpenVINO master state? I expect DQ being significantly faster now.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Quantization] Int8 Dynamic Quantization for LLM #3312

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

[Quantization] Int8 Dynamic Quantization for LLM #3312

yang-ahuan Feb 26, 2025

Replies: 1 comment

dmitry-gorokhov Feb 26, 2025

yang-ahuan
Feb 26, 2025

dmitry-gorokhov
Feb 26, 2025