[Quantization] Int8 Dynamic Quantization for LLM #3312
Unanswered
yang-ahuan
asked this question in
Q&A
Replies: 1 comment
-
Hi @yang-ahuan, thanks for reporting this! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
According to the documentation, when applying dynamic quantization on LLMs, the
DYNAMIC_QUANTIZATION_GROUP_SIZE
must be nonzero [1]. But, I have observed the following issue:DYNAMIC_QUANTIZATION_GROUP_SIZE
(or the default 32), I observe that weight-only quantization has lower latency than dynamic quantization.Is this behavior expected? Any insights would be appreciated!
BTW, I know that the group size for INT8 weight compression must be -1 [2]. However, I'm not sure if this is the reason causing the above results.
[1] https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-optimum-intel.html#enabling-openvino-runtime-optimizations
[2] https://github.com/openvinotoolkit/nncf/blob/develop/nncf/quantization/algorithms/weight_compression/algorithm.py#L131
Beta Was this translation helpful? Give feedback.
All reactions