-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove KV-cache compression disabling flag for compressed models #1141
Remove KV-cache compression disabling flag for compressed models #1141
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Co-authored-by: Alexander Kozlov <alexander.kozlov@intel.com>
[{"int8": 14, "int4": 16}, {"int8": 9}, {"int8": 1}], | ||
[{"int8": 14, "int4": 16}, {"int8": 1}, {"int8": 9}], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Order of some reference values is changed because of the addition of this line: submodels = list(model.submodels.values())
. This alters the order of submodels compared to before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What does this PR do?
"KV_CACHE_PRECISION": "f16"
rt info flag for all weight-compressed models. This way applying weight compression also enables KV cache compression.OVDynamicQuantizationConfig
.Tests:
KV_CACHE_PRECISION
rt info flag is not present for compressed models.image-text-to-text
models aslist(model.submodels.values())
, this is more convenient.Before submitting