This example demonstrates how to find the appropriate `awq`, `ratio` and `group_size` parameters to compress the weights of the TinyLLama model from the HuggingFace Transformers. OpenVINO backend supports inference of mixed-precision models with weights compressed to a 4-bit data type as a primary precision. The fastest mixed-precision mode is `INT4_SYM`, but it may lead to a significant accuracy degradation, especially for models of moderate size. In this example, the allowed maximum deviation from the original model is `0.2` points of the similarity metric. If the similarity of the compressed model is not satisfying, there are 3 hyper-parameters to tune: `awq`, `group_size` and `ratio`. Smaller `group_size` and `ratio` of 4-bit layers usually improve accuracy at the sacrifice of model size and inference latency. Generally, the accuracy of the 4-bit compressed models also can be improved by using AWQ algorithm over data-based mixed-precision algorithm.
0 commit comments