Visualisation of weight compression results (#3009)

ljaljushkin · web-flow · commit 73eb8b3695b2 · 2024-10-16T16:53:20.000+02:00
### Changes

The tables and images have been added to illustrate the trade-off
between accuracy and footprint for the INT4_ASYM mode.

A script has been created to automate the process of generating this
visualization from a CSV file containing all the necessary raw data.
The script calculates the compression rate and the average relative
error for a given model size and metrics.
It should reduce the likelihood of errors and simplify the maintenance
of results.

### Reason for changes

INT4_ASYM is a more accurate and preferable mode for weight compression.
The previous results were obtained using the INT4_SYM mode. 

### Related tickets

n/a

### Tests

tests/tools/test_compression_visualization.py
diff --git a/docs/usage/post_training_compression/weights_compression/Usage.md b/docs/usage/post_training_compression/weights_compression/Usage.md
@@ -569,14 +569,13 @@ Here is the perplexity and accuracy with data-free and data-aware mixed-precisio
 
 #### Accuracy/Footprint trade-off
 
-Below are the tables showing the accuracy/footprint trade-off for `Qwen/Qwen2-7B` and
+Below are the tables showing the accuracy/footprint trade-off for `meta-llama/Llama-2-7b-chat-hf` and
 `microsoft/Phi-3-mini-4k-instruct` compressed with different options.
 
 Compression ratio is defined as the ratio between the size of fp32 model and size of the compressed one.
-Accuracy metrics are measured on 4 tasks [lambada openai](https://huggingface.co/datasets/EleutherAI/lambada_openai), [wikitext](https://arxiv.org/pdf/1609.07843.pdf),
-[winogrande](https://arxiv.org/abs/1907.10641), [WWB](https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python/who_what_benchmark/whowhatbench).
+Accuracy metrics are measured on 3 tasks [lambada openai](https://huggingface.co/datasets/EleutherAI/lambada_openai), [wikitext](https://arxiv.org/pdf/1609.07843.pdf), [WWB](https://github.com/openvinotoolkit/openvino.genai/tree/master/tools/who_what_benchmark).
 The `average relative error` in the tables below is the mean of relative errors for each of four tasks with respect to
-the metric value for fp32 model. All int4 models are compressed group-wise with `group_size=128` and `mode=CompressionMode.INT4_SYM` and
+the metric value for fp32 model. All int4 models are compressed group-wise with `group_size=64` and `mode=CompressionMode.INT4_ASYM` and
 with calibration dataset based on 128 samples from `wikitext-2-v1`. Int8 model is compressed with `mode=CompressionMode.INT8_ASYM`.
 The following advanced parameters were used for AWQ, Scale Estimation and Lora Correction algorithms:
 
@@ -590,229 +589,44 @@ AdvancedCompressionParameters(
 
 The tables clearly shows the followings:
 
-- More layers in 8 bit does improve accuracy, but it increases the footprint a lot.
-- Scale Estimation, AWQ, GPTQ do improve accuracy of the baseline int4 model without footprint increase.
-- Lora correction algorithm improves the accuracy of int4 models further with a footprint much less compared to mixed-precision models with the same or worse accuracy.
+- More layers in 8 bit does improve accuracy, but it also increases the footprint significantly.
+- Scale Estimation, AWQ, GPTQ improve the accuracy of the baseline int4 model without increasing the footprint.
+- The Lora Correction algorithm further improves the accuracy of int4 models with a much smaller footprint compared to mixed-precision models that have the same or worse accuracy.
 
-Accuracy/footprint trade-off for `Qwen/Qwen2-7B`:
+Accuracy/footprint trade-off for `meta-llama/Llama-2-7b-chat-hf`:
 
-<div class="tg-wrap"><table><thead>
-  <tr>
-    <th>Mode </th>
-    <th>%int4</th>
-    <th>%int8</th>
-    <th>lora<br>rank</th>
-    <th>average<br>relative<br>error</th>
-    <th>compression<br>rate</th>
-  </tr></thead>
-<tbody>
-  <tr>
-    <td>fp32</td>
-    <td>0%</td>
-    <td>0%</td>
-    <td></td>
-    <td>0.0%</td>
-    <td>1.0x</td>
-  </tr>
-  <tr>
-    <td>int8</td>
-    <td>0%</td>
-    <td>100%</td>
-    <td></td>
-    <td>7.9%</td>
-    <td>3.9x</td>
-  </tr>
-  <tr>
-    <td>int4 + awq + scale&nbsp;estimation + lora&nbsp;correction</td>
-    <td>100%</td>
-    <td>0%</td>
-    <td>256</td>
-    <td>16.5%</td>
-    <td>5.8x</td>
-  </tr>
-  <tr>
-    <td>int4 + awq + scale&nbsp;estimation</td>
-    <td>40%</td>
-    <td>60%</td>
-    <td></td>
-    <td>17.1%</td>
-    <td>4.7x</td>
-  </tr>
-  <tr>
-    <td>int4 + awq + scale&nbsp;estimation</td>
-    <td>60%</td>
-    <td>40%</td>
-    <td></td>
-    <td>17.1%</td>
-    <td>5.2x</td>
-  </tr>
-  <tr>
-    <td>int4 + awq + scale&nbsp;estimation + lora&nbsp;correction</td>
-    <td>100%</td>
-    <td>0%</td>
-    <td>32</td>
-    <td>17.4%</td>
-    <td>6.5x</td>
-  </tr>
-  <tr>
-    <td>int4 + awq + scale&nbsp;estimation + lora&nbsp;correction</td>
-    <td>100%</td>
-    <td>0%</td>
-    <td>8</td>
-    <td>17.5%</td>
-    <td>6.6x</td>
-  </tr>
-  <tr>
-    <td>int4 + awq + scale&nbsp;estimation</td>
-    <td>80%</td>
-    <td>20%</td>
-    <td></td>
-    <td>17.5%</td>
-    <td>5.8x</td>
-  </tr>
-  <tr>
-    <td>int4 + awq + scale&nbsp;estimation + lora&nbsp;correction</td>
-    <td>100%</td>
-    <td>0%</td>
-    <td>16</td>
-    <td>18.0%</td>
-    <td>6.6x</td>
-  </tr>
-  <tr>
-    <td>int4 + awq + scale&nbsp;estimation</td>
-    <td>100%</td>
-    <td>0%</td>
-    <td></td>
-    <td>18.4%</td>
-    <td>6.7x</td>
-  </tr>
-  <tr>
-    <td>int4 + awq + scale&nbsp;estimation + gptq</td>
-    <td>100%</td>
-    <td>0%</td>
-    <td></td>
-    <td>20.2%</td>
-    <td>6.7x</td>
-  </tr>
-  <tr>
-    <td>int4</td>
-    <td>100%</td>
-    <td>0%</td>
-    <td></td>
-    <td>21.4%</td>
-    <td>6.7x</td>
-  </tr>
-</tbody></table></div>
+| mode                                            | %int4   | %int8   | lora<br>rank   | average<br>relative<br>error   | compression<br>rate   |
+|:------------------------------------------------|:--------|:--------|:---------------|:-------------------------------|:----------------------|
+| fp32                                            | 0%      | 0%      |                | 0.0%                           | 1.0x                  |
+| int4 + awq + scale estimation + lora correction | 100%    | 0%      | 256.0          | 2.5%                           | 6.1x                  |
+| int4 + awq + scale estimation                   | 40%     | 60%     |                | 2.5%                           | 4.8x                  |
+| int4 + awq + scale estimation                   | 60%     | 40%     |                | 2.7%                           | 5.4x                  |
+| int4 + awq + scale estimation                   | 80%     | 20%     |                | 3.5%                           | 6.2x                  |
+| int4 + awq + scale estimation + lora correction | 100%    | 0%      | 128.0          | 3.6%                           | 6.6x                  |
+| int4 + awq + scale estimation + lora correction | 100%    | 0%      | 32.0           | 3.9%                           | 7.0x                  |
+| int4 + awq + scale estimation + gptq            | 100%    | 0%      |                | 4.1%                           | 7.2x                  |
+| int4 + awq + scale estimation                   | 100%    | 0%      |                | 5.3%                           | 7.2x                  |
+| int4                                            | 100%    | 0%      |                | 8.5%                           | 7.2x                  |
+
+![alt text](llama2_asym.png)
 
 Accuracy/footprint trade-off for `microsoft/Phi-3-mini-4k-instruct`:
 
-<div class="tg-wrap"><table><thead>
-  <tr>
-    <th>Mode </th>
-    <th>%int4</th>
-    <th>%int8</th>
-    <th>lora<br>rank</th>
-    <th>average<br>relative<br>error</th>
-    <th>compression<br>rate</th>
-  </tr></thead>
-<tbody>
-  <tr>
-    <td>fp32</td>
-    <td>0%</td>
-    <td>0%</td>
-    <td></td>
-    <td>0.0%</td>
-    <td>1.0x</td>
-  </tr>
-  <tr>
-    <td>int8</td>
-    <td>0%</td>
-    <td>100%</td>
-    <td></td>
-    <td>7.3%</td>
-    <td>4.0x</td>
-  </tr>
-  <tr>
-    <td>int4 + scale&nbsp;estimation</td>
-    <td>40%</td>
-    <td>60%</td>
-    <td></td>
-    <td>16.9%</td>
-    <td>4.9x</td>
-  </tr>
-  <tr>
-    <td>int4 + scale&nbsp;estimation</td>
-    <td>60%</td>
-    <td>40%</td>
-    <td></td>
-    <td>18.4%</td>
-    <td>5.5x</td>
-  </tr>
-  <tr>
-    <td>int4 + scale&nbsp;estimation + lora&nbsp;correction</td>
-    <td>100%</td>
-    <td>0%</td>
-    <td>256</td>
-    <td>18.7%</td>
-    <td>6.2x</td>
-  </tr>
-  <tr>
-    <td>int4 + scale&nbsp;estimation + lora&nbsp;correction</td>
-    <td>100%</td>
-    <td>0%</td>
-    <td>16</td>
-    <td>20.5%</td>
-    <td>7.3x</td>
-  </tr>
-  <tr>
-    <td>int4 + scale&nbsp;estimation + lora&nbsp;correction</td>
-    <td>100%</td>
-    <td>0%</td>
-    <td>32</td>
-    <td>20.6%</td>
-    <td>7.2x</td>
-  </tr>
-  <tr>
-    <td>int4 + scale&nbsp;estimation</td>
-    <td>80%</td>
-    <td>20%</td>
-    <td></td>
-    <td>21.3%</td>
-    <td>6.3x</td>
-  </tr>
-  <tr>
-    <td>int4 + scale&nbsp;estimation + gptq</td>
-    <td>100%</td>
-    <td>0%</td>
-    <td></td>
-    <td>21.7%</td>
-    <td>7.4x</td>
-  </tr>
-  <tr>
-    <td>int4 + scale&nbsp;estimation + lora&nbsp;correction</td>
-    <td>100%</td>
-    <td>0%</td>
-    <td>8</td>
-    <td>22.1%</td>
-    <td>7.3x</td>
-  </tr>
-  <tr>
-    <td>int4 + scale&nbsp;estimation</td>
-    <td>100%</td>
-    <td>0%</td>
-    <td></td>
-    <td>24.5%</td>
-    <td>7.4x</td>
-  </tr>
-  <tr>
-    <td>int4</td>
-    <td>100%</td>
-    <td>0%</td>
-    <td></td>
-    <td>25.3%</td>
-    <td>7.4x</td>
-  </tr>
-</tbody></table></div>
+| mode                                      | %int4   | %int8   | lora<br>rank   | average<br>relative<br>error   | compression<br>rate   |
+|:------------------------------------------|:--------|:--------|:---------------|:-------------------------------|:----------------------|
+| fp32                                      | 0%      | 0%      |                | 0.0%                           | 1.0x                  |
+| int8                                      | 0%      | 100%    |                | 1.0%                           | 4.0x                  |
+| int4 + scale estimation + lora correction | 100%    | 0%      | 256.0          | 3.9%                           | 6.0x                  |
+| int4 + scale estimation                   | 40%     | 60%     |                | 4.1%                           | 4.8x                  |
+| int4 + scale estimation                   | 60%     | 40%     |                | 4.3%                           | 5.4x                  |
+| int4 + scale estimation + lora correction | 100%    | 0%      | 128.0          | 4.6%                           | 6.5x                  |
+| int4 + scale estimation                   | 80%     | 20%     |                | 5.7%                           | 6.1x                  |
+| int4 + scale estimation + lora correction | 100%    | 0%      | 8.0            | 5.8%                           | 7.1x                  |
+| int4 + scale estimation + gptq            | 100%    | 0%      |                | 6.1%                           | 7.1x                  |
+| int4 + scale estimation                   | 100%    | 0%      |                | 7.5%                           | 7.1x                  |
+| int4                                      | 100%    | 0%      |                | 11.9%                          | 7.1x                  |
+
+![alt text](phi3_asym.png)
 
 ### Limitations
 
diff --git a/docs/usage/post_training_compression/weights_compression/llama2_asym.png b/docs/usage/post_training_compression/weights_compression/llama2_asym.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:041669290e99c841e29752de311e096723096ac58460b6dff9a79ddc19eab93a
+size 58306
diff --git a/docs/usage/post_training_compression/weights_compression/phi3_asym.png b/docs/usage/post_training_compression/weights_compression/phi3_asym.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:41f22429de120433fdfcd27f9b392703ffc67fe1cfeda94f8b46a5f193132483
+size 57184
diff --git a/tests/tools/data/phi3_asym.csv b/tests/tools/data/phi3_asym.csv
@@ -0,0 +1,12 @@
+"model, int4_asym, gs64",mode,%int4,%int8,lora rank,plot name,"model size, Gb",compression rate,"wikitext, word perplexity","lambada-openai, acc","lambada-openai, perplexity","WWB, similarity",average relative error,"compression time, min"
+Phi-3-mini-4k-instruct,fp32,0.0,0.0,,,14.235,1.0,9.48394027691655,0.654764215020377,5.09378699019839,1.0,0.0,0.0
+Phi-3-mini-4k-instruct,int8,0.0,1.0,,,3.562,3.9963503649635,9.499335040825285,0.6549582767320008,5.052950754896191,0.9527044384567825,0.01045364376495017,0.73
+Phi-3-mini-4k-instruct,int4 + scale estimation,0.4,0.6,,40% int4,2.953,4.820521503555706,9.71854910518393,0.650300795653018,5.30571206241349,0.9110616776678298,0.04083810560342551,7.66
+Phi-3-mini-4k-instruct,int4 + scale estimation,0.6,0.4,,60% int4,2.646,5.379818594104308,9.85636814588443,0.644673006015913,5.291942925143432,0.9213788685975252,0.04336565988152735,11.19
+Phi-3-mini-4k-instruct,int4 + scale estimation + lora correction,1.0,0.0,256.0,rank 256,2.382,5.976070528967254,9.988029919971328,0.6555404618668736,5.233390277183369,0.9246453907754686,0.03899610491099074,60.22
+Phi-3-mini-4k-instruct,int4 + scale estimation,0.8,0.2,,80% int4,2.324,6.125215146299483,10.02766689136818,0.6456433145740346,5.503355706799129,0.9220877753363715,0.05771907718987122,15.02
+Phi-3-mini-4k-instruct,int4 + scale estimation + lora correction,1.0,0.0,128.0,rank 128,2.194,6.488149498632635,10.06818844548755,0.6545701533087522,5.251984488969776,0.9090713858604431,0.04628722290049148,59.02
+Phi-3-mini-4k-instruct,int4 + scale estimation + gptq,1.0,0.0,,gptq,2.004,7.103293413173652,10.16727993333731,0.6444789443042888,5.444539130651724,0.9119987841005679,0.06147882553656911,137.77
+Phi-3-mini-4k-instruct,int4 + scale estimation + lora correction,1.0,0.0,8.0,rank 8,2.018,7.054013875123886,10.20127160713859,0.6497186105181447,5.441740470297863,0.9188693364461263,0.05851965015464443,43.83
+Phi-3-mini-4k-instruct,int4 + scale estimation,1.0,0.0,,100% int4,2.004,7.103293413173652,10.36438870990786,0.6413739569183,5.573979676833424,0.9068410683561254,0.07550933307474214,17.3
+Phi-3-mini-4k-instruct,int4,1.0,0.0,,data-free,2.004,7.103293413173652,10.6753930252974,0.622355909179119,6.088275680704702,0.8961785568131341,0.1188976835750515,2.71
diff --git a/tests/tools/data/phi3_asym.md b/tests/tools/data/phi3_asym.md
@@ -0,0 +1,13 @@
+| mode                                      | %int4   | %int8   | lora<br>rank   | average<br>relative<br>error   | compression<br>rate   |
+|:------------------------------------------|:--------|:--------|:---------------|:-------------------------------|:----------------------|
+| fp32                                      | 0%      | 0%      |                | 0.0%                           | 1.0x                  |
+| int8                                      | 0%      | 100%    |                | 1.0%                           | 4.0x                  |
+| int4 + scale estimation + lora correction | 100%    | 0%      | 256.0          | 3.9%                           | 6.0x                  |
+| int4 + scale estimation                   | 40%     | 60%     |                | 4.1%                           | 4.8x                  |
+| int4 + scale estimation                   | 60%     | 40%     |                | 4.3%                           | 5.4x                  |
+| int4 + scale estimation + lora correction | 100%    | 0%      | 128.0          | 4.6%                           | 6.5x                  |
+| int4 + scale estimation                   | 80%     | 20%     |                | 5.7%                           | 6.1x                  |
+| int4 + scale estimation + lora correction | 100%    | 0%      | 8.0            | 5.8%                           | 7.1x                  |
+| int4 + scale estimation + gptq            | 100%    | 0%      |                | 6.1%                           | 7.1x                  |
+| int4 + scale estimation                   | 100%    | 0%      |                | 7.5%                           | 7.1x                  |
+| int4                                      | 100%    | 0%      |                | 11.9%                          | 7.1x                  |
diff --git a/tests/tools/requirements.txt b/tests/tools/requirements.txt
@@ -1,4 +1,5 @@
 matplotlib
 psutil
 pytest
+pandas
 tabulate>=0.9.0
diff --git a/tests/tools/test_compression_visualization.py b/tests/tools/test_compression_visualization.py
@@ -0,0 +1,25 @@
+# Copyright (c) 2024 Intel Corporation
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#      http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from tests.cross_fw.shared.paths import TEST_ROOT
+from tools.visualize_compression_results import visualize
+
+
+def test_visualization_of_compression_results(tmp_path):
+    in_file = TEST_ROOT / "tools" / "data" / "phi3_asym.csv"
+    ref_md_file = TEST_ROOT / "tools" / "data" / "phi3_asym.md"
+
+    visualize(in_file, tmp_path)
+
+    md_file = tmp_path / (in_file.stem + ".md")
+    assert md_file.exists()
+    assert md_file.with_suffix(".png").exists()
+    assert ref_md_file.read_text()[:-1] == md_file.read_text()  # ref file ends with a newline character by code style
diff --git a/tools/README.md b/tools/README.md
@@ -108,3 +108,48 @@ def allocate_memory():
 
 max_memory_usage: float = mmc.memory_data[MemoryType.SYSTEM]
 ```
+
+## Visualization of Weight Compression results
+
+The [visualize_compression_results.py](visualize_compression_results.py) script is a useful tool for visualizing the results of weight compression.
+The result of the script is a .md file with a table:
+
+| mode                                      | %int4   | %int8   | lora<br>rank   | average<br>relative<br>error   | compression<br>rate   |
+|:------------------------------------------|:--------|:--------|:---------------|:-------------------------------|:----------------------|
+| fp32                                      | 0%      | 0%      |                | 0.0%                           | 1.0x                  |
+| int8                                      | 0%      | 100%    |                | 1.0%                           | 4.0x                  |
+| int4 + scale estimation + lora correction | 100%    | 0%      | 256.0          | 3.9%                           | 6.0x                  |
+| int4 + scale estimation                   | 40%     | 60%     |                | 4.1%                           | 4.8x                  |
+| int4 + scale estimation                   | 60%     | 40%     |                | 4.3%                           | 5.4x                  |
+| int4 + scale estimation + lora correction | 100%    | 0%      | 128.0          | 4.6%                           | 6.5x                  |
+| int4 + scale estimation                   | 80%     | 20%     |                | 5.7%                           | 6.1x                  |
+| int4 + scale estimation + lora correction | 100%    | 0%      | 8.0            | 5.8%                           | 7.1x                  |
+| int4 + scale estimation + gptq            | 100%    | 0%      |                | 6.1%                           | 7.1x                  |
+| int4 + scale estimation                   | 100%    | 0%      |                | 7.5%                           | 7.1x                  |
+| int4                                      | 100%    | 0%      |                | 11.9%                          | 7.1x                  |
+
+Also it plots a trade-off between accuracy and footprint by processing a CSV file in a specific format.
+The resulting images are employed for [the relevant section](/docs/usage/post_training_compression/weights_compression/Usage.md#accuracyfootprint-trade-off) in the Weight Compression documentation:
+
+![alt text](/docs/usage/post_training_compression/weights_compression/phi3_asym.png)
+
+### CSV-file format
+
+The input file should contain the following columns:
+
+- `mode` - The string indicating the compression method used for the model. The 'fp32' mode corresponds to the uncompressed version. To calculate the accuracy-footprint trade-off, the following words must be present in at least one row: "gptq", "int4", "fp32", "int8".
+- `%int4` - The ratio of int4 layers.
+- `%int8` - The ratio of int8 layers.
+- `lora rank` - The rank of the adapters used in Lora Correction algorithm.
+- `plot name` - Short names for annotation in the plot.
+- `model size, Gb` - The size of the corresponding model in Gb.
+- `wikitext, word perplexity` - Word perplexity on the Wikitext dataset, measured using rolling loglikelihoods in the [lm_eval tool](https://github.com/EleutherAI/lm-evaluation-harness).
+- `lambada-openai, acc` - Accuracy on the Lambada-OpenAI dataset, measured using [lm_eval tool](https://github.com/EleutherAI/lm-evaluation-harness).
+- `lambada-openai, perplexity` - Perplexity on the Lambada-OpenAI dataset, measured using the [lm_eval tool](https://github.com/EleutherAI/lm-evaluation-harness).
+- `WWB, similarity` - Similarity, measured using the [WWB tool](https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python/).
+
+### Example of script usage
+
+```shell
+python visualize_compression_results.py --input-file data/llama2_asym.csv --output-dir output_dir
+```
diff --git a/tools/visualize_compression_results.py b/tools/visualize_compression_results.py