Skip to content

Commit 73eb8b3

Browse files
authored
Visualisation of weight compression results (#3009)
### Changes The tables and images have been added to illustrate the trade-off between accuracy and footprint for the INT4_ASYM mode. A script has been created to automate the process of generating this visualization from a CSV file containing all the necessary raw data. The script calculates the compression rate and the average relative error for a given model size and metrics. It should reduce the likelihood of errors and simplify the maintenance of results. ### Reason for changes INT4_ASYM is a more accurate and preferable mode for weight compression. The previous results were obtained using the INT4_SYM mode. ### Related tickets n/a ### Tests tests/tools/test_compression_visualization.py
1 parent 17f799e commit 73eb8b3

File tree

9 files changed

+318
-222
lines changed

9 files changed

+318
-222
lines changed

docs/usage/post_training_compression/weights_compression/Usage.md

+36-222
Original file line numberDiff line numberDiff line change
@@ -569,14 +569,13 @@ Here is the perplexity and accuracy with data-free and data-aware mixed-precisio
569569

570570
#### Accuracy/Footprint trade-off
571571

572-
Below are the tables showing the accuracy/footprint trade-off for `Qwen/Qwen2-7B` and
572+
Below are the tables showing the accuracy/footprint trade-off for `meta-llama/Llama-2-7b-chat-hf` and
573573
`microsoft/Phi-3-mini-4k-instruct` compressed with different options.
574574

575575
Compression ratio is defined as the ratio between the size of fp32 model and size of the compressed one.
576-
Accuracy metrics are measured on 4 tasks [lambada openai](https://huggingface.co/datasets/EleutherAI/lambada_openai), [wikitext](https://arxiv.org/pdf/1609.07843.pdf),
577-
[winogrande](https://arxiv.org/abs/1907.10641), [WWB](https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python/who_what_benchmark/whowhatbench).
576+
Accuracy metrics are measured on 3 tasks [lambada openai](https://huggingface.co/datasets/EleutherAI/lambada_openai), [wikitext](https://arxiv.org/pdf/1609.07843.pdf), [WWB](https://github.com/openvinotoolkit/openvino.genai/tree/master/tools/who_what_benchmark).
578577
The `average relative error` in the tables below is the mean of relative errors for each of four tasks with respect to
579-
the metric value for fp32 model. All int4 models are compressed group-wise with `group_size=128` and `mode=CompressionMode.INT4_SYM` and
578+
the metric value for fp32 model. All int4 models are compressed group-wise with `group_size=64` and `mode=CompressionMode.INT4_ASYM` and
580579
with calibration dataset based on 128 samples from `wikitext-2-v1`. Int8 model is compressed with `mode=CompressionMode.INT8_ASYM`.
581580
The following advanced parameters were used for AWQ, Scale Estimation and Lora Correction algorithms:
582581

@@ -590,229 +589,44 @@ AdvancedCompressionParameters(
590589

591590
The tables clearly shows the followings:
592591

593-
- More layers in 8 bit does improve accuracy, but it increases the footprint a lot.
594-
- Scale Estimation, AWQ, GPTQ do improve accuracy of the baseline int4 model without footprint increase.
595-
- Lora correction algorithm improves the accuracy of int4 models further with a footprint much less compared to mixed-precision models with the same or worse accuracy.
592+
- More layers in 8 bit does improve accuracy, but it also increases the footprint significantly.
593+
- Scale Estimation, AWQ, GPTQ improve the accuracy of the baseline int4 model without increasing the footprint.
594+
- The Lora Correction algorithm further improves the accuracy of int4 models with a much smaller footprint compared to mixed-precision models that have the same or worse accuracy.
596595

597-
Accuracy/footprint trade-off for `Qwen/Qwen2-7B`:
596+
Accuracy/footprint trade-off for `meta-llama/Llama-2-7b-chat-hf`:
598597

599-
<div class="tg-wrap"><table><thead>
600-
<tr>
601-
<th>Mode </th>
602-
<th>%int4</th>
603-
<th>%int8</th>
604-
<th>lora<br>rank</th>
605-
<th>average<br>relative<br>error</th>
606-
<th>compression<br>rate</th>
607-
</tr></thead>
608-
<tbody>
609-
<tr>
610-
<td>fp32</td>
611-
<td>0%</td>
612-
<td>0%</td>
613-
<td></td>
614-
<td>0.0%</td>
615-
<td>1.0x</td>
616-
</tr>
617-
<tr>
618-
<td>int8</td>
619-
<td>0%</td>
620-
<td>100%</td>
621-
<td></td>
622-
<td>7.9%</td>
623-
<td>3.9x</td>
624-
</tr>
625-
<tr>
626-
<td>int4 + awq + scale&nbsp;estimation + lora&nbsp;correction</td>
627-
<td>100%</td>
628-
<td>0%</td>
629-
<td>256</td>
630-
<td>16.5%</td>
631-
<td>5.8x</td>
632-
</tr>
633-
<tr>
634-
<td>int4 + awq + scale&nbsp;estimation</td>
635-
<td>40%</td>
636-
<td>60%</td>
637-
<td></td>
638-
<td>17.1%</td>
639-
<td>4.7x</td>
640-
</tr>
641-
<tr>
642-
<td>int4 + awq + scale&nbsp;estimation</td>
643-
<td>60%</td>
644-
<td>40%</td>
645-
<td></td>
646-
<td>17.1%</td>
647-
<td>5.2x</td>
648-
</tr>
649-
<tr>
650-
<td>int4 + awq + scale&nbsp;estimation + lora&nbsp;correction</td>
651-
<td>100%</td>
652-
<td>0%</td>
653-
<td>32</td>
654-
<td>17.4%</td>
655-
<td>6.5x</td>
656-
</tr>
657-
<tr>
658-
<td>int4 + awq + scale&nbsp;estimation + lora&nbsp;correction</td>
659-
<td>100%</td>
660-
<td>0%</td>
661-
<td>8</td>
662-
<td>17.5%</td>
663-
<td>6.6x</td>
664-
</tr>
665-
<tr>
666-
<td>int4 + awq + scale&nbsp;estimation</td>
667-
<td>80%</td>
668-
<td>20%</td>
669-
<td></td>
670-
<td>17.5%</td>
671-
<td>5.8x</td>
672-
</tr>
673-
<tr>
674-
<td>int4 + awq + scale&nbsp;estimation + lora&nbsp;correction</td>
675-
<td>100%</td>
676-
<td>0%</td>
677-
<td>16</td>
678-
<td>18.0%</td>
679-
<td>6.6x</td>
680-
</tr>
681-
<tr>
682-
<td>int4 + awq + scale&nbsp;estimation</td>
683-
<td>100%</td>
684-
<td>0%</td>
685-
<td></td>
686-
<td>18.4%</td>
687-
<td>6.7x</td>
688-
</tr>
689-
<tr>
690-
<td>int4 + awq + scale&nbsp;estimation + gptq</td>
691-
<td>100%</td>
692-
<td>0%</td>
693-
<td></td>
694-
<td>20.2%</td>
695-
<td>6.7x</td>
696-
</tr>
697-
<tr>
698-
<td>int4</td>
699-
<td>100%</td>
700-
<td>0%</td>
701-
<td></td>
702-
<td>21.4%</td>
703-
<td>6.7x</td>
704-
</tr>
705-
</tbody></table></div>
598+
| mode | %int4 | %int8 | lora<br>rank | average<br>relative<br>error | compression<br>rate |
599+
|:------------------------------------------------|:--------|:--------|:---------------|:-------------------------------|:----------------------|
600+
| fp32 | 0% | 0% | | 0.0% | 1.0x |
601+
| int4 + awq + scale estimation + lora correction | 100% | 0% | 256.0 | 2.5% | 6.1x |
602+
| int4 + awq + scale estimation | 40% | 60% | | 2.5% | 4.8x |
603+
| int4 + awq + scale estimation | 60% | 40% | | 2.7% | 5.4x |
604+
| int4 + awq + scale estimation | 80% | 20% | | 3.5% | 6.2x |
605+
| int4 + awq + scale estimation + lora correction | 100% | 0% | 128.0 | 3.6% | 6.6x |
606+
| int4 + awq + scale estimation + lora correction | 100% | 0% | 32.0 | 3.9% | 7.0x |
607+
| int4 + awq + scale estimation + gptq | 100% | 0% | | 4.1% | 7.2x |
608+
| int4 + awq + scale estimation | 100% | 0% | | 5.3% | 7.2x |
609+
| int4 | 100% | 0% | | 8.5% | 7.2x |
610+
611+
![alt text](llama2_asym.png)
706612

707613
Accuracy/footprint trade-off for `microsoft/Phi-3-mini-4k-instruct`:
708614

709-
<div class="tg-wrap"><table><thead>
710-
<tr>
711-
<th>Mode </th>
712-
<th>%int4</th>
713-
<th>%int8</th>
714-
<th>lora<br>rank</th>
715-
<th>average<br>relative<br>error</th>
716-
<th>compression<br>rate</th>
717-
</tr></thead>
718-
<tbody>
719-
<tr>
720-
<td>fp32</td>
721-
<td>0%</td>
722-
<td>0%</td>
723-
<td></td>
724-
<td>0.0%</td>
725-
<td>1.0x</td>
726-
</tr>
727-
<tr>
728-
<td>int8</td>
729-
<td>0%</td>
730-
<td>100%</td>
731-
<td></td>
732-
<td>7.3%</td>
733-
<td>4.0x</td>
734-
</tr>
735-
<tr>
736-
<td>int4 + scale&nbsp;estimation</td>
737-
<td>40%</td>
738-
<td>60%</td>
739-
<td></td>
740-
<td>16.9%</td>
741-
<td>4.9x</td>
742-
</tr>
743-
<tr>
744-
<td>int4 + scale&nbsp;estimation</td>
745-
<td>60%</td>
746-
<td>40%</td>
747-
<td></td>
748-
<td>18.4%</td>
749-
<td>5.5x</td>
750-
</tr>
751-
<tr>
752-
<td>int4 + scale&nbsp;estimation + lora&nbsp;correction</td>
753-
<td>100%</td>
754-
<td>0%</td>
755-
<td>256</td>
756-
<td>18.7%</td>
757-
<td>6.2x</td>
758-
</tr>
759-
<tr>
760-
<td>int4 + scale&nbsp;estimation + lora&nbsp;correction</td>
761-
<td>100%</td>
762-
<td>0%</td>
763-
<td>16</td>
764-
<td>20.5%</td>
765-
<td>7.3x</td>
766-
</tr>
767-
<tr>
768-
<td>int4 + scale&nbsp;estimation + lora&nbsp;correction</td>
769-
<td>100%</td>
770-
<td>0%</td>
771-
<td>32</td>
772-
<td>20.6%</td>
773-
<td>7.2x</td>
774-
</tr>
775-
<tr>
776-
<td>int4 + scale&nbsp;estimation</td>
777-
<td>80%</td>
778-
<td>20%</td>
779-
<td></td>
780-
<td>21.3%</td>
781-
<td>6.3x</td>
782-
</tr>
783-
<tr>
784-
<td>int4 + scale&nbsp;estimation + gptq</td>
785-
<td>100%</td>
786-
<td>0%</td>
787-
<td></td>
788-
<td>21.7%</td>
789-
<td>7.4x</td>
790-
</tr>
791-
<tr>
792-
<td>int4 + scale&nbsp;estimation + lora&nbsp;correction</td>
793-
<td>100%</td>
794-
<td>0%</td>
795-
<td>8</td>
796-
<td>22.1%</td>
797-
<td>7.3x</td>
798-
</tr>
799-
<tr>
800-
<td>int4 + scale&nbsp;estimation</td>
801-
<td>100%</td>
802-
<td>0%</td>
803-
<td></td>
804-
<td>24.5%</td>
805-
<td>7.4x</td>
806-
</tr>
807-
<tr>
808-
<td>int4</td>
809-
<td>100%</td>
810-
<td>0%</td>
811-
<td></td>
812-
<td>25.3%</td>
813-
<td>7.4x</td>
814-
</tr>
815-
</tbody></table></div>
615+
| mode | %int4 | %int8 | lora<br>rank | average<br>relative<br>error | compression<br>rate |
616+
|:------------------------------------------|:--------|:--------|:---------------|:-------------------------------|:----------------------|
617+
| fp32 | 0% | 0% | | 0.0% | 1.0x |
618+
| int8 | 0% | 100% | | 1.0% | 4.0x |
619+
| int4 + scale estimation + lora correction | 100% | 0% | 256.0 | 3.9% | 6.0x |
620+
| int4 + scale estimation | 40% | 60% | | 4.1% | 4.8x |
621+
| int4 + scale estimation | 60% | 40% | | 4.3% | 5.4x |
622+
| int4 + scale estimation + lora correction | 100% | 0% | 128.0 | 4.6% | 6.5x |
623+
| int4 + scale estimation | 80% | 20% | | 5.7% | 6.1x |
624+
| int4 + scale estimation + lora correction | 100% | 0% | 8.0 | 5.8% | 7.1x |
625+
| int4 + scale estimation + gptq | 100% | 0% | | 6.1% | 7.1x |
626+
| int4 + scale estimation | 100% | 0% | | 7.5% | 7.1x |
627+
| int4 | 100% | 0% | | 11.9% | 7.1x |
628+
629+
![alt text](phi3_asym.png)
816630

817631
### Limitations
818632

Loading
Loading

tests/tools/data/phi3_asym.csv

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
"model, int4_asym, gs64",mode,%int4,%int8,lora rank,plot name,"model size, Gb",compression rate,"wikitext, word perplexity","lambada-openai, acc","lambada-openai, perplexity","WWB, similarity",average relative error,"compression time, min"
2+
Phi-3-mini-4k-instruct,fp32,0.0,0.0,,,14.235,1.0,9.48394027691655,0.654764215020377,5.09378699019839,1.0,0.0,0.0
3+
Phi-3-mini-4k-instruct,int8,0.0,1.0,,,3.562,3.9963503649635,9.499335040825285,0.6549582767320008,5.052950754896191,0.9527044384567825,0.01045364376495017,0.73
4+
Phi-3-mini-4k-instruct,int4 + scale estimation,0.4,0.6,,40% int4,2.953,4.820521503555706,9.71854910518393,0.650300795653018,5.30571206241349,0.9110616776678298,0.04083810560342551,7.66
5+
Phi-3-mini-4k-instruct,int4 + scale estimation,0.6,0.4,,60% int4,2.646,5.379818594104308,9.85636814588443,0.644673006015913,5.291942925143432,0.9213788685975252,0.04336565988152735,11.19
6+
Phi-3-mini-4k-instruct,int4 + scale estimation + lora correction,1.0,0.0,256.0,rank 256,2.382,5.976070528967254,9.988029919971328,0.6555404618668736,5.233390277183369,0.9246453907754686,0.03899610491099074,60.22
7+
Phi-3-mini-4k-instruct,int4 + scale estimation,0.8,0.2,,80% int4,2.324,6.125215146299483,10.02766689136818,0.6456433145740346,5.503355706799129,0.9220877753363715,0.05771907718987122,15.02
8+
Phi-3-mini-4k-instruct,int4 + scale estimation + lora correction,1.0,0.0,128.0,rank 128,2.194,6.488149498632635,10.06818844548755,0.6545701533087522,5.251984488969776,0.9090713858604431,0.04628722290049148,59.02
9+
Phi-3-mini-4k-instruct,int4 + scale estimation + gptq,1.0,0.0,,gptq,2.004,7.103293413173652,10.16727993333731,0.6444789443042888,5.444539130651724,0.9119987841005679,0.06147882553656911,137.77
10+
Phi-3-mini-4k-instruct,int4 + scale estimation + lora correction,1.0,0.0,8.0,rank 8,2.018,7.054013875123886,10.20127160713859,0.6497186105181447,5.441740470297863,0.9188693364461263,0.05851965015464443,43.83
11+
Phi-3-mini-4k-instruct,int4 + scale estimation,1.0,0.0,,100% int4,2.004,7.103293413173652,10.36438870990786,0.6413739569183,5.573979676833424,0.9068410683561254,0.07550933307474214,17.3
12+
Phi-3-mini-4k-instruct,int4,1.0,0.0,,data-free,2.004,7.103293413173652,10.6753930252974,0.622355909179119,6.088275680704702,0.8961785568131341,0.1188976835750515,2.71

tests/tools/data/phi3_asym.md

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
| mode | %int4 | %int8 | lora<br>rank | average<br>relative<br>error | compression<br>rate |
2+
|:------------------------------------------|:--------|:--------|:---------------|:-------------------------------|:----------------------|
3+
| fp32 | 0% | 0% | | 0.0% | 1.0x |
4+
| int8 | 0% | 100% | | 1.0% | 4.0x |
5+
| int4 + scale estimation + lora correction | 100% | 0% | 256.0 | 3.9% | 6.0x |
6+
| int4 + scale estimation | 40% | 60% | | 4.1% | 4.8x |
7+
| int4 + scale estimation | 60% | 40% | | 4.3% | 5.4x |
8+
| int4 + scale estimation + lora correction | 100% | 0% | 128.0 | 4.6% | 6.5x |
9+
| int4 + scale estimation | 80% | 20% | | 5.7% | 6.1x |
10+
| int4 + scale estimation + lora correction | 100% | 0% | 8.0 | 5.8% | 7.1x |
11+
| int4 + scale estimation + gptq | 100% | 0% | | 6.1% | 7.1x |
12+
| int4 + scale estimation | 100% | 0% | | 7.5% | 7.1x |
13+
| int4 | 100% | 0% | | 11.9% | 7.1x |

tests/tools/requirements.txt

+1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
matplotlib
22
psutil
33
pytest
4+
pandas
45
tabulate>=0.9.0
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Copyright (c) 2024 Intel Corporation
2+
# Licensed under the Apache License, Version 2.0 (the "License");
3+
# you may not use this file except in compliance with the License.
4+
# You may obtain a copy of the License at
5+
# http://www.apache.org/licenses/LICENSE-2.0
6+
# Unless required by applicable law or agreed to in writing, software
7+
# distributed under the License is distributed on an "AS IS" BASIS,
8+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
9+
# See the License for the specific language governing permissions and
10+
# limitations under the License.
11+
12+
from tests.cross_fw.shared.paths import TEST_ROOT
13+
from tools.visualize_compression_results import visualize
14+
15+
16+
def test_visualization_of_compression_results(tmp_path):
17+
in_file = TEST_ROOT / "tools" / "data" / "phi3_asym.csv"
18+
ref_md_file = TEST_ROOT / "tools" / "data" / "phi3_asym.md"
19+
20+
visualize(in_file, tmp_path)
21+
22+
md_file = tmp_path / (in_file.stem + ".md")
23+
assert md_file.exists()
24+
assert md_file.with_suffix(".png").exists()
25+
assert ref_md_file.read_text()[:-1] == md_file.read_text() # ref file ends with a newline character by code style

tools/README.md

+45
Original file line numberDiff line numberDiff line change
@@ -108,3 +108,48 @@ def allocate_memory():
108108

109109
max_memory_usage: float = mmc.memory_data[MemoryType.SYSTEM]
110110
```
111+
112+
## Visualization of Weight Compression results
113+
114+
The [visualize_compression_results.py](visualize_compression_results.py) script is a useful tool for visualizing the results of weight compression.
115+
The result of the script is a .md file with a table:
116+
117+
| mode | %int4 | %int8 | lora<br>rank | average<br>relative<br>error | compression<br>rate |
118+
|:------------------------------------------|:--------|:--------|:---------------|:-------------------------------|:----------------------|
119+
| fp32 | 0% | 0% | | 0.0% | 1.0x |
120+
| int8 | 0% | 100% | | 1.0% | 4.0x |
121+
| int4 + scale estimation + lora correction | 100% | 0% | 256.0 | 3.9% | 6.0x |
122+
| int4 + scale estimation | 40% | 60% | | 4.1% | 4.8x |
123+
| int4 + scale estimation | 60% | 40% | | 4.3% | 5.4x |
124+
| int4 + scale estimation + lora correction | 100% | 0% | 128.0 | 4.6% | 6.5x |
125+
| int4 + scale estimation | 80% | 20% | | 5.7% | 6.1x |
126+
| int4 + scale estimation + lora correction | 100% | 0% | 8.0 | 5.8% | 7.1x |
127+
| int4 + scale estimation + gptq | 100% | 0% | | 6.1% | 7.1x |
128+
| int4 + scale estimation | 100% | 0% | | 7.5% | 7.1x |
129+
| int4 | 100% | 0% | | 11.9% | 7.1x |
130+
131+
Also it plots a trade-off between accuracy and footprint by processing a CSV file in a specific format.
132+
The resulting images are employed for [the relevant section](/docs/usage/post_training_compression/weights_compression/Usage.md#accuracyfootprint-trade-off) in the Weight Compression documentation:
133+
134+
![alt text](/docs/usage/post_training_compression/weights_compression/phi3_asym.png)
135+
136+
### CSV-file format
137+
138+
The input file should contain the following columns:
139+
140+
- `mode` - The string indicating the compression method used for the model. The 'fp32' mode corresponds to the uncompressed version. To calculate the accuracy-footprint trade-off, the following words must be present in at least one row: "gptq", "int4", "fp32", "int8".
141+
- `%int4` - The ratio of int4 layers.
142+
- `%int8` - The ratio of int8 layers.
143+
- `lora rank` - The rank of the adapters used in Lora Correction algorithm.
144+
- `plot name` - Short names for annotation in the plot.
145+
- `model size, Gb` - The size of the corresponding model in Gb.
146+
- `wikitext, word perplexity` - Word perplexity on the Wikitext dataset, measured using rolling loglikelihoods in the [lm_eval tool](https://github.com/EleutherAI/lm-evaluation-harness).
147+
- `lambada-openai, acc` - Accuracy on the Lambada-OpenAI dataset, measured using [lm_eval tool](https://github.com/EleutherAI/lm-evaluation-harness).
148+
- `lambada-openai, perplexity` - Perplexity on the Lambada-OpenAI dataset, measured using the [lm_eval tool](https://github.com/EleutherAI/lm-evaluation-harness).
149+
- `WWB, similarity` - Similarity, measured using the [WWB tool](https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python/).
150+
151+
### Example of script usage
152+
153+
```shell
154+
python visualize_compression_results.py --input-file data/llama2_asym.csv --output-dir output_dir
155+
```

0 commit comments

Comments
 (0)