Skip to content

Commit e9c5b01

Browse files
authored
Merge branch 'master' into xin3he-patch-2
2 parents 0978e15 + 3eb5529 commit e9c5b01

File tree

14 files changed

+346
-379
lines changed

14 files changed

+346
-379
lines changed

README.md

+6-8
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Intel® Neural Compressor
55
<h3> An open-source Python library supporting popular model compression techniques on all mainstream deep learning frameworks (TensorFlow, PyTorch, and ONNX Runtime)</h3>
66

77
[![python](https://img.shields.io/badge/python-3.8%2B-blue)](https://github.com/intel/neural-compressor)
8-
[![version](https://img.shields.io/badge/release-3.1.1-green)](https://github.com/intel/neural-compressor/releases)
8+
[![version](https://img.shields.io/badge/release-3.3-green)](https://github.com/intel/neural-compressor/releases)
99
[![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/intel/neural-compressor/blob/master/LICENSE)
1010
[![coverage](https://img.shields.io/badge/coverage-85%25-green)](https://github.com/intel/neural-compressor)
1111
[![Downloads](https://static.pepy.tech/personalized-badge/neural-compressor?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/neural-compressor)
@@ -54,9 +54,9 @@ pip install neural-compressor[tf]
5454
After successfully installing these packages, try your first quantization program. **Following example code demonstrates FP8 Quantization**, it is supported by Intel Gaudi2 AI Accelerator.
5555
To try on Intel Gaudi2, docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built).
5656

57-
Run a container with an interactive shell,
57+
Run a container with an interactive shell, [more info](https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Docker_Installation.html#docker-installation)
5858
```
59-
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.0/ubuntu24.04/habanalabs/pytorch-installer-2.5.1:latest
59+
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.20.0/ubuntu24.04/habanalabs/pytorch-installer-2.6.0:latest
6060
```
6161
Run the example,
6262
```python
@@ -173,12 +173,10 @@ model = load(
173173
174174
## Selected Publications/Events
175175

176+
* arXiv: [Faster Inference of LLMs using FP8 on the Intel Gaudi](https://arxiv.org/abs/2503.09975) (Mar 2025)
177+
* PyTorch landscape: [PyTorch general optimizations](https://landscape.pytorch.org/) (Mar 2025)
178+
* Blog on SqueezeBits: [[Intel Gaudi] #4. FP8 Quantization](https://blog.squeezebits.com/intel-gaudi-4-fp8-quantization--40269) (Jan 2025)
176179
* EMNLP'2024: [Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs](https://arxiv.org/abs/2309.05516) (Sep 2024)
177-
* Blog on Medium: [Quantization on Intel Gaudi Series AI Accelerators](https://medium.com/intel-analytics-software/intel-neural-compressor-v3-0-a-quantization-tool-across-intel-hardware-9856adee6f11) (Aug 2024)
178-
* Blog by Intel: [Neural Compressor: Boosting AI Model Efficiency](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Neural-Compressor-Boosting-AI-Model-Efficiency/post/1604740) (June 2024)
179-
* Blog by Intel: [Optimization of Intel AI Solutions for Alibaba Cloud’s Qwen2 Large Language Models](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-ai-solutions-accelerate-alibaba-qwen2-llms.html) (June 2024)
180-
* Blog by Intel: [Accelerate Meta* Llama 3 with Intel AI Solutions](https://www.intel.com/content/www/us/en/developer/articles/technical/accelerate-meta-llama3-with-intel-ai-solutions.html) (Apr 2024)
181-
* EMNLP'2023 (Under Review): [TEQ: Trainable Equivalent Transformation for Quantization of LLMs](https://openreview.net/forum?id=iaI8xEINAf&referrer=%5BAuthor%20Console%5D) (Sep 2023)
182180
* arXiv: [Efficient Post-training Quantization with FP8 Formats](https://arxiv.org/abs/2309.14592) (Sep 2023)
183181
* arXiv: [Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs](https://arxiv.org/abs/2309.05516) (Sep 2023)
184182

docs/source/3x/PT_FP8Quant.md

+50-26
Original file line numberDiff line numberDiff line change
@@ -129,47 +129,71 @@ mistralai/Mistral-Nemo-Instruct-2407
129129

130130
### Running with FP8
131131
Refer to [here](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation#running-with-fp8).
132-
Change "--model_name_or_path" to be your model like
133-
"meta-llama/Llama-3.1-8B-Instruct",
134-
"Qwen/Qwen2.5-7B-Instruct", or
135-
"mistralai/Mixtral-8x7B-Instruct-v0.1" and so on.
136-
"--use_kv_cache" is to enable FP8 KV cache.
132+
Change "--model_name_or_path" to be your model like someone in the above models list. "--use_kv_cache" is or not to enable FP8 KV cache.
137133

138134
### Profiling
139-
Add "--profiling_warmup_steps 5 --profiling_steps 2 --profiling_record_shapes" as args in the end of commandline of run_generation.py.
135+
Add "--profiling_warmup_steps 5 --profiling_steps 2 --profiling_record_shapes" as args in the end of commandline of `run_generation.py`.
140136
Refer to [torch.profiler.ProfilerActivity.HPU](https://github.com/huggingface/optimum-habana/blob/c9e1c23620618e2f260c92c46dfeb163545ec5ba/optimum/habana/utils.py#L305).
141137

142138
### FP8 Accuracy
143139
"lm_eval.tasks", "lm_eval.evaluator", "lm_eval" are installed from the above requirements_lm_eval.txt. The tasks can be set and the default is ["hellaswag", "lambada_openai", "piqa", "winogrande"], [more info](https://github.com/EleutherAI/lm-evaluation-harness/)
144140

145-
| `Llama-2-7b-hf`| fp8 & fp8 KVCache| bf16 w/ bf16 KVCache|
141+
| `Llama-3.1-8B-Instruct`| fp8 w/ fp8 KVCache| bf16 w/ bf16 KVCache|
146142
|---------------|---------|--------|
147-
| hellaswag | 0.5691097390957977 | 0.5704043019318861 |
148-
| lambada_openai| 0.7360760721909567 | 0.7372404424607025 |
149-
| piqa | 0.7850924918389554 | 0.7818280739934712 |
150-
| winogrande | 0.6929755327545383 | 0.6929755327545383 |
143+
| lambada_openai| 0.7299 | 0.7359 |
144+
| hellaswag | 0.5892 | 0.5911 |
145+
| piqa | 0.7965 | 0.7998 |
146+
| winogrande | 0.7474 | 0.7372 |
147+
| mmlu | 0.6599 | 0.6829 |
151148

152-
| `Qwen2.5-7B-Instruct`| fp8 & fp8 KVCache| bf16 w/ bf16 KVCache|
149+
| `Phi-3-mini-4k-instruct`| fp8 w/ fp8 KVCache| bf16 w/ bf16 KVCache|
153150
|---------------|---------|--------|
154-
| hellaswag | 0.2539334793865764 | 0.2539334793865764 |
155-
| lambada_openai| 0.0 | 0.0 |
156-
| piqa | 0.5391730141458106 | 0.5391730141458106 |
157-
| winogrande | 0.4956590370955012 | 0.4956590370955012 |
151+
| lambada_openai| 0.6420 | 0.6552 |
152+
| hellaswag | 0.5866 | 0.5902 |
153+
| piqa | 0.8041 | 0.8014 |
154+
| winogrande | 0.7324 | 0.7348 |
155+
| mmlu | 0.7035 | 0.7055 |
158156

159-
| `Llama-3.1-8B-Instruct`| fp8 & fp8 KVCache| bf16 w/ bf16 KVCache|
157+
| `Mistral-7B-Instruct-v0.2`| fp8 w/ fp8 KVCache| bf16 w/ bf16 KVCache|
160158
|---------------|---------|--------|
161-
| hellaswag | 0.5934076877116112 | 0.5975901214897431 |
162-
| lambada_openai| 0.7230739375121289 | 0.7255967397632447 |
163-
| piqa | 0.7932535364526659 | 0.8030467899891186 |
164-
| winogrande | 0.7434885556432518 | 0.7371744277821626 |
159+
| lambada_openai| 0.7126 | 0.7165 |
160+
| hellaswag | 0.6556 | 0.6609 |
161+
| piqa | 0.8014 | 0.8025 |
162+
| winogrande | 0.7253 | 0.7388 |
163+
| mmlu | 0.5833 | 0.5919 |
165164

165+
| `Mistral-Nemo-Instruct-2407`| fp8 w/ fp8 KVCache| bf16 w/ bf16 KVCache|
166+
|---------------|---------|--------|
167+
| lambada_openai| 0.7568 | 0.7596 |
168+
| hellaswag | 0.6273 | 0.6325 |
169+
| piqa | 0.8150 | 0.8085 |
170+
| winogrande | 0.7419 | 0.7482 |
171+
| mmlu | 0.6684 | 0.6840 |
172+
173+
| `bigscience/bloom-7b1`| fp8 w/ fp8 KVCache| bf16 w/ bf16 KVCache|
174+
|---------------|---------|--------|
175+
| lambada_openai| 0.5599 | 0.5731 |
176+
| hellaswag | 0.4632 | 0.4639 |
177+
| piqa | 0.7301 | 0.7242 |
178+
| winogrande | 0.6314 | 0.6393 |
179+
| mmlu | 0.2563 | 0.2572 |
180+
181+
| `Mixtral-8x7B-Instruct-v0.1`| fp8 w/ fp8 KVCache| bf16 w/ bf16 KVCache|
182+
|---------------|---------|--------|
183+
| lambada_openai| 0.7805 | 0.7778 |
184+
| hellaswag | 0.6733 | 0.6764 |
185+
| piqa | 0.8324 | 0.8351 |
186+
| winogrande | 0.7680 | 0.7672 |
187+
| mmlu | 0.7031 | 0.7026 |
166188

167-
| `Mixtral-8x7B-Instruct-v0.1`| fp8 & fp8 KVCache| bf16 w/ bf16 KVCache|
189+
| `EleutherAI/gpt-j-6b`| fp8 w/ fp8 KVCache| bf16 w/ bf16 KVCache|
168190
|---------------|---------|--------|
169-
| hellaswag | 0.25323640709022105 | 0.25323640709022105 |
170-
| lambada_openai| 0.0 | 0.0 |
171-
| piqa | 0.528835690968444 | 0.528835690968444 |
172-
| winogrande | 0.4956590370955012 | 0.4956590370955012 |
191+
| lambada_openai| 0.6769 | 0.6781 |
192+
| hellaswag | 0.4928 | 0.4958 |
193+
| piqa | 0.7557 | 0.7541 |
194+
| winogrande | 0.6409 | 0.6425 |
195+
| mmlu | 0.2524 | 0.2606 |
196+
> Notes: For gpt-j model, if `--use_kv_cache` is set to enable KVCache quantization, `--reuse_cache` should also be set.
173197
174198
## VLLM example
175199
### Overview

docs/source/faq.md

+16
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,19 @@ torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed
3232
[AutoGPTQ/AutoGPTQ#196](https://github.com/AutoGPTQ/AutoGPTQ/issues/196).
3333
Try increasing `percdamp` (percent of the average Hessian diagonal to use for dampening),
3434
or increasing `nsamples` (the number of calibration samples).
35+
#### Issue 7:
36+
If you run GPTQ quantization with transformers-like API on xpu device, then you may encounter the following error:
37+
```shell
38+
[ERROR][modeling_auto.py:128] index 133 is out of bounds for dimension 0 with size 128
39+
[ERROR][modeling_auto.py:129] Saved low bit model loading failed, please check your model.
40+
HINT:
41+
XPU device does not support `g_idx` for GPTQ quantization now. Please stay tuned.
42+
You can set desc_act=False.
43+
```
44+
#### Issue 8:
45+
UnicodeEncodeError: 'charmap' codec can't encode character '\u2191' in position 195: character maps to <undefined>
46+
**Solution:**
47+
```
48+
set PYTHONIOENCODING=UTF-8 # for windows
49+
export PYTHONIOENCODING=UTF-8 # for linux
50+
```

docs/source/publication_list.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
1-
Full Publications/Events (87)
1+
Full Publications/Events (89)
22
==========
3-
## 2025 (1)
3+
## 2025 (3)
4+
* arXiv: [Faster Inference of LLMs using FP8 on the Intel Gaudi](https://arxiv.org/abs/2503.09975) (Mar 2025)
5+
* PyTorch landscape: [PyTorch general optimizations](https://landscape.pytorch.org/) (Mar 2025)
46
* Blog on SqueezeBits: [[Intel Gaudi] #4. FP8 Quantization](https://blog.squeezebits.com/intel-gaudi-4-fp8-quantization--40269) (Jan 2025)
57

68
## 2024 (7)

examples/.config/model_params_pytorch_3x.json

+8
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,14 @@
143143
"main_script": "run_clm_no_trainer.py",
144144
"batch_size": 1
145145
},
146+
"phi3_vlm_128k_autoround_int4":{
147+
"model_src_dir": "multimodal-modeling/quantization/auto_round",
148+
"dataset_location": "",
149+
"input_model": "",
150+
"main_script": "mllm.py",
151+
"batch_size": 8,
152+
"iters": 50
153+
},
146154
"gpt_j_ipex":{
147155
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/static_quant/ipex",
148156
"dataset_location": "",

0 commit comments

Comments
 (0)