Skip to content

Commit 0bc5d8c

Browse files
authored
Doc: Update readme.md (#2083)
Signed-off-by: fengding <feng1.ding@intel.com>
1 parent 9bddd52 commit 0bc5d8c

File tree

4 files changed

+205
-69
lines changed

4 files changed

+205
-69
lines changed

README.md

+19-46
Original file line numberDiff line numberDiff line change
@@ -32,55 +32,33 @@ support AMD CPU, ARM CPU, and NVidia GPU through ONNX Runtime with limited testi
3232
* [2024/07] Performance optimizations and usability improvements on [client-side](./docs/source/3x/client_quant.md).
3333

3434
## Installation
35+
Choose the necessary framework dependencies to install based on your deploy environment.
3536
### Install Framework
36-
#### Install torch for CPU
37-
```Shell
38-
pip install torch --index-url https://download.pytorch.org/whl/cpu
37+
* [Install intel_extension_for_pytorch for CPU](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/)
38+
* [Install intel_extension_for_pytorch for XPU](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/)
39+
* [Use Docker Image with torch installed for HPU](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#bare-metal-fresh-os-single-click)
40+
**Note**: There is a version mapping between Intel Neural Compressor and Gaudi Software Stack, please refer to this [table](./docs/source/3x/gaudi_version_map.md) and make sure to use a matched combination.
41+
* [Install torch for other platform](https://pytorch.org/get-started/locally)
42+
* [Install TensorFlow](https://www.tensorflow.org/install)
43+
44+
### Install Neural Compressor from pypi
3945
```
40-
#### Use Docker Image with torch installed for HPU
41-
https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#bare-metal-fresh-os-single-click
42-
43-
> **Note**:
44-
> There is a version mapping between Intel Neural Compressor and Gaudi Software Stack, please refer to this [table](./docs/source/3x/gaudi_version_map.md) and make sure to use a matched combination.
45-
46-
#### Install torch/intel_extension_for_pytorch for Intel GPU
47-
https://intel.github.io/intel-extension-for-pytorch/index.html#installation
48-
49-
#### Install torch for other platform
50-
https://pytorch.org/get-started/locally
51-
52-
#### Install tensorflow
53-
```Shell
54-
pip install tensorflow
55-
```
56-
57-
### Install from pypi
58-
```Shell
5946
# Install 2.X API + Framework extension API + PyTorch dependency
6047
pip install neural-compressor[pt]
6148
# Install 2.X API + Framework extension API + TensorFlow dependency
6249
pip install neural-compressor[tf]
63-
```
64-
> **Note**:
65-
> Further installation methods can be found under [Installation Guide](./docs/source/installation_guide.md). check out our [FAQ](./docs/source/faq.md) for more details.
50+
```
51+
**Note**: Further installation methods can be found under [Installation Guide](./docs/source/installation_guide.md). check out our [FAQ](./docs/source/faq.md) for more details.
6652

6753
## Getting Started
54+
After successfully installing these packages, try your first quantization program. **Following example code demonstrates FP8 Quantization**, it is supported by Intel Gaudi2 AI Accelerator.
55+
To try on Intel Gaudi2, docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built).
6856

69-
Setting up the environment:
70-
```bash
71-
pip install "neural-compressor>=2.3" "transformers>=4.34.0" torch torchvision
57+
Run a container with an interactive shell,
7258
```
73-
After successfully installing these packages, try your first quantization program.
74-
75-
### [FP8 Quantization](./docs/source/3x/PT_FP8Quant.md)
76-
Following example code demonstrates FP8 Quantization, it is supported by Intel Gaudi2 AI Accelerator.
77-
78-
To try on Intel Gaudi2, docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built).
79-
```bash
80-
# Run a container with an interactive shell
8159
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.0/ubuntu24.04/habanalabs/pytorch-installer-2.5.1:latest
8260
```
83-
Run the example:
61+
Run the example,
8462
```python
8563
from neural_compressor.torch.quantization import (
8664
FP8Config,
@@ -102,12 +80,10 @@ model = convert(model)
10280

10381
output = model(torch.randn(1, 3, 224, 224).to("hpu")).to("cpu")
10482
print(output.shape)
105-
```
106-
107-
### Weight-Only Large Language Model Loading (LLMs)
108-
109-
Following example code demonstrates weight-only large language model loading on Intel Gaudi2 AI Accelerator.
83+
```
84+
More [FP8 quantization doc](./docs/source/3x/PT_FP8Quant.md).
11085

86+
**Following example code demonstrates weight-only large language model loading** on Intel Gaudi2 AI Accelerator.
11187
```python
11288
from neural_compressor.torch.quantization import load
11389

@@ -119,10 +95,7 @@ model = load(
11995
torch_dtype=torch.bfloat16,
12096
)
12197
```
122-
123-
**Note:**
124-
125-
Intel Neural Compressor will convert the model format from auto-gptq to hpu format on the first load and save hpu_model.safetensors to the local cache directory for the next load. So it may take a while to load for the first time.
98+
**Note:** Intel Neural Compressor will convert the model format from auto-gptq to hpu format on the first load and save hpu_model.safetensors to the local cache directory for the next load. So it may take a while to load for the first time.
12699

127100
## Documentation
128101

docs/source/3x/PT_FP8Quant.md

+186-23
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@ FP8 Quantization
44
1. [Introduction](#introduction)
55
2. [Supported Parameters](#supported-parameters)
66
3. [Get Start with FP8 Quantization](#get-start-with-fp8-quantization)
7-
4. [Examples](#examples)
7+
4. [Optimum-habana LLM example](#optimum-habana-LLM-example)
8+
5. [VLLM example](#VLLM-example)
89

910
## Introduction
1011

@@ -75,30 +76,192 @@ Intel Neural Compressor provides general quantization APIs to leverage HPU FP8 c
7576
</tbody></table>
7677

7778
## Get Start with FP8 Quantization
79+
[Demo Usage](https://github.com/intel/neural-compressor?tab=readme-ov-file#getting-started)
80+
[Computer vision example](../../../examples/3.x_api/pytorch/cv/fp8_quant)
7881

79-
### Demo Usage
80-
81-
```python
82-
from neural_compressor.torch.quantization import (
83-
FP8Config,
84-
prepare,
85-
convert,
86-
)
87-
import torchvision.models as models
88-
89-
model = models.resnet18()
90-
qconfig = FP8Config(fp8_config="E4M3")
91-
model = prepare(model, qconfig)
92-
# customer defined calibration
93-
calib_func(model)
94-
model = convert(model)
82+
## Optimum-habana LLM example
83+
### Overview
84+
[Optimum](https://huggingface.co/docs/optimum) is an extension of Transformers that provides a set of performance optimization tools to train and run models on targeted hardware with maximum efficiency.
85+
[Optimum-habana](https://github.com/huggingface/optimum-habana) is the interface between the Transformers, Diffusers libraries and Intel Gaudi AI Accelerators (HPU). It provides higher performance based on modified modeling files, and utilizes Intel Neural Compressor for FP8 quantization internally, [running-with-fp8](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation#running-with-fp8)
86+
![](./imgs/optimum-habana.png)
87+
### Installation
88+
Refer to [optimum-habana, install-the-library-and-get-example-scripts](https://github.com/huggingface/optimum-habana?tab=readme-ov-file#install-the-library-and-get-example-scripts)
89+
Option to install from source,
90+
```
91+
$ git clone https://github.com/huggingface/optimum-habana
92+
$ cd optimum-habana && git checkout v1.14.0 (change the version)
93+
$ pip install -e .
94+
$ pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.18.0
95+
$ cd examples/text-generation
96+
$ pip install -r requirements.txt
97+
$ pip install -r requirements_lm_eval.txt (Option)
98+
```
99+
### Check neural_compressor code
100+
> optimum-habana/examples/text-generation/utils.py
101+
>> initialize_model() -> setup_model() -> setup_quantization() -> FP8Config/prepare()/convert()
102+
103+
### FP8 KV cache
104+
Introduction: [kv-cache-quantization in huggingface transformers](https://huggingface.co/blog/kv-cache-quantization)
105+
106+
BF16 KVCache Code -> [Modeling_all_models.py -> KVCache()](https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/transformers/models/modeling_all_models.py)
107+
108+
FP8 KVCache code trace with neural compressor support, for example Llama models,
109+
> optimum-habana/optimum/habana/transformers/models/llama/modeling_llama.py
110+
>> GaudiLlamaForCausalLM() -> self.model()
111+
>>> GaudiLlamaModel() -> forward() -> decoder_layer() -> GaudiLlamaDecoderLayer() forward() -> pre_attn() -> pre_attn_forward() -> self.k_cache.update
112+
113+
> neural_compressor/torch/algorithms/fp8_quant/_quant_common/helper_modules.py
114+
>> PatchedKVCache() -> update()
115+
>> PatchedModuleFusedSDPA()
116+
117+
Models list which support FP8 KV Cache,
118+
```
119+
microsoft/Phi-3-mini-4k-instruct
120+
bigcode/starcoder2-3b
121+
Qwen/Qwen2.5-7B-Instruct|
122+
meta-llama/Llama-3.2-3B-Instruct
123+
tiiuae/falcon-7b-instruct
124+
mistralai/Mixtral-8x7B-Instruct-v0.1
125+
EleutherAI/gpt-j-6b
126+
mistralai/Mistral-Nemo-Instruct-2407
127+
...
128+
```
129+
130+
### Running with FP8
131+
Refer to [here](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation#running-with-fp8).
132+
Change "--model_name_or_path" to be your model like
133+
"meta-llama/Llama-3.1-8B-Instruct",
134+
"Qwen/Qwen2.5-7B-Instruct", or
135+
"mistralai/Mixtral-8x7B-Instruct-v0.1" and so on.
136+
"--use_kv_cache" is to enable FP8 KV cache.
137+
138+
### Profiling
139+
Add "--profiling_warmup_steps 5 --profiling_steps 2 --profiling_record_shapes" as args in the end of commandline of run_generation.py.
140+
Refer to [torch.profiler.ProfilerActivity.HPU](https://github.com/huggingface/optimum-habana/blob/c9e1c23620618e2f260c92c46dfeb163545ec5ba/optimum/habana/utils.py#L305).
141+
142+
### FP8 Accuracy
143+
"lm_eval.tasks", "lm_eval.evaluator", "lm_eval" are installed from the above requirements_lm_eval.txt. The tasks can be set and the default is ["hellaswag", "lambada_openai", "piqa", "winogrande"], [more info](https://github.com/EleutherAI/lm-evaluation-harness/)
144+
145+
| `Llama-2-7b-hf`| fp8 & fp8 KVCache| bf16 w/ bf16 KVCache|
146+
|---------------|---------|--------|
147+
| hellaswag | 0.5691097390957977 | 0.5704043019318861 |
148+
| lambada_openai| 0.7360760721909567 | 0.7372404424607025 |
149+
| piqa | 0.7850924918389554 | 0.7818280739934712 |
150+
| winogrande | 0.6929755327545383 | 0.6929755327545383 |
151+
152+
| `Qwen2.5-7B-Instruct`| fp8 & fp8 KVCache| bf16 w/ bf16 KVCache|
153+
|---------------|---------|--------|
154+
| hellaswag | 0.2539334793865764 | 0.2539334793865764 |
155+
| lambada_openai| 0.0 | 0.0 |
156+
| piqa | 0.5391730141458106 | 0.5391730141458106 |
157+
| winogrande | 0.4956590370955012 | 0.4956590370955012 |
158+
159+
| `Llama-3.1-8B-Instruct`| fp8 & fp8 KVCache| bf16 w/ bf16 KVCache|
160+
|---------------|---------|--------|
161+
| hellaswag | 0.5934076877116112 | 0.5975901214897431 |
162+
| lambada_openai| 0.7230739375121289 | 0.7255967397632447 |
163+
| piqa | 0.7932535364526659 | 0.8030467899891186 |
164+
| winogrande | 0.7434885556432518 | 0.7371744277821626 |
165+
166+
167+
| `Mixtral-8x7B-Instruct-v0.1`| fp8 & fp8 KVCache| bf16 w/ bf16 KVCache|
168+
|---------------|---------|--------|
169+
| hellaswag | 0.25323640709022105 | 0.25323640709022105 |
170+
| lambada_openai| 0.0 | 0.0 |
171+
| piqa | 0.528835690968444 | 0.528835690968444 |
172+
| winogrande | 0.4956590370955012 | 0.4956590370955012 |
173+
174+
## VLLM example
175+
### Overview
176+
![](./imgs/vllm_gaudi.png)
177+
178+
### Installation
179+
Refer to [Habana vllm-fork](https://github.com/HabanaAI/vllm-fork) to install.
180+
Option to install `vllm-hpu-extension`, `neural_compressor` and `vllm` from the source,
181+
```
182+
$ git clone https://github.com/HabanaAI/vllm-fork.git
183+
$ cd vllm-fork
184+
$ pip install -r requirements-hpu.txt
185+
$ python setup.py develop --user
186+
187+
## Check
188+
$ pip list |grep vllm
189+
vllm 0.6.3.dev1122+g2f43ebf5.d20241121.gaudi118 /home/fengding/vllm-fork
190+
vllm-hpu-extension 0.1
191+
192+
## Validation
193+
$ VLLM_SKIP_WARMUP=true python3 examples/offline_inference.py
194+
......
195+
Prompt: 'Hello, my name is', Generated text: ' Kelly and I have a job to do.\nI need someone to come over'
196+
Prompt: 'The president of the United States is', Generated text: ' facing a sharp criticism of his handling of the coronavirus pandemic, including'
197+
Prompt: 'The capital of France is', Generated text: ' the capital of the Socialist Party of France (SPF), with its state-'
198+
Prompt: 'The future of AI is', Generated text: " in what's coming, not what's coming.\nI don't know what"
199+
```
200+
201+
### Run FP8 calibration
202+
Refer to [vllm-hpu-extension->calibration](https://github.com/HabanaAI/vllm-hpu-extension/tree/main/calibration)
203+
```
204+
$ git clone https://github.com/HabanaAI/vllm-hpu-extension
205+
$ cd vllm-hpu-extension/calibration
206+
207+
# For Llama-3.1.8B-Instruct
208+
$ ./calibrate_model.sh -m meta-llama/Llama-3.1-8B-Instruct -d /home/fengding/processed-data.pkl -o ./output_llama3.1.8b.Instruct -b 128 -t 1 -l 128
209+
## Generate scale factors in ./output_llama3.1.8b.Instruct
210+
```
211+
212+
### Start vllm server
213+
```
214+
$ cd vllm-fork/
215+
216+
$ PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
217+
PT_HPU_WEIGHT_SHARING=0 \
218+
VLLM_CONTIGUOUS_PA=true \
219+
VLLM_SKIP_WARMUP=true \
220+
QUANT_CONFIG=output_llama3.1.8b.Instruct/maxabs_quant_g2.json \
221+
python3 -m vllm.entrypoints.openai.api_server \
222+
--model meta-llama/Llama-3.1-8B-Instruct \
223+
--port 8080 \
224+
--gpu-memory-utilization 0.9 \
225+
--tensor-parallel-size 1 \
226+
--disable-log-requests \
227+
--block-size 128 \
228+
--quantization inc \
229+
--kv-cache-dtype fp8_inc \
230+
--device hpu \
231+
--weights-load-device cpu \
232+
--dtype bfloat16 \
233+
--num_scheduler_steps 16 2>&1 > vllm_serving.log &
234+
```
235+
Refer to [vllm-fork->README_GAUDI.md](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md) for more details.
236+
237+
### Start client to test
238+
```
239+
$ curl --noproxy "*" http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "San Francisco is a", "max_tokens": 100}'
240+
```
241+
242+
### Run benchmark
243+
```
244+
python benchmarks/benchmark_serving.py \
245+
--backend vllm \
246+
--model meta-llama/Llama-3.1-8B-Instruct \
247+
--dataset-name sonnet \
248+
--dataset-path benchmarks/sonnet.txt \
249+
--request-rate 128 \
250+
--num-prompts 128 \
251+
--port 8080 \
252+
--sonnet-input-len 128 \
253+
--sonnet-output-len 128 \
254+
--sonnet-prefix-len 100
95255
```
96256

97-
## Examples
257+
### FP8 KV cache
258+
Code trace
259+
> vllm-fork/vllm/attention/backends/hpu_attn.py
260+
>> from vllm_hpu_extension.utils import Matmul, Softmax, VLLMKVCache
261+
>> HPUAttentionImpl() -> self.k_cache() / self.v_cache()
98262
99-
| Task | Example |
100-
|----------------------|---------|
101-
| Computer Vision (CV) | [Link](../../../examples/3.x_api/pytorch/cv/fp8_quant/) |
102-
| Large Language Model (LLM) | [Link](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation#running-with-fp8) |
263+
> neural_compressor/torch/algorithms/fp8_quant/_quant_common/helper_modules.py
264+
>> PatchedVLLMKVCache()
103265
104-
> Note: For LLM, Optimum-habana provides higher performance based on modified modeling files, so here the Link of LLM goes to Optimum-habana, which utilize Intel Neural Compressor for FP8 quantization internally.
266+
> neural_compressor/torch/algorithms/fp8_quant/common.py
267+
>> "VLLMKVCache": ModuleInfo("kv_cache", PatchedVLLMKVCache)
11.7 KB
Loading

docs/source/3x/imgs/vllm_gaudi.png

219 KB
Loading

0 commit comments

Comments
 (0)