You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*[Install intel_extension_for_pytorch for CPU](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/)
38
+
*[Install intel_extension_for_pytorch for XPU](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/)
39
+
*[Use Docker Image with torch installed for HPU](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#bare-metal-fresh-os-single-click)
40
+
**Note**: There is a version mapping between Intel Neural Compressor and Gaudi Software Stack, please refer to this [table](./docs/source/3x/gaudi_version_map.md) and make sure to use a matched combination.
41
+
*[Install torch for other platform](https://pytorch.org/get-started/locally)
> There is a version mapping between Intel Neural Compressor and Gaudi Software Stack, please refer to this [table](./docs/source/3x/gaudi_version_map.md) and make sure to use a matched combination.
45
-
46
-
#### Install torch/intel_extension_for_pytorch for Intel GPU
# Install 2.X API + Framework extension API + PyTorch dependency
60
47
pip install neural-compressor[pt]
61
48
# Install 2.X API + Framework extension API + TensorFlow dependency
62
49
pip install neural-compressor[tf]
63
-
```
64
-
> **Note**:
65
-
> Further installation methods can be found under [Installation Guide](./docs/source/installation_guide.md). check out our [FAQ](./docs/source/faq.md) for more details.
50
+
```
51
+
**Note**: Further installation methods can be found under [Installation Guide](./docs/source/installation_guide.md). check out our [FAQ](./docs/source/faq.md) for more details.
66
52
67
53
## Getting Started
54
+
After successfully installing these packages, try your first quantization program. **Following example code demonstrates FP8 Quantization**, it is supported by Intel Gaudi2 AI Accelerator.
55
+
To try on Intel Gaudi2, docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built).
Following example code demonstrates FP8 Quantization, it is supported by Intel Gaudi2 AI Accelerator.
77
-
78
-
To try on Intel Gaudi2, docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built).
### Weight-Only Large Language Model Loading (LLMs)
108
-
109
-
Following example code demonstrates weight-only large language model loading on Intel Gaudi2 AI Accelerator.
83
+
```
84
+
More [FP8 quantization doc](./docs/source/3x/PT_FP8Quant.md).
110
85
86
+
**Following example code demonstrates weight-only large language model loading** on Intel Gaudi2 AI Accelerator.
111
87
```python
112
88
from neural_compressor.torch.quantization import load
113
89
@@ -119,10 +95,7 @@ model = load(
119
95
torch_dtype=torch.bfloat16,
120
96
)
121
97
```
122
-
123
-
**Note:**
124
-
125
-
Intel Neural Compressor will convert the model format from auto-gptq to hpu format on the first load and save hpu_model.safetensors to the local cache directory for the next load. So it may take a while to load for the first time.
98
+
**Note:** Intel Neural Compressor will convert the model format from auto-gptq to hpu format on the first load and save hpu_model.safetensors to the local cache directory for the next load. So it may take a while to load for the first time.
from neural_compressor.torch.quantization import (
83
-
FP8Config,
84
-
prepare,
85
-
convert,
86
-
)
87
-
import torchvision.models as models
88
-
89
-
model = models.resnet18()
90
-
qconfig = FP8Config(fp8_config="E4M3")
91
-
model = prepare(model, qconfig)
92
-
# customer defined calibration
93
-
calib_func(model)
94
-
model = convert(model)
82
+
## Optimum-habana LLM example
83
+
### Overview
84
+
[Optimum](https://huggingface.co/docs/optimum) is an extension of Transformers that provides a set of performance optimization tools to train and run models on targeted hardware with maximum efficiency.
85
+
[Optimum-habana](https://github.com/huggingface/optimum-habana) is the interface between the Transformers, Diffusers libraries and Intel Gaudi AI Accelerators (HPU). It provides higher performance based on modified modeling files, and utilizes Intel Neural Compressor for FP8 quantization internally, [running-with-fp8](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation#running-with-fp8)
86
+

87
+
### Installation
88
+
Refer to [optimum-habana, install-the-library-and-get-example-scripts](https://github.com/huggingface/optimum-habana?tab=readme-ov-file#install-the-library-and-get-example-scripts)
Refer to [here](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation#running-with-fp8).
132
+
Change "--model_name_or_path" to be your model like
133
+
"meta-llama/Llama-3.1-8B-Instruct",
134
+
"Qwen/Qwen2.5-7B-Instruct", or
135
+
"mistralai/Mixtral-8x7B-Instruct-v0.1" and so on.
136
+
"--use_kv_cache" is to enable FP8 KV cache.
137
+
138
+
### Profiling
139
+
Add "--profiling_warmup_steps 5 --profiling_steps 2 --profiling_record_shapes" as args in the end of commandline of run_generation.py.
140
+
Refer to [torch.profiler.ProfilerActivity.HPU](https://github.com/huggingface/optimum-habana/blob/c9e1c23620618e2f260c92c46dfeb163545ec5ba/optimum/habana/utils.py#L305).
141
+
142
+
### FP8 Accuracy
143
+
"lm_eval.tasks", "lm_eval.evaluator", "lm_eval" are installed from the above requirements_lm_eval.txt. The tasks can be set and the default is ["hellaswag", "lambada_openai", "piqa", "winogrande"], [more info](https://github.com/EleutherAI/lm-evaluation-harness/)
> Note: For LLM, Optimum-habana provides higher performance based on modified modeling files, so here the Link of LLM goes to Optimum-habana, which utilize Intel Neural Compressor for FP8 quantization internally.
0 commit comments