Skip to content

Commit 59a1f81

Browse files
committed
merge from main branch
Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>
2 parents de190fd + 8f7d016 commit 59a1f81

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+3276
-941
lines changed

.github/workflows/test_openvino.yml

+2-8
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ jobs:
1717
strategy:
1818
fail-fast: false
1919
matrix:
20-
python-version: [3.8, 3.9]
20+
python-version: [3.8, 3.11]
2121
os: [ubuntu-latest]
2222

2323
runs-on: ${{ matrix.os }}
@@ -32,13 +32,7 @@ jobs:
3232
python -m pip install --upgrade pip
3333
# install PyTorch CPU version to avoid installing CUDA packages on GitHub runner without GPU
3434
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
35-
pip install .[openvino,nncf,tests,diffusers]
35+
pip install .[openvino,openvino-tokenizers,nncf,tests,diffusers]
3636
- name: Test with Pytest
3737
run: |
3838
pytest tests/openvino/ --ignore test_modeling_basic
39-
- name: Test openvino-nightly import
40-
run: |
41-
pip uninstall -y openvino
42-
pip install openvino-nightly
43-
python -c "from optimum.intel import OVModelForCausalLM; OVModelForCausalLM.from_pretrained('hf-internal-testing/tiny-random-gpt2', export=True, compile=False)"
44-

README.md

+6-3
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@
66

77
🤗 Optimum Intel is the interface between the 🤗 Transformers and Diffusers libraries and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures.
88

9+
[Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/#introduction) is an open-source library which provides optimizations for both eager mode and graph mode, however, compared to eager mode, graph mode in PyTorch* normally yields better performance from optimization techniques, such as operation fusion.
10+
911
Intel [Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) is an open-source library enabling the usage of the most popular compression techniques such as quantization, pruning and knowledge distillation. It supports automatic accuracy-driven tuning strategies in order for users to easily generate quantized model. The users can easily apply static, dynamic and aware-training quantization approaches while giving an expected accuracy criteria. It also supports different weight pruning techniques enabling the creation of pruned model giving a predefined sparsity target.
1012

1113
[OpenVINO](https://docs.openvino.ai/latest/index.html) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
@@ -19,6 +21,7 @@ To install the latest release of 🤗 Optimum Intel with the corresponding requi
1921
|:-----------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------|
2022
| [Intel Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) | `pip install --upgrade-strategy eager "optimum[neural-compressor]"` |
2123
| [OpenVINO](https://docs.openvino.ai/latest/index.html) | `pip install --upgrade-strategy eager "optimum[openvino,nncf]"` |
24+
| [Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/#introduction) | `pip install --upgrade-strategy eager "optimum[ipex]"` |
2225

2326
The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
2427

@@ -37,7 +40,7 @@ or to install from source including dependencies:
3740
python -m pip install "optimum-intel[extras]"@git+https://github.com/huggingface/optimum-intel.git
3841
```
3942

40-
where `extras` can be one or more of `neural-compressor`, `openvino`, `nncf`.
43+
where `extras` can be one or more of `ipex`, `neural-compressor`, `openvino`, `nncf`.
4144

4245
# Quick tour
4346

@@ -75,10 +78,10 @@ It is possible to export your model to the [OpenVINO](https://docs.openvino.ai/2
7578
optimum-cli export openvino --model gpt2 ov_model
7679
```
7780

78-
If you add `--int8`, the model linear and embedding weights will be quantized to INT8, the activations will be kept in floating point precision.
81+
You can also apply 8-bit weight-only quantization when exporting your model : the model linear and embedding weights will be quantized to INT8, the activations will be kept in floating point precision.
7982

8083
```plain
81-
optimum-cli export openvino --model gpt2 --int8 ov_model
84+
optimum-cli export openvino --model gpt2 --weight-format int8 ov_model
8285
```
8386

8487
To apply quantization on both weights and activations, you can find more information in the [documentation](https://huggingface.co/docs/optimum/main/en/intel/optimization_ov).

docs/source/inference.mdx

+27-13
Original file line numberDiff line numberDiff line change
@@ -50,19 +50,19 @@ optimum-cli export openvino --model local_path --task text-generation-with-past
5050
Once the model is exported, you can load the OpenVINO model using :
5151

5252
```python
53-
from optimum.intel import AutoModelForCausalLM
53+
from optimum.intel import OVModelForCausalLM
5454

55-
model_id = "helenai/gpt2-ov"
56-
model = AutoModelForCausalLM.from_pretrained(model_id)
55+
model_id = "ov_model"
56+
model = OVModelForCausalLM.from_pretrained(model_id)
5757
```
5858

5959
You can also load your PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, by setting `export=True` when loading your model.
6060

6161
```python
62-
from optimum.intel import AutoModelForCausalLM
62+
from optimum.intel import OVModelForCausalLM
6363

6464
model_id = "gpt2"
65-
model = AutoModelForCausalLM.from_pretrained(model_id, export=True)
65+
model = OVModelForCausalLM.from_pretrained(model_id, export=True)
6666
model.save_pretrained("ov_model")
6767
```
6868

@@ -94,15 +94,15 @@ model.save_pretrained(save_directory)
9494
tokenizer.save_pretrained(save_directory)
9595
```
9696

97-
### Weight only quantization
97+
### Weight-only quantization
9898

99-
You can also apply INT8 quantization on your models weights when exporting your model with the CLI:
99+
You can also apply 8-bit or 4-bit weight quantization when exporting your model with the CLI:
100100

101101
```bash
102-
optimum-cli export openvino --model gpt2 --int8 ov_model
102+
optimum-cli export openvino --model gpt2 --weight-format int8 ov_model
103103
```
104104

105-
This will results in the exported model linear and embedding layers to be quantized to INT8, the activations will be kept in floating point precision.
105+
This will result in the exported model linear and embedding layers to be quantized to INT8 or INT4, the activations will be kept in floating point precision. This type of optimization allows reducing the footprint and latency of LLMs.
106106

107107
This can also be done when loading your model by setting the `load_in_8bit` argument when calling the `from_pretrained()` method.
108108

@@ -112,6 +112,21 @@ from optimum.intel import OVModelForCausalLM
112112
model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
113113
```
114114

115+
> **NOTE:** `load_in_8bit` is enabled by default for the models larger than 1 billion parameters.
116+
117+
There are also alternative compression options for a different performance-accuracy trade-off:
118+
119+
| Option | Description |
120+
|---------------------------------------------------------------------|-------------------|
121+
| `fp16` | Float16 weights |
122+
| `int8` | INT8 weights |
123+
| `int4_sym_g128`, `int4_asym_g128`, `int4_sym_g64`, `int4_asym_g64`* | INT4 weights |
124+
125+
*`sym` and `asym` stand for symmetric and asymmetric quantization, `g128` and `g64` means the group size `128` and `64` respectively.
126+
127+
`--ratio` CLI parameter controls the ratio between 4-bit and 8-bit quantized layers and can also change performance-accuracy trade-off for the optimized model. It is valid only for INT4 quantization options.
128+
129+
115130
To apply quantization on both weights and activations, you can use the `OVQuantizer`, more information in the [documentation](https://huggingface.co/docs/optimum/main/en/intel/optimization_ov#optimization).
116131

117132
### Static shape
@@ -186,11 +201,10 @@ It is possible to pass an `ov_config` parameter to `from_pretrained()` with cust
186201
model = OVModelForSequenceClassification.from_pretrained(model_id, ov_config={"INFERENCE_PRECISION_HINT":"f32"})
187202
```
188203

189-
Optimum Intel leverages OpenVINO's model caching to speed up model compiling. By default a `model_cache` directory is created in the model's directory in the [Hugging Face Hub cache](https://huggingface.co/docs/huggingface_hub/main/en/guides/manage-cache). To override this, use the ov_config parameter and set `CACHE_DIR` to a different value. To disable model caching, set `CACHE_DIR` to an empty string.
190-
204+
Optimum Intel leverages OpenVINO's model caching to speed up model compiling on GPU. By default a `model_cache` directory is created in the model's directory in the [Hugging Face Hub cache](https://huggingface.co/docs/huggingface_hub/main/en/guides/manage-cache). To override this, use the ov_config parameter and set `CACHE_DIR` to a different value. To disable model caching on GPU, set `CACHE_DIR` to an empty string.
191205

192206
```python
193-
model = OVModelForSequenceClassification.from_pretrained(model_id, ov_config={"CACHE_DIR":""})
207+
model = OVModelForSequenceClassification.from_pretrained(model_id, device="GPU", ov_config={"PERFORMANCE_HINT": "LATENCY", "CACHE_DIR":""})
194208
```
195209

196210
### Sequence-to-sequence models
@@ -258,7 +272,7 @@ prompt = "sailing ship in storm by Rembrandt"
258272
images = pipeline(prompt).images
259273
```
260274

261-
To load your PyTorch model and convert it to OpenVINO on-the-fly, you can set `export=True`.
275+
To load your PyTorch model and convert it to OpenVINO on the fly, you can set `export=True`.
262276

263277
```python
264278
model_id = "runwayml/stable-diffusion-v1-5"

docs/source/optimization_ov.mdx

+30
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,36 @@ tokenizer.save_pretrained(save_dir)
6262

6363
The `quantize()` method applies post-training static quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.
6464

65+
## Weight-only quantization
66+
67+
You can optimize the performance of text-generation LLMs by quantizing weights to various precisions that provide different performance-accuracy trade-offs.
68+
69+
```python
70+
from optimum.intel import OVModelForCausalLM
71+
72+
model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
73+
```
74+
75+
> **NOTE:** `load_in_8bit` is enabled by default for models larger than 1 billion parameters.
76+
77+
For the 4-bit weight quantization we recommend using the NNCF API like below:
78+
```python
79+
from optimum.intel import OVModelForCausalLM
80+
import nncf
81+
82+
model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=False)
83+
model.model = nncf.compress_weights(
84+
model.model,
85+
mode=nncf.CompressWeightsMode.INT4_SYM,
86+
ratio=0.8,
87+
group_size=128,
88+
)
89+
model.save_pretrained("compressed_model")
90+
```
91+
92+
For more details, please refer to the corresponding NNCF [documentation](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/CompressWeights.md).
93+
94+
6595
## Training-time optimization
6696

6797
Apart from optimizing a model after training like post-training quantization above, `optimum.openvino` also provides optimization methods during training, namely Quantization-Aware Training (QAT) and Joint Pruning, Quantization and Distillation (JPQD).

docs/source/reference_inc.mdx

+2-2
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,8 @@ specific language governing permissions and limitations under the License.
4343

4444
## INCModelForCausalLM
4545

46-
[[autodoc]] neural_compressor.modeling_decoder.INCModelForCausalLM
46+
[[autodoc]] neural_compressor.modeling_base.INCModelForCausalLM
4747

4848
## INCModelForSeq2SeqLM
4949

50-
[[autodoc]] neural_compressor.modeling_base.INCModelForSeq2SeqLM
50+
[[autodoc]] neural_compressor.modeling_base.INCModelForSeq2SeqLM

examples/openvino/stable-diffusion/requirements.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@ accelerate
22
diffusers
33
torch~=1.13
44
nncf @ git+https://github.com/openvinotoolkit/nncf.git
5-
tomesd @ git+https://github.com/AlexKoff88/tomesd/tree/openvino
5+
tomesd @ git+https://github.com/AlexKoff88/tomesd.git@openvino

examples/openvino/stable-diffusion/train_text_to_image_qat.py

+9-58
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,6 @@
1919
import math
2020
import os
2121
import random
22-
import tempfile
2322
from copy import deepcopy
2423
from functools import partial
2524
from io import BytesIO
@@ -34,7 +33,7 @@
3433
import torch.utils.checkpoint
3534
from accelerate import Accelerator
3635
from accelerate.logging import get_logger
37-
from accelerate.utils import set_seed
36+
from accelerate.utils import ProjectConfiguration, set_seed
3837
from datasets import load_dataset
3938
from diffusers import DDIMScheduler, DDPMScheduler, DiffusionPipeline, LMSDiscreteScheduler, StableDiffusionPipeline
4039
from diffusers.optimization import get_scheduler
@@ -44,20 +43,12 @@
4443
from nncf.torch import create_compressed_model, register_default_init_args
4544
from nncf.torch.initialization import PTInitializingDataLoader
4645
from nncf.torch.layer_utils import CompressionParameter
47-
from openvino._offline_transformations import apply_moc_transformations, compress_quantize_weights_transformation
4846
from PIL import Image
4947
from requests.packages.urllib3.exceptions import InsecureRequestWarning
5048
from torchvision import transforms
5149
from tqdm import tqdm
5250

53-
from optimum.exporters.onnx import export_models, get_stable_diffusion_models_for_export
54-
from optimum.intel import OVStableDiffusionPipeline
55-
from optimum.utils import (
56-
DIFFUSION_MODEL_TEXT_ENCODER_SUBFOLDER,
57-
DIFFUSION_MODEL_UNET_SUBFOLDER,
58-
DIFFUSION_MODEL_VAE_DECODER_SUBFOLDER,
59-
DIFFUSION_MODEL_VAE_ENCODER_SUBFOLDER,
60-
)
51+
from optimum.exporters.openvino import export_from_model
6152

6253

6354
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
@@ -583,47 +574,6 @@ def get_noise_scheduler(args):
583574
return noise_scheduler
584575

585576

586-
def export_to_onnx(pipeline, save_dir):
587-
unet = pipeline.unet
588-
vae = pipeline.vae
589-
text_encoder = pipeline.text_encoder
590-
591-
unet.eval().cpu()
592-
vae.eval().cpu()
593-
text_encoder.eval().cpu()
594-
595-
ONNX_WEIGHTS_NAME = "model.onnx"
596-
597-
output_names = [
598-
os.path.join(DIFFUSION_MODEL_TEXT_ENCODER_SUBFOLDER, ONNX_WEIGHTS_NAME),
599-
os.path.join(DIFFUSION_MODEL_UNET_SUBFOLDER, ONNX_WEIGHTS_NAME),
600-
os.path.join(DIFFUSION_MODEL_VAE_ENCODER_SUBFOLDER, ONNX_WEIGHTS_NAME),
601-
os.path.join(DIFFUSION_MODEL_VAE_DECODER_SUBFOLDER, ONNX_WEIGHTS_NAME),
602-
]
603-
604-
with torch.no_grad():
605-
models_and_onnx_configs = get_stable_diffusion_models_for_export(pipeline)
606-
pipeline.save_config(save_dir)
607-
export_models(
608-
models_and_onnx_configs=models_and_onnx_configs, output_dir=Path(save_dir), output_names=output_names
609-
)
610-
611-
612-
def export_to_openvino(pipeline, onnx_dir, save_dir):
613-
ov_pipe = OVStableDiffusionPipeline.from_pretrained(
614-
model_id=onnx_dir,
615-
from_onnx=True,
616-
model_save_dir=save_dir,
617-
tokenizer=pipeline.tokenizer,
618-
scheduler=pipeline.scheduler,
619-
feature_extractor=pipeline.feature_extractor,
620-
compile=False,
621-
)
622-
apply_moc_transformations(ov_pipe.unet.model, cf=False)
623-
compress_quantize_weights_transformation(ov_pipe.unet.model)
624-
ov_pipe.save_pretrained(save_dir)
625-
626-
627577
class UnetInitDataset(torch.utils.data.Dataset):
628578
def __init__(self, data):
629579
super().__init__()
@@ -700,7 +650,7 @@ def get_nncf_config(pipeline, dataloader, args):
700650
"ignored_scopes": [
701651
"{re}.*__add___[0-2]",
702652
"{re}.*layer_norm_0",
703-
"{re}.*Attention.*/bmm_0",
653+
# "{re}.*Attention.*/bmm_0",
704654
"{re}.*__truediv__*",
705655
"{re}.*group_norm_0",
706656
"{re}.*mul___[0-2]",
@@ -771,11 +721,13 @@ def main():
771721

772722
logging_dir = os.path.join(args.output_dir, args.logging_dir)
773723

724+
accelerator_project_config = ProjectConfiguration(project_dir=args.output_dir, logging_dir=logging_dir)
725+
774726
accelerator = Accelerator(
775727
gradient_accumulation_steps=args.gradient_accumulation_steps,
776728
mixed_precision=args.mixed_precision,
777729
log_with=args.report_to,
778-
logging_dir=logging_dir,
730+
project_config=accelerator_project_config,
779731
)
780732

781733
logging.basicConfig(
@@ -922,7 +874,7 @@ def tokenize_captions(examples, is_train=True):
922874

923875
with accelerator.main_process_first():
924876
if args.max_train_samples is not None:
925-
dataset["train"] = dataset["train"].shuffle(seed=42, buffer_size=args.max_train_samples)
877+
dataset["train"] = dataset["train"].shuffle(seed=42).select(range(args.max_train_samples))
926878
# Set the training transforms
927879
train_dataset = dataset["train"]
928880

@@ -1132,9 +1084,8 @@ def collate_fn(examples):
11321084
feature_extractor=pipeline.feature_extractor,
11331085
)
11341086

1135-
with tempfile.TemporaryDirectory() as tmpdirname:
1136-
export_to_onnx(export_pipeline, tmpdirname)
1137-
export_to_openvino(export_pipeline, tmpdirname, Path(args.output_dir) / "openvino")
1087+
save_directory = Path(args.output_dir) / "openvino"
1088+
export_from_model(export_pipeline, output=save_directory, task="stable-diffusion")
11381089

11391090

11401091
if __name__ == "__main__":

optimum/commands/export/openvino.py

+19-1
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,22 @@ def parse_args_openvino(parser: "ArgumentParser"):
9292
"precision (by default 20%% in INT8). This helps to achieve better accuracy after weight compression."
9393
),
9494
)
95+
optional_group.add_argument(
96+
"--disable-stateful",
97+
action="store_true",
98+
help=(
99+
"Disable stateful converted models, stateless models will be generated instead. Stateful models are produced by default when this key is not used. "
100+
"In stateful models all kv-cache inputs and outputs are hidden in the model and are not exposed as model inputs and outputs. "
101+
"If --disable-stateful option is used, it may result in sub-optimal inference performance. "
102+
"Use it when you intentionally want to use a stateless model, for example, to be compatible with existing "
103+
"OpenVINO native inference code that expects kv-cache inputs and outputs in the model."
104+
),
105+
)
106+
optional_group.add_argument(
107+
"--convert-tokenizer",
108+
action="store_true",
109+
help="Add converted tokenizer and detokenizer with OpenVINO Tokenizers",
110+
)
95111

96112

97113
class OVExportCommand(BaseOptimumCLICommand):
@@ -138,6 +154,8 @@ def run(self):
138154
trust_remote_code=self.args.trust_remote_code,
139155
pad_token_id=self.args.pad_token_id,
140156
compression_option=self.args.weight_format,
141-
compression_ratio=self.args.ratio
157+
compression_ratio=self.args.ratio,
158+
stateful=not self.args.disable_stateful,
159+
convert_tokenizer=self.args.convert_tokenizer,
142160
# **input_shapes,
143161
)
+2-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
from .__main__ import main_export
2-
from .convert import export, export_models, export_pytorch_via_onnx
2+
from .convert import export, export_from_model, export_models, export_pytorch_via_onnx
3+
from .stateful import ensure_stateful_is_available, patch_stateful
34

45

56
__all__ = ["main_export", "export", "export_models"]

0 commit comments

Comments
 (0)