Skip to content

Commit 345f9e5

Browse files
l-batPenghuiCheng
authored andcommitted
Add hybrid quantization for StableDiffusion pipelines (huggingface#584)
* Add hybrid quantization for StableDiffusion pipelines * apply black * fix tests * fix ruff * fix lcm bug * apply review comments * rework dataset processing * Add doc * remove SDXL test * Apply comments * reformat
1 parent de243f0 commit 345f9e5

File tree

8 files changed

+283
-28
lines changed

8 files changed

+283
-28
lines changed

docs/source/optimization_ov.mdx

+17
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,23 @@ from optimum.intel import OVModelForCausalLM
6969
model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
7070
```
7171

72+
## Hybrid quantization
73+
74+
Traditional optimization methods like post-training 8-bit quantization do not work well for Stable Diffusion models and can lead to poor generation results. On the other hand, weight compression does not improve performance significantly when applied to Stable Diffusion models, as the size of activations is comparable to weights.
75+
The UNet model takes up most of the overall execution time of the pipeline. Thus, optimizing just one model brings substantial benefits in terms of inference speed while keeping acceptable accuracy without fine-tuning. Quantizing the rest of the diffusion pipeline does not significantly improve inference performance but could potentially lead to substantial degradation of accuracy.
76+
Therefore, the proposal is to apply quantization in *hybrid mode* for the UNet model and weight-only quantization for the rest of the pipeline components. The hybrid mode involves the quantization of weights in MatMul and Embedding layers, and activations of other layers, facilitating accuracy preservation post-optimization while reducing the model size.
77+
The `quantization_config` is utilized to define optimization parameters for optimizing the Stable Diffusion pipeline. To enable hybrid quantization, specify the quantization dataset in the `quantization_config`. Otherwise, weight-only quantization to a specified data type (8 tr 4 bits) is applied to UNet model.
78+
79+
```python
80+
from optimum.intel import OVStableDiffusionPipeline, OVWeightQuantizationConfig
81+
82+
model = OVStableDiffusionPipeline.from_pretrained(
83+
model_id,
84+
export=True,
85+
quantization_config=OVWeightQuantizationConfig(bits=8, dataset="conceptual_captions"),
86+
)
87+
```
88+
7289
<Tip warning={true}>
7390

7491
`load_in_8bit` is enabled by default for the models larger than 1 billion parameters.

optimum/intel/openvino/configuration.py

+23-14
Original file line numberDiff line numberDiff line change
@@ -167,7 +167,7 @@ class OVWeightQuantizationConfig(QuantizationConfigMixin):
167167
168168
bits (`int`, defaults to 8):
169169
The number of bits to quantize to.
170-
sym (`bool`, *optional*, defaults to `False`):
170+
sym (`bool`, defaults to `False`):
171171
Whether to use symetric quantization.
172172
tokenizer (`str` or `PreTrainedTokenizerBase`, *optional*):
173173
The tokenizer used to process the dataset. You can pass either:
@@ -177,23 +177,24 @@ class OVWeightQuantizationConfig(QuantizationConfigMixin):
177177
user or organization name, like `dbmdz/bert-base-german-cased`.
178178
- A path to a *directory* containing vocabulary files required by the tokenizer, for instance saved
179179
using the [`~PreTrainedTokenizer.save_pretrained`] method, e.g., `./my_model_directory/`.
180-
dataset (`Union[List[str]]`, *optional*):
181-
The dataset used for data-aware compression. You can provide your own dataset in a list of string or just use the
182-
the one from the list ['wikitext2','c4','c4-new','ptb','ptb-new']
183-
group_size (`int`, *optional*, defaults to 128):
184-
The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantization.
185-
ratio (`float`, *optional*, defaults to 1.0):
180+
dataset (`str or List[str]`, *optional*):
181+
The dataset used for data-aware compression or quantization with NNCF. You can provide your own dataset
182+
in a list of strings or just use the one from the list ['wikitext2','c4','c4-new','ptb','ptb-new'] for LLLMs
183+
or ['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit'] for diffusion models.
184+
ratio (`float`, defaults to 1.0):
186185
The ratio between baseline and backup precisions (e.g. 0.9 means 90% of layers quantized to INT4_ASYM
187186
and the rest to INT8_ASYM).
187+
group_size (`int`, *optional*):
188+
The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantization.
188189
all_layers (`bool`, *optional*):
189190
Defines how many layers are compressed to 4-bits while the rest are kept in 8-bit presicion.
190-
sensitivity_metric (`nncf.SensitivityMetric`, *optional*):
191+
sensitivity_metric (`str`, *optional*):
191192
The sensitivity metric for assigning quantization precision to layers. In order to
192193
preserve the accuracy of the model, the more sensitive layers receives a higher precision.
193-
awq (`bool`, *optional*):
194-
Enables AWQ method to unify weight ranges and improve overall model accuracy.
195-
ignored_scope (`nncf.IgnoredScope`, *optional*):
194+
ignored_scope (`dict`, *optional*):
196195
An ignored scope that defined the list of model control flow graph nodes to be ignored during quantization.
196+
num_samples (`int`, *optional*):
197+
The maximum number of samples composing the calibration dataset.
197198
198199
"""
199200

@@ -202,12 +203,13 @@ def __init__(
202203
bits: int = 8,
203204
sym: bool = False,
204205
tokenizer: Optional[Any] = None,
205-
dataset: Optional[str] = None,
206+
dataset: Optional[Union[str, List[str]]] = None,
206207
ratio: float = 1.0,
207208
group_size: Optional[int] = None,
208209
all_layers: Optional[bool] = None,
209210
sensitivity_metric: Optional[str] = None,
210211
ignored_scope: Optional[dict] = None,
212+
num_samples: Optional[int] = None,
211213
**kwargs,
212214
):
213215
self.bits = bits
@@ -219,6 +221,7 @@ def __init__(
219221
self.all_layers = all_layers
220222
self.sensitivity_metric = sensitivity_metric
221223
self.ignored_scope = ignored_scope
224+
self.num_samples = num_samples
222225
self.quant_method = "default" # TODO : enable AWQ after nncf v2.9.0 release
223226
self.post_init()
224227

@@ -231,10 +234,16 @@ def post_init(self):
231234
if self.group_size is not None and self.group_size != -1 and self.group_size <= 0:
232235
raise ValueError("`group_size` must be greater than 0 or equal to -1")
233236
if self.dataset is not None and isinstance(self.dataset, str):
234-
if self.dataset not in ["wikitext2", "c4", "c4-new", "ptb", "ptb-new"]:
237+
llm_datasets = ["wikitext2", "c4", "c4-new", "ptb", "ptb-new"]
238+
stable_diffusion_datasets = [
239+
"conceptual_captions",
240+
"laion/220k-GPT4Vision-captions-from-LIVIS",
241+
"laion/filtered-wit",
242+
]
243+
if self.dataset not in llm_datasets + stable_diffusion_datasets:
235244
raise ValueError(
236245
f"""You have entered a string value for dataset. You can only choose between
237-
['wikitext2','c4','c4-new','ptb','ptb-new'], but we found {self.dataset}"""
246+
{llm_datasets} for LLLMs or {stable_diffusion_datasets} for diffusion models, but we found {self.dataset}"""
238247
)
239248

240249
if self.bits not in [4, 8]:

optimum/intel/openvino/modeling_decoder.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -635,7 +635,8 @@ def _from_pretrained(
635635
# from optimum.gptq.utils import get_seqlen
636636

637637
# seqlen = get_seqlen(causal_model)
638-
dataset = get_dataset(quantization_config.dataset, tokenizer, seqlen=32)
638+
nsamples = quantization_config.num_samples if quantization_config.num_samples else 128
639+
dataset = get_dataset(quantization_config.dataset, tokenizer, seqlen=32, nsamples=nsamples)
639640
dataset = prepare_dataset(dataset)
640641
quantization_config = copy.deepcopy(quantization_config)
641642
quantization_config.dataset = nncf.Dataset(dataset, lambda x: causal_model.prepare_inputs(**x))

optimum/intel/openvino/modeling_diffusion.py

+96-5
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
import logging
1717
import os
1818
import shutil
19+
from copy import deepcopy
1920
from pathlib import Path
2021
from tempfile import TemporaryDirectory, gettempdir
2122
from typing import Any, Dict, List, Optional, Union
@@ -57,7 +58,13 @@
5758
from .configuration import OVConfig, OVWeightQuantizationConfig
5859
from .loaders import OVTextualInversionLoaderMixin
5960
from .modeling_base import OVBaseModel
60-
from .utils import ONNX_WEIGHTS_NAME, OV_TO_NP_TYPE, OV_XML_FILE_NAME, _print_compiled_model_properties
61+
from .utils import (
62+
ONNX_WEIGHTS_NAME,
63+
OV_TO_NP_TYPE,
64+
OV_XML_FILE_NAME,
65+
PREDEFINED_SD_DATASETS,
66+
_print_compiled_model_properties,
67+
)
6168

6269

6370
core = Core()
@@ -274,9 +281,19 @@ def _from_pretrained(
274281
kwargs[name] = load_method(new_model_save_dir)
275282

276283
quantization_config = cls._prepare_weight_quantization_config(quantization_config, load_in_8bit)
277-
unet = cls.load_model(
278-
new_model_save_dir / DIFFUSION_MODEL_UNET_SUBFOLDER / unet_file_name, quantization_config
279-
)
284+
285+
unet_path = new_model_save_dir / DIFFUSION_MODEL_UNET_SUBFOLDER / unet_file_name
286+
if quantization_config is not None and quantization_config.dataset is not None:
287+
# load the UNet model uncompressed to apply hybrid quantization further
288+
unet = cls.load_model(unet_path)
289+
# Apply weights compression to other `components` without dataset
290+
weight_quantization_params = {
291+
param: value for param, value in quantization_config.__dict__.items() if param != "dataset"
292+
}
293+
weight_quantization_config = OVWeightQuantizationConfig.from_dict(weight_quantization_params)
294+
else:
295+
weight_quantization_config = quantization_config
296+
unet = cls.load_model(unet_path, weight_quantization_config)
280297

281298
components = {
282299
"vae_encoder": new_model_save_dir / DIFFUSION_MODEL_VAE_ENCODER_SUBFOLDER / vae_encoder_file_name,
@@ -286,11 +303,29 @@ def _from_pretrained(
286303
}
287304

288305
for key, value in components.items():
289-
components[key] = cls.load_model(value, quantization_config) if value.is_file() else None
306+
components[key] = cls.load_model(value, weight_quantization_config) if value.is_file() else None
290307

291308
if model_save_dir is None:
292309
model_save_dir = new_model_save_dir
293310

311+
if quantization_config is not None and quantization_config.dataset is not None:
312+
sd_model = cls(unet=unet, config=config, model_save_dir=model_save_dir, **components, **kwargs)
313+
314+
supported_pipelines = (
315+
OVStableDiffusionPipeline,
316+
OVStableDiffusionXLPipeline,
317+
OVLatentConsistencyModelPipeline,
318+
)
319+
if not isinstance(sd_model, supported_pipelines):
320+
raise NotImplementedError(f"Quantization in hybrid mode is not supported for {cls.__name__}")
321+
322+
nsamples = quantization_config.num_samples if quantization_config.num_samples else 200
323+
unet_inputs = sd_model._prepare_unet_inputs(quantization_config.dataset, nsamples)
324+
325+
from .quantization import _hybrid_quantization
326+
327+
unet = _hybrid_quantization(sd_model.unet.model, weight_quantization_config, dataset=unet_inputs)
328+
294329
return cls(
295330
unet=unet,
296331
config=config,
@@ -300,6 +335,62 @@ def _from_pretrained(
300335
**kwargs,
301336
)
302337

338+
def _prepare_unet_inputs(
339+
self,
340+
dataset: Union[str, List[Any]],
341+
num_samples: int,
342+
height: Optional[int] = None,
343+
width: Optional[int] = None,
344+
seed: Optional[int] = 42,
345+
**kwargs,
346+
) -> Dict[str, Any]:
347+
self.compile()
348+
349+
size = self.unet.config.get("sample_size", 64) * self.vae_scale_factor
350+
height = height or min(size, 512)
351+
width = width or min(size, 512)
352+
353+
if isinstance(dataset, str):
354+
dataset = deepcopy(dataset)
355+
available_datasets = PREDEFINED_SD_DATASETS.keys()
356+
if dataset not in available_datasets:
357+
raise ValueError(
358+
f"""You have entered a string value for dataset. You can only choose between
359+
{list(available_datasets)}, but the {dataset} was found"""
360+
)
361+
362+
from datasets import load_dataset
363+
364+
dataset_metadata = PREDEFINED_SD_DATASETS[dataset]
365+
dataset = load_dataset(dataset, split=dataset_metadata["split"], streaming=True).shuffle(seed=seed)
366+
input_names = dataset_metadata["inputs"]
367+
dataset = dataset.select_columns(list(input_names.values()))
368+
369+
def transform_fn(data_item):
370+
return {inp_name: data_item[column] for inp_name, column in input_names.items()}
371+
372+
else:
373+
374+
def transform_fn(data_item):
375+
return data_item if isinstance(data_item, (list, dict)) else [data_item]
376+
377+
from .quantization import InferRequestWrapper
378+
379+
calibration_data = []
380+
self.unet.request = InferRequestWrapper(self.unet.request, calibration_data)
381+
382+
for inputs in dataset:
383+
inputs = transform_fn(inputs)
384+
if isinstance(inputs, dict):
385+
self.__call__(**inputs, height=height, width=width)
386+
else:
387+
self.__call__(*inputs, height=height, width=width)
388+
if len(calibration_data) > num_samples:
389+
break
390+
391+
self.unet.request = self.unet.request.request
392+
return calibration_data[:num_samples]
393+
303394
@classmethod
304395
def _from_transformers(
305396
cls,

optimum/intel/openvino/quantization.py

+93-2
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
import inspect
1717
import logging
1818
import os
19+
from collections import deque
1920
from pathlib import Path
2021
from typing import Any, Callable, Dict, Optional, Tuple, Union
2122

@@ -24,6 +25,7 @@
2425
import torch
2526
import transformers
2627
from nncf import CompressWeightsMode, IgnoredScope, NNCFConfig, SensitivityMetric
28+
from nncf.quantization.advanced_parameters import AdvancedSmoothQuantParameters
2729
from nncf.torch import create_compressed_model, register_default_init_args, register_module
2830
from nncf.torch.dynamic_graph.io_handling import wrap_nncf_model_inputs_with_objwalk
2931
from nncf.torch.initialization import PTInitializingDataLoader
@@ -550,7 +552,7 @@ def _remove_unused_columns(self, dataset: "Dataset"):
550552

551553
def _weight_only_quantization(
552554
model: openvino.runtime.Model, quantization_config: Union[OVWeightQuantizationConfig, Dict]
553-
):
555+
) -> openvino.runtime.Model:
554556
config = quantization_config
555557
if isinstance(config, dict):
556558
config = OVWeightQuantizationConfig.from_dict(quantization_config)
@@ -564,7 +566,8 @@ def _weight_only_quantization(
564566

565567
from optimum.gptq.data import get_dataset, prepare_dataset
566568

567-
dataset = get_dataset(config.dataset, tokenizer, seqlen=32)
569+
nsamples = config.num_samples if config.num_samples else 128
570+
dataset = get_dataset(config.dataset, tokenizer, seqlen=32, nsamples=nsamples)
568571
dataset = prepare_dataset(dataset)
569572

570573
sensitivity_metric = None
@@ -590,4 +593,92 @@ def _weight_only_quantization(
590593
# awq=config.quant_method == "awq", # TODO : remove and add it back once nncf v2.9.0
591594
ignored_scope=ignored_scope,
592595
dataset=dataset,
596+
# subset_size=config.num_samples if config.num_samples else 128, # TODO : enable from nncf v2.9.0
593597
)
598+
599+
600+
def _get_operation_const_op(operation, const_port_id: int):
601+
node = operation.input_value(const_port_id).get_node()
602+
queue = deque([node])
603+
constant_node = None
604+
allowed_propagation_types_list = ["Convert", "FakeQuantize", "Reshape"]
605+
606+
while len(queue) != 0:
607+
curr_node = queue.popleft()
608+
if curr_node.get_type_name() == "Constant":
609+
constant_node = curr_node
610+
break
611+
if len(curr_node.inputs()) == 0:
612+
break
613+
if curr_node.get_type_name() in allowed_propagation_types_list:
614+
queue.append(curr_node.input_value(0).get_node())
615+
616+
return constant_node
617+
618+
619+
def _is_embedding(node) -> bool:
620+
allowed_types_list = ["f16", "f32", "f64"]
621+
const_port_id = 0
622+
input_tensor = node.input_value(const_port_id)
623+
if input_tensor.get_element_type().get_type_name() in allowed_types_list:
624+
const_node = _get_operation_const_op(node, const_port_id)
625+
if const_node is not None:
626+
return True
627+
628+
return False
629+
630+
631+
def _collect_ops_with_weights(model):
632+
ops_with_weights = []
633+
for op in model.get_ops():
634+
if op.get_type_name() == "MatMul":
635+
constant_node_0 = _get_operation_const_op(op, const_port_id=0)
636+
constant_node_1 = _get_operation_const_op(op, const_port_id=1)
637+
if constant_node_0 or constant_node_1:
638+
ops_with_weights.append(op.get_friendly_name())
639+
if op.get_type_name() == "Gather" and _is_embedding(op):
640+
ops_with_weights.append(op.get_friendly_name())
641+
642+
return ops_with_weights
643+
644+
645+
def _hybrid_quantization(
646+
model: openvino.runtime.Model, quantization_config: OVWeightQuantizationConfig, dataset: Dict[str, Any]
647+
) -> openvino.runtime.Model:
648+
"""
649+
Quantize a model in hybrid mode with NNCF which means that we quantize:
650+
weights of MatMul and Embedding layers and activations of other layers.
651+
The optimization specifications defined in `quantization_config`.
652+
653+
Args:
654+
model (`openvino.runtime.Model`):
655+
The OpenVINO Runtime model for applying hybrid quantization.
656+
quantization_config (`OVWeightQuantizationConfig`):
657+
The configuration containing the parameters related to quantization.
658+
dataset (`Dict[str, Any]`):
659+
The dataset used for hybrid quantization.
660+
Returns:
661+
The OpenVINO Runtime model with applied hybrid quantization.
662+
"""
663+
ops_to_compress = _collect_ops_with_weights(model)
664+
665+
ignored_scope = quantization_config.ignored_scope if isinstance(quantization_config.ignored_scope, dict) else {}
666+
ptq_ignored_scope = nncf.IgnoredScope(**ignored_scope)
667+
ptq_ignored_scope.names += ops_to_compress
668+
669+
wc_quantization_config = copy.deepcopy(quantization_config)
670+
wc_quantization_config.ignored_scope = ignored_scope
671+
wc_quantization_config.ignored_scope["types"] = ignored_scope.get("types", []) + ["Convolution"]
672+
compressed_model = _weight_only_quantization(model, wc_quantization_config)
673+
674+
subset_size = quantization_config.num_samples if quantization_config.num_samples else 200
675+
quantized_model = nncf.quantize(
676+
model=compressed_model,
677+
calibration_dataset=nncf.Dataset(dataset),
678+
model_type=nncf.ModelType.TRANSFORMER,
679+
ignored_scope=ptq_ignored_scope,
680+
# The SQ algo should be disabled for MatMul nodes because their weights are already compressed
681+
advanced_parameters=nncf.AdvancedQuantizationParameters(AdvancedSmoothQuantParameters(matmul=-1)),
682+
subset_size=subset_size,
683+
)
684+
return quantized_model

0 commit comments

Comments
 (0)