-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hybrid quantization for StableDiffusion pipelines #584
Changes from 9 commits
2e364ef
8ffc124
bfd7172
93dae89
74f8883
783a654
24de966
3544c4b
067c6d5
2dc4087
636a613
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -69,6 +69,23 @@ from optimum.intel import OVModelForCausalLM | |||||
model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True) | ||||||
``` | ||||||
|
||||||
## Hybrid quantization | ||||||
|
||||||
Traditional optimization methods like post-training 8-bit quantization do not work for Stable Diffusion models because accuracy drops significantly. On the other hand, weight compression does not improve performance when applied to Stable Diffusion models, as the size of activations is comparable to weights. | ||||||
The UNet model takes up most of the overall execution time of the pipeline. Thus, optimizing just one model brings substantial benefits in terms of inference speed while keeping acceptable accuracy without fine-tuning. Quantizing the rest of the diffusion pipeline does not significantly improve inference performance but could potentially lead to substantial degradation of accuracy. | ||||||
Therefore, the proposal is to apply quantization in hybrid mode for the UNet model and weight-only quantization for other pipeline components. The hybrid mode involves the quantization of weights in MatMul and Embedding layers, and activations of other layers, facilitating accuracy preservation post-optimization while reducing the model size. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
For optimizing the Stable Diffusion pipeline, utilize the `quantization_config` to define optimization parameters. To enable hybrid quantization, specify the quantization dataset in the `quantization_config`; otherwise, weight-only quantization in specified precisions will be applied to UNet. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
```python | ||||||
from optimum.intel import OVStableDiffusionPipeline, OVWeightQuantizationConfig | ||||||
|
||||||
model = OVStableDiffusionPipeline.from_pretrained( | ||||||
model_id, | ||||||
export=True, | ||||||
quantization_config=OVWeightQuantizationConfig(bits=8, dataset="conceptual_captions"), | ||||||
) | ||||||
``` | ||||||
|
||||||
<Tip warning={true}> | ||||||
|
||||||
`load_in_8bit` is enabled by default for the models larger than 1 billion parameters. | ||||||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -167,7 +167,7 @@ class OVWeightQuantizationConfig(QuantizationConfigMixin): | |||||
|
||||||
bits (`int`, defaults to 8): | ||||||
The number of bits to quantize to. | ||||||
sym (`bool`, *optional*, defaults to `False`): | ||||||
sym (`bool`, defaults to `False`): | ||||||
Whether to use symetric quantization. | ||||||
tokenizer (`str` or `PreTrainedTokenizerBase`, *optional*): | ||||||
The tokenizer used to process the dataset. You can pass either: | ||||||
|
@@ -177,23 +177,24 @@ class OVWeightQuantizationConfig(QuantizationConfigMixin): | |||||
user or organization name, like `dbmdz/bert-base-german-cased`. | ||||||
- A path to a *directory* containing vocabulary files required by the tokenizer, for instance saved | ||||||
using the [`~PreTrainedTokenizer.save_pretrained`] method, e.g., `./my_model_directory/`. | ||||||
dataset (`Union[List[str]]`, *optional*): | ||||||
The dataset used for data-aware compression. You can provide your own dataset in a list of string or just use the | ||||||
the one from the list ['wikitext2','c4','c4-new','ptb','ptb-new'] | ||||||
group_size (`int`, *optional*, defaults to 128): | ||||||
The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantization. | ||||||
ratio (`float`, *optional*, defaults to 1.0): | ||||||
dataset (`str or List[str]`, *optional*): | ||||||
The dataset used for data-aware compression or quantization with NNCF. You can provide your own dataset | ||||||
in a list of string or just use the the one from the list ['wikitext2','c4','c4-new','ptb','ptb-new'] for LLLMs | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
or ['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit'] for SD models. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
ratio (`float`, defaults to 1.0): | ||||||
The ratio between baseline and backup precisions (e.g. 0.9 means 90% of layers quantized to INT4_ASYM | ||||||
and the rest to INT8_ASYM). | ||||||
group_size (`int`, *optional*): | ||||||
The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantization. | ||||||
all_layers (`bool`, *optional*): | ||||||
Defines how many layers are compressed to 4-bits while the rest are kept in 8-bit presicion. | ||||||
sensitivity_metric (`nncf.SensitivityMetric`, *optional*): | ||||||
sensitivity_metric (`str`, *optional*): | ||||||
The sensitivity metric for assigning quantization precision to layers. In order to | ||||||
preserve the accuracy of the model, the more sensitive layers receives a higher precision. | ||||||
awq (`bool`, *optional*): | ||||||
Enables AWQ method to unify weight ranges and improve overall model accuracy. | ||||||
ignored_scope (`nncf.IgnoredScope`, *optional*): | ||||||
ignored_scope (`dict`, *optional*): | ||||||
An ignored scope that defined the list of model control flow graph nodes to be ignored during quantization. | ||||||
num_samples (`int`, *optional*): | ||||||
The maximum number of samples composing the calibration dataset. | ||||||
|
||||||
""" | ||||||
|
||||||
|
@@ -202,12 +203,13 @@ def __init__( | |||||
bits: int = 8, | ||||||
sym: bool = False, | ||||||
tokenizer: Optional[Any] = None, | ||||||
dataset: Optional[str] = None, | ||||||
dataset: Optional[Union[str, List[str]]] = None, | ||||||
ratio: float = 1.0, | ||||||
group_size: Optional[int] = None, | ||||||
all_layers: Optional[bool] = None, | ||||||
sensitivity_metric: Optional[str] = None, | ||||||
ignored_scope: Optional[dict] = None, | ||||||
num_samples: Optional[int] = None, | ||||||
**kwargs, | ||||||
): | ||||||
self.bits = bits | ||||||
|
@@ -219,6 +221,7 @@ def __init__( | |||||
self.all_layers = all_layers | ||||||
self.sensitivity_metric = sensitivity_metric | ||||||
self.ignored_scope = ignored_scope | ||||||
self.num_samples = num_samples | ||||||
self.quant_method = "default" # TODO : enable AWQ after nncf v2.9.0 release | ||||||
self.post_init() | ||||||
|
||||||
|
@@ -231,10 +234,16 @@ def post_init(self): | |||||
if self.group_size is not None and self.group_size != -1 and self.group_size <= 0: | ||||||
raise ValueError("`group_size` must be greater than 0 or equal to -1") | ||||||
if self.dataset is not None and isinstance(self.dataset, str): | ||||||
if self.dataset not in ["wikitext2", "c4", "c4-new", "ptb", "ptb-new"]: | ||||||
llm_datasets = ["wikitext2", "c4", "c4-new", "ptb", "ptb-new"] | ||||||
stable_diffusion_datasets = [ | ||||||
"conceptual_captions", | ||||||
"laion/220k-GPT4Vision-captions-from-LIVIS", | ||||||
"laion/filtered-wit", | ||||||
] | ||||||
if self.dataset not in llm_datasets + stable_diffusion_datasets: | ||||||
raise ValueError( | ||||||
f"""You have entered a string value for dataset. You can only choose between | ||||||
['wikitext2','c4','c4-new','ptb','ptb-new'], but we found {self.dataset}""" | ||||||
{llm_datasets} for LLLMs or {stable_diffusion_datasets} for SD models, but we found {self.dataset}""" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
) | ||||||
|
||||||
if self.bits not in [4, 8]: | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,7 @@ | |
import logging | ||
import os | ||
import shutil | ||
from copy import deepcopy | ||
from pathlib import Path | ||
from tempfile import TemporaryDirectory, gettempdir | ||
from typing import Any, Dict, List, Optional, Union | ||
|
@@ -57,7 +58,13 @@ | |
from .configuration import OVConfig, OVWeightQuantizationConfig | ||
from .loaders import OVTextualInversionLoaderMixin | ||
from .modeling_base import OVBaseModel | ||
from .utils import ONNX_WEIGHTS_NAME, OV_TO_NP_TYPE, OV_XML_FILE_NAME, _print_compiled_model_properties | ||
from .utils import ( | ||
ONNX_WEIGHTS_NAME, | ||
OV_TO_NP_TYPE, | ||
OV_XML_FILE_NAME, | ||
PREDEFINED_SD_DATASETS, | ||
_print_compiled_model_properties, | ||
) | ||
|
||
|
||
core = Core() | ||
|
@@ -274,9 +281,17 @@ def _from_pretrained( | |
kwargs[name] = load_method(new_model_save_dir) | ||
|
||
quantization_config = cls._prepare_weight_quantization_config(quantization_config, load_in_8bit) | ||
unet = cls.load_model( | ||
new_model_save_dir / DIFFUSION_MODEL_UNET_SUBFOLDER / unet_file_name, quantization_config | ||
) | ||
|
||
dataset = None | ||
unet_path = new_model_save_dir / DIFFUSION_MODEL_UNET_SUBFOLDER / unet_file_name | ||
if quantization_config is not None and quantization_config.dataset is not None: | ||
dataset = quantization_config.dataset | ||
# load the UNet model uncompressed to apply hybrid quantization further | ||
unet = cls.load_model(unet_path) | ||
# Apply weights compression to other `components` without dataset | ||
quantization_config.dataset = None | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the error-prone approach when you change the values of the input argument, IMO. Please think about how we can make it more safe. |
||
else: | ||
unet = cls.load_model(unet_path, quantization_config) | ||
|
||
components = { | ||
"vae_encoder": new_model_save_dir / DIFFUSION_MODEL_VAE_ENCODER_SUBFOLDER / vae_encoder_file_name, | ||
|
@@ -291,6 +306,25 @@ def _from_pretrained( | |
if model_save_dir is None: | ||
model_save_dir = new_model_save_dir | ||
|
||
if dataset is not None: | ||
sd_model = cls(unet=unet, config=config, model_save_dir=model_save_dir, **components, **kwargs) | ||
|
||
supported_pipelines = ( | ||
OVStableDiffusionPipeline, | ||
OVStableDiffusionXLPipeline, | ||
OVLatentConsistencyModelPipeline, | ||
) | ||
if not isinstance(sd_model, supported_pipelines): | ||
raise NotImplementedError(f"Quantization in hybrid mode is not supported for {cls.__name__}") | ||
|
||
nsamples = quantization_config.num_samples if quantization_config.num_samples else 200 | ||
unet_inputs = sd_model._prepare_unet_inputs(dataset, nsamples) | ||
|
||
from .quantization import _hybrid_quantization | ||
|
||
unet = _hybrid_quantization(sd_model.unet.model, quantization_config, dataset=unet_inputs) | ||
quantization_config.dataset = dataset | ||
|
||
return cls( | ||
unet=unet, | ||
config=config, | ||
|
@@ -300,6 +334,58 @@ def _from_pretrained( | |
**kwargs, | ||
) | ||
|
||
def _prepare_unet_inputs( | ||
self, | ||
dataset: Union[str, List[Any]], | ||
num_samples: int, | ||
height: Optional[int] = 512, | ||
width: Optional[int] = 512, | ||
seed: Optional[int] = 42, | ||
**kwargs, | ||
) -> Dict[str, Any]: | ||
self.compile() | ||
|
||
if isinstance(dataset, str): | ||
dataset = deepcopy(dataset) | ||
available_datasets = PREDEFINED_SD_DATASETS.keys() | ||
if dataset not in available_datasets: | ||
raise ValueError( | ||
f"""You have entered a string value for dataset. You can only choose between | ||
{list(available_datasets)}, but the {dataset} was found""" | ||
) | ||
|
||
from datasets import load_dataset | ||
|
||
dataset_metadata = PREDEFINED_SD_DATASETS[dataset] | ||
dataset = load_dataset(dataset, split=dataset_metadata["split"], streaming=True).shuffle(seed=seed) | ||
input_names = dataset_metadata["inputs"] | ||
dataset = dataset.select_columns(list(input_names.values())) | ||
|
||
def transform_fn(data_item): | ||
return {inp_name: data_item[column] for inp_name, column in input_names.items()} | ||
|
||
else: | ||
|
||
def transform_fn(data_item): | ||
return data_item if isinstance(data_item, (list, dict)) else [data_item] | ||
|
||
from .quantization import InferRequestWrapper | ||
|
||
calibration_data = [] | ||
self.unet.request = InferRequestWrapper(self.unet.request, calibration_data) | ||
|
||
for inputs in dataset: | ||
inputs = transform_fn(inputs) | ||
if isinstance(inputs, dict): | ||
self.__call__(**inputs, height=height, width=width) | ||
else: | ||
self.__call__(*inputs, height=height, width=width) | ||
if len(calibration_data) > num_samples: | ||
break | ||
|
||
self.unet.request = self.unet.request.request | ||
return calibration_data[:num_samples] | ||
|
||
@classmethod | ||
def _from_transformers( | ||
cls, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.