Skip to content

Commit 08fc8ed

Browse files
PenghuiChengecharlaixAlexKoff88ljaljushkinhelena-intel
authored
Support weight-only quantization with quantized operators in intel-extension-for-transformers. (#455)
* Support weight-only quantization with quantized operators in intel-extension-for-transformers * Update code style * Update readme for weight-only quantization example * Update code * Adapt intel-extension-for-transformers 1.3 API change Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Support weight-only quantization with quantized operators in intel-extension-for-transformers * Update code * rebase code on main branch Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Update example Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Update optimum/intel/neural_compressor/quantization.py Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * [OV]: Fixed inference after 4 bit weight compression (#569) * [OV]: Fixed inferece after 4 bit weight compression * Fixed issue * Update optimum/intel/openvino/modeling_decoder.py Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Applied comments * Fixed issue when request is None --------- Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Updated docs with load_in_4bit (#558) * Updated docs with load_in_4bit * Update documentation * Update documentation * typo --------- Co-authored-by: Ella Charlaix <ella@huggingface.co> * Update Transformers dependency requirements (#571) * Fix compatibility for latest transformers release (#570) * fix compatibility for latest transformers release * update setup * update setup * fix test input size * fix prepare generation for llama models * Deprecate compression options (#565) * deprecate compression options * style * fix configuration * Update CLI argument * update documentation * deprecate torch nn modules for ov quantizer * fix ov config for fp32 models * fix format * update documentation * Add check for configuration * fix ratio default value for SD models * add quantization_config argument for OVModel * remove commented line * Update docs/source/inference.mdx Co-authored-by: Alexander Kozlov <alexander.kozlov@intel.com> * add default config for causal LM * fix warning message --------- Co-authored-by: Alexander Kozlov <alexander.kozlov@intel.com> * Add default quantization int4 config for Mixtral-8x7B (#576) * Update stable diffusion example requirements (#579) * Fix collecting duplicate tensors in quantization calibration dataset (#577) * Added deepcopying of inputs collected by InferRequestWrapper. Added a test covering the fixed issue. * Phrasing tweaks * Add soundfile to test requirements * Added librosa to test requirements * Added copying to other data cache appends * Remove the need for real test data * Process __call__ call properly * Addressed suggested changes * Save an openvino config summarizing all information related to quantization when saving model (#578) * fix doc * remove default compression value * set default compression config when not provided * save openvino config to include quantization configuration * fix style * add test * update setup * style * remove from quantization_config key from ov_config * add test * update setup * modify method name * Fix warning (#582) * Fix warning * fix message warning * Add reference to the temporary directory for windows fix (#581) * Fix documentation (#583) * Fix documentation * fix * Add llama test model to cover MQA (#585) * change llama test model to cover MQA * keep llama and llama2 in tests * fix code style * Include nncf in openvino extra (#586) * Fix title documentation (#588) * Update OpenVINO documentation links in README.md (#587) * Update OpenVINO documentation links in README.md The links are now aligned with OpenVINO 2024.0 documentation, and include permalinks instead of direct links, when possible. * Update inference.mdx * Update index.mdx * Update installation.mdx * Update README.md * Fix default int8 quantization for CLI (#592) * Change model output parameter to last_hidden_states for IPEXModel (#589) * change model output parameter to last_hidden_states * update ipex model testiong * update testing * add output name to ipex model * Add IPEX model patcher (#567) * llama model patcher * fix jit model * fix jit model * rm autocast in model * add llama model patcher * support assisted decoding and add reorder cache function * add comment for _prepare_past_key_values * rebase main * fix model_dtype * rm useless comments * fix llama * add comments for ipex_rope and ipex_scale_dot_product * fix comments * add enable_tpp comments * fix import * fix review aroun2 * add torch.no_grad to avoid auto_kernel_selection issue * use torch.no_grad in jit trace * fix ipex model testing * add tests for ipex model generation with multi inputs * fix code style * remove __get__(self) as _reorder_cache is static method for the class * fix reorder_cache * use model_type * check if reorder_cache is a static method * fix _reorder_cache * fix raise import error * test ipex patching * fix comments * update API name and testing * disable untill ipex version 2.5.0 * update testing name * Update optimum/intel/ipex/modeling_base.py Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Update tests/ipex/test_modeling.py Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * fix tests --------- Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Updates weight quantization section in the docs (#593) * Remove accelerate and onnxruntime from required dependencies (#590) * Remove accelerate dependency * Add accelerate to import backend mapping * Add eval method to OVModels * add onnxruntime install for OV test * fix test expected int8 * Fix OpenVINO image classification examples (#598) * Fix weights compression for OPenVINO models (#596) * hot fix for weights compression * rewrite mcok tests * Fix default ov config (#600) * Add warning for transformers>=4.38 and OpenVINO 2024.0 (#599) * Add warning for transformers>=4.38 and OpenVINO 2024.0 * Use is_openvino_version to compare versions * Show version warning only for llama and gpt-bigcode * Fix style, show OpenVINO version * Include affected model types in warning message * Add hybrid quantization for StableDiffusion pipelines (#584) * Add hybrid quantization for StableDiffusion pipelines * apply black * fix tests * fix ruff * fix lcm bug * apply review comments * rework dataset processing * Add doc * remove SDXL test * Apply comments * reformat * Show device name in _print_compiled_model_properties (#541) * Show device name in _print_compiled_model_properties Enable CACHE_DIR also for devices like "GPU:0" * Update optimum/intel/openvino/modeling_seq2seq.py Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Change check for gpu device --------- Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Update code with comments Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Fixed pylint error Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Update optimum/intel/neural_compressor/configuration.py Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Fixed example and UT for weight-only quantization Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Fixed pre-ci test error Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Fixed pre-ci test error Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Fixed UT and examples error Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Fixed pre-CI error Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Fixed UT error Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Update tests/openvino/test_modeling_basic.py Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Update examples/neural_compressor/language-modeling/README.md Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Update examples/neural_compressor/language-modeling/run_clm.py Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Update examples/neural_compressor/language-modeling/run_clm.py Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Update examples/neural_compressor/language-modeling/run_clm.py Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Update examples/neural_compressor/language-modeling/run_clm.py Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Update examples/neural_compressor/language-modeling/run_clm.py Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Load weight-only quantized model with INCModelForCausalLM Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Changed parameters name for GPTQ in example Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Changed parameters order in INCQuantizer.quantize Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Fixed UT error Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Update examples/neural_compressor/text-generation/run_generation.py Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Update optimum/intel/neural_compressor/quantization.py Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Update optimum/intel/neural_compressor/quantization.py Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Update import message Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Limit intel-extension-for-transformers version Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Limit torch version for weight-only quantization Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> * Fixed doc building error Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> --------- Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> Co-authored-by: Alexander Kozlov <alexander.kozlov@intel.com> Co-authored-by: Ella Charlaix <ella@huggingface.co> Co-authored-by: Lyalyushkin Nikolay <nikolay.lyalyushkin@intel.com> Co-authored-by: Helena Kloosterman <helena.kloosterman@intel.com> Co-authored-by: Nikita Savelyev <nikita.savelyev@intel.com> Co-authored-by: jiqing-feng <107918818+jiqing-feng@users.noreply.github.com> Co-authored-by: Ekaterina Aidova <ekaterina.aidova@intel.com> Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> Co-authored-by: Liubov Talamanova <liubov.talamanova@intel.com>
1 parent a3bf172 commit 08fc8ed

File tree

12 files changed

+331
-191
lines changed

12 files changed

+331
-191
lines changed

.github/workflows/test_inc.yml

+6-1
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,13 @@ jobs:
3030
- name: Install dependencies
3131
run: |
3232
python -m pip install --upgrade pip
33+
pip install cmake
34+
pip install py-cpuinfo
35+
pip install torch==2.1.0 torchaudio==2.1.0 torchvision==0.16 --extra-index-url https://download.pytorch.org/whl/cpu
3336
pip install .[neural-compressor,diffusers,tests]
34-
pip install intel-extension-for-pytorch
37+
pip install intel-extension-for-pytorch==2.1.100
38+
pip install intel-extension-for-transformers==1.3.2
39+
pip install peft
3540
- name: Test with Pytest
3641
run: |
3742
pytest tests/neural_compressor/

examples/neural_compressor/language-modeling/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -97,4 +97,4 @@ respectively `dynamic`, `static`, `weight_only` or `aware_training`.
9797

9898
The flag `--verify_loading` can be passed along to verify that the resulting quantized model can be loaded correctly.
9999

100-
> **_Note:_** `weight_only` quantization_approach requires neural-compressor >= 2.3
100+
> **_Note:_** `weight_only` quantization_approach requires `neural-compressor` >= 2.3 and `intel-extension-for-transformers` >= 1.3.

examples/neural_compressor/language-modeling/requirements.txt

+2
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,5 @@ torch >= 1.9
33
datasets >= 1.8.0
44
sentencepiece != 0.1.92
55
protobuf
6+
intel-extension-for-transformers >= 1.3
7+
peft

examples/neural_compressor/language-modeling/run_clm.py

+66-29
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,14 @@
5757
from transformers.utils.versions import require_version
5858

5959
from optimum.intel.neural_compressor import INCModelForCausalLM, INCQuantizer, INCTrainer
60+
from optimum.intel.utils.import_utils import (
61+
INTEL_EXTENSION_FOR_TRANSFORMERS_IMPORT_ERROR,
62+
is_intel_extension_for_transformers_available,
63+
)
64+
65+
66+
if is_intel_extension_for_transformers_available():
67+
from intel_extension_for_transformers.transformers.utils.config import WeightOnlyQuantConfig
6068

6169

6270
os.environ["CUDA_VISIBLE_DEVICES"] = ""
@@ -143,7 +151,9 @@ class OptimizationArguments:
143151
)
144152
quantization_approach: str = field(
145153
default="dynamic",
146-
metadata={"help": "Quantization approach. Supported approach are static, dynamic and aware_training."},
154+
metadata={
155+
"help": "Quantization approach. Supported approach are static, dynamic aware_training and weight_only."
156+
},
147157
)
148158
smooth_quant: bool = field(
149159
default=False,
@@ -196,9 +206,13 @@ class OptimizationArguments:
196206
default=False,
197207
metadata={"help": "Whether or not to verify the loading of the quantized model."},
198208
)
199-
bits: int = field(
200-
default=8,
201-
metadata={"help": "Bits for weight only quantization, 1-8 bits."},
209+
bits: str = field(
210+
default="4",
211+
metadata={"help": "Bits number of weight for weight only quantization. 1~8 bits."},
212+
)
213+
weight_dtype: str = field(
214+
default="int4_clip",
215+
metadata={"help": "weight dtype for weight only quantization."},
202216
)
203217
group_size: int = field(
204218
default=-1,
@@ -214,10 +228,29 @@ class OptimizationArguments:
214228
)
215229
quantization_methodology: str = field(
216230
default="RTN",
231+
metadata={"help": "Quantization methodology for weight only quantization. Choose from 'RTN' and 'GPTQ'."},
232+
)
233+
damp_percent: float = field(
234+
default=0.01,
217235
metadata={
218-
"help": "Quantization methodology for weight only quantization. Choose from 'RTN', 'AWQ' and 'GPTQ'."
236+
"help": "Percentage of Hessian's diagonal values average, which will be added to Hessian's diagonal to increase numerical stability, used for GPTQ quantization"
219237
},
220238
)
239+
gptq_block_size: int = field(
240+
default=128,
241+
metadata={"help": "Block size. sub weight matrix size to run GPTQ."},
242+
)
243+
num_calibration_samples: int = field(
244+
default=128, metadata={"help": "Number of examples to use for the GPTQ calibration step."}
245+
)
246+
use_max_length: bool = field(
247+
default=False,
248+
metadata={"help": "Set all sequence length to be same length of args.gptq_pad_max_length"},
249+
)
250+
pad_max_length: int = field(
251+
default=2048,
252+
metadata={"help": "Calibration dataset sequence max length, this should align with your model config"},
253+
)
221254

222255

223256
@dataclass
@@ -625,26 +658,30 @@ def compute_metrics(eval_preds):
625658
else:
626659
recipes = {}
627660
if optim_args.quantization_approach == "weight_only":
628-
op_type_dict = {
629-
".*": {
630-
"weight": {
631-
"bits": optim_args.bits,
632-
"group_size": optim_args.group_size,
633-
"scheme": optim_args.weight_only_scheme,
634-
"algorithm": optim_args.quantization_methodology,
635-
},
636-
},
637-
}
661+
if not is_intel_extension_for_transformers_available():
662+
raise ImportError(INTEL_EXTENSION_FOR_TRANSFORMERS_IMPORT_ERROR.format("WeightOnly quantization"))
663+
if optim_args.apply_pruning or optim_args.apply_distillation:
664+
raise ValueError("Weight only quantization and pruning or distillation cannot be combined.")
638665
if optim_args.quantization_methodology == "GPTQ":
639-
gptq_args = {
640-
"pad_max_length": block_size,
666+
algorithm_args = {
667+
"act_order": False,
668+
"percdamp": optim_args.damp_percent,
669+
"block_size": optim_args.gptq_block_size,
670+
"nsamples": optim_args.num_calibration_samples,
671+
"use_max_length": optim_args.use_max_length,
672+
"pad_max_length": optim_args.pad_max_length,
641673
}
642-
recipes.update({"gptq_args": gptq_args})
674+
quantization_config = WeightOnlyQuantConfig(
675+
weight_dtype=optim_args.weight_dtype,
676+
group_size=optim_args.group_size,
677+
scheme=optim_args.weight_only_scheme,
678+
algorithm=optim_args.quantization_methodology,
679+
algorithm_args=algorithm_args if optim_args.quantization_methodology == "GPTQ" else None,
680+
)
643681
else:
644-
op_type_dict = {}
645-
quantization_config = PostTrainingQuantConfig(
646-
approach=optim_args.quantization_approach, op_type_dict=op_type_dict, recipes=recipes
647-
)
682+
quantization_config = PostTrainingQuantConfig(
683+
approach=optim_args.quantization_approach, recipes=recipes
684+
)
648685

649686
if optim_args.apply_pruning:
650687
if optim_args.end_step is None:
@@ -732,15 +769,15 @@ def compute_metrics(eval_preds):
732769
quantizer.quantize(
733770
quantization_config=quantization_config,
734771
save_directory=training_args.output_dir,
735-
calibration_dataset=train_dataset
736-
if optim_args.quantization_approach in ["static", "weight_only"]
737-
else None,
738-
batch_size=1 # batch_size > 1 for GPTQ is WIP
739-
if optim_args.quantization_approach == "weight_only" and optim_args.quantization_methodology == "GPTQ"
740-
else training_args.per_device_train_batch_size,
741-
weight_only=True if optim_args.quantization_approach == "weight_only" else False,
772+
calibration_dataset=(
773+
train_dataset if optim_args.quantization_approach in ["static", "weight_only"] else None
774+
),
775+
batch_size=(
776+
1 if optim_args.quantization_approach == "weight_only" else training_args.per_device_train_batch_size
777+
),
742778
)
743779
trainer.model = quantizer._quantized_model
780+
744781
if optim_args.apply_quantization and optim_args.verify_loading:
745782
loaded_model = INCModelForCausalLM.from_pretrained(training_args.output_dir)
746783
tokens = tokenizer("This is a sample input", return_tensors="pt")

examples/neural_compressor/text-generation/run_generation.py

+1-3
Original file line numberDiff line numberDiff line change
@@ -368,9 +368,7 @@ def calibration_fn(p_model):
368368

369369
args.length = adjust_length_to_model(
370370
args.length,
371-
max_sequence_length=model.config.max_position_embeddings
372-
if hasattr(model.config, "max_position_embeddings")
373-
else 0,
371+
max_sequence_length=getattr(model.config, "max_position_embeddings", 0),
374372
)
375373
logger.info(args)
376374

optimum/intel/neural_compressor/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
from ..utils.import_utils import is_diffusers_available
15+
from ..utils.import_utils import is_diffusers_available, is_intel_extension_for_transformers_available
1616
from .configuration import INCConfig
1717
from .modeling_base import (
1818
INCModel,

optimum/intel/neural_compressor/modeling_base.py

+30-1
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,12 @@
4343
from optimum.intel.generation import BaseModelForCausalLM
4444

4545
from ...modeling_base import OptimizedModel
46-
from ..utils.import_utils import _torch_version, is_torch_version
46+
from ..utils.import_utils import (
47+
_torch_version,
48+
is_intel_extension_for_transformers_available,
49+
is_torch_version,
50+
requires_backends,
51+
)
4752
from .configuration import INCConfig
4853
from .utils import WEIGHTS_NAME
4954

@@ -63,6 +68,11 @@
6368
"""
6469

6570

71+
if is_intel_extension_for_transformers_available():
72+
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM as ITREX_WOQ_MODEL
73+
from intel_extension_for_transformers.transformers.utils import WeightOnlyQuantConfig
74+
75+
6676
class INCModel(OptimizedModel):
6777
auto_model_class = AutoModel
6878
export_feature = "feature-extraction"
@@ -131,6 +141,25 @@ def _from_pretrained(
131141
model_save_dir = Path(model_cache_path).parent
132142
inc_config = None
133143
msg = None
144+
try:
145+
requires_backends(cls, ["intel_extension_for_transformers"])
146+
quantization_config = WeightOnlyQuantConfig.from_pretrained(model_id)
147+
if getattr(
148+
quantization_config, "algorithm", None
149+
) is not None and quantization_config.algorithm.lower() in ["rtn", "gptq", "awq", "autoaround"]:
150+
return ITREX_WOQ_MODEL.from_pretrained(
151+
pretrained_model_name_or_path=model_id,
152+
use_auth_token=use_auth_token,
153+
revision=revision,
154+
force_download=force_download,
155+
cache_dir=cache_dir,
156+
local_files_only=local_files_only,
157+
subfolder=subfolder,
158+
trust_remote_code=trust_remote_code,
159+
**kwargs,
160+
)
161+
except EnvironmentError:
162+
msg = "The model is not quantized with weight-only quantization."
134163
try:
135164
inc_config = INCConfig.from_pretrained(model_id)
136165
if not is_torch_version("==", inc_config.torch_version):

0 commit comments

Comments
 (0)