Skip to content

Commit b6a4c78

Browse files
xin3hexinhe3RafLitRafal Litkakiazada
authored
Cherry pick Habana software v1.20.0 (#2123)
* [SW-210525] release HPU memory when loading neural_magic fp8 models (#48) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-211178] save generation_config when saving model if exists (#57) * [SW-211178] save generation_config when saving model if exists --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-210543] update gitignore to simplify the git message (#50) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-205334][SW-187731] llama70b vLLM fix graph breaks with torch.compile (#67) * fix graph breaks with torch.compile * remove orig_mod from helper_modules * fix typos * fix test_register_apis --------- Co-authored-by: Rafal Litka <rlitka@habana.ai> * [SW-213890] Disable test_two_step_layer_wise temporarily (#84) * [SW-205437] - Support LM-HEAD patching (#79) * [SW-205437] - Support LM-HEAD patching * fix CR comments * Enhance and rename fix_measurements tool to postprocessing_vllm_measurements (#82) * [SW-214088] Fix graph break caused by PatchedMixtralMoE (#74) * [SW-208528] Support FP8 per channel Q/DQ (#13) * add per channel qdq support Signed-off-by: changwang <changwang@habana.ai> * improve ut Signed-off-by: changwang <changwang@habana.ai> * improve get_scale_dtype func and qdq init Signed-off-by: changwangss <changwang@habana.ai> * improve DequantOutput QuantInput init Signed-off-by: changwangss <changwang@habana.ai> * add scale_method improve PCQ Signed-off-by: changwangss <changwang@habana.ai> * remove scale name Signed-off-by: changwangss <changwang@habana.ai> * fix PCQ scale_inv expanding Signed-off-by: changwangss <changwang@habana.ai> * merge the qdq_per_channel, qdq_per_tensor to qdq Signed-off-by: changwangss <changwang@habana.ai> * move scale_inv change to the QuantInput init Signed-off-by: changwangss <changwang@habana.ai> * remove scale_dtype list judge Signed-off-by: changwangss <changwang@habana.ai> * fix missing axis parameter Signed-off-by: changwangss <changwang@habana.ai> --------- Signed-off-by: changwang <changwang@habana.ai> Signed-off-by: changwangss <changwang@habana.ai> * [SW-204341] explicit scale format for ops (#73) * [SW-204341] explicit scale format for ops Added wrapper around fp8 functions Wrapper decides which flavor of the function to call, according to scale format Helper modules call the wrapper Decide which cast flavor to call, according to scale format * [SW-204341] Adjust softmax API , remove commented-out code * [SW-204341] Fixes from CR 1 * [SW-204341] Fixed CR 2 * [SW-204341] add missing arg is fsdpa Signed-off-by: Uri Livne <ulivne@habana.ai> * [SW-204341] Enhance SDPA for measure and quant * [SW-204341] remove sdpa quantized ops * reland per op class with more enchancments * [SW-204341] reland specfic arguments , rename class to wrapper * added call with self in patched lm head rebased on top of master next force push * fix mistake in conflict resolution resotore MethodType fix * antoher fix * modified fp8 mtamul test to test quantized matmul func * another fix of rebase mistake * hopefully last rebase mistake fix * restore backward compatibly import protection --------- Signed-off-by: Uri Livne <ulivne@habana.ai> * [SW-213890] Revert "[SW-213890] Disable test_two_step_layer_wise temporarily (#84)" (#86) This reverts commit 27162ae. * Revert "[SW-205334][SW-187731] llama70b vLLM fix graph breaks with torch.com…" (#87) This reverts commit 01a5734. Co-authored-by: Danny Semiat <dsemiat@habana.ai> * [ALGO-809] PatchedLmHeadLinearAllreduce: replacing the sharding code with the one from deepspeed-fork (#85) Change-Id: Icb9670cfefdd1880c1ebb9a804a97c9ba79ecdc3 Co-authored-by: smarkovichgolan <smarkovich@habana.ai> * fix bug of FusedMoE object has no attribute w13_weight (#94) Signed-off-by: yuwenzho <yuwen.zhou@intel.com> * [SW-208588] Add HPU fp8 Dynamic MOE (#88) * [SW-208588] Add HPU fp8 Dynamic MOE * fix review comments * fix more review comments * fix comments * fix tests * minor config fixes (#96) * [SW-0] minor cosmetic fixes in quant_config * remove hooks * [SW-196641] - Fix type mismatch in linear quantization unit tests (#99) * [SW-196641] - Fix type mismatch in linear quantization unit tests * fix atol value * add hp_dtype to fp8 config dict before parsing * [SW-214785] Apply PatchedModuleBase for all existing PatchedModules (#92) * [SW-214785] Apply PatchedModuleBase for all existing PatchedModules Signed-off-by: Xin He <xinhe3@habana.ai> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-215319] threshold of memory usage in test_block_wise.py is too tight (#100) * [SW-215543] Revert "minor config fixes (#96)" (#104) This reverts commit fa40142. * fix RowParalleLinear func names from string to tuple (#106) * [SW-215615] memory is unreleased during loading neural_magic models on multi-cards (#105) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-212423] RuntimeError when load the gptq model from HF (#70) * [SW-212423] RuntimeError when load the gptq model from HF * skip tie_word_embeddings=False Signed-off-by: Xin He <xinhe3@habana.ai> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-214785] fix issue when self._mod_extra_config is None (#108) * [SW-211826] [example] demonstrate layer-wise, block-wise and lm_eval usage (#66) * [SW-211826] [example] demonstrate layer-wise&block-wise usage to quantize LLM with limited host&device memory Signed-off-by: Xin He <xinhe3@habana.ai> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-215295] Force single object from quantized func wrapper classes (#103) * [SW-215295] Force single object from quantized func wrapper classes * Modify the factory object to be cleared after module patching * Move cleanup to Quantizer object * [SW-216292]Minor update for lm-eval (#113) * Enable lm-eval 0.4.2 and expose `add_bos_token` --------- Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Yi Liu <yiliu4@habana.ai> * [SW-209207] add vllm fp8 dynamic MoE (#116) * [SW-216239] Align Softmax fp8 scale calc with configuration (#112) * [SW-217321] Skip auto round tests (#119) (#125) * Test Commit * [SW-217321] Skip auto round tests do to CI breakage * remove uneeded print * [SW-207451] Implement block-wise calibration for LLM (#24) For LLMs, measurement on bf16 requires high hpu memory usage. This change can help measure bf16 llama-405b on 8 Gaudi2 card, or measure llama-70b on 1 Gaudi card. Shortage: cannot measure lm_head layer, maybe we can enhance it later. --------- Signed-off-by: Xin <xin3.he@intel.com> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-197077] fix bug in output arbitrary scales (#45) * [SW-197077] fix bug * [SW-197077] fix bug in outputs arbitrary scales Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-197077] fix bug in output arbitrary scales (#45) * [SW-197077] fix bug * [SW-197077] fix bug in outputs arbitrary scales * [SW-210500] [Optimum-Habana] [Regression] [fp8] [INC] No generated text for llava models [llava-1.5-7b-hf] [llava-1.5-13b-hf ] (#54) (#77) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-213236] resolve CPU mem issue in CI (#76) (#83) Cherry-pick from 1.19 Co-authored-by: Xin He <xin3.he@intel.com> * [SW-213368] requirements_pt.txt: allow newer pydantic versions to >= 1.10.13 (#80) * requirements_pt.txt: upgrade pydantic version to >= 2.0.0 * allow newer version of pydantic newer deepspeed uses pydantic v2, which have slight different APIs. * Update requirements_pt.txt * [SW-212057] Enable scalar scale to support QDQ (#98) * [SW-212057] Enable scalar scale to support QDQ Change-Id: Ib5f5accd7a770675609e91c18bd04497b15937c5 * PR comment fixes Change-Id: I01be41c29721b8d59c887f3d2b4e3cef8433331c Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-215845] Run some unit tests from top level API (#109) Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-212629] Support saving weight-only quantization INT4 model in Hugging Face format (#101) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-205970] update state_dict to save scalar scales (#6) * update state_dict method in save/load function --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * Revert "[SW-205970] update state_dict to save scalar scales (#6)" (#114) This reverts commit ffcb97e. * [SW-212092] Save vllm compatible format (#102) * save vllm compatible format Signed-off-by: changwangss <changwang@habana.ai> * add assertion and improve max_file_size to human reading Signed-off-by: changwangss <changwang@habana.ai> * support default the same with huggingface when saving Signed-off-by: changwangss <changwang@habana.ai> * separate save funtion for single device and multi devices. Signed-off-by: changwangss <changwang@habana.ai> * rebase Signed-off-by: changwangss <changwang@habana.ai> * rebase save Signed-off-by: changwangss <changwang@habana.ai> * remove weight and scale convert on G2 Signed-off-by: changwangss <changwang@habana.ai> * rebase master_next due to revert #6 Signed-off-by: changwangss <changwang@habana.ai> * improve convert weight to vllm compatable function Signed-off-by: changwangss <changwang@habana.ai> * replace print to logger Signed-off-by: changwangss <changwang@habana.ai> * move unit_mapping to common utils Signed-off-by: changwangss <changwang@habana.ai> --------- Signed-off-by: changwangss <changwang@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-205970] update state_dict to save scalar scales (#115) * [SW-205970] update state_dict to save scalar scales (#6) * update state_dict method in save/load function * support mixtral --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-215009] support loading per-channel scales (#95) * [SW-215009] support loading per-channel scales Signed-off-by: Xin He <xinhe3@habana.ai> * fix UT Signed-off-by: Xin He <xinhe3@habana.ai> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * Refactoring scales (#22) (#122) * Refactoring scales (#22) * [SW-197077] refactoring maxabs scales and adding arbitrary scales. * [SW-199696] Supporting Dynamic Quantization (#128) * Calculating dynamic scales using nn.Modules Change-Id: I8c344ae737803b39117037edaaa3d3b9cbd09f30 * [SW-199696] Supporting Dynamic Quantization Change-Id: Ic5d6f04ec0b5032ac305e1b3097747c47250385b * Code cleanup Change-Id: I213bc7438e06bd1002775066bfb0dc6f10e8a84a * Review changes and model print issue (circular dependency fix) Change-Id: I5c41d2f9a937416ce260f55cb045c86858dd201a * removed debug code from patching_common.py * Round 2 + CI import issue Change-Id: I27dbb33de8e027fb0b726336b38156b5d23a6896 Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-217334] enable fp8 qdq mode using PatchedModuleBase (#129) * [SW-217334] enable fp8 qdq mode using PatchedModuleBase * fix review commnets * [SW-218871] fp8 multi-cards is not loaded correctly (#138) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * Fix bug in mixtral unitscale (#141) * [SW-218197] fix bug in Mixtral unitscale * [SW-218197] fix bug in Mixtral unitscale * update version to 3.3 for release Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-20808] Make sure save&load format is an Enum object (#58) * [SW-20808] Make sure save&load format is an Enum object Signed-off-by: Xin He <xinhe3@habana.ai> * Update save_load_entry.py --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add xfail for torchvision Signed-off-by: Xin He <xinhe3@habana.ai> * fix ILITV-3859 Signed-off-by: xin3he <xin3.he@intel.com> * workaround for ILITV-3858 Signed-off-by: xin3he <xin3.he@intel.com> * fix sdxl_smooth_quant Signed-off-by: xin3he <xin3.he@intel.com> * fix ILITV-3854 Signed-off-by: xin3he <xin3.he@intel.com> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Signed-off-by: changwang <changwang@habana.ai> Signed-off-by: changwangss <changwang@habana.ai> Signed-off-by: Uri Livne <ulivne@habana.ai> Signed-off-by: yuwenzho <yuwen.zhou@intel.com> Signed-off-by: Yi Liu <yiliu4@habana.ai> Signed-off-by: Xin <xin3.he@intel.com> Signed-off-by: xin3he <xin3.he@intel.com> Co-authored-by: Xin He <xinhe3@habana.ai> Co-authored-by: RafLit <rafal.litka@intel.com> Co-authored-by: Rafal Litka <rlitka@habana.ai> Co-authored-by: Dany Kiazada <141814181+kiazada@users.noreply.github.com> Co-authored-by: Nir David <124874956+nirda7@users.noreply.github.com> Co-authored-by: Yuwen Zhou <yuwen.zhou@intel.com> Co-authored-by: Wang, Chang <changwang@habana.ai> Co-authored-by: Uri Livne <ulivne@habana.ai> Co-authored-by: Oz Abramovich <oabramovich@habana.ai> Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com> Co-authored-by: Danny Semiat <dsemiat@habana.ai> Co-authored-by: smarkovichgolan <smarkovich@habana.ai> Co-authored-by: Yi Liu <yi4.liu@intel.com> Co-authored-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Co-authored-by: Nadav Elyahu <88962733+nelyahu@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: chen, suyue <suyue.chen@intel.com> Co-authored-by: Sun, Xuehao <xuehao.sun@intel.com>
1 parent ce2f845 commit b6a4c78

File tree

68 files changed

+3719
-2031
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

68 files changed

+3719
-2031
lines changed

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,5 @@ lpot_workspace/
2222
.torch/
2323
node_modules
2424
build_tmp
25+
hqt_output*/
26+
inc_output*/

examples/3.x_api/pytorch/diffusion_model/diffusers/stable_diffusion/smooth_quant/main.py

+1
Original file line numberDiff line numberDiff line change
@@ -400,6 +400,7 @@ def __call__(
400400
torch_dtype=dtype,
401401
use_safetensors=True,
402402
)
403+
pipe = pipe.to(dtype) # Ensure all modules are set as dtype
403404
pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config)
404405

405406
if args.refiner:

examples/3.x_api/pytorch/diffusion_model/diffusers/stable_diffusion/smooth_quant/sdxl_smooth_quant.py

+1
Original file line numberDiff line numberDiff line change
@@ -361,6 +361,7 @@ def main():
361361
torch_dtype=dtype,
362362
use_safetensors=True,
363363
)
364+
pipeline = pipeline.to(dtype) # Ensure all modules are set as dtype
364365
pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)
365366

366367
# This is a list of prompts
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Step-by-step
2+
3+
Here we demonstrate FP8 quantization with some advanced techniques.
4+
- block-wise calibration: reduce device memory requirement during calibration
5+
- layer-wise quantization (base on memory mapping): reduce host memory requirement during quantization
6+
- lm_eval evaluation for HPU: balance performance and memory usage, `--use_hpu_graph` is required.
7+
8+
Typically, quantization requires calibration with a high-precision model (such as bf16), which occupies a lot of device memory. Block-wise calibration splits the LLM into blocks and performs calibration one by one. Use ` --enable_block_wise_calibration` to enable this feature.
9+
10+
By default, This example loads model into shared memory from disk and loads to physical host memory layer-by-layer during quantization. The occupied physical host memory will be released in time.
11+
12+
In this example, you can measure and quantize`llama3.1/Meta-Llama-3.1-405B-Instruct` in torch.bfloat16 dtype with 8 Gaudi2 cards or even less, and host memory requirement is also low.
13+
14+
## Install deepspeed
15+
Due to a known issue [microsoft/DeepSpeed/issues/3207](https://github.com/microsoft/DeepSpeed/issues/3207), we recommend installing deepspeed as follows.
16+
```shell
17+
git clone https://github.com/HabanaAI/DeepSpeed.git
18+
cd DeepSpeed
19+
git checkout 1.19.0
20+
pip install -e .
21+
cd ..
22+
```
23+
24+
# Run
25+
26+
## meta-llama/Llama-2-70b-hf
27+
28+
```bash
29+
# Measure, quantize and save
30+
deepspeed --num_gpus 2 quantize.py --model_name_or_path meta-llama/Llama-2-70b-hf --quantize --save --save_path llama2_70b_fp8/
31+
# With block-wise calibration, we can quantize 70b with one Gaudi2 cards
32+
python quantize.py --model_name_or_path meta-llama/Llama-2-70b-hf --quantize --enable_block_wise_calibration --save --save_path llama2_70b_fp8/
33+
34+
35+
# Load fp8 model and verify accuracy
36+
python quantize.py --model_name_or_path llama2_70b_fp8/ --load --use_hpu_graph --accuracy
37+
```
38+
39+
> Note: To get the best performance of fp8 model, please go to [optimum-habana](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation#running-with-fp8) to quantize the model. These advanced techniques will be upstreamed to optimum-habana soon.
40+
41+
## meta-llama/Llama-3.1-405B-Instruct
42+
43+
```bash
44+
# Measure
45+
deepspeed --num_gpus 8 quantize.py --model_name_or_path meta-llama/Llama-3.1-405B-Instruct --quantize --enable_block_wise_calibration --save --save_path llama3.1_405b_fp8/
46+
47+
# Load fp8 model and verify accuracy
48+
deepspeed --num_gpus 8 quantize.py --model_name_or_path llama3.1_405b_fp8/ --load --use_hpu_graph --accuracy
49+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
import os
2+
import argparse
3+
import tqdm
4+
5+
# ensure that unnecessary memory is released during quantization.
6+
os.environ.setdefault("PT_HPU_WEIGHT_SHARING", "0")
7+
if int(os.getenv("WORLD_SIZE", "0")) > 0:
8+
os.environ.setdefault("PT_HPU_LAZY_ACC_PAR_MODE", "0")
9+
os.environ.setdefault("PT_HPU_ENABLE_LAZY_COLLECTIVES", "true")
10+
11+
12+
import torch
13+
import habana_frameworks.torch.core as htcore
14+
15+
from neural_compressor.torch.quantization import (
16+
FP8Config,
17+
prepare,
18+
convert,
19+
finalize_calibration,
20+
save,
21+
load,
22+
)
23+
from neural_compressor.torch.utils import get_used_hpu_mem_MB, get_used_cpu_mem_MB, logger, forward_wrapper
24+
from neural_compressor.torch.utils.block_wise import block_wise_calibration
25+
from neural_compressor.torch.utils.llm_utility import (
26+
initialize_model_and_tokenizer,
27+
get_default_llm_dataloader,
28+
llm_benchmark,
29+
)
30+
31+
# use no_grad mode for quantization
32+
torch.set_grad_enabled(False)
33+
htcore.hpu_set_env()
34+
hpu_mem_0 = get_used_hpu_mem_MB()
35+
cpu_mem_0 = get_used_cpu_mem_MB()
36+
37+
38+
if __name__ == "__main__":
39+
parser = argparse.ArgumentParser(
40+
description="Habana FP8 quantization.", formatter_class=argparse.ArgumentDefaultsHelpFormatter
41+
)
42+
parser.add_argument("--model_name_or_path", type=str, default="meta-llama/Meta-Llama-3.1-405B", help="model name or path")
43+
parser.add_argument("--quantize", action="store_true", help="whether to quantize model")
44+
parser.add_argument("--scale_method", type=str, default="maxabs_hw", help="Choose scale method", choices=[
45+
# per-tensor
46+
"unit_scale", "hw_aligned_single_scale", "maxabs_hw", "maxabs_pow2",
47+
"maxabs_arbitrary", "maxabs_hw_opt_weight", "maxabs_pow2_opt_weight",
48+
# per-channel
49+
"act_maxabs_hw_weights_pcs_maxabs_pow2", "act_maxabs_hw_weights_pcs_opt_pow2",
50+
"act_maxabs_pow2_weights_pcs_maxabs_pow2", "act_maxabs_pow2_weights_pcs_opt_pow2",
51+
])
52+
parser.add_argument("--use_hpu_graph", action="store_true", help="whether to use hpu graph mode to accelerate performance")
53+
parser.add_argument("--enable_block_wise_calibration", action="store_true", help="whether to use block-wise calibration")
54+
parser.add_argument("--disable_optimum_habana", action="store_true", help="whether to use adapt_transformers_to_gaudi")
55+
parser.add_argument("--save", action="store_true", help="whether to save the quantized model")
56+
parser.add_argument("--load", action="store_true", help="whether to load the quantized model")
57+
parser.add_argument("--save_path", type=str, default="saved_results", help="path to save the quantized model")
58+
parser.add_argument("--accuracy", action="store_true", help="accuracy measurement")
59+
parser.add_argument("--performance", action="store_true", help="performance measurement")
60+
parser.add_argument("--local_rank", type=int, default=0, metavar="N", help="Local process rank.")
61+
parser.add_argument("--batch_size", default=1, type=int, help="batch size for accuracy measurement.")
62+
parser.add_argument("--num_fewshot", default=0, type=int, help="num_fewshot of lm_eval.")
63+
parser.add_argument("--dump_stats_path", type=str, default="./hqt_output/measure", help="path and prefix to calibration info file.")
64+
parser.add_argument("--tasks", default="lambada_openai",
65+
type=str, help="tasks for accuracy validation, text-generation and code-generation tasks are different.")
66+
parser.add_argument("--dataset_name", type=str, default="NeelNanda/pile-10k", help="dataset name for calibration dataloader")
67+
parser.add_argument("--nsamples", type=int, default=128, help="number of samples for calibration dataloader")
68+
parser.add_argument("--seq_len", type=int, default=128, help="sequence length for calibration dataloader and benchmarking")
69+
args = parser.parse_args()
70+
if not args.disable_optimum_habana:
71+
# Tweak generation so that it runs faster on Gaudi
72+
import transformers
73+
from optimum.habana.transformers.modeling_utils import adapt_transformers_to_gaudi
74+
if args.quantize:
75+
orig_check_support_param_buffer_assignment = transformers.modeling_utils.check_support_param_buffer_assignment
76+
adapt_transformers_to_gaudi()
77+
# to protect memory mapping usage for quantization
78+
transformers.modeling_utils.check_support_param_buffer_assignment = orig_check_support_param_buffer_assignment
79+
else:
80+
adapt_transformers_to_gaudi()
81+
82+
model, tokenizer = initialize_model_and_tokenizer(args.model_name_or_path, use_load=args.load, device="hpu")
83+
# show used memory
84+
logger.info(f"After loading model, used HPU memory: {round((get_used_hpu_mem_MB() - hpu_mem_0)/1024, 3)} GiB")
85+
logger.info(f"After loading model, used CPU memory: {round((get_used_cpu_mem_MB() - cpu_mem_0)/1024, 3)} GiB")
86+
87+
if args.quantize:
88+
if args.enable_block_wise_calibration:
89+
logger.warning("Block-wise calibration is enabled, lm_head will be excluded from calibration.")
90+
91+
# prepare
92+
qconfig = FP8Config(
93+
fp8_config="E4M3",
94+
scale_method=args.scale_method,
95+
blocklist={"names": ["lm_head"]} if args.enable_block_wise_calibration else {}, # block-wise cannot calibrate lm_head
96+
measure_on_hpu=False if args.enable_block_wise_calibration else True, # to avoid device mapping of model
97+
dump_stats_path=args.dump_stats_path,
98+
)
99+
if args.scale_method in ["unit_scale", "hw_aligned_single_scale"]:
100+
model = convert(model, qconfig)
101+
else:
102+
model = prepare(model, qconfig)
103+
104+
# calibration
105+
dataloader = get_default_llm_dataloader(
106+
tokenizer,
107+
dataset_name=args.dataset_name,
108+
bs=args.batch_size,
109+
nsamples=args.nsamples,
110+
seq_len=args.seq_len,
111+
seed=42,
112+
)
113+
if args.enable_block_wise_calibration:
114+
block_wise_calibration(model, dataloader)
115+
else:
116+
if args.use_hpu_graph:
117+
from habana_frameworks.torch.hpu import wrap_in_hpu_graph
118+
model = wrap_in_hpu_graph(model)
119+
for data in tqdm.tqdm(dataloader):
120+
logger.info("Calibration started")
121+
forward_wrapper(model, data)
122+
logger.info("Calibration end")
123+
124+
# convert
125+
model = convert(model)
126+
127+
# show used memory
128+
logger.info(f"Used HPU memory: {round((get_used_hpu_mem_MB() - hpu_mem_0)/1024, 3)} GiB")
129+
logger.info(f"Used CPU memory: {round((get_used_cpu_mem_MB() - cpu_mem_0)/1024, 3)} GiB")
130+
if args.save:
131+
logger.info(f"Saving quantized model to {args.save_path}")
132+
save(model, args.save_path, format="huggingface")
133+
tokenizer.save_pretrained(args.save_path)
134+
logger.info(f"Saved quantized model to {args.save_path}")
135+
exit(0) # model is wrapped during calibration, need to exit before accuracy and performance measurement
136+
137+
# preprocess model for accuracy and performance measurement
138+
if not args.load:
139+
# compare fp8 with bf16, not fp32.
140+
model = model.to(torch.bfloat16)
141+
model = model.eval().to("hpu")
142+
if args.use_hpu_graph:
143+
from habana_frameworks.torch.hpu import wrap_in_hpu_graph
144+
model = wrap_in_hpu_graph(model)
145+
htcore.hpu_inference_initialize(model, mark_only_scales_as_const=True)
146+
147+
if args.accuracy:
148+
from neural_compressor.evaluation.lm_eval import evaluate, LMEvalParser
149+
eval_args = LMEvalParser(
150+
model="hf",
151+
user_model=model,
152+
tokenizer=tokenizer,
153+
batch_size=args.batch_size,
154+
tasks=args.tasks,
155+
device="hpu",
156+
pad_to_buckets=True,
157+
num_fewshot=args.num_fewshot,
158+
)
159+
results = evaluate(eval_args)
160+
# show used memory
161+
logger.info(f"Used HPU memory: {round((get_used_hpu_mem_MB() - hpu_mem_0)/1024, 3)} GiB")
162+
logger.info(f"Used CPU memory: {round((get_used_cpu_mem_MB() - cpu_mem_0)/1024, 3)} GiB")
163+
164+
165+
if args.performance:
166+
llm_benchmark(model, args.batch_size, args.seq_len)
167+
# show used memory
168+
logger.info(f"Used HPU memory: {round((get_used_hpu_mem_MB() - hpu_mem_0)/1024, 3)} GiB")
169+
logger.info(f"Used CPU memory: {round((get_used_cpu_mem_MB() - cpu_mem_0)/1024, 3)} GiB")
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
lm-eval>=0.4.3
2+
transformers >= 4.45.2, < 4.46.0 # refer to optimum-habana
3+
datasets
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
#!/bin/bash
2+
set -x
3+
4+
function main {
5+
6+
init_params "$@"
7+
run_benchmark
8+
9+
}
10+
11+
# init params
12+
function init_params {
13+
batch_size=1
14+
tuned_checkpoint=saved_results
15+
task=lambada_openai
16+
17+
for var in "$@"
18+
do
19+
case $var in
20+
--topology=*)
21+
topology=$(echo $var |cut -f2 -d=)
22+
;;
23+
--dataset_location=*)
24+
dataset_location=$(echo $var |cut -f2 -d=)
25+
;;
26+
--input_model=*)
27+
input_model=$(echo $var |cut -f2 -d=)
28+
;;
29+
--mode=*)
30+
mode=$(echo $var |cut -f2 -d=)
31+
;;
32+
--batch_size=*)
33+
batch_size=$(echo $var |cut -f2 -d=)
34+
;;
35+
--iters=*)
36+
iters=$(echo ${var} |cut -f2 -d=)
37+
;;
38+
--int8=*)
39+
int8=$(echo ${var} |cut -f2 -d=)
40+
;;
41+
--config=*)
42+
tuned_checkpoint=$(echo $var |cut -f2 -d=)
43+
;;
44+
*)
45+
echo "Error: No such parameter: ${var}"
46+
exit 1
47+
;;
48+
esac
49+
done
50+
51+
}
52+
53+
54+
# run_benchmark
55+
function run_benchmark {
56+
python_cmd="python"
57+
58+
if [[ ${mode} == "accuracy" ]]; then
59+
mode_cmd=" --accuracy "
60+
elif [[ ${mode} == "performance" ]]; then
61+
mode_cmd=" --performance"
62+
else
63+
echo "Error: No such mode: ${mode}"
64+
exit 1
65+
fi
66+
67+
if [ "${topology}" = "opt_125m_fp8" ]; then
68+
model_name_or_path="facebook/opt-125m"
69+
tuned_checkpoint="opt_125m_fp8"
70+
elif [ "${topology}" = "opt_125m_fp8_pcs" ]; then
71+
model_name_or_path="facebook/opt-125m"
72+
tuned_checkpoint="opt_125m_fp8_pcs"
73+
elif [ "${topology}" = "opt_125m_fp8_block_wise" ]; then
74+
model_name_or_path="facebook/opt-125m"
75+
tuned_checkpoint="opt_125m_fp8_block_wise"
76+
elif [ "${topology}" = "llama3_1_8b_fp8" ]; then
77+
model_name_or_path="/git_lfs/data/pytorch/llama3.1/Meta-Llama-3.1-8B-Instruct/"
78+
tuned_checkpoint="/software/llama_fp8/llama3_1_8b_fp8"
79+
elif [ "${topology}" = "llama3_1_8b_fp8_block_wise" ]; then
80+
model_name_or_path="/git_lfs/data/pytorch/llama3.1/Meta-Llama-3.1-8B-Instruct/"
81+
tuned_checkpoint="/software/llama_fp8/llama3_1_8b_fp8_block_wise"
82+
elif [ "${topology}" = "llama3_1_8b_fp8_block_wise_pcs" ]; then
83+
model_name_or_path="/git_lfs/data/pytorch/llama3.1/Meta-Llama-3.1-8B-Instruct/"
84+
tuned_checkpoint="/software/llama_fp8/llama3_1_8b_fp8_block_wise_pcs"
85+
elif [ "${topology}" = "llama2_70b_fp8_block_wise" ]; then
86+
model_name_or_path="/git_lfs/data/pytorch/llama2/Llama-2-70b-hf/"
87+
tuned_checkpoint="/software/llama_fp8/llama2_70b_fp8_block_wise"
88+
elif [ "${topology}" = "mixtral_8x7b_fp8_block_wise" ]; then
89+
model_name_or_path="mistralai/Mixtral-8x7B-v0.1"
90+
tuned_checkpoint="/software/mixtral_fp8/mixtral_8x7b_fp8_block_wise"
91+
elif [ "${topology}" = "llama3_1_405b_fp8_block_wise" ]; then
92+
model_name_or_path="/git_lfs/data/pytorch/llama3.1/Meta-Llama-3.1-405B-Instruct/"
93+
tuned_checkpoint="/software/llama_fp8/llama3_1_405b_fp8_block_wise"
94+
python_cmd="deepspeed --num_gpus 8"
95+
fi
96+
97+
if [[ ${int8} == "true" ]]; then
98+
${python_cmd} quantize.py \
99+
--model ${tuned_checkpoint} \
100+
--load\
101+
--task ${task} \
102+
--batch_size ${batch_size} \
103+
--use_hpu_graph \
104+
${mode_cmd}
105+
else
106+
${python_cmd} quantize.py \
107+
--model ${model_name_or_path} \
108+
--task ${task} \
109+
--batch_size ${batch_size} \
110+
--use_hpu_graph \
111+
${mode_cmd}
112+
fi
113+
}
114+
115+
main "$@"

0 commit comments

Comments
 (0)