|
1 | 1 | # Benchmarking Script for Large Language Models
|
2 | 2 |
|
3 |
| -This script provides a unified approach to estimate performance for Large Language Models (LLMs). It leverages pipelines provided by Optimum-Intel and allows performance estimation for PyTorch and OpenVINO models using nearly identical code and pre-collected models. |
4 |
| - |
5 |
| - |
6 |
| -### 1. Prepare Python Virtual Environment for LLM Benchmarking |
7 |
| - |
8 |
| -``` bash |
9 |
| -python3 -m venv ov-llm-bench-env |
10 |
| -source ov-llm-bench-env/bin/activate |
11 |
| -pip install --upgrade pip |
12 |
| - |
13 |
| -git clone https://github.com/openvinotoolkit/openvino.genai.git |
14 |
| -cd openvino.genai/llm_bench/python/ |
15 |
| -pip install -r requirements.txt |
16 |
| -``` |
17 |
| - |
18 |
| -> Note: |
19 |
| -> For existing Python environments, run the following command to ensure that all dependencies are installed with the latest versions: |
20 |
| -> `pip install -U --upgrade-strategy eager -r requirements.txt` |
21 |
| -
|
22 |
| -#### (Optional) Hugging Face Login : |
23 |
| - |
24 |
| -Login to Hugging Face if you want to use non-public models: |
25 |
| - |
26 |
| -```bash |
27 |
| -huggingface-cli login |
28 |
| -``` |
29 |
| - |
30 |
| -### 2. Convert Model to OpenVINO IR Format |
31 |
| - |
32 |
| -The `optimum-cli` tool simplifies converting Hugging Face models to OpenVINO IR format. |
33 |
| -- Detailed documentation can be found in the [Optimum-Intel documentation](https://huggingface.co/docs/optimum/main/en/intel/openvino/export). |
34 |
| -- To learn more about weight compression, see the [NNCF Weight Compression Guide](https://docs.openvino.ai/2024/openvino-workflow/model-optimization-guide/weight-compression.html). |
35 |
| -- For additional guidance on running inference with OpenVINO for LLMs, see the [OpenVINO LLM Inference Guide](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html). |
36 |
| - |
37 |
| -**Usage:** |
38 |
| - |
39 |
| -```bash |
40 |
| -optimum-cli export openvino --model <MODEL_ID> --weight-format <PRECISION> <OUTPUT_DIR> |
41 |
| - |
42 |
| -optimum-cli export openvino -h # For detailed information |
43 |
| -``` |
44 |
| - |
45 |
| -* `--model <MODEL_ID>` : model_id for downloading from [huggngface_hub](https://huggingface.co/models) or path with directory where pytorch model located. |
46 |
| -* `--weight-format <PRECISION>` : precision for model conversion. Available options: `fp32, fp16, int8, int4, mxfp4` |
47 |
| -* `<OUTPUT_DIR>`: output directory for saving generated OpenVINO model. |
48 |
| - |
49 |
| -**NOTE:** |
50 |
| -- Models larger than 1 billion parameters are exported to the OpenVINO format with 8-bit weights by default. You can disable it with `--weight-format fp32`. |
51 |
| - |
52 |
| -**Example:** |
53 |
| -```bash |
54 |
| -optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf --weight-format fp16 models/llama-2-7b-chat |
55 |
| -``` |
56 |
| -**Resulting file structure:** |
57 |
| - |
58 |
| -```console |
59 |
| - models |
60 |
| - └── llama-2-7b-chat |
61 |
| - ├── config.json |
62 |
| - ├── generation_config.json |
63 |
| - ├── openvino_detokenizer.bin |
64 |
| - ├── openvino_detokenizer.xml |
65 |
| - ├── openvino_model.bin |
66 |
| - ├── openvino_model.xml |
67 |
| - ├── openvino_tokenizer.bin |
68 |
| - ├── openvino_tokenizer.xml |
69 |
| - ├── special_tokens_map.json |
70 |
| - ├── tokenizer_config.json |
71 |
| - ├── tokenizer.json |
72 |
| - └── tokenizer.model |
73 |
| -``` |
74 |
| - |
75 |
| -### 3. Benchmark LLM Model |
76 |
| - |
77 |
| -To benchmark the performance of the LLM, use the following command: |
78 |
| - |
79 |
| -``` bash |
80 |
| -python benchmark.py -m <model> -d <device> -r <report_csv> -f <framework> -p <prompt text> -n <num_iters> |
81 |
| -# e.g. |
82 |
| -python benchmark.py -m models/llama-2-7b-chat/ -n 2 |
83 |
| -python benchmark.py -m models/llama-2-7b-chat/ -p "What is openvino?" -n 2 |
84 |
| -python benchmark.py -m models/llama-2-7b-chat/ -pf prompts/llama-2-7b-chat_l.jsonl -n 2 |
85 |
| -``` |
86 |
| - |
87 |
| -**Parameters:** |
88 |
| -- `-m`: Path to the model. |
89 |
| -- `-d`: Inference device (default: CPU). |
90 |
| -- `-r`: Path to the CSV report. |
91 |
| -- `-f`: Framework (default: ov). |
92 |
| -- `-p`: Interactive prompt text. |
93 |
| -- `-pf`: Path to a JSONL file containing prompts. |
94 |
| -- `-n`: Number of iterations (default: 0, the first iteration is excluded). |
95 |
| -- `-ic`: Limit the output token size (default: 512) for text generation and code generation models. |
96 |
| - |
97 |
| -**Additional options:** |
98 |
| -``` bash |
99 |
| -python ./benchmark.py -h # for more information |
100 |
| -``` |
101 |
| - |
102 |
| -#### Benchmarking the Original PyTorch Model: |
103 |
| -To benchmark the original PyTorch model, first download the model locally and then run benchmark by specifying PyTorch as the framework with parameter `-f pt` |
104 |
| - |
105 |
| -```bash |
106 |
| -# Download PyTorch Model |
107 |
| -huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir models/llama-2-7b-chat/pytorch |
108 |
| -# Benchmark with PyTorch Framework |
109 |
| -python benchmark.py -m models/llama-2-7b-chat/pytorch -n 2 -f pt |
110 |
| -``` |
111 |
| - |
112 |
| -> **Note:** If needed, You can install a specific OpenVINO version using pip: |
113 |
| -> ``` bash |
114 |
| -> # e.g. |
115 |
| -> pip install openvino==2024.4.0 |
116 |
| -> # Optional, install the openvino nightly package if needed. |
117 |
| -> # OpenVINO nightly is pre-release software and has not undergone full release validation or qualification. |
118 |
| -> pip uninstall openvino |
119 |
| -> pip install --upgrade --pre openvino openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly |
120 |
| -> ``` |
121 |
| -
|
122 |
| -## 4. Benchmark LLM with `torch.compile()` |
123 |
| -
|
124 |
| -The `--torch_compile_backend` option enables you to use `torch.compile()` to accelerate PyTorch models by compiling them into optimized kernels using a specified backend. |
125 |
| -
|
126 |
| -Before benchmarking, you need to download the original PyTorch model. Use the following command to download the model locally: |
127 |
| -
|
128 |
| -```bash |
129 |
| -huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir models/llama-2-7b-chat/pytorch |
130 |
| -``` |
131 |
| -
|
132 |
| -To run the benchmarking script with `torch.compile()`, use the `--torch_compile_backend` option to specify the backend. You can choose between `pytorch` or `openvino` (default). Example: |
133 |
| - |
134 |
| -```bash |
135 |
| -python ./benchmark.py -m models/llama-2-7b-chat/pytorch -d CPU --torch_compile_backend openvino |
136 |
| -``` |
137 |
| - |
138 |
| -> **Note:** To use `torch.compile()` with CUDA GPUs, you need to install the nightly version of PyTorch: |
139 |
| -> |
140 |
| -> ```bash |
141 |
| -> pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 |
142 |
| -> ``` |
143 |
| -
|
144 |
| -
|
145 |
| -## 5. Running on 2-Socket Platforms |
146 |
| -
|
147 |
| -The benchmarking script sets `openvino.properties.streams.num(1)` by default. For multi-socket platforms, use `numactl` on Linux or the `--load_config` option to modify behavior. |
148 |
| -
|
149 |
| -| OpenVINO Version | Behaviors | |
150 |
| -|:--------------------|:------------------------------------------------| |
151 |
| -| Before 2024.0.0 | streams.num(1) <br>execute on 2 sockets. | |
152 |
| -| 2024.0.0 | streams.num(1) <br>execute on the same socket as the APP is running on. | |
153 |
| -
|
154 |
| -For example, `--load_config config.json` as following will result in streams.num(1) and execute on 2 sockets. |
155 |
| -```json |
156 |
| -{ |
157 |
| - "INFERENCE_NUM_THREADS": <NUMBER> |
158 |
| -} |
159 |
| -``` |
160 |
| -`<NUMBER>` is the number of total physical cores in 2 sockets. |
161 |
| -
|
162 |
| -## 6. Execution on CPU device |
163 |
| - |
164 |
| -OpenVINO is by default bult with [oneTBB](https://github.com/oneapi-src/oneTBB/) threading library, while Torch uses [OpenMP](https://www.openmp.org/). Both threading libraries have ['busy-wait spin'](https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fSPINCOUNT.html) by default. When running LLM pipeline on CPU device, there is threading overhead in the switching between inference on CPU with OpenVINO (oneTBB) and postprocessing (For example: greedy search or beam search) with Torch (OpenMP). |
165 |
| - |
166 |
| -**Alternative solutions** |
167 |
| -1. Use --genai option which uses OpenVINO genai API instead of optimum-intel API. In this case postprocessing is executed with OpenVINO genai API. |
168 |
| -2. Without --genai option which uses optimum-intel API, set environment variable [OMP_WAIT_POLICY](https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fWAIT_005fPOLICY.html) to PASSIVE which will disable OpenMP 'busy-wait', and benchmark.py will also limit the Torch thread number to avoid using CPU cores which is in 'busy-wait' by OpenVINO inference. |
169 |
| - |
170 |
| -## 7. Additional Resources |
171 |
| - |
172 |
| -- **Error Troubleshooting:** Check the [NOTES.md](./doc/NOTES.md) for solutions to known issues. |
173 |
| -- **Image Generation Configuration:** Refer to [IMAGE_GEN.md](./doc/IMAGE_GEN.md) for setting parameters for image generation models. |
| 3 | +> [!IMPORTANT] |
| 4 | +> LLM bench code was moved to [tools](../../tools/llm_bench/) directory. Please navigate to the new directory for continue of tool usage. |
0 commit comments