|
1 |
| -# Benchmarking script for large language models |
| 1 | +# Benchmarking Script for Large Language Models |
2 | 2 |
|
3 |
| -This script provides a unified approach to estimate performance for Large Language Models. |
4 |
| -It is based on pipelines provided by Optimum-Intel and allows to estimate performance for |
5 |
| -pytorch and openvino models, using almost the same code and precollected models. |
| 3 | +This script provides a unified approach to estimate performance for Large Language Models (LLMs). It leverages pipelines provided by Optimum-Intel and allows performance estimation for PyTorch and OpenVINO models using nearly identical code and pre-collected models. |
6 | 4 |
|
7 |
| -## Usage |
8 | 5 |
|
9 |
| -### 1. Start a Python virtual environment |
| 6 | +### 1. Prepare Python Virtual Environment for LLM Benchmarking |
10 | 7 |
|
11 | 8 | ``` bash
|
12 |
| -python3 -m venv python-env |
13 |
| -source python-env/bin/activate |
| 9 | +python3 -m venv ov-llm-bench-env |
| 10 | +source ov-llm-bench-env/bin/activate |
14 | 11 | pip install --upgrade pip
|
15 |
| -pip install -r requirements.txt |
| 12 | + |
| 13 | +git clone https://github.com/openvinotoolkit/openvino.genai.git |
| 14 | +cd openvino.genai/llm_bench/python/ |
| 15 | +pip install -r requirements.txt |
16 | 16 | ```
|
17 |
| -> Note: |
18 |
| -> If you are using an existing python environment, recommend following command to use all the dependencies with latest versions: |
19 |
| -> pip install -U --upgrade-strategy eager -r requirements.txt |
20 | 17 |
|
21 |
| -### 2. Convert a model to OpenVINO IR |
22 |
| - |
23 |
| -The optimum-cli tool allows you to convert models from Hugging Face to the OpenVINO IR format. More detailed info about tool usage can be found in [Optimum Intel documentation](https://huggingface.co/docs/optimum/main/en/intel/openvino/export) |
| 18 | +> Note: |
| 19 | +> For existing Python environments, run the following command to ensure that all dependencies are installed with the latest versions: |
| 20 | +> `pip install -U --upgrade-strategy eager -r requirements.txt` |
24 | 21 |
|
25 |
| -Prerequisites: |
26 |
| -install conversion dependencies using `requirements.txt` |
| 22 | +#### (Optional) Hugging Face Login : |
27 | 23 |
|
28 |
| -Usage: |
| 24 | +Login to Hugging Face if you want to use non-public models: |
29 | 25 |
|
30 | 26 | ```bash
|
31 |
| -optimum-cli export openvino --model <MODEL_NAME> --weight-format <PRECISION> <OUTPUT_DIR> |
| 27 | +huggingface-cli login |
32 | 28 | ```
|
33 | 29 |
|
34 |
| -Paramters: |
35 |
| -* `--model <MODEL_NAME>` - <MODEL_NAME> model_id for downloading from huggngface_hub (https://huggingface.co/models) or path with directory where pytorch model located. |
36 |
| -* `--weight-format` - precision for model conversion fp32, fp16, int8, int4 |
37 |
| -* `<OUTPUT_DIR>` - output directory for saving OpenVINO model. |
| 30 | +### 2. Convert Model to OpenVINO IR Format |
| 31 | + |
| 32 | +The `optimum-cli` tool simplifies converting Hugging Face models to OpenVINO IR format. |
| 33 | +- Detailed documentation can be found in the [Optimum-Intel documentation](https://huggingface.co/docs/optimum/main/en/intel/openvino/export). |
| 34 | +- To learn more about weight compression, see the [NNCF Weight Compression Guide](https://docs.openvino.ai/2024/openvino-workflow/model-optimization-guide/weight-compression.html). |
| 35 | +- For additional guidance on running inference with OpenVINO for LLMs, see the [OpenVINO LLM Inference Guide](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html). |
38 | 36 |
|
39 |
| -Usage example: |
40 |
| -```bash |
41 |
| -optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf --weight-format fp16 models/llama-2-7b-chat |
42 |
| -``` |
| 37 | +**Usage:** |
43 | 38 |
|
44 |
| -the result of running the command will have the following file structure: |
| 39 | +```bash |
| 40 | +optimum-cli export openvino --model <MODEL_ID> --weight-format <PRECISION> <OUTPUT_DIR> |
45 | 41 |
|
46 |
| - |-llama-2-7b-chat |
47 |
| - |-pytorch |
48 |
| - |-dldt |
49 |
| - |-FP16 |
50 |
| - |-openvino_model.xml |
51 |
| - |-openvino_model.bin |
52 |
| - |-config.json |
53 |
| - |-generation_config.json |
54 |
| - |-tokenizer_config.json |
55 |
| - |-tokenizer.json |
56 |
| - |-tokenizer.model |
57 |
| - |-special_tokens_map.json |
| 42 | +optimum-cli export openvino -h # For detailed information |
| 43 | +``` |
58 | 44 |
|
59 |
| -### 3. Benchmarking |
| 45 | +* `--model <MODEL_ID>` : model_id for downloading from [huggngface_hub](https://huggingface.co/models) or path with directory where pytorch model located. |
| 46 | +* `--weight-format <PRECISION>` : precision for model conversion. Available options: `fp32, fp16, int8, int4, mxfp4` |
| 47 | +* `<OUTPUT_DIR>`: output directory for saving generated OpenVINO model. |
60 | 48 |
|
61 |
| -Prerequisites: |
62 |
| -install benchmarking dependencies using `requirements.txt` |
| 49 | +**NOTE:** |
| 50 | +- Models larger than 1 billion parameters are exported to the OpenVINO format with 8-bit weights by default. You can disable it with `--weight-format fp32`. |
63 | 51 |
|
64 |
| -``` bash |
65 |
| -pip install -r requirements.txt |
| 52 | +**Example:** |
| 53 | +```bash |
| 54 | +optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf --weight-format fp16 models/llama-2-7b-chat |
66 | 55 | ```
|
67 |
| -note: **You can specify the installed OpenVINO version through pip install** |
68 |
| -``` bash |
69 |
| -# e.g. |
70 |
| -pip install openvino==2023.3.0 |
| 56 | +**Resulting file structure:** |
| 57 | + |
| 58 | +```console |
| 59 | + models |
| 60 | + └── llama-2-7b-chat |
| 61 | + ├── config.json |
| 62 | + ├── generation_config.json |
| 63 | + ├── openvino_detokenizer.bin |
| 64 | + ├── openvino_detokenizer.xml |
| 65 | + ├── openvino_model.bin |
| 66 | + ├── openvino_model.xml |
| 67 | + ├── openvino_tokenizer.bin |
| 68 | + ├── openvino_tokenizer.xml |
| 69 | + ├── special_tokens_map.json |
| 70 | + ├── tokenizer_config.json |
| 71 | + ├── tokenizer.json |
| 72 | + └── tokenizer.model |
71 | 73 | ```
|
72 | 74 |
|
73 |
| -### 4. Run the following command to test the performance of one LLM model |
| 75 | +### 3. Benchmark LLM Model |
| 76 | + |
| 77 | +To benchmark the performance of the LLM, use the following command: |
| 78 | + |
74 | 79 | ``` bash
|
75 | 80 | python benchmark.py -m <model> -d <device> -r <report_csv> -f <framework> -p <prompt text> -n <num_iters>
|
76 | 81 | # e.g.
|
77 |
| -python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -n 2 |
78 |
| -python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -p "What is openvino?" -n 2 |
79 |
| -python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -pf prompts/llama-2-7b-chat_l.jsonl -n 2 |
| 82 | +python benchmark.py -m models/llama-2-7b-chat/ -n 2 |
| 83 | +python benchmark.py -m models/llama-2-7b-chat/ -p "What is openvino?" -n 2 |
| 84 | +python benchmark.py -m models/llama-2-7b-chat/ -pf prompts/llama-2-7b-chat_l.jsonl -n 2 |
80 | 85 | ```
|
81 |
| -Parameters: |
82 |
| -* `-m` - model path |
83 |
| -* `-d` - inference device (default=cpu) |
84 |
| -* `-r` - report csv |
85 |
| -* `-f` - framework (default=ov) |
86 |
| -* `-p` - interactive prompt text |
87 |
| -* `-pf` - path of JSONL file including interactive prompts |
88 |
| -* `-n` - number of benchmarking iterations, if the value greater 0, will exclude the first iteration. (default=0) |
89 |
| -* `-ic` - limit the output token size (default 512) of text_gen and code_gen models. |
90 |
| - |
91 | 86 |
|
| 87 | +**Parameters:** |
| 88 | +- `-m`: Path to the model. |
| 89 | +- `-d`: Inference device (default: CPU). |
| 90 | +- `-r`: Path to the CSV report. |
| 91 | +- `-f`: Framework (default: ov). |
| 92 | +- `-p`: Interactive prompt text. |
| 93 | +- `-pf`: Path to a JSONL file containing prompts. |
| 94 | +- `-n`: Number of iterations (default: 0, the first iteration is excluded). |
| 95 | +- `-ic`: Limit the output token size (default: 512) for text generation and code generation models. |
| 96 | + |
| 97 | +**Additional options:** |
92 | 98 | ``` bash
|
93 | 99 | python ./benchmark.py -h # for more information
|
94 | 100 | ```
|
95 | 101 |
|
96 |
| -## Running `torch.compile()` |
| 102 | +#### Benchmarking the Original PyTorch Model: |
| 103 | +To benchmark the original PyTorch model, first download the model locally and then run benchmark by specifying PyTorch as the framework with parameter `-f pt` |
97 | 104 |
|
98 |
| -The option `--torch_compile_backend` uses `torch.compile()` to speed up |
99 |
| -the PyTorch code by compiling it into optimized kernels using a selected backend. |
| 105 | +```bash |
| 106 | +# Download PyTorch Model |
| 107 | +huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir models/llama-2-7b-chat/pytorch |
| 108 | +# Benchmark with PyTorch Framework |
| 109 | +python benchmark.py -m models/llama-2-7b-chat/pytorch -n 2 -f pt |
| 110 | +``` |
100 | 111 |
|
101 |
| -Prerequisites: install benchmarking dependencies using requirements.txt |
| 112 | +> **Note:** If needed, You can install a specific OpenVINO version using pip: |
| 113 | +> ``` bash |
| 114 | +> # e.g. |
| 115 | +> pip install openvino==2024.4.0 |
| 116 | +> # Optional, install the openvino nightly package if needed. |
| 117 | +> # OpenVINO nightly is pre-release software and has not undergone full release validation or qualification. |
| 118 | +> pip uninstall openvino |
| 119 | +> pip install --upgrade --pre openvino openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly |
| 120 | +> ``` |
102 | 121 |
|
103 |
| -``` bash |
104 |
| -pip install -r requirements.txt |
105 |
| -``` |
| 122 | +## 4. Benchmark LLM with `torch.compile()` |
| 123 | +
|
| 124 | +The `--torch_compile_backend` option enables you to use `torch.compile()` to accelerate PyTorch models by compiling them into optimized kernels using a specified backend. |
106 | 125 |
|
107 |
| -In order to run the `torch.compile()` on CUDA GPU, install additionally the nightly PyTorch version: |
| 126 | +Before benchmarking, you need to download the original PyTorch model. Use the following command to download the model locally: |
108 | 127 |
|
109 | 128 | ```bash
|
110 |
| -pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 |
| 129 | +huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir models/llama-2-7b-chat/pytorch |
111 | 130 | ```
|
112 | 131 |
|
113 |
| -Add the option `--torch_compile_backend` with the desired backend: `pytorch` or `openvino` (default) while running the benchmarking script: |
| 132 | +To run the benchmarking script with `torch.compile()`, use the `--torch_compile_backend` option to specify the backend. You can choose between `pytorch` or `openvino` (default). Example: |
114 | 133 |
|
115 | 134 | ```bash
|
116 | 135 | python ./benchmark.py -m models/llama-2-7b-chat/pytorch -d CPU --torch_compile_backend openvino
|
117 | 136 | ```
|
118 | 137 |
|
119 |
| -## Run on 2 sockets platform |
| 138 | +> **Note:** To use `torch.compile()` with CUDA GPUs, you need to install the nightly version of PyTorch: |
| 139 | +> |
| 140 | +> ```bash |
| 141 | +> pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 |
| 142 | +> ``` |
| 143 | +
|
120 | 144 |
|
121 |
| -benchmark.py sets openvino.properties.streams.num(1) by default |
| 145 | +## 5. Running on 2-Socket Platforms |
122 | 146 |
|
123 |
| -| OpenVINO version | Behaviors | |
| 147 | +The benchmarking script sets `openvino.properties.streams.num(1)` by default. For multi-socket platforms, use `numactl` on Linux or the `--load_config` option to modify behavior. |
| 148 | +
|
| 149 | +| OpenVINO Version | Behaviors | |
124 | 150 | |:--------------------|:------------------------------------------------|
|
125 |
| -| Before 2024.0.0 | streams.num(1) <br>execute on 2 sockets. | |
126 |
| -| 2024.0.0 | streams.num(1) <br>execute on the same socket as the APP is running on. | |
| 151 | +| Before 2024.0.0 | streams.num(1) <br>execute on 2 sockets. | |
| 152 | +| 2024.0.0 | streams.num(1) <br>execute on the same socket as the APP is running on. | |
127 | 153 |
|
128 |
| -numactl on Linux or --load_config for benchmark.py can be used to change the behaviors. |
| 154 | +For example, `--load_config config.json` as following will result in streams.num(1) and execute on 2 sockets. |
| 155 | +```json |
| 156 | +{ |
| 157 | + "INFERENCE_NUM_THREADS": <NUMBER> |
| 158 | +} |
| 159 | +``` |
| 160 | +`<NUMBER>` is the number of total physical cores in 2 sockets. |
129 | 161 |
|
130 |
| -For example, --load_config config.json as following in OpenVINO 2024.0.0 will result in streams.num(1) and execute on 2 sockets. |
131 |
| -``` |
132 |
| -{"INFERENCE_NUM_THREADS":<NUMBER>} |
133 |
| -``` |
134 |
| -`<NUMBER>` is the number of total physical cores in 2 sockets |
| 162 | +## 6. Additional Resources |
135 | 163 |
|
136 |
| -## Additional Resources |
137 |
| -### 1. NOTE |
138 |
| -> If you encounter any errors, please check **[NOTES.md](./doc/NOTES.md)** which provides solutions to the known errors. |
139 |
| -### 2. Image generation |
140 |
| -> To configure more parameters for image generation models, reference to **[IMAGE_GEN.md](./doc/IMAGE_GEN.md)** |
| 164 | +- **Error Troubleshooting:** Check the [NOTES.md](./doc/NOTES.md) for solutions to known issues. |
| 165 | +- **Image Generation Configuration:** Refer to [IMAGE_GEN.md](./doc/IMAGE_GEN.md) for setting parameters for image generation models. |
0 commit comments