Skip to content

Commit 24242ff

Browse files
authored
[Doc] Add vllm benchmark docs. (#448)
1 parent 38658b1 commit 24242ff

File tree

1 file changed

+80
-0
lines changed

1 file changed

+80
-0
lines changed

serving/vllm-xft.md

+80
Original file line numberDiff line numberDiff line change
@@ -68,4 +68,84 @@ OMP_NUM_THREADS=48 mpirun \
6868
--dtype bf16 \
6969
--model ${MODEL_PATH} \
7070
--kv-cache-dtype fp16
71+
```
72+
73+
## Benchmarking vLLM-xFT
74+
75+
### Downloading the vLLM
76+
```bash
77+
git clone https://github.com/Duyi-Wang/vllm.git && cd vllm/benchmarks
78+
```
79+
80+
### Downloading the ShareGPT dataset
81+
You can download the dataset by running:
82+
```bash
83+
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
84+
```
85+
86+
### Benchmark offline inference throughput.
87+
This script is used to benchmark the offline inference throughput of a specified model. It sets up the environment, defines the paths for the tokenizer, model, and dataset, and uses numactl to bind the process to appropriate CPU resources for optimized performance.
88+
```bash
89+
#!/bin/bash
90+
91+
# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually
92+
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')
93+
94+
# Define the paths for the tokenizer and the model
95+
TOKEN_PATH=/data/models/Qwen2-7B-Instruct
96+
MODEL_PATH=/data/models/Qwen2-7B-Instruct-xft
97+
DATASET_PATH=ShareGPT_V3_unfiltered_cleaned_split.json
98+
99+
# Use numactl to bind to appropriate CPU resources
100+
numactl -C 0-47 -l python benchmark_throughput.py \
101+
--tokenizer ${TOKEN_PATH} \ # Path to the tokenizer
102+
--model ${MODEL_PATH} \ # Path to the model
103+
--dataset ${DATASET_PATH} # Path to the dataset
104+
```
105+
106+
### Benchmark online serving throughput.
107+
This guide explains how to benchmark the online serving throughput for a model. It includes instructions for setting up the server and running the client benchmark script.
108+
1. On the server side, you can refer to the following code to start the test API server:
109+
```bash
110+
#!/bin/bash
111+
112+
# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually
113+
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')
114+
115+
# Define the paths for the tokenizer and the model
116+
TOKEN_PATH=/data/models/Qwen2-7B-Instruct
117+
MODEL_PATH=/data/models/Qwen2-7B-Instruct-xft
118+
119+
# Start the API server using numactl to bind to appropriate CPU resources
120+
numactl -C 0-47 -l python -m vllm.entrypoints.openai.api_server \
121+
--model ${MODEL_PATH} \ # Path to the model
122+
--tokenizer ${TOKEN_PATH} \ # Path to the tokenizer
123+
--dtype bf16 \ # Data type for the model (bfloat16)
124+
--kv-cache-dtype fp16 \ # Data type for the key-value cache (float16)
125+
--served-model-name xft \ # Name for the served model
126+
--port 8000 \ # Port number for the API server
127+
--trust-remote-code # Trust remote code execution
128+
```
129+
130+
2. On the client side, you can use `python benchmark_serving.py --help` to see the required configuration parameters. Here is a reference example:
131+
132+
```bash
133+
$ python benchmark_serving.py --model xft --tokenizer /data/models/Qwen2-7B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json
134+
============ Serving Benchmark Result ============
135+
Successful requests: xxxx
136+
Benchmark duration (s): xxxx
137+
Total input tokens: xxxx
138+
Total generated tokens: xxxx
139+
Request throughput (req/s): xxxx
140+
Input token throughput (tok/s): xxxx
141+
Output token throughput (tok/s): xxxx
142+
---------------Time to First Token----------------
143+
Mean TTFT (ms): xxxx
144+
Median TTFT (ms): xxxx
145+
P99 TTFT (ms): xxxx
146+
-----Time per Output Token (excl. 1st token)------
147+
Mean TPOT (ms): xxx
148+
Median TPOT (ms): xxx
149+
P99 TPOT (ms): xxx
150+
==================================================
71151
```

0 commit comments

Comments
 (0)