The distilled models can be used in the same way as the base model. For instance:
DeepSeek-R1-Distill-Qwen-xxB
, based on Qwen2.5, utilizesQwen2Convert
.DeepSeek-R1-Distill-Llama-xxB
employsLlamaConvert
.
Notice: DeepSeek-R1 671B only supports dtype fp8_e4m3
and kvcache dtype bf16
.
- xfastertransformer >= 2.0.0
- oneCCL
- vllm-xft >= 0.5.5.3
Execute the benchmark scripts to evaluate the model's performance using fake weight without downloading 600GB+ weight.
git clone https://github.com/intel/xFasterTransformer.git
cd xFasterTransformer/benchmark
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')
bash run_benchmark.sh -m deepseek-r1 -d fp8_e4m3 -kvd bf16 -bs 1 -in 32 -out 32 -s 1
-bs
: batch size.-in 32
: input token length,[32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
.-out
: output length.-s
: number of used CPU nodes. If you only have 1 node with SNC-3 enabled, it will be used as number of subnumas.
- Download original DeepSeek-R1 671B model from HuggingFace.
- Convert model into xFT format.
python -c 'import xfastertransformer as xft; xft.DeepSeekR1Convert().convert("${HF_DATASET_DIR}","${OUTPUT_DIR}")'
After conversion, the
*.safetensor
files in${HF_DATASET_DIR}
is no longer needed if you want to save storage space.
-
Single instance
If you want to run DeepSeek in one CPU numa, likeNUMA node0 CPU(s): 0-47,96-143
# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually export $(python -c 'import xfastertransformer as xft; print(xft.get_env())') numactl 0-47 -l python -m vllm.entrypoints.openai.api_server \ --model ${MODEL_PATH} \ --tokenizer ${TOKEN_PATH} \ --dtype fp8_e4m3 \ --kv-cache-dtype bf16 \ --served-model-name xft \ --port 8000 \ --trust-remote-code
MODEL_PATH
: The xFT format model weights.TOKEN_PATH
: The tokenizer related files, likeHF_DATASET_DIR
.served-model-name
: The model name used in the API.
-
Distributed(Multi-rank) If you want to run DeepSeek cross numas, like
NUMA node0 CPU(s): 0-47,96-143 NUMA node1 CPU(s): 48-95,144-191
# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually export $(python -c 'import xfastertransformer as xft; print(xft.get_env())') OMP_NUM_THREADS=48 mpirun \ -n 1 numactl --all -C 0-47 \ python -m vllm.entrypoints.openai.api_server \ --model ${MODEL_PATH} \ --tokenizer ${TOKEN_PATH} \ --dtype fp8_e4m3 \ --kv-cache-dtype bf16 \ --served-model-name xft \ --port 8000 \ --trust-remote-code \ : -n 1 numactl --all -C 48-95 \ python -m vllm.entrypoints.slave \ --dtype fp8_e4m3 \ --model ${MODEL_PATH} \ --kv-cache-dtype bf16
-
Query example
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "xft", "messages": [{"role": "user", "content": "你好呀!请问你是谁?"}], "max_tokens": 256, "temperature": 0.6, "top_p": 0.95 }'