[NPU][LNL] Run LLM inference on LNL NPU is very very slow #1563

johnysh · 2025-01-16T08:29:20Z

[OS] Win11
[Platform]: Intel(R) Core(TM) Ultra 7 258V 2.20 GHz
[RAM]: 32GB
[NPU driver]: 32.0.100.3104
ENV:

https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html

pip install nncf==2.12 onnx==1.16.1 optimum-intel==1.19.0
pip install openvino==2024.6 openvino-tokenizers==2024.6 openvino-genai==2024.6

PIP LIST:
openvino 2024.6.0
openvino-genai 2024.6.0.0
openvino-telemetry 2024.5.0
openvino-tokenizers 2024.6.0.0
optimum 1.23.3
optimum-intel 1.19.0

Code:
https://github.com/openvinotoolkit/openvino.genai

CMD:
optimum-cli export openvino -m TheBloke/Llama-2-7B-Chat-GPTQ Llama-2-7B-Chat-GPTQ

python\benchmark_genai>python ./benchmark_genai.py -m Llama-2-7B-Chat-GPTQ -d NPU

Result:

Wan-Intel · 2025-01-20T05:00:47Z

Could you please run the following command and share the result with us?
python\benchmark_genai>python ./benchmark_genai.py -m Llama-2-7B-Chat-GPTQ -d CPU

johnysh · 2025-01-20T05:18:34Z

CPU Result:

Load time: 1260.00 ms
Generate time: 1786.82 ± 66.43 ms
Tokenization time: 0.48 ± 0.04 ms
Detokenization time: 0.43 ± 0.04 ms
TTFT: 288.04 ± 43.60 ms
TPOT: 78.86 ± 19.94 ms
Throughput : 12.68 ± 3.21 tokens/s

iGPU Result:

Load time: 77268.00 ms
Generate time: 917.43 ± 2.66 ms
Tokenization time: 0.54 ± 0.04 ms
Detokenization time: 0.56 ± 0.00 ms
TTFT: 52.42 ± 0.39 ms
TPOT: 45.47 ± 2.41 ms
Throughput : 21.99 ± 1.17 tokens/s

Wan-Intel · 2025-01-23T03:22:32Z

Thanks for sharing the information. I'll escalate the case to relevant team and we'll provide an update as soon as possible.

avitial · 2025-02-04T17:54:57Z

Ref. 161855

avitial · 2025-02-28T22:32:26Z

@johnysh could you follow the updated OpenVINO guide to re-measure the performance on NPU device? Also we suggest considering to use BEST_PERF option to ensure best possible performance at lower compilation speed, check the performance modes section here. Kindly share your results.

TolyaTalamanov · 2025-03-07T12:01:41Z

@johnysh Could you re-validate with the new driver, please?
https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html

YuChern-Intel assigned Munesh-Intel, zulkifli-halim and Wan-Intel and unassigned zulkifli-halim Jan 19, 2025

Wan-Intel added the PSE label Jan 23, 2025

avitial self-assigned this Jan 31, 2025

avitial added bug Something isn't working category: NPU labels Feb 4, 2025

ilya-lavrenov assigned TolyaTalamanov and dmatveev Feb 9, 2025

ilya-lavrenov added the category: LLM LLM pipeline (stateful, static) label Feb 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU][LNL] Run LLM inference on LNL NPU is very very slow #1563

[NPU][LNL] Run LLM inference on LNL NPU is very very slow #1563

johnysh commented Jan 16, 2025

Wan-Intel commented Jan 20, 2025

johnysh commented Jan 20, 2025

Wan-Intel commented Jan 23, 2025

avitial commented Feb 4, 2025

avitial commented Feb 28, 2025

TolyaTalamanov commented Mar 7, 2025

[NPU][LNL] Run LLM inference on LNL NPU is very very slow #1563

[NPU][LNL] Run LLM inference on LNL NPU is very very slow #1563

Comments

johnysh commented Jan 16, 2025

Wan-Intel commented Jan 20, 2025

johnysh commented Jan 20, 2025

Wan-Intel commented Jan 23, 2025

avitial commented Feb 4, 2025

avitial commented Feb 28, 2025

TolyaTalamanov commented Mar 7, 2025