Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NPU][Llama] NPU is slower than CPU&GPU when running LLM #1882

Open
yang-ahuan opened this issue Mar 11, 2025 · 2 comments
Open

[NPU][Llama] NPU is slower than CPU&GPU when running LLM #1882

yang-ahuan opened this issue Mar 11, 2025 · 2 comments
Assignees
Labels
category: LLM LLM pipeline (stateful, static) category: NPU PSE

Comments

@yang-ahuan
Copy link

yang-ahuan commented Mar 11, 2025

Description

I followed the official guidelines (NPU with OpenVINO GenAI) to run LLM. When testing the model on different hardware, I observed the following:

  • GPU performance is indeed faster than CPU, as expected.
  • However, NPU runs slower than CPU, which seems counterintuitive.

Is this behavior expected, or could there be an issue with my setup? I would appreciate any insights into this.

Experiment

Environment

  • Processor: Intel(R) Core(TM) Ultra 9 288V
  • OS: Win11 Pro 24H2
  • Package version:
    • openvino: 2025.0.0
    • openvino-tokenizers: 2025.0.0.0
    • openvino-genai: 2025.0.0.0
    • optimum-intel: 1.22.0

Testing model

  • Llama-3.2-3B-Instruct
  • Llama-2-7B-Chat-Hf

Testing code

optimum-cli export openvino -m meta-llama/Llama-2-7b-chat-hf --weight-format int4 --sym --group-size -1 model/llama2_7b_chat_hf_ov_quant_wi4_sym_gs-1
from time import time
from openvino_genai import LLMPipeline

device = "CPU" # NPU
model_path = "model/llama2_7b_chat_hf_ov_quant_wi4_sym_gs-1"
pipe = LLMPipeline(
    model_path,
    device,
    # CACHE_DIR=".npucache",
)

latency_rec = []
for _ in range(10):
    start_time = time()
    pipe.generate("How are you?", max_new_tokens=100)
    latency_rec.append(time()-start_time)

Result

Device Average Individual
CPU 6.1186s 12.960466861724854, 5.25625205039978, 1.1002938747406006, 6.883204698562622, 6.783355712890625, 2.4332289695739746, 4.931521654129028, 6.968140602111816, 7.027053356170654, 6.842746019363403
NPU 9.4189s 10.49455976486206, 9.88098931312561, 6.548482418060303, 9.95136308670044, 9.887852668762207, 9.795000553131104, 9.912097215652466, 7.871314287185669, 9.916764259338379, 9.931214570999146

Try to Solve

I attempted to optimize the execution of the LLM on the NPU by configuring different group sizes, model types, and cache settings, but the results were still slower than on the CPU.

@yang-ahuan yang-ahuan changed the title [NPU] NPU is slower than CPU&GPU when running LLM [NPU][Llama] NPU is slower than CPU&GPU when running LLM Mar 11, 2025
@ilya-lavrenov ilya-lavrenov added category: LLM LLM pipeline (stateful, static) category: NPU labels Mar 11, 2025
@Wan-Intel
Copy link

I've replicated the issue on Intel® Core™ Ultra 7 Processor 258V, the results of NPU are slower than CPU:

CPU:
[7.669469594955444, 4.948709964752197, 5.008191347122192, 5.01607608795166, 5.037824392318726, 5.108126640319824, 2.7261970043182373, 5.05750036239624, 5.072535514831543, 4.77348256111145]

NPU:
[12.301685571670532, 11.517853736877441, 11.37859845161438, 11.471659660339355, 11.345218658447266, 10.561753749847412, 11.35003662109375, 11.410158395767212, 11.563832759857178, 11.311937808990479]

I'll escalate the case to relevant team, and we will provide an update as soon as possible.

@Wan-Intel Wan-Intel added the PSE label Mar 16, 2025
@deveworld
Copy link

deveworld commented Mar 20, 2025

I was also able to reproduce the NPU being significantly slower than the CPU and iGPU on a Galaxy Book5 Pro with an Intel® Core™ Ultra 5 Processor 228V.
Here are the test results:

iGPU (3.7sec) < CPU (5.6sec) < NPU (7.5sec)

Device Avg (sec) 10 samples
iGPU 3.7 3.8001456260681152, 3.6813881397247314, 3.6545255184173584, 3.667475700378418, 3.6804163455963135, 3.717153549194336, 3.703486919403076, 3.1916630268096924, 3.6896822452545166, 3.717395067214966
CPU 5.6 5.6250293254852295, 5.627218246459961, 5.371657133102417, 5.438901901245117, 5.822175741195679, 5.832181692123413, 5.705724716186523, 5.52695894241333, 5.490842580795288, 5.5097081661224365
NPU 7.5 8.383576393127441, 7.973790645599365, 5.383915662765503, 7.974738121032715, 8.008851289749146, 8.047709226608276, 8.006436824798584, 5.279796361923218, 8.01841402053833, 8.002372026443481

When can we expect a fix for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: LLM LLM pipeline (stateful, static) category: NPU PSE
Projects
None yet
Development

No branches or pull requests

6 participants