ORT whisper on TensorrtExecutionProvider is slower than PyTorch #2212

huggingfacename · 2025-03-11T13:57:04Z

System Info

optimum 1.24.0

Who can help?

I've seen #869, but this appears to be a separate issue. Maybe @fxmarty or @JingyaHuang can help me?

I am able to get significant speedup with Whisper with CUDAExecutionProvider and wav2vec2 with TensorrtExecutionProvider, but Whisper with TensorrtExecutionProvider yields very poor results.

I'm using trt_engine_cache and warming up the model, but Whisper with TensorrtExecutionProvider consistently is over 2x slower than a vanilla pipeline.

Ex, running the script below yields me:

Benchmarking baseline pipeline...
Run 1: 2.48 seconds
Run 2: 1.87 seconds
Run 3: 1.85 seconds

Benchmarking TensorRT pipeline...
Run 1: 5.61 seconds
Run 2: 4.99 seconds
Run 3: 5.00 seconds

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

import time
import torch
from datasets import load_dataset
from transformers import pipeline, AutoTokenizer, AutoFeatureExtractor
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
import numpy as np

if not torch.cuda.is_available():
    raise SystemExit("CUDA not available!")
device = 0

model_name = "openai/whisper-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)

# Create the baseline pipeline using the standard model
baseline_pipe = pipeline(
    task="automatic-speech-recognition",
    model=model_name,
    tokenizer=tokenizer,
    feature_extractor=feature_extractor,
    return_timestamps=True,
    device=device
)

# Create the TensorRT-optimized pipeline
provider_options = {
    "trt_engine_cache_enable": True,
    "trt_engine_cache_path": "trt_cache"
}
ort_model = ORTModelForSpeechSeq2Seq.from_pretrained(
    model_name,
    export=True,
    provider="TensorrtExecutionProvider",
    provider_options=provider_options
)
trt_pipe = pipeline(
    task="automatic-speech-recognition",
    model=ort_model,
    tokenizer=tokenizer,
    feature_extractor=feature_extractor,
    return_timestamps=True,
    device=device
)


audio = next(iter(load_dataset("librispeech_asr", "clean", split="test", streaming=True)))["audio"]["array"]
audio_long = np.tile(audio, 60)

def benchmark(pipe, audio, runs=3):
    times = []
    for i in range(runs):
        start = time.time()
        _ = pipe(audio)
        elapsed = time.time() - start
        times.append(elapsed)
        print(f"Run {i+1}: {elapsed:.2f} seconds")
    return times

print("\nBenchmarking baseline pipeline...")
baseline_times = benchmark(baseline_pipe, audio_long)

print("\nBenchmarking TensorRT pipeline...")
trt_times = benchmark(trt_pipe, audio_long)

Expected behavior

TensorrtExecutionProvider yields a significant speedup.

The text was updated successfully, but these errors were encountered:

huggingfacename added the bug Something isn't working label Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORT whisper on TensorrtExecutionProvider is slower than PyTorch #2212

ORT whisper on TensorrtExecutionProvider is slower than PyTorch #2212

huggingfacename commented Mar 11, 2025

ORT whisper on TensorrtExecutionProvider is slower than PyTorch #2212

ORT whisper on TensorrtExecutionProvider is slower than PyTorch #2212

Comments

huggingfacename commented Mar 11, 2025

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior