Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORT whisper on TensorrtExecutionProvider is slower than PyTorch #2212

Open
2 of 4 tasks
huggingfacename opened this issue Mar 11, 2025 · 0 comments
Open
2 of 4 tasks
Labels
bug Something isn't working

Comments

@huggingfacename
Copy link

System Info

optimum 1.24.0

Who can help?

I've seen #869, but this appears to be a separate issue. Maybe @fxmarty or @JingyaHuang can help me?

I am able to get significant speedup with Whisper with CUDAExecutionProvider and wav2vec2 with TensorrtExecutionProvider, but Whisper with TensorrtExecutionProvider yields very poor results.

I'm using trt_engine_cache and warming up the model, but Whisper with TensorrtExecutionProvider consistently is over 2x slower than a vanilla pipeline.

Ex, running the script below yields me:

Benchmarking baseline pipeline...
Run 1: 2.48 seconds
Run 2: 1.87 seconds
Run 3: 1.85 seconds

Benchmarking TensorRT pipeline...
Run 1: 5.61 seconds
Run 2: 4.99 seconds
Run 3: 5.00 seconds

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

import time
import torch
from datasets import load_dataset
from transformers import pipeline, AutoTokenizer, AutoFeatureExtractor
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
import numpy as np

if not torch.cuda.is_available():
    raise SystemExit("CUDA not available!")
device = 0

model_name = "openai/whisper-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)

# Create the baseline pipeline using the standard model
baseline_pipe = pipeline(
    task="automatic-speech-recognition",
    model=model_name,
    tokenizer=tokenizer,
    feature_extractor=feature_extractor,
    return_timestamps=True,
    device=device
)

# Create the TensorRT-optimized pipeline
provider_options = {
    "trt_engine_cache_enable": True,
    "trt_engine_cache_path": "trt_cache"
}
ort_model = ORTModelForSpeechSeq2Seq.from_pretrained(
    model_name,
    export=True,
    provider="TensorrtExecutionProvider",
    provider_options=provider_options
)
trt_pipe = pipeline(
    task="automatic-speech-recognition",
    model=ort_model,
    tokenizer=tokenizer,
    feature_extractor=feature_extractor,
    return_timestamps=True,
    device=device
)


audio = next(iter(load_dataset("librispeech_asr", "clean", split="test", streaming=True)))["audio"]["array"]
audio_long = np.tile(audio, 60)

def benchmark(pipe, audio, runs=3):
    times = []
    for i in range(runs):
        start = time.time()
        _ = pipe(audio)
        elapsed = time.time() - start
        times.append(elapsed)
        print(f"Run {i+1}: {elapsed:.2f} seconds")
    return times

print("\nBenchmarking baseline pipeline...")
baseline_times = benchmark(baseline_pipe, audio_long)

print("\nBenchmarking TensorRT pipeline...")
trt_times = benchmark(trt_pipe, audio_long)

Expected behavior

TensorrtExecutionProvider yields a significant speedup.

@huggingfacename huggingfacename added the bug Something isn't working label Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant