Bug exporting Whisper? #2200

AlArgente · 2025-02-25T09:45:02Z

System Info

Hi! I'm exporting some fine-tuned whisper models, small and base, being fine-tuned in english or spanish. In some cases I've detected that the tokenizer.json is 2.423KB and in other cases 3.839, being the tokenizer.json exported for the same language. I have some models in english where the tokenizer weight's 2.423KB and others where the tokenizer weight's 3.839KB, and same for the spanish ones.

When the tokenizer is 2.423KBs I get problems generating the output, as it reachs the max_lenght of the model, but when the tokenizer file is 3.839KBs, the output gets as it should.

The tokenizer from the original models weights 2.423KBs, and I they works well, but when finetuned the weight change. I don't know if this is an expected output,

Who can help?

@michaelbenayoun @JingyaHuang @echarlaix

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

I have used the following URL to train my models: https://huggingface.co/blog/fine-tune-whisper

The datasets I have used in spanish are:

voxpopuli_spanish = load_dataset(
      "facebook/voxpopuli", "es", split="train", streaming=True, trust_remote_code=True
  ) # I take 133 random instances
common_voice_spanish = load_dataset(
    "mozilla-foundation/common_voice_17_0",
    "es",
    split="train",
    streaming=True,
    trust_remote_code=True,
) # I take 66 random instances
librispeech_spanish = load_dataset(
    "facebook/multilingual_librispeech", "spanish", split="train", streaming=True
) # I take 66 random instances

I have used the same datasets for english:
In case of the common_voice and voxpopuli, I just change "es"for "en". For the librispeech:

librispeech_asr = load_dataset(
    "openslr/librispeech_asr", split="train.other.500", streaming=True, trust_remote_code=True
)

I use other private dataset that I can't share right now, but they are around 200 instances.

For exporting the model I use the following line:

optimum-cli export onnx --model whisper-small-es-trained whisper-small-es-onnx --task automatic-speech-recognition --opset 18

I have tested using multiple opsets, but I get the same output.

Expected behavior

I don't know if the behavior is the correct one, or I the exported tokenizer.json must be always the same.

The text was updated successfully, but these errors were encountered:

xenova · 2025-03-05T20:57:21Z

I believe this was due to a change in how BPE models were saved in huggingface/tokenizers#909, with more recent models having more whitespace due to another level of indentation.

Practically, this shouldn't be an issue for you here, and you can ignore the difference in size.

AlArgente added the bug Something isn't working label Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug exporting Whisper? #2200

Bug exporting Whisper? #2200

AlArgente commented Feb 25, 2025 •

edited by xenova

Loading

xenova commented Mar 5, 2025

Bug exporting Whisper? #2200

Bug exporting Whisper? #2200

Comments

AlArgente commented Feb 25, 2025 • edited by xenova Loading

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

xenova commented Mar 5, 2025

AlArgente commented Feb 25, 2025 •

edited by xenova

Loading