Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug exporting Whisper? #2200

Open
2 of 4 tasks
AlArgente opened this issue Feb 25, 2025 · 1 comment
Open
2 of 4 tasks

Bug exporting Whisper? #2200

AlArgente opened this issue Feb 25, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@AlArgente
Copy link

AlArgente commented Feb 25, 2025

System Info

Hi! I'm exporting some fine-tuned whisper models, small and base, being fine-tuned in english or spanish. In some cases I've detected that the tokenizer.json is 2.423KB and in other cases 3.839, being the tokenizer.json exported for the same language. I have some models in english where the tokenizer weight's 2.423KB and others where the tokenizer weight's 3.839KB, and same for the spanish ones.

When the tokenizer is 2.423KBs I get problems generating the output, as it reachs the max_lenght of the model, but when the tokenizer file is 3.839KBs, the output gets as it should.

The tokenizer from the original models weights 2.423KBs, and I they works well, but when finetuned the weight change. I don't know if this is an expected output,

Who can help?

@michaelbenayoun @JingyaHuang @echarlaix

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

I have used the following URL to train my models: https://huggingface.co/blog/fine-tune-whisper

The datasets I have used in spanish are:

voxpopuli_spanish = load_dataset(
      "facebook/voxpopuli", "es", split="train", streaming=True, trust_remote_code=True
  ) # I take 133 random instances
common_voice_spanish = load_dataset(
    "mozilla-foundation/common_voice_17_0",
    "es",
    split="train",
    streaming=True,
    trust_remote_code=True,
) # I take 66 random instances
librispeech_spanish = load_dataset(
    "facebook/multilingual_librispeech", "spanish", split="train", streaming=True
) # I take 66 random instances

I have used the same datasets for english:
In case of the common_voice and voxpopuli, I just change "es"for "en". For the librispeech:

librispeech_asr = load_dataset(
    "openslr/librispeech_asr", split="train.other.500", streaming=True, trust_remote_code=True
)

I use other private dataset that I can't share right now, but they are around 200 instances.

For exporting the model I use the following line:

optimum-cli export onnx --model whisper-small-es-trained whisper-small-es-onnx --task automatic-speech-recognition --opset 18

I have tested using multiple opsets, but I get the same output.

Expected behavior

I don't know if the behavior is the correct one, or I the exported tokenizer.json must be always the same.

@AlArgente AlArgente added the bug Something isn't working label Feb 25, 2025
@xenova
Copy link
Contributor

xenova commented Mar 5, 2025

I believe this was due to a change in how BPE models were saved in huggingface/tokenizers#909, with more recent models having more whitespace due to another level of indentation.

Practically, this shouldn't be an issue for you here, and you can ignore the difference in size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants