You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I'm exporting some fine-tuned whisper models, small and base, being fine-tuned in english or spanish. In some cases I've detected that the tokenizer.json is 2.423KB and in other cases 3.839, being the tokenizer.json exported for the same language. I have some models in english where the tokenizer weight's 2.423KB and others where the tokenizer weight's 3.839KB, and same for the spanish ones.
When the tokenizer is 2.423KBs I get problems generating the output, as it reachs the max_lenght of the model, but when the tokenizer file is 3.839KBs, the output gets as it should.
The tokenizer from the original models weights 2.423KBs, and I they works well, but when finetuned the weight change. I don't know if this is an expected output,
voxpopuli_spanish=load_dataset(
"facebook/voxpopuli", "es", split="train", streaming=True, trust_remote_code=True
) # I take 133 random instancescommon_voice_spanish=load_dataset(
"mozilla-foundation/common_voice_17_0",
"es",
split="train",
streaming=True,
trust_remote_code=True,
) # I take 66 random instanceslibrispeech_spanish=load_dataset(
"facebook/multilingual_librispeech", "spanish", split="train", streaming=True
) # I take 66 random instances
I have used the same datasets for english:
In case of the common_voice and voxpopuli, I just change "es"for "en". For the librispeech:
I believe this was due to a change in how BPE models were saved in huggingface/tokenizers#909, with more recent models having more whitespace due to another level of indentation.
Practically, this shouldn't be an issue for you here, and you can ignore the difference in size.
System Info
Hi! I'm exporting some fine-tuned whisper models, small and base, being fine-tuned in english or spanish. In some cases I've detected that the tokenizer.json is 2.423KB and in other cases 3.839, being the tokenizer.json exported for the same language. I have some models in english where the tokenizer weight's 2.423KB and others where the tokenizer weight's 3.839KB, and same for the spanish ones.
When the tokenizer is 2.423KBs I get problems generating the output, as it reachs the max_lenght of the model, but when the tokenizer file is 3.839KBs, the output gets as it should.
The tokenizer from the original models weights 2.423KBs, and I they works well, but when finetuned the weight change. I don't know if this is an expected output,
Who can help?
@michaelbenayoun @JingyaHuang @echarlaix
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
I have used the following URL to train my models: https://huggingface.co/blog/fine-tune-whisper
The datasets I have used in spanish are:
I have used the same datasets for english:
In case of the common_voice and voxpopuli, I just change "es"for "en". For the librispeech:
I use other private dataset that I can't share right now, but they are around 200 instances.
For exporting the model I use the following line:
I have tested using multiple opsets, but I get the same output.
Expected behavior
I don't know if the behavior is the correct one, or I the exported tokenizer.json must be always the same.
The text was updated successfully, but these errors were encountered: