[NLP] Enable import of models with missing vocabulary files #721

davidkyle · 2024-09-04T10:13:58Z

Eland needs access to a model's vocabulary file so that is can be uploaded to Elasticsearch along with the model definition. In some cases the vocab file is not included in the model repo on HuggingFace, one example is Jina Reranker. The eland_import_hub_model script fails with this error when the file is missing:

Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

This happens because AutoTokenizer.from_pretrained(...) is called with use_fast=False.

It should be possible to download the vocab from the base model, investigate other ways to download the vocab file where it is not present in the model repo.

eland_import_hub_model --cloud-id labs:xxxxxx== --hub-model-id jinaai/jina-reranker-v2-base-multilingual --task-type text_similarity --es-api-key xxxx== --start --clear-previous
And I'm getting this error:
2024-09-03 01:59:53,443 INFO : Establishing connection to Elasticsearch
2024-09-03 01:59:53,940 INFO : Connected to cluster named 'XXX' (version: 8.15.0)
2024-09-03 01:59:53,942 INFO : Loading HuggingFace transformer tokenizer and model 'jinaai/jina-reranker-v2-base-multilingual'
/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta.py", line 154, in __init__
    self.sp_model.Load(str(vocab_file))
  File "/usr/local/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
  File "/usr/local/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
OSError: Not found: "None": No such file or directory Error #2
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/bin/eland_import_hub_model", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/eland/cli/eland_import_hub_model.py", line 298, in main
    tm = TransformerModel(
  File "/usr/local/lib/python3.10/site-packages/eland/ml/pytorch/transformers.py", line 655, in __init__
    self._tokenizer = transformers.AutoTokenizer.from_pretrained(
  File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 768, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2024, in from_pretrained
    return cls._from_pretrained(
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2258, in _from_pretrained
    raise OSError(
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

The text was updated successfully, but these errors were encountered:

serenachou · 2024-09-11T19:57:38Z

note to @serenachou a blocker for JinaAI models in the ML node

davidkyle added bug Something isn't working topic:NLP Issue or PR about NLP model support and eland_import_hub_model labels Sep 4, 2024

davidkyle mentioned this issue Sep 4, 2024

Support for importing model #687

Closed

davidkyle mentioned this issue Feb 14, 2025

Support the Jina AI jina-embeddings-v3 model #760

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NLP] Enable import of models with missing vocabulary files #721

[NLP] Enable import of models with missing vocabulary files #721

davidkyle commented Sep 4, 2024 •

edited

Loading

serenachou commented Sep 11, 2024

[NLP] Enable import of models with missing vocabulary files #721

[NLP] Enable import of models with missing vocabulary files #721

Comments

davidkyle commented Sep 4, 2024 • edited Loading

serenachou commented Sep 11, 2024

davidkyle commented Sep 4, 2024 •

edited

Loading