You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Eland needs access to a model's vocabulary file so that is can be uploaded to Elasticsearch along with the model definition. In some cases the vocab file is not included in the model repo on HuggingFace, one example is Jina Reranker. The eland_import_hub_model script fails with this error when the file is missing:
Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.
This happens because AutoTokenizer.from_pretrained(...) is called with use_fast=False.
It should be possible to download the vocab from the base model, investigate other ways to download the vocab file where it is not present in the model repo.
eland_import_hub_model --cloud-id labs:xxxxxx== --hub-model-id jinaai/jina-reranker-v2-base-multilingual --task-type text_similarity --es-api-key xxxx== --start --clear-previous
And I'm getting this error:
2024-09-03 01:59:53,443 INFO : Establishing connection to Elasticsearch
2024-09-03 01:59:53,940 INFO : Connected to cluster named 'XXX' (version: 8.15.0)
2024-09-03 01:59:53,942 INFO : Loading HuggingFace transformer tokenizer and model 'jinaai/jina-reranker-v2-base-multilingual'
/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta.py", line 154, in __init__
self.sp_model.Load(str(vocab_file))
File "/usr/local/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load
return self.LoadFromFile(model_file)
File "/usr/local/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
OSError: Not found: "None": No such file or directory Error #2
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/eland_import_hub_model", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/eland/cli/eland_import_hub_model.py", line 298, in main
tm = TransformerModel(
File "/usr/local/lib/python3.10/site-packages/eland/ml/pytorch/transformers.py", line 655, in __init__
self._tokenizer = transformers.AutoTokenizer.from_pretrained(
File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 768, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2024, in from_pretrained
return cls._from_pretrained(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2258, in _from_pretrained
raise OSError(
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.
The text was updated successfully, but these errors were encountered:
Eland needs access to a model's vocabulary file so that is can be uploaded to Elasticsearch along with the model definition. In some cases the vocab file is not included in the model repo on HuggingFace, one example is Jina Reranker. The
eland_import_hub_model
script fails with this error when the file is missing:This happens because
AutoTokenizer.from_pretrained(...)
is called withuse_fast=False
.It should be possible to download the vocab from the base model, investigate other ways to download the vocab file where it is not present in the model repo.
The text was updated successfully, but these errors were encountered: