Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NLP] Enable import of models with missing vocabulary files #721

Open
davidkyle opened this issue Sep 4, 2024 · 1 comment
Open

[NLP] Enable import of models with missing vocabulary files #721

davidkyle opened this issue Sep 4, 2024 · 1 comment
Labels
bug Something isn't working topic:NLP Issue or PR about NLP model support and eland_import_hub_model

Comments

@davidkyle
Copy link
Member

davidkyle commented Sep 4, 2024

Eland needs access to a model's vocabulary file so that is can be uploaded to Elasticsearch along with the model definition. In some cases the vocab file is not included in the model repo on HuggingFace, one example is Jina Reranker. The eland_import_hub_model script fails with this error when the file is missing:

Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

This happens because AutoTokenizer.from_pretrained(...) is called with use_fast=False.

It should be possible to download the vocab from the base model, investigate other ways to download the vocab file where it is not present in the model repo.

eland_import_hub_model --cloud-id labs:xxxxxx== --hub-model-id jinaai/jina-reranker-v2-base-multilingual --task-type text_similarity --es-api-key xxxx== --start --clear-previous
And I'm getting this error:
2024-09-03 01:59:53,443 INFO : Establishing connection to Elasticsearch
2024-09-03 01:59:53,940 INFO : Connected to cluster named 'XXX' (version: 8.15.0)
2024-09-03 01:59:53,942 INFO : Loading HuggingFace transformer tokenizer and model 'jinaai/jina-reranker-v2-base-multilingual'
/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta.py", line 154, in __init__
    self.sp_model.Load(str(vocab_file))
  File "/usr/local/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
  File "/usr/local/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
OSError: Not found: "None": No such file or directory Error #2
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/bin/eland_import_hub_model", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/eland/cli/eland_import_hub_model.py", line 298, in main
    tm = TransformerModel(
  File "/usr/local/lib/python3.10/site-packages/eland/ml/pytorch/transformers.py", line 655, in __init__
    self._tokenizer = transformers.AutoTokenizer.from_pretrained(
  File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 768, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2024, in from_pretrained
    return cls._from_pretrained(
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2258, in _from_pretrained
    raise OSError(
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.
@davidkyle davidkyle added bug Something isn't working topic:NLP Issue or PR about NLP model support and eland_import_hub_model labels Sep 4, 2024
@serenachou
Copy link

note to @serenachou a blocker for JinaAI models in the ML node

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working topic:NLP Issue or PR about NLP model support and eland_import_hub_model
Projects
None yet
Development

No branches or pull requests

2 participants