Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: allow user to provide tokenizer when loading transformer model #320

Open
jessecambon opened this issue Jul 27, 2022 · 2 comments

Comments

@jessecambon
Copy link
Contributor

Feature request

When I try to load a locally saved transformers model with ORTModelForSequenceClassification.from_pretrained(<path>, from_transformers=True) an error occurs ("unable to generate dummy inputs for model") unless I also save the tokenizer in the checkpoint. A reproducible example of this is below.

A way to pass a tokenizer object to from_pretrained() would be helpful to avoid this problem.

orig_model="prajjwal1/bert-tiny" 
saved_model_path='saved_model'

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load a model from the hub and save it locally
model = AutoModelForSequenceClassification.from_pretrained(orig_model)
model.save_pretrained(saved_model_path)

tokenizer=AutoTokenizer.from_pretrained(orig_model)

# attempt to load the locally saved model and convert to Onnx
loaded_model=ORTModelForSequenceClassification.from_pretrained(
    saved_model_path,
    from_transformers=True
    )

Produces error:

Traceback (most recent call last):
  File "optimum_loading_reprex.py", line 21, in <module>
    loaded_model=ORTModelForSequenceClassification.from_pretrained(
  File "/home/cambonator/anaconda3/envs/onnx/lib/python3.8/site-packages/optimum/modeling_base.py", line 201, in from_pretrained
    return cls._from_transformers(
  File "/home/cambonator/anaconda3/envs/onnx/lib/python3.8/site-packages/optimum/onnxruntime/modeling_ort.py", line 275, in _from_transformers
    export(
  File "/home/cambonator/anaconda3/envs/onnx/lib/python3.8/site-packages/transformers/onnx/convert.py", line 335, in export
    return export_pytorch(preprocessor, model, config, opset, output, tokenizer=tokenizer, device=device)
  File "/home/cambonator/anaconda3/envs/onnx/lib/python3.8/site-packages/transformers/onnx/convert.py", line 142, in export_pytorch
    model_inputs = config.generate_dummy_inputs(preprocessor, framework=TensorType.PYTORCH)
  File "/home/cambonator/anaconda3/envs/onnx/lib/python3.8/site-packages/transformers/onnx/config.py", line 334, in generate_dummy_inputs
    raise ValueError(
ValueError: Unable to generate dummy inputs for the model. Please provide a tokenizer or a preprocessor.

Package versions

  • transformers: 4.20.1
  • optimum: 1.3.0
  • onnxruntime: 1.11.1
  • torch: 1.11.0

Motivation

Saving the tokenizer to the model checkpoint is a step that could be eliminated if there were a way to provide a tokenizer to ORTModelForSequenceClassification.from_pretrained()

Your contribution

I'm not currently sure where to start on implementing this feature, but would be happy to help with some guidance.

@jmwoloso
Copy link

+1
Related to #210 to support alternative workflows when run-time tokenization isn't possible or feasible.

@michaelbenayoun
Copy link
Member

Hi @jessecambon,
This seems related to the ONNX export. We are currently working on adding support for this in optimum, and providing a tokenizer will not be needed to perform the export.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants