|
| 1 | +<!--Copyright 2022 The HuggingFace Team. All rights reserved. |
| 2 | + |
| 3 | +Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with |
| 4 | +the License. You may obtain a copy of the License at |
| 5 | + |
| 6 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 7 | + |
| 8 | +Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on |
| 9 | +an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the |
| 10 | +specific language governing permissions and limitations under the License. |
| 11 | +--> |
| 12 | + |
| 13 | +# Optimum pipelines for inference |
| 14 | + |
| 15 | +The [`pipeline`] makes it simple to use models from the [Model Hub](https://huggingface.co/models) for accelerated inference on a variety of tasks such as text classification. |
| 16 | +Even if you don't have experience with a specific modality or understand the code powering the models, you can still use them with the [`pipeline`]! This tutorial will teach you to: |
| 17 | + |
| 18 | +<Tip> |
| 19 | + |
| 20 | +You can also use the `pipeline()` function from Transformers and provide your `OptimumModel`. |
| 21 | + |
| 22 | +</Tip> |
| 23 | + |
| 24 | +Currenlty supported tasks are: |
| 25 | + |
| 26 | +**Onnx Runtime** |
| 27 | + |
| 28 | +* `feature-extraction` |
| 29 | +* `text-classification` |
| 30 | +* `token-classification` |
| 31 | +* `question-answering` |
| 32 | +* `zero-shot-classification` |
| 33 | +* `text-generation` |
| 34 | + |
| 35 | +## Optimum pipeline usage |
| 36 | + |
| 37 | +While each task has an associated [~`pipeline`], which it is simpler to use the general [~`pipeline`] abstraction which contains all the specific task pipelines. |
| 38 | +The [~`pipeline`] automatically loads a default model and tokenizer capable of inference for your task. |
| 39 | + |
| 40 | +1. Start by creating a [~`pipeline`] and specify an inference task: |
| 41 | + |
| 42 | +```python |
| 43 | +>>> from optimum import pipeline |
| 44 | + |
| 45 | +>>> classifier = pipeline(task="text-classification", accelerator="ort") |
| 46 | +
|
| 47 | +``` |
| 48 | +
|
| 49 | +2. Pass your input text to the [~`pipeline`]: |
| 50 | +
|
| 51 | +```python |
| 52 | +>>> classifier("I like you. I love you.") |
| 53 | +[{'label': 'POSITIVE', 'score': 0.9998838901519775}] |
| 54 | +``` |
| 55 | + |
| 56 | +_Note: The default models used in the [~`pipeline`] are not optimized or quantized, there won't be an performance improvement compared to there pytorch counter parts._ |
| 57 | + |
| 58 | +### Using vanilla Transformers model and converting to ONNX |
| 59 | + |
| 60 | +The [`pipeline`] accepts any supported model from the [Model Hub](https://huggingface.co/models). |
| 61 | +There are tags on the Model Hub that allow you to filter for a model you'd like to use for your task. |
| 62 | +Once you've picked an appropriate model, load it with the `from_pretrained("{model_id}",from_transformers=True)` method associated with the `ORTModelFor*` |
| 63 | +[`AutoTokenizer'] class. For example, here's how you can load the [`ORTModelForQuestionAnswering`] class for question answering: |
| 64 | + |
| 65 | +```python |
| 66 | +>>> from transformers import AutoTokenizer |
| 67 | +>>> from optimum.onnxruntime import ORTModelForQuestionAnswering |
| 68 | +>>> from optimum import pipeline |
| 69 | + |
| 70 | +>>> tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2") |
| 71 | +>>> # loading the pytorch checkpoint and converting to ORT format by providing the from_transformers=True parameter |
| 72 | +>>> model = ORTModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2",from_transformers=True) |
| 73 | + |
| 74 | +>>> onnx_qa = pipeline("question-answering", model=model, tokenizer=tokenizer) |
| 75 | +>>> question = "What's my name?" |
| 76 | +>>> context = "My name is Philipp and I live in Nuremberg." |
| 77 | + |
| 78 | +>>> pred = onnx_qa(question=question, context=context) |
| 79 | +``` |
| 80 | + |
| 81 | +### Using Optimum models |
| 82 | + |
| 83 | +The [`pipeline`] is tightly integrated with [Model Hub](https://huggingface.co/models) and can load optimized models directly, e.g. those created with OnnxRuntime. |
| 84 | +There are tags on the Model Hub that allow you to filter for a model you'd like to use for your task. |
| 85 | +Once you've picked an appropriate model, load it with the `from_pretrained()` method associated with the corresponding `ORTModelFor*` |
| 86 | +and [`AutoTokenizer'] class. For example, here's how you can load an optimized model for question answering: |
| 87 | + |
| 88 | +```python |
| 89 | +>>> from transformers import AutoTokenizer |
| 90 | +>>> from optimum.onnxruntime import ORTModelForQuestionAnswering |
| 91 | +>>> from optimum import pipeline |
| 92 | + |
| 93 | +>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2") |
| 94 | +>>> # loading already converted and optimized ORT checkpoint for inference |
| 95 | +>>> model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2") |
| 96 | + |
| 97 | +>>> onnx_qa = pipeline("question-answering", model=model, tokenizer=tokenizer) |
| 98 | +>>> question = "What's my name?" |
| 99 | +>>> context = "My name is Philipp and I live in Nuremberg." |
| 100 | + |
| 101 | +>>> pred = onnx_qa(question=question, context=context) |
| 102 | +``` |
| 103 | + |
| 104 | + |
| 105 | +### Optimizing and Quantizing in Pipelines |
| 106 | + |
| 107 | +The [`pipeline`] can not only run inference on vanilla Onnxruntime checkpoints you can also use checkpoints optimized with `ORTQuantizer` and `ORTOptimizer` |
| 108 | +Below you can find two examples on how you could [~`ORTOptimizer`] and [~`ORTQuantizer`] to optimize/quantize your model and use it for inference afterwards. |
| 109 | + |
| 110 | +### Quantizing with [~`ORTQuantizer`] |
| 111 | + |
| 112 | +```python |
| 113 | +>>> from pathlib import Path |
| 114 | +>>> from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer |
| 115 | +>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig |
| 116 | +>>> from optimum.pipelines import pipeline |
| 117 | +>>> from transformers import AutoTokenizer |
| 118 | + |
| 119 | +# define model_id and load tokenizer |
| 120 | +>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english" |
| 121 | +>>> tokenizer = AutoTokenizer.from_pretrained(model_id) |
| 122 | +>>> save_path = Path("optimum_model") |
| 123 | +>>> save_path.mkdir(exist_ok=True) |
| 124 | + |
| 125 | +# use ORTQuantizer to export the model and define quantization configuration |
| 126 | +>>> quantizer = ORTQuantizer.from_pretrained(model_id, feature="sequence-classification") |
| 127 | +>>> qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True) |
| 128 | + |
| 129 | +# apply the quantization configuration to the model |
| 130 | +>>> quantizer.export( |
| 131 | + onnx_model_path=save_path / "model.onnx", |
| 132 | + onnx_quantized_model_output_path=save_path / "model-quantized.onnx", |
| 133 | + quantization_config=qconfig, |
| 134 | + ) |
| 135 | +>>> quantizer.model.config.save_pretrained(save_path) # saves config.json |
| 136 | + |
| 137 | +# load optimized model from local path or repository |
| 138 | +>>> model = ORTModelForSequenceClassification.from_pretrained(save_path,file_name="model-quantized.onnx") |
| 139 | + |
| 140 | +# create transformers pipeline |
| 141 | +>>> onnx_clx = pipeline("text-classification", model=model, tokenizer=tokenizer) |
| 142 | +>>> text = "I like the new ORT pipeline" |
| 143 | +>>> pred = onnx_clx(text) |
| 144 | +>>> print(pred) |
| 145 | + |
| 146 | +# save model & push model to the hub |
| 147 | +>>> tokenizer.save_pretrained("new_path_for_directory") |
| 148 | +>>> model.save_pretrained("new_path_for_directory") |
| 149 | +>>> model.push_to_hub("new_path_for_directory", |
| 150 | + repository_id="my-onnx-repo", |
| 151 | + use_auth_token=True |
| 152 | + ) |
| 153 | +``` |
| 154 | + |
| 155 | +### Optimizing with [~`ORTOptimizer`] |
| 156 | + |
| 157 | +```python |
| 158 | +>>> from pathlib import Path |
| 159 | +>>> from optimum.onnxruntime import ORTModelForSequenceClassification, ORTOptimizer |
| 160 | +>>> from optimum.onnxruntime.configuration import OptimizationConfig |
| 161 | +>>> from optimum.pipelines import pipeline |
| 162 | + |
| 163 | +# define model_id and load tokenizer |
| 164 | +>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english" |
| 165 | +>>> tokenizer = AutoTokenizer.from_pretrained(model_id) |
| 166 | +>>> save_path = Path("optimum_model") |
| 167 | +>>> save_path.mkdir(exist_ok=True) |
| 168 | + |
| 169 | +# use ORTOptimizer to export the model and define quantization configuration |
| 170 | +>>> optimizer = ORTOptimizer.from_pretrained(model_id, feature="sequence-classification") |
| 171 | +>>> optimization_config = OptimizationConfig(optimization_level=2) |
| 172 | + |
| 173 | +# apply the optimization configuration to the model |
| 174 | +>>> optimizer.export( |
| 175 | + onnx_model_path=save_path / "model.onnx", |
| 176 | + onnx_optimized_model_output_path=save_path / "model-optimized.onnx", |
| 177 | + optimization_config=optimization_config, |
| 178 | +) |
| 179 | +>>> optimizer.model.config.save_pretrained(save_path) # saves config.json |
| 180 | + |
| 181 | +# load optimized model from local path or repository |
| 182 | +>>> model = ORTModelForSequenceClassification.from_pretrained(save_path,file_name="model-optimized.onnx") |
| 183 | + |
| 184 | +# create transformers pipeline |
| 185 | +>>> onnx_clx = pipeline("text-classification", model=model, tokenizer=tokenizer) |
| 186 | +>>> text = "I like the new ORT pipeline" |
| 187 | +>>> pred = onnx_clx(text) |
| 188 | +>>> print(pred) |
| 189 | + |
| 190 | +# save model & push model to the hub |
| 191 | +>>> tokenizer.save_pretrained("new_path_for_directory") |
| 192 | +>>> model.save_pretrained("new_path_for_directory") |
| 193 | +>>> model.push_to_hub("new_path_for_directory", |
| 194 | + repository_id="my-onnx-repo", |
| 195 | + use_auth_token=True) |
| 196 | +``` |
| 197 | + |
| 198 | +## Transformers pipeline usage |
| 199 | + |
| 200 | +The [`pipeline`] is just a light wrapper around the `transformers.pipeline` function to enable checks for supported tasks and additional features |
| 201 | +, like quantization and optimization. This being said you can use the `transformers.pipeline` and just replace your `AutoFor*` with the optimum |
| 202 | + `ORTModelFor*` class. |
| 203 | + |
| 204 | +```diff |
| 205 | +from transformers import AutoTokenizer, pipeline |
| 206 | +-from transformers import AutoModelForQuestionAnswering |
| 207 | ++from optimum.onnxruntime import ORTModelForQuestionAnswering |
| 208 | + |
| 209 | +-model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2") |
| 210 | ++model = ORTModelForQuestionAnswering.from_transformers("optimum/roberta-base-squad2") |
| 211 | +tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2") |
| 212 | + |
| 213 | +onnx_qa = pipeline("question-answering",model=model,tokenizer=tokenizer) |
| 214 | + |
| 215 | +question = "What's my name?" |
| 216 | +context = "My name is Philipp and I live in Nuremberg." |
| 217 | +pred = onnx_qa(question, context) |
| 218 | +``` |
0 commit comments