Skip to content

Commit 471f803

Browse files
committed
Merge remote-tracking branch 'upstream/main' into feature/directml
2 parents 66ed0ff + 6335599 commit 471f803

File tree

18 files changed

+251
-328
lines changed

18 files changed

+251
-328
lines changed

.github/workflows/test_onnxruntime.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -64,4 +64,4 @@ jobs:
6464
run: |
6565
pytest tests/onnxruntime -m "not run_in_series" --durations=0 -vvvv -n auto
6666
env:
67-
HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
67+
HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}

docs/source/bettertransformer/overview.mdx

+3-3
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License.
1616

1717
## Quickstart
1818

19-
Since its 1.13 version, [PyTorch released](https://pytorch.org/blog/PyTorch-1.13-release/) the stable version of a fast path for its standard Transformer APIs that provides out of the box performance improvements for transformer-based models. You can benefit from interesting speedup on most consumer-type devices, including CPUs, older and newer versions of NIVIDIA GPUs.
19+
Since its 1.13 version, [PyTorch released](https://pytorch.org/blog/PyTorch-1.13-release/) the stable version of a fast path for its standard Transformer APIs that provides out of the box performance improvements for transformer-based models. You can benefit from interesting speedup on most consumer-type devices, including CPUs, older and newer versions of NVIDIA GPUs.
2020
You can now use this feature in 🤗 Optimum together with Transformers and use it for major models in the Hugging Face ecosystem.
2121

2222
In the 2.0 version, PyTorch includes a native scaled dot-product attention operator (SDPA) as part of `torch.nn.functional`. This function encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the [official documentation](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) for more information, and [this blog post](https://pytorch.org/blog/out-of-the-box-acceleration/) for benchmarks.
@@ -54,13 +54,13 @@ The list of supported model below:
5454
- [DeiT](https://arxiv.org/abs/2012.12877)
5555
- [Electra](https://arxiv.org/abs/2003.10555)
5656
- [Ernie](https://arxiv.org/abs/1904.09223)
57-
- [Falcon](https://arxiv.org/abs/2306.01116) (No need to use BetterTransformer, it is [directy supported by Transformers](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-and-memory-efficient-attention-through-pytorchs-scaleddotproductattention))
57+
- [Falcon](https://arxiv.org/abs/2306.01116) (No need to use BetterTransformer, it is [directly supported by Transformers](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-and-memory-efficient-attention-through-pytorchs-scaleddotproductattention))
5858
- [FSMT](https://arxiv.org/abs/1907.06616)
5959
- [GPT2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
6060
- [GPT-j](https://huggingface.co/EleutherAI/gpt-j-6B)
6161
- [GPT-neo](https://github.com/EleutherAI/gpt-neo)
6262
- [GPT-neo-x](https://arxiv.org/abs/2204.06745)
63-
- [GPT BigCode](https://arxiv.org/abs/2301.03988) (SantaCoder, StarCoder - no need to use BetterTransformer, it is [directy supported by Transformers](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-and-memory-efficient-attention-through-pytorchs-scaleddotproductattention))
63+
- [GPT BigCode](https://arxiv.org/abs/2301.03988) (SantaCoder, StarCoder - no need to use BetterTransformer, it is [directly supported by Transformers](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-and-memory-efficient-attention-through-pytorchs-scaleddotproductattention))
6464
- [HuBERT](https://arxiv.org/pdf/2106.07447.pdf)
6565
- [LayoutLM](https://arxiv.org/abs/1912.13318)
6666
- [Llama & Llama2](https://arxiv.org/abs/2302.13971) (No need to use BetterTransformer, it is [directy supported by Transformers](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-and-memory-efficient-attention-through-pytorchs-scaleddotproductattention))

docs/source/bettertransformer/tutorials/contribute.mdx

+2-2
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ Now, make sure to fill all the necessary attributes, the list of attributes are:
112112

113113
Note that these attributes correspond to all the components that are necessary to run a Transformer Encoder module, check the figure 1 on the ["Attention Is All You Need"](https://arxiv.org/pdf/1706.03762.pdf) paper.
114114

115-
Once you filled all these attributes (sometimes the `query`, `key` and `value` layers needs to be "contigufied", check the [`modeling_encoder.py`](https://github.com/huggingface/optimum/blob/main/optimum/bettertransformer/models/encoder_models.py) file to understand more.)
115+
Once you filled all these attributes (sometimes the `query`, `key` and `value` layers needs to be "contiguified", check the [`modeling_encoder.py`](https://github.com/huggingface/optimum/blob/main/optimum/bettertransformer/models/encoder_models.py) file to understand more.)
116116

117117
Make sure also to add the lines:
118118
```python
@@ -125,7 +125,7 @@ self.validate_bettertransformer()
125125

126126
First of all, start with the line `super().forward_checker()`, this is needed so that the parent class can run all the safety checkers before.
127127

128-
After the first forward pass, the hidden states needs to be *nested* using the attention mask. Once they are nested, the attention mask is not needed anymore, therefore can be set to `None`. This is how the forward pass is built for `Bert`, these lines should remain pretty much similar accross models, but sometimes the shapes of the attention masks are different across models.
128+
After the first forward pass, the hidden states needs to be *nested* using the attention mask. Once they are nested, the attention mask is not needed anymore, therefore can be set to `None`. This is how the forward pass is built for `Bert`, these lines should remain pretty much similar across models, but sometimes the shapes of the attention masks are different across models.
129129
```python
130130
super().forward_checker()
131131

docs/source/bettertransformer/tutorials/convert.mdx

+3-3
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ Sometimes you can directly load your model on your GPU devices using `accelerate
4545

4646
## Step 2: Set your model on your preferred device
4747

48-
If you did not used `device_map="auto"` to load your model (or if your model does not support `device_map="auto"`), you can manually set your model to a GPU:
48+
If you did not use `device_map="auto"` to load your model (or if your model does not support `device_map="auto"`), you can manually set your model to a GPU:
4949
```python
5050
>>> model = model.to(0) # or model.to("cuda:0")
5151
```
@@ -92,7 +92,7 @@ You can also use `transformers.pipeline` as usual and pass the converted model d
9292
>>> ...
9393
```
9494

95-
Please refer to the [official documentation of `pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines) for further usage. If you face into any issue, do not hesitate to open an isse on GitHub!
95+
Please refer to the [official documentation of `pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines) for further usage. If you run into any issue, do not hesitate to open an issue on GitHub!
9696

9797
## Training compatibility
9898

@@ -113,4 +113,4 @@ model = BetterTransformer.transform(model)
113113
model = BetterTransformer.reverse(model)
114114
model.save_pretrained("fine_tuned_model")
115115
model.push_to_hub("fine_tuned_model")
116-
```
116+
```

docs/source/onnxruntime/usage_guides/models.mdx

+1-1
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Once your model was [exported to the ONNX format](https://huggingface.co/docs/op
1616
- from transformers import AutoModelForCausalLM
1717
+ from optimum.onnxruntime import ORTModelForCausalLM
1818

19-
- model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B) # PyTorch checkpoint
19+
- model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B") # PyTorch checkpoint
2020
+ model = ORTModelForCausalLM.from_pretrained("onnx-community/Llama-3.2-1B", subfolder="onnx") # ONNX checkpoint
2121
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
2222

docs/source/onnxruntime/usage_guides/optimization.mdx

+1-1
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@ Below you will find an easy end-to-end example on how to optimize [distilbert-ba
132132
```
133133

134134

135-
Below you will find an easy end-to-end example on how to optimize a Seq2Seq model [sshleifer/distilbart-cnn-12-6"](https://huggingface.co/sshleifer/distilbart-cnn-12-6).
135+
Below you will find an easy end-to-end example on how to optimize a Seq2Seq model [sshleifer/distilbart-cnn-12-6](https://huggingface.co/sshleifer/distilbart-cnn-12-6).
136136

137137
```python
138138
>>> from transformers import AutoTokenizer

docs/source/quicktour.mdx

+1-1
Original file line numberDiff line numberDiff line change
@@ -185,7 +185,7 @@ Check out the [documentation](https://huggingface.co/docs/optimum/exporters/onnx
185185

186186
## PyTorch's BetterTransformer support
187187

188-
[BetterTransformer](https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/) is a free-lunch PyTorch-native optimization to gain x1.25 - x4 speedup on the inference of Transformer-based models. It has been marked as stable in [PyTorch 1.13](https://pytorch.org/blog/PyTorch-1.13-release/). We integrated BetterTransformer with the most-used models from the 🤗 Transformers libary, and using the integration is as simple as:
188+
[BetterTransformer](https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/) is a free-lunch PyTorch-native optimization to gain x1.25 - x4 speedup on the inference of Transformer-based models. It has been marked as stable in [PyTorch 1.13](https://pytorch.org/blog/PyTorch-1.13-release/). We integrated BetterTransformer with the most-used models from the 🤗 Transformers library, and using the integration is as simple as:
189189

190190
```python
191191
>>> from optimum.bettertransformer import BetterTransformer

optimum/exporters/tasks.py

+8-3
Original file line numberDiff line numberDiff line change
@@ -2067,7 +2067,11 @@ def infer_library_from_model(
20672067
return library_name
20682068

20692069
@classmethod
2070-
def standardize_model_attributes(cls, model: Union["PreTrainedModel", "TFPreTrainedModel", "DiffusionPipeline"]):
2070+
def standardize_model_attributes(
2071+
cls,
2072+
model: Union["PreTrainedModel", "TFPreTrainedModel", "DiffusionPipeline"],
2073+
library_name: Optional[str] = None,
2074+
):
20712075
"""
20722076
Updates the model for export. This function is suitable to make required changes to the models from different
20732077
libraries to follow transformers style.
@@ -2078,7 +2082,8 @@ def standardize_model_attributes(cls, model: Union["PreTrainedModel", "TFPreTrai
20782082
20792083
"""
20802084

2081-
library_name = TasksManager.infer_library_from_model(model)
2085+
if library_name is None:
2086+
library_name = TasksManager.infer_library_from_model(model)
20822087

20832088
if library_name == "diffusers":
20842089
inferred_model_type = None
@@ -2299,7 +2304,7 @@ def get_model_from_task(
22992304
kwargs["from_pt"] = True
23002305
model = model_class.from_pretrained(model_name_or_path, **kwargs)
23012306

2302-
TasksManager.standardize_model_attributes(model)
2307+
TasksManager.standardize_model_attributes(model, library_name=library_name)
23032308

23042309
return model
23052310

optimum/onnxruntime/constants.py

+1
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,4 @@
1616
DECODER_ONNX_FILE_PATTERN = r"(.*)?decoder((?!(with_past|merged)).)*?\.onnx"
1717
DECODER_WITH_PAST_ONNX_FILE_PATTERN = r"(.*)?decoder(.*)?with_past(.*)?\.onnx"
1818
DECODER_MERGED_ONNX_FILE_PATTERN = r"(.*)?decoder(.*)?merged(.*)?\.onnx"
19+
ONNX_FILE_PATTERN = r".*\.onnx$"

optimum/onnxruntime/model.py

-94
This file was deleted.

0 commit comments

Comments
 (0)