huggingface
diff --git a/‎.github/workflows/test_onnxruntime.yml
+1-1 b/‎.github/workflows/test_onnxruntime.yml
+1-1
diff --git a/‎docs/source/bettertransformer/overview.mdx
+3-3 b/‎docs/source/bettertransformer/overview.mdx
+3-3
diff --git a/‎docs/source/bettertransformer/tutorials/contribute.mdx
+2-2 b/‎docs/source/bettertransformer/tutorials/contribute.mdx
+2-2
diff --git a/‎docs/source/bettertransformer/tutorials/convert.mdx
+3-3 b/‎docs/source/bettertransformer/tutorials/convert.mdx
+3-3
diff --git a/‎docs/source/onnxruntime/usage_guides/models.mdx
+1-1 b/‎docs/source/onnxruntime/usage_guides/models.mdx
+1-1
diff --git a/‎docs/source/onnxruntime/usage_guides/optimization.mdx
+1-1 b/‎docs/source/onnxruntime/usage_guides/optimization.mdx
+1-1
diff --git a/‎docs/source/quicktour.mdx
+1-1 b/‎docs/source/quicktour.mdx
+1-1
diff --git a/‎optimum/exporters/tasks.py
+8-3 b/‎optimum/exporters/tasks.py
+8-3
diff --git a/‎optimum/onnxruntime/constants.py
+1 b/‎optimum/onnxruntime/constants.py
+1
diff --git a/‎optimum/onnxruntime/model.py
-94 b/‎optimum/onnxruntime/model.py
-94
@@ -64,4 +64,4 @@ jobs:
         run: |
           pytest tests/onnxruntime -m "not run_in_series" --durations=0 -vvvv -n auto
         env:
-          HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
+          HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License.
 
 ## Quickstart
 
-Since its 1.13 version, [PyTorch released](https://pytorch.org/blog/PyTorch-1.13-release/) the stable version of a fast path for its standard Transformer APIs that provides out of the box performance improvements for transformer-based models. You can benefit from interesting speedup on most consumer-type devices, including CPUs, older and newer versions of NIVIDIA GPUs.
+Since its 1.13 version, [PyTorch released](https://pytorch.org/blog/PyTorch-1.13-release/) the stable version of a fast path for its standard Transformer APIs that provides out of the box performance improvements for transformer-based models. You can benefit from interesting speedup on most consumer-type devices, including CPUs, older and newer versions of NVIDIA GPUs.
 You can now use this feature in 🤗 Optimum together with Transformers and use it for major models in the Hugging Face ecosystem.
 
 In the 2.0 version, PyTorch includes a native scaled dot-product attention operator (SDPA) as part of `torch.nn.functional`. This function encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the [official documentation](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) for more information, and [this blog post](https://pytorch.org/blog/out-of-the-box-acceleration/) for benchmarks.
@@ -54,13 +54,13 @@ The list of supported model below:
 - [DeiT](https://arxiv.org/abs/2012.12877)
 - [Electra](https://arxiv.org/abs/2003.10555)
 - [Ernie](https://arxiv.org/abs/1904.09223)
-- [Falcon](https://arxiv.org/abs/2306.01116) (No need to use BetterTransformer, it is [directy supported by Transformers](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-and-memory-efficient-attention-through-pytorchs-scaleddotproductattention))
+- [Falcon](https://arxiv.org/abs/2306.01116) (No need to use BetterTransformer, it is [directly supported by Transformers](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-and-memory-efficient-attention-through-pytorchs-scaleddotproductattention))
 - [FSMT](https://arxiv.org/abs/1907.06616)
 - [GPT2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
 - [GPT-j](https://huggingface.co/EleutherAI/gpt-j-6B)
 - [GPT-neo](https://github.com/EleutherAI/gpt-neo)
 - [GPT-neo-x](https://arxiv.org/abs/2204.06745)
-- [GPT BigCode](https://arxiv.org/abs/2301.03988) (SantaCoder, StarCoder - no need to use BetterTransformer, it is [directy supported by Transformers](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-and-memory-efficient-attention-through-pytorchs-scaleddotproductattention))
+- [GPT BigCode](https://arxiv.org/abs/2301.03988) (SantaCoder, StarCoder - no need to use BetterTransformer, it is [directly supported by Transformers](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-and-memory-efficient-attention-through-pytorchs-scaleddotproductattention))
 - [HuBERT](https://arxiv.org/pdf/2106.07447.pdf)
 - [LayoutLM](https://arxiv.org/abs/1912.13318)
 - [Llama & Llama2](https://arxiv.org/abs/2302.13971) (No need to use BetterTransformer, it is [directy supported by Transformers](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-and-memory-efficient-attention-through-pytorchs-scaleddotproductattention))
 
@@ -112,7 +112,7 @@ Now, make sure to fill all the necessary attributes, the list of attributes are:
 
 Note that these attributes correspond to all the components that are necessary to run a Transformer Encoder module, check the figure 1 on the ["Attention Is All You Need"](https://arxiv.org/pdf/1706.03762.pdf) paper.
 
-Once you filled all these attributes (sometimes the `query`, `key` and `value` layers needs to be "contigufied", check the [`modeling_encoder.py`](https://github.com/huggingface/optimum/blob/main/optimum/bettertransformer/models/encoder_models.py) file to understand more.)
+Once you filled all these attributes (sometimes the `query`, `key` and `value` layers needs to be "contiguified", check the [`modeling_encoder.py`](https://github.com/huggingface/optimum/blob/main/optimum/bettertransformer/models/encoder_models.py) file to understand more.)
 
 Make sure also to add the lines:
 ```python
@@ -125,7 +125,7 @@ self.validate_bettertransformer()
 
 First of all, start with the line `super().forward_checker()`, this is needed so that the parent class can run all the safety checkers before.
 
-After the first forward pass, the hidden states needs to be *nested* using the attention mask. Once they are nested, the attention mask is not needed anymore, therefore can be set to `None`. This is how the forward pass is built for `Bert`, these lines should remain pretty much similar accross models, but sometimes the shapes of the attention masks are different across models. 
+After the first forward pass, the hidden states needs to be *nested* using the attention mask. Once they are nested, the attention mask is not needed anymore, therefore can be set to `None`. This is how the forward pass is built for `Bert`, these lines should remain pretty much similar across models, but sometimes the shapes of the attention masks are different across models. 
 ```python
 super().forward_checker()
 
 
@@ -45,7 +45,7 @@ Sometimes you can directly load your model on your GPU devices using `accelerate
 
 ## Step 2: Set your model on your preferred device
 
-If you did not used `device_map="auto"` to load your model (or if your model does not support `device_map="auto"`), you can manually set your model to a GPU:
+If you did not use `device_map="auto"` to load your model (or if your model does not support `device_map="auto"`), you can manually set your model to a GPU:
 ```python
 >>> model = model.to(0) # or model.to("cuda:0")
 ```
@@ -92,7 +92,7 @@ You can also use `transformers.pipeline` as usual and pass the converted model d
 >>> ...
 ```
 
-Please refer to the [official documentation of `pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines) for further usage. If you face into any issue, do not hesitate to open an isse on GitHub!
+Please refer to the [official documentation of `pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines) for further usage. If you run into any issue, do not hesitate to open an issue on GitHub!
 
 ## Training compatibility
 
@@ -113,4 +113,4 @@ model = BetterTransformer.transform(model)
 model = BetterTransformer.reverse(model)
 model.save_pretrained("fine_tuned_model")
 model.push_to_hub("fine_tuned_model")
-```
+```
@@ -16,7 +16,7 @@ Once your model was [exported to the ONNX format](https://huggingface.co/docs/op
 - from transformers import AutoModelForCausalLM
 + from optimum.onnxruntime import ORTModelForCausalLM
 
-- model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B) # PyTorch checkpoint
+- model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B") # PyTorch checkpoint
 + model = ORTModelForCausalLM.from_pretrained("onnx-community/Llama-3.2-1B", subfolder="onnx") # ONNX checkpoint
   tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
 
 
@@ -132,7 +132,7 @@ Below you will find an easy end-to-end example on how to optimize [distilbert-ba
 ```
 
 
-Below you will find an easy end-to-end example on how to optimize a Seq2Seq model [sshleifer/distilbart-cnn-12-6"](https://huggingface.co/sshleifer/distilbart-cnn-12-6).
+Below you will find an easy end-to-end example on how to optimize a Seq2Seq model [sshleifer/distilbart-cnn-12-6](https://huggingface.co/sshleifer/distilbart-cnn-12-6).
 
 ```python
 >>> from transformers import AutoTokenizer
 
@@ -185,7 +185,7 @@ Check out the [documentation](https://huggingface.co/docs/optimum/exporters/onnx
 
 ## PyTorch's BetterTransformer support
 
-[BetterTransformer](https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/) is a free-lunch PyTorch-native optimization to gain x1.25 - x4 speedup on the inference of Transformer-based models. It has been marked as stable in [PyTorch 1.13](https://pytorch.org/blog/PyTorch-1.13-release/). We integrated BetterTransformer with the most-used models from the 🤗 Transformers libary, and using the integration is as simple as:
+[BetterTransformer](https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/) is a free-lunch PyTorch-native optimization to gain x1.25 - x4 speedup on the inference of Transformer-based models. It has been marked as stable in [PyTorch 1.13](https://pytorch.org/blog/PyTorch-1.13-release/). We integrated BetterTransformer with the most-used models from the 🤗 Transformers library, and using the integration is as simple as:
 
 ```python
 >>> from optimum.bettertransformer import BetterTransformer
 
@@ -2067,7 +2067,11 @@ def infer_library_from_model(
         return library_name
 
     @classmethod
-    def standardize_model_attributes(cls, model: Union["PreTrainedModel", "TFPreTrainedModel", "DiffusionPipeline"]):
+    def standardize_model_attributes(
+        cls,
+        model: Union["PreTrainedModel", "TFPreTrainedModel", "DiffusionPipeline"],
+        library_name: Optional[str] = None,
+    ):
         """
         Updates the model for export. This function is suitable to make required changes to the models from different
         libraries to follow transformers style.
@@ -2078,7 +2082,8 @@ def standardize_model_attributes(cls, model: Union["PreTrainedModel", "TFPreTrai
 
         """
 
-        library_name = TasksManager.infer_library_from_model(model)
+        if library_name is None:
+            library_name = TasksManager.infer_library_from_model(model)
 
         if library_name == "diffusers":
             inferred_model_type = None
@@ -2299,7 +2304,7 @@ def get_model_from_task(
                     kwargs["from_pt"] = True
                     model = model_class.from_pretrained(model_name_or_path, **kwargs)
 
-        TasksManager.standardize_model_attributes(model)
+        TasksManager.standardize_model_attributes(model, library_name=library_name)
 
         return model
 
 
@@ -16,3 +16,4 @@
 DECODER_ONNX_FILE_PATTERN = r"(.*)?decoder((?!(with_past|merged)).)*?\.onnx"
 DECODER_WITH_PAST_ONNX_FILE_PATTERN = r"(.*)?decoder(.*)?with_past(.*)?\.onnx"
 DECODER_MERGED_ONNX_FILE_PATTERN = r"(.*)?decoder(.*)?merged(.*)?\.onnx"
+ONNX_FILE_PATTERN = r".*\.onnx$"