[RFC] ExecuTorch Community Building - Hugging Face Model Usability #8705

guangy10 · 2025-02-25T23:09:38Z

guangy10
Feb 25, 2025
Collaborator

Goals:

Build confidence for GA through model & delegates coverage over OSS models on HF
Expand open source awareness, adoption and contribution through better DevX/UX

UX to Enable:

With native HF models and within HF ecosystem:

HF users can easily experiment with various ExecuTorch models using standard HF APIs in Python
HF users can publish and download PTE models to/from HF Hub
HF users can create customized recipes and share with the community

Hugging Face/ExecuTorch-Community

This continues the partnership from 2024. In 2025, we aim to strengthen it with a focus on ExecuTorch's general availability and improving UX in the open-source community.

Hugging Face Transformers

Text model export coverage
Vision model export coverage
Audio model export coverage
Video model export coverage
Define a coverage threshold so that we can turn on the export support by default for new models

Detailed tasks will be filed and tracked in huggingface/transformers: huggingface/transformers#32253

Hugging Face Optimum-ExecuTorch

Support necessary modeling coding rewriting to make model exportable to certain delegates
Support quantization and delegations to major delegates
Support creating and registering customized recipes
Support distributing PTE models on HuggingFace Hub/executor-community
The usability of delegated models is backed by ExecuTorch Benchmark infra

Detailed tasks will be filed and tracked in huggingface/optimum-executorch

ExecuTorch Recipe API

ExecuTorch will host a limited number of canonical recipes covering all major backends as the default, out-of-the-box solution for most use cases.

Each recipe will adhere to the following contract:

Modeling code agnostic – The recipe can take nn.Module or graph module from anywhere as input, including models from Hugging Face, TorchTune, or private modeling code from end users.
Usable and customizable outside the ExecuTorch repo – This ensures that advanced users can create and maintain custom recipes externally, preventing the ExecuTorch repo from being cluttered with tens of thousands of recipes.

Details about this section will be posted separately by @tarun292

Documentation/Navigation

In ExecuTorch repo

HF model coverage and performance overview
Redirect users to detailed instructions to run ExecuTorch models on HF

In HF repo

Intro of running HF transformers with ExecuTorch
Contribution guideline for adding new models, recipes, tasks, etc.
Easy navigate to read detailed tutorials/docs between HF and ExecuTorch

This section will expand the work from @GregoryComer here #8178 and #8676 to provide a seamless navigation between Hugging Face and ExecuTorch.

Cross-repo Integration

Runtime

ExecuTorch exposes runtime via pybind to load exported HF vision, audio and video models for inference
Encapsulate runtime via ExecuTorchModelForXX on HF
Enhance UX, e.g. text model token generation, optimize logging and warnings thrown from runtime

Benchmark & Deployment to Mobile

Convert HF tokenizers to ET runtime recognizable binary format. Provide similar converter for vision and audio models
Provide guarantee on usability of models with delegations, e.g. CoreML, QNN
Offline model validation through benchmark apps

byjlw · 2025-02-25T23:16:36Z

byjlw
Feb 25, 2025
Collaborator

I put the guidelines in this discussion item
#8350

0 replies

larryliu0820 · 2025-02-25T23:33:02Z

larryliu0820
Feb 25, 2025
Collaborator

Convert HF tokenizers to ET runtime recognizable binary format.

Notice that we already support HF tokenizers in https://github.com/pytorch-labs/tokenizers natively.

0 replies

GregoryComer · 2025-02-26T10:58:32Z

GregoryComer
Feb 26, 2025
Collaborator

Thanks for putting this together. From my experiences working with product teams, I think this will be a massive shift in usability for common GenAI/LLM use cases. On the documentation front, I'd like to put the HF integrate front and center. I'm thinking we can create/update our top-level LLM doc, put a callout in the getting started section, and otherwise try to funnel people to HF optimum-executorch as much as possible. I'll reach out later this week and we can put together a doc update PR. Thanks!

0 replies

cptspacemanspiff · 2025-03-07T13:15:20Z

cptspacemanspiff
Mar 7, 2025

@guangy10

Hey, so I have been working on some similar stuff, mainly the issues with multi-method export w/ regards to encoder/decoder type models where you run into graph breaks.

see #8030

I got distracted most of February, but so far, I have been able to export encoder/decoder seq2seq models from hugging face, with a paired down generation pipeline, the high-level goal being that for a given model class (seq2seq -encoder/decoder), users should just be able to provide the model/tokenizer link and everything will just export.

Also, one should minimize the amount of C++ code written that duplicates python model processing code. AKA do not write logit processing in c++, the only thing in C++ should be the bare minimum to handle the graph breaks from data/shape-dependent control flow.

I still need to get some changes merged in on the hugging face side:

Encoder/decoder models have a weird static cache implementation issue Allow static cache to be larger than sequence length / batch size for encoder-decoder models huggingface/transformers#35444,
Also some of the logit processing is torch.compile compatible and but not export compatible.

I have fixes for both of these but need to sit down and get feedback from huggingface/ figure out how I should integrate these things in.

Additionally, I honestly add more unit tests/ validate against more models, as my main development test case has been a Bart variant OPUS translation model.

I would appreciate feedback/ how best to integrate this between both repos -> been mostly working on the Executorch side so far (wanted to validate that things worked, especially since you seem to be highly familiar with both repos.

Ideally

6 replies

cptspacemanspiff Mar 7, 2025

Where is the C++ code for these examples?

guangy10 Mar 7, 2025
Collaborator Author

I assume you are referring to the c++ runtime code to these models. In the PR the .pte models for T5 are validated in python, there is no such a runtime that can be directly used via pybind. As I mentioned in the plan under "Cross-repo Integrations / Runtime", ideally we should have something similar to llama_runner for decoder-only LLMs that is implemented in C++ and exposed through pybind, which can take the encoder and decoder as separate PTEs and run inference.

guangy10 Mar 7, 2025
Collaborator Author

cc: @GregoryComer @tarun292

cptspacemanspiff Mar 7, 2025

Gotcha,

So, I have been trying to do the same thing with Executorch, but what I ended up doing was slightly different. Though possibly different use-cases (I am focused on C++ only inference without any python glue.)

Use torch.export to export multiple methods:

This seems to be an undocumented feature of executorch, however just naively doing this does not allow for shared state between the methods. (IE the decoder cannot access the kv cache weights generated by the encoder.)
However, by being fancy with the memory planner during both export and the runtime side, you can have these methods share state.

For example, my BART model example has the following methods:

reset - encode - prefill - logit process
decode & logit process (greedy algorithm only)

And all the state from both methods is automatically shared on the runtime side.

From what I can tell might be a few advantages to this:

You do not have to manually share buffers between export modules.
No mucking around with multiple .pte files.
There are some nice export api stuff.
You can wrap up most of the python generation code (I am using HF StoppingCriteriaList and LogitsProcessorList directly from the model configuration)
it allows you to do a bunch of kv cache manipulation by just adding another method to your model and exporting it.

I have been able to get the OPUS-MT translation model running w/ the C++ runtime and C++ HF tokenizers, with the xnnpack backend.

It is also including a variable batch size (up to a max) and variable input sequence length (up to a max), and obviously variable decode up to a max.

I would be interested in your feedback, on what issues you see with this method:

https://github.com/cptspacemanspiff/execu-tools

guangy10 Mar 7, 2025
Collaborator Author

Tag @tarun292 again who is working on standardizing the high-level export API, seems quite relevant. cc: @mergennachin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] ExecuTorch Community Building - Hugging Face Model Usability #8705

{{title}}

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[RFC] ExecuTorch Community Building - Hugging Face Model Usability #8705

guangy10 Feb 25, 2025 Collaborator

Goals:

UX to Enable:

Hugging Face/ExecuTorch-Community

Hugging Face Transformers

Hugging Face Optimum-ExecuTorch

ExecuTorch Recipe API

Documentation/Navigation

Cross-repo Integration

Replies: 4 comments · 6 replies

byjlw Feb 25, 2025 Collaborator

larryliu0820 Feb 25, 2025 Collaborator

GregoryComer Feb 26, 2025 Collaborator

cptspacemanspiff Mar 7, 2025

cptspacemanspiff Mar 7, 2025

guangy10 Mar 7, 2025 Collaborator Author

guangy10 Mar 7, 2025 Collaborator Author

cptspacemanspiff Mar 7, 2025

guangy10 Mar 7, 2025 Collaborator Author

guangy10
Feb 25, 2025
Collaborator

Replies: 4 comments 6 replies

byjlw
Feb 25, 2025
Collaborator

larryliu0820
Feb 25, 2025
Collaborator

GregoryComer
Feb 26, 2025
Collaborator

cptspacemanspiff
Mar 7, 2025

guangy10 Mar 7, 2025
Collaborator Author

guangy10 Mar 7, 2025
Collaborator Author

guangy10 Mar 7, 2025
Collaborator Author