PEFT to ONNX conversion #2189

morteza89 · 2025-02-13T18:21:05Z

System Info

Hello! 
I have a fine-tuned LLM model from Hugging Face saved in PEFT format, and it’s about 2.1 GB. When we convert it to ONNX, its size nearly doubles to about 4.1 GB. What causes this significant increase in model size after converting from PEFT to ONNX? Is there any bug under this conversion? ( Here is the code do this conversion. Need to mention: loading it in any commented formats will kill the accuracy). Thanks

model = ORTModelForCausalLM.from_pretrained(
            peft_path,
            provider='OpenVINOExecutionProvider',
            provider_options={'device_type': 'GPU_FP16'},
            # use_cache=False,
            #use_io_binding=False
            export=True,
            #load_in_4bit=True,
            #load_in_8bit=True
            #torch_dtype=torch.bfloat16,
            #device_map=device,
            #from_transformers=True
        )
tokenizer = AutoTokenizer.from_pretrained(peft_path)
model.save_pretrained(onnex_path)
tokenizer.save_pretrained(onnex_path)

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

model = ORTModelForCausalLM.from_pretrained(
peft_path,
provider='OpenVINOExecutionProvider',
provider_options={'device_type': 'GPU_FP16'},
# use_cache=False,
#use_io_binding=False
export=True,
#load_in_4bit=True,
#load_in_8bit=True
#torch_dtype=torch.bfloat16,
#device_map=device,
#from_transformers=True
)
tokenizer = AutoTokenizer.from_pretrained(peft_path)
model.save_pretrained(onnex_path)
tokenizer.save_pretrained(onnex_path)

Expected behavior

I need to have the OONX model with at least the same size while not loosing accuracy performance.

The text was updated successfully, but these errors were encountered:

xenova · 2025-03-05T21:06:35Z

Which model are you trying to convert? 👀 You should compare the model size with the size of the base model being converted, just in case the PEFT model doesn't cover all parameters. Similarly, if the model is doubling in size, it probably suggests the ONNX model is being saved in float32 precision (vs. a float16 model).

morteza89 · 2025-03-10T13:58:26Z

I'm attempting to convert a fine-tuned TinyLlama model, which has been fine-tuned using a LoRA configuration. The PEFT model is approximately 2.1 GB on disk. When I load it for inference using the following code, it works as expected:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(base_path)
tokenizer = AutoTokenizer.from_pretrained(base_path, padding_side='right')

However, after converting this model to ONNX, the size increases to about 4.1 GB. You mentioned that the PEFT model might be saved in float16 precision, whereas the ONNX conversion could default to float32, leading to the size increase. When I attempt to save the ONNX model with float16 precision to maintain the original size, I experience a drop in performance.

Could you provide insights into why converting the PEFT model to ONNX with float16 precision results in performance degradation? Is there a recommended approach to preserve both the model's size and its performance during the conversion?

morteza89 added the bug Something isn't working label Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PEFT to ONNX conversion #2189

PEFT to ONNX conversion #2189

morteza89 commented Feb 13, 2025

xenova commented Mar 5, 2025 •

edited

Loading

morteza89 commented Mar 10, 2025

PEFT to ONNX conversion #2189

PEFT to ONNX conversion #2189

Comments

morteza89 commented Feb 13, 2025

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

xenova commented Mar 5, 2025 • edited Loading

morteza89 commented Mar 10, 2025

xenova commented Mar 5, 2025 •

edited

Loading