Using ONNX to Make Model Faster!

So recently I tried out ONNX (Open Neural Network Exchange) and honestly, it was way easier than I thought. ONNX basically lets you take your normal PyTorch or TensorFlow model and run it in a lighter, faster format. The main reason I even looked into this was simple: I wanted faster inference.

PyTorch is great, no doubt, but when you keep hitting the model again and again (like for embeddings or feature extraction), it gets slow. ONNX solves that problem by making the model run quicker and smoother.

Why I went for ONNX

Speed. That’s the biggest reason.
It makes deployment easier, since the exported model is just a folder with everything inside.
It’s lightweight, so perfect for apps where you want instant responses.

I was working on the embedding generation so i went for this model all-MiniLM-L6-v2 and it was taking around 20 sec to generate a particular embedding and i wanted it to be lesser so i searched for it that how can i optimize this and i got to know about this OONX, i used it and the time reduced by 8 sec, now it was taking 12 sec and it was pretty good.

How I exported my model

I used Hugging Face’s optimum.onnxruntime. Trust me, this library makes the whole export process painless. Here’s literally what I did:

from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer

model_id = "sentence-transformers/all-MiniLM-L6-v2"
onnx_path = "onnx_model"

# Export to ONNX
model = ORTModelForFeatureExtraction.from_pretrained(model_id, export=True)
model.save_pretrained(onnx_path)

# Save tokenizer too
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.save_pretrained(onnx_path)

print("✅ ONNX export complete. Files saved in:", onnx_path)

That’s it. I took the all-MiniLM-L6-v2 model (a sentence transformer I use for embeddings), exported it into ONNX, and saved it in a folder with the tokenizer. Done.

How I used the ONNX model

After export, I just loaded it back and wrote a small function to get embeddings:

from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer

onnx_path = r"C:\Users\your_path\onnx_model"

model = ORTModelForFeatureExtraction.from_pretrained(onnx_path)
tokenizer = AutoTokenizer.from_pretrained(onnx_path)

def get_embeddings(texts):
 inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
 embeddings = model(**inputs).last_hidden_state.mean(dim=1)
 return embeddings

Now, if I run this:

texts = ["AI is changing the world.", "ONNX speeds up inference."]
embeddings = get_embeddings(texts)
print(embeddings)

I get my embeddings out — but much faster than before.

Conclusion:

Honestly, this was super easy. I didn’t have to go through any painful setup or configs. Just a few lines of code and my model was running in ONNX. The speed improvement is the main win here, and I can definitely see myself using this in projects where quick responses matter (like search, chatbots, or semantic similarity stuff).

✅ In short, what I did:

Took a Hugging Face model (all-MiniLM-L6-v2).

Exported it to ONNX with optimum.onnxruntime.

Saved the model + tokenizer.

Loaded it back and used it for embeddings.

That’s it. Nothing complicated, but super useful.

Thank you !!

Join Bhavik on Peerlist!

Join amazing folks like Bhavik and thousands of other builders on Peerlist.