How to Deploy a Computer Vision Model to Production
Most computer vision tutorials end at model training. You get 94% validation accuracy, save your model weights, and the tutorial says "congratulations."
What it doesn't tell you is that the trained model is maybe 20% of the work. The other 80% — the API layer, the preprocessing pipeline, the monitoring, the deployment infrastructure — is what separates a working notebook from a production system that handles real traffic.
I've deployed computer vision systems for 80+ clients over six years, across healthcare, sports tech, manufacturing, and hospitality. This guide covers everything I now do before any CV model goes live — from the first API call to production monitoring.
The Gap Between Notebook and Production
Here's the production reality that tutorials skip:
- Your data pipeline: In a notebook, you load a clean dataset. In production, images arrive from phone cameras, CCTV streams, industrial sensors — all different resolutions, orientations, and quality levels.
- Model versioning: notebook = model_final_v3.pkl. Production = you need to know which version is live, what it was trained on, and be able to roll back in under 5 minutes.
- The API layer: a notebook cell is not a web service. You need async request handling, proper error responses, and latency under your SLA.
- Monitoring: in a notebook, you run eval once. In production, your model can silently degrade over weeks as the input distribution shifts.
Let's solve all of these, one layer at a time.
Step 1: Export Your Model to ONNX
The first thing I do after training is export to ONNX format. ONNX (Open Neural Network Exchange) is a framework-agnostic model format that runs on ONNX Runtime — which gives you 2–3x faster CPU inference compared to native PyTorch serving, with no GPU required for most production CV workloads.
Why ONNX?
- Framework independence: train in PyTorch, serve anywhere
- 2–3x faster inference on CPU vs native PyTorch serving
- Smaller deployment footprint — no PyTorch dependency in production
- Supports quantization to INT8 for further speed/cost gains
Export from PyTorch in three lines:
import torch
dummy_input = torch.randn(1, 3, 640, 640) # match your model's input shape
torch.onnx.export(model, dummy_input, 'model.onnx', opset_version=11)Verify the export ran cleanly before moving on — a corrupted ONNX file will produce confusing inference errors later that are hard to trace back to the export step.
Step 2: Build the Preprocessing Pipeline
Bad preprocessing causes more production failures than bad models. In six years of deploying CV systems, this is the single most common root cause of 'the model works in testing but fails in production.'
The issue: your test set was clean, consistent, and controlled. Production images come from phone cameras with different colour profiles, CCTV streams with compression artifacts, and user uploads in any format and orientation. Your preprocessing pipeline needs to handle all of these gracefully.
The preprocessing checklist I run on every project:
- Resize to the model's exact input shape — never let the model figure it out
- Normalize to match training — if you trained with [0,1] normalization, use it in production exactly
- Fix BGR → RGB if using OpenCV — OpenCV loads BGR by default, most models expect RGB
- Handle grayscale vs colour consistently — explicitly convert if your model expects 3 channels
- Test on 50 real production images before shipping — not your test set
Step 3: Build the FastAPI Serving Layer
Your model needs to serve predictions to real users, in real time, reliably. FastAPI is my default choice for CV APIs because it's async by default, handles concurrent requests without blocking, and has automatic request validation.
Rule 1: Load the model once at startup — never per request.
Loading a model from disk takes 1–5 seconds depending on size. If you load it on every request, your API becomes unusable under any real load. Load it once at server startup and cache it in application state.
# Correct pattern — load once at startup
@app.on_event('startup')
async def load_model():
app.state.session = InferenceSession('model.onnx')
# Wrong pattern — kills latency under load
@app.post('/predict')
def predict():
session = InferenceSession('model.onnx') # loaded every callRule 2: Return structured JSON with metadata — not raw arrays.
Clients don't want numpy arrays. They want actionable data. Always include model_version and latency_ms in your response — your future self will thank you when debugging at 2AM.
{
"label": "defect_detected",
"confidence": 0.94,
"latency_ms": 47,
"model_version": "v2.1",
"timestamp": "2025-01-15T09:23:11Z"
}Rule 3: Handle errors explicitly — never let raw exceptions reach the client.
Unhandled exceptions leak implementation details, confuse clients, and make debugging harder. Wrap your inference call in explicit error handling with meaningful error codes.
Step 4: Containerise the Right Way
Dockerizing a CV model is not the same as Dockerizing a regular web app. A standard python:3.11-slim Dockerfile will not work out of the box for AI inference.
What's different for AI containers:
- Base image: use a base that supports your inference runtime (ONNX Runtime, CUDA if GPU)
- Model weights: bake them into the image at build time — never download at container startup
- Memory limits: CV models use 800MB–2GB; without explicit limits your container silently kills co-located services
- Health checks: test actual inference with a dummy input — HTTP 200 doesn't mean the model is working
- Layer ordering: dependencies first, code last — saves 10+ minutes per rebuild on large images
The model weights point is the one most engineers get wrong first. If you download weights on container startup and you're running 10 instances, that's 10 simultaneous downloads, 10 cold start delays, and 10 race conditions. Build the weights into the image. Ship once.
Step 5: Model Versioning with MLflow
If you can't roll back your model in under 5 minutes, you don't have model versioning — you have optimism.
I use MLflow on every production AI project. Every model version is logged with:
- Model weights in ONNX format
- Training data hash — to detect data leakage between versions
- Hyperparameters and training config
- Evaluation metrics on a held-out test set
- Deployment environment and infrastructure config
Promotion workflow: train → auto-log to MLflow → tag as 'staging' → run canary test (5% production traffic) → promote to 'production' or rollback in 2 clicks. This whole flow takes under an hour once set up.
Step 6: Production Monitoring
Your model can fail in production without a single error appearing in your logs. This is called model drift — the input distribution shifts, the model's accuracy degrades, and you find out from a client, not your monitoring system.
What I monitor on every deployed CV model:
- Confidence score distribution — tracked weekly, alerted if mean drops >5% from baseline
- Latency at P95, not average — average hides the worst user experiences
- Input shape distribution — catches upstream pipeline changes before they cause inference errors
- Error rate by error type — preprocessing failures vs inference failures vs network issues
My monitoring stack for CV APIs: Prometheus for metrics collection, Grafana for dashboards and alerting, Sentry for error tracking with full request context.
Step 7: Optimising for Cost
1. Model quantization (INT8)
PyTorch → ONNX → ONNX Runtime quantization converts FP32 weights to INT8. Result: 4x smaller model, 2–4x faster inference, ~1% accuracy loss on most CV tasks. On one recent project, this alone reduced inference time from 340ms to 91ms on CPU.
2. Request batching
Processing images one by one wastes the overhead of setting up and tearing down inference sessions. Group 8–10 requests into a single batch inference call. Same results, fraction of the compute cost.
3. Redis caching for duplicate inputs
A significant percentage of requests in most production CV systems are near-duplicate inputs. Add a cache at the API gateway layer. Duplicate inputs return cached results instantly — no inference, no cost.
4. Spot instances for non-critical workloads
On-demand EC2 instances are the most expensive way to run inference. Spot Instances are 70–90% cheaper for identical hardware. On one project, switching to Spot Instances cut the monthly inference bill by 60%.
The Full Production Checklist
- Model exported to ONNX and verified
- Preprocessing tested on 50 real production images from the actual source
- FastAPI API with startup model loading, structured JSON responses, explicit error handling
- Docker container with baked weights, memory limits, and inference health check
- MLflow versioning with rollback tested
- Prometheus + Grafana monitoring with confidence drift alerts
- Sentry error tracking with full request context
- Load test at 2x expected peak traffic before go-live
Final Thoughts
The model is 20% of a production CV system. The infrastructure around it is 80%.
Every item in this guide exists because I've seen what happens when it's missing — a client call at 2AM about an API that's 'broken' (it loaded the model on every request), a model that worked perfectly in testing and failed in production (wrong normalization for the production camera), a system that ran out of memory and took down other services (no memory limits on the Docker container).
Ship the infrastructure first. The model will follow.
Frequently Asked Questions
What is the best format for deploying a computer vision model to production?
ONNX (Open Neural Network Exchange) with ONNX Runtime. It delivers 2–3x faster CPU inference than native PyTorch, removes the PyTorch dependency from your production environment, and supports INT8 quantization for further speed and cost gains.
How do you prevent preprocessing errors in a production CV system?
Test your preprocessing pipeline on 50 real production images — not your clean test dataset. Key checks: resize to the exact model input shape, normalize using the same values as training, fix BGR→RGB if using OpenCV, and handle grayscale vs colour explicitly.
What should I monitor in a production computer vision API?
Track inference latency (average and P95), error rate per endpoint, model confidence score distribution, and preprocessing failure rate. Alert immediately when the confidence distribution shifts more than 5% from your deployment baseline — this is the first signal of model degradation.
Available for Consulting
Let's build something
that matters.
I take on a select number of project-based consulting engagements per quarter — from architecture reviews and LLM pipeline audits to full production builds.
80+ clients · 4+ years production AI · Remote / Islamabad