Who is Qalab Hassnain Agha?

Qalab Hassnain Agha (QHA) is a CTO and AI Systems Architect based in Islamabad, Pakistan. He leads Quickgen Technologies and QuickComm AE, with 4+ years building production AI systems including LLM pipelines, computer vision, IoT platforms, and cloud-native backends shipped to clients in Australia, UAE, the UK, and Pakistan.

What AI services does Qalab Hassnain Agha offer?

Qalab offers AI Systems Architecture & Consulting, LLM Pipeline and RAG development (GPT-4, Gemini, Claude, Whisper), Computer Vision systems (YOLOv8, OpenCV), Backend development (FastAPI, microservices, AWS/Azure), and IoT platform development (BLE 5.0, ESP32, MQTT).

What is Qalab Hassnain Agha's tech stack?

Primary stack: Python, FastAPI, TensorFlow, Keras, YOLOv8, OpenCV, LLMs (GPT-4, Gemini, Claude), AWS, Azure, Docker, PostgreSQL, Redis, WebSockets. Also works with Next.js, Flutter, .NET Core, and IoT (BLE 5.0, ESP32, MQTT).

Where is Qalab Hassnain Agha based and does he work remotely?

Qalab is based in Islamabad, Pakistan and works remotely with international clients. He has delivered projects for clients in Australia, UAE, the UK, and Pakistan, and is open to remote, hybrid, or relocation opportunities.

How can I hire Qalab Hassnain Agha for an AI project?

You can contact Qalab via email at aghaqalabhassnain@gmail.com, book a 30-minute call on Calendly, or reach him on LinkedIn (linkedin.com/in/qalabhassnainagha) and Upwork. He is currently available for new projects and consultations.

Computer VisionFastAPIONNXDockerMLOpsProduction AI

How to Deploy a Computer Vision Model to Production

Qalab Hassnain Agha·January 15, 2025·15 min read

ShareLinkedIn X / Twitter WhatsApp

Most computer vision tutorials end at model training. You get 94% validation accuracy, save your model weights, and the tutorial says "congratulations."

What it doesn't tell you is that the trained model is maybe 20% of the work. The other 80% — the API layer, the preprocessing pipeline, the monitoring, the deployment infrastructure — is what separates a working notebook from a production system that handles real traffic.

I've deployed computer vision systems for 80+ clients over six years, across healthcare, sports tech, manufacturing, and hospitality. This guide covers everything I now do before any CV model goes live — from the first API call to production monitoring.

The Gap Between Notebook and Production

Here's the production reality that tutorials skip:

Your data pipeline: In a notebook, you load a clean dataset. In production, images arrive from phone cameras, CCTV streams, industrial sensors — all different resolutions, orientations, and quality levels.
Model versioning: notebook = model_final_v3.pkl. Production = you need to know which version is live, what it was trained on, and be able to roll back in under 5 minutes.
The API layer: a notebook cell is not a web service. You need async request handling, proper error responses, and latency under your SLA.
Monitoring: in a notebook, you run eval once. In production, your model can silently degrade over weeks as the input distribution shifts.

Let's solve all of these, one layer at a time.

Step 1: Export Your Model to ONNX

The first thing I do after training is export to ONNX format. ONNX (Open Neural Network Exchange) is a framework-agnostic model format that runs on ONNX Runtime — which gives you 2–3x faster CPU inference compared to native PyTorch serving, with no GPU required for most production CV workloads.

Why ONNX?

Framework independence: train in PyTorch, serve anywhere
2–3x faster inference on CPU vs native PyTorch serving
Smaller deployment footprint — no PyTorch dependency in production
Supports quantization to INT8 for further speed/cost gains

Export from PyTorch in three lines:

import torch
dummy_input = torch.randn(1, 3, 640, 640)  # match your model's input shape
torch.onnx.export(model, dummy_input, 'model.onnx', opset_version=11)

Verify the export ran cleanly before moving on — a corrupted ONNX file will produce confusing inference errors later that are hard to trace back to the export step.

Step 2: Build the Preprocessing Pipeline

Bad preprocessing causes more production failures than bad models. In six years of deploying CV systems, this is the single most common root cause of 'the model works in testing but fails in production.'

The issue: your test set was clean, consistent, and controlled. Production images come from phone cameras with different colour profiles, CCTV streams with compression artifacts, and user uploads in any format and orientation. Your preprocessing pipeline needs to handle all of these gracefully.

The preprocessing checklist I run on every project:

Resize to the model's exact input shape — never let the model figure it out
Normalize to match training — if you trained with [0,1] normalization, use it in production exactly
Fix BGR → RGB if using OpenCV — OpenCV loads BGR by default, most models expect RGB
Handle grayscale vs colour consistently — explicitly convert if your model expects 3 channels
Test on 50 real production images before shipping — not your test set

Step 3: Build the FastAPI Serving Layer

Your model needs to serve predictions to real users, in real time, reliably. FastAPI is my default choice for CV APIs because it's async by default, handles concurrent requests without blocking, and has automatic request validation.

Rule 1: Load the model once at startup — never per request.

Loading a model from disk takes 1–5 seconds depending on size. If you load it on every request, your API becomes unusable under any real load. Load it once at server startup and cache it in application state.

# Correct pattern — load once at startup
@app.on_event('startup')
async def load_model():
    app.state.session = InferenceSession('model.onnx')

# Wrong pattern — kills latency under load
@app.post('/predict')
def predict():
    session = InferenceSession('model.onnx')  # loaded every call

Rule 2: Return structured JSON with metadata — not raw arrays.

Clients don't want numpy arrays. They want actionable data. Always include model_version and latency_ms in your response — your future self will thank you when debugging at 2AM.

{
  "label": "defect_detected",
  "confidence": 0.94,
  "latency_ms": 47,
  "model_version": "v2.1",
  "timestamp": "2025-01-15T09:23:11Z"
}

Rule 3: Handle errors explicitly — never let raw exceptions reach the client.

Unhandled exceptions leak implementation details, confuse clients, and make debugging harder. Wrap your inference call in explicit error handling with meaningful error codes.

Step 4: Containerise the Right Way

Dockerizing a CV model is not the same as Dockerizing a regular web app. A standard python:3.11-slim Dockerfile will not work out of the box for AI inference.

What's different for AI containers:

Base image: use a base that supports your inference runtime (ONNX Runtime, CUDA if GPU)
Model weights: bake them into the image at build time — never download at container startup
Memory limits: CV models use 800MB–2GB; without explicit limits your container silently kills co-located services
Health checks: test actual inference with a dummy input — HTTP 200 doesn't mean the model is working
Layer ordering: dependencies first, code last — saves 10+ minutes per rebuild on large images

The model weights point is the one most engineers get wrong first. If you download weights on container startup and you're running 10 instances, that's 10 simultaneous downloads, 10 cold start delays, and 10 race conditions. Build the weights into the image. Ship once.

Step 5: Model Versioning with MLflow

If you can't roll back your model in under 5 minutes, you don't have model versioning — you have optimism.

I use MLflow on every production AI project. Every model version is logged with:

Model weights in ONNX format
Training data hash — to detect data leakage between versions
Hyperparameters and training config
Evaluation metrics on a held-out test set
Deployment environment and infrastructure config

Promotion workflow: train → auto-log to MLflow → tag as 'staging' → run canary test (5% production traffic) → promote to 'production' or rollback in 2 clicks. This whole flow takes under an hour once set up.

Step 6: Production Monitoring

Your model can fail in production without a single error appearing in your logs. This is called model drift — the input distribution shifts, the model's accuracy degrades, and you find out from a client, not your monitoring system.

What I monitor on every deployed CV model:

Confidence score distribution — tracked weekly, alerted if mean drops >5% from baseline
Latency at P95, not average — average hides the worst user experiences
Input shape distribution — catches upstream pipeline changes before they cause inference errors
Error rate by error type — preprocessing failures vs inference failures vs network issues

My monitoring stack for CV APIs: Prometheus for metrics collection, Grafana for dashboards and alerting, Sentry for error tracking with full request context.

Step 7: Optimising for Cost

1. Model quantization (INT8)

PyTorch → ONNX → ONNX Runtime quantization converts FP32 weights to INT8. Result: 4x smaller model, 2–4x faster inference, ~1% accuracy loss on most CV tasks. On one recent project, this alone reduced inference time from 340ms to 91ms on CPU.

2. Request batching

Processing images one by one wastes the overhead of setting up and tearing down inference sessions. Group 8–10 requests into a single batch inference call. Same results, fraction of the compute cost.

3. Redis caching for duplicate inputs

A significant percentage of requests in most production CV systems are near-duplicate inputs. Add a cache at the API gateway layer. Duplicate inputs return cached results instantly — no inference, no cost.

4. Spot instances for non-critical workloads

On-demand EC2 instances are the most expensive way to run inference. Spot Instances are 70–90% cheaper for identical hardware. On one project, switching to Spot Instances cut the monthly inference bill by 60%.

The Full Production Checklist

Model exported to ONNX and verified
Preprocessing tested on 50 real production images from the actual source
FastAPI API with startup model loading, structured JSON responses, explicit error handling
Docker container with baked weights, memory limits, and inference health check
MLflow versioning with rollback tested
Prometheus + Grafana monitoring with confidence drift alerts
Sentry error tracking with full request context
Load test at 2x expected peak traffic before go-live

Final Thoughts

The model is 20% of a production CV system. The infrastructure around it is 80%.

Every item in this guide exists because I've seen what happens when it's missing — a client call at 2AM about an API that's 'broken' (it loaded the model on every request), a model that worked perfectly in testing and failed in production (wrong normalization for the production camera), a system that ran out of memory and took down other services (no memory limits on the Docker container).

Ship the infrastructure first. The model will follow.

Frequently Asked Questions

What is the best format for deploying a computer vision model to production?

ONNX (Open Neural Network Exchange) with ONNX Runtime. It delivers 2–3x faster CPU inference than native PyTorch, removes the PyTorch dependency from your production environment, and supports INT8 quantization for further speed and cost gains.

How do you prevent preprocessing errors in a production CV system?

Test your preprocessing pipeline on 50 real production images — not your clean test dataset. Key checks: resize to the exact model input shape, normalize using the same values as training, fix BGR→RGB if using OpenCV, and handle grayscale vs colour explicitly.

What should I monitor in a production computer vision API?

Track inference latency (average and P95), error rate per endpoint, model confidence score distribution, and preprocessing failure rate. Alert immediately when the confidence distribution shifts more than 5% from your deployment baseline — this is the first signal of model degradation.