Who is Qalab Hassnain Agha?

Qalab Hassnain Agha (QHA) is a CTO and AI Systems Architect based in Islamabad, Pakistan. He leads Quickgen Technologies and QuickComm AE, with 4+ years building production AI systems including LLM pipelines, computer vision, IoT platforms, and cloud-native backends shipped to clients in Australia, UAE, the UK, and Pakistan.

What AI services does Qalab Hassnain Agha offer?

Qalab offers AI Systems Architecture & Consulting, LLM Pipeline and RAG development (GPT-4, Gemini, Claude, Whisper), Computer Vision systems (YOLOv8, OpenCV), Backend development (FastAPI, microservices, AWS/Azure), and IoT platform development (BLE 5.0, ESP32, MQTT).

What is Qalab Hassnain Agha's tech stack?

Primary stack: Python, FastAPI, TensorFlow, Keras, YOLOv8, OpenCV, LLMs (GPT-4, Gemini, Claude), AWS, Azure, Docker, PostgreSQL, Redis, WebSockets. Also works with Next.js, Flutter, .NET Core, and IoT (BLE 5.0, ESP32, MQTT).

Where is Qalab Hassnain Agha based and does he work remotely?

Qalab is based in Islamabad, Pakistan and works remotely with international clients. He has delivered projects for clients in Australia, UAE, the UK, and Pakistan, and is open to remote, hybrid, or relocation opportunities.

How can I hire Qalab Hassnain Agha for an AI project?

You can contact Qalab via email at aghaqalabhassnain@gmail.com, book a 30-minute call on Calendly, or reach him on LinkedIn (linkedin.com/in/qalabhassnainagha) and Upwork. He is currently available for new projects and consultations.

Production AIDevOpsMLOpsMonitoringIoT

My Production Deployment Checklist for AI Systems: What I Check Before Every Launch

Qalab Hassnain Agha·May 10, 2025·14 min read

ShareLinkedIn X / Twitter WhatsApp

Every item on this checklist exists because I once shipped without it.

Not as a hypothetical. Not as a precaution based on something I read. As a direct result of a production failure, a 3AM phone call, a client asking why their data disappeared, or a system silently degrading for two weeks before anyone noticed.

I've shipped AI systems for 80+ clients over six years — wearable health monitors, computer vision APIs, real-time communication platforms, clinical rehabilitation tools. The checklist I'm sharing here is what I now run before any of them go live. It's not theoretical. Every section has a story behind it.

Layer 1: Crash Reporting

Tools: Firebase Crashlytics (mobile), Sentry (backend and web)

If your system crashes in production and you don't have crash reporting configured, you are completely blind. You'll hear about it from a user, not your monitoring system.

Crashlytics gives me the exact stack trace, the exact device model, the exact OS version, and the exact app version for every mobile crash. Sentry does the same for backend errors — with the added benefit of full request context attached to every exception.

The rule I follow: no crash reporting, no shipping. This is non-negotiable on every project, regardless of timeline pressure.

What to configure before launch:

Crash reporting initialized and tested with a deliberate crash in staging
Source maps uploaded (for web) or dSYMs uploaded (for iOS) so stack traces are readable
Alert routing configured — crashes go to the right person immediately, not to a shared inbox nobody checks
Crash-free rate baseline established in staging before going to production

Layer 2: User Interaction Tracking

Tools: Firebase Analytics

This is engineering-level tracking, not business analytics. The question I'm answering is: what did the user do in the 60 seconds before something broke?

Firebase Analytics gives me the full user journey — screen by screen, event by event. When a crash occurs, I can replay the exact sequence of interactions that led to it. This turns a 4-hour debugging session into a 20-minute one.

What to track before launch:

Screen views and navigation events
Key user interactions (button taps, form submissions, feature usage)
Custom events for AI-specific actions (inference triggered, result displayed, feedback given)
Error events with context (what the user was doing when the error occurred)

Layer 3: UX Feedback Loop

Tools: Instabug

The first version of any product is wrong. That's not pessimism — it's how software development works. The question is how quickly you find out and how actionable the feedback is.

Instabug sits inside the app. Users shake their device to report a bug or send feedback. They get a screenshot they can annotate, and I get that screenshot, their device info, their OS version, their session replay, and the network logs from the last 60 seconds — all attached to a single ticket automatically.

A bug report without context is noise. An Instabug report is a complete picture.

What to configure before launch:

Instabug initialized and shake-to-report enabled
Bug reports routed to ClickUp automatically
In-app feature request flow for non-bug feedback
Response SLA defined — users who report bugs should get a response within 24 hours

Layer 4: Bug Tracking and Triage

Tools: ClickUp

Every crash from Firebase Crashlytics, every bug report from Instabug, every Sentry alert feeds into ClickUp. One place. Full history. Priority tracking. Nothing gets lost in a Slack thread or an email chain.

The integration matters as much as the tool. Automated ticket creation from your monitoring tools means zero bugs fall through the cracks because someone forgot to log it manually.

What to configure before launch:

Automated ticket creation from Sentry, Crashlytics, and Instabug
Severity labels defined (P0 = production down, P1 = major feature broken, P2 = minor issue, P3 = cosmetic)
P0 and P1 alert routing to on-call engineer
Bug triage cadence defined — who reviews the backlog and when

Layer 5: Infrastructure Monitoring

Tools: Prometheus + Grafana + Sentry

Sentry catches errors. Prometheus collects metrics. Grafana makes it visible and actionable.

The three-layer approach covers different failure modes. Sentry tells you when something has already broken. Prometheus and Grafana tell you when something is about to break — CPU trending up, memory climbing, response latency degrading, error rate increasing. Catching the leading indicators before they become incidents is the difference between proactive and reactive operations.

What to monitor before launch:

API response time (average and P95)
Error rate by endpoint and error type
CPU and memory usage with alerts at 70% and 90%
Database connection pool utilization
Queue depth (for async processing pipelines)
For AI systems: inference latency, model confidence score distribution, preprocessing failure rate

Layer 6: Device and Hardware Fingerprinting (IoT)

Tools: Custom built

This layer is specific to IoT and wearable projects, and it's the one most teams skip entirely — which is why IoT debugging is so painful for most teams and relatively straightforward for mine.

Every device in the field is registered with: Device ID, firmware version, hardware revision, connected phone model, phone OS version and build, app version, last connection timestamp and connection quality score. All of this is logged against every session and every data transmission event.

When a user reports 'my data is wrong' or 'the connection keeps dropping,' I query their device record, look at their session history, and compare their metrics to the cohort. Is this a firmware version issue? A specific Android build incompatibility? A hardware revision problem?

Layer 7: CDN and Backup Infrastructure

Tools: Azure Blob Storage (geo-redundant), Cloudflare CDN

This layer prevents data loss and keeps your service available during infrastructure incidents. I used to treat it as something to add 'later' — until a brief Azure regional disruption took down a production platform and I spent a very unpleasant afternoon explaining to a client why their data had disappeared.

What to configure before launch:

All static assets served through Cloudflare CDN — faster globally, survives origin incidents, automatic DDoS protection
Geo-redundant storage — user data replicated to a secondary region that serves automatically on primary failure
Automated daily backups with weekly restore test to staging — backups that have never been tested are not backups, they are hope
Tiered storage — data older than 90 days moves to cool storage automatically

The Full Pre-Launch Checklist

Crash reporting:

Firebase Crashlytics configured and tested (mobile)
Sentry configured with source maps (backend/web)
Alert routing to on-call engineer

User tracking:

Firebase Analytics events defined and tested
Key user journey instrumented end-to-end

UX feedback:

Instabug initialized and tested
Feedback routing to ClickUp automated

Infrastructure monitoring:

Prometheus metrics collection running
Grafana dashboards for key metrics
Sentry error tracking with environment separation
Alerts at meaningful thresholds (not just "server down")

IoT/hardware (if applicable):

Device registry implemented with firmware version tracking per device
Session quality logging per device

CDN and backup:

Static assets on Cloudflare CDN
Geo-redundant blob storage configured
Automated daily backups running with restore test completed
Load test at 2x expected peak traffic — P95 latency within SLA under load

Final Thoughts

The checklist looks long. It gets faster with practice — on a new project now, this full setup takes two to three days. The first time I built it from scratch it took a week.

The question isn't whether you can afford to do this. It's whether you can afford not to. Every item takes hours to set up. Every item has saved me from incidents that would have taken days to resolve and cost client relationships to recover from.

Ship observable systems. Everything else is easier.

Frequently Asked Questions

What monitoring tools should I use for a production AI system?

Use Firebase Crashlytics for mobile crash reporting, Sentry for backend errors with full request context, Prometheus for metrics collection, and Grafana for dashboards and alerting. Route all automated tickets to a single ClickUp workspace with defined severity levels (P0–P3 scale).

What should I check before launching an AI system to production?

Run through 7 layers: crash reporting (Crashlytics + Sentry), user interaction tracking (Firebase Analytics), UX feedback loop (Instabug with ClickUp routing), bug tracking (ClickUp with severity labels), infrastructure monitoring (Prometheus + Grafana with environment separation), device fingerprinting for IoT projects, and CDN + geo-redundant backup storage with a completed restore test.

How do you set up IoT device monitoring for production hardware?

Register every device with its hardware ID, firmware version, hardware revision, connected phone model, OS version and build number, app version, and last connection timestamp. Log all fields against every session and data transmission event to enable rapid querying of a device's full history when a user reports an issue.

ShareLinkedIn X / Twitter WhatsApp

Available for Consulting

Let's build something
that matters.

I take on a select number of project-based consulting engagements per quarter — from architecture reviews and LLM pipeline audits to full production builds.

AI SystemsComputer VisionLLM PipelinesMLOpsIoT & BLE

Book a Call

80+ clients · 4+ years production AI · Remote / Islamabad

Layer 1: Crash Reporting

Tools: Firebase Crashlytics (mobile), Sentry (backend and web)

What to configure before launch:

Layer 2: User Interaction Tracking

Tools: Firebase Analytics

What to track before launch:

Layer 3: UX Feedback Loop

Tools: Instabug

What to configure before launch:

Layer 4: Bug Tracking and Triage

Tools: ClickUp

What to configure before launch:

Layer 5: Infrastructure Monitoring

Tools: Prometheus + Grafana + Sentry

What to monitor before launch:

Layer 6: Device and Hardware Fingerprinting (IoT)

Tools: Custom built

Layer 7: CDN and Backup Infrastructure

Tools: Azure Blob Storage (geo-redundant), Cloudflare CDN

What to configure before launch:

The Full Pre-Launch Checklist

Crash reporting:

User tracking:

UX feedback:

Infrastructure monitoring:

IoT/hardware (if applicable):

CDN and backup:

Final Thoughts

Frequently Asked Questions

What monitoring tools should I use for a production AI system?

What should I check before launching an AI system to production?

How do you set up IoT device monitoring for production hardware?

Let's build somethingthat matters.

Let's build something
that matters.