Who is Qalab Hassnain Agha?

Qalab Hassnain Agha (QHA) is a CTO and AI Systems Architect based in Islamabad, Pakistan. He leads Quickgen Technologies and QuickComm AE, with 4+ years building production AI systems including LLM pipelines, computer vision, IoT platforms, and cloud-native backends shipped to clients in Australia, UAE, the UK, and Pakistan.

What AI services does Qalab Hassnain Agha offer?

Qalab offers AI Systems Architecture & Consulting, LLM Pipeline and RAG development (GPT-4, Gemini, Claude, Whisper), Computer Vision systems (YOLOv8, OpenCV), Backend development (FastAPI, microservices, AWS/Azure), and IoT platform development (BLE 5.0, ESP32, MQTT).

What is Qalab Hassnain Agha's tech stack?

Primary stack: Python, FastAPI, TensorFlow, Keras, YOLOv8, OpenCV, LLMs (GPT-4, Gemini, Claude), AWS, Azure, Docker, PostgreSQL, Redis, WebSockets. Also works with Next.js, Flutter, .NET Core, and IoT (BLE 5.0, ESP32, MQTT).

Where is Qalab Hassnain Agha based and does he work remotely?

Qalab is based in Islamabad, Pakistan and works remotely with international clients. He has delivered projects for clients in Australia, UAE, the UK, and Pakistan, and is open to remote, hybrid, or relocation opportunities.

How can I hire Qalab Hassnain Agha for an AI project?

You can contact Qalab via email at aghaqalabhassnain@gmail.com, book a 30-minute call on Calendly, or reach him on LinkedIn (linkedin.com/in/qalabhassnainagha) and Upwork. He is currently available for new projects and consultations.

Real-Time AudioLLMsSpeech-to-TextWebSocketsProduction AI

LLM-Powered Real-Time Audio Pipelines: How We Built AI Transcription at Scale

Qalab Hassnain Agha·April 22, 2025·15 min read

ShareLinkedIn X / Twitter WhatsApp

Most developers who build voice-enabled applications think the hard part is the speech-to-text model.

It isn't.

The hard part is everything around it — the audio ingestion pipeline that handles real-world microphone input, the LLM layer that turns raw transcriptions into structured intelligence, the WebSocket architecture that delivers results in real time, and the operational infrastructure that keeps all of it running reliably under production load.

I built QuickComm — a real-time AI-powered communication platform for hospitality operations — from the ground up. Every conversation on the platform is captured, transcribed, classified by intent, and delivered as actionable intelligence to operations managers in under two seconds. This is how we built it and what we learned.

The Architecture Overview

Before getting into the details, here's the full pipeline at a high level:

Audio capture → PCM streaming over WebSocket → STT engine (Deepgram Nova-3) → Transcript accumulation → LLM intent classification (Gemini 2.5 Flash) → Structured output delivery → Database storage + dashboard update

Each arrow in that chain is a failure point. Each one has production lessons attached to it.

Stage 1: Audio Capture and Streaming

The first challenge in any real-time audio pipeline is getting clean audio from the source to your processing backend with minimal latency and without quality degradation.

PCM over WebSockets

We stream raw PCM audio (16-bit, 16kHz, mono) from the client device to the backend over a persistent WebSocket connection. The reasons for this choice:

PCM is uncompressed — no encoding/decoding overhead at the client, which matters for real-time latency
WebSockets give us a persistent connection with no HTTP request overhead per chunk
16kHz mono is the standard input format for most production STT engines — no server-side resampling needed

Chunk size matters

We send audio in 100ms chunks. Smaller chunks (20–50ms) reduce latency but increase WebSocket overhead and make the STT engine work harder. Larger chunks (200–500ms) reduce overhead but add perceptible lag. 100ms is the sweet spot for conversational audio — the user finishes a sentence, and the transcription appears within 300–400ms of their last word.

Handling network interruptions

Audio pipelines break differently from regular APIs. When a user's network drops mid-sentence, you don't want to lose the audio captured locally before the connection dropped. We buffer audio locally during disconnections and replay it on reconnection, with sequence numbers to ensure correct ordering on the server. The user experience is seamless — the system catches up silently.

Stage 2: Speech-to-Text with Deepgram Nova-3

We evaluated four STT engines before choosing Deepgram Nova-3: OpenAI Whisper, Google Speech-to-Text, AWS Transcribe, and Deepgram. Evaluation criteria:

Accuracy on hospitality domain vocabulary — "reservations," "housekeeping," "concierge" need to transcribe correctly
Streaming latency — time from spoken word to transcription output
Cost at scale — audio pipelines generate a lot of API calls
Reliability — uptime and error rate under sustained load

Deepgram Nova-3 won on accuracy (94% on our domain-specific test set), latency (average 180ms from audio chunk to transcript), and cost. Whisper was more accurate on general speech but significantly slower for streaming use cases.

Streaming vs batch transcription

We use streaming transcription for real-time dashboard display and batch transcription for final accuracy on the stored transcript. The real-time display prioritizes speed, the archive prioritizes accuracy. This dual-pass approach costs slightly more but produces a significantly better product experience.

Transcript accumulation

A common mistake is appending every partial result directly to the transcript — you end up with duplicates and incorrect word sequences. The correct approach: each partial result includes start_time and end_time. Use these to reconstruct the transcript by time window, replacing earlier partials with more accurate later ones as they arrive. We maintain a rolling transcript buffer keyed by time windows. Final segments lock in and trigger the LLM classification layer.

Stage 3: LLM Intent Classification

Raw transcription text is useful for records. What operations managers need is structured intelligence: what was this conversation about, what actions are required, who is responsible.

Why not fine-tune a smaller model?

We evaluated fine-tuning a smaller classification model vs. using a general-purpose LLM with a structured prompt. Fine-tuning requires labeled training data, regular retraining as new categories emerge, and a model deployment pipeline. For a product in active development, the maintenance overhead was too high. Gemini 2.5 Flash gives us structured classification in under 400ms with zero training data requirements — when new intent categories emerge, we update the prompt, not a training pipeline.

The classification prompt structure

We ask Gemini to classify each completed utterance into a structured JSON output:

{
  "intent": "maintenance_request",
  "urgency": "high",
  "department": "housekeeping",
  "action_required": true,
  "summary": "Room 412 reports broken AC unit",
  "entities": {
    "room": "412",
    "issue": "AC unit",
    "status": "broken"
  }
}

Latency management

Our latency budget: audio buffering 100ms, STT streaming 180ms average, LLM classification 380ms average, WebSocket delivery 20ms — total ~680ms on the typical path. The P95 path required aggressive timeout handling, fallback to cached classifications for known utterance patterns, and circuit breakers that degrade gracefully rather than failing completely.

Stage 4: WebSocket Delivery and State Management

Connection management at scale

Each hotel property has multiple staff members using the platform simultaneously. Each staff member's audio stream is processed independently, but results are delivered to a shared operations dashboard for that property. We maintain a WebSocket connection per client session and use a pub/sub model for dashboard delivery — when a new classification result is ready for Property X, it's published to Property X's channel and delivered to all connected dashboards in under 20ms.

Handling reconnections on the dashboard

Operations dashboards run in browsers, which have aggressive connection management — tabs that go to background can lose WebSocket connections after a few minutes. We implement exponential backoff reconnection with state catch-up: on reconnect, the client requests the last N minutes of classifications to fill any gaps in the live feed.

Stage 5: Storage and Post-Processing

Time-series storage for audio metadata

For audio metadata (session timestamps, duration, speaker ID, channel), we use PostgreSQL with appropriate indexing. Query patterns are predictable — last N sessions per property, sessions by date range, sessions by department — and a well-indexed relational table handles these cleanly.

Blob storage for audio files

Raw audio files are stored in Azure Blob Storage with geo-redundant replication. Retention policy: 90 days at full quality, archived to cool storage after that. Audio files older than 90 days are rarely accessed but occasionally needed for dispute resolution — cool storage keeps them available at lower cost.

Post-processing for accuracy

After a conversation ends, we run a higher-accuracy batch transcription pass on the full audio file. The real-time transcript was optimized for speed — the stored transcript is optimized for accuracy. This second pass also catches any words the streaming engine missed during network instability.

Production Lessons

STT engines hallucinate too

LLM hallucination gets a lot of attention, but STT engines have their own version of the problem. Deepgram occasionally produces confident but incorrect transcriptions — particularly for proper nouns, domain-specific terms, and speech with background noise. Maintain a domain-specific vocabulary file and pass it to the STT engine. Flag low-confidence transcriptions (below 0.75 confidence score) for human review rather than feeding them directly to the LLM.

The silence problem

Audio pipelines need to handle silence correctly. Too aggressive on silence detection, and you fragment utterances mid-sentence, breaking the LLM's context window. Too conservative, and you hold the transcript open too long, adding latency. We use Deepgram's utterance end detection (combining silence duration with speech probability) rather than simple silence thresholding — it handles natural speech pauses much better.

Cost management at scale

STT API costs: voice activity detection stops sending audio during silence — reduces billable audio by 30–40% for conversational use cases
LLM API costs: truncate transcripts to the last 500 tokens before sending to the LLM — older context rarely changes the classification result for short operational utterances
Storage costs: tiered storage policy (hot → cool → archive) based on access patterns

The Full Stack at a Glance

Audio capture: 16kHz mono PCM, 100ms chunks, WebSocket streaming
STT: Deepgram Nova-3 streaming, domain vocabulary, confidence scoring
LLM: Gemini 2.5 Flash, structured JSON output, schema validation
Delivery: WebSocket pub/sub, per-property channels, reconnection with state catch-up
Storage: PostgreSQL (metadata), Azure Blob (audio), Redis (cache)
Monitoring: Prometheus + Grafana, Sentry, custom pipeline latency dashboard

Final Thoughts

Real-time audio pipelines are deceptively complex. The happy path — user speaks, transcription appears, classification is delivered — is straightforward to prototype. Making it work reliably in production, across variable network conditions, with real users who pause mid-sentence and use domain-specific vocabulary, is an entirely different engineering challenge.

The lessons that changed how I build these systems: treat silence as a first-class problem, validate every LLM output against a schema, use streaming STT for experience and batch STT for accuracy, and always calculate your latency budget before choosing your architecture.

The 2-second end-to-end target is achievable. Getting there requires understanding every millisecond of the pipeline.

Frequently Asked Questions

What is the achievable end-to-end latency for an LLM audio classification pipeline?

Sub-2-second end-to-end latency is achievable: audio capture adds under 100ms, Deepgram Nova-3 streaming STT adds approximately 300ms, Gemini 2.5 Flash LLM classification adds approximately 380ms, and WebSocket delivery adds 50ms. Total: approximately 830ms under good network conditions.

How do you prevent STT hallucinations in production audio pipelines?

Implement voice activity detection to stop sending audio during silence periods. Deepgram Nova-3 combined with silence detection reduces hallucination rate from 12% to under 1% on silent audio segments. Also set confidence thresholds to discard low-confidence transcript segments.

Should I use streaming or batch STT for production audio pipelines?

Use streaming STT for real-time user experience — it delivers partial transcripts within 300ms. Use batch STT for accuracy-critical storage records where a 2–3 second delay is acceptable. The two serve different purposes and can be used simultaneously in the same pipeline.