LLM-Powered Real-Time Audio Pipelines: How We Built AI Transcription at Scale
Most developers who build voice-enabled applications think the hard part is the speech-to-text model.
It isn't.
The hard part is everything around it — the audio ingestion pipeline that handles real-world microphone input, the LLM layer that turns raw transcriptions into structured intelligence, the WebSocket architecture that delivers results in real time, and the operational infrastructure that keeps all of it running reliably under production load.
I built QuickComm — a real-time AI-powered communication platform for hospitality operations — from the ground up. Every conversation on the platform is captured, transcribed, classified by intent, and delivered as actionable intelligence to operations managers in under two seconds. This is how we built it and what we learned.
The Architecture Overview
Before getting into the details, here's the full pipeline at a high level:
Audio capture → PCM streaming over WebSocket → STT engine (Deepgram Nova-3) → Transcript accumulation → LLM intent classification (Gemini 2.5 Flash) → Structured output delivery → Database storage + dashboard update
Each arrow in that chain is a failure point. Each one has production lessons attached to it.
Stage 1: Audio Capture and Streaming
The first challenge in any real-time audio pipeline is getting clean audio from the source to your processing backend with minimal latency and without quality degradation.
PCM over WebSockets
We stream raw PCM audio (16-bit, 16kHz, mono) from the client device to the backend over a persistent WebSocket connection. The reasons for this choice:
- PCM is uncompressed — no encoding/decoding overhead at the client, which matters for real-time latency
- WebSockets give us a persistent connection with no HTTP request overhead per chunk
- 16kHz mono is the standard input format for most production STT engines — no server-side resampling needed
Chunk size matters
We send audio in 100ms chunks. Smaller chunks (20–50ms) reduce latency but increase WebSocket overhead and make the STT engine work harder. Larger chunks (200–500ms) reduce overhead but add perceptible lag. 100ms is the sweet spot for conversational audio — the user finishes a sentence, and the transcription appears within 300–400ms of their last word.
Handling network interruptions
Audio pipelines break differently from regular APIs. When a user's network drops mid-sentence, you don't want to lose the audio captured locally before the connection dropped. We buffer audio locally during disconnections and replay it on reconnection, with sequence numbers to ensure correct ordering on the server. The user experience is seamless — the system catches up silently.
Stage 2: Speech-to-Text with Deepgram Nova-3
We evaluated four STT engines before choosing Deepgram Nova-3: OpenAI Whisper, Google Speech-to-Text, AWS Transcribe, and Deepgram. Evaluation criteria:
- Accuracy on hospitality domain vocabulary — "reservations," "housekeeping," "concierge" need to transcribe correctly
- Streaming latency — time from spoken word to transcription output
- Cost at scale — audio pipelines generate a lot of API calls
- Reliability — uptime and error rate under sustained load
Deepgram Nova-3 won on accuracy (94% on our domain-specific test set), latency (average 180ms from audio chunk to transcript), and cost. Whisper was more accurate on general speech but significantly slower for streaming use cases.
Streaming vs batch transcription
We use streaming transcription for real-time dashboard display and batch transcription for final accuracy on the stored transcript. The real-time display prioritizes speed, the archive prioritizes accuracy. This dual-pass approach costs slightly more but produces a significantly better product experience.
Transcript accumulation
A common mistake is appending every partial result directly to the transcript — you end up with duplicates and incorrect word sequences. The correct approach: each partial result includes start_time and end_time. Use these to reconstruct the transcript by time window, replacing earlier partials with more accurate later ones as they arrive. We maintain a rolling transcript buffer keyed by time windows. Final segments lock in and trigger the LLM classification layer.
Stage 3: LLM Intent Classification
Raw transcription text is useful for records. What operations managers need is structured intelligence: what was this conversation about, what actions are required, who is responsible.
Why not fine-tune a smaller model?
We evaluated fine-tuning a smaller classification model vs. using a general-purpose LLM with a structured prompt. Fine-tuning requires labeled training data, regular retraining as new categories emerge, and a model deployment pipeline. For a product in active development, the maintenance overhead was too high. Gemini 2.5 Flash gives us structured classification in under 400ms with zero training data requirements — when new intent categories emerge, we update the prompt, not a training pipeline.
The classification prompt structure
We ask Gemini to classify each completed utterance into a structured JSON output:
{
"intent": "maintenance_request",
"urgency": "high",
"department": "housekeeping",
"action_required": true,
"summary": "Room 412 reports broken AC unit",
"entities": {
"room": "412",
"issue": "AC unit",
"status": "broken"
}
}Latency management
Our latency budget: audio buffering 100ms, STT streaming 180ms average, LLM classification 380ms average, WebSocket delivery 20ms — total ~680ms on the typical path. The P95 path required aggressive timeout handling, fallback to cached classifications for known utterance patterns, and circuit breakers that degrade gracefully rather than failing completely.
Stage 4: WebSocket Delivery and State Management
Connection management at scale
Each hotel property has multiple staff members using the platform simultaneously. Each staff member's audio stream is processed independently, but results are delivered to a shared operations dashboard for that property. We maintain a WebSocket connection per client session and use a pub/sub model for dashboard delivery — when a new classification result is ready for Property X, it's published to Property X's channel and delivered to all connected dashboards in under 20ms.
Handling reconnections on the dashboard
Operations dashboards run in browsers, which have aggressive connection management — tabs that go to background can lose WebSocket connections after a few minutes. We implement exponential backoff reconnection with state catch-up: on reconnect, the client requests the last N minutes of classifications to fill any gaps in the live feed.
Stage 5: Storage and Post-Processing
Time-series storage for audio metadata
For audio metadata (session timestamps, duration, speaker ID, channel), we use PostgreSQL with appropriate indexing. Query patterns are predictable — last N sessions per property, sessions by date range, sessions by department — and a well-indexed relational table handles these cleanly.
Blob storage for audio files
Raw audio files are stored in Azure Blob Storage with geo-redundant replication. Retention policy: 90 days at full quality, archived to cool storage after that. Audio files older than 90 days are rarely accessed but occasionally needed for dispute resolution — cool storage keeps them available at lower cost.
Post-processing for accuracy
After a conversation ends, we run a higher-accuracy batch transcription pass on the full audio file. The real-time transcript was optimized for speed — the stored transcript is optimized for accuracy. This second pass also catches any words the streaming engine missed during network instability.
Production Lessons
STT engines hallucinate too
LLM hallucination gets a lot of attention, but STT engines have their own version of the problem. Deepgram occasionally produces confident but incorrect transcriptions — particularly for proper nouns, domain-specific terms, and speech with background noise. Maintain a domain-specific vocabulary file and pass it to the STT engine. Flag low-confidence transcriptions (below 0.75 confidence score) for human review rather than feeding them directly to the LLM.
The silence problem
Audio pipelines need to handle silence correctly. Too aggressive on silence detection, and you fragment utterances mid-sentence, breaking the LLM's context window. Too conservative, and you hold the transcript open too long, adding latency. We use Deepgram's utterance end detection (combining silence duration with speech probability) rather than simple silence thresholding — it handles natural speech pauses much better.
Cost management at scale
- STT API costs: voice activity detection stops sending audio during silence — reduces billable audio by 30–40% for conversational use cases
- LLM API costs: truncate transcripts to the last 500 tokens before sending to the LLM — older context rarely changes the classification result for short operational utterances
- Storage costs: tiered storage policy (hot → cool → archive) based on access patterns
The Full Stack at a Glance
- Audio capture: 16kHz mono PCM, 100ms chunks, WebSocket streaming
- STT: Deepgram Nova-3 streaming, domain vocabulary, confidence scoring
- LLM: Gemini 2.5 Flash, structured JSON output, schema validation
- Delivery: WebSocket pub/sub, per-property channels, reconnection with state catch-up
- Storage: PostgreSQL (metadata), Azure Blob (audio), Redis (cache)
- Monitoring: Prometheus + Grafana, Sentry, custom pipeline latency dashboard
Final Thoughts
Real-time audio pipelines are deceptively complex. The happy path — user speaks, transcription appears, classification is delivered — is straightforward to prototype. Making it work reliably in production, across variable network conditions, with real users who pause mid-sentence and use domain-specific vocabulary, is an entirely different engineering challenge.
The lessons that changed how I build these systems: treat silence as a first-class problem, validate every LLM output against a schema, use streaming STT for experience and batch STT for accuracy, and always calculate your latency budget before choosing your architecture.
The 2-second end-to-end target is achievable. Getting there requires understanding every millisecond of the pipeline.
Frequently Asked Questions
What is the achievable end-to-end latency for an LLM audio classification pipeline?
Sub-2-second end-to-end latency is achievable: audio capture adds under 100ms, Deepgram Nova-3 streaming STT adds approximately 300ms, Gemini 2.5 Flash LLM classification adds approximately 380ms, and WebSocket delivery adds 50ms. Total: approximately 830ms under good network conditions.
How do you prevent STT hallucinations in production audio pipelines?
Implement voice activity detection to stop sending audio during silence periods. Deepgram Nova-3 combined with silence detection reduces hallucination rate from 12% to under 1% on silent audio segments. Also set confidence thresholds to discard low-confidence transcript segments.
Should I use streaming or batch STT for production audio pipelines?
Use streaming STT for real-time user experience — it delivers partial transcripts within 300ms. Use batch STT for accuracy-critical storage records where a 2–3 second delay is acceptable. The two serve different purposes and can be used simultaneously in the same pipeline.
Available for Consulting
Let's build something
that matters.
I take on a select number of project-based consulting engagements per quarter — from architecture reviews and LLM pipeline audits to full production builds.
80+ clients · 4+ years production AI · Remote / Islamabad