Voice AIWhisperDeepgramGeminiReal-TimeLLMHospitality

How We Replaced Hotel Walkie-Talkies With Real-Time Voice AI

Qalab Hassnain Agha··11 min read

Walk any hotel back-of-house and you hear it: constant radio chatter. "Housekeeping, 412 needs towels." "Engineering, pool pump again." Every request exists for exactly as long as the audio hangs in the air — no record, no routing, no accountability, and the guest waits while the right person maybe hears it.

QuickComm replaces that with a pipeline that hears the radio, understands it, and routes it — while keeping the radios themselves, because retraining an entire hotel staff off radios is how projects die. The staff kept talking exactly as before; the system listens. Here is how it works and what it took.

The Pipeline: Audio → Text → Intent → Action

  • Ingest: PCM audio streams from the radio system into the cloud — continuous, real-time, per channel.
  • Transcribe: streaming speech-to-text via Whisper and Deepgram converts utterances to text within seconds of being spoken.
  • Understand: a Gemini LLM classifies each utterance — request type, department, room, urgency — at ~88% intent precision, emitting a strict JSON schema.
  • Route: the structured event fires to the right team instantly over WebSockets — dashboard, app notification, and an audit trail that finally exists.

Hard Lesson 1: Radio Audio Is Its Own Species

Every STT benchmark is recorded on good microphones by people speaking in sentences. Radio traffic is compressed, clipped at both ends by push-to-talk, spoken in fragments and hotel shorthand, over static. Our first pass accuracy was sobering. Getting to 94%+ took audio preprocessing (normalisation, band filtering), domain vocabulary hints (room-number patterns, department names, local terms), and running dual STT engines — Whisper and Deepgram disagree in usefully different ways, and confidence-weighted selection between them recovers a meaningful slice of errors.

Hard Lesson 2: Constrain the LLM or It Will Improvise

The classification prompt evolved into a contract: a fixed intent taxonomy, mandatory JSON output validated at the boundary, explicit entity slots, and a confidence field the model must populate. Below the confidence threshold, the event routes to a human dispatcher rather than guessing — an unglamorous fallback that is the difference between a tool staff trust and one they turn off. LLM-as-classifier works; LLM-as-freestyle-interpreter does not.

Hard Lesson 3: The Architecture Migration Paid for Itself

V1 was a monolith — correct choice for shipping fast, wrong choice for scaling to many properties. Audio ingestion, transcription, classification, and delivery have wildly different load profiles; scaling the monolith meant scaling all of them to the peak of the hungriest. Splitting into AWS microservices along those load boundaries delivered 3× throughput at near-zero deployment downtime — and dropped infrastructure cost to roughly $3/month per property. Automated anomaly detection on service health metrics then cut critical-incident response time by 70%: at multi-property scale, the system notices its own problems before staff do.

Results

  • 94%+ transcription accuracy on live, noisy radio audio
  • ~88% intent-classification precision via constrained Gemini prompts
  • ~45% faster staff response times — requests reach the right team instantly, with accountability
  • 3× throughput post-migration · ~$3/month per property · 70% faster incident response

Where This Pattern Applies

Hotels were the wedge, but the shape generalises: any operation coordinating over voice — warehouses, hospitals, security teams, restaurants, events — is running on unstructured audio that could be structured, routed, and measured. The full QuickComm case study lives on my AI consulting services page; if your operation runs on radio chatter and you wonder what it would take, that is a conversation I am always glad to have.

Frequently Asked Questions

How accurate is speech-to-text on walkie-talkie radio audio?

Raw radio audio is brutal: compressed, clipped, full of static, spoken in shorthand. Out of the box, general STT models degrade badly on it. With audio preprocessing, domain vocabulary hints, and a dual-engine strategy (Whisper and Deepgram), our production system sustains 94%+ transcription accuracy on live hotel radio traffic.

Why use an LLM for intent classification instead of rules or a classifier?

Staff phrase the same request a hundred ways, in multiple languages, with names and room numbers embedded. Rules engines rot immediately. A constrained LLM (Gemini) with a strict output schema classifies intent at ~88% precision, extracts the entities (room, urgency, department), and degrades gracefully — anything below the confidence bar routes to a human dispatcher.

What does a voice AI system like this cost to run?

Less than intuition suggests, if cost is designed in: streaming architectures avoid storing and reprocessing audio; per-second STT billing rewards short utterances (radio traffic is naturally brief); a small fast LLM handles classification. After migrating from monolith to microservices, infrastructure runs at roughly $3/month per property with 3× the original throughput.

Code, architecture patterns, and recommendations in this article come from real projects but are shared as-is, without warranty — validate them against your own requirements before production use. See the Terms of Use.

Available for Consulting

Let's build something
that matters.

I take on a select number of project-based consulting engagements per quarter — from architecture reviews and LLM pipeline audits to full production builds.

AI SystemsComputer VisionLLM PipelinesMLOpsIoT & BLE

80+ clients · 14+ production systems · Remote / Islamabad