Who is Qalab Hassnain Agha?

Qalab Hassnain Agha (QHA) is a CTO and AI Systems Architect based in Islamabad, Pakistan. He leads Quickgen Technologies and QuickComm AE, with 4+ years building production AI systems including LLM pipelines, computer vision, IoT platforms, and cloud-native backends shipped to clients in Australia, UAE, the UK, and Pakistan.

What AI services does Qalab Hassnain Agha offer?

Qalab offers AI Systems Architecture & Consulting, LLM Pipeline and RAG development (GPT-4, Gemini, Claude, Whisper), Computer Vision systems (YOLOv8, OpenCV), Backend development (FastAPI, microservices, AWS/Azure), and IoT platform development (BLE 5.0, ESP32, MQTT).

What is Qalab Hassnain Agha's tech stack?

Primary stack: Python, FastAPI, TensorFlow, Keras, YOLOv8, OpenCV, LLMs (GPT-4, Gemini, Claude), AWS, Azure, Docker, PostgreSQL, Redis, WebSockets. Also works with Next.js, Flutter, .NET Core, and IoT (BLE 5.0, ESP32, MQTT).

Where is Qalab Hassnain Agha based and does he work remotely?

Qalab is based in Islamabad, Pakistan and works remotely with international clients. He has delivered projects for clients in Australia, UAE, the UK, and Pakistan, and is open to remote, hybrid, or relocation opportunities.

How can I hire Qalab Hassnain Agha for an AI project?

You can contact Qalab via email at aghaqalabhassnain@gmail.com, book a 30-minute call on Calendly, or reach him on LinkedIn (linkedin.com/in/qalabhassnainagha) and Upwork. He is currently available for new projects and consultations.

RAGLLMsChromaDBBM25Hybrid RetrievalProduction AI

RAG Architecture in Production: Building a Research Intelligence System with ChromaDB and BM25

Qalab Hassnain Agha·June 3, 2025·15 min read

ShareLinkedIn X / Twitter WhatsApp

Retrieval-Augmented Generation (RAG) is one of the most talked-about patterns in AI engineering right now — and one of the most poorly understood in production.

Most RAG tutorials show you the happy path: embed your documents, store them in a vector database, retrieve the top-K most similar chunks, pass them to an LLM, get an answer. It works beautifully in the demo. It fails in surprisingly specific ways in production.

I built PaperIntel — a research intelligence system that processes academic PDFs and answers domain-specific questions with citation-level accuracy. This is what production RAG actually looks like, including the parts the tutorials skip.

What RAG Solves (and What It Doesn't)

RAG is the right pattern when you need an LLM to answer questions about a specific corpus of documents that it wasn't trained on — internal company documents, recent research papers, proprietary data, client records.

It solves the knowledge cutoff problem: the LLM doesn't need to have seen the document during training, because you retrieve the relevant content and include it in the prompt at inference time.

What it doesn't solve: poor document quality, bad chunking decisions, retrieval failures, and hallucinations from low-confidence retrievals. These are production problems, not demo problems. All of them need explicit engineering solutions.

Stage 1: Document Ingestion and Chunking

The quality of your RAG system is determined largely by decisions made before a single query is processed. Chunking strategy is the most impactful — and most underspecified — of these decisions.

Why chunking strategy matters

Vector embeddings capture semantic meaning — but they work best on coherent, self-contained units of text. If you chunk too small (sentence-level), individual chunks lose context. If you chunk too large (full page), the embedding averages over too much content and becomes too general to retrieve precisely.

The chunking approach for academic papers

Academic papers have natural structure: abstract, introduction, methodology, results, discussion, conclusion. We exploit this structure rather than ignoring it.

First pass — section-aware chunking: detect section boundaries and chunk within sections, not across them
Second pass — semantic chunking within sections: chunk by coherence rather than fixed token count, targeting 300–500 tokens with 50-token overlap
Third pass — metadata enrichment: every chunk enriched with paper title, authors, year, section name, position, and a generated summary

Stage 2: Embedding and Vector Storage

Embedding model selection

We use BGE-M3 (BAAI General Embedding) for embedding. Selection criteria: better performance on scientific/technical text than OpenAI ada-002, supports both dense and sparse representations from a single model, runs locally with no per-token API cost, and an 8192-token context window.

ChromaDB for vector storage

ChromaDB is our vector store. Self-hosted deployment keeps research data on-premise. Python-native with clean FastAPI integration. Metadata filtering is essential for scoping retrieval to specific papers, date ranges, or sections — critical for citation-accurate answers.

Stage 3: Hybrid Retrieval (The Part Most Tutorials Skip)

Pure vector search has a known failure mode: it excels at semantic similarity but struggles with exact term matching. BM25 (Best Match 25) is a classical IR algorithm that excels at exact term matching — the foundation of Elasticsearch and most search engines.

Hybrid retrieval combines both: vector search finds semantically relevant content, BM25 finds exact term matches, and a fusion algorithm combines the ranked lists.

The hybrid retrieval pipeline

Step 1 — Parallel retrieval: run vector search (top 20) and BM25 search (top 20) simultaneously
Step 2 — Reciprocal Rank Fusion (RRF): combine ranked lists weighting by position rather than raw scores to handle score incompatibility
Step 3 — Cross-encoder reranking: fused candidates through BGE-Reranker-v2-m3, reducing to top 5–8 chunks with significantly higher accuracy
Step 4 — Context assembly: top chunks assembled into context window with adjacent chunks included when a top result is part of a longer argument

Stage 4: Generation with Citation Tracking

The prompt structure

We assign chunk IDs in the prompt and instruct the LLM to cite them for every claim. Chunk IDs are mapped back to full citations (paper, authors, year, section, page range) at display time. The LLM never needs to know the full citation — just the ID.

You are a research assistant. Answer the question using ONLY the provided context.
For every claim, cite the specific chunk ID from the context.
If the context does not contain enough information, say so explicitly.

Context:
[CHUNK_ID_001] {chunk text}
[CHUNK_ID_002] {chunk text}
...

Question: {user_query}

Handling uncertainty

Retrieval confidence: if top-ranked chunk scores below 0.4 on our relevance scale, flag the answer as low-confidence with a visible warning
Generation confidence: ask the LLM to rate its own confidence (0–1) and state missing information if below 0.7

Stage 5: Evaluation and Continuous Improvement

RAG systems degrade in ways that are hard to detect without systematic evaluation. We maintain a golden evaluation set of 200 question-answer pairs with ground truth citations, running automatically on every system update.

Evaluation framework

Retrieval recall@5: what percentage of ground truth chunks appear in the top 5 retrieved results
Answer faithfulness: does the generated answer contain only claims supported by retrieved chunks
Citation accuracy: do cited chunk IDs correspond to actual sources of each claim
Answer completeness: does the answer address all aspects of the question

Production Lessons

Chunking decisions are irreversible at scale — re-chunking requires re-embedding everything. Test on representative documents with your actual query distribution before committing.
BM25 is not optional for technical content — pure vector search misses exact term matches for equation numbers, algorithm names, specific parameter values.
The reranker changes everything — 100–200ms latency cost is worth it. Largest single accuracy gain from any architectural change we made.
Citation accuracy is a product requirement, not a nice-to-have — wrong citations undermine user trust catastrophically. Invest in citation infrastructure before answer quality.

The Full Stack

Embedding: BGE-M3 (local deployment)
Vector store: ChromaDB (self-hosted)
BM25: custom implementation over the chunk corpus
Reranker: BGE-Reranker-v2-m3
LLM: Gemini 2.5 Flash (generation), small fast LLM for chunk summarization
API: FastAPI, monitoring: retrieval latency, answer confidence distribution, user feedback rate

Final Thoughts

Production RAG is an engineering discipline, not just a prompt engineering exercise. The difference between a RAG demo and a RAG product is chunking strategy, hybrid retrieval, reranking, citation tracking, confidence scoring, and systematic evaluation.

Get each layer right before moving to the next. The retrieval quality ceiling is set by your chunking decisions. The answer quality ceiling is set by your retrieval quality. The trust ceiling is set by your citation accuracy.

Build from the ground up. The LLM is the last piece, not the first.

Frequently Asked Questions

What is hybrid retrieval in RAG?

Hybrid retrieval combines vector search (for semantic similarity) and BM25 keyword search (for exact term matching), then merges ranked results using Reciprocal Rank Fusion. The combined candidate set is passed through a cross-encoder reranker to select the top 5–8 most relevant chunks for the LLM prompt.

What is the optimal chunk size for RAG with technical documents?

For academic and technical documents, target 300–500 tokens per chunk with 50-token overlap between adjacent chunks. Use section-aware chunking to keep chunks within their natural document sections rather than splitting across section boundaries.

How do you ensure citation accuracy in a production RAG system?

Assign chunk IDs in the LLM prompt and require the model to cite the specific chunk ID for every claim. Map IDs back to full citations at display time. Set a retrieval confidence threshold (0.4 on a 0–1 scale) below which answers are flagged as low-confidence to reduce hallucination from irrelevant context.