RAG Architecture in Production: Building a Research Intelligence System with ChromaDB and BM25
Retrieval-Augmented Generation (RAG) is one of the most talked-about patterns in AI engineering right now — and one of the most poorly understood in production.
Most RAG tutorials show you the happy path: embed your documents, store them in a vector database, retrieve the top-K most similar chunks, pass them to an LLM, get an answer. It works beautifully in the demo. It fails in surprisingly specific ways in production.
I built PaperIntel — a research intelligence system that processes academic PDFs and answers domain-specific questions with citation-level accuracy. This is what production RAG actually looks like, including the parts the tutorials skip.
What RAG Solves (and What It Doesn't)
RAG is the right pattern when you need an LLM to answer questions about a specific corpus of documents that it wasn't trained on — internal company documents, recent research papers, proprietary data, client records.
It solves the knowledge cutoff problem: the LLM doesn't need to have seen the document during training, because you retrieve the relevant content and include it in the prompt at inference time.
What it doesn't solve: poor document quality, bad chunking decisions, retrieval failures, and hallucinations from low-confidence retrievals. These are production problems, not demo problems. All of them need explicit engineering solutions.
Stage 1: Document Ingestion and Chunking
The quality of your RAG system is determined largely by decisions made before a single query is processed. Chunking strategy is the most impactful — and most underspecified — of these decisions.
Why chunking strategy matters
Vector embeddings capture semantic meaning — but they work best on coherent, self-contained units of text. If you chunk too small (sentence-level), individual chunks lose context. If you chunk too large (full page), the embedding averages over too much content and becomes too general to retrieve precisely.
The chunking approach for academic papers
Academic papers have natural structure: abstract, introduction, methodology, results, discussion, conclusion. We exploit this structure rather than ignoring it.
- First pass — section-aware chunking: detect section boundaries and chunk within sections, not across them
- Second pass — semantic chunking within sections: chunk by coherence rather than fixed token count, targeting 300–500 tokens with 50-token overlap
- Third pass — metadata enrichment: every chunk enriched with paper title, authors, year, section name, position, and a generated summary
Stage 2: Embedding and Vector Storage
Embedding model selection
We use BGE-M3 (BAAI General Embedding) for embedding. Selection criteria: better performance on scientific/technical text than OpenAI ada-002, supports both dense and sparse representations from a single model, runs locally with no per-token API cost, and an 8192-token context window.
ChromaDB for vector storage
ChromaDB is our vector store. Self-hosted deployment keeps research data on-premise. Python-native with clean FastAPI integration. Metadata filtering is essential for scoping retrieval to specific papers, date ranges, or sections — critical for citation-accurate answers.
Stage 3: Hybrid Retrieval (The Part Most Tutorials Skip)
Pure vector search has a known failure mode: it excels at semantic similarity but struggles with exact term matching. BM25 (Best Match 25) is a classical IR algorithm that excels at exact term matching — the foundation of Elasticsearch and most search engines.
Hybrid retrieval combines both: vector search finds semantically relevant content, BM25 finds exact term matches, and a fusion algorithm combines the ranked lists.
The hybrid retrieval pipeline
- Step 1 — Parallel retrieval: run vector search (top 20) and BM25 search (top 20) simultaneously
- Step 2 — Reciprocal Rank Fusion (RRF): combine ranked lists weighting by position rather than raw scores to handle score incompatibility
- Step 3 — Cross-encoder reranking: fused candidates through BGE-Reranker-v2-m3, reducing to top 5–8 chunks with significantly higher accuracy
- Step 4 — Context assembly: top chunks assembled into context window with adjacent chunks included when a top result is part of a longer argument
Stage 4: Generation with Citation Tracking
The prompt structure
We assign chunk IDs in the prompt and instruct the LLM to cite them for every claim. Chunk IDs are mapped back to full citations (paper, authors, year, section, page range) at display time. The LLM never needs to know the full citation — just the ID.
You are a research assistant. Answer the question using ONLY the provided context.
For every claim, cite the specific chunk ID from the context.
If the context does not contain enough information, say so explicitly.
Context:
[CHUNK_ID_001] {chunk text}
[CHUNK_ID_002] {chunk text}
...
Question: {user_query}Handling uncertainty
- Retrieval confidence: if top-ranked chunk scores below 0.4 on our relevance scale, flag the answer as low-confidence with a visible warning
- Generation confidence: ask the LLM to rate its own confidence (0–1) and state missing information if below 0.7
Stage 5: Evaluation and Continuous Improvement
RAG systems degrade in ways that are hard to detect without systematic evaluation. We maintain a golden evaluation set of 200 question-answer pairs with ground truth citations, running automatically on every system update.
Evaluation framework
- Retrieval recall@5: what percentage of ground truth chunks appear in the top 5 retrieved results
- Answer faithfulness: does the generated answer contain only claims supported by retrieved chunks
- Citation accuracy: do cited chunk IDs correspond to actual sources of each claim
- Answer completeness: does the answer address all aspects of the question
Production Lessons
- Chunking decisions are irreversible at scale — re-chunking requires re-embedding everything. Test on representative documents with your actual query distribution before committing.
- BM25 is not optional for technical content — pure vector search misses exact term matches for equation numbers, algorithm names, specific parameter values.
- The reranker changes everything — 100–200ms latency cost is worth it. Largest single accuracy gain from any architectural change we made.
- Citation accuracy is a product requirement, not a nice-to-have — wrong citations undermine user trust catastrophically. Invest in citation infrastructure before answer quality.
The Full Stack
- Embedding: BGE-M3 (local deployment)
- Vector store: ChromaDB (self-hosted)
- BM25: custom implementation over the chunk corpus
- Reranker: BGE-Reranker-v2-m3
- LLM: Gemini 2.5 Flash (generation), small fast LLM for chunk summarization
- API: FastAPI, monitoring: retrieval latency, answer confidence distribution, user feedback rate
Final Thoughts
Production RAG is an engineering discipline, not just a prompt engineering exercise. The difference between a RAG demo and a RAG product is chunking strategy, hybrid retrieval, reranking, citation tracking, confidence scoring, and systematic evaluation.
Get each layer right before moving to the next. The retrieval quality ceiling is set by your chunking decisions. The answer quality ceiling is set by your retrieval quality. The trust ceiling is set by your citation accuracy.
Build from the ground up. The LLM is the last piece, not the first.
Frequently Asked Questions
What is hybrid retrieval in RAG?
Hybrid retrieval combines vector search (for semantic similarity) and BM25 keyword search (for exact term matching), then merges ranked results using Reciprocal Rank Fusion. The combined candidate set is passed through a cross-encoder reranker to select the top 5–8 most relevant chunks for the LLM prompt.
What is the optimal chunk size for RAG with technical documents?
For academic and technical documents, target 300–500 tokens per chunk with 50-token overlap between adjacent chunks. Use section-aware chunking to keep chunks within their natural document sections rather than splitting across section boundaries.
How do you ensure citation accuracy in a production RAG system?
Assign chunk IDs in the LLM prompt and require the model to cite the specific chunk ID for every claim. Map IDs back to full citations at display time. Set a retrieval confidence threshold (0.4 on a 0–1 scale) below which answers are flagged as low-confidence to reduce hallucination from irrelevant context.
Available for Consulting
Let's build something
that matters.
I take on a select number of project-based consulting engagements per quarter — from architecture reviews and LLM pipeline audits to full production builds.
80+ clients · 4+ years production AI · Remote / Islamabad