Building a Production RAG System

Back to Blog

Retrieval-augmented generation (RAG) sounds simple on paper: embed your documents, store them in a vector database, retrieve the most relevant chunks at query time, and pass them to an LLM as context. In practice, getting it right across multiple live enterprise data sources — and shipping it as a production system used daily by a cross-functional team — taught me more than any tutorial did.

This post walks through the architecture decisions, mistakes, and lessons from building that system end-to-end.

The Problem

Internal knowledge in most engineering organizations is scattered. Product documentation lives in a wiki. Customer case history lives in a CRM. Bug and feature context lives in a ticketing system. Team discussions and decisions live in a chat platform. When someone needs to answer a technical support question or understand how a feature was designed, they're doing archaeology across four different UIs — assuming they even know where to look.

The goal was a single interface where you could ask a natural-language question and get an answer grounded in actual company documentation, not a hallucinated summary. The system needed to span all four source types and serve a team across engineering, support, and management roles.

Architecture Overview

The system has four main layers:

Ingestion pipeline — watches for new and updated content across all source systems via event-driven triggers, chunks and embeds documents, and writes them to the vector store.
Vector store + retrieval layer — handles similarity search over embedded chunks, with metadata filtering for source type, recency, and access level.
LLM layer — receives the query plus retrieved context, generates an answer, and handles citation tracking back to source documents.
API + frontend — a backend API serving a web frontend, with SSO authentication and role-based access control.

The Ingestion Pipeline

The hardest part wasn't retrieval — it was keeping the index current. A RAG system that answers questions based on stale data is worse than no RAG system, because it answers confidently with wrong information.

Each source system gets its own connector that listens for change events rather than polling on a schedule. Most enterprise platforms support some form of webhooks or change data capture. Using event-driven triggers means the index reflects the current state of your documents within seconds of a change, not hours.

Each connector normalizes the raw payload into a common document schema: source type, document ID, URL, timestamp, content, and access metadata. This normalization step is critical for keeping the retrieval layer source-agnostic — the retrieval code doesn't need to know whether a chunk came from a wiki page or a support case.

Chunking Strategy

How you split documents matters more than most people expect. The wrong chunk size creates two failure modes: chunks too large waste context window and dilute relevance scores; chunks too small lose the surrounding context that makes them meaningful.

The right chunking strategy depends on the content type. Some principles that held up well:

Structured documents (wikis, documentation) chunk well at natural section boundaries — headings, subheadings — with a token-length soft limit and a small overlap between adjacent chunks to avoid cutting off context at the boundary.
Short, dense records (tickets, issues) are often best kept intact as single documents. They're short enough that splitting them loses the relationship between the problem description and the resolution.
Conversational content (support threads, chat messages) benefits from temporal or thread-based grouping rather than arbitrary token splits. A single message has almost no retrievable signal on its own; a full exchange does.

The underlying rule: chunk to preserve semantic coherence, not just to hit a token budget.

The Retrieval Layer

Semantic search alone isn't enough in a multi-source environment. A query about a specific customer case shouldn't surface random chat messages that happen to share vocabulary. The retrieval layer combines embedding similarity with metadata filtering:

Queries can specify source type constraints when the user knows where the answer should come from.
Recency filtering applies automatically when the query contains temporal language ("recent", "last week", "current").
User roles determine which documents are eligible — access control at retrieval time, not just at the API boundary.

One improvement worth the effort: a two-stage retrieval approach. Retrieve a larger candidate set (say, top 8–10 chunks by similarity), then apply a re-ranking pass to select the 4–5 most contextually relevant before passing them to the LLM. This catches cases where cosine similarity returns superficially related but contextually irrelevant chunks — and it meaningfully improves answer quality.

The LLM Layer

The orchestration layer handles building the retrieval chain, assembling the prompt, managing conversation history for multi-turn queries, and routing tool calls. The LLM handles generation.

The system prompt is source-aware. It tells the model what kinds of documents may appear in context, instructs it to prefer specificity over generality, and explicitly tells it to say "I don't have enough context" rather than guess. Getting the model to admit uncertainty was harder than getting it to answer — without explicit instruction, it hallucinates plausibly.

Citations are non-negotiable in a system used for customer support. Every answer includes source links with the specific chunk that grounded each claim. This lets users verify answers and builds trust in the system over time. A RAG system without citations is just a confident chatbot.

Deployment Considerations

A few things worth thinking through before you deploy:

Container orchestration — for a team-internal tool, the operational overhead of Kubernetes often isn't justified. Simpler orchestration (Docker Swarm, Compose with a reverse proxy) is easier to operate and adequate for most internal workloads.
Conversation persistence — store conversation history separately from the vector index. They have different access patterns and different retention requirements.
API surface — expose the RAG system as a clean API from day one, even if only the frontend uses it initially. It makes it easy to connect other tools later, including AI coding assistants that can query your knowledge base directly.

What I'd Do Differently

A few things I'd change with hindsight:

Build evaluation infrastructure first. It's hard to know if retrieval is improving without a labeled test set of question-answer pairs grounded in your actual documents. I built this late and spent too long guessing about chunk quality. Even 50–100 manually labeled examples give you a baseline to measure against.
Instrument everything from day one. Query latency, retrieval hit rate, user feedback (thumbs up/down), and model refusals are all signals. The earlier you collect them, the earlier you can tune.
Don't underestimate access control complexity. In theory, role-based filtering is straightforward. In practice, different source systems have different permission models, and mapping them to a unified access control layer takes longer than the retrieval logic itself. Plan for it upfront.

Closing Thoughts

The hardest part of building a production RAG system isn't the ML. It's the data engineering — keeping your documents fresh, chunked appropriately, and correctly permissioned across multiple live systems. Get that right and the retrieval and generation layers are comparatively straightforward.

If you're building something similar, invest early in observability and evaluation. A RAG system without metrics is a black box, and black boxes erode user trust fast. The goal isn't a impressive demo — it's a system people rely on daily because it's consistently right.