Skip to main content

RAG · AI Architecture

Building RAG That Actually Works

·4 min read

Most teams building RAG hit the same wall: the system works in demos but falls apart on real data. Queries return confident answers sourced from the wrong documents. Users lose trust. The project stalls.

We've built RAG systems across aviation operations, legal compliance, and financial services. Every domain has its quirks, but the failure modes are universal. This is what we've learned about building retrieval that actually holds up.

The retrieval problem nobody talks about

The default RAG tutorial goes like this: chunk your documents, embed them, do a cosine similarity search, stuff the top-k results into a prompt. Ship it.

This works for toy datasets. It breaks on real ones. Here's why:

Semantic similarity ≠ relevance. A query like "What's our policy on overtime pay?" will match documents about overtime — meeting notes mentioning it, emails discussing it, old drafts of the policy. The actual policy document might rank fifth because its language is formal and the query is casual.

Chunking destroys context. A 500-token chunk from the middle of a procedures manual has no idea what section it belongs to. The embedding captures the local semantics but loses the structural meaning entirely.

Top-k is a blunt instrument. Sometimes you need one document. Sometimes you need twelve. A fixed k means you're either missing relevant context or drowning the LLM in noise.

What actually works

After building several production RAG systems, here's the architecture that consistently delivers:

1. Hybrid retrieval

Don't choose between keyword search and vector search — use both. BM25 catches exact terminology that embeddings miss. Vectors catch semantic matches that keywords miss. Reciprocal Rank Fusion (RRF) combines the two result sets:

def reciprocal_rank_fusion(
    results: list[list[str]],
    k: int = 60
) -> list[tuple[str, float]]:
    """Combine multiple ranked lists using RRF."""
    scores: dict[str, float] = {}
    for result_list in results:
        for rank, doc_id in enumerate(result_list):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

This is simple, robust, and consistently outperforms either method alone.

2. Hierarchical chunking

Instead of flat chunks, preserve document structure:

  • Level 0: Full document metadata (title, source, date, category)
  • Level 1: Section-level chunks with headers intact
  • Level 2: Paragraph-level chunks with parent section context prepended

When a paragraph chunk matches, you can pull in its parent section for context. The LLM gets both the specific match and the surrounding structure.

interface Chunk {
  id: string;
  content: string;
  level: 0 | 1 | 2;
  parentId: string | null;
  metadata: {
    source: string;
    section: string;
    position: number;
  };
}

3. Query transformation

Real user queries are messy. Before hitting retrieval, transform them:

  • Decomposition: Break complex questions into sub-queries. "How does our overtime policy compare to the legal minimum?" becomes two retrievals: one for your policy, one for legal requirements.
  • Hypothetical Document Embedding (HyDE): Generate what an ideal answer would look like, then embed that for retrieval. This bridges the gap between question-style queries and document-style content.
  • Expansion: Add domain-specific synonyms. In aviation, "MEL" and "Minimum Equipment List" should retrieve the same documents.

4. Reranking

After initial retrieval, rerank with a cross-encoder. The retrieval step casts a wide net (top 20-50); the reranker narrows it to the actually relevant results (top 3-5).

Cross-encoders are more accurate than bi-encoders for relevance scoring because they see the query and document together, but they're too slow for first-pass retrieval. Using them as a second stage gives you the best of both worlds.

The context window trap

"Just stuff everything into a 128k context window" sounds appealing. Don't do it.

Large context windows have a well-documented "lost in the middle" problem — models pay more attention to the beginning and end of their context, degrading performance on information buried in the middle. More practically, cost and latency scale linearly with context size.

Good retrieval means you send less to the model, not more. Our production systems rarely send more than 3,000 tokens of retrieved context. The precision of retrieval matters more than the volume.

Evaluation is everything

The biggest mistake teams make is not measuring retrieval quality separately from generation quality. Your RAG system has two failure modes:

  1. Retrieval failure: The right documents weren't retrieved
  2. Generation failure: The right documents were retrieved but the LLM generated a bad answer

If you only evaluate end-to-end, you can't tell which component failed. Build a retrieval evaluation set — 50-100 queries with ground-truth relevant documents — and measure recall@k before you ever look at generated answers.

def recall_at_k(
    retrieved: list[str],
    relevant: set[str],
    k: int
) -> float:
    """Proportion of relevant docs found in top-k results."""
    retrieved_at_k = set(retrieved[:k])
    return len(retrieved_at_k & relevant) / len(relevant)

What we'd build today

If we were starting a new RAG system tomorrow:

  • Embeddings: text-embedding-3-large with Matryoshka dimensionality reduction to 512 dims
  • Vector store: pgvector (keeps everything in Postgres, one less service)
  • Keyword search: pg_trgm + tsvector in the same Postgres instance
  • Reranker: Cohere Rerank or a fine-tuned cross-encoder
  • Chunking: Hierarchical with section-aware splitting
  • Evaluation: Automated retrieval recall checks on every deployment

The stack is deliberately boring. RAG systems fail from bad retrieval design, not from insufficient infrastructure complexity.

The bottom line

RAG is a retrieval engineering problem disguised as an AI problem. The language model is the easy part. The hard part is getting the right 3,000 tokens of context in front of it.

If your RAG system isn't working, don't reach for a bigger model or a longer context window. Fix your retrieval.

Want to build something like this?

We design and build production AI systems — RAG, agents, automation. Book a call and tell us what you're working on.

Book 15 minutes →