← Back to blog
Development & TechJune 15, 202613 min read

RAG Explained: Building AI Systems That Actually Know Your Data

LLMs Without Your Data Are Confident Liars

Ask any LLM a question about your company's internal docs, your codebase, or last quarter's pricing decisions. It will answer. Confidently. Wrongly. That's the core problem RAG solves — not by making the model smarter, but by giving it actual facts to work with before it opens its mouth.

Retrieval-Augmented Generation was introduced by Lewis et al. at Meta AI in 2020, and the idea is straightforward: instead of relying on what the model memorized during training, you fetch relevant documents at query time and feed them into the prompt. The model generates its answer grounded in those documents.

The basic pipeline looks simple — chunk your documents, embed them, store in a vector database, retrieve relevant chunks, pass to the LLM. Most tutorials stop there. Then you deploy it and watch your system return irrelevant garbage half the time.

I've been through this loop building the AI writing assistant in SimpleAIFolio. The naive pipeline works for demos. Production requires fixing retrieval, chunking, and evaluation — the stuff most guides gloss over. That's what this post covers.

The Naive RAG Pipeline (And Why It Fails)

Here's what every tutorial teaches:

  1. Split documents into chunks
  2. Embed chunks with an embedding model
  3. Store embeddings in a vector database
  4. At query time, embed the query, find nearest neighbors, return top-k chunks
  5. Feed chunks + query to an LLM

This works for a 5-page demo. It falls apart at scale because:

  • Chunking is naive — fixed-size splits break sentences, separate related information, and lose context
  • Pure semantic search misses keyword matches — if someone searches for "error code E-4027", semantic similarity won't help
  • No reranking — the top-k results from vector similarity include irrelevant chunks that the LLM then uses to hallucinate
  • No evaluation — you have no idea if your pipeline is actually working until users complain

Anecdotal reports from production RAG deployments suggest retrieval errors are the dominant failure mode — even a perfect generation model produces wrong answers given wrong chunks. One analysis flagged roughly 40% of retrievals as problematic in naive setups (unverified — treat this as directional, not canonical). But the direction matches my experience: retrieval quality is the bottleneck, not the LLM.

Chunking: The Make-or-Break Step Everyone Gets Wrong

Chunking is where most RAG systems win or lose, and it's where most developers spend the least time. Here's what actually matters:

Chunk Size Is a Lever, Not a Setting

Too small (100-200 tokens) and you lose context — the chunk says "revenue grew 15%" without specifying which quarter or which product line. Too large (1000+ tokens) and you dilute the signal — the relevant fact is buried in a wall of text, and the embedding averages away the meaning.

For most document types, 300-500 tokens with 50-100 token overlap is a solid starting point. But you need to tune this based on your content:

  • API documentation: smaller chunks (200-300 tokens) — each endpoint or method should be its own chunk
  • Legal/contract text: larger chunks (500-800 tokens) — context around clauses matters
  • FAQ/knowledge base: chunk by question-answer pair, not by size

Semantic Chunking Beats Fixed-Size

Instead of splitting every N tokens, split where meaning changes. This means breaking at paragraph boundaries, section headers, or topic shifts:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=80,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)

chunks = splitter.split_text(document_text)

The separators list is the key — it tries paragraph breaks first, then line breaks, then sentences, then words. This keeps related content together instead of splitting mid-sentence.

Metadata Is Not Optional

Every chunk needs metadata: source document, section, page number, date, category. This lets you filter during retrieval ("only search docs from Q1 2026" or "only search the API reference") and trace answers back to sources. Without it, you're flying blind.

chunks_with_metadata = []
for i, chunk in enumerate(chunks):
    chunks_with_metadata.append({
        "text": chunk,
        "metadata": {
            "source": "q1-2026-revenue-report.pdf",
            "section": "Financial Summary",
            "chunk_index": i,
            "doc_type": "financial_report",
            "date": "2026-03-31"
        }
    })

Embeddings: Choosing What Goes Into Your Vector Store

Embeddings convert your chunks into high-dimensional vectors that capture semantic meaning. The model you choose matters, but not as much as people think — chunking and retrieval strategy matter more.

Current good options as of mid-2026:

  • OpenAI text-embedding-3-small: solid default, fast, reasonably priced. Good for getting started.
  • OpenAI text-embedding-3-large: higher quality, higher cost. Worth it if retrieval quality is critical.
  • Nomic Embed: open-source, competitive with OpenAI's small model, runs locally. My preference for self-hosted setups.
  • Qwen3 embeddings: strong open-source option, especially for multilingual content.

Don't overthink this. Start with OpenAI's small model or Nomic Embed, measure retrieval quality, and upgrade only if you have evidence the embeddings are the bottleneck.

One thing people skip: normalization. Always normalize your embeddings before storage and before search. Most libraries handle this, but if you're rolling your own, cosine similarity on unnormalized vectors will give you subtly wrong results.

Vector Stores: An Opinionated Comparison

Here's where I take a stance. Most comparisons present every option as equally valid. They're not.

Self-Hosted First (My Recommendation)

ChromaDB — best for getting started. Runs in-process in Python, zero infrastructure, persistent storage on disk. Perfect for prototyping and small-to-medium deployments. Not ideal for distributed setups.

Qdrant — best for production self-hosted. Written in Rust (fast), supports filtering, hybrid search, and horizontal scaling. If you're serious about self-hosting, this is the one.

Milvus — best for large-scale deployments. Handles billions of vectors, supports multiple index types, but heavier infrastructure requirements. Overkill until you need it.

Managed (If You Must)

Pinecone — fine if you want zero infrastructure and don't mind vendor lock-in and per-query pricing that adds up fast. I'd skip it unless your team has zero DevOps capacity and budget isn't a concern.

pgvector — if you already run Postgres, this is a no-brainer. Not the fastest vector search, but the operational simplicity of one fewer database to manage is worth a lot.

My bias is clear: I'd take Qdrant or Chroma over Pinecone the same way I'd take self-hosted n8n over Zapier — more control, lower cost, no vendor lock-in, and you actually learn how the system works.

Quick Setup with Chroma

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.PersistentClient(path="./chroma_db")

embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-api-key",
    model_name="text-embedding-3-small"
)

collection = client.get_or_create_collection(
    name="docs",
    embedding_function=embedding_fn
)

# Add chunks with metadata
collection.add(
    documents=[chunk["text"] for chunk in chunks_with_metadata],
    metadatas=[chunk["metadata"] for chunk in chunks_with_metadata],
    ids=[f"chunk-{i}" for i in range(len(chunks_with_metadata))]
)

# Query
results = collection.query(
    query_texts=["What was Q1 revenue?"],
    n_results=5,
    where={"doc_type": "financial_report"}  # metadata filtering
)

Retrieval That Actually Works

Vector similarity search is your baseline. It's not enough for production. Here's what to add:

Hybrid Search: Semantic + Keyword

Pure semantic search fails on exact matches — error codes, product names, version numbers, proper nouns. Hybrid search combines vector similarity with keyword matching (BM25), and it's the single biggest retrieval upgrade you can make.

# Qdrant hybrid search example
from qdrant_client import QdrantClient
from qdrant_client.models import (
    NamedSparseVector, NamedVector, SearchRequest,
    FusionQuery, Prefetch
)

client = QdrantClient(host="localhost", port=6333)

results = client.query_points(
    collection_name="docs",
    prefetch=[
        Prefetch(
            query=query_vector,  # dense embedding
            using="dense",
            limit=20,
        ),
        Prefetch(
            query=sparse_vector,  # BM25 sparse vector
            using="sparse",
            limit=20,
        ),
    ],
    query=FusionQuery(fusion="rrf"),  # reciprocal rank fusion
    limit=5,
)

Reciprocal Rank Fusion (RRF) merges the two result sets by rank, not by raw score. This handles the different score scales between dense and sparse vectors elegantly.

Reranking: Non-Optional for Production

Vector search (even hybrid) returns results by similarity to the query. But similarity ≠ relevance. A chunk might mention the same keywords as your query while answering a completely different question. Reranking fixes this.

A cross-encoder reranker takes the (query, chunk) pair and scores actual relevance, not just similarity. It's slower but far more accurate. You retrieve top-20 with vector search, then rerank to top-5.

Good rerankers:

  • Cohere Rerank API: easy to integrate, pay per query
  • Qwen3 reranker: open-source, self-hosted, strong quality
  • ColBERT: late-interaction model, fast at scale, more complex setup

My stance: if you're not reranking, you're not in production. The latency cost is minimal compared to the quality gain.

Evaluation: How Do You Know It Works?

This is where most RAG projects fail silently. Without evaluation, you're guessing. RAGAS (Retrieval Augmented Generation Assessment) is the standard framework — it measures:

  • Context Precision: are the retrieved chunks actually relevant?
  • Context Recall: did we retrieve all the chunks needed to answer?
  • Faithfulness: is the generated answer grounded in the retrieved context?
  • Answer Relevance: does the answer actually address the question?

You need a test set — at minimum 50-100 question-answer pairs with ground truth. Build this manually from your actual documents. Don't generate it with an LLM; that's circular.

from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy,
)
from datasets import Dataset

test_data = {
    "question": ["What was Q1 revenue?", ...],
    "answer": [generated_answer_1, ...],
    "contexts": [retrieved_chunks_1, ...],
    "ground_truth": ["$4.2M", ...],
}

dataset = Dataset.from_dict(test_data)
results = evaluate(
    dataset,
    metrics=[context_precision, context_recall, faithfulness, answer_relevancy]
)
print(results)

Run this on every pipeline change. If context precision drops below 0.7, your retrieval is broken. If faithfulness drops, your LLM is ignoring the context and hallucinating.

Production Gotchas I Learned the Hard Way

API Quota Limits Will Kill Your Pipeline

I've hit OpenAI API quota limits mid-campaign while building AI-powered content workflows. It wasn't during a batch embedding job — it was during peak query traffic, and the entire pipeline went down because there was no fallback.

For RAG specifically, this means:

  • Embedding ingestion is bursty — you'll embed thousands of chunks at once, then barely touch the API. Monitor your rate limits during ingestion.
  • Query-time embedding is constant — every user query needs an embedding call. This is where quota limits bite.
  • Always have a fallback model — if OpenAI is rate-limited, fall back to a local model. The quality drop is better than a 500 error.
import time
from openai import APITimeoutError, RateLimitError

def embed_with_fallback(text: str, max_retries: int = 3) -> list[float]:
    for attempt in range(max_retries):
        try:
            response = openai_client.embeddings.create(
                input=text,
                model="text-embedding-3-small"
            )
            return response.data[0].embedding
        except RateLimitError:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
                continue
            return local_model.encode(text).tolist()
        except APITimeoutError:
            if attempt < max_retries - 1:
                continue
            return local_model.encode(text).tolist()
    raise RuntimeError("Embedding failed after retries")

Cost Controls from Day One

Track your embedding and generation costs per query. Reranking to top-3 before generation cuts your LLM costs by 5-7x with no quality loss.

Caching Is Free Performance

Cache embedding results for repeated queries. Cache LLM responses for identical (query + context) combinations. This isn't optional at scale — it's the difference between $50/month and $500/month in API costs.

The Start-Here-Then-Upgrade Path

Don't try to build the full production pipeline on day one. Here's the progression I recommend:

  1. Week 1: ChromaDB + OpenAI embeddings + basic chunking. Get something working end-to-end.
  2. Week 2: Add metadata filtering and semantic chunking. Tune chunk size on your actual data.
  3. Week 3: Add hybrid search (BM25 + dense). This is your biggest quality jump.
  4. Week 4: Add reranking. Second biggest quality jump.
  5. Week 5: Build a RAGAS evaluation set. Now you can measure instead of guess.
  6. Week 6+: Migrate to Qdrant if you need scale. Add caching, fallback models, and monitoring.

Each step is measurable. Each step makes the system better. And you have something working from week one — not a half-finished infrastructure project.

Wrapping Up

RAG isn't complicated conceptually — fetch relevant data, give it to the LLM, generate a grounded answer. But the engineering details between "works in a notebook" and "works in production" are where most teams fail. Chunking strategy, hybrid search, reranking, and evaluation are the four things that separate toy demos from real systems.

The biggest mistake I see is treating retrieval as a solved problem once you've set up a vector database. It's not. Retrieval quality is the bottleneck, and it requires ongoing tuning, measurement, and iteration — just like any search system ever built.

If you want to see a real AI implementation in an open-source project, star SimpleAIFolio on GitHub — it's a portfolio and blog CMS with an AI writing assistant that deals with these exact challenges. Or follow along for more production-focused AI builds.

Frequently Asked Questions

  • Why does my RAG system return irrelevant chunks?

    Three most common causes: naive fixed-size chunking that breaks context, pure semantic search that misses exact keyword matches, and no reranking to filter out similar-but-irrelevant results. Fix chunking first (use semantic splitting with overlap), add hybrid search (BM25 + dense vectors), then add a cross-encoder reranker. These three changes typically cut irrelevant retrievals by more than half.

  • What chunk size works best for RAG?

    300-500 tokens with 50-100 token overlap is a solid default, but it depends on content type. API docs work better at 200-300 tokens (one endpoint per chunk). Legal text needs 500-800 tokens (context around clauses matters). FAQ content should chunk by question-answer pair, not by size. Always tune on your actual data using RAGAS evaluation.

  • How do I choose between Chroma, Qdrant, and Milvus?

    Chroma for getting started — runs in-process in Python, zero infrastructure. Qdrant for production self-hosted — Rust-based, fast, supports hybrid search and filtering. Milvus for large-scale (billions of vectors). Start with Chroma, migrate to Qdrant when you need production features. Skip Pinecone unless you have zero DevOps capacity.

  • Do I need reranking in my RAG pipeline?

    Yes. Reranking is non-optional for production. Vector search returns results by similarity, but similarity ≠ relevance. A cross-encoder reranker scores actual relevance for each (query, chunk) pair. Retrieve top-20 with vector search, rerank to top-5, then generate. The latency cost is minimal; the quality gain is significant. If you're not reranking, you're not in production.

  • How do I evaluate RAG retrieval quality?

    Use RAGAS. Build a test set of 50-100 question-answer pairs with ground truth from your actual documents. Measure context precision, context recall, faithfulness, and answer relevance. Run evaluation on every pipeline change. If context precision drops below 0.7, your retrieval needs fixing.

  • What's the difference between semantic search and hybrid search in RAG?

    Semantic search uses dense vector embeddings to find chunks with similar meaning. Hybrid search combines dense vectors with sparse vectors (BM25 keyword matching), then merges results using Reciprocal Rank Fusion. Hybrid search handles both 'how does authentication work' (semantic) and 'error code AUTH-4027' (keyword) in the same query.

  • How do I handle RAG at scale without blowing up costs?

    Three levers: rerank before generation (feeding 3 chunks instead of 20 cuts costs 5-7x), cache embedding results and LLM responses for repeated queries, and use fallback models when your primary API hits rate limits. Self-hosted embedding models (Nomic, Qwen3) eliminate per-query embedding costs entirely.

References

#Embeddings#RAG#Vector Databases#LLM production#Retrieval Augmented Generation#Chunking Strategies#Semantic Search#Reranking

Comments (0)

Loading comments...