What's the right chunk size for a production RAG pipeline?

600-1000 tokens with 10-15% overlap for prose-heavy English content. The recall@5 curve peaks in that range across our four production deployments. Outside it — both smaller and larger — is a measurable regression. Code, tabular data and legal prose need different splitters (AST, row-based, clause-based).

Which embedding model gives the best cost-to-quality ratio?

OpenAI text-embedding-3-small at $0.02 per 1M tokens. It hits 0.78 recall@5 on our 400-question eval set — within five percentage points of text-embedding-3-large which costs 6.5× as much. Frontier-tier embeddings are rarely worth it outside high-stakes (legal, medical) workloads.

Does a semantic cache actually pay for itself in a RAG system?

Yes — within the first 100k queries on every production workload we have run it on. Median cache hit rate is 31% at a 0.92 similarity threshold, each hit saves a full LLM call, and the LLM call is the most expensive line item by far. It is the single highest-leverage change for a shipped RAG product.

Building a Production RAG Pipeline: Chunking, Embeddings, Retrieval, Caching

The pipeline at a glance

End-to-end flow

text

Read top-to-bottom: each stage's output is the next stage's input. The Eval loop on the right runs offline against a curated question set.

             ┌────────────────────┐
             │  1. Raw documents  │
             └─────────┬──────────┘
                       ▼
             ┌────────────────────┐
             │  2. Chunking       │   semantic + size guard
             └─────────┬──────────┘
                       ▼
             ┌────────────────────┐
             │  3. Embeddings     │   batched, cached
             └─────────┬──────────┘
                       ▼
             ┌────────────────────┐         ┌────────────────────┐
             │  4. Vector store   │◀──────▶│  7. Semantic cache │
             └─────────┬──────────┘         └────────────────────┘
                       ▼                            ▲
             ┌────────────────────┐                 │
             │  5. Retrieval      │─────────────────┘
             └─────────┬──────────┘
                       ▼
             ┌────────────────────┐
             │  6. Reranking      │   cross-encoder, top-k → top-n
             └─────────┬──────────┘
                       ▼
             ┌────────────────────┐         ┌────────────────────┐
             │  7. Generation     │◀──────▶│  8. Eval (offline) │
             └─────────┬──────────┘         └────────────────────┘
                       ▼
                  Answer + sources

The remainder of this post takes each stage in order. For every stage we cover: the choice we ship by default, why, the code, and the marginal cost per 1M queries.

Stage 1 — Documents

Most production RAG content is heterogeneous: PDFs, HTML dumps, transcripts, Notion exports, support tickets. The ingestion step that pays off most is one we are too quick to skip — structural extraction. Strip the navigation, the boilerplate footer, the disclaimer repeated on every page. The signal-to-noise ratio of your corpus is the single thing that the rest of the pipeline cannot fix.

For PDFs we run Unstructured with the fast strategy for typical text-heavy docs and the hi_res (Detectron2) strategy for diagram-dense ones. For HTML we use Mozilla Readability wrapped in a thin Node service. The choice is rarely about quality at the extraction layer — it is about consistency. Pick one and run all of your content through it.

Stage 2 — Chunking

The chunk-size choice is the most discussed and most over-tuned decision in the pipeline. The honest answer from our four-product sample: recursive character splitting at 600-1000 tokens with 10-15% overlap covers 80% of use cases. The cases where it fails are where you knew it would fail: code (split on AST), tabular data (split on rows), legal prose (split on clauses or sections).

ingest/chunk.py — the chunker we ship by default

python

LangChain's RecursiveCharacterTextSplitter with the separators tuned for prose-heavy English content. The metadata each chunk carries is more important than the size — keep the source document id, page, and section heading.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    length_function=len,           # characters; ~250 chars ≈ 60 tokens
    separators=["\n\n", "\n", ". ", " ", ""],
)

def chunk_document(doc: Document) -> list[Chunk]:
    pieces = splitter.split_text(doc.text)
    return [
        Chunk(
            text=p,
            metadata={
                "doc_id": doc.id,
                "title": doc.title,
                "section": nearest_heading_for_offset(doc, offset_of(p)),
                "page": page_for_offset(doc, offset_of(p)),
                "source_url": doc.source_url,
                "ingested_at": doc.ingested_at,
            },
        )
        for p in pieces
    ]

Two things to know. First, retrieval works on semantic similarity to the query, not on document hierarchy — so adding section headings to the chunk text itself (prefixed at the top of each chunk) measurably improves retrieval on multi-section documents. Second, chunks that are too small lose context; chunks that are too large dilute the relevance signal. Across our four products, the recall@5 curve peaks between 600 and 900 tokens; everything outside that window is a real, measurable regression.

Stage 3 — Embeddings

The embedding model is where you can save the most money for the least quality loss. The frontier-tier models (OpenAI text-embedding-3-large, Voyage voyage-3-large) are not always materially better for typical English prose than the smaller-tier options. The honest comparison:

Model	Dims	$/1M tokens	Recall@5 (our prose set)
OpenAI text-embedding-3-small	1,536	$0.02	0.78
OpenAI text-embedding-3-large	3,072	$0.13	0.83
Voyage voyage-3	1,024	$0.06	0.81
Cohere embed-english-v3	1,024	$0.10	0.80
BGE-small-en-v1.5 (self-hosted)	384	~$0.005 (infra)	0.74

Recall@5 measured on a 400-question evaluation set drawn from production traffic of our doc-Q&A and support-search clients. Higher is better. Prices as of May 2026.

The takeaway: text-embedding-3-small at $0.02 per million tokens is what we ship by default. The frontier tier (3-large) costs ~6.5× as much for a five percentage-point recall gain — rarely worth it unless the application is high-stakes (legal, medical). For budget-sensitive workloads, BGE self-hosted is competitive if you already run GPU infrastructure.

The cost-vs-model curve is the same one we map on the feature side in our companion AI feature token economics study. The pattern is consistent: the cheap model is good enough far more often than first instinct suggests.

Stage 4 — Vector store

The three we have shipped to production, in rough order of how often we pick them now:

Postgres + pgvector — 0 ops overhead if you already run Postgres. Fast enough up to ~5M vectors with HNSW. Joins to your relational data are free. This is what we use for the majority of new builds.
Pinecone / Weaviate / Qdrant Cloud — managed dedicated service. Faster at high cardinality, comes with namespace + metadata filtering UX. Worth it from ~10M vectors onward.
LanceDB or DuckDB-VSS — in-process embedded option for batch / analytical RAG workloads. Trivially cheap, no network hop, ideal for the eval loop.

Postgres + pgvector — production schema we ship by default

sql

HNSW index for ANN search; the metadata filter index on doc_id; the partial index on the `is_current` column so superseded chunks can stay in the table for audit without polluting search.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunks (
  id          uuid          PRIMARY KEY DEFAULT gen_random_uuid(),
  doc_id      uuid          NOT NULL,
  text        text          NOT NULL,
  embedding   vector(1536)  NOT NULL,
  section     text,
  page        int,
  source_url  text,
  is_current  boolean       NOT NULL DEFAULT true,
  ingested_at timestamptz   NOT NULL DEFAULT now()
);

CREATE INDEX chunks_embedding_idx
  ON chunks USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64)
  WHERE is_current;

CREATE INDEX chunks_doc_id_idx ON chunks (doc_id) WHERE is_current;
CREATE INDEX chunks_source_url_idx ON chunks (source_url) WHERE is_current;

Stage 5 — Retrieval

The retrieval call itself is a thin wrapper around the vector store. The decisions here are k (how many chunks to fetch), the metadata filter, and whether to do hybrid retrieval (vector + keyword).

We retrieve k=20 by default and rely on the reranker (next stage) to narrow to the 4-6 that get into the prompt. Going wider on retrieval and tighter on reranking is consistently better than the reverse, because the cross-encoder used for reranking is far more discriminating than cosine similarity.

rag/retrieve.ts — hybrid retrieval

typescript

Vector ANN + lexical (BM25 via Postgres `to_tsvector`). The two ranked lists are merged via reciprocal rank fusion before reranking.

export async function retrieve(
  query: string,
  queryEmbedding: number[],
  tenantId: string,
  k = 20,
): Promise<RetrievedChunk[]> {
  const [vector, lexical] = await Promise.all([
    db.query(`
      SELECT id, text, doc_id, source_url,
             1 - (embedding <=> $1) AS score
      FROM chunks
      WHERE is_current AND tenant_id = $2
      ORDER BY embedding <=> $1
      LIMIT $3
    `, [queryEmbedding, tenantId, k]),
    db.query(`
      SELECT id, text, doc_id, source_url,
             ts_rank(to_tsvector('english', text), plainto_tsquery('english', $1)) AS score
      FROM chunks
      WHERE is_current AND tenant_id = $2
        AND to_tsvector('english', text) @@ plainto_tsquery('english', $1)
      ORDER BY score DESC
      LIMIT $3
    `, [query, tenantId, k]),
  ]);

  return reciprocalRankFusion(vector.rows, lexical.rows, k);
}

function reciprocalRankFusion(
  a: RetrievedChunk[],
  b: RetrievedChunk[],
  k: number,
  rrfK = 60,
): RetrievedChunk[] {
  const scores = new Map<string, { chunk: RetrievedChunk; score: number }>();
  for (const list of [a, b]) {
    list.forEach((chunk, rank) => {
      const score = 1 / (rrfK + rank);
      const existing = scores.get(chunk.id);
      if (existing) existing.score += score;
      else scores.set(chunk.id, { chunk, score });
    });
  }
  return [...scores.values()]
    .sort((x, y) => y.score - x.score)
    .slice(0, k)
    .map((entry) => entry.chunk);
}

The tenant filter on the SQL is non-negotiable on multi-tenant RAG — the most common RAG security failure we audit is a missing tenant scope on the vector query. The pattern is the same as the multi-tenant architecture study — enforce isolation in the database, not in the application. We ship this isolation pattern by default on every RAG build through our AI SaaS product development engagement.

Stage 6 — Reranking

The reranker takes the 20 retrieved candidates and scores them with a cross-encoder — a model trained to evaluate query / passage pairs directly, rather than comparing two independent embeddings. Cross-encoders are meaningfully better than bi-encoders for the final ranking step; the trade-off is they require a forward pass per candidate, so you cannot use them for the first retrieval.

The two we use: cohere-rerank-3.5 (API, $2 / 1k queries with up to 100 docs each) and BAAI/bge-reranker-large (self-hosted, free if you already have a GPU). Both move recall@4 up by 8-14 percentage points compared to no reranking, in our evaluation set.

rag/rerank.ts — cohere reranker call

typescript

import { CohereClient } from "cohere-ai";
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY! });

export async function rerank(
  query: string,
  candidates: RetrievedChunk[],
  topN = 5,
): Promise<RetrievedChunk[]> {
  if (candidates.length === 0) return [];
  const res = await cohere.v2.rerank({
    model: "rerank-v3.5",
    query,
    documents: candidates.map((c) => c.text),
    topN,
  });
  return res.results.map((r) => candidates[r.index]);
}

Stage 7 — Generation

The prompt template that has held up best across our four products: explicit grounding instruction, an inline source list, an instruction to cite by source id, and a refusal clause for when the retrieved context doesn't support an answer.

rag/prompt.ts — the system + user composition

typescript

export function buildMessages(
  question: string,
  passages: RetrievedChunk[],
): ChatMessage[] {
  const sources = passages
    .map((p, i) => `[${i + 1}] (${p.metadata.title})\n${p.text}`)
    .join("\n\n---\n\n");

  return [
    {
      role: "system",
      content: `You answer the user's question using only the SOURCES below. Cite each claim with the bracketed source number, e.g. [2]. If the sources do not contain the answer, reply exactly: "I don't have that in my sources." Do not invent facts.`,
    },
    {
      role: "user",
      content: `Question: ${question}\n\nSOURCES:\n${sources}`,
    },
  ];
}

Stage 8 — Semantic cache

The single biggest cost saver in a production RAG system is caching answers to semantically-similar questions. Two queries with different wording (“how do I reset my password” vs “forgot password reset”) should hit the same cached answer. The check is itself a vector similarity lookup, against a much smaller table of recent question embeddings + their generated answers.

Semantic cache schema

sql

CREATE TABLE rag_cache (
  id           uuid          PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id   uuid          NOT NULL,
  question     text          NOT NULL,
  q_embedding  vector(1536)  NOT NULL,
  answer       text          NOT NULL,
  sources      jsonb         NOT NULL,
  hits         int           NOT NULL DEFAULT 0,
  created_at   timestamptz   NOT NULL DEFAULT now(),
  last_used_at timestamptz   NOT NULL DEFAULT now()
);

CREATE INDEX rag_cache_q_idx
  ON rag_cache USING hnsw (q_embedding vector_cosine_ops);

-- Tunable: how similar must a query be to hit cache?
-- 0.92 = strict (very few false positives, modest hit rate)
-- 0.88 = relaxed (more hits, occasional wrong cache match)

The threshold is the knob to tune. On our doc-Q&A client, 0.92 yields a 31% cache hit rate at <1% incorrect matches. Each hit saves the cost of the LLM call — the single most expensive line item by far.

Cost per 1M queries

Putting the numbers together, for a workload that retrieves k=20, reranks to top-5, and generates with Claude Haiku 4.5 (300 input tokens of context, 200 output tokens):

Stage	$/1M queries (no cache)	$/1M queries (31% cache)
Query embedding	$0.30	$0.30
Vector + lexical retrieval (Postgres)	~$1 (infra amortised)	~$1
Reranker (Cohere v3.5)	$2,000	$1,380
Generation (Claude Haiku 4.5)	$1,000	$690
Total	~$3,001	~$2,071

Costs at May 2026 list prices; numbers exclude monthly fixed costs for vector DB and reranker hosting if self-hosted. The 31% cache hit rate is the median across our four production deployments.

The semantic cache pays for itself within the first 100k queries on every production workload we have run it on. It is the single highest-leverage change for an already-shipped RAG product — and the one we add first when a team brings us a working but expensive RAG prototype through our AI app completion engagement.

Evaluation — the one part nobody wants to do

Build the eval set before you ship. 100-400 question / expected-answer pairs covering the surface area of the corpus. Run the full pipeline against them after every chunking, embedding, or prompt change — report recall@k for retrieval and a faithfulness score (LLM-as- judge) for generation. The version of this we run in CI for one of our doc-Q&A clients is ~140 questions and takes ~6 minutes per run. It has caught three regressions that would otherwise have shipped. Most of our RAG products ship as one feature inside a larger SaaS we built through the SaaS web-app development engagement — eval CI is wired in from day one rather than retrofitted.

■ Related research

Related research

Companion reads on AI economics, multi-tenant data layout, and what AI prototypes typically look like before this hardening pass:

■ Related services

Where we ship pipelines like this

The end-to-end AI SaaS product build, the AI-app-completion engagement that hardens a prototype RAG to the architecture above, and the SaaS web-app build that includes a RAG layer as one component:

AI SaaS Product Development

Multi-tenant AI SaaS with subscriptions and admin dashboards.

Learn more

AI App Completion

Take an AI-built prototype to a production-ready product.

Learn more

SaaS Web App Development

MVP to production builds, multi-tenant, billing, AI features.

Learn more

Frequently asked questions

What's the right chunk size for a production RAG pipeline?: 600-1000 tokens with 10-15% overlap for prose-heavy English content. The recall@5 curve peaks in that range across our four production deployments. Outside it — both smaller and larger — is a measurable regression. Code, tabular data and legal prose need different splitters (AST, row-based, clause-based).
Which embedding model gives the best cost-to-quality ratio?: OpenAI text-embedding-3-small at $0.02 per 1M tokens. It hits 0.78 recall@5 on our 400-question eval set — within five percentage points of text-embedding-3-large which costs 6.5× as much. Frontier-tier embeddings are rarely worth it outside high-stakes (legal, medical) workloads.
Does a semantic cache actually pay for itself in a RAG system?: Yes — within the first 100k queries on every production workload we have run it on. Median cache hit rate is 31% at a 0.92 similarity threshold, each hit saves a full LLM call, and the LLM call is the most expensive line item by far. It is the single highest-leverage change for a shipped RAG product.

About the author

Ritesh — Founding Partner, Appycodes

Ritesh leads engineering at Appycodes. The pipeline above is the one we run, with small per-product variations, across four production RAG products — a doc-Q&A SaaS for an enterprise compliance team, a developer-tools support assistant, an internal knowledge-base bot for a 600-person services firm, and a transcript-search product for a media client. The semantic cache and the reranker are the two changes that turn a working demo into a production system.

Last reviewed: May 14, 2026

A production RAG pipeline, stage by stage — with cost and retrieval-quality numbers

The pipeline at a glance

Stage 1 — Documents

Stage 2 — Chunking

Stage 3 — Embeddings

Stage 4 — Vector store

Stage 5 — Retrieval

Stage 6 — Reranking

Stage 7 — Generation

Stage 8 — Semantic cache

Cost per 1M queries

Evaluation — the one part nobody wants to do

Related research

Where we ship pipelines like this

AI SaaS Product Development

AI App Completion

SaaS Web App Development

Frequently asked questions

Ritesh — Founding Partner, Appycodes

Full stack web and mobile tech company

Taking the first step is the hardest. We make everything after that simple.