The pipeline at a glance
┌────────────────────┐
│ 1. Raw documents │
└─────────┬──────────┘
▼
┌────────────────────┐
│ 2. Chunking │ semantic + size guard
└─────────┬──────────┘
▼
┌────────────────────┐
│ 3. Embeddings │ batched, cached
└─────────┬──────────┘
▼
┌────────────────────┐ ┌────────────────────┐
│ 4. Vector store │◀──────▶│ 7. Semantic cache │
└─────────┬──────────┘ └────────────────────┘
▼ ▲
┌────────────────────┐ │
│ 5. Retrieval │─────────────────┘
└─────────┬──────────┘
▼
┌────────────────────┐
│ 6. Reranking │ cross-encoder, top-k → top-n
└─────────┬──────────┘
▼
┌────────────────────┐ ┌────────────────────┐
│ 7. Generation │◀──────▶│ 8. Eval (offline) │
└─────────┬──────────┘ └────────────────────┘
▼
Answer + sourcesThe remainder of this post takes each stage in order. For every stage we cover: the choice we ship by default, why, the code, and the marginal cost per 1M queries.
Stage 1 — Documents
Most production RAG content is heterogeneous: PDFs, HTML dumps, transcripts, Notion exports, support tickets. The ingestion step that pays off most is one we are too quick to skip — structural extraction. Strip the navigation, the boilerplate footer, the disclaimer repeated on every page. The signal-to-noise ratio of your corpus is the single thing that the rest of the pipeline cannot fix.
For PDFs we run Unstructured with the fast strategy for typical text-heavy docs and the hi_res (Detectron2) strategy for diagram-dense ones. For HTML we use Mozilla Readability wrapped in a thin Node service. The choice is rarely about quality at the extraction layer — it is about consistency. Pick one and run all of your content through it.
Stage 2 — Chunking
The chunk-size choice is the most discussed and most over-tuned decision in the pipeline. The honest answer from our four-product sample: recursive character splitting at 600-1000 tokens with 10-15% overlap covers 80% of use cases. The cases where it fails are where you knew it would fail: code (split on AST), tabular data (split on rows), legal prose (split on clauses or sections).
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
length_function=len, # characters; ~250 chars ≈ 60 tokens
separators=["\n\n", "\n", ". ", " ", ""],
)
def chunk_document(doc: Document) -> list[Chunk]:
pieces = splitter.split_text(doc.text)
return [
Chunk(
text=p,
metadata={
"doc_id": doc.id,
"title": doc.title,
"section": nearest_heading_for_offset(doc, offset_of(p)),
"page": page_for_offset(doc, offset_of(p)),
"source_url": doc.source_url,
"ingested_at": doc.ingested_at,
},
)
for p in pieces
]Two things to know. First, retrieval works on semantic similarity to the query, not on document hierarchy — so adding section headings to the chunk text itself (prefixed at the top of each chunk) measurably improves retrieval on multi-section documents. Second, chunks that are too small lose context; chunks that are too large dilute the relevance signal. Across our four products, the recall@5 curve peaks between 600 and 900 tokens; everything outside that window is a real, measurable regression.
Stage 3 — Embeddings
The embedding model is where you can save the most money for the least quality loss. The frontier-tier models (OpenAI text-embedding-3-large, Voyage voyage-3-large) are not always materially better for typical English prose than the smaller-tier options. The honest comparison:
| Model | Dims | $/1M tokens | Recall@5 (our prose set) |
|---|---|---|---|
| OpenAI text-embedding-3-small | 1,536 | $0.02 | 0.78 |
| OpenAI text-embedding-3-large | 3,072 | $0.13 | 0.83 |
| Voyage voyage-3 | 1,024 | $0.06 | 0.81 |
| Cohere embed-english-v3 | 1,024 | $0.10 | 0.80 |
| BGE-small-en-v1.5 (self-hosted) | 384 | ~$0.005 (infra) | 0.74 |
Recall@5 measured on a 400-question evaluation set drawn from production traffic of our doc-Q&A and support-search clients. Higher is better. Prices as of May 2026.
The takeaway: text-embedding-3-small at $0.02 per million tokens is what we ship by default. The frontier tier (3-large) costs ~6.5× as much for a five percentage-point recall gain — rarely worth it unless the application is high-stakes (legal, medical). For budget-sensitive workloads, BGE self-hosted is competitive if you already run GPU infrastructure.
The cost-vs-model curve is the same one we map on the feature side in our companion AI feature token economics study. The pattern is consistent: the cheap model is good enough far more often than first instinct suggests.
Stage 4 — Vector store
The three we have shipped to production, in rough order of how often we pick them now:
- Postgres + pgvector — 0 ops overhead if you already run Postgres. Fast enough up to ~5M vectors with HNSW. Joins to your relational data are free. This is what we use for the majority of new builds.
- Pinecone / Weaviate / Qdrant Cloud — managed dedicated service. Faster at high cardinality, comes with namespace + metadata filtering UX. Worth it from ~10M vectors onward.
- LanceDB or DuckDB-VSS — in-process embedded option for batch / analytical RAG workloads. Trivially cheap, no network hop, ideal for the eval loop.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE chunks (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
doc_id uuid NOT NULL,
text text NOT NULL,
embedding vector(1536) NOT NULL,
section text,
page int,
source_url text,
is_current boolean NOT NULL DEFAULT true,
ingested_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX chunks_embedding_idx
ON chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
WHERE is_current;
CREATE INDEX chunks_doc_id_idx ON chunks (doc_id) WHERE is_current;
CREATE INDEX chunks_source_url_idx ON chunks (source_url) WHERE is_current;Stage 5 — Retrieval
The retrieval call itself is a thin wrapper around the vector store. The decisions here are k (how many chunks to fetch), the metadata filter, and whether to do hybrid retrieval (vector + keyword).
We retrieve k=20 by default and rely on the reranker (next stage) to narrow to the 4-6 that get into the prompt. Going wider on retrieval and tighter on reranking is consistently better than the reverse, because the cross-encoder used for reranking is far more discriminating than cosine similarity.
export async function retrieve(
query: string,
queryEmbedding: number[],
tenantId: string,
k = 20,
): Promise<RetrievedChunk[]> {
const [vector, lexical] = await Promise.all([
db.query(`
SELECT id, text, doc_id, source_url,
1 - (embedding <=> $1) AS score
FROM chunks
WHERE is_current AND tenant_id = $2
ORDER BY embedding <=> $1
LIMIT $3
`, [queryEmbedding, tenantId, k]),
db.query(`
SELECT id, text, doc_id, source_url,
ts_rank(to_tsvector('english', text), plainto_tsquery('english', $1)) AS score
FROM chunks
WHERE is_current AND tenant_id = $2
AND to_tsvector('english', text) @@ plainto_tsquery('english', $1)
ORDER BY score DESC
LIMIT $3
`, [query, tenantId, k]),
]);
return reciprocalRankFusion(vector.rows, lexical.rows, k);
}
function reciprocalRankFusion(
a: RetrievedChunk[],
b: RetrievedChunk[],
k: number,
rrfK = 60,
): RetrievedChunk[] {
const scores = new Map<string, { chunk: RetrievedChunk; score: number }>();
for (const list of [a, b]) {
list.forEach((chunk, rank) => {
const score = 1 / (rrfK + rank);
const existing = scores.get(chunk.id);
if (existing) existing.score += score;
else scores.set(chunk.id, { chunk, score });
});
}
return [...scores.values()]
.sort((x, y) => y.score - x.score)
.slice(0, k)
.map((entry) => entry.chunk);
}The tenant filter on the SQL is non-negotiable on multi-tenant RAG — the most common RAG security failure we audit is a missing tenant scope on the vector query. The pattern is the same as the multi-tenant architecture study — enforce isolation in the database, not in the application. We ship this isolation pattern by default on every RAG build through our AI SaaS product development engagement.
Stage 6 — Reranking
The reranker takes the 20 retrieved candidates and scores them with a cross-encoder — a model trained to evaluate query / passage pairs directly, rather than comparing two independent embeddings. Cross-encoders are meaningfully better than bi-encoders for the final ranking step; the trade-off is they require a forward pass per candidate, so you cannot use them for the first retrieval.
The two we use: cohere-rerank-3.5 (API, $2 / 1k queries with up to 100 docs each) and BAAI/bge-reranker-large (self-hosted, free if you already have a GPU). Both move recall@4 up by 8-14 percentage points compared to no reranking, in our evaluation set.
import { CohereClient } from "cohere-ai";
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY! });
export async function rerank(
query: string,
candidates: RetrievedChunk[],
topN = 5,
): Promise<RetrievedChunk[]> {
if (candidates.length === 0) return [];
const res = await cohere.v2.rerank({
model: "rerank-v3.5",
query,
documents: candidates.map((c) => c.text),
topN,
});
return res.results.map((r) => candidates[r.index]);
}Stage 7 — Generation
The prompt template that has held up best across our four products: explicit grounding instruction, an inline source list, an instruction to cite by source id, and a refusal clause for when the retrieved context doesn't support an answer.
export function buildMessages(
question: string,
passages: RetrievedChunk[],
): ChatMessage[] {
const sources = passages
.map((p, i) => `[${i + 1}] (${p.metadata.title})\n${p.text}`)
.join("\n\n---\n\n");
return [
{
role: "system",
content: `You answer the user's question using only the SOURCES below. Cite each claim with the bracketed source number, e.g. [2]. If the sources do not contain the answer, reply exactly: "I don't have that in my sources." Do not invent facts.`,
},
{
role: "user",
content: `Question: ${question}\n\nSOURCES:\n${sources}`,
},
];
}Stage 8 — Semantic cache
The single biggest cost saver in a production RAG system is caching answers to semantically-similar questions. Two queries with different wording (“how do I reset my password” vs “forgot password reset”) should hit the same cached answer. The check is itself a vector similarity lookup, against a much smaller table of recent question embeddings + their generated answers.
CREATE TABLE rag_cache (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id uuid NOT NULL,
question text NOT NULL,
q_embedding vector(1536) NOT NULL,
answer text NOT NULL,
sources jsonb NOT NULL,
hits int NOT NULL DEFAULT 0,
created_at timestamptz NOT NULL DEFAULT now(),
last_used_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX rag_cache_q_idx
ON rag_cache USING hnsw (q_embedding vector_cosine_ops);
-- Tunable: how similar must a query be to hit cache?
-- 0.92 = strict (very few false positives, modest hit rate)
-- 0.88 = relaxed (more hits, occasional wrong cache match)The threshold is the knob to tune. On our doc-Q&A client, 0.92 yields a 31% cache hit rate at <1% incorrect matches. Each hit saves the cost of the LLM call — the single most expensive line item by far.
Cost per 1M queries
Putting the numbers together, for a workload that retrieves k=20, reranks to top-5, and generates with Claude Haiku 4.5 (300 input tokens of context, 200 output tokens):
| Stage | $/1M queries (no cache) | $/1M queries (31% cache) |
|---|---|---|
| Query embedding | $0.30 | $0.30 |
| Vector + lexical retrieval (Postgres) | ~$1 (infra amortised) | ~$1 |
| Reranker (Cohere v3.5) | $2,000 | $1,380 |
| Generation (Claude Haiku 4.5) | $1,000 | $690 |
| Total | ~$3,001 | ~$2,071 |
Costs at May 2026 list prices; numbers exclude monthly fixed costs for vector DB and reranker hosting if self-hosted. The 31% cache hit rate is the median across our four production deployments.
The semantic cache pays for itself within the first 100k queries on every production workload we have run it on. It is the single highest-leverage change for an already-shipped RAG product — and the one we add first when a team brings us a working but expensive RAG prototype through our AI app completion engagement.
Evaluation — the one part nobody wants to do
Build the eval set before you ship. 100-400 question / expected-answer pairs covering the surface area of the corpus. Run the full pipeline against them after every chunking, embedding, or prompt change — report recall@k for retrieval and a faithfulness score (LLM-as- judge) for generation. The version of this we run in CI for one of our doc-Q&A clients is ~140 questions and takes ~6 minutes per run. It has caught three regressions that would otherwise have shipped. Most of our RAG products ship as one feature inside a larger SaaS we built through the SaaS web-app development engagement — eval CI is wired in from day one rather than retrofitted.
■ Related research
Related research
Companion reads on AI economics, multi-tenant data layout, and what AI prototypes typically look like before this hardening pass:
■ Related services
Where we ship pipelines like this
The end-to-end AI SaaS product build, the AI-app-completion engagement that hardens a prototype RAG to the architecture above, and the SaaS web-app build that includes a RAG layer as one component:
Frequently asked questions
- What's the right chunk size for a production RAG pipeline?
- 600-1000 tokens with 10-15% overlap for prose-heavy English content. The recall@5 curve peaks in that range across our four production deployments. Outside it — both smaller and larger — is a measurable regression. Code, tabular data and legal prose need different splitters (AST, row-based, clause-based).
- Which embedding model gives the best cost-to-quality ratio?
- OpenAI text-embedding-3-small at $0.02 per 1M tokens. It hits 0.78 recall@5 on our 400-question eval set — within five percentage points of text-embedding-3-large which costs 6.5× as much. Frontier-tier embeddings are rarely worth it outside high-stakes (legal, medical) workloads.
- Does a semantic cache actually pay for itself in a RAG system?
- Yes — within the first 100k queries on every production workload we have run it on. Median cache hit rate is 31% at a 0.92 similarity threshold, each hit saves a full LLM call, and the LLM call is the most expensive line item by far. It is the single highest-leverage change for a shipped RAG product.

About the author
Ritesh — Founding Partner, Appycodes
LinkedInRitesh leads engineering at Appycodes. The pipeline above is the one we run, with small per-product variations, across four production RAG products — a doc-Q&A SaaS for an enterprise compliance team, a developer-tools support assistant, an internal knowledge-base bot for a 600-person services firm, and a transcript-search product for a media client. The semantic cache and the reranker are the two changes that turn a working demo into a production system.
