Every RAG tutorial opens with Pinecone. Then LangChain. Then Python. You close the tab.
pgvector is one CREATE EXTENSION away. If you already run Postgres, you already have a vector store. This post builds the whole pipeline in TypeScript with Drizzle: schema, ingestion, cosine retrieval, hybrid search with Reciprocal Rank Fusion, generation, and a retrieval eval you can actually run. No new services.
One table. Keep it simple until you have a reason not to.
// src/db/schema.ts
import {
index,
jsonb,
pgTable,
serial,
text,
vector,
} from "drizzle-orm/pg-core";
export const documents = pgTable(
"documents",
{
id: serial("id").primaryKey(),
content: text("content").notNull(),
embedding: vector("embedding", { dimensions: 1536 }),
metadata: jsonb(
Occasional notes on software, tools, and things I learn. No spam.
Unsubscribe anytime.
vector_cosine_ops tells pgvector to optimize this index for cosine distance queries (the <=> operator). If you're using inner product instead, swap it for vector_ip_ops. The choice should match how you query. I've written more on this in what the pgvector operators actually mean.
Before your first migration, enable the extension:
CREATE EXTENSION IF NOT EXISTS vector;Drizzle Kit doesn't do this for you. Run it once on every database (local, staging, prod) or add it to your migration prologue. The setup walkthrough has more detail if this is your first time with pgvector in Drizzle.
Three steps: chunk the source text, embed each chunk, batch-insert into Postgres.
There's no universally correct chunk size. Start with 512 tokens, 50-token overlap and adjust based on your eval results later. General heuristics:
A simple token-aware splitter using tiktoken:
import { encoding_for_model } from "tiktoken";
interface Chunk {
content: string;
chunkIndex: number;
}
export function chunkText(
text: string,
chunkSize = 512,
overlap = 50,
): Chunk[] {
const enc = encoding_for_model("text-embedding-3-small");
const tokens = enc.encode(text);
const chunks: Chunk[] = [];
let i = 0;
while (i < tokens.length) {
const chunkTokens = tokens.slice(i, i + chunkSize);
chunks.push({
content: new TextDecoder().decode(enc.decode(chunkTokens)),
chunkIndex: chunks.length,
});
i += chunkSize - overlap;
}
enc.free();
return chunks;
}import OpenAI from "openai";
import { db } from "./db";
import { documents } from "./db/schema";
import { chunkText } from "./chunker";
const openai = new OpenAI();
async function embedBatch(texts: string[]): Promise<number[][]> {
const res = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
});
return res.data.map((d) => d.embedding);
}
export async function ingestDocument(
text: string,
source: string,
title?: string,
) {
const chunks = chunkText(text);
const BATCH_SIZE = 100;
for (let i = 0; i < chunks.length; i += BATCH_SIZE) {
const batch = chunks.slice(i, i + BATCH_SIZE);
const embeddings = await embedBatch(batch.map((c) => c.content));
await db.insert(documents).values(
batch.map((chunk, j) => ({
content: chunk.content,
embedding: embeddings[j],
metadata: { source, title, chunkIndex: chunk.chunkIndex },
})),
);
}
}Batch at 100 chunks per OpenAI call. The API accepts up to 2048 inputs per request, but smaller batches give you finer-grained retry targets if one fails. For retry logic on transient failures, wrap embedBatch in an exponential backoff helper or push chunks into a queue.
Embedding 10,000 chunks synchronously in a migration script is fine. Embedding during user-driven ingestion in a web server is not. Offload it to a background job.
Embed the user's question, then find the nearest chunks by cosine distance.
import { cosineDistance, sql } from "drizzle-orm";
import { db } from "./db";
import { documents } from "./db/schema";
export async function retrieve(query: string, k = 5, source?: string) {
const [{ embedding: queryEmbedding }] = (
await openai.embeddings.create({
model: "text-embedding-3-small",
input: [query],
})
).data;
const similarity = sql<number>`1 - (${cosineDistance(documents.embedding, queryEmbedding)})`;
return db
.select({
id: documents.id,
content: documents.content,
metadata: documents.metadata,
similarity,
})
.from(documents)
.where(
source ? sql`${documents.metadata}->>'source' = ${source}` : undefined,
)
.orderBy((t) => sql`${t.similarity} DESC`)
.limit(k);
}cosineDistance returns a distance (0 = identical, 2 = opposite), so 1 - distance gives you a similarity score where 1 is best. The where clause on metadata->>'source' lets you scope retrieval to a single knowledge base without a separate table, useful when you're indexing multiple document sources into one.
The metadata filter runs before the vector scan, keeping latency low. If you have millions of rows and many distinct sources, consider a partial index per source instead. See pgvector indexing options for the trade-offs.
Pure vector search misses exact keyword matches. Pure full-text search misses semantic similarity. Hybrid search gets both.
The standard fusion technique is Reciprocal Rank Fusion (RRF). It works on rank positions, not raw scores, so you don't need to normalize incomparable signals:
rrf_score = 1 / (k + rank)
k = 60, the empirically optimal constant from the original RRF paper. Here's the full query in raw SQL first, then the Drizzle equivalent:
WITH vector_results AS (
SELECT id,
ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rank
FROM documents
WHERE metadata->>'source' = $2
ORDER BY embedding <=> $1
LIMIT 60
),
fts_results AS (
SELECT id,
ROW_NUMBER() OVER (ORDER BY ts_rank(to_tsvector('english', content), query) DESC) AS rank
FROM documents,
plainto_tsquery('english', $3) query
WHERE to_tsvector('english', content) @@ query
AND metadata->>'source' = $2
ORDER BY rank
LIMIT 60
),
rrf AS (
SELECT
COALESCE(v.id, f.id) AS id,
COALESCE(1.0 / (60 + v.rank), 0) + COALESCE(1.0 / (60 + f.rank), 0) AS score
FROM vector_results v
FULL OUTER JOIN fts_results f ON v.id = f.id
)
SELECT d.id, d.content, d.metadata, rrf.score
FROM rrf
JOIN documents d ON d.id = rrf.id
ORDER BY rrf.score DESC
LIMIT $4;The FULL OUTER JOIN is important: a document that appears in only one ranking still gets a score, just a lower one. COALESCE(1.0 / (60 + rank), 0) gives it zero contribution from the missing side.
In Drizzle, wrap this as a raw SQL query until Drizzle adds first-class CTE support for this pattern:
export async function hybridSearch(query: string, source: string, k = 10) {
const [{ embedding: queryEmbedding }] = (
await openai.embeddings.create({
model: "text-embedding-3-small",
input: [query],
})
).data;
const vectorParam = JSON.stringify(queryEmbedding);
return db.execute(sql`
WITH vector_results AS (
SELECT id,
ROW_NUMBER() OVER (ORDER BY embedding <=> ${vectorParam}::vector) AS rank
FROM documents
WHERE metadata->>'source' = ${source}
ORDER BY embedding <=> ${vectorParam}::vector
LIMIT 60
),
fts_results AS (
SELECT id,
ROW_NUMBER() OVER (
ORDER BY ts_rank(to_tsvector('english', content),
plainto_tsquery('english', ${query})) DESC
) AS rank
FROM documents
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', ${query})
AND metadata->>'source' = ${source}
LIMIT 60
),
rrf AS (
SELECT
COALESCE(v.id, f.id) AS id,
COALESCE(1.0 / (60 + v.rank), 0) + COALESCE(1.0 / (60 + f.rank), 0) AS score
FROM vector_results v
FULL OUTER JOIN fts_results f ON v.id = f.id
)
SELECT d.id, d.content, d.metadata, rrf.score
FROM rrf
JOIN documents d ON d.id = rrf.id
ORDER BY rrf.score DESC
LIMIT ${k}
`);
}Use hybrid search when your corpus contains proper nouns, product names, or technical identifiers the embedding model might not distinguish reliably ("GPT-4o" vs "GPT-4" is a good example). For general prose retrieval, pure vector search is often good enough.
Pass the retrieved chunks to the LLM as context. The prompt structure matters more than people think.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
export async function answer(query: string, source?: string) {
const chunks = await retrieve(query, 5, source);
if (chunks.length === 0) {
return { answer: "No relevant information found.", citations: [] };
}
const context = chunks.map((c, i) => `[${i + 1}] ${c.content}`).join("\n\n");
const message = await client.messages.create({
model: "claude-opus-4-7",
max_tokens: 1024,
messages: [
{
role: "user",
content: `Answer the question using only the context below. If the answer isn't in the context, say so. Cite sources as [1], [2], etc.
Context:
${context}
Question: ${query}`,
},
],
});
const text =
message.content[0].type === "text" ? message.content[0].text : "";
const citationPattern = /\[(\d+)\]/g;
const citedIndices = new Set<number>();
for (const match of text.matchAll(citationPattern)) {
citedIndices.add(parseInt(match[1]) - 1);
}
return {
answer: text,
citations: [...citedIndices]
.filter((i) => i < chunks.length)
.map((i) => ({
index: i + 1,
source: chunks[i].metadata?.source,
excerpt: chunks[i].content.slice(0, 200),
})),
};
}Two things worth being explicit about in the prompt: tell the model to use only the provided context (reduces hallucination), and ask for citation markers (lets you show users where the answer came from).
Shipping without an eval is fine for a prototype. It's not fine once you start tuning chunk size, switching embedding models, or adding hybrid search. You need to know whether the change actually helped.
The minimum viable eval:
interface EvalCase {
query: string;
relevantIds: number[]; // IDs of chunks that should appear in top-k
}
async function recallAtK(cases: EvalCase[], k = 5): Promise<number> {
let totalRecall = 0;
for (const { query, relevantIds } of cases) {
const results = await retrieve(query, k);
const retrievedIds = new Set(results.map((r) => r.id));
const hits = relevantIds.filter((id) => retrievedIds.has(id)).length;
totalRecall += hits / relevantIds.length;
}
return totalRecall / cases.length;
}
// Usage
const evalSet: EvalCase[] = [
{ query: "How do I reset my password?", relevantIds: [42, 43] },
{ query: "What payment methods do you accept?", relevantIds: [17] },
];
const score = await recallAtK(evalSet, 5);
console.log(`Recall@5: ${(score * 100).toFixed(1)}%`);Build this eval set manually. Ten to twenty representative queries is enough to catch regressions. Run it every time you change chunk size, switch embedding models, or modify the retrieval query. A drop in Recall@5 tells you the change hurt retrieval even if the generated answers still look plausible.
Precision@k is the complement: of the k chunks retrieved, what fraction were actually relevant? High recall, low precision means you're retrieving noise that pollutes the LLM context. High precision, low recall means you're missing relevant chunks entirely. You want both. (Without this, you're tuning blind.)
If you're running a local LLM already (covered in self-hosting an LLM as a TypeScript developer), you can embed with it too. Ollama's /api/embeddings endpoint speaks OpenAI's format via its compatibility layer:
const ollamaClient = new OpenAI({
baseURL: "http://localhost:11434/v1",
apiKey: "ollama", // required but unused
});
async function embedWithOllama(texts: string[]): Promise<number[][]> {
const res = await ollamaClient.embeddings.create({
model: "nomic-embed-text", // 768 dimensions
input: texts,
});
return res.data.map((d) => d.embedding);
}One catch: nomic-embed-text produces 768-dimensional vectors, not 1536. Your schema's vector(1536) won't accept them. Define a separate table with the right dimensions and rebuild the HNSW index:
export const documentsLocal = pgTable(
"documents_local",
{
id: serial("id").primaryKey(),
content: text("content").notNull(),
embedding: vector("embedding", { dimensions: 768 }),
metadata: jsonb("metadata"),
},
(t) => [
index("documents_local_embedding_hnsw_idx").using(
"hnsw",
t.embedding.op("vector_cosine_ops"),
),
],
);Quality-wise, nomic-embed-text is competitive with text-embedding-ada-002 on most benchmarks and runs at zero marginal cost. Reasonable default for keeping embeddings out of third-party infrastructure entirely.
If you switch embedding models on an existing corpus, you need to re-embed everything. There's no shortcut: the vector spaces are incomparable. Plan a reindex migration - insert into a new table, validate recall on your eval set, then cut over.
The pipeline above runs entirely inside Postgres. No data leaves your database, no new service to page you at 2am, no vector store billing dashboard to watch. The query is SQL. The index is an HNSW tree Postgres already knows how to manage. At the scale most apps actually reach, that's a surprisingly good deal.