PgVector similarity search distance functions

PgVector allows you to store and search vector data efficiently in PostgreSQL. Today, I’m going to cover similarity search functions provided by PgVector and see how they can be used to find the nearest neighbors in vector space.

Let's start by looking at the four main distance functions supported by pgvector:

Euclidean distance (L2 distance)
Inner product
Cosine distance
Taxicab distance (L1 distance)

Each of these functions has its own use case, so let's break them down and see how to use them in practice.

Euclidean distance (L2 distance)

Euclidean distance is probably the most intuitive distance metric. It's the straight-line distance between two points in Euclidean space. In PgVector, you can use the <-> operator to calculate Euclidean distance.

Here's an example query:

SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

This query will return the 5 nearest neighbors to the vector [3,1,2] based on Euclidean distance.

Inner product

Inner product is a bit different. It measures the similarity between two vectors by multiplying their corresponding elements and summing the results. In PgVector, you use the <#> operator for inner product. However, there's a catch: pgvector returns the negative inner product since Postgres only supports ASC order index scans on operators.

Here's how you'd use it:

SELECT * FROM items ORDER BY embedding <#> '[3,1,2]' LIMIT 5;

To get the actual inner product, you'd need to multiply the result by -1:

SELECT (embedding <#> '[3,1,2]') * -1 AS inner_product FROM items;

Cosine distance

Cosine distance is great when you care about the direction of the vectors but not their magnitude. It's particularly useful for text embeddings. In PgVector, you use the <=> operator for cosine distance.

Here's an example:

SELECT * FROM items ORDER BY embedding <=> '[3,1,2]' LIMIT 5;

If you want cosine similarity instead of distance, you can use:

SELECT 1 - (embedding <=> '[3,1,2]') AS cosine_similarity FROM items;

Taxicab distance (L1 distance)

Taxicab distance, also known as Manhattan distance or L1 distance, is the sum of the absolute differences of the coordinates. It's called taxicab distance because it's the distance a taxi would drive in a city laid out in a grid (like Manhattan). In PgVector, you use the <+> operator for taxicab distance.

Here's how to use it:

SELECT * FROM items ORDER BY embedding <+> '[3,1,2]' LIMIT 5;

Tips and Gotchas

Indexing: To speed up your queries, you can create indexes for each distance function you want to use. For example:

CREATE INDEX ON items USING hnsw (embedding vector_l2_ops);
CREATE INDEX ON items USING hnsw (embedding vector_ip_ops);
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON items USING hnsw (embedding vector_l1_ops);

Note that you need to create separate indexes for each distance function.

Normalization: If your vectors are normalized to length 1 (like OpenAI embeddings), use inner product for best performance.
Dimensions: PgVector supports up to 2,000 dimensions by default. If you need more, you can use half-precision indexing for up to 4,000 dimensions or binary quantization for up to 64,000 dimensions.
Approximate vs Exact Search: By default, PgVector performs exact nearest neighbor search. You can add an index to use approximate nearest neighbor search, which trades some recall for speed.
Query Performance: Use EXPLAIN ANALYZE to debug performance issues. You can also increase max_parallel_workers_per_gather to speed up queries without an index.
Filtering: You can combine vector search with other conditions. For example:
```
SELECT * FROM items WHERE category_id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
```
Consider creating an index on the filtering column or using a partial index for best performance.

tl;dr

PgVector provides four main similarity search functions:

Euclidean distance (<->)
Inner product (<#>)
Cosine distance (<=>)
Taxicab distance (<+>)

Each has its use case and can be optimized with appropriate indexing. Remember to consider normalization, dimensions, and performance when working with vector embeddings in Postgres.

Online