PgVector allows you to store and search vector data efficiently in PostgreSQL. Today, I’m going to cover similarity search functions provided by PgVector and see how they can be used to find the nearest neighbors in vector space.
Let's start by looking at the four main distance functions supported by pgvector:
Each of these functions has its own use case, so let's break them down and see how to use them in practice.
Euclidean distance is probably the most intuitive distance metric. It's the straight-line distance between two points in Euclidean space. In PgVector, you can use the <->
operator to calculate Euclidean distance.
Here's an example query:
This query will return the 5 nearest neighbors to the vector [3,1,2]
based on Euclidean distance.
Inner product is a bit different. It measures the similarity between two vectors by multiplying their corresponding elements and summing the results. In PgVector, you use the <#>
operator for inner product. However, there's a catch: pgvector returns the negative inner product since Postgres only supports ASC order index scans on operators.
Here's how you'd use it:
To get the actual inner product, you'd need to multiply the result by -1:
Cosine distance is great when you care about the direction of the vectors but not their magnitude. It's particularly useful for text embeddings. In PgVector, you use the <=>
operator for cosine distance.
Here's an example:
If you want cosine similarity instead of distance, you can use:
Taxicab distance, also known as Manhattan distance or L1 distance, is the sum of the absolute differences of the coordinates. It's called taxicab distance because it's the distance a taxi would drive in a city laid out in a grid (like Manhattan). In PgVector, you use the <+>
operator for taxicab distance.
Here's how to use it:
Indexing: To speed up your queries, you can create indexes for each distance function you want to use. For example:
Note that you need to create separate indexes for each distance function.Normalization: If your vectors are normalized to length 1 (like OpenAI embeddings), use inner product for best performance.
Dimensions: PgVector supports up to 2,000 dimensions by default. If you need more, you can use half-precision indexing for up to 4,000 dimensions or binary quantization for up to 64,000 dimensions.
Approximate vs Exact Search: By default, PgVector performs exact nearest neighbor search. You can add an index to use approximate nearest neighbor search, which trades some recall for speed.
Query Performance: Use EXPLAIN ANALYZE
to debug performance issues. You can also increase max_parallel_workers_per_gather
to speed up queries without an index.
Filtering: You can combine vector search with other conditions. For example:
Consider creating an index on the filtering column or using a partial index for best performance.PgVector provides four main similarity search functions:
<->
)<#>
)<=>
)<+>
)Each has its use case and can be optimized with appropriate indexing. Remember to consider normalization, dimensions, and performance when working with vector embeddings in Postgres.