Query Time Constrained Approximate Nearest Neighbor Search

Photo by
Christopher Burns on Unsplash

This blog post describes Vespa’s support for combining approximate nearest neighbor search,
or vector search, with query-time filters or constraints to tackle real-world search and recommendation use cases at scale.
The first section covers the data structures Vespa builds to support fast search in vector fields and other field types.
It then goes through the two primary strategies for combining vector search with filters; pre- or post-filtering.
Finally, an intro to two Vespa parameters for controlling vector search filtering behavior to
achieve optimal functionality with the lowest possible resource usage.


Many real-world applications require query-time constrained vector search. For example, real-time recommender systems
using vector search for candidate retrieval
need to filter recommendations by hard constraints like regional availability or age group suitability.
Likewise, search systems using vector search need to support filtering.
For example, typical e-commerce search interfaces allow users to navigate and filter search results using result facets.

Vespa’s document model supports representing multiple field and collection types in the same
document schema.
Supported Vespa schema field types
include string, long, int, float, double, geo position, bool, byte, and tensor fields.
Vespa’s first-order dense tensor fields represent vector fields.
Vespa’s tensor fields support different tensor cell precision types,
ranging from int8 for binary vectors to bfloat16, float, and double for real-valued vectors.
Allowing vectors and other field types in the same document schema enables searching the vector field(s) in
combination with filters expressed over other fields.

This blog post uses the following Vespa document schema
to exemplify combining vector search with filters:

schema track {

  document track {

    field title type string {
      indexing: index | summary
      index: enable-bm25
      match: text 

    field tags type array<string> {
      indexing: attribute | summary 
      attribute: fast-search
      rank: filter
      match: exact 

    field embedding type tensor<float>(x[384]) {
      indexing: attribute | index 
      attribute {
        distance-metric: euclidean
  rank-profile closeness {
    inputs {
       query(query_embedding) tensor<float>(x[384])      
    first-phase { 
      expression: closeness(field, embedding) 

The practical nearest neighbor search guide
uses a similar schema, indexing a subset of the last.fm track dataset.
The simplified track document type used in this post contains three fields: track title, track tags, and track embedding:

  • title is configured for regular text indexing and matching using the default matching mode for
    indexed string fields, match:text.
  • tags is an array of strings,
    configured for exact database-style matching using Vespa’s match:exact.
  • embedding is a first-order tensor (vector) using float tensor cell precision.
    x[384] denotes the named dimension (x) with dimensionality (384). A vector field searched using
    Vespa’s nearestNeighbor
    query operator must define a distance-metric.
    Also see the Vespa tensor user guide.

The embedding vector field can be produced by, for example, a dense embedding model like

Vespa builds data structures for efficient query evaluation for fields with
indexing: index or
attribute fields
defined with the attribute: fast-search property.
The data structures used for non-tensor fields are variants of the classic inverted index data structure.
The inverted index data structure enables fast query evaluation of boolean queries, expressed using
Vespa’s query language:

select * from track where tags contains "90s" and tags contains "pop" 

Vespa builds Hierarchical Navigable Small World (HNSW)
graphs for first-order dense tensor fields or vector fields to support a fast, approximate nearest neighbor search.
Read more in the introducing HNSW support in Vespa
blog post.

Vespa’s implementation of the HNSW graph data structure allows for fast approximate nearest neighbor searches like:

select * from track where {targetHits:3, approximate:true}nearestNeighbor(embedding, query_embedding) 

This example searches for the three (approximate) nearest neighbors of the
input query embedding vector using the configured

Figure 1 illustrates, on a high level, Vespa’s data structures for all types of fields,
including vector fields with
HNSW enabled.

Search query planning and estimation

Vespa builds a query plan that tries to optimize the query execution.
A significant part of the execution planning is estimating how many documents each branch of the query tree will match.
The estimation process uses information present in per-field inverted index data structures.

Query terms searching fields with inverted index structures enabled,
use the size of posting lists as the hit count estimate. Other terms in the query
might use the number of searchable documents as an estimate, as it’s not known how many hits they will produce.
Furthermore, subtrees of ANDed terms use the minimum estimate of their children,
and OR-terms use the saturated sum of their children.
Finally, the complete query hit estimate is scaled with
the number of searchable documents to get an estimated-hit-ratio [0, 1].

Using the high-level illustration in Figure 1, a query for tags contains “90s” or tags contains “pop”
is estimated to match the sum of the length of the posting lists of the two terms (4+7=11).
A query for tags contains “90s” and tags contains “pop” is estimated to match
at most four documents (min(4,7) = 4). The hit estimates determine the query execution plan.
An optimal query evaluation for 90s and pop would start with the shortest posting list (90s)
and intersect with the postings of the longest (pop).
The query execution with the intersection of these two posting lists will only match one document (D9),
which is less than the estimated hit count. Estimation is a best-effort process,
overestimating the number of documents the query will match.

The posting lists of the inverted index can be of different granularity, which can help optimize the query execution.
For example, using rank: filter for attribute fields
with fast-search, enables compact posting list representation for frequent terms.

Combining exact nearest neighbor search with filters

Given a query vector, Vespa’s nearestNeighbor
query operator finds the (targetHits) nearest neighbors using the configured

The retrieved hits are exposed to first-phase ranking,
where the retrieved neighbors can be re-ranked using more sophisticated ranking models
beyond the pure vector similarity.

The query examples in this blog post use the closeness rank feature
directly as the first-phase rank expression:

rank-profile closeness { 
  first-phase { 
    expression: closeness(field, embedding) 

Developers might switch between using approximate:true, which searches the HNSW
graph data structure, and using exact search, setting approximate:false.
The ability to switch between approximate and exact enables quantifying accuracy
loss when turning to the faster, but approximate search. Read more about HNSW parameters and accuracy versus performance
tradeoffs in the Billion-scale vector search with Vespa – part two blog post.

It’s trivial to combine the exact nearest neighbor search with query-time filters, and the computational
complexity of the search is easy to understand. For example, the following query searches for the exact
nearest neighbors of the query embedding using the euclidean distance-metric,
restricting the search for neighbors to only consider tracks tagged with rock:

select * from track where {targetHits:5, approximate:false}nearestNeighbor(embedding, query_embedding) and tags contains "rock" 

The query execution planner estimates that the exact nearestNeighbor query operator
will match all searchable documents,
while the tags term will match a restricted subset.
The most optimal way to evaluate this query is to first find the documents matching the filter,
and then perform an exact nearest neighbor search. During the exact search, Vespa reads the vector data
from the tensor attribute store and does not touch the HNSW graph.

Figure 2 illustrates the computational cost (driven mainly by distance calculations)
versus an increasing number of filter matches (% hit count rate) using exact search filters.
A more restrictive filter uses less resources as it involves fewer distance calculations.

Note that the figure uses computational cost and not latency.
It is possible to reduce search latency by using more
threads to parallelize
the exact search,
but the number of distance calculations involved in the query execution stays the same.

Combining approximate nearest neighbor search with filters

Using exact nearest neighbor search for web-scale search and recommendation problems
quickly becomes prohibitively expensive. As a result, many turn to the less resource-intensive
approximate nearest neighbor search, accepting an accuracy loss to reduce serving cost.
There are two high-level strategies for combining boolean query evaluation over
inverted indexes with approximate nearest neighbor search: post-filtering and pre-filtering.

Post-filtering strategy

This strategy evaluates the approximate nearest neighbors first and runs the constrained filter
over the retrieved targetHits hits. This strategy is characterized as post-filtering as the
filter constraints are considered only over the retrieved targetHits closest neighbors.
The disadvantage of this approach is that restrictive filters (for example, the tag 90s from Figure 1)
will cause the search to expose fewer hits to ranking than the wanted targetHits.
In the worst case, the post-filters eliminate all retrieved neighbors and expose zero documents to ranking.
The advantage of this strategy is that the serving performance impact of constrained search is negligible
compared to unconstrained approximate nearest neighbor search.
Another aspect is that the hits which survive the post-filter are within the original targetHits.
In other words, the neighbors exposed to ranking are not distant compared to the nearest.

Pre-filtering strategy

This strategy evaluates the filter part of the query over the inverted index structures first.
Then, it uses the resulting document IDs from the filter execution as input to the search for approximate nearest neighbors.
The search for neighbors traverses the HNSW graph and each candidate neighbor is looked up in the
document ID list from the filter execution.
Neighbors not on the document ID list are ignored and the greedy graph search continues until
targetHits hits have been found.

This filtering strategy is known as pre-filtering, as the filters go first before searching
for the approximate nearest neighbors. With pre-filtering, the probability of exposing targetHits
to the ranking phase is much higher than with the post-filtering approach.
The disadvantage is that the performance impact is higher than with the post-filter strategy.
In addition, the retrieved neighbors for a restrictive filter might be somewhat distant neighbors
than the nearest neighbor of the query embedding.
Distant neighbors could be combated by specifying a
distance threshold
as another constraint for the approximate nearestNeighbor query operator.

Figure 3 summarizes the strategies used for approximate nearest neighbor search with filtering.
In the case of
post-filtering, only one hit is exposed to the ranking phase after the post-filter is
evaluated. In the case of pre-filtering, two additional hits were exposed, but they
are more distant neighbors.

Filtering strategies and serving performance

From a performance and serving cost perspective, one can summarize on a high level:

  • Unconstrained approximate nearest neighbor search without filters is the fastest option, but real-world applications
    need constrained vector search. Pre-building several nearest neighbor indexes using pre-defined constraints offline also
    cost resources and index management.

  • Post-filtering is less resource-intensive than pre-filtering for the same number of targetHits. Increasing
    targetHits to combat the effect of post-filtering changes cost of the query as increasing targetHits increases the
    number of distance calculations.

  • Pre-filtering uses more resources than post-filtering for the same number of targetHits.
    Pre-filtering needs to execute the filter and then search the HNSW graph,
    constrained by the document ID list produced by the pre-filter execution.

Figure 4 Approximate search with pre-filtering versus exact search with pre-filtering

Figure 4 illustrates the performance characteristics of approximate search and exact search when using pre-filtering
with increasing hit count rate.

With the pre-filtering strategy, searching the HNSW graph with a restrictive pre-filter
result causes the greedy graph search to traverse many nodes before finding the targetHits
which satisfies the document ID list from the pre-filter execution.

Due to this, there is a sweet spot or threshold where an exact search with filters
has a lower computational cost than an approximate search using the document ID list from the pre-filter execution.
The actual threshold value depends on vector dimensionality, HNSW graph properties,
and the number of vectors indexed in the Vespa instance.

Vespa exposes two parameters that control the query-time filtering strategy.
These parameters give the