Approximate Nearest Neighbor Search in Vespa – Part 1

Searching for the nearest neighbors of a data point in a high dimensional vector space is an important problem for many real-time applications.
For example, in Computer Vision it can be used to find similar images in large image datasets.
In Information Retrieval, pre-trained text embeddings can be used to match documents based on the distance between query and document embeddings.
In many of these applications,
the document corpus is constantly evolving and the search is constrained by query filters applied on the metadata of the data points.
For example, in E-commerce,
a nearest neighbor search for products in a vector space would typically be constrained by product metadata like inventory status and price.

Vespa (the open source big data serving engine) already supports exact
nearest neighbor search
that is integrated with the Vespa query tree and its filter support.
This enables you to get the exact nearest neighbors meeting the filter criterias of the query.
This works well when the number of documents to calculate nearest neighbors for is small,
e.g when the query filter is strong, or the document corpus is small.
However, as the number of documents to consider increases, we want to trade exactness for performance and we need an approximate solution.

This blog post is part 1 in a series of blog posts where we share how the Vespa team implemented an approximate nearest neighbor (ANN) search algorithm.
In this first post, we’ll explain why we selected HNSW (Hierarchical Navigable Small World Graphs)
as the baseline algorithm and how we extended it to meet the requirements for integration in Vespa.
Requirements included supporting real-time updates, integration with the Vespa query tree, and being flexible enough to fit a wide range of use cases.

Requirements & Challenges

Vespa is an open source real-time big data serving engine.
In a typical application, the document corpus is constantly evolving.
New documents are added and removed, and metadata is being updated.
A typical use case is news article search and recommendation, keeping a week, month or year’s worth of articles,
continually adding or updating new articles while selectively removing the oldest ones.
In this case, nearest neighbor search over pre-trained text embeddings could be used in combination with query filters over metadata.

Based on the existing functionality and flexibility of Vespa,
we defined a set of requirements an ANN search algorithm had to support to be a good fit for integration:

  • Indexes used in the algorithm must be real-time updateable with low latency and high throughput.
    Data points can be added, updated and removed, and the entire corpus should not be needed when creating the index.
  • Concurrent querying with low latency and high throughput without huge performance impact due to ongoing update operations.
  • Seamless integration with the Vespa query tree and its filter support.
    This enables correct results, compared to just increasing top K for the nearest neighbor search algorithm to compensate when query filters are used.

Algorithms Background

There exists a lot of algorithms for ANN search.
Many of these are summarized and analyzed in
Approximate Nearest Neighbor Search on High Dimensional Data — Experiments, Analyses, and Improvement (v1.0)
A Revisit of Hashing Algorithms for Approximate Nearest Neighbor Search.
In addition, benchmark results for different algorithms over different datasets are summarized in
ANN Benchmarks.

Existing algorithms are mainly focused on a stable document corpus where all data points in the vector space are known up front.
In this case the index structure is built once, and then used for nearest neighbor searches.
To support real-time updates, we looked for algorithms that were either incremental in nature or could be modified to meet this requirement.

There are three broad categories of algorithms: tree-based, graph-based and hash-based.
We choose one from each category for a more detailed exploration.
This selection was based on how they performed in benchmarks within papers and in
ANN Benchmarks,
and how easy the algorithm was to modify to fit our requirements.

We ended up with the following:

  • Annoy
    Annoy is tree-based and described in
    Nearest neighbors and vector models – part 2 – algorithms and data structures.
    It takes a straightforward engineering approach to the ANN problem, and is quite easy to understand and implement.
    An Annoy index consists of N binary trees, where each tree partitions the vector space using random hyperplanes at each node in the tree.
    The implementation is available on github.
  • HNSW – Hierarchical Navigable Small World Graphs
    This is graph-based and described in
    Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.
    HNSW builds a hierarchical graph incrementally,
    and has great search performance with high recall, motivating us to prototype it for comparison.
    Each node in the graph represents a point in the vector space, and nodes are linked to other nodes that are close in space.
    The implementation is available on github.
  • RPLSH – Random Projection based Locality Sensitive Hashing
    This is hash-based and described in
    A Revisit of Hashing Algorithms for Approximate Nearest Neighbor Search.
    The implementation is available on github.
    Since the random hyperplanes used for projections can be selected up-front
    (only depending on the number of dimensions of the vector space) this approach is data-independent.
    For our purposes, we could use the hash value as a filter,
    where only documents having most bits in common with the hash value of the query data point would be considered for full distance computation.
    This would give us little extra memory and CPU usage and would be easy to fit into our exact (brute-force) nearest neighbor feature.
    It was the first algorithm that we implemented as a prototype.
    However, in our prototype we found that getting a significant speedup required a large sacrifice of quality.
    For more info, see Dropping RPLSH below.

We created prototype implementations of the three algorithms in C++.
This gave us a deeper understanding and the ability to make necessary modifications to support real-time updates and query filters.
The prototypes were also used to do low-level benchmarking and quality testing.

Prototype Benchmark

We benchmarked the three prototype implementations to look at indexing throughput and search throughput with different document corpus sizes.
We used the
1M SIFT (128 dim)
1M GIST (960 dim)
datasets, where one vector corresponds to a document.
In this section, we summarize the results of the 1M SIFT tests. Findings were similar with the 1M GIST dataset.


The tests were executed in a CentOS 7 Docker image using Docker Desktop on a MacBook Pro (15 inch, 2018)
with a 2.6 GHz 6-Core Intel Core i7 CPU and 32GB of memory. The Docker setup was 8 CPUs, 16GB of memory and 1GB of Swap.

The configuration of the algorithms were as follows:

  • Float size: 4 bytes.
  • Annoy: Number of trees = 50. Max points in leaf node = 128.
  • HNSW: Max links per node (M) = 16. Note that level 0 has 32 links per node (2*M). Number of neighbors to explore at insert (efConstruction) = 200.
  • RPLSH: 512 bits used for hash and linear scan of hash table.

Indexing Throughput

The 1M SIFT dataset was split into chunks of first 100k, 200k, and 500k vectors,
and indexing throughput was tested with different corpus sizes.
The index structures for all algorithms started empty, and we added one document at a time to simulate real-time feeding.
One thread was used. Necessary adjustments to the prototype implementations were done to accommodate this.
Note that the data points for 10M documents are estimated based on the 100k and 1M data points.


  • Indexing throughput depends on corpus size for Annoy and HNSW, where throughput is halved when corpus size is increased by 10x.
  • Indexing throughput for RPLSH is independent of corpus size.
  • Annoy is 4.5 to 5 times faster than HNSW.
  • RPLSH is 23 to 24 times faster than HNSW at 1M documents.

Search Throughput (QPS)

The search throughput was measured after the indexes were built, using a single benchmarking client and search thread.
We used the query vectors from the dataset, and searched for the nearest K=100 neighbors from the query vector.
To get comparable quality results between the algorithms we had to adjust some of the parameters used.
The default behavior for Annoy is to collect K=100 candidates per tree, and then calculate brute force distance for all those.
To get comparable quality between Annoy and HNSW we had to collect 8000 candidates instead of 5000.
For RPLSH, we used the Hamming distance between hashes to pre-filter candidates after collecting 200 candidates into a heap,
skipping full distance calculation for most candidates. For HNSW, we asked for 100 candidates.


  • HNSW outperforms Annoy and RPLSH. At corpus size 1M the QPS is 9 times as high as Annoy,
    and 16 times as high as RPLSH at comparable quality.
    Similar observations between hnswlib and Annoy are found in ANN Benchmarks,
    where the QPS of hnswlib is 5-10 times higher at the same quality on all tested datasets.
  • The HNSW search algorithm depends heavily on the number of links between nodes, which again depends on corpus size.
    The QPS is halved when the corpus size is increased by 10x.
    We see the same during indexing as that uses the search algorithm to find candidates to connect to.
  • The Annoy search algorithm is less dependent on corpus size, at least with small corpus sizes.
    The static cost is driven by brute force calculation of distances for the candidate set, 8000 in this case.
    With high corpus size the cost of traversing the binary trees will likely match and exceed the static cost.
    We don’t know where this limit is.
  • For RPLSH, doing exhaustive search (linear scan) of the hashes is more expensive than expected.
    One major consideration here is that the data reduction (from 128 dimensions*32 bit floats, to 512 bits hash) is just a factor 8,
    and indeed we got around 8 times speedup compared to brute-force linear scan.

Index Structures & Memory Usage

The prototype implementations used simple data structures to represent the index structures of the algorithms in memory.
Only one thread was used for indexing and searching and we didn’t have to worry about locking.
These challenges would have to be overcome in the actual implementation of the selected algorithm.
We created a rough index structure design for both Annoy and HNSW to see how much memory the index would use.

In Vespa, a data point in a high dimensional vector space is represented by a
of rank one with dimension size equal to the dimensionality of the vector space.
To use this, a document type with a tensor field of such type is defined, and documents with tensors are fed to Vespa.
A document type typically consists of multiple other fields as well, for instance, metadata fields and other tensor fields.
The tensors across all documents for a given tensor field are stored in-memory.
The index structures were designed such that they just reference the tensor data points stored in-memory.
In addition, they support one writer thread doing changes, while multiple reader threads are searching, without use of locking.
The actual index structure details for HNSW will be covered in an upcoming blog post.

Annoy Index

An Annoy index consists of multiple binary trees. Each tree consists of split nodes and leaf nodes.
A split node has an offset from origo, a hyperplane that splits the space in two, and reference to left and right child nodes.
A leaf node contains the document ids of the points that ended up in that portion of space.
The document ids are used to lookup the actual tensor data from memory.

HNSW Index

An HNSW index consists of navigable small world graphs in a hierarchy.
Each document in the index is represented by a single graph node.
Each node has an array of levels, from level 0 to n.
The number of levels is constant during the lifetime of the node and is drawn randomly when the node is created.
All nodes have

Using approximate nearest neighbor search in real world applications

From text search and recommendation to ads and online dating, ANN search rarely works in isolation

Anything can be represented by a list of numbers.

For instance, text can be represented by a list of numbers describing the
text’s meaning. Images can be represented by the objects it contains. Users of
a system can be represented by their interests and preferences. Even time-based
entities such as video, sound, or user interactions can be represented by a
single list of numbers.

These vector representations describe content or meaning: the original,
containing thousands of characters or pixels, is compressed to a much smaller
representation of a few hundred numbers.

Most often, we are interested in finding the most similar vectors. This is
called k-nearest neighbor (KNN) search or similarity search and has all kinds
of useful applications. Examples here are model-free classification, pattern
recognition, collaborative filtering for recommendation, and data compression,
to name but a few. We’ll see some more examples later in this post.

However, a nearest neighbor search is only a part of the process for many
applications. For applications doing search and recommendation, the potential
candidates from the KNN search are often combined with other facets of the
query or request, such as some form of filtering, to refine the results.

This can severely limit the quality of the end result, as post-filtering can
prevent otherwise relevant results from surfacing. The solution is to integrate
the nearest neighbor search with filtering, however most libraries for nearest
neighbor search work in isolation and do not support this. To my knowledge, the
only open-source platform that does is

In this post, we’ll take a closer look at approximate neighbor search, explore
some real cases combining this with filtering, and delve into how
solves this problem.

Finding the (approximate) nearest neighbors

The representations can be visualized as points in a high-dimension space, even
though it’s kind of difficult to envision a space with hundreds of dimensions.
This allows us to think of these points as vectors, sometimes called thought
vectors, and we can use various distance metrics to measure the likeness or
similarity between them. Examples are the dot (or inner) product, cosine angle,
or euclidean distance.

The 5 nearest neighbors

Finding the nearest neighbors of a point is reasonably straight-forward: just
compute the similarity using the distance metric between the point and all
other points. Unfortunately, this brute-force approach doesn’t scale well,
particularly in time-critical settings such as online serving, where you have a
large number of points to consider.

There are no known exact methods for finding nearest neighbors efficiently. As
both the number of points increases and the number of dimensions increase, we
fall victim to the curse of dimensionality. In high dimensions, all points are
almost equally distant from each other. A good enough solution for many
applications is to trade accuracy for efficiency. In approximately nearest
neighbors (ANN), we build index structures that narrow down the search space.
The implicit neighborhoods in such indexes also help reduce the problem of high

You can roughly divide the approaches used for ANNs into whether or not they
can be implemented using an inverse index. The inverse index originates from
information retrieval and is comparable to the index often found at many books’
back. This index points from a word (or term) to the documents containing it.
This can be used for ANNs as well. Using k-means clustering, one can cluster
all points and index them by which cluster they belong to. A related approach
is product quantization (and its relatives), which splits the vectors into
products of lower-dimensional spaces. Yet another is locality-sensitive
hashing, which uses hash functions to group similar vectors together. These
approaches index the centroids or buckets.

A method that is not compatible with inverted indexes is HNSW (hierarchical
navigable small world). HNSW is based on graph structures, is efficient,
and lets the graph be incrementally built at runtime. This is in contrast to
most other methods that require offline, batch-oriented index building.

As approximate nearest neighbor search has many applications, quite a few tools
and libraries exist. A few examples are:

A good overview of tradeoffs for these can be found at

Nearest neighbors in search and recommendation

In many applications, such as search and recommendation, the results of the
nearest neighbor search is combined with additional facets of the request. In
this section, we’ll provide some examples of when this becomes problematic.

Only 2 of the 5 nearest neighbors remain after filtering

Modern text search increasingly uses representation vectors, often called text
embeddings or embedding vectors. Word2vec was an early example. More recently,
sophisticated language understanding models such as BERT and other
Transformer-based models are increasingly used. These are capable of assigning
different representations for a word depending upon the context. For text
search, the current state-of-the-art uses different models to encode query
vectors and document vectors. These representations are trained so that the
inner product of these vectors is maximized for relevant results.

Using embedding vectors in text search is often called semantic search. For
many text search applications, we would like to combine this semantic search
with other filters. For instance, we can combine a query for “approximate
nearest neighbor” with a date filter such as “2020”. The naive approach here is
to use one of the ANN libraries mentioned above to perform a nearest neighbor
search and then filter out the results.

However, this is problematic. Imagine that 1000 documents are relevant to the
query “approximate nearest neighbor”, with 100 added each year over the past 10
years. Assume they all are approximately equally likely to be retrieved from
the ANN. So, retrieving the top 100 will result in about 10 documents from each
year. Applying the filter “2020” will result in only 10 documents. That means
the other 90 relevant documents from 2020 are missed.


Recommender systems, such as YouTube and TikTok, are built to provide
continually interesting content to all users. As such, it’s essential to learn
the interests or preferences of the user. Such user profiles are represented by
one or more vectors, as are the items that should be recommended.

These vectors are often generated by using some form of collaborative
filtering. One method is matrix factorization, where the maximum inner product
is used as a distance function. Deep learning approaches have recently shown
great promise, trained explicitly for the distance function between the user
and item vector.

Recommendation systems employ filters to a great degree. Examples are filters
for age-appropriate content, NSFW labels, availability of content in various
regions due to distribution rights, and user-specified filters blocking certain
content. These are examples of direct filters. More indirect filters come in
the form of business rules such as diversity and de-duplication, which filters
out content that has already been recommended.

The problem of filtering is more evident for recommendation systems than for
text search. These filters’ quantity and strength lead to a greater probability
that items retrieved from the ANN search are filtered away. So, only a few of
the relevant items are actually recommended.

Serving ads

Ad serving systems work much like recommender systems. Given a user
profile and a context such as a search query or page content, the system should
provide an advertisement relevant to the user. The advertisements are stored
with advertiser-specific rules, for instance, who the ad or campaign should
target. One such rule is to not exceed the budget of the campaign.

These rules function as filters. Like with text search and recommendation, if
these filters are applied after the user-profile based retrieval, there is a
probability that an appropriate advertisement is not retrieved. This is
particularly important regarding the budget. Income is lost if there are no
results retrieved with an available spending budget.

Online dating

In the world of online dating, people have a set of preferences. These can be
binary such as gender, age range, location, height, and so on. Interests might
be less absolute, such as hiking, loves pets, traveling, and exercise. These
interests and preferences can be represented by a vector, and at least parts
can be compressed to a representation vector as well.

Suppose retrieval is based on an ANN over interests, and the preferences are
applied as a filter afterward. In that case, it’s clear why online dating is
hard. As we retrieve the best matches from the ANN, there is a significant
probability that all or most of these are filtered out, for instance, by
location or by profiles already seen.

Local search and recommendation is based on geographical location. Given
longitude and latitude coordinates, we can find places or businesses within
certain distances from a point: finding restaurants near a user is a typical
case. Imagine that we have the dining preferences of a user represented as a
vector. Likewise, all restaurants are represented by vectors. Then, by
performing an ANN search followed by a location filter, we could retrieve the
restaurants preferred by the user in their local area.

However, this would not work. Of all the restaurants in the world, only a small
fraction are close to the user. The location filter is much stronger than the
ANN retrieval. So with a high probability, no results would be produced at all.


The naive approach to the problems above is simply to request more candidates
from the ANN search. This obviously hurts performance, as the workload of both
the ANN and post-filtering increases. Besides, this is not guaranteed to work.
If you have a strong filter independent of the ANN, there is a real chance of
not producing any results at all. The local restaurant case is an example of
this, where the location is a strong filter independent of the user

The real solution here is to integrate the filters into the ANN search. Such an
algorithm would be able to reject candidates early that don’t pass the filters.
This effectively increases the search area from the query point dynamically
until enough candidates are found. This guarantees that the requested number of
candidates are produced.

Unfortunately, for most ANN libraries, this is not an option as they work in

The 5 nearest neighbors with integrated filtering is to my knowledge the only implementation of ANN that supports
integrated filtering. The implementation is based on a modified HNSW graph
algorithm, and innovates in 3 main areas:

  • Dynamic modification of the graph. Most ANN algorithms require the index to
    be built offline, but HNSW supports incremental building of the index. Vespa
    takes advantage of this and supports both adding and removing items in
    real-time while serving.
  • Multi-threaded indexing using lock-free data structures and copy-on-write
    semantics drastically increase the performance of building the index.
  • Metadata filtering modifies the algorithm to skip non-eligible candidates.

To support filtering, first evaluates the filters to create a list of
eligible candidates. During the ANN search, a point close to the query point is
selected and the graph is explored by following each node’s edge to its
neighbors. Candidates not in the eligibility list are skipped, and the search
continues until we have produced enough candidates.

There is a small problem here however. If the eligibility list is small in
relation to the number of items in the graph, skipping occurs with a high
probability. This means that the algorithm needs to consider an exponentially
increasing number of candidates, slowing down the search significantly. To
solve this, switches over to a brute-force search when this occurs.
The result is a efficient ANN search when combined with filters.

About is an open-source platform for building applications that do real-time
data processing over large data sets. Designed to be highly performant and
web-scalable, it is used for such diverse tasks as search, personalization,
recommendation, ads, auto-complete, image and similarity search, comment
ranking, and more.

One of’s strengths is that it includes all the necessary features to
realize such applications. This means one does not need additional plugins or
external services. Thus, it offers a simplified path to deployment in
production without coping

Using approximate nearest neighbor search to find similar products

In this blog we give an introduction to how to use the
Vespa’s approximate nearest neighbor search query operator.

We demonstrate how nearest neighbor search in an image feature vector space can be used to find similar products.
Given an image of a product, we want to find
similar products, but also with the possibility to filter the returned products by inventory status, price or other real world constraints.

We’ll be using the
Amazon Products dataset
as our demo dataset. The subset we use has about 22K products.

  • The dataset contains product information like title, description and price and we show how to map the data into the Vespa document model
  • The dataset also contains binary feature vectors for images, features produced by a neural network model. We’ll use these image vectors and show you how to store and index vectors in Vespa.

We also demonstrate some of the real world challenges with using nearest neighbor search:

  • Combining the nearest neighbor search with filters, for example on inventory status
  • Real time indexing of vectors and update documents with vectors
  • True partial updates to update inventory status at scale

Since we use the Amazon Products dataset, we also recommend these resources:


In this blog we’ll be using pyvespa, which is a simple python api built on top of Vespa’s native HTTP apis.

The python api is not meant as a production ready api, but an api to explore features in Vespa.
It is also for training ML models, which can be deployed to Vespa for serving.

See also the Complete notebook which powered this blog post.

Exploring the Amazon Product dataset

Let us download the demo data from the Amazon Products dataset, we use the Amazon_Fashion subset.

!wget -nc
!wget -nc
!wget -nc

from amazonutils import *
from pprint import pprint

Let us have a look at a selected slice of the dataset:

for p in islice(iter_products('meta_Amazon_Fashion.json.gz'), 221,223):
  display(Image(p['imUrl'], width=128, height=128))

{'asin': 'B0001KHRKU',
'categories': [['Amazon Fashion']],
'imUrl': '',
'salesRank': {'Watches': 220635},
'title': 'Midas Remote Control Watch'}


{'asin': 'B0001LU08U',
'categories': [['Amazon Fashion']],
'imUrl': '',
'salesRank': {'Clothing': 1296873},
'title': 'Solid Genuine Leather Fanny Pack Waist Bag/purse with '


## Build basic search functionality

A Vespa instance is described by a [Vespa application package]( Let us create an application called product and define our [document schema]( We use pyvespa to define our application. This is not a requirement for creating a Vespa application, one can just use a text editor of choice.

from vespa.package import ApplicationPackage
app_package = ApplicationPackage(name = "product")
from vespa.package import Field
    Field(name = "asin", type = "string", indexing = ["attribute", "summary"]),
    Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
    Field(name = "description", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
    Field(name = "price", type = "float", indexing = ["attribute", "summary"]),
    Field(name = "salesRank", type = "weightedset", indexing = ["summary","attribute"]),
    Field(name = "imUrl", type = "string", indexing = ["summary"])

We define a fieldset which is a way to combine matching over multiple string fields. 
We will only do text queries over the *title* and *description* field. 

from vespa.package import FieldSet
    FieldSet(name = "default", fields = ["title", "description"])

Then define a simple [ranking]( function which uses a linear combination of the [bm25]( text ranking feature over our two free text string fields.

from vespa.package import RankProfile
        name = "bm25", 
        first_phase = "0.9*bm25(title) + 0.2*bm25(description)")

So let us deploy this application. We use docker in this example. See also [Vespa quick start](

### Deploy the application and start Vespa

from vespa.package import VespaDocker
vespa_docker = VespaDocker(port=8080)

app = vespa_docker.deploy(
    application_package = app_package,
    disk_folder="/home/centos/product_search" # include the desired absolute path here

Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for application status.
Waiting for application status.
Finished deployment.

### Index our product data

Pyvespa does expose a feed api, but in this notebook we use the raw [Vespa http /document/v1 feed api](

The HTTP document api is synchronous and the operation is visible in search when acked with a response code 200. In this case, the feed throughput is limited by the client as we are posting one document at a time. For high throughput use cases use the asynchronous feed api, or use more client threads with the synchronous api.

import requests 
session = requests.Session()

def index_document(product):
    asin = product['asin']
    doc = {
        "fields": {
            "asin": asin,
            "title": product.get("title",None),
            "description": product.get('description',None),
            "price": product.get("price",None),
            "imUrl": product.get("imUrl",None),
            "salesRank": product.get("salesRank",None)     
    resource = "http://localhost:8080/document/v1/demo/product/docid/{}".format(asin)
    request_response =,json=doc)

With our routine defined we can iterate over the data and index the products, one doc at a time:

from tqdm import tqdm
for product in tqdm(iter_products("meta_Amazon_Fashion.json.gz")):

24145it [01:46, 226.40it/s]

So we have our index ready, no need to perform any additional index maintenance operations like merging segments.
All the data is searchable. Let us define a simple routine to display search results. Parsing the [Vespa JSON search response]( format:

def display_hits(res, ranking):
    time = 1000*res['timing']['searchtime'] #convert to ms
    totalCount = res['root']['fields']['totalCount']
    print("Found {} hits in {:.2f} ms.".format(totalCount,time))
    print("Showing top {}, ranked by {}".format(len(res['root']['children']),ranking))
    for hit in res['root']['children']:
        fields = hit['fields']
        print("{}".format(fields.get('title', None)))
        display(Image(fields.get("imUrl"), width=128, height=128))  
        print("documentid: {}".format(fields.get('documentid')))
        if 'inventory' in fields:
            print("Inventory: {}".format(fields.get('inventory')))
        print("asin: {}".format(fields.get('asin')))
        if 'price' in fields:
            print("price: {}".format(fields.get('price',None)))
        if 'priceRank' in fields:
            print("priceRank: {}".format(fields.get('priceRank',None)))  
        print("relevance score: {:.2f}".format(hit.get('relevance')))

### Query our product data

We use the [Vespa HTTP Search API]( to search our product index.

In this example we assume there is a user query 'mens wrist watch' which we use as input to the
[YQL]( query language. Vespa allows combining the structured application logic expressed by YQL with a
user query language called [Vespa simple query language](

In this case we use *type=any* so matching any of our 3 terms is enough to retrieve the document.
In the YQL statement we select the fields we want to return. Only fields which are marked as *summary* in the schema can be returned with the hit result.

We don't mention which fields we want to search, so Vespa uses the fieldset defined earlier called default, which will search both the title and the description fields.

query = {
    'yql': 'select documentid, asin,title,imUrl,price from sources * where userQuery();',
    'query': 'mens wrist watch',
    'ranking': 'bm25',
    'type': 'any',
    'presentation.timing': True,
    'hits': 2

display_hits(app.query(body=query).json, "bm25")

Found 3285 hits in 4.00 ms.
Showing top 2, ranked by bm25

Geekbuying 814 Analog Alloy Quartz Men's Wrist Watch - Black (White)


documentid: id:demo:product::B00GLP1GTW
asin: B00GLP1GTW
price: 8.99
relevance score: 10.27

Popular Quality Brand New Fashion Mens Boy Leatheroid Quartz Wrist Watch Watches


documentid: id:demo:product::B009EJATDQ
asin: B009EJATDQ
relevance score: 9.67

So there we have our basic search functionality.

## Related products using nearest neighbor search in image feature spaces
Now we have basic search functionality up and running, but the Amazon Product dataset also includes image features which we can also
index in Vespa and use approximate nearest neighbor search on.
Let us load the image feature data. We reduce the vector dimensionality to something more practical and use 256 dimensions.

vectors = []
reduced = iter_vectors_reduced("image_features_Amazon_Fashion.b", 256, 1000)
for asin,v in tqdm(reduced("image_features_Amazon_Fashion.b")):

22929it [00:04, 4739.67it/s]

We need to re-configure our application to add our image vector field.
We also define a *HNSW* index for it and using *angular* as our [distance metric](

We also need to define the input query vector in the application package.
Without defining our query input tensor we won't be able to perform our nearest neighbor search,
so make sure you remember to include that.

Most changes like adding or remove a field is a [live change]( in Vespa, no need to re-index the data.

from vespa.package import HNSW
    Field(name = "image_vector", 
          type = "tensor(x[256])", 
          indexing = ["attribute","index"],
from vespa.package import QueryTypeField
    QueryTypeField(name="ranking.features.query(query_image_vector)", type="tensor(x[256])")

We also need to define a ranking profile on how we want to score our documents. We use the *closeness* [ranking feature]( Note that it's also possible to retrieve results using approximate nearest neighbor search operator and use the first phase ranking function as a re-ranking stage (e.g by sales popularity etc). 
        name = "vector_similarity", 
        first_phase = "closeness(field,image_vector)")

Now, we need to re-deploy our application package to make the changes effective.

app = vespa_docker.deploy(
    application_package = app_package,
    disk_folder="/home/centos/product_search" # include the desired absolute path here

## Update the index with image vectors
Now we are ready to feed and index the image vectors.

We update the documents in the index by running partial update operations, adding the vectors using real time updates of the existing documents.
Partially updating a tensor field, with or without tensor, does not trigger re-indexing.

for asin,vector in tqdm(vectors):
    update_doc = {
        "fields":  {
            "image_vector": {
                "assign": {
                    "values": vector
    url = "http://localhost:8080/document/v1/demo/product/docid/{}".format(asin)
    response = session.put(url, json=update_doc)

100%|██████████| 22929/22929 [01:40<00:00, 228.94it/s]

We now want to get similar products using the image feature data.
We do so by first fetching the
vector of the product we want to find similar products for, and use this vector as input to the nearest neighbor search operator of Vespa.
First we define a simple get vector utility to fetch the vector of a given product *asin*.

def get_vector(asin):
    resource = "http://localhost:8080/document/v1/demo/product/docid/{}".format(asin)
    response = session.get(resource)
    document = response.json()
    cells = document['fields']['image_vector']['cells']
    vector = {}
    for i,cell in enumerate(cells):
        v = cell['value']
        adress = cell['address']['x']
        vector[int(adress)] = v
    values = []
    for i in range(0,256):
    return values

Let us repeat the query from above to find an image to find similar products for

query = {
    'yql': 'select documentid, asin,title,imUrl,price from sources * where userQuery();',
    'query': 'mens wrist watch',
    'ranking': 'bm25',
    'type': 'any',
    'presentation.timing': True,
    'hits': 1
display_hits(app.query(body=query).json, "bm25")

Found 3285 hits in 4.00 ms.
Showing top 1, ranked by bm25

Geekbuying 814 Analog Alloy Quartz Men's Wrist Watch - Black (White)


documentid: id:demo:product::B00GLP1GTW
asin: B00GLP1GTW
price: 8.99
relevance score: 10.27

Let us search for similar images using **exact** nearest neighbor search. We ask for 3 most similar to the product image with asin id **B00GLP1GTW**

query = {
    'yql': 'select documentid, asin,title,imUrl,description,price from sources * where \
    'ranking': 'vector_similarity',
    'hits': 3, 
    'presentation.timing': True,
    'ranking.features.query(query_image_vector)': get_vector('B00GLP1GTW')
display_hits(app.query(body=query).json, "vector_similarity")

Found 46 hits in 10.00 ms.
Showing top 3, ranked by vector_similarity

Geekbuying 814 Analog Alloy Quartz Men's Wrist Watch - Black (White)


documentid: id:demo:product::B00GLP1GTW
asin: B00GLP1GTW
price: 8.99
relevance score: 1.00

Avalon EZC Unisex Low-Vision Silver-Tone Flex Bracelet One-Button Talking Watch, # 2609-1B


documentid: id:demo:product::B000M9GQ0M
asin: B000M9GQ0M
price: 44.95
relevance score: 0.63

White Rubber Strap Digital Watch


documentid: id:demo:product::B003ZYXF1Y
asin: B003ZYXF1Y
relevance score: 0.62

Let us repeat the same query but this time using the faster approximate version.
When there is a HNSW index on the tensor,
the default behavior is to use approximate:true, so we remove the approximation flag.

query = {
    'yql': 'select documentid, asin,title,imUrl,description,price from sources * where \
    'ranking': 'vector_similarity',
    'hits': 3, 
    'presentation.timing': True,
    'ranking.features.query(query_image_vector)': get_vector('B00GLP1GTW')
display_hits(app.query(body=query).json, "vector_similarity")

Found 3 hits in 6.00 ms.
Showing top 3, ranked by vector_similarity

Geekbuying 814 Analog Alloy Quartz Men's Wrist Watch - Black (White)


documentid: id:demo:product::B00GLP1GTW
asin: B00GLP1GTW
price: 8.99
relevance score: 1.00

Avalon EZC Unisex Low-Vision Silver-Tone Flex Bracelet One-Button Talking Watch, # 2609-1B


documentid: id:demo:product::B000M9GQ0M
asin: B000M9GQ0M
price: 44.95
relevance score: 0.63

White Rubber Strap Digital Watch


documentid: id:demo:product::B003ZYXF1Y
asin: B003ZYXF1Y
relevance score: 0.62

## Combining nearest neighbor search with filters

If we look at the results for the above exact and approximate nearest neighbor searches, we got the same results using the approximate version (perfect recall).
But naturally the first listed product was the same product that we used as input, and the closeness score was 1.0 simply because the angular distance is 0.
Since the user is already presented with the product, we want to remove it from the result. We can do that

Query Time Constrained Approximate Nearest Neighbor Search

Photo by
Christopher Burns on Unsplash

This blog post describes Vespa’s support for combining approximate nearest neighbor search,
or vector search, with query-time filters or constraints to tackle real-world search and recommendation use cases at scale.
The first section covers the data structures Vespa builds to support fast search in vector fields and other field types.
It then goes through the two primary strategies for combining vector search with filters; pre- or post-filtering.
Finally, an intro to two Vespa parameters for controlling vector search filtering behavior to
achieve optimal functionality with the lowest possible resource usage.


Many real-world applications require query-time constrained vector search. For example, real-time recommender systems
using vector search for candidate retrieval
need to filter recommendations by hard constraints like regional availability or age group suitability.
Likewise, search systems using vector search need to support filtering.
For example, typical e-commerce search interfaces allow users to navigate and filter search results using result facets.

Vespa’s document model supports representing multiple field and collection types in the same
document schema.
Supported Vespa schema field types
include string, long, int, float, double, geo position, bool, byte, and tensor fields.
Vespa’s first-order dense tensor fields represent vector fields.
Vespa’s tensor fields support different tensor cell precision types,
ranging from int8 for binary vectors to bfloat16, float, and double for real-valued vectors.
Allowing vectors and other field types in the same document schema enables searching the vector field(s) in
combination with filters expressed over other fields.

This blog post uses the following Vespa document schema
to exemplify combining vector search with filters:

schema track {

  document track {

    field title type string {
      indexing: index | summary
      index: enable-bm25
      match: text 

    field tags type array<string> {
      indexing: attribute | summary 
      attribute: fast-search
      rank: filter
      match: exact 

    field embedding type tensor<float>(x[384]) {
      indexing: attribute | index 
      attribute {
        distance-metric: euclidean
  rank-profile closeness {
    inputs {
       query(query_embedding) tensor<float>(x[384])      
    first-phase { 
      expression: closeness(field, embedding) 

The practical nearest neighbor search guide
uses a similar schema, indexing a subset of the track dataset.
The simplified track document type used in this post contains three fields: track title, track tags, and track embedding:

  • title is configured for regular text indexing and matching using the default matching mode for
    indexed string fields, match:text.
  • tags is an array of strings,
    configured for exact database-style matching using Vespa’s match:exact.
  • embedding is a first-order tensor (vector) using float tensor cell precision.
    x[384] denotes the named dimension (x) with dimensionality (384). A vector field searched using
    Vespa’s nearestNeighbor
    query operator must define a distance-metric.
    Also see the Vespa tensor user guide.

The embedding vector field can be produced by, for example, a dense embedding model like

Vespa builds data structures for efficient query evaluation for fields with
indexing: index or
attribute fields
defined with the attribute: fast-search property.
The data structures used for non-tensor fields are variants of the classic inverted index data structure.
The inverted index data structure enables fast query evaluation of boolean queries, expressed using
Vespa’s query language:

select * from track where tags contains "90s" and tags contains "pop" 

Vespa builds Hierarchical Navigable Small World (HNSW)
graphs for first-order dense tensor fields or vector fields to support a fast, approximate nearest neighbor search.
Read more in the introducing HNSW support in Vespa
blog post.

Vespa’s implementation of the HNSW graph data structure allows for fast approximate nearest neighbor searches like:

select * from track where {targetHits:3, approximate:true}nearestNeighbor(embedding, query_embedding) 

This example searches for the three (approximate) nearest neighbors of the
input query embedding vector using the configured

Figure 1 illustrates, on a high level, Vespa’s data structures for all types of fields,
including vector fields with
HNSW enabled.

Search query planning and estimation

Vespa builds a query plan that tries to optimize the query execution.
A significant part of the execution planning is estimating how many documents each branch of the query tree will match.
The estimation process uses information present in per-field inverted index data structures.

Query terms searching fields with inverted index structures enabled,
use the size of posting lists as the hit count estimate. Other terms in the query
might use the number of searchable documents as an estimate, as it’s not known how many hits they will produce.
Furthermore, subtrees of ANDed terms use the minimum estimate of their children,
and OR-terms use the saturated sum of their children.
Finally, the complete query hit estimate is scaled with
the number of searchable documents to get an estimated-hit-ratio [0, 1].

Using the high-level illustration in Figure 1, a query for tags contains “90s” or tags contains “pop”
is estimated to match the sum of the length of the posting lists of the two terms (4+7=11).
A query for tags contains “90s” and tags contains “pop” is estimated to match
at most four documents (min(4,7) = 4). The hit estimates determine the query execution plan.
An optimal query evaluation for 90s and pop would start with the shortest posting list (90s)
and intersect with the postings of the longest (pop).
The query execution with the intersection of these two posting lists will only match one document (D9),
which is less than the estimated hit count. Estimation is a best-effort process,
overestimating the number of documents the query will match.

The posting lists of the inverted index can be of different granularity, which can help optimize the query execution.
For example, using rank: filter for attribute fields
with fast-search, enables compact posting list representation for frequent terms.

Combining exact nearest neighbor search with filters

Given a query vector, Vespa’s nearestNeighbor
query operator finds the (targetHits) nearest neighbors using the configured

The retrieved hits are exposed to first-phase ranking,
where the retrieved neighbors can be re-ranked using more sophisticated ranking models
beyond the pure vector similarity.

The query examples in this blog post use the closeness rank feature
directly as the first-phase rank expression:

rank-profile closeness { 
  first-phase { 
    expression: closeness(field, embedding) 

Developers might switch between using approximate:true, which searches the HNSW
graph data structure, and using exact search, setting approximate:false.
The ability to switch between approximate and exact enables quantifying accuracy
loss when turning to the faster, but approximate search. Read more about HNSW parameters and accuracy versus performance
tradeoffs in the Billion-scale vector search with Vespa – part two blog post.

It’s trivial to combine the exact nearest neighbor search with query-time filters, and the computational
complexity of the search is easy to understand. For example, the following query searches for the exact
nearest neighbors of the query embedding using the euclidean distance-metric,
restricting the search for neighbors to only consider tracks tagged with rock:

select * from track where {targetHits:5, approximate:false}nearestNeighbor(embedding, query_embedding) and tags contains "rock" 

The query execution planner estimates that the exact nearestNeighbor query operator
will match all searchable documents,
while the tags term will match a restricted subset.
The most optimal way to evaluate this query is to first find the documents matching the filter,
and then perform an exact nearest neighbor search. During the exact search, Vespa reads the vector data
from the tensor attribute store and does not touch the HNSW graph.

Figure 2 illustrates the computational cost (driven mainly by distance calculations)
versus an increasing number of filter matches (% hit count rate) using exact search filters.
A more restrictive filter uses less resources as it involves fewer distance calculations.

Note that the figure uses computational cost and not latency.
It is possible to reduce search latency by using more
threads to parallelize
the exact search,
but the number of distance calculations involved in the query execution stays the same.

Combining approximate nearest neighbor search with filters

Using exact nearest neighbor search for web-scale search and recommendation problems
quickly becomes prohibitively expensive. As a result, many turn to the less resource-intensive
approximate nearest neighbor search, accepting an accuracy loss to reduce serving cost.
There are two high-level strategies for combining boolean query evaluation over
inverted indexes with approximate nearest neighbor search: post-filtering and pre-filtering.

Post-filtering strategy

This strategy evaluates the approximate nearest neighbors first and runs the constrained filter
over the retrieved targetHits hits. This strategy is characterized as post-filtering as the
filter constraints are considered only over the retrieved targetHits closest neighbors.
The disadvantage of this approach is that restrictive filters (for example, the tag 90s from Figure 1)
will cause the search to expose fewer hits to ranking than the wanted targetHits.
In the worst case, the post-filters eliminate all retrieved neighbors and expose zero documents to ranking.
The advantage of this strategy is that the serving performance impact of constrained search is negligible
compared to unconstrained approximate nearest neighbor search.
Another aspect is that the hits which survive the post-filter are within the original targetHits.
In other words, the neighbors exposed to ranking are not distant compared to the nearest.

Pre-filtering strategy

This strategy evaluates the filter part of the query over the inverted index structures first.
Then, it uses the resulting document IDs from the filter execution as input to the search for approximate nearest neighbors.
The search for neighbors traverses the HNSW graph and each candidate neighbor is looked up in the
document ID list from the filter execution.
Neighbors not on the document ID list are ignored and the greedy graph search continues until
targetHits hits have been found.

This filtering strategy is known as pre-filtering, as the filters go first before searching
for the approximate nearest neighbors. With pre-filtering, the probability of exposing targetHits
to the ranking phase is much higher than with the post-filtering approach.
The disadvantage is that the performance impact is higher than with the post-filter strategy.
In addition, the retrieved neighbors for a restrictive filter might be somewhat distant neighbors
than the nearest neighbor of the query embedding.
Distant neighbors could be combated by specifying a
distance threshold
as another constraint for the approximate nearestNeighbor query operator.

Figure 3 summarizes the strategies used for approximate nearest neighbor search with filtering.
In the case of
post-filtering, only one hit is exposed to the ranking phase after the post-filter is
evaluated. In the case of pre-filtering, two additional hits were exposed, but they
are more distant neighbors.

Filtering strategies and serving performance

From a performance and serving cost perspective, one can summarize on a high level:

  • Unconstrained approximate nearest neighbor search without filters is the fastest option, but real-world applications
    need constrained vector search. Pre-building several nearest neighbor indexes using pre-defined constraints offline also
    cost resources and index management.

  • Post-filtering is less resource-intensive than pre-filtering for the same number of targetHits. Increasing
    targetHits to combat the effect of post-filtering changes cost of the query as increasing targetHits increases the
    number of distance calculations.

  • Pre-filtering uses more resources than post-filtering for the same number of targetHits.
    Pre-filtering needs to execute the filter and then search the HNSW graph,
    constrained by the document ID list produced by the pre-filter execution.

Figure 4 Approximate search with pre-filtering versus exact search with pre-filtering

Figure 4 illustrates the performance characteristics of approximate search and exact search when using pre-filtering
with increasing hit count rate.

With the pre-filtering strategy, searching the HNSW graph with a restrictive pre-filter
result causes the greedy graph search to traverse many nodes before finding the targetHits
which satisfies the document ID list from the pre-filter execution.

Due to this, there is a sweet spot or threshold where an exact search with filters
has a lower computational cost than an approximate search using the document ID list from the pre-filter execution.
The actual threshold value depends on vector dimensionality, HNSW graph properties,
and the number of vectors indexed in the Vespa instance.

Vespa exposes two parameters that control the query-time filtering strategy.
These parameters give the