Billion-scale vector search with Vespa – part two

illustrative image

Photo by
Vincentiu Solomon on Unsplash

In the first post in this series, we introduced compact binary-coded vector representations that can
reduce the storage and computational complexity of both exact and approximate nearest neighbor search.
This second post covers an experiment using a 1-billion binary-coded representation derived from a
vector dataset used in the big-ann-benchmark challenge. The
purpose of this post is to highlight some of the trade-offs related to approximate nearest neighbor search,
and especially we focus on serving performance versus accuracy.

Vespa implements a version of the HNSW (Hierarchical Navigable Small Word)
algorithm for approximate
vector search. Before diving into this post, we recommend reading the HNSW in Vespa
blog post explaining why we chose the HNSW algorithm.

Choosing a Vector Dataset

When evaluating approximate nearest neighbor search algorithms it is important to use realistic vector data.
If you don’t have the vectors for your problem at hand, use a publicly available dataset.

For our experiments, we chose the Microsoft SPACEV-1B
from Bing as our base vector dataset.

This is a dataset released by Microsoft from SpaceV, Bing web vector search scenario, for large
scale vector search-related research usage. It consists of more than one billion document vectors
and 29K+ query vectors encoded by the Microsoft SpaceV Superior model.

The vector dataset was published last year as part of the
big-ann-benchmarks challenge.
It consists of one billion 100-dimensional vectors using int8 precision. In other words,
each of the hundred vector dimensions is a number in the [-128,127] range.
The dataset has 29,3K queries with pre-computed ground truth nearest neighbors
using the euclidean distance for each query vector.
Vespa supports four different
tensor vector precision types,
in order of increasing precision:

  • int8 (8 bits, 1 byte) per dimension
  • bfloat16 (16 bits, 2 bytes) per dimension
  • float (32 bits, 4 bytes) per dimension
  • double (64 bits, 8 bytes) per dimension

Quantization and dimension reduction as part of the representation learning could save both
memory and CPU cycles in the serving phase, and Microsoft researchers have undoubtedly had this in mind
when using 100 dimensions with int8 precision for the embedding.

In the first post in this series we discussed some of the benefits of working with binarized vector datasets
where the original vector representation (e.g, int8) is transformed to binary.
We don’t train a binary representation model, but instead use the unsupervised sign threshold function.
Using the threshold function, we convert the mentioned SPACEV-1B vector dataset to a new and binarized
dataset. Both queries and document vectors are binarized, and we use the hamming distance
metric for the nearest neighbor search (NNS)
with our new dataset.

import numpy as np
import binascii
#vector is np.array([-12,3,4....100],dtype=np.int8)
binary_code = np.packbits(np.where(vector > 0, 1,0)).astype(np.int8)

Binarization using NumPy

The binarization step packs the original vector into 104 bits, which is represented using a
13 dimensional dense single-order int8 tensor in Vespa.

However, this transformation from euclidean to hamming does not preserve the
original euclidean distance ordering, so we calculate a new set of ground truth nearest neighbors
for the query dataset using hamming distance. Effectively, we create a new binarized vector dataset
which we can use to experiment and demonstrate some trade-offs related to vector search:

  • Brute force nearest neighbor search using multiple search threads to parallelize the search for
    exact nearest neighbors
  • Approximate nearest neighbor search where we accept an accuracy loss compared
    to exact search.
  • Indexing throughput with and without HNSW enabled
  • Memory and resource utilization with and without HNSW enabled

As described in the first post in this series, we can also use the hamming distance as the coarse level
nearest neighbor search distance metric and re-rank close vectors in hamming space using the
original representation’s euclidean distance. In our experiments, we focus on the new binarized
dataset using hamming distance.

Experimental setup

We deploy the Vespa application to the Vespa cloud
performance zone, and use the following
node resources for the core Vespa service types, see services.xml
reference guide for details:

2x Stateless container with search API (<search>)

<nodes count="2">
  <resources memory="12Gb" vcpu="16" disk="100Gb"/>

2x Stateless container with feed API (<document-api>)

<nodes count="2">
  <resources memory="12Gb" vcpu="16" disk="100Gb"/>

1x Stateful content cluster with one node for storing and indexing the vector dataset

<nodes count="1">
  <resources memory="256Gb" vcpu="72" disk="1000Gb" disk-speed="fast"/>

This deployment specification isolates resources used for feed and search, except for search and indexing related
resource usage on the content node. Isolating feed and search allows for easier on-demand resource scaling as the
stateless containers can be auto-scaled faster with read and write
volume than stateful content resources. For self-hosted deployments of Vespa, you need to list the nodes you are using, and there is no auto-scaling.

The following is the base Vespa document schema we use throughout our experiments:

schema code {
  document code {
    field id type int {
      indexing: summary|attribute
    field binary_code type tensor<int8>(b[13]) {
      indexing: attribute
      attribute {

Vespa document schema without HNSW indexing enabled

Evaluation Metrics

To evaluate the accuracy degradation when using approximate nearest neighbor search (ANNS) versus the exact ground
truth (NNS), we use the Recall@k metric, also called Overlap@k. Recall@k measures the overlap
between the k ground truth nearest neighbors for a query with the k nearest returned by the
approximate search.

The evaluation routine handles distance ties; if a vector returned by
the approximate search at position k has the same distance as the ground truth vector at position k,
it is still considered a valid kth nearest neighbor. For our experiments, we use k equal to 10.
The overall Recall@10 associated with a given parameter configuration is the mean recall@10 of all 29,3K
queries in the dataset.

Note that the vector overlap metric used does not necessarily directly correlate with application-specific recall
metrics. For example, recall is also used in Information Retrieval (IR) relevancy evaluations to measure if the judged
relevant document(s) are in the retrieved top k result. Generally, degrading vector search Recall@k
impacts the application-specific recall, but much depends on the use case.

For example, consider the Efficient Passage Retrieval with Hashing for
Open-domain Question Answering paper
discussed in the previous post in this series. The authors present a recall@k
metric for k equal to 1, 20 and 100. This specific recall@k measures if the ground truth
golden answer to the question is in any top k retrieved passage. In this case, the error introduced
by using approximate search might not impact the use case recall@k metric since the
exact answer to the question might exist in several retrieved documents. In other words, not
recalling a record with the ground truth answer due to inaccuracy introduced by using approximate
vector search does not necessarily severely impact the end-to-end recall metric.

Similarly, when using ANNS for image search at a web scale, there might be many equally relevant
images for almost any query. Therefore, losing relevant “redundant” pictures due to nearest neighbor
search accuracy degradation might not impact end-to-end metrics like revenue and user satisfaction severely.

However, reducing vector recall quality will impact other applications using nearest
neighbor search algorithms. Consider, for example, a biometric fingerprint recognition application
that uses nearest neighbor search to find the closest neighbor for a given query fingerprint in
a database of many fingerprints. Accepting any accuracy loss as measured by
Recall@1 (the true closest neighbor) will have severe consequences for the overall usefulness of the

Vector Indexing performance

We want to quantify the impact of adding data structures for faster and approximate vector search on
vector indexing throughput. We use the Vespa HTTP feeding client
to feed vector data to the Vespa instance.

Writing becomes more expensive when enabling HNSW indexing for
approximate vector search. This is because insertion into the HNSW graph requires distance calculations and graph
modifications which reduces overall throughput. Vespa’s HNSW implementation uses multiple threads for
distance calculations during indexing, but only a single writer thread can mutate the HNSW graph.
The single writer thread limits concurrency and resource consumption. Generally, Vespa balances CPU resources used for indexing versus searching
using the concurrency setting.

Vespa exposes two core HNSW construction parameters
that impacts feeding performance (and quality as we will see in subsequent sections):

  • max-links-per-node Specifies how many links are created per vector inserted into the graph. The
    default value in Vespa is 16. The HNSW paper calls this parameter M.
    A higher value of max-links-per node increases memory usage and reduces indexing throughput, but also improves the quality of the graph.

  • neighbors-to-explore-at-insert Specifies how many neighbors to explore when inserting a vector in
    the HNSW graph. The default value in Vespa is 200. This parameter is called efConstruction in the HNSW paper.
    A higher value generally improves the quality but lowers indexing throughput as each insertion requires more
    distance computations. This parameter does not impact memory footprint.

We experiment with the following HNSW parameter combinations for evaluating feed indexing
throughput impact.

  • No HNSW indexing for exact search
  • HNSW with max-links-per-node = 8, neighbors-to-explore-at-insert 48
  • HNSW with max-links-per-node = 16, neighbors-to-explore-at-insert 96

The document schema, with HNSW indexing enabled for the binary_code field looks like this for the
last listed parameter combination:

schema code {
  document code {
    field id type int {
      indexing: summary|attribute
    field binary_code type tensor<int8>(b[13]) {
      indexing: attribute|index
      attribute {
      index {
        hnsw {
          max-links-per-node: 16
          neighbors-to-explore-at-insert: 96

Vespa document schema with HNSW enabled

The real-time indexing throughput results are summarized in the following chart:

Indexing performance

Real-time indexing performance without HNSW indexing and with two HNSW parameter combinations.

Without HNSW enabled, Vespa is able to sustain 80 000 vector puts/s. By increasing the number of nodes in the
Vespa content cluster using Vespa’s content distribution,
it is possible to increase throughput horizontally. For example, using four nodes instead of one, would support 4×80 000 = 320 000 puts/.

Indexing performance

Sustained real-time indexing without HNSW, screenshot from Vespa Cloud Metrics

When we introduce HNSW indexing, the write-throughput drops significantly as it involves mutations of the
HNSW graph and distance calculations. In addition to indexing throughput, we also measure peak memory usage for the content node:

Memory Usage(GB)

Peak Memory Usage without HNSW indexing and with two HNSW parameter combinations.

Now, you might ask, why are Vespa using 64G of memory for this dataset in the baseline case without HNSW?
The reason is that Vespa stores the global document id (a 96-bit hash of the document id string) in memory, and the document id consumes more memory than the vector
data alone. One billion document identifiers consume
about 33GB of memory. Finally, there is also 4GB of data for the integer id attribute.
This additional memory used for the in-memory gid, is used to support fast elastic content distribution,
fast partial updates and more.

As we introduce HNSW indexing, the memory usage increases significantly due to the additional HNSW graph data structure which is also
in memory for fast access during searches and insertions.

Brute-force exact nearest neighbor search performance

As we have seen in the indexing performance and memory utilization experiments, not using HNSW uses
considerably less memory, and is the clear indexing throughput winner – but what about the search
performance of brute force search? Without HNSW graph indexing, the complexity of the search for neighbors is linear with
the total document volume, so that is surely slow for 1B documents?

To overcome the latency issue, we can use one of the essential Vespa features: Executing a query using multiple
search threads.
By using more threads per query, Vespa can make better use of multi-CPU core architectures and
reduce query latency at the cost of increased CPU usage per query. See more on scaling latency
versus throughput using multithreaded search in the Vespa
sizing and performance guide.

To easily test multiple threading configurations, we deploy multiple
Vespa rank profiles. Choosing ranking profile is
done in the query, so it’s easy to run experiments without having to re-deploy the application.

rank-profile hamming {
  first-phase {

rank-profile hamming-t2 inherits hamming {

Ranking profiles defined in the document schema.

Exact NNS versus threads

Exact nearest neighbor search performance versus threads used per query.

As we can see from the figure above, one search thread uses on average 15 seconds to compute the exact nearest neighbors.
By increasing the number of search threads per query to 32, we can reach sub-second search latency.
The catch is that at as low as one query per second (QPS), the node would be running at close to 100% CPU utilization.
Still, trading latency over throughput might be a good decision for some use cases that do not require high query throughput or
where CPU utilization is low (over-provisioned resources). In our case,
using all available CPU cores for our exact ground truth calculations reduced the overall time duration significantly.

Approximate nearest neighbor search performance

Moving from exact to approximate nearest neighbor search, we evaluate the search performance
versus accuracy using the recall@10 metric for the same HNSW parameter combinations we used
to evaluate indexing performance:

  • max-links-per-node = 8, neighbors-to-explore-at-insert 48
  • max-links-per-node = 16, neighbors-to-explore-at-insert 96

We use the exact search to obtain the 100 ground truth exact nearest neighbors for all the queries in the dataset, and use those for the
approximate nearest neighbor search recall@10 evaluation. We use 100 in the ground truth set to be able to take into account distance ties.

With approximate nearest neighbor search in Vespa, the traversal of the HNSW graph uses one thread per
search irrespective of the number of configured rank profile search threads; the threads are put to
use if the ranking profile uses higher-level subsequent ranking phases.
For use cases and ranking profiles without higher level ranking phases, we recommend explicitly configuring
one thread to avoid idling searcher threads which are not used for the graph traversal. Recall@10 versus latency is provided
in the figure below:

Approximate nearest neighbor search performance versus recall@10

Approximate nearest neighbor search performance versus recall@10.

We produce the graph by running all 29,3K queries in the binarized dataset and computing
the recall@10 and measure the latency for different run time values of
The hnsw.exploreAdditionalHits run-time parameter allows us to tune quality versus cost without re-building the HNSW graph.

As we can see from the above graph, using more indexing resources with more links and more distance
calculations during graph insertion improves the search quality at the same cost/latency.
As a result, we reach 90% recall@10 instead of 64% recall@10 at the exact cost of 4 ms search time.
Of course, the level of acceptable recall@10 will be use case dependent, but the above figure illustrates the impact on search
quality when using different construction HNSW parameters.

Furthermore, comparing the 4ms at 90% recall@10 with the exact nearest neighbor search performance
of 15000 ms, we see that we achieve a speedup of 3,750x. Note that these are latency numbers for single-threaded searches.
For example, with 4 ms average latency per search using one thread, a node with one CPU core will be able to evaluate up to about 250
queries per second. 72 CPU cores would be 72x that, reaching 18,000 queries per second at 100% CPU
utilization. Scaling further for increased query throughput is achieved using multiple replicas using grouped content
distribution (or more CPU cores per node). See more on how to size Vespa search deployments in the
Vespa sizing guide.


In this blog post, we explored several trade-offs related to vector search.
We concluded that the quality, as measured by recall@k, must be weighed against the use case metrics
and the deployment cost. Furthermore, we demonstrated how multithreaded search could reduce the
latency of the exact search, but scaling query throughput for exact search would be prohibitively
expensive at this scale. However, using brute force search could be a valid and cost-effective alternative
for smaller data volumes with low query throughput, especially since the memory usage is considerably less,
and supported indexing throughput is higher.

In the next blog post in this series, we will experiment with the original
Microsoft SPACEV-1B vector dataset, using
the original dimension with int8 precision with euclidean distance. In the upcoming blog post we will explore a hybrid approach
for approximate nearest neighbor search. This approach uses a combination of inverted indexes and HNSW, which reduces
memory usage. The method we will explore using Vespa is highly inspired by the SPANN: Highly-efficient Billion-scale
Approximate Nearest Neighbor Search paper. Stay tuned!