Billion-scale vector search with Vespa – part one

illustrative image

Photo by Federico Beccari
on Unsplash

NASA estimates that the Milky Way galaxy
contains between
100 to 400 billion stars. A mind-blowing large number of stars and solar systems,
and also a stunning sight from planet earth on a clear night.

Many organizations face challenges involving star-sized datasets, even orders of magnitude larger.
AI-powered representations of unstructured data like image, text, and video have enabled search applications
we could not foresee just a few years ago.
For example, in a previous blog post, we covered vision and video search applications using
AI-powered vector representations.

Searching star-sized vector datasets using (approximate)
nearest neighbor search is a fascinating and complex problem with many trade-offs :

  • Accuracy of the approximate versus the exact nearest neighbor search
  • Latency and scaling
  • Scaling search volume (searches per second)
  • Batch indexing versus real-time indexing and handling of updates and vector meta-data
  • Distributed search in large vector datasets which does not fit into a single content node
  • Cost($), total cost of ownership, including development efforts

This blog series looks at how to represent and search billion-scale vector datasets using Vespa, and we cover many of the mentioned
trade-offs.

In this first post we look at compact binary-coded vector representations which can reduce storage and computational complexity
of both exact and approximate nearest neighbor search. For those that are new to Vespa we can recommend reading the
Vespa overview and the
Vespa quick start guide before diving into this post.

Deep Neural Hashing is a catchy phrase
for using deep neural networks for
learning
compact hash-like binary-coded representations. The goal of representation
learning, deep or not, is to transform any data into a suitable representation
that retains the essential information needed to solve a particular task, for
example, search or retrieval. Representation learning for retrieval usually involves
supervised learning with labeled or pseudo-labeled data from user-item interactions.

Many successful text retrieval systems use supervised representation
learning to transform text queries and documents into continuous
high-dimensional real-valued vector representations. Query and document
similarity, or relevancy, is reduced to a vector distance metric in the
representational vector space. To efficiently retrieve from large
collection of documents, one can turn to approximate nearest neighbor search
algorithms instead of exact nearest neighbor search.
See for example the
pre-trained transformer language models for search blog post for
an introduction to state-of-the-art text retrieval using dense vector representations.

Recently, exciting research has demonstrated that it is possible to learn a
compact hash-like binary code representation instead of a dense continuous
vector representation without much accuracy loss.
In
Efficient Passage Retrieval with Hashing for Open-domain Question Answering, the authors describe
using a hashing layer on top of the bi-encoder transformer architecture
to train a binary coded representation of documents and queries
instead of continuous real-valued representation.

Illustrative image

Illustration from
Efficient Passage Retrieval with Hashing for Open-domain Question Answering

Note that the search is performed in two phases, first a coarse-level search using the
hamming distance with binary codes, secondly a re-ranking phase using the continuous query vector representation and
a unpacked vector representation from the binary code.

A huge advantage over continuous vector
representations is that the binary-coded document representation significantly reduces
the document storage requirements. For example, representing text documents in a
768-dimensional vector space where each dimension uses float precision, the
storage requirement per document becomes 3072 bytes. Using a 768-bit
binary-coded representation instead, the storage requirement per document
becomes 96 bytes, a 32x reduction.
In the mentioned paper, the authors demonstrate that the entire
English Wikipedia consisting of 22M passages can be reduced to 2GB of binary codes.

Searching in binary-coded representations can be done using the hamming distance metric.
The hamming distance between code q and code d is simply the number of bit
positions that differ or, in other words, the number of bits that need to flip
to convert q into d. Hamming distance is efficiently implemented on CPU
architectures using few instructions (xor combined with population count). In
addition, hamming distance search makes it possible to rank a set of binary
codes for a binary coded query compared to exact hash table lookup.

Compact binary-coded representations paired with hamming
distance is also successfully used for large-scale vision search.
See for example these papers:

Vespa has first-class citizen support for representing high dimensional dense
vectors using the
Vespa tensor field type. Dense vectors are represented as
dense first-order tensors in Vespa. Tensor cell values in Vespa support four
numeric precision types,
in order of increasing precision:

  • int8 (8 bits, 1 byte) per dimension
  • bfloat16 (16 bits, 2 bytes) per dimension
  • float (32 bits, 4 bytes) per dimension
  • double (64 bits, 8 bytes) per dimension

All of which are signed data types. In addition, for dense first-order tensors
(vectors), Vespa supports fast approximate nearest neighbor search using Vespa’s
HNSW implementation.
Furthermore, the nearest neighbor search operator in Vespa
supports several
distance metrics, including euclidean, angular, and bitwise
hamming distance.

To represent binary-coded vectors in Vespa, one should use first-order tensors
with the int8 tensor cell precision type. For example, to represent a 128-bit code using
Vespa tensors, one can use a 16-dimensional dense (indexed) tensor using int8 value type
(16*8 = 128 bits). The
Vespa document schema below defines a numeric id field
of type int, representing the vector document id in addition to the binary-coded
vector using a dense first-order tensor. The b denotes the tensor dimension
name, and finally, [16] denotes the dimensionality.

schema code {
  document code {
    field id type int {}
    field binary_code type tensor<int8>(b[16]) {
      indexing: attribute
      attribute { 
        distance-metric:hamming
      } 
    }
  }
}

Using big-endian ordering for the binary-coded representation, the
most significant bits from the binary-code at position 0 to 7 inclusive are the first
vector element at index 0, bits at position 8 to 15 inclusive in the second
vector element, and so on. For example, the following snippet uses NumPy
with python to
demonstrate a way to create a binary-coded representation from a 128-dimensional
real-valued vector representation by using an untrained thresholding function (sign
function):

import numpy as np
import binascii
vector = np.random.normal(3,2.5, size=(128))
binary_code = np.packbits(np.where(vector > 0, 1,0)).astype(np.int8)
str(binascii.hexlify(binary_code),'utf-8')
'e7ffef5df3bcefffbfffff9fff77fdc7'

The above hexadecimal string representation
can be fed to Vespa using the
Vespa JSON feed format.

{
  "id": "id:bigann:code::834221",
  "fields": {
    "id": 834221,
    "binary_code": {
      "values": "e7ffef5df3bcefffbfffff9fff77fdc7"
    }
  } 
}

Indexing in Vespa is real-time and documents become searchable within single digit
millisecond latency at high throughput. The JSON feed files can be indexed with
high throughput using the
Vespa feed client.

Feeding output stream

Dense first-order tensors can either be in memory all the time or paged in from
secondary storage on-demand at query time, for example, during scoring and
ranking phases in a
multiphased retrieval and ranking pipeline. In any case,
the data is
persisted and stored for durability and data re-balancing.

Furthermore, supporting in-memory and
paged dense first-order tensors enables
storing the original vector representation in the document model at a relatively
low cost (storage hierarchy economics). For example, the following document schema keeps
the original float precision vector on disk using the paged tensor attribute option.

schema code {
  document code {
    field id type int {} 
    field binary_code type tensor<int8>(b[16]) {
      indexing: attribute
      attribute { 
        distance-metric: hamming 
      }
    }
    field vector type tensor<float>(r[128]) {
      indexing: attribute
      attribute: paged
    }
  }
}

Storing the original vector representation on disk using the
paged
attribute option enables phased retrieval and ranking close to the data. First,
one can use the compact in-memory binary code for the coarse level search to
efficiently find a limited number of candidates. Then, the pruned
candidates from the coarse search can be re-scored and re-ranked using a more
advanced scoring function using the original document and query representation. Once
a document is retrieved and exposed to the ranking phases, one can also use more
sophisticated scoring models, for example using Vespa’s support for evaluating
ONNX models.

Illustration from Lin_Deep_Learning_of_2015_CVPR_paper.pdf

A two-phased coarse to fine level search using hamming distance as the coarse-level search.
Illustration from
Deep Learning of Binary Hash Codes for Fast Image Retrieval (pdf)
.

The binary-coded representation and the original vector are co-located on the
same Vespa content node(s)
since they live in the same document object. Thus,
re-ranking or fine-level search using the real-valued vector representation does not require any
network round trips to fetch the original vector representation from some
external key-value storage system.

In addition, co-locating both the coarse and fine representation avoid
data consistency and synchronizations issues between different data stores.

Using Vespa’s nearest neighbor search query operator, one can search for the
semantically similar documents using the hamming distance metric. The following
example uses the exact version and does not use the approximate version using
Vespa’s HNSW indexing support. The next blog post in this series will compare
exact with approximate search. The following Vespa document schema defines
two Vespa ranking profiles:

schema code {
  document code {
    field id type int {..} 
    field binary_code type tensor<int8>(b[16]) {..}
 }
 rank-profile coarse-ranking {
    num-threads-per-search:12
    first-phase { expression { closeness(field,binary_code) } } 
 }
 rank-profile fine-ranking inherits coarse-ranking {
    second-phase { 
      rerank-count:200
      expression { .. } 
    } 
 }
}

The coarse-ranking ranking
profile ranks documents by the
closeness rank feature which in our case is the inverse hamming distance.
By default, Vespa sorts documents by descending relevancy score,
hence the closeness(field,name) rank feature uses
1/(1 + distance()) as the relevance score.

The observant reader might have noticed the
num-threads-per-search ranking profile setting.
This setting allows parallelizing the search and ranking using
multiple CPU threads, reducing the overall serving latency at the cost of
increased CPU usage per query. This allows better use of multicore CPU architectures.

The second ranking profile fine-ranking inherits the first phase
ranking function from the coarse-ranking profile and re-ranks the top k results using a more sophisticated model,
for example using the original representation.

The nearest neighbor search is expressed using the Vespa YQL query
language in a query api http(s) request.

A sample JSON POST query is given below, searching for the 10 nearest neighbors of a binary coded query vector query(q_binary_code):

{
  "yql": "select id from vector where ([{\"targetHits\":10}]nearestNeighbor(binary_code, q_binary_code));",
  "ranking.profile": "coarse-ranking",
  "ranking.features.query(q_binary_code): [-18,-14,28,...],
  "hits":10
}

Similar, using the fine-ranking we can also pass the original query vector representation which might be
used in the second phased ranking expression.

{
  "yql": "select id from vector where ([{\"targetHits\":10}]nearestNeighbor(binary_code, q_binary_code));",
  "ranking.profile": "fine-ranking",
  "ranking.features.query(q_binary_code): [-18,-14,28,...],
  "ranking.features.query(q_vector_real): [-0.513,-0.843,0.034,...],
  "hits":10
}

Vespa allows combining the nearest neighbor search query operator with
other query operators and filters. Using filtering reduces the complexity of the nearest neighbor search as fewer candidates
evaluated. Fewer documents saves memory bandwidth and CPU instructions.

See also this blog post for more examples of
combining the nearest neighbor query operator with filters. An example of filtering on a bool field type
is given below.

{
  "yql": "select id from vector where ([{\"targetHits\":10}]nearestNeighbor(binary_code, q_binary_code)) and is_visible=true;",
  "ranking.profile": "coarse-ranking",
  "ranking.features.query(q_binary_code): [-18,-14,28,...],
  "hits":10
}

In the above query examples we use the
short dense (indexed) tensor input format.
Note that query input tensors do not support the compact hex string representation. The above examples also assumed that an external
system would do the binarization.
Vespa also supports importing ONNX models so that the binarization
could be performed in the Vespa stateless cluster before searching the content cluster(s),
see stateless model evaluation for examples and discussion.

This post introduced our blog post series on billion-scale vector search, furthermore, we took a deep dive into representing binary-code using
Vespa’s tensor field with int8 tensor cell precision.
We also covered coarse-level to fine-level search and ranking using hamming
distance as the coarse-level nearest neighbor search distance metric.

In the next blog post in this series we will
experiment with a billion-scale vector dataset from big-ann-benchmarks.com.
We will be indexing it using a single Vespa content node, and we will experiment with using both exact and approximate vector search with hamming distance.

The focus of the next post will be to demonstrate some of the mentioned trade-offs from the introduction:

  • Real-time

Billion-scale vector search with Vespa – part two

illustrative image

Photo by
Vincentiu Solomon on Unsplash

In the first post in this series, we introduced compact binary-coded vector representations that can
reduce the storage and computational complexity of both exact and approximate nearest neighbor search.
This second post covers an experiment using a 1-billion binary-coded representation derived from a
vector dataset used in the big-ann-benchmark challenge. The
purpose of this post is to highlight some of the trade-offs related to approximate nearest neighbor search,
and especially we focus on serving performance versus accuracy.

Vespa implements a version of the HNSW (Hierarchical Navigable Small Word)
algorithm for approximate
vector search. Before diving into this post, we recommend reading the HNSW in Vespa
blog post explaining why we chose the HNSW algorithm.

Choosing a Vector Dataset

When evaluating approximate nearest neighbor search algorithms it is important to use realistic vector data.
If you don’t have the vectors for your problem at hand, use a publicly available dataset.

For our experiments, we chose the Microsoft SPACEV-1B
from Bing as our base vector dataset.

This is a dataset released by Microsoft from SpaceV, Bing web vector search scenario, for large
scale vector search-related research usage. It consists of more than one billion document vectors
and 29K+ query vectors encoded by the Microsoft SpaceV Superior model.

The vector dataset was published last year as part of the
big-ann-benchmarks challenge.
It consists of one billion 100-dimensional vectors using int8 precision. In other words,
each of the hundred vector dimensions is a number in the [-128,127] range.
The dataset has 29,3K queries with pre-computed ground truth nearest neighbors
using the euclidean distance for each query vector.
Vespa supports four different
tensor vector precision types,
in order of increasing precision:

  • int8 (8 bits, 1 byte) per dimension
  • bfloat16 (16 bits, 2 bytes) per dimension
  • float (32 bits, 4 bytes) per dimension
  • double (64 bits, 8 bytes) per dimension

Quantization and dimension reduction as part of the representation learning could save both
memory and CPU cycles in the serving phase, and Microsoft researchers have undoubtedly had this in mind
when using 100 dimensions with int8 precision for the embedding.

In the first post in this series we discussed some of the benefits of working with binarized vector datasets
where the original vector representation (e.g, int8) is transformed to binary.
We don’t train a binary representation model, but instead use the unsupervised sign threshold function.
Using the threshold function, we convert the mentioned SPACEV-1B vector dataset to a new and binarized
dataset. Both queries and document vectors are binarized, and we use the hamming distance
metric for the nearest neighbor search (NNS)
with our new dataset.

import numpy as np
import binascii
#vector is np.array([-12,3,4....100],dtype=np.int8)
binary_code = np.packbits(np.where(vector > 0, 1,0)).astype(np.int8)

Binarization using NumPy

The binarization step packs the original vector into 104 bits, which is represented using a
13 dimensional dense single-order int8 tensor in Vespa.

However, this transformation from euclidean to hamming does not preserve the
original euclidean distance ordering, so we calculate a new set of ground truth nearest neighbors
for the query dataset using hamming distance. Effectively, we create a new binarized vector dataset
which we can use to experiment and demonstrate some trade-offs related to vector search:

  • Brute force nearest neighbor search using multiple search threads to parallelize the search for
    exact nearest neighbors
  • Approximate nearest neighbor search where we accept an accuracy loss compared
    to exact search.
  • Indexing throughput with and without HNSW enabled
  • Memory and resource utilization with and without HNSW enabled

As described in the first post in this series, we can also use the hamming distance as the coarse level
nearest neighbor search distance metric and re-rank close vectors in hamming space using the
original representation’s euclidean distance. In our experiments, we focus on the new binarized
dataset using hamming distance.

Experimental setup

We deploy the Vespa application to the Vespa cloud
performance zone, and use the following
node resources for the core Vespa service types, see services.xml
reference guide for details:

2x Stateless container with search API (<search>)

<nodes count="2">
  <resources memory="12Gb" vcpu="16" disk="100Gb"/>
</nodes>

2x Stateless container with feed API (<document-api>)

<nodes count="2">
  <resources memory="12Gb" vcpu="16" disk="100Gb"/>
</nodes>

1x Stateful content cluster with one node for storing and indexing the vector dataset

<nodes count="1">
  <resources memory="256Gb" vcpu="72" disk="1000Gb" disk-speed="fast"/>
</nodes>

This deployment specification isolates resources used for feed and search, except for search and indexing related
resource usage on the content node. Isolating feed and search allows for easier on-demand resource scaling as the
stateless containers can be auto-scaled faster with read and write
volume than stateful content resources. For self-hosted deployments of Vespa, you need to list the nodes you are using, and there is no auto-scaling.

The following is the base Vespa document schema we use throughout our experiments:

schema code {
  document code {
    field id type int {
      indexing: summary|attribute
    }
    field binary_code type tensor<int8>(b[13]) {
      indexing: attribute
      attribute {
        distance-metric:hamming
      }
    }
  }
}

Vespa document schema without HNSW indexing enabled

Evaluation Metrics

To evaluate the accuracy degradation when using approximate nearest neighbor search (ANNS) versus the exact ground
truth (NNS), we use the Recall@k metric, also called Overlap@k. Recall@k measures the overlap
between the k ground truth nearest neighbors for a query with the k nearest returned by the
approximate search.

The evaluation routine handles distance ties; if a vector returned by
the approximate search at position k has the same distance as the ground truth vector at position k,
it is still considered a valid kth nearest neighbor. For our experiments, we use k equal to 10.
The overall Recall@10 associated with a given parameter configuration is the mean recall@10 of all 29,3K
queries in the dataset.

Note that the vector overlap metric used does not necessarily directly correlate with application-specific recall
metrics. For example, recall is also used in Information Retrieval (IR) relevancy evaluations to measure if the judged
relevant document(s) are in the retrieved top k result. Generally, degrading vector search Recall@k
impacts the application-specific recall, but much depends on the use case.

For example, consider the Efficient Passage Retrieval with Hashing for
Open-domain Question Answering paper
discussed in the previous post in this series. The authors present a recall@k
metric for k equal to 1, 20 and 100. This specific recall@k measures if the ground truth
golden answer to the question is in any top k retrieved passage. In this case, the error introduced
by using approximate search might not impact the use case recall@k metric since the
exact answer to the question might exist in several retrieved documents. In other words, not
recalling a record with the ground truth answer due to inaccuracy introduced by using approximate
vector search does not necessarily severely impact the end-to-end recall metric.

Similarly, when using ANNS for image search at a web scale, there might be many equally relevant
images for almost any query. Therefore, losing relevant “redundant” pictures due to nearest neighbor
search accuracy degradation might not impact end-to-end metrics like revenue and user satisfaction severely.

However, reducing vector recall quality will impact other applications using nearest
neighbor search algorithms. Consider, for example, a biometric fingerprint recognition application
that uses nearest neighbor search to find the closest neighbor for a given query fingerprint in
a database of many fingerprints. Accepting any accuracy loss as measured by
Recall@1 (the true closest neighbor) will have severe consequences for the overall usefulness of the
application.

Vector Indexing performance

We want to quantify the impact of adding data structures for faster and approximate vector search on
vector indexing throughput. We use the Vespa HTTP feeding client
to feed vector data to the Vespa instance.

Writing becomes more expensive when enabling HNSW indexing for
approximate vector search. This is because insertion into the HNSW graph requires distance calculations and graph
modifications which reduces overall throughput. Vespa’s HNSW implementation uses multiple threads for
distance calculations during indexing, but only a single writer thread can mutate the HNSW graph.
The single writer thread limits concurrency and resource consumption. Generally, Vespa balances CPU resources used for indexing versus searching
using the concurrency setting.

Vespa exposes two core HNSW construction parameters
that impacts feeding performance (and quality as we will see in subsequent sections):

  • max-links-per-node Specifies how many links are created per vector inserted into the graph. The
    default value in Vespa is 16. The HNSW paper calls this parameter M.
    A higher value of max-links-per node increases memory usage and reduces indexing throughput, but also improves the quality of the graph.

  • neighbors-to-explore-at-insert Specifies how many neighbors to explore when inserting a vector in
    the HNSW graph. The default value in Vespa is 200. This parameter is called efConstruction in the HNSW paper.
    A higher value generally improves the quality but lowers indexing throughput as each insertion requires more
    distance computations. This parameter does not impact memory footprint.

We experiment with the following HNSW parameter combinations for evaluating feed indexing
throughput impact.

  • No HNSW indexing for exact search
  • HNSW with max-links-per-node = 8, neighbors-to-explore-at-insert 48
  • HNSW with max-links-per-node = 16, neighbors-to-explore-at-insert 96

The document schema, with HNSW indexing enabled for the binary_code field looks like this for the
last listed parameter combination:

schema code {
  document code {
    field id type int {
      indexing: summary|attribute
    }
    field binary_code type tensor<int8>(b[13]) {
      indexing: attribute|index
      attribute {
        distance-metric:hamming
      }
      index {
        hnsw {
          max-links-per-node: 16
          neighbors-to-explore-at-insert: 96
        }
      }
    }
  }
}

Vespa document schema with HNSW enabled

The real-time indexing throughput results are summarized in the following chart:

Indexing performance

Real-time indexing performance without HNSW indexing and with two HNSW parameter combinations.

Without HNSW enabled, Vespa is able to sustain 80 000 vector puts/s. By increasing the number of nodes in the
Vespa content cluster using Vespa’s content distribution,
it is possible to increase throughput horizontally. For example, using four nodes instead of one, would support 4×80 000 = 320 000 puts/.

Indexing performance

Sustained real-time indexing without HNSW, screenshot from Vespa Cloud Metrics

When we introduce HNSW indexing, the write-throughput drops significantly as it involves mutations of the
HNSW graph and distance calculations. In addition to indexing throughput, we also measure peak memory usage for the content node:

Memory Usage(GB)

Peak Memory Usage without HNSW indexing and with two HNSW parameter combinations.

Now, you might ask, why are Vespa using 64G of memory for this dataset in the baseline case without HNSW?
The reason is that Vespa stores the global document id (a 96-bit hash of the document id string) in memory, and the document id consumes more memory than the vector
data alone. One billion document identifiers consume
about 33GB of memory. Finally, there is also 4GB of data for the integer id attribute.
This additional memory used for the in-memory gid, is used to support fast elastic content distribution,
fast partial updates and more.

As we introduce HNSW indexing, the memory usage increases significantly due to the additional HNSW graph data structure which is also
in memory for fast access during searches and insertions.

Brute-force exact nearest neighbor search performance

As we have seen in the indexing performance and memory utilization experiments, not using HNSW uses
considerably less memory, and is the clear indexing throughput winner – but what about the search
performance of brute force search? Without HNSW graph indexing, the complexity of the search for neighbors is linear with
the total document volume, so that is surely slow for 1B documents?

To overcome the latency issue, we can use one of the essential Vespa features: Executing a query using multiple
search threads.
By using more threads per query, Vespa can make better use of multi-CPU core architectures and
reduce query latency at the cost of increased CPU usage per query. See more on scaling latency
versus throughput using multithreaded search in the Vespa
sizing and performance guide.

To easily test multiple threading configurations, we deploy multiple
Vespa rank profiles. Choosing ranking profile is
done in the query, so it’s easy to run experiments without having to re-deploy the application.

rank-profile hamming {
  num-threads-per-search:1
  first-phase {
    expression:closeness(field,binary_code)
  

Billion-scale vector search using hybrid HNSW-IF

Photo by
Graham Holtshausen on Unsplash

The first blog post on billion-scale
vector search covered methods for compressing real-valued vectors to binary representations
and using hamming distance for efficient coarse level search.
The second post described approximate nearest neighbor search tradeoffs
using Hierarchical Navigable Small World (HNSW),
including memory usage, vector indexing performance, and query performance versus accuracy.

This post in this series on billion scale search introduces a cost-efficient hybrid method
for approximate nearest neighbor (ANN) search combining (HNSW) with disk-backed inverted file.
We name this hybrid method for ANN search for HNSW-IF.

Introduction

In-memory algorithms, like HNSW,
for approximate nearest neighbor search, offer fast, high accuracy vector search
but quickly become expensive for massive vector datasets due to memory requirements.
The HNSW algorithm requires storing the vector data in memory for low latency access during query and indexing.

For example, a billion scale vector dataset using 768 dimensions with float precision requires
close to 3TiB of memory. In addition, the HNSW graph data structure needs to be in-memory,
which adds 20-40% in addition to the vector data.
Given this, indexing a 1B vector dataset using HNSW will need about 4TiB of memory.

In 2022, many cloud providers offer cloud instance types
with large amounts of memory, but these instance types also come with many v-CPUs, which
drives production deployment costs. These high-memory and high-compute instance types support massive queries per second and
might be the optimal instance type for applications needing to support large query throughput with high recall.
However, many real-world applications using vector search do not need enormous query throughput but still
need to search large billion-scale vector datasets with relatively low latency with high accuracy.
Therefore, large cloud instance types with thousands of GiB of memory and hundreds
of v-CPUs are not cost-efficient for those low query volume use cases.

Due to this, there is an increasing interest in hybrid ANN search
solutions using solid-state disks (SSD)
to store most of the vector data combined with in-memory graph data structures.
SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search
introduces a simple and effective solution for hybrid ANN search.

Introducing SPANN

SPANN combines the graph-based in-memory method for ANN search with the inverted file using clustering.
SPANN partitions the vector dataset of M vectors into N clusters.
The paper explores setting N to a number between 4% to 20% of M.
A centroid vector represents each cluster.
The paper describes different algorithms for
clustering the vector dataset into N clusters and finds that hierarchical balanced clustering (HBC) works best.
See Figure 10 in the paper: Different types of centroid selection.

The cluster centroid vector points to a posting list
containing the vectors close to the cluster centroid. Disk-based data structures back the posting lists of non-centroids,
and centroids are indexed using an in-memory ANN search algorithm. Unlike
quantization techniques for ANN search, all vector distance calculations are performed
with the full-precision vector representation.

SPANN searches for the k closest centroid vectors of the query vector in the in-memory ANN search
data structure. Then, it reads the k associated posting lists for the retrieved
centroids and computes the distance between the query vector
and the vector data read from the posting list.


Figure 1 illustrates SPANN.

Figure 1 gives a conceptual overview of SPANN for a small vector dataset of ten vectors.
There are two centroid vectors, vectors 4 and 9, referencing a posting list
consisting of the vectors close to the cluster the centroid represents.
A vector might be close to multiple cluster centroids, for example, vector 5 and vector 8 in the figure above.
These are examples of boundary vectors that lay in between multiple centroids.

The offline clustering part of SPANN tries to balance the clusters so
that the posting lists are equal in size to reduce the time it takes to read the posting list from disk.
For example, if the vector has 100 dimensions using int8 precision and int32 for the vector id,
each posting list entry uses 104 bytes. With a 4KiB disk read page size,
one can read 1024 posting entries in a single IO read operation.

Hybrid HNSW-IF with Vespa

Inspired by the SPANN paper, we at the Vespa team
implemented a simplified version of SPANN using Vespa primitives, released
as a Vespa sample application.
We call this hybrid ANN search method for HNSW-IF.

Vespa features used to implement HNSW-IF:

  • Real-time HNSW vector indexing
  • Real-time inverted index data structures
  • Disk based vectors using Vespa dense tensor type using paged option
  • Phased ranking
  • Stateless search and document processors

The following sections outline the differences between the method described in the SPANN paper and the
Vespa HNSW-IF sample application
implementation using Vespa primitives.

Vector Indexing and Serving architecture

  • Instead of clustering and computing centroids offline, let vectors from the original dataset
    represent centroids and use the original vector id as the centroid id.
    This approach does not waste any distance calculations at query time as the
    centroids are valid eligible vectors. A subset of the vector dataset (20%) is
    selected randomly to represent centroids. Random centroid selection only
    requires one pass through the vector dataset, splitting the dataset into
    centroids and non-centroids.

  • The vectors representing centroids are indexed in memory using
    Vespa’s support for vector indexing using
    Hierarchical Navigable Small World (HNSW).
    Searching 200M centroid vectors indexed with HNSW typically takes 2-3 milliseconds, single-threaded (depending on recall
    target and HNSW settings). Both the graph data structure and the vector data are stored in memory.

  • During indexing of vectors that are not cluster centroids,
    search for the k closest centroids in the HNSW graph of centroids and index the
    closest centroid ids using Vespa’s support for inverted indexes.
    Later, when the index build is complete, a search for a centroid id efficiently retrieves
    the closest non-centroid vector id.
    The inverted index consists of a dictionary of centroid ids pointing to
    posting lists of non-centroid vector ids. For a given billion scale dataset with 20% centroids,
    the maximum centroid dictionary vocabulary size is 200M.

  • A non-centroid vector might be present in multiple centroid clusters.
    Instead of storing the vector data in the posting lists, the vector data
    is stored in a separate Vespa data structure and avoids duplication
    caused by storing the same vector data in multiple posting lists.
    Instead, the Vespa posting list entry stores the closeness (inverted distance) of the vector to the centroid,
    scaled to integer weight. Only the vector ids are duplicated across centroid posting lists,
    not the vector data itself. Vespa posting lists are compressed using standard techniques for
    lossless posting list compression.

Querying Vectors

For an input query vector, first search the vectors representing centroids, using HNSW, for the k closest centroids.
Next, using the retrieved k nearest centroids from HNSW search,
search the inverted index using logical disjunction (OR) of the centroid ids retrieved
by the HNSW graph search. The actual implementation uses the
Vespa dotProduct multivalued query operator.

Each node involved in the inverted index query ranks the retrieved non-centroid vectors by
calculating the distance between the vector and the query vector. Finally, the result of the two
searches is merged and returned.

The query serving flow can be optimized by two heuristics:

  • Cluster centroid dynamic pruning. After retrieving the k closest centroids from searching the HNSW graph,
    distant centroids (compared to the nearest centroid) can be pruned without significantly impacting recall.
    This distant centroid pruning heuristic reduces the number of seeks and reads
    for the inverted index query evaluation.
    The centroid pruning heuristic is dynamic; a query vector that retrieves
    many equally close centroids allows little pruning, while a query vector that retrieves
    centroids with more considerable distance differences might allow pruning many.

  • Retrieve using dynamic pruning. This heuristic sorts the retrieved vector ids by the
    closeness(q, centroid) * closeness(centroid, v) transitive closeness score where q is the query vector and v is the document vector.
    This phase is implemented as a Vespa first-phase
    ranking phase. The closeness(centroid,v) weight is stored in the posting list, and the closeness(q, centroid)
    is passed as a query term weight
    with the dotProduct query operator. This heuristic enables limiting the number of vector page-ins by using Vespa’s support
    for controlling phased ranking.
    The local per node second-phase ranking calculates the full precision, (closeness(q,v), which involves
    paging the vector data into memory from disk. The maximum re-ranking depth is
    a query time hyperparameter enabling easy experimentation.

Real-world considerations

Real-world applications using vector search need both batch and real-time vector indexing:

  • Batch indexing: An embedder model
    (for example, Data2Vec)
    that maps data to vector representation is trained, and embedding vector representations are produced for all known data items.
  • Incremental Real-time indexing: A new data item arrives and is encoded with the current version of the embedder model and needs to be indexed.

In addition, data items (with vector representation) need to be updated and deleted. The hybrid method
described in this blog post supports all CRUD (Create, Read, Update, Delete) operations using the standard Vespa
APIs.

  • Batch indexing with a new embedder model is handled by adding a model version field to the schema. Serving queries
    must restrict the search for vectors to the given model id using
    standard inverted index query evaluation and constrained vector search.
    See Query Time Constrained Approximate Nearest Neighbor Search and
    Embedding model hot swap.
    Having multiple active model versions increases the storage-related deployment cost linearly with the number of models.

  • New vectors using an existing embedding model are added as a non-centroid vector.
    As long as the ratio of centroids is large, one can expect to grow the vector volume significantly without
    significantly degrading accuracy.

The only thing the application owner needs to consider is that deleting large amounts of centroid vectors
will negatively impact recall. For most large-scale vector use cases, this is not a real problem. If the use case requires
deleting many vector items, one can consider decoupling centroids from real vectors so that centroids
are real centroids and not vectors part of the dataset.

Vespa Experimental Setup

The following section describes our experiments with the Vespa HNSW-IF sample application using
Vespa Cloud’s performance environment.
The Vespa Cloud performance environment makes it easy to iteratively develop applications and choose the ideal instance types
for any size vector dataset.


Vespa Cloud Console – sample app deployment in Vespa Cloud perf environment in aws-us-east-1c region.


Vespa HNSW-IF serving architecture overview.

The Vespa HNSW-IF representation uses the same
Vespa document schema to represent centroid and non-centroid vectors.
They are differentiated using a single field of type bool.

Using two content clusters with the same document schema
enables using different instance types for the two vector types:

  • High memory instances with remote storage for the centroid vectors using in-memory HNSW.
  • Inexpensive low memory instances with fast local storage for the non-centroid vectors.

This optimizes the deployment and resource cost – the vectors indexed using HNSW
does not need fast local disks since queries will never page data from disk during queries.
Similarly, for vectors indexed using inverted file, the instances don’t
need an awful amount of memory, but more memory can improve query performance
due to page caching.

The Vespa inverted index posting lists do not contain the vector data.
Instead, vector data is stored using Vespa paged tensor attributes,
a type of disk-backed memory mapped forward-index. The downside of not storing the vector
data in the postings file is that paging in a vector from disk for distance calculation requires one
additional disk seek. However, in our experience, locally attached SSD disks are rarely limited by random seek
but by GiB/s throughput bandwidth.

Vector Dataset

For our experiments with HNSW-IF, we use the 1B Microsoft SPACEV-1B vector dataset:

Microsoft SPACEV-1B is a new web search-related dataset
released by Microsoft Bing for this competition.
It consists of document and query vectors encoded by the
Microsoft SpaceV Superior model to capture generic intent representation.

The SPACEV-1B vector dataset

Managed Vector Search using Vespa Cloud

Photo by israel palacio on Unsplash

There is a growing interest in AI-powered vector representations of unstructured multimodal data
and searching efficiently over these representations. This blog post describes how your organization can unlock the full potential of multimodal AI-powered vector representations using Vespa – the industry-leading open-source big data serving engine.

Introduction

Deep Learning has revolutionized information extraction from unstructured data like text, audio, image, and videos.
Furthermore, self-supervised learning algorithms like data2vec
accelerate learning representations of speech, vision, text, and multimodal representations
combining these modalities. Pre-training deep neural network models using self-supervised
learning without expensive curated labeled data helps scale machine learning as
adoption and fine-tuning for a specific task requires fewer labeled examples.

Representing unstructured multimodal data as vectors or tensors unlocks new and exciting use cases
it wasn’t easy to foresee just a few years ago. Even a well-established AI-powered use case like
search ranking, which has been using AI to improve the search results for decades,
is going through a neural paradigm shift
driven by language models like BERT.

These emerging multimodal data-to-vector models increase the insight and knowledge organizations can
extract from their unstructured data. As a result, organizations leveraging this
new data paradigm will have a significant competitive advantage over organizations
not participating in this paradigm shift.
Learning from structured and unstructured data has historically
primarily been performed offline.
However, advanced organizations with access to modern infrastructure
and competence have started transferring the learning process to onstage,
using real-time,
in-session contextual features to improve AI predictions.

One example of real-time online inference or prediction is within-cart
recommendation systems,
where grocery and e-commerce sites recommend or predict
related items to supplement the user’s current cart contents.
An AI-powered recommendation model for this use case could use item-to-item similarity
or past sparse user-to-item interactions.
Still, without a doubt, using the real-time context, in this case, the cart’s contents,
can improve the model’s accuracy. Furthermore,
creating add-to-cart suggestions for all possible combinations offline is impossible
due to the combinatoric explosion of likely cart items.
This use case also has the challenging property that the number of things to choose from is extensive,
hundreds of millions in the case of Amazon. In addition, business constraints like in-stock status limit the candidate selection.

Building technology and infrastructure to perform computationally complex distributed AI inference
over billions of data items with low user-time serving latency constraints
is one of the most challenging problems in computing.

Vespa – Serving Engine

Vespa, the open-source big data serving engine, specializes in making it easy for an
any-sized organization to move AI inference computations online at scale without investing a significant amount of resources in building infrastructure and technology. Vespa is a distributed computation engine that can scale in any dimension.

  • Scale elastically with data volume – handling billion scale
    datasets efficiently without pre-provisioning resources up-front.
  • Scale update and ingestion rates to handle evolving real-time data.
  • Scale with query volume using state-of-the-art retrieval and index structures and fully use modern hardware stacks.

In Vespa, AI is a first-class citizen and not an after-thought. The following Vespa primitives are the
foundational building blocks for building an online AI serving engine:

  • CRUD operations at scale. Dataset sizes vary across organizations and use cases. Handling fast-paced evolving datasets is one of Vespa’s core strengths. Returning to our in-cart recommendation system for a moment, handling in-stock status updates, price changes, or real-time click feedback can dramatically improve the experience – imagine recommending an item out of stock? A lost revenue opportunity and a negative user experience.
  • Document Model. Vespa’s document model supports structured and unstructured field types, including tensor fields representing single-order dense vectors. Vespa’s tensor storage and compute engine
    is built from the ground up.
    The document model with tensor also enables feature-store functionality, accessing real-time features close to the data.
    Features stored as Vespa attributes support in place real-time updates
    at scale (50K updates/s per tensor field per compute node).
  • A feature-rich query language. Vespa’s SQL-like query language
    enables efficient online selection over potentially billions of rows, combining structured and unstructured data in the same query.
  • Machine Learning frameworks and accelerator integrations. Vespa integrates with the most popular machine learning frameworks like
    Tensorflow, PyTorch,
    XGboost, and LightGBM.
    In addition, Vespa integrates with ONNX-Runtime
    for accelerated inference
    with large deep neural network models that accelerate powerful data-to-vector models.
    Vespa handles model versioning,
    distribution, and auto-scaling of online inference computations.
    These framework integrations complement Vespa’s native
    support for tensor storage and calculations over tensors.
  • Efficient Vector Search. AI-powered vector representations are at the core of the unstructured data revolution. Vespa implements a real-time version of the HNSW algorithm for efficient Vector search, an implementation that is vetted and verified with multiple vector datasets on ann-benchmarks.com.
    Vespa supports combining vector search with structured query filters at scale.

Get Started Today with Vector Search using Vespa Cloud.

We have created a getting started with Vector Search sample application which,
in a few steps, shows you how to deploy your Vector search use case to Vespa Cloud.
Check it out at github.com/vespa-cloud/vector-search.

The sample application features:

  • Deployment to Vespa Cloud environments (dev, perf, and production) and how to perform safe deployments to production using CI/CD
  • Vespa Cloud’s security model
  • Vespa Cloud Auto-Scaling and pricing, optimizing the deployment cost by auto-scaling by resource usage
  • Interacting with Vespa Cloud – indexing your vector data and searching it at scale.

For only $3,36 per hour, your organization can store and search 5M 768 dimensional vectors,
deployed in Vespa Cloud production zones with high availability, supporting thousands
of inserts and queries per second.

Vespa Cloud Console. Snapshot while auto-scaling of stateless container cluster in progress.

Vespa Cloud Console. Concurrent real-time indexing of vectors while searching. Scale as needed to
meet any low latency serving use case.

With this vector search sample application, you have a great starting point for
implementing your vector search use case, without worrying about managing complex infrastructure.
See also other Vespa sample applications using vector search:

  • State-of-the-art text ranking:
    Vector search with AI-powered representations built on NLP Transformer models for candidate retrieval.
    The application has multi-vector representations for re-ranking, using Vespa’s phased retrieval and ranking
    pipelines. Furthermore, the application shows how embedding models, which map the text data to vector representation, can be
    deployed to Vespa for run-time inference during document and query processing.

  • State-of-the-art image search: AI-powered multi-modal vector representations
    to retrieve images for a text query.

  • State-of-the-art open-domain question answering: AI-powered vector representations
    to retrieve passages from Wikipedia, which are fed into an NLP reader model which extracts the answer. End-to-end represented using Vespa.

These are examples of applications built using AI-powered vector representations.

Vespa is available as a cloud service; see Vespa Cloud – getting started,
or self-serve Vespa – getting started.

Will new vector databases dislodge traditional search engines?

Doug Turnbull asks an interesting question on Linkedin; Will new vector databases dislodge traditional search engines?

Photo by
Joshua Sortino on Unsplash

The short answer is no, but it depends on what you classify as a traditional search engine.

Features like phrase search, exact search, BM25 ranking, dynamic summaries, and result facets are features we take for granted in a search engine implementation. Most vector databases lack these features. Apache Lucene and Vespa have 20 years of development, adding search critical features.
Accelerated dynamic pruning algorithms like wand and BM-wand over inverted indexes also come to mind.

Major web search engines use semantic vector search for candidate retrieval but still allow users to perform an exact phrase search.
For example, a user searching for an article number, a phone number, or an ISSN are examples of search use cases that dense vector similarity computations cannot solve.

Real-world search ranking implementations include real-time signals like item stock availability, popularity, or other ranking business constraints. Unfortunately, these signals are hard to compress into a simple vector similarity calculation.

The future of search is hybrid

A successful search implementation uses hybrid retrieval techniques, combining the best of both types of representations; sparse and dense vectors. The hybrid model is demonstrably better than the sum of its parts, especially when applied to new domains without lots of interaction data to train vector embedding models that map data to vectors.

The critical observation is that search implementations must support exact matches and richer ranking than vector similarity alone. Given this, I believe integrating excellent dense vector search capabilities into feature-rich search engine technologies is the right direction.

However, not all search engines are alike.

Not all search engine architectures can add efficient dense vector search capabilities without significantly impacting latency, storage, and serving costs.

For example, search engines built on the Apache Lucene
library face severe challenges when exposing the recently added approximate nearest neighbor search support using HNSW graphs.
Apache Lucene achieves near real-time indexing by creating multiple immutable segments. One new segment per refresh interval. How many are active in total depends on the number of shards, indexing rate, refresh interval settings, and segment merge policies.

A vector search in Elasticsearch, Apache Solr, or OpenSearch, using Lucene, needs to scan all these active segments per shard, causing unpredictable latency and recall. Furthermore, the query cost, driven by vector distance calculations, increases almost linearly with the number of active segments since there is one graph per Lucene index segment.

But, can immutable segments with HNSW graphs be efficiently merged into fewer and larger segments to solve this search scalability problem?

Unfortunately not, HNSW graph data structures are inherently different from the classic inverted index data structures. Apache Lucene based its immutable segment architecture on sorted inverted index structures in 1998. Due to the sorted property, merging sorted posting lists was simple and cost-efficient. HNSW graphs for high-recall vector search, on the other hand, are immensely expensive to merge, effectively costing the same amount of compute as building a single HNSW graph from scratch.

The solution?

So what is the alternative if you want all the features of a traditional search engine but also want to incorporate dense vector search in your search use cases?

Luckily, Vespa, the open-source big data serving engine, is an alternative to Apache Lucene-based engines. Vespa implements a mutable HNSW graph per node adjacent to other mutable data structures for efficient retrieval and ranking.

Users of Vespa can effortlessly combine vector search with traditional search engine query operators, implementing hybrid search at scale.

Vespa’s core data structures are mutable, avoiding expensive merging and duplication. Vespa indexing latency is in milliseconds, not seconds. Updating a single field does not require a full re-indexing as in engines built on immutable data structures. For example, updating the popularity does not require reindexing the vector data like in engines built on Apache Lucene.

Critical features such as phrase search, exact search, BM25, proximity features, and result grouping comes for free. In addition to performance, scalability, and reliability, Vespa has proven ranking results on the world’s most extensive open relevancy dataset.
Not anecdotes in a sales presentation, but proven ranking results,
available to anyone to reproduce.

If you are interested in learning more about Vespa and how organizations like
Spotify are using Vespa
to unlock the full potential of hybrid search and neural ranking, check out the Vespa Blog or get started with one of many Vespa sample applications. All the sample applications can either be deployed on-premise on your infrastructure or using Vespa Cloud.

Building Billion-Scale Vector Search – part one

Decorative image

Photo by Arnaud Mariat on Unsplash

How fast is fast? Many consider the blink of an eye, around 100-250ms, to be plenty fast. Try blinking 10 times in a
second? If you can manage, your blink of an eye latency is around 100ms. But did you know that algorithms and data
structures for approximate vector search can search across billions of vectors in high dimensional space in a few
milliseconds?

In this blog post, we look at how fast is fast enough in the context of vector search, and how the answer to this
question impacts how we design and build a billion-scale vector search solution.

Introduction

Advances in self-supervised deep learning have revolutionized information extraction from unstructured data like
text,
audio, image, and
videos. In addition to these modalities, multimodality
models are on the rise, combining multiple modalities, for example, text and image data. These advances in deep learning
leading to better vector representations of data have driven increased interest in searching these representations at
scale.

vectorization

Any machine learning algorithm must transform the input and output data into a data format the machine understands. This
is where vectors and, generally, tensors come into the picture.

Everything can be represented as a vector in high-dimensional vector space

Using mentioned ML models, we can convert our data into vectors and index the vectors using an algorithm for nearest
neighbor search. Given a query, for example, a picture, a text
document, a search query, or dating
preferences, we can
convert it into a vector representation using the same model we used to index our collection. Then, we can use this
representation to search for similar data in the collection using vector search in the high dimensional vector space.

Searching over a few million vector representations is trivial as the index can fit into a single instance. There is
much tooling for searching small amounts of vector data, where all the data can serve in a single node. We can replicate
the index over more nodes if we need to scale query throughput. However, we need to distribute the data over multiple
nodes with more data.

More data, more problems

Building out real-time serving infrastructure for searching over billion-scale or trillion-scale vector datasets is one
of the most challenging problems in computing and has been
reserved for FAANG-sized organizations. When the data no longers fit into a single instance, data must be distributed
over multiple serving instances. With more instances come failure modes. The serving system needs to implement
resilience for failures, distributed search over an elastic number of partitions, and replication to avoid losing data
in case of failures.

More data and more problems are reflected by the high pricing of cloud-based vector search solutions. In addition, the
pricing tells a story of the infrastructure complexity and market demand for fully managed cloud-based vector search
solutions.

For example, suppose your organization wants to index 1B 512-dimensional vectors using Google Vertex AI Matching
Engine. In that case, you’ll be adding $389,000 per month to
your GCP cloud bill. That quote example is for one batch job of vectors. Each new batch indexing job adds $6,000. The
quote does not cover storing the data that produced the vectors; that data must be served out of a different serving
store.

Over the past few years, we have made a few observations around scaling vector search to billions of data items:

  • Surprisingly, many organizations have a lot of raw unstructured data,
    petabytes of data that easily reach billions of rows.
  • AI models to generate vector representations from this data have become a commodity,
    thanks to Huggingface.
  • Few organizations have Google’s level of query traffic searching the data.
  • Few organizations need query serving latency much lower than the blink of an eye .
  • Query volume changes and pre-provisioning resources for peak query traffic wastes resources.

These observations impact our design for a cost-efficient billion-scale vector search solution.

The quickest and most accurate methods for approximate nearest neighbors search (ANNS) use in-memory data structures.
For example, the popular HNSW graph algorithm for ANNS requires storing the vectors in memory for low latency access
during query and indexing. In 2022, many cloud providers will offer cloud instance types with large amounts of memory,
but these types also come with many v-CPUs, which drives costs. These high-memory and high-compute instance types
support massive queries per second. They might be the optimal instance type for applications needing to support large
query throughput with high accuracy. However, as we have observed, many real-world applications do not need enormous
query throughput but still need to search sizeable billion-scale vector datasets with relatively low latency with high
accuracy.

Due to these tradeoffs, there is an increasing interest in hybrid
ANNS algorithms using solid-state disks (SSD) to store
most of the vector data combined with in-memory graph data structures. Storing the data on disk lowers costs
significantly due to storage hierarchy economics. Furthermore, 2022 cloud instances come with higher network bandwidth
than we have used to. The higher bandwidth allows us to move more data from content nodes to stateless compute nodes. In
addition, independent scaling of content and compute enables on-demand, elastic auto-scaling of resources.

Vespa’s value proposition

Vespa, the open-source big data serving engine, makes it straightforward for an any-sized organization to implement
large-scale search and recommendation use cases. The following Vespa primitives are the foundational building blocks for
building a vector search serving system.

Document Schema(s)

Vespa’s schema model supports structured and unstructured data types, including tensors and vectors. Representing
tensors, vectors, and unstructured data types in the same document schema avoids consistency and synchronization issues
between data stores.

Vespa schema example

A simplified Vespa document schema, expressed using Vespa’s schema language.

CRUD (Create, Read, Update, Delete) operations

Add new documents, and update and delete documents using real-time APIs.

Searching structured and unstructured data in the same query request

A feature-rich query language for performing efficient selections over
the data. Vespa’s SQL-like query language enables efficient online selection over billions of documents, combining
search over structured and unstructured data in the same query.

Vespa implements a real-time version of the HNSW algorithm for
efficient and high-recall ANNS. The implementation is verified with multiple vector datasets on
ann-benchmarks.com and used in production by
Spotify.

Highly Extensible Architecture

architecture

Vespa allows developers to embed custom components for stateless processing, allowing separation of processing from
storage content clusters. In addition, support for multiple content clusters allows scaling stateful resources
independently.

Summary

In the next blog post in this series, we’ll cover the design and implementation of a cost-efficient billion-scale image
search application over multimodal AI-powered CLIP representations. The application uses a hybrid ANN solution where
most of the vector data is stored on disk and where the most computationally expensive vector similarity operations are
performed in the stateless layer to allow faster, elastic auto-scaling of resources.

Building Billion-Scale Vector Search – part two

Decorative image

Photo by Julien Tromeur
on Unsplash

This blog is the second post in a series on building a billion-scale vector search using Vespa.
The series covers the cost and serving performance tradeoffs related to approximate nearest neighbor search at a large scale.
In the first post, we introduced some challenges related to building
large-scale vector search solutions. We presented these observations made over the years, working on billion-scale vector search systems:

  • Surprisingly, many organizations have a lot of raw unstructured data,
    petabytes of data that easily reach billions of rows.
  • AI models to generate vector representations from this data have become a commodity,
    thanks to Huggingface.
  • Few organizations have Google’s level of query traffic searching the data.
  • Few organizations need query serving latency much lower than the blink of an eye.
  • Query volume fluctuates daily, and pre-provisioning compute resources for peak query traffic wastes resources.

These observations impact our design for a cost-efficient billion-scale vector search solution.

Dataset

This blog post delves into these observations and how we think about implementing large-scale vector search without breaking the bank.
This work is also published as an
open-source Vespa sample application
if you want to jump ahead. To demonstrate, we needed a large-scale vector dataset, and thankfully,
the LAION team released the LAION-5B dataset earlier this year.
The LAION-5B dataset is built from crawled data from the Internet, and the dataset has
been used to train popular text-to-image models like StableDiffusion.
The LAION-5B dataset consists of the following:

  • The caption text of the image.
  • The URL of the image.
  • A 64bit signed hash calculated over the caption text and URL concatenation.
  • The height and width of the image.
  • NSFW (Not safe for work) labels, labels assigned by a machine-learned NSFW classifier.
  • A 768-dimensional vector representation of the image, obtained by running the image data through a
    multimodal encoding model (CLIP).

In addition, there are derived datasets, for example, aesthetic scores
for images in the LAION-5B dataset. These derived datasets can be joined into the
index using partial updates without re-indexing all the data.

We have previously released open-source Vespa sample applications using CLIP models for text-to-image search;
see the blog post and
Vespa sample application for text-to-image search
and text-to-video search.
Those sample applications are great building blocks for powerful multimodal search applications,
but they do not demonstrate scale.
The LAION-5B dataset, on the other hand, offers scale and pre-computed image embeddings with metadata,
and we thought this work would make a great addition to the rich set of Vespa sample applications.

There is a growing interest in generative text-to-image models and also controversy
around the data they are trained on. Where do these models find inspiration from? What about copyrights?
These are questions that many ask. Therefore, building an open, searchable multi-modal index over the
LAION-5B dataset is particularly interesting, given the current controversy.

Cats
The top-left image is generated by StableDiffusion, the image is encoded using CLIP, and the vector representation
is used to search the Vespa LAION-5B index using approximate nearest neighbor search.

Search features

This work allows searching both the metadata and the image CLIP embeddings using hybrid search,
combining sparse and dense vector representations. We wanted to implement the following core features:

  • Search the caption text or URL for keywords or phrases using the full-fledged
    Vespa query language.
    For example, select caption, url from image where url contains “https://www.erinhanson.com/”.
  • Given a text prompt, encode the text into CLIP embedding space, and use this embedding to search over the CLIP embedding vectors in the LAION-5B dataset.
  • Given an image, encode the image into CLIP embedding space, and use this embedding to search over the CLIP embedding vectors in the LAION-5B dataset.
    For example, given an image generated by StableDiffusion, search for similar images in the training set (LAION).
  • Hybrid combinations of the above, including filtering on metadata like NSFW labels or aesthetic scores.

CLIP encoding
CLIP (Contrastive Language–Image Pre-training). CLIP is trained with captions and image data, inputting pairs at the same time.
At inference (after training), we can input an image, or a text prompt, independently, and get back a vector representation.

In addition to the mentioned features, we want to have the ability to update the index. For example, when there is a new derived dataset,
such as laion2B-en-aesthetic,
which adds new metadata to images, such as the probability of the image containing unsafe content,
or the probability of image containing a watermark, we can update the index with that information,
without having to rebuild the entire index. The ability to efficiently update an index, without having to re-build the
entire index saves resources and cost, but also enable important functionality. For example, updating aesthetic score, where
aesthetic score can be used when ranking images for a text query, either as a hard filter, or as a ranking signal.

Designing a cost-efficient large-scale vector search solution

In a previous blog post, we presented a hybrid approximate nearest search algorithm
combining HNSW with an Inverted file so that the majority of the vector data could be stored on disk,
which due to storage hierarchy economics, is at least one order of magnitude cheaper than in-memory graph representations.
In another post, we covered coarse-level approximate nearest neighbor search
methods where the vectors are compressed, for example, using binary representations in
combination with bitwise hamming distance
to perform an efficient but coarse-level search, and where full precision vectors were paged on-demand from disk for re-scoring.

In this work, implementing search use cases over the LAION-5B dataset,
we wanted to combine multiple approximate vector search techniques discussed in previous blog posts.
Additionally, we wanted to move parts of vector similarity computations out of the Vespa content clusters to stateless
clusters, for faster auto-scaling with query volume changes .

Relaxing latency requirements from single-digit milliseconds to double-digits,
enables moving vector similarity calculations out of Vespa stateful nodes to
stateless container nodes. This then also enable faster auto-scaling of stateless resources
with changes in query volume. In the cloud, where resources can be provisioned on-demand within seconds,
paying for idle resources at low user traffic is wasteful and costly.

To scale at ease with query traffic changes, we need to move vector similarity computations
to the Vespa stateless container layer and move vector data on-demand, efficiently across the network from content clusters to stateless container clusters.
This also meant that we need a way to perform a first-phase similarity search close to the data on the content nodes
so that the stateless containers could work on a small subset of the vector dataset.
Without an efficient first-phase candidate coarse selection, we would need to move too much vector data per query across the network.
This would hurt serving latency and quickly run into network bandwidth bottlenecks at high query throughput.

Tiered Compute

This tiered compute approach, where smaller and computationally efficient models are applied close to the data,
is used in many real-world search and recommendation use cases, known as multi-phase retrieval and ranking.
Effectively, a distributed search operation falls under the MapReduce paradigm,
where the map compute stage is pushed close to the data, avoiding moving data across the network.
The reducer stage operates on a much smaller amount of data, which is suitable for transferring across the network.

Hybrid HNSW-IF with PCA

The hybrid HNSW-Inverted-File method, described in detail in
Billion-scale vector search using hybrid HNSW-IF,
uses clustering or random centroid selection to identify centroids that span the high-dimensional vector space.
Centroids are indexed using Vespa’s support for HNSW indexing,
allowing efficiently searching 100s of millions of centroids on a single instance with single digit milliseconds, single-threaded.

With Vespa’s support for distributed search, the centroid content cluster storing and indexing the centroids
can be scaled to billions of centroids, unlocking indexing trillion-sized vector datasets.
Separating the centroid graph into a dedicated content cluster enables using memory-optimized instance
types without needing locally attached disks, further optimizing the deployment cost.

Vectors from the real-world dataset are assigned a set of close centroids at
indexing time and indexed into the inverted file content cluster.
The inverted index then maps from a centroid id to the posting list of vectors close to the centroid.
Vespa’s inverted indexes are disk-based and memory mapped.

In our previous work using HNSW-IF with a static vector dataset containing vectors without any metadata,
we split the dataset into two, centroids and non-centroids,
using the same Vespa document schema with a field to differentiate the two classes.
This work diverges from that approach and instead represents centroids as a
separate Vespa document schema. This has several benefits for real-world datasets:

  • The centroid representation can use fewer dimensions than the original embedding representation.
    For example, using vector quantization or other dimension-reduction techniques.
    Centroids are indexed using HNSW, and dimension reduction reduces the memory footprint and increases the number of
    centroids that can be indexed per instance type (memory resource constraints).
    We can fit 6x more centroids per node for any instance type if we reduce the vector dimensionality from 768 to 128.
    Additionally, vector similarity computations enjoy a similar speedup when reducing the number of dimensions.
  • Centroids are separated from the original vector dataset, and centroids are never deleted.
    Since the number of centroids is very large, compared to other InvertedFile methods, we expect to be able to incrementally
    index new image data without having to redo the centroid selection.

Dimension reduction using Principal Component Analysis (PCA)

Our implementation uses a separate Vespa document schema to represent centroids,
enabling the use of dimension reduction for the centroid vectors.
The intuition behind this idea is that the centroid search
is a coarse search. For example, in a 2-dimensional geographical longitude and latitude vector space,
we can quickly focus on the grid coordinates to our point of interest before doing the fine-level search.

A technique for performing dimension reduction is
principal component analysis (PCA).
We train a PCA matrix (128×768) using a small subset (10%) of the vector dataset. This step is performed offline.
This dimension reduction technique fits well with the Vespa architecture, and we perform the PCA matrix multiplication during
the stateless processing of queries and documents. This step is implemented as a simple
stateless component.
The matrix multiplication between the incoming vector and the trained PCA matrix is accelerated
using Vespa’s support for accelerated inference with ONNX-Runtime.

The PCA matrix weights are stored in an ONNX model imported into the Vespa application package.
The ONNX compute graph visualization is given below. The input is a 768-dimensional vector,
and the output is a reduced 128-dimensional vector.

ONNX Ranking Compute Graph

PCA matrix multiplication for dimension reduction. The A matrix represents the trained PCA matrix weights.

After using dimension reduction for centroids to lower the memory footprint of the HNSW graph,
we realized that we could also use dimension reduction for the image embedding vectors,
using the reduced vector space for a coarse-level first-phase ranking.

  • At query time, we reduce the query vector using the same
    dimension reduction technique and use the reduced query vector representation
    to search the HNSW graph for close centroids.
    After this first stage, we have a list of K centroid ids.
  • We dispatch a new query to the image content cluster,
    searching for the list of centroids obtained from the previous stage.
  • Each image content node ranks the images using innerproduct in the reduced vector space.
  • After obtaining the global top N images, ranked in the reduced vector space,
    we can request the full vector representation and perform re-ranking in the original vector space.

By implementing a coarse-level ranking of the images retrieved by the centroid search,
we can limit the number of full-precision vectors needed for innerproduct calculations in
the original 768-dimensional vector space. This tiered compute approach is
a classic example of tried and tested phased retrieval and ranking.

With phased ranking, we want the two ranking phases to correlate; a high innerproduct score in the PCA reduced
vector space should also produce a high innerproduct score in the original vector space. If there is weak correlation,
the overall quality

Announcing vector streaming search: AI assistants at scale without breaking the bank

Decorative
image

Photo by Marc Sendra Martorell on Unsplash

If you are using a large language model to build a personal assistant
you often need to give it access to personal data such as email, documents or images.
This is usually done by indexing the vectors in a vector database and retrieving by approximate nearest neighbor (ANN) search.

In this post we’ll explain why this is not a good solution for personal data
and introduce an alternative which is an order of magnitude cheaper while actually solving the problem:
Vector streaming search.

Let’s just build an ANN index?

Let’s say you’re building a personal assistant who’s working with personal data averaging 10k documents per user,
and that you want to scale to a million users – that is 10B documents.
And let’s say you are using typical cost-effective embeddings of 384 bfloat16s – 768 bytes per document.
How efficient can we make this in a vector database?

Let’s try to handle it the normal way by maintaining a global (but sharded) approximate nearest neighbor vector index.
Queries will need to calculate distances for vectors in a random access pattern as they are found in the index,
which means they’ll need to be in memory to deliver interactive latency.
Here, we need 10B * 768 bytes = 7.68 Tb of memory for the vector,
plus about 20% for the vector index for a total of about 9.2 Tb memory to store a single copy of the data.
In practice though you need two copies to be able to deliver a user’s data reliably,
some headroom for other in-memory data (say 10%), and about 35% headroom for working memory.
This gives a grand total of 9.2 * 2 * 1.1 / 0.65 = 31Tb.

If we use nodes with 128Gb memory that works out to 31Tb / 128Gb = 242 nodes.
On AWS, we can use i4i.4xlarge nodes at a cost of about $33 per node per day, so our total cost becomes 242 * 33 = $8000 per day.

Hefty, but at least we get a great solution right? Well, not really.

The A in ANN stands for approximate – the results from an ANN index will be missing some documents,
including likely some of the very best ones. That is often fine when working with global data,
but is it really acceptable to miss the one crucial mail, photo or document the user needs to complete some task correctly?

In addition – ANN indexes shine when most of the vectors in the data are eligible for a given query,
that is when query filters are weak. But here we need to filter on the user’s own data,
so our filter is very strong indeed and our queries will be quite expensive despite all the effort of building the index.
In fact it would be cheaper to not make use of the index at all (which is what Vespa would automatically do when given these queries).

Lastly, there’s write speed. A realistic speed here is about 8k inserts per node per second.
Since we have 2 * 10B/242 = 82 M documents per node that means it will take about
82M/(8k * 3600) = 2.8 hours to feed the entire data set even though we have this massive amount of powerful nodes.

To recap, this solution has four problems as shown in this table:

Regular ANN for personal data
❌ CostAll the vectors must be in memory, which becomes very expensive.
❌ CoverageANN doesn’t find all the best matches, problematic with personal data.
❌ Query performanceQueries are expensive to the point of making an ANN index moot.
❌ Write performanceWriting the data set is slow.

Can we do better?

Let’s consider some alternatives.

The first observation to make is that we are building a global index capable of searching all user’s data at once,
but we are not actually using this capability since we always search in the context of a single user.
So, could we build a single ANN index per user instead?

This actually makes the ANN indexes useful since there is no user filter. However, the other three problems remain.

ANN (approximate nearest neighbor) for personal data
❌ CostAll the vectors must be in memory, which becomes very expensive.
❌ CoverageANN doesn’t find all the best matches, problematic with personal data.
✅ Query performanceOne index per user makes queries cheap.
❌ Write performanceWriting the data set is slow.

Can we drop the ANN index and do vector calculations brute force?
This is actually not such a bad option (and Vespa trivially supports it).
Since each user has a limited number of documents, there is no problem getting good latency by brute forcing over a user’s vectors.
However, we still store all the vectors in memory so the cost problem remains.

NN (exact nearest neighbor) for personal data
❌ CostAll the vectors must be in memory, which becomes very expensive.
✅ CoverageAll the best matches are guaranteed to be found.
✅ Query performanceCheap enough: One user’s data is a small subset of a node’s data.
✅ Write performanceWriting data is an order of magnitude faster than with ANN.

Can we avoid the memory cost? Vespa provides an option to mark vectors paged,
meaning portions of the data will be swapped out to disk.
However, since this vector store is not localizing the data of each user
we still need a good fraction of the data in memory to stay responsive, and even so both query and write speed will suffer.

NN (exact nearest neighbor) with paged vectors for personal data
🟡 CostA significant fraction of data must be in memory.
✅ CoverageAll the best matches are guaranteed to be found.
🟡 Query performanceReading vectors from disk with random access is slow.
✅ Write performanceWriting data is an order of magnitude faster than with ANN.

Can we do even better, by localizing the vector data of each user
and so avoid the egregious memory cost altogether while keeping good performance?
Yes, with Vespa’s new vector streaming search you can!

Vespa’s streaming search solution lets you make the user id a part of the document id
so that Vespa can use it to co-locate the data of each user on a small set of nodes and on the same chunk of disk.
This allows you to do searches over a user’s data with low latency without keeping any user’s data in memory
nor paying the cost of managing indexes at all.

This mode has been available for a long time for text and metadata search,
and we have now extended it to support vectors and tensors as well, both for search and ranking.

With this mode you can store billions of user vectors, along other data, on each node without running out of memory,
write it at a very high throughput thanks to Vespa’s log data store, and run queries with:

  • High throughput: Data is co-located on disk, or in memory buffers for recently written data.
  • Low latency regardless user data size: Vespa will,
    in addition to co-locating a user’s data, also automatically spread it over a sufficient number of nodes to bound query latency.

In addition you’ll see about an order of magnitude higher write throughput per node than with a vector indexing solution.

The resource driving cost instead moves to disk I/O capacity, which is what makes streaming so much cheaper.
To compare with our initial solution which required 242 128Gb nodes – streaming requires 45b to be stored in memory per document
so we’ll be able to cram about 128Gb / 45 * 0.65 = 1.84 B documents on each node.
We can then fit two copies of the 10B document corpus on 20B / 1.84B = 11 nodes.

Quite a reduction! In a very successful application you may want a little more to deliver sufficient query capacity
(see the performance case study below), but this is the kind of savings you’ll see for real production systems.

Vector streaming search for personal data
✅ CostNo vector data (or other document data) must be in memory.
✅ CoverageAll the best matches are guaranteed to be found.
✅ Query performanceLocalized disk reads are fast.
✅ Write performanceWriting data is faster even with less than 1/20 of the nodes.

You can also combine vector streaming search with regular text search
and metadata search with little additional cost, and with advanced machine-learned ranking on the content nodes.
These are features you’ll also need if you want to create an application that gives users high quality responses.

To use streaming search in your application, make these changes to it:

  • Set streaming search mode for the document type in services.xml:
        <documents>
            <document type="my-document-type" mode="streaming" />
        </documents>
  • Feed documents with ids that includes the user id of each document by
    setting the group value on ids. Id’s will then be on the form id:myNamespace:myType:g=myUserid:myLocalId where the g=myUserId is new.
  • Set the user id to search on each query by setting the parameter
    streaming.groupname to the user id.

See the streaming search documentation for more details,
and try out the vector streaming search sample application to get started.

Performance case study

To measure the performance of Vespa’s vector streaming search we deployed a modified version of the
nearest neighbor streaming performance test
to Vespa Cloud.
We changed the node resources
and count for container and content nodes to fit the large scale use case.

The dataset used is generated and consists of 48B documents, spread across 3.7M users.
The average number of documents per user is around 13000, and the document user distribution is as follows:

Documents per userPercentage of users
10035%
100028%
1000022%
5000010%
1000005%

We used 20 content nodes with the following settings to store around 2.4B documents per content node (redundancy=1).
These nodes equate to the AWS i4i.4xlarge instance with 1 3750Gb AWS Nitro local SSD disk.

<nodes deploy:environment="perf" count="20">
    <resources vcpu="16" memory="128Gb" disk="3750Gb" storage-type="local" architecture="x86_64"/>
</nodes>

Content nodes

Vespa Cloud console showing the 20 content nodes allocated to store the dataset.

We used the following settings for container nodes. The node count was adjusted based on the particular test to run.
These nodes equate to the AWS Graviton 2 c6g.2xlarge instance.

<nodes deploy:environment="perf" count="32">
    <resources
        vcpu="8"
        memory="16Gb"
        disk="20Gb"
        storage-type="remote"
        architecture="arm64"/>
</nodes>

Feeding performance

The schema
in the application has two fields:

  • field id type long
  • field embedding type tensor<bfloat16>(x[384])

The embeddings are randomly generated by a document processor
while feeding the documents. In total each document is around 800 bytes, including the document id.
Example document put for user with id 10000021:

{
    "put":"id:test:test:g=10000021:81",
    "fields":{
        "id":81,
        "embedding":[
            0.424140,0.663390,
            ..,
            0.261550,0.860670
        ]
    }
}

To feed the dataset we used three instances of Vespa CLI
running in parallel on a non-AWS machine in the same geographic region (us east).
This machine has 48 CPUs and 256Gb of memory, and used between 40 and 48 CPU cores during feeding.
The total feed throughput was around 300k documents per second, and the total feed time was around 45 hours.

Feed throughput

Vespa Cloud console showing feed throughput towards the end of feeding 48B documents.

Feed throughput

Vespa Cloud console showing the 32 container nodes allocated when feeding the dataset.

Query performance

To analyze query performance we focused on users with 1k, 10k, 50k and 100k documents each.
For each of these four groups we drew between 160k and 640k random user ids to generate query files with 10k queries each.
Each query uses the nearestNeighbor
query operator to perform an exact nearest neighbor search over all documents for a given user.
Example query for user with id 10000021:

yql=select * from sources * where {targetHits:10}nearestNeighbor(embedding,qemb)
&input.query(qemb)=[...]
&streaming.groupname=10000021
&ranking.profile=default
&presentation.summary=minimal
&hits=10

This query returns the 10 closest documents according to the angular distance between the document embeddings and the query embedding.
See how the default ranking profile
is