Building Billion-Scale Vector Search – part one

Decorative image

Photo by Arnaud Mariat on Unsplash

How fast is fast? Many consider the blink of an eye, around 100-250ms, to be plenty fast. Try blinking 10 times in a
second? If you can manage, your blink of an eye latency is around 100ms. But did you know that algorithms and data
structures for approximate vector search can search across billions of vectors in high dimensional space in a few
milliseconds?

In this blog post, we look at how fast is fast enough in the context of vector search, and how the answer to this
question impacts how we design and build a billion-scale vector search solution.

Introduction

Advances in self-supervised deep learning have revolutionized information extraction from unstructured data like
text,
audio, image, and
videos. In addition to these modalities, multimodality
models are on the rise, combining multiple modalities, for example, text and image data. These advances in deep learning
leading to better vector representations of data have driven increased interest in searching these representations at
scale.

vectorization

Any machine learning algorithm must transform the input and output data into a data format the machine understands. This
is where vectors and, generally, tensors come into the picture.

Everything can be represented as a vector in high-dimensional vector space

Using mentioned ML models, we can convert our data into vectors and index the vectors using an algorithm for nearest
neighbor search. Given a query, for example, a picture, a text
document, a search query, or dating
preferences, we can
convert it into a vector representation using the same model we used to index our collection. Then, we can use this
representation to search for similar data in the collection using vector search in the high dimensional vector space.

Searching over a few million vector representations is trivial as the index can fit into a single instance. There is
much tooling for searching small amounts of vector data, where all the data can serve in a single node. We can replicate
the index over more nodes if we need to scale query throughput. However, we need to distribute the data over multiple
nodes with more data.

More data, more problems

Building out real-time serving infrastructure for searching over billion-scale or trillion-scale vector datasets is one
of the most challenging problems in computing and has been
reserved for FAANG-sized organizations. When the data no longers fit into a single instance, data must be distributed
over multiple serving instances. With more instances come failure modes. The serving system needs to implement
resilience for failures, distributed search over an elastic number of partitions, and replication to avoid losing data
in case of failures.

More data and more problems are reflected by the high pricing of cloud-based vector search solutions. In addition, the
pricing tells a story of the infrastructure complexity and market demand for fully managed cloud-based vector search
solutions.

For example, suppose your organization wants to index 1B 512-dimensional vectors using Google Vertex AI Matching
Engine. In that case, you’ll be adding $389,000 per month to
your GCP cloud bill. That quote example is for one batch job of vectors. Each new batch indexing job adds $6,000. The
quote does not cover storing the data that produced the vectors; that data must be served out of a different serving
store.

Over the past few years, we have made a few observations around scaling vector search to billions of data items:

  • Surprisingly, many organizations have a lot of raw unstructured data,
    petabytes of data that easily reach billions of rows.
  • AI models to generate vector representations from this data have become a commodity,
    thanks to Huggingface.
  • Few organizations have Google’s level of query traffic searching the data.
  • Few organizations need query serving latency much lower than the blink of an eye .
  • Query volume changes and pre-provisioning resources for peak query traffic wastes resources.

These observations impact our design for a cost-efficient billion-scale vector search solution.

The quickest and most accurate methods for approximate nearest neighbors search (ANNS) use in-memory data structures.
For example, the popular HNSW graph algorithm for ANNS requires storing the vectors in memory for low latency access
during query and indexing. In 2022, many cloud providers will offer cloud instance types with large amounts of memory,
but these types also come with many v-CPUs, which drives costs. These high-memory and high-compute instance types
support massive queries per second. They might be the optimal instance type for applications needing to support large
query throughput with high accuracy. However, as we have observed, many real-world applications do not need enormous
query throughput but still need to search sizeable billion-scale vector datasets with relatively low latency with high
accuracy.

Due to these tradeoffs, there is an increasing interest in hybrid
ANNS algorithms using solid-state disks (SSD) to store
most of the vector data combined with in-memory graph data structures. Storing the data on disk lowers costs
significantly due to storage hierarchy economics. Furthermore, 2022 cloud instances come with higher network bandwidth
than we have used to. The higher bandwidth allows us to move more data from content nodes to stateless compute nodes. In
addition, independent scaling of content and compute enables on-demand, elastic auto-scaling of resources.

Vespa’s value proposition

Vespa, the open-source big data serving engine, makes it straightforward for an any-sized organization to implement
large-scale search and recommendation use cases. The following Vespa primitives are the foundational building blocks for
building a vector search serving system.

Document Schema(s)

Vespa’s schema model supports structured and unstructured data types, including tensors and vectors. Representing
tensors, vectors, and unstructured data types in the same document schema avoids consistency and synchronization issues
between data stores.

Vespa schema example

A simplified Vespa document schema, expressed using Vespa’s schema language.

CRUD (Create, Read, Update, Delete) operations

Add new documents, and update and delete documents using real-time APIs.

Searching structured and unstructured data in the same query request

A feature-rich query language for performing efficient selections over
the data. Vespa’s SQL-like query language enables efficient online selection over billions of documents, combining
search over structured and unstructured data in the same query.

Vespa implements a real-time version of the HNSW algorithm for
efficient and high-recall ANNS. The implementation is verified with multiple vector datasets on
ann-benchmarks.com and used in production by
Spotify.

Highly Extensible Architecture

architecture

Vespa allows developers to embed custom components for stateless processing, allowing separation of processing from
storage content clusters. In addition, support for multiple content clusters allows scaling stateful resources
independently.

Summary

In the next blog post in this series, we’ll cover the design and implementation of a cost-efficient billion-scale image
search application over multimodal AI-powered CLIP representations. The application uses a hybrid ANN solution where
most of the vector data is stored on disk and where the most computationally expensive vector similarity operations are
performed in the stateless layer to allow faster, elastic auto-scaling of resources.

Building Billion-Scale Vector Search – part two

Decorative image

Photo by Julien Tromeur
on Unsplash

This blog is the second post in a series on building a billion-scale vector search using Vespa.
The series covers the cost and serving performance tradeoffs related to approximate nearest neighbor search at a large scale.
In the first post, we introduced some challenges related to building
large-scale vector search solutions. We presented these observations made over the years, working on billion-scale vector search systems:

  • Surprisingly, many organizations have a lot of raw unstructured data,
    petabytes of data that easily reach billions of rows.
  • AI models to generate vector representations from this data have become a commodity,
    thanks to Huggingface.
  • Few organizations have Google’s level of query traffic searching the data.
  • Few organizations need query serving latency much lower than the blink of an eye.
  • Query volume fluctuates daily, and pre-provisioning compute resources for peak query traffic wastes resources.

These observations impact our design for a cost-efficient billion-scale vector search solution.

Dataset

This blog post delves into these observations and how we think about implementing large-scale vector search without breaking the bank.
This work is also published as an
open-source Vespa sample application
if you want to jump ahead. To demonstrate, we needed a large-scale vector dataset, and thankfully,
the LAION team released the LAION-5B dataset earlier this year.
The LAION-5B dataset is built from crawled data from the Internet, and the dataset has
been used to train popular text-to-image models like StableDiffusion.
The LAION-5B dataset consists of the following:

  • The caption text of the image.
  • The URL of the image.
  • A 64bit signed hash calculated over the caption text and URL concatenation.
  • The height and width of the image.
  • NSFW (Not safe for work) labels, labels assigned by a machine-learned NSFW classifier.
  • A 768-dimensional vector representation of the image, obtained by running the image data through a
    multimodal encoding model (CLIP).

In addition, there are derived datasets, for example, aesthetic scores
for images in the LAION-5B dataset. These derived datasets can be joined into the
index using partial updates without re-indexing all the data.

We have previously released open-source Vespa sample applications using CLIP models for text-to-image search;
see the blog post and
Vespa sample application for text-to-image search
and text-to-video search.
Those sample applications are great building blocks for powerful multimodal search applications,
but they do not demonstrate scale.
The LAION-5B dataset, on the other hand, offers scale and pre-computed image embeddings with metadata,
and we thought this work would make a great addition to the rich set of Vespa sample applications.

There is a growing interest in generative text-to-image models and also controversy
around the data they are trained on. Where do these models find inspiration from? What about copyrights?
These are questions that many ask. Therefore, building an open, searchable multi-modal index over the
LAION-5B dataset is particularly interesting, given the current controversy.

Cats
The top-left image is generated by StableDiffusion, the image is encoded using CLIP, and the vector representation
is used to search the Vespa LAION-5B index using approximate nearest neighbor search.

Search features

This work allows searching both the metadata and the image CLIP embeddings using hybrid search,
combining sparse and dense vector representations. We wanted to implement the following core features:

  • Search the caption text or URL for keywords or phrases using the full-fledged
    Vespa query language.
    For example, select caption, url from image where url contains “https://www.erinhanson.com/”.
  • Given a text prompt, encode the text into CLIP embedding space, and use this embedding to search over the CLIP embedding vectors in the LAION-5B dataset.
  • Given an image, encode the image into CLIP embedding space, and use this embedding to search over the CLIP embedding vectors in the LAION-5B dataset.
    For example, given an image generated by StableDiffusion, search for similar images in the training set (LAION).
  • Hybrid combinations of the above, including filtering on metadata like NSFW labels or aesthetic scores.

CLIP encoding
CLIP (Contrastive Language–Image Pre-training). CLIP is trained with captions and image data, inputting pairs at the same time.
At inference (after training), we can input an image, or a text prompt, independently, and get back a vector representation.

In addition to the mentioned features, we want to have the ability to update the index. For example, when there is a new derived dataset,
such as laion2B-en-aesthetic,
which adds new metadata to images, such as the probability of the image containing unsafe content,
or the probability of image containing a watermark, we can update the index with that information,
without having to rebuild the entire index. The ability to efficiently update an index, without having to re-build the
entire index saves resources and cost, but also enable important functionality. For example, updating aesthetic score, where
aesthetic score can be used when ranking images for a text query, either as a hard filter, or as a ranking signal.

Designing a cost-efficient large-scale vector search solution

In a previous blog post, we presented a hybrid approximate nearest search algorithm
combining HNSW with an Inverted file so that the majority of the vector data could be stored on disk,
which due to storage hierarchy economics, is at least one order of magnitude cheaper than in-memory graph representations.
In another post, we covered coarse-level approximate nearest neighbor search
methods where the vectors are compressed, for example, using binary representations in
combination with bitwise hamming distance
to perform an efficient but coarse-level search, and where full precision vectors were paged on-demand from disk for re-scoring.

In this work, implementing search use cases over the LAION-5B dataset,
we wanted to combine multiple approximate vector search techniques discussed in previous blog posts.
Additionally, we wanted to move parts of vector similarity computations out of the Vespa content clusters to stateless
clusters, for faster auto-scaling with query volume changes .

Relaxing latency requirements from single-digit milliseconds to double-digits,
enables moving vector similarity calculations out of Vespa stateful nodes to
stateless container nodes. This then also enable faster auto-scaling of stateless resources
with changes in query volume. In the cloud, where resources can be provisioned on-demand within seconds,
paying for idle resources at low user traffic is wasteful and costly.

To scale at ease with query traffic changes, we need to move vector similarity computations
to the Vespa stateless container layer and move vector data on-demand, efficiently across the network from content clusters to stateless container clusters.
This also meant that we need a way to perform a first-phase similarity search close to the data on the content nodes
so that the stateless containers could work on a small subset of the vector dataset.
Without an efficient first-phase candidate coarse selection, we would need to move too much vector data per query across the network.
This would hurt serving latency and quickly run into network bandwidth bottlenecks at high query throughput.

Tiered Compute

This tiered compute approach, where smaller and computationally efficient models are applied close to the data,
is used in many real-world search and recommendation use cases, known as multi-phase retrieval and ranking.
Effectively, a distributed search operation falls under the MapReduce paradigm,
where the map compute stage is pushed close to the data, avoiding moving data across the network.
The reducer stage operates on a much smaller amount of data, which is suitable for transferring across the network.

Hybrid HNSW-IF with PCA

The hybrid HNSW-Inverted-File method, described in detail in
Billion-scale vector search using hybrid HNSW-IF,
uses clustering or random centroid selection to identify centroids that span the high-dimensional vector space.
Centroids are indexed using Vespa’s support for HNSW indexing,
allowing efficiently searching 100s of millions of centroids on a single instance with single digit milliseconds, single-threaded.

With Vespa’s support for distributed search, the centroid content cluster storing and indexing the centroids
can be scaled to billions of centroids, unlocking indexing trillion-sized vector datasets.
Separating the centroid graph into a dedicated content cluster enables using memory-optimized instance
types without needing locally attached disks, further optimizing the deployment cost.

Vectors from the real-world dataset are assigned a set of close centroids at
indexing time and indexed into the inverted file content cluster.
The inverted index then maps from a centroid id to the posting list of vectors close to the centroid.
Vespa’s inverted indexes are disk-based and memory mapped.

In our previous work using HNSW-IF with a static vector dataset containing vectors without any metadata,
we split the dataset into two, centroids and non-centroids,
using the same Vespa document schema with a field to differentiate the two classes.
This work diverges from that approach and instead represents centroids as a
separate Vespa document schema. This has several benefits for real-world datasets:

  • The centroid representation can use fewer dimensions than the original embedding representation.
    For example, using vector quantization or other dimension-reduction techniques.
    Centroids are indexed using HNSW, and dimension reduction reduces the memory footprint and increases the number of
    centroids that can be indexed per instance type (memory resource constraints).
    We can fit 6x more centroids per node for any instance type if we reduce the vector dimensionality from 768 to 128.
    Additionally, vector similarity computations enjoy a similar speedup when reducing the number of dimensions.
  • Centroids are separated from the original vector dataset, and centroids are never deleted.
    Since the number of centroids is very large, compared to other InvertedFile methods, we expect to be able to incrementally
    index new image data without having to redo the centroid selection.

Dimension reduction using Principal Component Analysis (PCA)

Our implementation uses a separate Vespa document schema to represent centroids,
enabling the use of dimension reduction for the centroid vectors.
The intuition behind this idea is that the centroid search
is a coarse search. For example, in a 2-dimensional geographical longitude and latitude vector space,
we can quickly focus on the grid coordinates to our point of interest before doing the fine-level search.

A technique for performing dimension reduction is
principal component analysis (PCA).
We train a PCA matrix (128×768) using a small subset (10%) of the vector dataset. This step is performed offline.
This dimension reduction technique fits well with the Vespa architecture, and we perform the PCA matrix multiplication during
the stateless processing of queries and documents. This step is implemented as a simple
stateless component.
The matrix multiplication between the incoming vector and the trained PCA matrix is accelerated
using Vespa’s support for accelerated inference with ONNX-Runtime.

The PCA matrix weights are stored in an ONNX model imported into the Vespa application package.
The ONNX compute graph visualization is given below. The input is a 768-dimensional vector,
and the output is a reduced 128-dimensional vector.

ONNX Ranking Compute Graph

PCA matrix multiplication for dimension reduction. The A matrix represents the trained PCA matrix weights.

After using dimension reduction for centroids to lower the memory footprint of the HNSW graph,
we realized that we could also use dimension reduction for the image embedding vectors,
using the reduced vector space for a coarse-level first-phase ranking.

  • At query time, we reduce the query vector using the same
    dimension reduction technique and use the reduced query vector representation
    to search the HNSW graph for close centroids.
    After this first stage, we have a list of K centroid ids.
  • We dispatch a new query to the image content cluster,
    searching for the list of centroids obtained from the previous stage.
  • Each image content node ranks the images using innerproduct in the reduced vector space.
  • After obtaining the global top N images, ranked in the reduced vector space,
    we can request the full vector representation and perform re-ranking in the original vector space.

By implementing a coarse-level ranking of the images retrieved by the centroid search,
we can limit the number of full-precision vectors needed for innerproduct calculations in
the original 768-dimensional vector space. This tiered compute approach is
a classic example of tried and tested phased retrieval and ranking.

With phased ranking, we want the two ranking phases to correlate; a high innerproduct score in the PCA reduced
vector space should also produce a high innerproduct score in the original vector space. If there is weak correlation,
the overall quality