Billion-scale vector search using hybrid HNSW-IF

Photo by
Graham Holtshausen on Unsplash

The first blog post on billion-scale
vector search covered methods for compressing real-valued vectors to binary representations
and using hamming distance for efficient coarse level search.
The second post described approximate nearest neighbor search tradeoffs
using Hierarchical Navigable Small World (HNSW),
including memory usage, vector indexing performance, and query performance versus accuracy.

This post in this series on billion scale search introduces a cost-efficient hybrid method
for approximate nearest neighbor (ANN) search combining (HNSW) with disk-backed inverted file.
We name this hybrid method for ANN search for HNSW-IF.

Introduction

In-memory algorithms, like HNSW,
for approximate nearest neighbor search, offer fast, high accuracy vector search
but quickly become expensive for massive vector datasets due to memory requirements.
The HNSW algorithm requires storing the vector data in memory for low latency access during query and indexing.

For example, a billion scale vector dataset using 768 dimensions with float precision requires
close to 3TiB of memory. In addition, the HNSW graph data structure needs to be in-memory,
which adds 20-40% in addition to the vector data.
Given this, indexing a 1B vector dataset using HNSW will need about 4TiB of memory.

In 2022, many cloud providers offer cloud instance types
with large amounts of memory, but these instance types also come with many v-CPUs, which
drives production deployment costs. These high-memory and high-compute instance types support massive queries per second and
might be the optimal instance type for applications needing to support large query throughput with high recall.
However, many real-world applications using vector search do not need enormous query throughput but still
need to search large billion-scale vector datasets with relatively low latency with high accuracy.
Therefore, large cloud instance types with thousands of GiB of memory and hundreds
of v-CPUs are not cost-efficient for those low query volume use cases.

Due to this, there is an increasing interest in hybrid ANN search
solutions using solid-state disks (SSD)
to store most of the vector data combined with in-memory graph data structures.
SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search
introduces a simple and effective solution for hybrid ANN search.

Introducing SPANN

SPANN combines the graph-based in-memory method for ANN search with the inverted file using clustering.
SPANN partitions the vector dataset of M vectors into N clusters.
The paper explores setting N to a number between 4% to 20% of M.
A centroid vector represents each cluster.
The paper describes different algorithms for
clustering the vector dataset into N clusters and finds that hierarchical balanced clustering (HBC) works best.
See Figure 10 in the paper: Different types of centroid selection.

The cluster centroid vector points to a posting list
containing the vectors close to the cluster centroid. Disk-based data structures back the posting lists of non-centroids,
and centroids are indexed using an in-memory ANN search algorithm. Unlike
quantization techniques for ANN search, all vector distance calculations are performed
with the full-precision vector representation.

SPANN searches for the k closest centroid vectors of the query vector in the in-memory ANN search
data structure. Then, it reads the k associated posting lists for the retrieved
centroids and computes the distance between the query vector
and the vector data read from the posting list.


Figure 1 illustrates SPANN.

Figure 1 gives a conceptual overview of SPANN for a small vector dataset of ten vectors.
There are two centroid vectors, vectors 4 and 9, referencing a posting list
consisting of the vectors close to the cluster the centroid represents.
A vector might be close to multiple cluster centroids, for example, vector 5 and vector 8 in the figure above.
These are examples of boundary vectors that lay in between multiple centroids.

The offline clustering part of SPANN tries to balance the clusters so
that the posting lists are equal in size to reduce the time it takes to read the posting list from disk.
For example, if the vector has 100 dimensions using int8 precision and int32 for the vector id,
each posting list entry uses 104 bytes. With a 4KiB disk read page size,
one can read 1024 posting entries in a single IO read operation.

Hybrid HNSW-IF with Vespa

Inspired by the SPANN paper, we at the Vespa team
implemented a simplified version of SPANN using Vespa primitives, released
as a Vespa sample application.
We call this hybrid ANN search method for HNSW-IF.

Vespa features used to implement HNSW-IF:

  • Real-time HNSW vector indexing
  • Real-time inverted index data structures
  • Disk based vectors using Vespa dense tensor type using paged option
  • Phased ranking
  • Stateless search and document processors

The following sections outline the differences between the method described in the SPANN paper and the
Vespa HNSW-IF sample application
implementation using Vespa primitives.

Vector Indexing and Serving architecture

  • Instead of clustering and computing centroids offline, let vectors from the original dataset
    represent centroids and use the original vector id as the centroid id.
    This approach does not waste any distance calculations at query time as the
    centroids are valid eligible vectors. A subset of the vector dataset (20%) is
    selected randomly to represent centroids. Random centroid selection only
    requires one pass through the vector dataset, splitting the dataset into
    centroids and non-centroids.

  • The vectors representing centroids are indexed in memory using
    Vespa’s support for vector indexing using
    Hierarchical Navigable Small World (HNSW).
    Searching 200M centroid vectors indexed with HNSW typically takes 2-3 milliseconds, single-threaded (depending on recall
    target and HNSW settings). Both the graph data structure and the vector data are stored in memory.

  • During indexing of vectors that are not cluster centroids,
    search for the k closest centroids in the HNSW graph of centroids and index the
    closest centroid ids using Vespa’s support for inverted indexes.
    Later, when the index build is complete, a search for a centroid id efficiently retrieves
    the closest non-centroid vector id.
    The inverted index consists of a dictionary of centroid ids pointing to
    posting lists of non-centroid vector ids. For a given billion scale dataset with 20% centroids,
    the maximum centroid dictionary vocabulary size is 200M.

  • A non-centroid vector might be present in multiple centroid clusters.
    Instead of storing the vector data in the posting lists, the vector data
    is stored in a separate Vespa data structure and avoids duplication
    caused by storing the same vector data in multiple posting lists.
    Instead, the Vespa posting list entry stores the closeness (inverted distance) of the vector to the centroid,
    scaled to integer weight. Only the vector ids are duplicated across centroid posting lists,
    not the vector data itself. Vespa posting lists are compressed using standard techniques for
    lossless posting list compression.

Querying Vectors

For an input query vector, first search the vectors representing centroids, using HNSW, for the k closest centroids.
Next, using the retrieved k nearest centroids from HNSW search,
search the inverted index using logical disjunction (OR) of the centroid ids retrieved
by the HNSW graph search. The actual implementation uses the
Vespa dotProduct multivalued query operator.

Each node involved in the inverted index query ranks the retrieved non-centroid vectors by
calculating the distance between the vector and the query vector. Finally, the result of the two
searches is merged and returned.

The query serving flow can be optimized by two heuristics:

  • Cluster centroid dynamic pruning. After retrieving the k closest centroids from searching the HNSW graph,
    distant centroids (compared to the nearest centroid) can be pruned without significantly impacting recall.
    This distant centroid pruning heuristic reduces the number of seeks and reads
    for the inverted index query evaluation.
    The centroid pruning heuristic is dynamic; a query vector that retrieves
    many equally close centroids allows little pruning, while a query vector that retrieves
    centroids with more considerable distance differences might allow pruning many.

  • Retrieve using dynamic pruning. This heuristic sorts the retrieved vector ids by the
    closeness(q, centroid) * closeness(centroid, v) transitive closeness score where q is the query vector and v is the document vector.
    This phase is implemented as a Vespa first-phase
    ranking phase. The closeness(centroid,v) weight is stored in the posting list, and the closeness(q, centroid)
    is passed as a query term weight
    with the dotProduct query operator. This heuristic enables limiting the number of vector page-ins by using Vespa’s support
    for controlling phased ranking.
    The local per node second-phase ranking calculates the full precision, (closeness(q,v), which involves
    paging the vector data into memory from disk. The maximum re-ranking depth is
    a query time hyperparameter enabling easy experimentation.

Real-world considerations

Real-world applications using vector search need both batch and real-time vector indexing:

  • Batch indexing: An embedder model
    (for example, Data2Vec)
    that maps data to vector representation is trained, and embedding vector representations are produced for all known data items.
  • Incremental Real-time indexing: A new data item arrives and is encoded with the current version of the embedder model and needs to be indexed.

In addition, data items (with vector representation) need to be updated and deleted. The hybrid method
described in this blog post supports all CRUD (Create, Read, Update, Delete) operations using the standard Vespa
APIs.

  • Batch indexing with a new embedder model is handled by adding a model version field to the schema. Serving queries
    must restrict the search for vectors to the given model id using
    standard inverted index query evaluation and constrained vector search.
    See Query Time Constrained Approximate Nearest Neighbor Search and
    Embedding model hot swap.
    Having multiple active model versions increases the storage-related deployment cost linearly with the number of models.

  • New vectors using an existing embedding model are added as a non-centroid vector.
    As long as the ratio of centroids is large, one can expect to grow the vector volume significantly without
    significantly degrading accuracy.

The only thing the application owner needs to consider is that deleting large amounts of centroid vectors
will negatively impact recall. For most large-scale vector use cases, this is not a real problem. If the use case requires
deleting many vector items, one can consider decoupling centroids from real vectors so that centroids
are real centroids and not vectors part of the dataset.

Vespa Experimental Setup

The following section describes our experiments with the Vespa HNSW-IF sample application using
Vespa Cloud’s performance environment.
The Vespa Cloud performance environment makes it easy to iteratively develop applications and choose the ideal instance types
for any size vector dataset.


Vespa Cloud Console – sample app deployment in Vespa Cloud perf environment in aws-us-east-1c region.


Vespa HNSW-IF serving architecture overview.

The Vespa HNSW-IF representation uses the same
Vespa document schema to represent centroid and non-centroid vectors.
They are differentiated using a single field of type bool.

Using two content clusters with the same document schema
enables using different instance types for the two vector types:

  • High memory instances with remote storage for the centroid vectors using in-memory HNSW.
  • Inexpensive low memory instances with fast local storage for the non-centroid vectors.

This optimizes the deployment and resource cost – the vectors indexed using HNSW
does not need fast local disks since queries will never page data from disk during queries.
Similarly, for vectors indexed using inverted file, the instances don’t
need an awful amount of memory, but more memory can improve query performance
due to page caching.

The Vespa inverted index posting lists do not contain the vector data.
Instead, vector data is stored using Vespa paged tensor attributes,
a type of disk-backed memory mapped forward-index. The downside of not storing the vector
data in the postings file is that paging in a vector from disk for distance calculation requires one
additional disk seek. However, in our experience, locally attached SSD disks are rarely limited by random seek
but by GiB/s throughput bandwidth.

Vector Dataset

For our experiments with HNSW-IF, we use the 1B Microsoft SPACEV-1B vector dataset:

Microsoft SPACEV-1B is a new web search-related dataset
released by Microsoft Bing for this competition.
It consists of document and query vectors encoded by the
Microsoft SpaceV Superior model to capture generic intent representation.

The SPACEV-1B vector dataset