Billion-scale vector search using hybrid HNSW-IF

Photo by
Graham Holtshausen on Unsplash

The first blog post on billion-scale
vector search covered methods for compressing real-valued vectors to binary representations
and using hamming distance for efficient coarse level search.
The second post described approximate nearest neighbor search tradeoffs
using Hierarchical Navigable Small World (HNSW),
including memory usage, vector indexing performance, and query performance versus accuracy.

This post in this series on billion scale search introduces a cost-efficient hybrid method
for approximate nearest neighbor (ANN) search combining (HNSW) with disk-backed inverted file.
We name this hybrid method for ANN search for HNSW-IF.

Introduction

In-memory algorithms, like HNSW,
for approximate nearest neighbor search, offer fast, high accuracy vector search
but quickly become expensive for massive vector datasets due to memory requirements.
The HNSW algorithm requires storing the vector data in memory for low latency access during query and indexing.

For example, a billion scale vector dataset using 768 dimensions with float precision requires
close to 3TiB of memory. In addition, the HNSW graph data structure needs to be in-memory,
which adds 20-40% in addition to the vector data.
Given this, indexing a 1B vector dataset using HNSW will need about 4TiB of memory.

In 2022, many cloud providers offer cloud instance types
with large amounts of memory, but these instance types also come with many v-CPUs, which
drives production deployment costs. These high-memory and high-compute instance types support massive queries per second and
might be the optimal instance type for applications needing to support large query throughput with high recall.
However, many real-world applications using vector search do not need enormous query throughput but still
need to search large billion-scale vector datasets with relatively low latency with high accuracy.
Therefore, large cloud instance types with thousands of GiB of memory and hundreds
of v-CPUs are not cost-efficient for those low query volume use cases.

Due to this, there is an increasing interest in hybrid ANN search
solutions using solid-state disks (SSD)
to store most of the vector data combined with in-memory graph data structures.
SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search
introduces a simple and effective solution for hybrid ANN search.

Introducing SPANN

SPANN combines the graph-based in-memory method for ANN search with the inverted file using clustering.
SPANN partitions the vector dataset of M vectors into N clusters.
The paper explores setting N to a number between 4% to 20% of M.
A centroid vector represents each cluster.
The paper describes different algorithms for
clustering the vector dataset into N clusters and finds that hierarchical balanced clustering (HBC) works best.
See Figure 10 in the paper: Different types of centroid selection.

The cluster centroid vector points to a posting list
containing the vectors close to the cluster centroid. Disk-based data structures back the posting lists of non-centroids,
and centroids are indexed using an in-memory ANN search algorithm. Unlike
quantization techniques for ANN search, all vector distance calculations are performed
with the full-precision vector representation.

SPANN searches for the k closest centroid vectors of the query vector in the in-memory ANN search
data structure. Then, it reads the k associated posting lists for the retrieved
centroids and computes the distance between the query vector
and the vector data read from the posting list.


Figure 1 illustrates SPANN.

Figure 1 gives a conceptual overview of SPANN for a small vector dataset of ten vectors.
There are two centroid vectors, vectors 4 and 9, referencing a posting list
consisting of the vectors close to the cluster the centroid represents.
A vector might be close to multiple cluster centroids, for example, vector 5 and vector 8 in the figure above.
These are examples of boundary vectors that lay in between multiple centroids.

The offline clustering part of SPANN tries to balance the clusters so
that the posting lists are equal in size to reduce the time it takes to read the posting list from disk.
For example, if the vector has 100 dimensions using int8 precision and int32 for the vector id,
each posting list entry uses 104 bytes. With a 4KiB disk read page size,
one can read 1024 posting entries in a single IO read operation.

Hybrid HNSW-IF with Vespa

Inspired by the SPANN paper, we at the Vespa team
implemented a simplified version of SPANN using Vespa primitives, released
as a Vespa sample application.
We call this hybrid ANN search method for HNSW-IF.

Vespa features used to implement HNSW-IF:

  • Real-time HNSW vector indexing
  • Real-time inverted index data structures
  • Disk based vectors using Vespa dense tensor type using paged option
  • Phased ranking
  • Stateless search and document processors

The following sections outline the differences between the method described in the SPANN paper and the
Vespa HNSW-IF sample application
implementation using Vespa primitives.

Vector Indexing and Serving architecture

  • Instead of clustering and computing centroids offline, let vectors from the original dataset
    represent centroids and use the original vector id as the centroid id.
    This approach does not waste any distance calculations at query time as the
    centroids are valid eligible vectors. A subset of the vector dataset (20%) is
    selected randomly to represent centroids. Random centroid selection only
    requires one pass through the vector dataset, splitting the dataset into
    centroids and non-centroids.

  • The vectors representing centroids are indexed in memory using
    Vespa’s support for vector indexing using
    Hierarchical Navigable Small World (HNSW).
    Searching 200M centroid vectors indexed with HNSW typically takes 2-3 milliseconds, single-threaded (depending on recall
    target and HNSW settings). Both the graph data structure and the vector data are stored in memory.

  • During indexing of vectors that are not cluster centroids,
    search for the k closest centroids in the HNSW graph of centroids and index the
    closest centroid ids using Vespa’s support for inverted indexes.
    Later, when the index build is complete, a search for a centroid id efficiently retrieves
    the closest non-centroid vector id.
    The inverted index consists of a dictionary of centroid ids pointing to
    posting lists of non-centroid vector ids. For a given billion scale dataset with 20% centroids,
    the maximum centroid dictionary vocabulary size is 200M.

  • A non-centroid vector might be present in multiple centroid clusters.
    Instead of storing the vector data in the posting lists, the vector data
    is stored in a separate Vespa data structure and avoids duplication
    caused by storing the same vector data in multiple posting lists.
    Instead, the Vespa posting list entry stores the closeness (inverted distance) of the vector to the centroid,
    scaled to integer weight. Only the vector ids are duplicated across centroid posting lists,
    not the vector data itself. Vespa posting lists are compressed using standard techniques for
    lossless posting list compression.

Querying Vectors

For an input query vector, first search the vectors representing centroids, using HNSW, for the k closest centroids.
Next, using the retrieved k nearest centroids from HNSW search,
search the inverted index using logical disjunction (OR) of the centroid ids retrieved
by the HNSW graph search. The actual implementation uses the
Vespa dotProduct multivalued query operator.

Each node involved in the inverted index query ranks the retrieved non-centroid vectors by
calculating the distance between the vector and the query vector. Finally, the result of the two
searches is merged and returned.

The query serving flow can be optimized by two heuristics:

  • Cluster centroid dynamic pruning. After retrieving the k closest centroids from searching the HNSW graph,
    distant centroids (compared to the nearest centroid) can be pruned without significantly impacting recall.
    This distant centroid pruning heuristic reduces the number of seeks and reads
    for the inverted index query evaluation.
    The centroid pruning heuristic is dynamic; a query vector that retrieves
    many equally close centroids allows little pruning, while a query vector that retrieves
    centroids with more considerable distance differences might allow pruning many.

  • Retrieve using dynamic pruning. This heuristic sorts the retrieved vector ids by the
    closeness(q, centroid) * closeness(centroid, v) transitive closeness score where q is the query vector and v is the document vector.
    This phase is implemented as a Vespa first-phase
    ranking phase. The closeness(centroid,v) weight is stored in the posting list, and the closeness(q, centroid)
    is passed as a query term weight
    with the dotProduct query operator. This heuristic enables limiting the number of vector page-ins by using Vespa’s support
    for controlling phased ranking.
    The local per node second-phase ranking calculates the full precision, (closeness(q,v), which involves
    paging the vector data into memory from disk. The maximum re-ranking depth is
    a query time hyperparameter enabling easy experimentation.

Real-world considerations

Real-world applications using vector search need both batch and real-time vector indexing:

  • Batch indexing: An embedder model
    (for example, Data2Vec)
    that maps data to vector representation is trained, and embedding vector representations are produced for all known data items.
  • Incremental Real-time indexing: A new data item arrives and is encoded with the current version of the embedder model and needs to be indexed.

In addition, data items (with vector representation) need to be updated and deleted. The hybrid method
described in this blog post supports all CRUD (Create, Read, Update, Delete) operations using the standard Vespa
APIs.

  • Batch indexing with a new embedder model is handled by adding a model version field to the schema. Serving queries
    must restrict the search for vectors to the given model id using
    standard inverted index query evaluation and constrained vector search.
    See Query Time Constrained Approximate Nearest Neighbor Search and
    Embedding model hot swap.
    Having multiple active model versions increases the storage-related deployment cost linearly with the number of models.

  • New vectors using an existing embedding model are added as a non-centroid vector.
    As long as the ratio of centroids is large, one can expect to grow the vector volume significantly without
    significantly degrading accuracy.

The only thing the application owner needs to consider is that deleting large amounts of centroid vectors
will negatively impact recall. For most large-scale vector use cases, this is not a real problem. If the use case requires
deleting many vector items, one can consider decoupling centroids from real vectors so that centroids
are real centroids and not vectors part of the dataset.

Vespa Experimental Setup

The following section describes our experiments with the Vespa HNSW-IF sample application using
Vespa Cloud’s performance environment.
The Vespa Cloud performance environment makes it easy to iteratively develop applications and choose the ideal instance types
for any size vector dataset.


Vespa Cloud Console – sample app deployment in Vespa Cloud perf environment in aws-us-east-1c region.


Vespa HNSW-IF serving architecture overview.

The Vespa HNSW-IF representation uses the same
Vespa document schema to represent centroid and non-centroid vectors.
They are differentiated using a single field of type bool.

Using two content clusters with the same document schema
enables using different instance types for the two vector types:

  • High memory instances with remote storage for the centroid vectors using in-memory HNSW.
  • Inexpensive low memory instances with fast local storage for the non-centroid vectors.

This optimizes the deployment and resource cost – the vectors indexed using HNSW
does not need fast local disks since queries will never page data from disk during queries.
Similarly, for vectors indexed using inverted file, the instances don’t
need an awful amount of memory, but more memory can improve query performance
due to page caching.

The Vespa inverted index posting lists do not contain the vector data.
Instead, vector data is stored using Vespa paged tensor attributes,
a type of disk-backed memory mapped forward-index. The downside of not storing the vector
data in the postings file is that paging in a vector from disk for distance calculation requires one
additional disk seek. However, in our experience, locally attached SSD disks are rarely limited by random seek
but by GiB/s throughput bandwidth.

Vector Dataset

For our experiments with HNSW-IF, we use the 1B Microsoft SPACEV-1B vector dataset:

Microsoft SPACEV-1B is a new web search-related dataset
released by Microsoft Bing for this competition.
It consists of document and query vectors encoded by the
Microsoft SpaceV Superior model to capture generic intent representation.

The SPACEV-1B vector dataset

Improving Zero-Shot Ranking with Vespa Hybrid Search

Decorative
image

Photo by Norbert
Braun on Unsplash

If you are planning to implement search functionality but have not
yet collected data from user
interactions to
train ranking models, where should you begin? In this series of
blog posts, we will examine the concept of zero-shot text ranking.
We implement a hybrid ranking method using Vespa and evaluate it on a
large set of text relevancy datasets in a zero-shot setting.

In the first post, we will discuss the distinction between in-domain
and out-of-domain (zero-shot) ranking and present the BEIR benchmark.
Furthermore, we highlight situations where in-domain embedding
ranking effectiveness does not carry over to a different domain in
a zero-shot setting.

Introduction

Pre-trained neural language models, such as BERT, fine-tuned for
text ranking, have demonstrated remarkable effectiveness compared
to baseline text ranking methods when evaluating the models in-domain.
For example, in the Pretrained Transformer Language Models for
Search
blog post series, we described three methods for using pre-trained
language models for text ranking, which all outperformed the
traditional lexical matching baseline
(BM25).

In the transformer ranking blog series,
we used in-domain data for training and production (test), and the
documents and the queries were drawn from the same in-domain data
distribution.

MS MARCO Results

In-domain trained and evaluated ranking methods. All models are
end-to-end represented using Vespa, open-sourced in the msmarco-ranking
sample
app.
The Mean Reciprocal Rank
(MRR@10) is
reported for the dev query split of the MS MARCO passage ranking
dataset.

This blog post looks at zero-shot text ranking, taking a ranking
model and applying it to new domains without adapting or fine-tuning
the model.

Zero-shot overview
In-domain training and inference (ranking) versus zero-shot inference
(ranking) with a model trained in a different domain.

Information Retrieval Evaluation

Information retrieval (IR) evaluation is the process of measuring
the effectiveness of an information retrieval system. Measuring
effectiveness is important because it allows us to compare different
ranking strategies and identify the most effective at retrieving
relevant information.

We need a corpus of documents and a set of queries with relevance
judgments to perform an IR evaluation. We can start experimenting
and evaluating different ranking methods using standard IR metrics
such as nDCG@10, Precision@10, and Recall@100. These IR metrics
allow us to reason about the strengths and weaknesses of proposed
ranking models, especially if we can evaluate the model in multiple
domains or relevance datasets. Evaluation like this contrasts the
industry’s most commonly used IR evaluation metric, LGTM (Looks
Good To Me
)@10, for a small number of queries.

Evaluating ranking models in a zero-shot setting

In BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of
Information Retrieval Models,
Thakur et al. introduce a benchmark for assessing text ranking
models in a zero-shot setting.

The benchmark includes 18 diverse datasets sampled from different
domains and task definitions. All BEIR datasets have relevance
judgments with varying relevance grading resolutions. For example,
TREC-COVID, a dataset of the BEIR benchmark, consists of 50 test
queries, with many graded relevance labels, on average 493,5 document
judgments per query. On the other hand, the BEIR Natural Questions
(NQ) dataset uses binary relevance labels, with 4352 test queries,
with, on average, 1.2 judgments per query.

The datasets included in BEIR also have a varying number of documents,
queries, and document lengths, but all the datasets are monolingual
(English).

BEIR overview
Statistics of the BEIR datasets. Table from BEIR: A Heterogeneous
Benchmark for Zero-shot Evaluation of Information Retrieval
Models; see also
BEIR Benchmark
datasets.

The BEIR benchmark uses the Normalised Cumulative Discount
Gain
(nDCG@10) ranking metric. The nDCG@10 metric handles both datasets
with binary (relevant/not-relevant) and graded relevance judgments.
Since not all the datasets are available in the public domain (E.g.,
Robust04), it’s common to report nDCG@10 on the 13 datasets that
can be downloaded from the BEIR GitHub repository
or
using the wonderful ir_datasets
library. It’s also possible to aggregate the reported nDCG@10 metrics
per dataset to obtain an overall nDCG@10 score, for example, using
the average across the selected BEIR datasets. It’s important to
note which datasets are included in the average overall score, as
they differ significantly in retrieval difficulty.

Zero-Shot evaluation of models trained on Natural Questions

The most common BEIR experimental setup uses the MS MARCO labels
to train models and apply the models in a zero-shot setting on the
BEIR datasets. The simple reason for this setup is that MS MARCO
is the largest relevance dataset in the public domain, with more
than ½ million training queries. As with NQ, there are few, an
average of 1.1, relevant passages per query. Another setup variant
we highlight in detail in this section is to use a ranking model
trained on Natural Questions(NQ) labels with about 100K training
queries and evaluate it in an out-of-domain setting on MS MARCO
labels.

MS MARCO and Natural Questions datasets have fixed document corpora,
and queries are split into train and test. We can train a ranking
model using the train queries and evaluate the ranking method on
the test set. Both datasets are monolingual (English) and have user
queries formulated as natural questions.

MS MARCO sample queries

how many years did william bradford serve as governor of plymouth colony?

define preventive

color overlay photoshop

Natural Questions (NQ) sample queries

what is non controlling interest on balance sheet

how many episodes are in chicago fire season 4

who sings love will keep us alive by the eagles

On the surface, these two datasets are similar. Still, NQ has longer
queries and documents compared to MS MARCO. There are also subtle
differences in how these datasets were created. For example, NQ
uses passages from Wikipedia only, while MS MARCO is sampled from
web search results.

Statistics/DatasetMS MARCONatural Questions (NQ)
query length5.99.2
document length56.676.0
documents8.84M2.68M

The above table summarizes basic statistics of
the two datasets. Words are counted after simple space tokenization.
Query lengths are calculated using the dev/test splits. Both
datasets have train splits with many queries to train the ranking
model.

In open-domain question answering with
Vespa,
we described the Dense Passage Retriever (DPR) model, which uses
the Natural Questions dataset to train a dense 768-dimensional
vector representation of both queries and Wikipedia paragraphs. The
Wikipedia passages are encoded using the DPR model, representing
each passage as a dense vector. The Wikipedia passage vector
representations can be indexed and efficiently searched using an
approximate nearest neighbor
search.

At query time, the text query is encoded with the DPR model into a dense
vector representation used to search the vector index. This ranking
model is an example of dense retrieval over vector text
representationns
from BERT. DPR was one of the first dense retrieval methods that
outperformed the BM25 baseline significantly on NQ. Since then,
much water has flown down the river, and dense vector models are
closing in on more computationally expensive cross-encoders in an
in-domain setting on MS MARCO.

In-domain effectiveness versus out-of-domain in a zero-shot setting

The DPR model trained on NQ labels outperforms the BM25 baseline
when evaluated on NQ. This is an example where the in-domain
application of the trained model improves the ranking accuracy over
baseline BM25.

in-domain

In-domain evaluation of the Dense Passage Retriever (DPR). DPR is
an example of Embedding Based Retrieval (EMB).

Suppose we use the DPR model trained on NQ (and other question-answering
datasets) and apply the model on MS MARCO. Then we can say something
about generalization in a zero-shot setting on MS MARCO.

in-domain

Out-of-domain evaluation of the Dense Passage Retriever (DPR). DPR
is an example of Embedding Based Retrieval (EMB). In this zero-shot
setting, the DPR model underperforms the BM25 baseline.

This case illustrates that in-domain effectiveness does not necessarily
transfer to an out-of-domain zero-shot application of the model.
Generally, as observed on the BEIR dense
leaderboard,
dense embeddings models trained on NQ labels underperform the BM25
baseline across almost all BEIR datasets.

Summary

In this blog post, we introduced zero-shot and out-of-domain IR
evaluation. We also introduced the important BEIR benchmark.
Furthermore, we highlighted a case study of the DPR model and its
generalization when applied out-of-domain in a zero-shot setting.

We summarize this blog post with the following quote from the
BEIR paper:

In-domain performance is not a good indicator for out-of-domain
generalization. We observe that BM25 heavily underperforms neural
approaches by 7-18 points on in-domain MS MARCO. However, BEIR
reveals it to be a strong baseline for generalization and generally
outperforming many other, more complex approaches. This stresses
the point that retrieval methods must be evaluated on a broad range
of datasets
.

Next blog post in this series

In the next post in this series on zero-shot ranking, we introduce a
hybrid ranking model, a model which combines multi-vector representations with BM25.
This hybrid model overcomes the limitations of single-vector embedding models,
and we prove its effectiveness in a zero-shot setting on the BEIR benchmark.

Improving Zero-Shot Ranking with Vespa Hybrid Search – part two

Decorative
image

Photo by Tamarcus Brown
on Unsplash

Where should you begin if you plan to implement search functionality
but have not yet collected data from user
interactions to
train ranking models?

In the first
post in
the series, we introduced the difference between in-domain and
out-of-domain (zero-shot) ranking. We also presented the BEIR
benchmark and highlighted cases where in-domain effectiveness does
not transfer to another domain in a zero-shot setting.

In this second post in this series, we introduce and evaluate three
different Vespa ranking methods on the
BEIR benchmark in a zero-shot
setting. We establish a new and strong BM25 baseline for the BEIR
dataset, which outperforms previously reported BM25 results. We
then show how a unique hybrid approach, combining a neural ranking
method with BM25, outperforms other evaluated methods on 12 out of
13 datasets on the BEIR benchmark. We also compare the effectiveness
of the hybrid ranking method with emerging few-shot methods that
generate in-domain synthetic training data via prompting large
language models (LLMs).

Establishing a strong baseline

In the BEIR paper,
the authors find that BM25
is a strong generalizable baseline text ranking model. Many, if not
most, of the dense single vector embedding models trained on MS
MARCO labels are outperformed by BM25 in an out-of-domain setting.
Quote from BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation
of Information Retrieval
Models:

In-domain performance is not a good indicator for out-of-domain
generalization. We observe that BM25 heavily underperforms neural
approaches by 7-18 points on in-domain MS MARCO. However, BEIR
reveals it to be a strong baseline for generalization and generally
outperforming many other, more complex approaches. This stresses
the point, that retrieval methods must be evaluated on a broad range
of datasets
.

What is interesting about reporting BM25 baselines is that there
are multiple implementations, variants, and performance tweaks, as
demonstrated in Which BM25 Do You Mean? A Large-Scale Reproducibility
Study of Scoring
Variants.
Unfortunately, various papers have reported conflicting results for BM25 on
the same BEIR benchmark datasets. The BM25 effectiveness can vary due to different
hyperparameters and different linguistic processing methods used in different system implementations,
such as removing stop words, stemming, and tokenization. Furthermore, researchers want to contrast their proposed ranking
approach with a baseline ranking method. It could be tempting to
report a weak BM25 baseline, which makes the proposed ranking method
stand out better.

Several serving systems implement BM25 scoring, including Vespa.
Vespa’s lexical or sparse retrieval is also accelerated using the
weakAnd Vespa query
operator.
This is important because implementing a BM25 scoring function in
a system is trivial, but scoring all documents that contains at
least one of the query terms approaches linear complexity. Dynamic
pruning algorithms like weakAnd improve the
retrieval efficiency significantly compared to naive
brute-force implementations that scores all documents matching any of the query terms.

BM25 has two
hyperparameters, k1 and b, which impact ranking effectiveness.
Additionally, most (14 out of 18) of the BEIR datasets have both
title and text document fields, which in a real-production environment
would be the first thing that a seasoned search practitioner would
tune the relative importance of. In our BM25 baseline,
we configure Vespa to independently calculate the
BM25 score of both
title and text, and we combine the two BM25 scores
linearly. The complete Vespa rank
profile is given below.

rank-profile bm25 inherits default {
   first-phase {
      expression: bm25(title) + bm25(text)
   }
   rank-properties {
      bm25(title).k1: 0.9
      bm25(title).b: 0.4
      bm25(text).k1: 0.9
      bm25(text).b: 0.4
   }
}

We modify the BM25 k1 and b parameters but use the same parameters for
both fields. The values align with Anserini
defaults
(k1=0.9, b=0.4).

The following table reports nDCG@10 scores on a subset (13) of the
BEIR benchmark datasets. We exclude the four datasets that are not
publicly available. We also exclude the BEIR CQADupStack dataset
because it consists of 12 sub-datasets where the overall nDCG@10
score is found by averaging each
sub-dataset’s
nDCG@10 score. Adding these sub-datasets would significantly increase
the evaluation effort.

BEIR DatasetBM25 from
BEIR Paper
Vespa BM25
MS MARCO0.2280.228
TREC-COVID0.6560.690
NFCorpus0.3250.313
Natural Questions (NQ)0.3290.327
HotpotQA0.6030.623
FiQA-20180.2360.244
ArguAna0.3150.393
Touché-2020 (V2)0.3670.413
Quora0.7890.761
DBPedia0.3130.327
SCIDOCS0.1580.160
FEVER0.7530.751
CLIMATE-FEVER0.2130.207
SciFact0.6650.673
Average (excluding MS MARCO)0.4400.453

The table summarizes the BM25 nDCG@10 results. Vespa BM25 versus
BM25 from BEIR paper.

The table above demonstrates that the Vespa implementation has set a new high
standard, outperforming reported BM25 baselines on the BEIR benchmark.

Evaluating Vespa ranking models in a zero-shot setting

With the new strong BM25 baseline established in the above section, we
will now introduce two neural ranking models and compare their performance with the baseline.

Vespa ColBERT

We have previously described the Vespa ColBERT implementation in
this blog
post,
and we use the same model
weights in this
work. The Vespa ColBERT model is based on a distilled 6-layer MiniLM
model with 22M parameters, using quantized int8 weights (post-training
quantization). The model uses only 32 vector dimensions per query
and document wordpiece),
in contrast to the original ColBERT model, which
uses 128 dimensions. Furthermore, we use Vespa’s support for
bfloat16
to reduce the per-dimension storage usage from 4 bytes per dimension
with float to 2 bytes with bfloat16. We configure the maximum query
length to 32 wordpieces, and maximum document length to 180
wordpieces. Both maximum length parameters align with the training and experiments
on MS MARCO.

The ColBERT MaxSim scoring is implemented as a re-ranking model
using Vespa phased ranking,
re-ranking the top 2K hits ranked by BM25. We also compute and store
the title term embeddings for datasets with titles, meaning we have
two MaxSim scores for datasets with titles. We use a linear combination
to combine the title and text MaxSim scores.

The complete Vespa rank profile
is given below.

rank-profile colbert inherits bm25 {
   inputs {
      query(qt) tensor<float>(qt{}, x[32])
      query(title_weight): 0.5
   }
   second-phase {
      rerank-count: 2000
	   expression {
	   (1 - query(title_weight))* sum(
	    reduce(
	      sum(
		      query(qt) * cell_cast(attribute(dt), float), x
	      ),
	      max, dt
	    ),
	    qt
	   ) +
	   query(title_weight) * sum(
	    reduce(
	      sum(
		      query(qt) * cell_cast(attribute(title_dt), float), x
	      ),
	      max, dt
	    ),
	    qt
	  )
	}
}

The per wordpiece ColBERT vectors are stored in Vespa using Vespa’s
support for storing and computing over
tensors.

Note: Users
can also trade efficiency versus cost by storing the tensors on
disk, or in-memory using
paging
options. Paging is highly efficient in a re-ranking pipeline, as
just a few K tensors values are potentially paged on-demand.

Vespa Hybrid ColBERT + BM25

There are several ways to combine the ColBERT MaxSim with BM25,
including reciprocal rank
fusion(RRF)
which does not consider model scores, just the ordering (ranking) the
scores produce. Quote from Reciprocal Rank Fusion outperforms Condorcet and
individual Rank Learning Methods:

RRF is simpler and more effective than Condorcet Fuse, while sharing
the valuable property that it combines ranks without regard to the
arbitrary scores returned by particular ranking methods

Another approach is to combine the model scores into a new score
to produce a new ranking. We use a linear combination in this work
to compute the hybrid score. Like the ColBERT-only model, we use
BM25 as the first-phase ranking model and only calculate the hybrid
score for the global top-ranking K documents from the BM25 model.

Before combining the scores, we want to normalize both the unbound
BM25 and the bound ColBERT score. Normalization is accomplished by
simple max-min
scaling of the scores. With max-min scaling, scores from any ranking
model are scaled from 0 to 1. This makes it easier to combine the
two using relative weighting.

Since scoring in a production serving system might be spread across
multiple nodes, each node involved in the query will not know the
global max scores. We solve this problem by letting Vespa content
nodes involved in the query return both scores using Vespa
match-features.

A custom searcher
is injected in the query dispatching stateless Vespa service. This
searcher calculates the max and min for both model scores using
match features for hits within the window of global top-k hits
ranked by BM25. As with the ColBERT rank profile, we use a re-ranking
window of 2000 hits, but we perform feature-score scaling and
re-ranking in a stateless custom searcher instead of on the content
nodes.

The complete Vespa rank profile is given below. Notice the
match-features, which are returned with each hit to the stateless
searcher
(implementation),
which performs the normalization and re-scoring. The first-phase
scoring function is inherited from the previously described bm25
rank profile.

rank-profile hybrid-colbert inherits bm25 {
   function bm25() {
	   expression: bm25(title) + bm25(text)
   }

   function colbert_maxsim() {
	   expression {
	      2*sum(
	         reduce(
	            sum(
		            query(qt) * cell_cast(attribute(dt), float) , x
	            ),
	         max, dt
	         ),
	         qt
	      ) +
	      sum(
	         reduce(
	            sum(
		            query(qt) * cell_cast(attribute(title_dt), float), x
	            ),
	         max, dt
	         ),
	         qt
	      )
      }
   }
   match-features {
	   bm25
	   colbert_maxsim
   }
}

​​Results and analysis

As with the BM25 baseline model, we index one of the BEIR datasets
at a time on a Vespa instance and evaluate the models. The following
table summarizes the results. All numbers are nDCG@10. The
best-performing model score per dataset is in bold.

BEIR DatasetVespa
BM25
Vespa ColBERTVespa Hybrid
MS MARCO (in-domain)0.2280.401
0.344
TREC-COVID0.6900.6580.750
NFCorpus0.3130.3040.350
Natural Questions (NQ)0.3270.4030.404
HotpotQA0.6230.2980.632
FiQA-20180.2440.2520.292
ArguAna0.3930.2860.404
Touché-2020 (V2)0.4130.3150.415
Quora0.7610.8170.826
DBPedia0.3270.2810.365
SCIDOCS0.1600.1070.161
FEVER0.7510.5340.779
CLIMATE-FEVER0.2070.0670.191
SciFact0.6730.4030.679
Average nDCG@10 (excluding MS MARCO)0.4530.3630.481

The table summarizes the nDCG@10 results per dataset. Note that MS MARCO is in-domain for ColBERT and Hybrid.
Average nDCG@10 is only computed for zero-shot and out-of-domain datasets.

As shown in the table above, in a in-domain setting on MS MARCO,
the Vespa ColBERT model outperforms the BM25
baseline significantly. The resulting nDCG@10 score aligns with reported MRR@10
results from previous work using
ColBERT
in-domain on MS MARCO. However, mixing the baseline BM25 using the hybrid model on MS MARCO evaluation
hurts the nDCG@10 score, as we combine two models where the unsupervised BM25
model is significantly weaker than the ColBERT model.

The Vespa ColBERT model underperforms BM25 on out-of-domain datasets,
especially CLIMATE-FEVER. The CLIMATE-FEVER dataset has very long
queries (avg 20.2 words). The long questions challenge the ColBERT
model, configured with a max query length of 32 wordpieces in the
experimental setup. Additionally, the Vespa ColBERT model underperforms
reported results for the full-sized ColBERT
V2 model using 110M parameters
and 128 dimensions. This result could indicate that the compressed
(in the number of dimensions) and model distillation have a more
significant negative impact when applied in a zero-shot setting
compared to in-domain.

These exceptions aside, the data shows that the unique hybrid Vespa ColBERT and BM25 combination is highly
effective, performing the best on 12 of 13 datasets
. Its average
nDCG@10 score improves from 0.453 to 0.481 compared to the strong
Vespa BM25 baseline.

To reproduce the results of this benchmark,
follow the open-sourced
instructions.

Comparing hybrid zero-shot with few-shot methods

To compare the hybrid Vespa ranking performance with other models,
we include the results reported in Promptagator: Few-shot Dense
Retrieval From 8 Examples from
Google Research.

Generating synthetic training data in-domain via prompting LLMs is a recent
emerging Information Retrieval(IR) trend also described in
InPars: Data Augmentation for Information Retrieval using Large Language
Models.

The basic idea is to “prompt” a large language model (LLM) to generate synthetic queries
for use in training of in-domain ranking models. A typical prompt include a few
examples of queries and relevant documents, then the LLM is “asked”
to generate syntetic queries for many of the documents in the corpus.
The generated syntetic query, document pairs can be used to train neural ranking models.
We include a quote describing the approach from the Promptagator paper: