Photo by Norbert
Braun on Unsplash
If you are planning to implement search functionality but have not
yet collected data from user
train ranking models, where should you begin? In this series of
blog posts, we will examine the concept of zero-shot text ranking.
We implement a hybrid ranking method using Vespa and evaluate it on a
large set of text relevancy datasets in a zero-shot setting.
In the first post, we will discuss the distinction between in-domain
and out-of-domain (zero-shot) ranking and present the BEIR benchmark.
Furthermore, we highlight situations where in-domain embedding
ranking effectiveness does not carry over to a different domain in
a zero-shot setting.
Pre-trained neural language models, such as BERT, fine-tuned for
text ranking, have demonstrated remarkable effectiveness compared
to baseline text ranking methods when evaluating the models in-domain.
For example, in the Pretrained Transformer Language Models for
blog post series, we described three methods for using pre-trained
language models for text ranking, which all outperformed the
traditional lexical matching baseline
In the transformer ranking blog series,
we used in-domain data for training and production (test), and the
documents and the queries were drawn from the same in-domain data
In-domain trained and evaluated ranking methods. All models are
end-to-end represented using Vespa, open-sourced in the msmarco-ranking
The Mean Reciprocal Rank
reported for the dev query split of the MS MARCO passage ranking
This blog post looks at zero-shot text ranking, taking a ranking
model and applying it to new domains without adapting or fine-tuning
In-domain training and inference (ranking) versus zero-shot inference
(ranking) with a model trained in a different domain.
Information Retrieval Evaluation
Information retrieval (IR) evaluation is the process of measuring
the effectiveness of an information retrieval system. Measuring
effectiveness is important because it allows us to compare different
ranking strategies and identify the most effective at retrieving
We need a corpus of documents and a set of queries with relevance
judgments to perform an IR evaluation. We can start experimenting
and evaluating different ranking methods using standard IR metrics
such as nDCG@10, Precision@10, and Recall@100. These IR metrics
allow us to reason about the strengths and weaknesses of proposed
ranking models, especially if we can evaluate the model in multiple
domains or relevance datasets. Evaluation like this contrasts the
industry’s most commonly used IR evaluation metric, LGTM (Looks
Good To Me)@10, for a small number of queries.
Evaluating ranking models in a zero-shot setting
In BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of
Information Retrieval Models,
Thakur et al. introduce a benchmark for assessing text ranking
models in a zero-shot setting.
The benchmark includes 18 diverse datasets sampled from different
domains and task definitions. All BEIR datasets have relevance
judgments with varying relevance grading resolutions. For example,
TREC-COVID, a dataset of the BEIR benchmark, consists of 50 test
queries, with many graded relevance labels, on average 493,5 document
judgments per query. On the other hand, the BEIR Natural Questions
(NQ) dataset uses binary relevance labels, with 4352 test queries,
with, on average, 1.2 judgments per query.
The datasets included in BEIR also have a varying number of documents,
queries, and document lengths, but all the datasets are monolingual
Statistics of the BEIR datasets. Table from BEIR: A Heterogeneous
Benchmark for Zero-shot Evaluation of Information Retrieval
Models; see also
The BEIR benchmark uses the Normalised Cumulative Discount
(nDCG@10) ranking metric. The nDCG@10 metric handles both datasets
with binary (relevant/not-relevant) and graded relevance judgments.
Since not all the datasets are available in the public domain (E.g.,
Robust04), it’s common to report nDCG@10 on the 13 datasets that
can be downloaded from the BEIR GitHub repository
using the wonderful ir_datasets
library. It’s also possible to aggregate the reported nDCG@10 metrics
per dataset to obtain an overall nDCG@10 score, for example, using
the average across the selected BEIR datasets. It’s important to
note which datasets are included in the average overall score, as
they differ significantly in retrieval difficulty.
Zero-Shot evaluation of models trained on Natural Questions
The most common BEIR experimental setup uses the MS MARCO labels
to train models and apply the models in a zero-shot setting on the
BEIR datasets. The simple reason for this setup is that MS MARCO
is the largest relevance dataset in the public domain, with more
than ½ million training queries. As with NQ, there are few, an
average of 1.1, relevant passages per query. Another setup variant
we highlight in detail in this section is to use a ranking model
trained on Natural Questions(NQ) labels with about 100K training
queries and evaluate it in an out-of-domain setting on MS MARCO
MS MARCO and Natural Questions datasets have fixed document corpora,
and queries are split into train and test. We can train a ranking
model using the train queries and evaluate the ranking method on
the test set. Both datasets are monolingual (English) and have user
queries formulated as natural questions.
MS MARCO sample queries
how many years did william bradford serve as governor of plymouth colony?
color overlay photoshop
Natural Questions (NQ) sample queries
what is non controlling interest on balance sheet
how many episodes are in chicago fire season 4
who sings love will keep us alive by the eagles
On the surface, these two datasets are similar. Still, NQ has longer
queries and documents compared to MS MARCO. There are also subtle
differences in how these datasets were created. For example, NQ
uses passages from Wikipedia only, while MS MARCO is sampled from
web search results.
|Statistics/Dataset||MS MARCO||Natural Questions (NQ)|
The above table summarizes basic statistics of
the two datasets. Words are counted after simple space tokenization.
Query lengths are calculated using the dev/test splits. Both
datasets have train splits with many queries to train the ranking
In open-domain question answering with
we described the Dense Passage Retriever (DPR) model, which uses
the Natural Questions dataset to train a dense 768-dimensional
vector representation of both queries and Wikipedia paragraphs. The
Wikipedia passages are encoded using the DPR model, representing
each passage as a dense vector. The Wikipedia passage vector
representations can be indexed and efficiently searched using an
approximate nearest neighbor
At query time, the text query is encoded with the DPR model into a dense
vector representation used to search the vector index. This ranking
model is an example of dense retrieval over vector text
from BERT. DPR was one of the first dense retrieval methods that
outperformed the BM25 baseline significantly on NQ. Since then,
much water has flown down the river, and dense vector models are
closing in on more computationally expensive cross-encoders in an
in-domain setting on MS MARCO.
In-domain effectiveness versus out-of-domain in a zero-shot setting
The DPR model trained on NQ labels outperforms the BM25 baseline
when evaluated on NQ. This is an example where the in-domain
application of the trained model improves the ranking accuracy over
In-domain evaluation of the Dense Passage Retriever (DPR). DPR is
an example of Embedding Based Retrieval (EMB).
Suppose we use the DPR model trained on NQ (and other question-answering
datasets) and apply the model on MS MARCO. Then we can say something
about generalization in a zero-shot setting on MS MARCO.
Out-of-domain evaluation of the Dense Passage Retriever (DPR). DPR
is an example of Embedding Based Retrieval (EMB). In this zero-shot
setting, the DPR model underperforms the BM25 baseline.
This case illustrates that in-domain effectiveness does not necessarily
transfer to an out-of-domain zero-shot application of the model.
Generally, as observed on the BEIR dense
dense embeddings models trained on NQ labels underperform the BM25
baseline across almost all BEIR datasets.
In this blog post, we introduced zero-shot and out-of-domain IR
evaluation. We also introduced the important BEIR benchmark.
Furthermore, we highlighted a case study of the DPR model and its
generalization when applied out-of-domain in a zero-shot setting.
We summarize this blog post with the following quote from the
In-domain performance is not a good indicator for out-of-domain
generalization. We observe that BM25 heavily underperforms neural
approaches by 7-18 points on in-domain MS MARCO. However, BEIR
reveals it to be a strong baseline for generalization and generally
outperforming many other, more complex approaches. This stresses
the point that retrieval methods must be evaluated on a broad range
Next blog post in this series
In the next post in this series on zero-shot ranking, we introduce a
hybrid ranking model, a model which combines multi-vector representations with BM25.
This hybrid model overcomes the limitations of single-vector embedding models,
and we prove its effectiveness in a zero-shot setting on the BEIR benchmark.