Improving Product Search with Learning to Rank – part one


Photo by Pawel Czerwinski on Unsplash

Over a few busy years, we have witnessed a neural ranking paradigm shift, where
pre-trained language models, such as BERT,
have revolutionized text ranking
in data-rich settings with lots of labeled data.
Feeding the data hungry neural networks lots of labeled relevance judgments
has proved more effective than statistical (tabular) text ranking feature-based methods.

However, historically, tree-based machine learning models
such as Gradient Boosting (GB)
with lambdarank loss have prevailed as state-of-the-art for product search ranking.
In this blog post series, we look at improving product search using learning to rank techniques,
with demonstrable results on the largest publicly available product ranking dataset.
We will train and measure their effectiveness on a large dataset, but
without getting into the details of complex loss functions.

In this first post, we introduce a product ranking dataset and establish multiple ranking baselines
that do not use the labeled relevance judgments in the dataset.
These ranking models are applied in a zero-shot setting, meaning they have not been trained using
domain-specific relevancy labels. The methods include traditional lexical
ranking like BM25 and other Vespa native text ranking features, semantic
off-the-shelf vector search models, and hybrid combinations of sparse and dense
methods. All of the ranking methods are represented end-to-end in Vespa.


In this series, we use a large shopping queries
dataset released by Amazon:

We introduce the “Shopping Queries Data Set”, a large dataset of complex search
queries released to foster research in semantic matching of queries and
products. For each query, the dataset provides a list of up to 40 potentially
relevant results, together with ESCI relevance judgments (Exact, Substitute,
Complement, Irrelevant) indicating the relevance of the product to the query.
Each query-product pair is accompanied by additional information. The dataset is
multilingual, containing queries in English, Japanese, and Spanish.

The dataset consists of a large number of questions, and where each query has,
on average, 20 graded query-item relevance judgments:

  • Exact (E): the item is relevant to the query and satisfies all the query
    specifications (e.g., water bottle matching all attributes of a query “plastic
    water bottle 24oz”, such as material and size)

  • Substitute (S): the item is somewhat relevant: it fails to fulfill some
    aspects of the query, but the item can be used as a functional substitute (e.g.,
    fleece for a “sweater” query)

  • Complement (C): the item does not fulfill the query but could be used in
    combination with an exact item (e.g., track pants for “running shoe” query)

  • Irrelevant (I): the item is irrelevant, or it fails to fulfill a central
    aspect of the query (e.g., socks for a “pant” query)

As with most ranking datasets, the dataset is split into a train and test part,
and models can be trained using the train split and evaluated on the test split.
In our work, we focus on the English queries in both splits.

English Train Split

The train split contains 20,888 queries with 419,653 query-item judgments. This
split can be used to train models, using learning to rank techniques. The train
set will be the focus of upcoming blog posts.

English Test Split

In this blog post we focus on the test split, and it contains 8,956 queries with
181,701 query-item judgments.

The judgment label distribution is as follows:

  • Exact (Relevant) 79,708 (44%)
  • Substitute (Somewhat Relevant) 8,099 (4%)
  • Complement 63,563 (35%)
  • Irrelevant 30,331 (17%)

Both splits have about 20 product judgments per query. As we can see from the label
distribution, the dataset has many Exact (relevant) judgments, especially
compared to Substitutes (somewhat relevant) and Irrelevant. On average, there
are about 9 relevant products for every query in the test split.

This product relevance dataset is the largest available product relevance
dataset and allows us to compare the effectiveness of multiple ranking models
using graded relevance judgments. A large dataset allows us to report
information retrieval ranking metrics and not anecdotical LGTM@10 (Looks Good To Me) metrics.

We can evaluate ranking models used in a
zero-shot setting without using the in-domain relevance labels and models
trained using learning to rank techniques, exploiting the relevance labels in the train split.

Product ranking example

The above image shows query #535 in the test split with a total of 16 product judgements.
The task is to optimize the ordering (ranking) so that products labeled as exact are ordered
before supplements and supplements before irrelevant.

Product ranking example
The above image shows perfect ranking of products for query #535.
Note that we have converted the textual labels (esci_label) to numeric labels using
E=4, S=3, C=2, I=1. If our ranking function is able to produce this ordering, we
would get a perfect score for this query.

Indexing the dataset with Vespa

We use the following Vespa document
schema, which allows us to index all the
products associated with English queries across both splits.
We use a utility script that converts the parquet product data file to a
Vespa JSON
formatted feed file. In total we index about 1.2M products in Vespa.

schema product {

    document product {

        field id type string {
            indexing: summary | index 

        field title type string {
            indexing: summary | index
            index: enable-bm25

        field description type string {
            indexing: summary | index
            index: enable-bm25
            weight: 200

        field bullets type string {
            indexing: summary | index
            index: enable-bm25
            weight: 200

        field brand type string {
            indexing: summary | index | attribute

        field color type string {
            indexing: summary | index | attribute


    field title_tokens type tensor<float>(d0[32]) {
        indexing: input title | embed tokenizer | attribute

    field description_tokens type tensor<float>(d0[128]) {
        indexing: input description | embed tokenizer | attribute

    field embedding type tensor<float>(d0[384]) {
        indexing {
            input title . input description | embed transformer | attribute | summary | index
        attribute {
            distance-metric: angular 
    fieldset default {
        fields: title, description, bullets, brand  

The product data contains title, description, brand, color and bullets. We use
Vespa’s support for encoding text into vector embeddings, in this case
we use the title and description as input.
See text embedding made simple
for details on how to embed text embedding models in Vespa.


The official dataset evaluation metric is
NDCG (Normalized
Discounted Cumulative Gain), a precision-oriented metric commonly used for
ranking datasets with graded relevance judgments. An important observation is
that the task only considers ranking of the products that have judgment
labels. In other words, we assume that a magic retriever has
retrieved all relevant documents (plus irrelevant documents), and our task is to
re-rank the products so that the relevant ones are pushed to the top of the
ranked list of products. This is a simpler task than implementing end-to-end
retrieval and ranking over the 1.2M products.

Ranking overview

Many deployed search systems need to hedge the deployment cost and deploy
multi-phased retrieval and ranking
pipelines to reduce computational complexity.
In a retrieval and ranking funnel, one or many diverse retrievers use a
cost-efficient retrieval ranking phase, and subsequent ranking phases re-rank
the documents, focusing on precision. In a later post in this series, we look at
how to train a retriever function using learning to retrieve and how to
represent retrieval and ranking phases in Vespa.

Baseline zero-shot ranking models

We propose and evaluate seven different baseline ranking models, all represented end-to-end in
Vespa using Vespa’s ranking framework.


Obviously not the greatest baseline, but allows us to compare other baselines with
random ranking. We use Vespa’s random

A pseudorandom number in the range [0,1] which is drawn once per document during rank evaluation.

rank-profile random inherits default {
    first-phase {
        expression: random 

BM25 – lexical ranking

BM25 is a tried and true
zero-shot text ranking model. BM25 only has two parameters and is the ranking
function many turn to when there are no explicit relevancy labels or implicit
click feedback to train ranking models. The BM25 scoring function can also be
accelerated for efficient retrieval using dynamic pruning algorithms like

rank-profile bm25 inherits default {
    first-phase {
        expression: bm25(title) + bm25(description) 

Note that we only use the title and description. All our lexical baselines
uses only these two fields.

Vespa nativeRank – lexical ranking

Vespa’s nativeRank is a
lexical ranking function and the default text ranking function in Vespa. It has
similar characteristics as BM25 but also considers the proximity between matched
terms in the document. Unlike BM25, which has an unbound score range, Vespa’s
nativeRank is normalized to a score in the range [0,1].

rank-profile nativeRank inherits default {
    first-phase {
        expression: nativeRank(title) + nativeRank(description) 

Vespa fieldMatch – lexical ranking

Vespa fieldMatch
is another lexical ranking function that is more sophisticated than Vespa
nativeRank but also computationally expensive and unsuitable as a retrieval
function. fieldMatch(field) or fieldMatch sub-features such as fieldMatch(field).proximity is
typically used as re-ranking features.

rank-profile fieldMatch inherits default {
    first-phase {
        expression: fieldMatch(title) + fieldMatch(description) 

Vespa semantic ranking using vector similarity – dense retrieval

Using an off-the-shelf dense vector model from
Huggingface, we
encode documents and queries into a 384-dimensional dense vector space. We use
cosine similarity between the query and the document in this shared embedding
vector space as our relevancy measure. The vector embedding model,
sentence-transformers/all-MiniLM-L6-v2, is trained
on a large number of sentence-length texts.

Semantic similarity search over dense representations can be accelerated using
Vespa‘s support for approximate nearest neighbor search, making it usable for
efficient retrieval. We use Vespa’s support for embedding dense embedding models in this work.
See text embeddings made simple.

rank-profile semantic inherits default {
    inputs {
        query(query_embedding) tensor<float>(d0[384])
    first-phase {
        expression: closeness(field, embedding)

Vespa semantic ranking using vector similarity – dense retrieval

Another model that can be used for vectorization,
bert-base-uncased, 768-dimensional.
This model has not been trained for semantic similarity, or semantic search. We
use the pre-trained language model directly. As with the previous semantic
model, we use cosine similarity between the query and the document as our relevancy measure.

rank-profile semantic inherits default {
    inputs {
        query(bert_query_embedding) tensor<float>(d0[768])
    first-phase {
        expression: closeness(field, bert_embedding)

Vespa hybrid – combining lexical and semantic ranking

In this baseline model, we combine the semantic similarity score of the
sentence-transformers/all-MiniLM-L6-v2 model with the lexical score computed by
Vespa’s nativeRank. We use a simple linear combination of the two ranking
models. Since the nativeRank lexical feature is normalized in the range [0,1],
combining it with semantic similarity is more straightforward as we don’t need
to normalize unbound lexical scoring functions, such as BM25.
We use a query(alpha) parameter to control how each method influences the overall score.

rank-profile hybrid inherits default {
    inputs {
        query(alpha): 0.5
        query(query_embedding) tensor<float>(d0[384])
    first-phase {
        expression: query(alpha)*closeness(field, embedding) + (1 - query(alpha))*(nativeRank(title) + nativeRank(description))

Zero-shot Baseline Results

Ranking result

We present the results of our evaluation in the above figure. We execute each
query in the English test split (8.9K queries) and ask Vespa to rank the
documents that we have labels for, using the Vespa search API’s recall

Sets a recall parameter to be combined with the query. This is identical to filter, except that recall terms are not exposed to the ranking framework and thus not ranked.

We use trec_eval utility to compute
NDCG, using the relevance judgments. The reported NDCG is the average NDCG
calculated across all 8,956 queries in the test split. These are unique queries
and we don’t know their frequency distribution on Amazon, but in a real-world
evaluation we would have to take into account the query frequency as well when
evaluating the model, for example using weighting so that popular (head) queries weights more
than rare (tail) queries. In other words, we don’t want to introduce a ranking model that on average improves
NDCG, but hurts many head queries.

Notice that the random ranking function produces a high NDCG score, this is because there are
many exact (relevant) judgements per query. On

Improving Product Search with Learning to Rank – part two


Photo by Carl Campbell
on Unsplash

In the first post,
we introduced a large product ranking dataset and established multiple zero-shot
ranking baselines that did not use the labeled relevance judgments in the
dataset. In this post, we start exploiting the labeled relevance judgments to
train deep neural ranking models. We evaluate these models on the test set to
see if we can increase the NDCG ranking metric.

Dataset Splits

The English train split contains 20,888 queries with 419,653 query-item
judgments. To avoid training and tuning models based on observed test set
accuracy, we split the official train set into two new splits; train and dev. In
practice, we pick, at random, 20% of the queries from the train as our dev split
and retain the remaining 80% as train queries.

Since we plan to train multiple models, we can use the performance on the dev
set when training ensemble models and then hide the dev set during the training.

Train Original: 20,888 queries with 419,653 judgments

  • Our Train 16,711 queries with 335,674 assessments
  • Our Dev 4,177 queries with 83,979 assessments

We do this dataset split once, so all our models are trained with the same train
and dev split.

ranking model

Our objective is to train a ranking model f(query,product) that takes a query
and product pair and outputs a relevance score. We aim to optimize the NDCG
ranking metric after sorting the products for a query by this score.

The ranking process is illustrated for a single query in the above figure. We
are asked to rank four products for a query with model f. The model scores each
pair and emits a relevance score. After sorting the products by this score, we
obtain the ranking of the products for the query. Once we have the ranked list,
we can calculate the NDCG metric using the labeled query and product data. The
overall effectiveness of the ranking model is given by computing the average
NDCG ranking score for all queries in the test dataset.

There are many ways to train a ranking model f(query, product), and in this
post, we introduce and evaluate two neural ranking methods based on pre-trained
language models.

Neural Cross-Encoder using pre-trained language models

One classic way to use Transformers for ranking is cross-encoding, where both
the query and the document are fed into the Transformer model simultaneously.

The simultaneous input of the query and document makes cross-encoders unsuitable
for cost-efficient retrieval. In this work, we are relatively lucky because we
already have a magic retriever, and our task is to re-rank a small set (average
20) of products per query.

cross-encoder model

One approach for training cross-encoders for text ranking is to convert the
ranking problem to a regression or classification problem. Then, one can train
the model by minimizing the error (loss) between the model’s predicted and true
label. A popular loss function for regression and classification problems is MSE
(Mean Squared Error) loss. MSE loss is a pointwise loss function, operating on a
single (query, document) point when calculating the error between the actual and
predicted label. In other words, the pointwise loss function does not consider the
ranking order of multiple (query, document) points for the same query. In the next
blog post in this series we will look at listwise loss functions.

With graded relevance labels, we can choose between converting the ranking
problem to a regression (graded relevance) or binary classification
(relevant/irrelevant) problem. With binary classification, we need to set a
threshold on the graded relevance labels. For example, let Exact and Complement
map to the value 1 (relevant) and the remaining labels as 0 (irrelevant).
Alternatively, we can use a map from the relevancy label to a numeric label

Exact => 1.0, Substitute => 0.1, Complement => 0.01, Irrelevant => 0

With this label mapping heuristic, we “tell” the model that Exact should be 10x as
important as Substitute and so forth. In other words, we want the model to output a score close
to 1 if the product is relevant and a score close to 0.1 if the product is labeled as a substitute.

training cross-encoder model

The above figure illustrates the training process; we initialize our model using
a pre-trained Transformer. Next, we prepare batches of labeled data of the form
<query, document, label>. This data must be pre-processed into
a format that the model understands, mapping free text to token ids. As part of
this process, we convert the textual graded label representation to a numeric
representation using the earlier label gain map. For each batch, the model makes
predictions, and the prediction errors are measured using the loss function. The
model weights are then adjusted to minimize the loss, one batch at a time.

Once we have trained the model, we can save the model weights and import the
model into Vespa for inference and ranking. More on how to represent
cross-encoder language models in Vespa will be in a later section in this blog

Choosing the Transformer model and model inputs

There are several pre-trained language models based on the Transformer model
architecture. Since we operate on English queries and products, we don’t need
multilingual capabilities. The following models use the English vocabulary and
wordpiece tokenizer.

  • bert-base-uncased – About 110M parameters
  • MiniLM Distilled model using bert-base-uncased as a teacher, about 22M parameters

Choosing a model is a tradeoff between deployment cost and accuracy. A larger
model (with more parameters) will take longer to train and have higher
deployment costs but with potentially higher accuracy. In our work, we base our
model on weights from a MiniLM
model trained on
relevancy labels from the MS Marco

We have worked with the MiniLM models before in our work on MS Marco passage
so we know these models provide a reasonable tradeoff between ranking
accuracy and inference efficiency, which impacts deployment cost. In addition,
MiniML is a CPU-friendly model, especially with quantized versions.
As a result, we are avoiding expensive GPU instances and GPU-related failure modes.

In addition to choosing the pre-trained Transformer model, we must decide which
product fields and the order we will input them into the model. Order matters,
and the Transformer model architecture has quadratic computational complexity
with the input length, so the size of the text inputs significantly impacts
inference-related costs.

Some alternative input permutations for our product dataset are:

  • query, product title,
  • query, product title, brand,
  • query, product title, brand, product description,
  • query, product title, product description, product bullets.

The product field order matters because the total input length is limited to a
maximum of 512 tokens, and shorter input reduces inference costs significantly.
Since all products in the dataset have a title, we used only the title as input,
avoiding complexity related to concatenating input fields and handling missing
fields. Furthermore, product titles are generally longer than, for example,
Wikipedia or web page titles.

We trained the model for two epochs, Machine Learning (ML) terminology for two
complete iteration over our training examples. Training the model on a free-tier
Google Colab GPU takes less than 30 minutes.

Representing cross-encoders in Vespa

Once we have trained our cross-model using the sentence-transformer
cross-encoder training library, we need to import it to Vespa for inference and
ranking. Vespa supports importing models saved in the ONNX
model format and where inference is accelerated using
We export the PyTorch Transformer model representation to ONNX format using an
ONNX export utility script. Then we can drop the ONNX file into the Vespa
application package and configure how to use it in the product schema:

Model declaration

document product {
    field title type string {..} 
field title_tokens type tensor(d0[128]) {
    indexing: input title | embed tokenizer | attribute | summary
onnx-model title_cross {
    file: models/title_ranker.onnx
    input input_ids: title_input_ids
    input attention_mask: title_attention_mask
    input token_type_ids: title_token_type_ids

The above declares the title_tokens field where we store the subword tokens
and the onnx-model with the file name in the application package and its inputs.
The title_tokens are declared outside of the document as these fields are populated
during indexing.

Using the model in rank-profile

rank-profile cross-title inherits default {
    inputs {
        query(query_tokens) tensor(d0[32])

    function title_input_ids() {
        expression: tokenInputIds(96, query(query_tokens), attribute(title_tokens))

    function title_token_type_ids() {
        expression: tokenTypeIds(96, query(query_tokens), attribute(title_tokens))

    function title_attention_mask() {
        expression: tokenAttentionMask(96, query(query_tokens), attribute(title_tokens)) 

    function cross() {
        expression: onnx(title_cross)({d0:0,d1:0} 
    first-phase {
        expression: cross() 

This defines the rank-profile with the function inputs and where to find the inputs. We
cap the sequence at 96 tokens, including special tokens such as CLS and SEP.
The onnx(title_cross)({d0:0,d1:0} invokes the model with the inputs and slices
the batch dimension (d0) and reads the predicted score.

Vespa provides convenience tensor functions to calculate the three inputs to the
Transformer models; tokenInputIds, tokenTypeIds, and tokenAttentionMask.
We just need to provide the query tokens and field from which we can read the document
tokens. To map text to token_ids using the fixed vocabulary associated with the model,
we use Vespa’s support for wordpiece embedding.
This avoids document side tokenization at inference time, and we can just read the product tokens
during ranking. See the full product schema.

Neural Bi-Encoder using pre-trained language models

Another popular approach for neural ranking is the two-tower or bi-encoder
approach. This approach encodes queries and documents independently.

For text ranking, queries and documents are encoded into a dense embedding
vector space using one or two Transformer models. Most methods use the same
Transformer instance, but it’s possible to decouple them. For example, suppose
both towers map documents and queries to the same vector dimensionality. In that
case, one could choose a smaller Transformer model for the online query encoder
versus the document tower to reduce latency and deployment costs.

The model learns the embedding vector representation from the labeled training
examples. After training the model, one can compute the embeddings for all
products, index them in Vespa, and use a nearest-neighbor search algorithm at
query time for efficient document retrieval or ranking. In this work, Amazon has
provided the “retrieved” products for all queries, so we don’t need HNSW
indexing for efficient
approximate nearest neighbor search.

cross-encoder model

The above illustrates the bi-encoder approach. This illustration uses the
Transformer CLS (classification token) output and uses that as the single dense
vector representation. Another common output pooling strategy is calculating the
mean over all the token vector outputs. Other representation methods use
multiple vectors to represent queries and documents. One example is the ColBERT
model described in this blog

training cross-encoder model

The above illustrates training a bi-encoder for semantic similarity. Training
bi-encoders for semantic similarity or semantic search/retrieval/ranking has
more options than training cross-encoders.

As with the cross-encoder, we must select a pre-trained model, minding size and

The model size impacts quality, inference speed, and vector representation
dimensionality. For example, bert-base-uncased uses 768 dimensions, while MiniLM
uses 384. Lower dimensionality lowers the computational complexity of the
similarity calculations and reduces the embedding storage footprint.

Furthermore, how we batch the input training data, the vector similarity
function, and, most importantly, the loss function determines the quality of the
model on the given task. While loss functions are out of the scope of this blog
post, we would like to mention that training a bi-encoder for semantic
retrieval over a large corpus, which “sees” many irrelevant documents, needs a
different loss function than a bi-encoder model used for re-ranking.

Like with the cross-encoder, we must decide what product fields we encode.
Instead of inputting multiple product fields, we train two bi-encoder models,
one that uses the product title and another that encodes the description.
Both models are based on

Representing bi-encoders in Vespa

A single Vespa document schema can contain multiple document embedding fields,

document product {
    field title type string {}
    field description type string {}
field title_embedding type tensor(d0[384]) {
    indexing: input title | embed title | attribute | summary
    attribute {
        distance-metric: angular 
field description_embedding type tensor(d0[384]) {
    indexing: input description | embed description | attribute 

Improving Product Search with Learning to Rank – part three


Photo by Niels Weiss
on Unsplash

We introduced a large product ranking dataset in the first
post in
this series and established multiple zero-shot ranking baselines.
In the second
we trained ranking models using pre-trained language models and
evaluated them using the NDCG ranking metric.

In this post, we dive deep into a popular method for learning to
rank; gradient-boosting decision trees
(GBDT). Vespa has
native support for evaluating GBDT models and imports models trained
with popular GBDT frameworks such as
XGBoost and


Our objective is to train a ranking model f(query, product) that
takes a query and product pair as input and which outputs a relevance
score. We aim to optimize the NDCG metric after sorting the products
by this score. There are many ways to train a ranking model f(query,
, and in this post, we introduce and evaluate tree-based
models. Tree-based GBDT models, implemented by popular frameworks
like XgBoost and
LightGBM, excel on tabular
(structured) features and handle feature mixes and feature value
ranges without any ceremony.

In our series, we only aim to optimize relevance (NDCG); still,
E-commerce ranking is a multi-objective optimization
where shops want to maximize relevance and revenue. GBDT models
are famous for multi-objective ranking optimization as they handle
a mix of features, combining normalized features (e.g., vector
similarity), unnormalized unbound features (e.g.,
BM25), and “business”
features such as product sales margin. GBDT is also an excellent
way to explore semantic search signals and integrate them into an
existing product ranking function without wasting years of ranking
model development.

Multi-objective ranking

The above figure illustrates a multi-objective ranking optimization
problem where we want to maximize relevance and revenue. One way
to approach this problem is to train the model with a modified
label, an aggregated weighted combination of the two conflicting
optimization objectives.

For organizations, prediction explainability is essential when
applying Machine Learning (ML) to ranking problems, and GBDT
predictions are more straightforward to interpret than neural
predictions. Finally, GBDT is a relatively simple algorithm, meaning
we can train on more data than deep neural networks (for the same
compute budget). For example, training on our product ranking
dataset takes a few seconds on a laptop with CPU-only, while our
neural methods from the second
took hours with GPU acceleration.

Model Features

Tree-based learning methods require tabular (structured) features.
Therefore, we convert the unstructured text ranking dataset to a
tabular dataset using simple feature engineering, where Vespa
calculates all tabular features.

Broadly we have four classes of features used for ranking problems.
Examples in this section do not necessarily map to the Amazon product
ranking dataset used in this blog post series.

Contextual query features

These are features that do not depend on the document. For example,
it could be the user’s age, the time of the day, or a list of
previously visited products — generally, query time contextual
features and the user’s query terms. Query classification techniques
using language models also fall under this feature category, for
example, using Vespa’s support for stateless model

Document features

These features are independent of the query. In E-commerce, for
example, this might be the product’s popularity, price, sale margin,
or user review rating score(s).

Online match features (query * document)

An example of this could be text matching features, such as
bm25(title), tensor computation
like a sparse dot product between the user profile interests and product category, or semantic vector
similarity, as demonstrated in the previous post in this series.
See also tensor computation examples.

Aggregate features

These features aggregate across documents or queries. The computing
of aggregated features is usually performed outside of Vespa, using
near real-time stream processing. The output of these steam processing
jobs are usually tensors
which are stored and updated at scale with Vespa. Aggregate features
can help cold-start problems when indexing a new document that lacks
document features such as sales rank or popularity. For example,
having a “category” popularity score could help rank new products
added to the catalog. For example, the Yahoo homepage recommendation
system uses topical click-through rate (CTR) using Vespa global

Representing features in Vespa

Vespa’s ranking framework has rich built-in
Still, developers can easily express domain-specific features by
combining Vespa’s ranking
language with its built-in features.

Many ranking architectures in the wild use external feature stores.
In these serving architectures, a middle-tier component retrieves
a small pool of documents from an indexing service, for example,
Elasticsearch. After retrieving documents, the middle-tier fetches
features from the feature store and inputs these to the model.

Vespa allows representing a richer set of features than other
traditional indexing services built on Apache Lucene. Furthermore,
since Vespa stores tensor features in memory and memory bandwidth
are much higher than network bandwidth, one can rank a much larger
candidate pool with Vespa than with external feature stores.

Vespa attributes and
tensors support
in-place partial updates,
and developers can update a document and aggregated features with
high throughput (up to 75K partial updates/s per float field per
node with low CPU resource consumption). Genuine partial update
support is a feature that differentiates Vespa compared to search
engines built on Apache Lucene, where partial updates trigger a
full re-indexing of the entire document (get-apply-write pattern).

Gathering features from Vespa

In previous posts, we introduced Vespa
rank-features and
We can define custom features and feature computations using the
Vespa rank expressions language with function support.

rank-profile features inherits default {
        inputs {
            query(query_tokens) tensor<float>(d0[32])
            query(q_title) tensor<float>(d0[384])
        function bi_encoder() {
            expression: closeness(field, title_embedding)

        function max_title_significance() {
            expression: foreach(terms, N, term(N).significance, ">0.5", max)

        function mean_title_significance() {
            expression: foreach(terms, N, term(N).significance, ">0.5", average)

        first-phase {
            expression: random

        match-features {

The above features rank-profile defines a set of custom features using function with rank expressions.
The match-features block defines the set of features that are returned with each ranked document.
See full schema on GitHub.

For explicitly labeled examples like in our dataset, we can ask
Vespa to compute and return the feature scores using a combination
of the Vespa recall
parameter and match-features or
The recall query
parameter allows developers to request Vespa to calculate features
for query document pairs without impacting other features like
queryTermCount or text matching features.


Query-product feature scraping illustration. We request that Vespa
retrieve only the labeled products for each query in the train
split. This feature scraping approach is used for explicit relevance
judgment labels.


The above figure illustrates feature logging in production, and
this process can be used to gather implicit relevance feedback using
clicks and skips.

Feature logging infrastructure is fundamental for scaling machine
learning (ML) applications beyond expensive labeled data.
With Vespa, developers can compute and log new features that the
current production model does not use. Vespa can log the query and
the list of ranked documents with the calculated features. Note:
Using click models to generate unbiased pseudo-relevance labels for
model training is out of the scope of this blog post.

Generally, one wants to train on the features as scraped or logged.
Computing features outside of Vespa for training might cause feature
drift between training and inference in Vespa, where the offline
feature computations differ from the Vespa implementation.

For our GBDT model training, we use built-in Vespa text-matching
features for our product ranking datasets, such as
nativeRank, and
We also include a few handcrafted function features and the neural
scores from the neural bi-encoder and cross-encoder introduced in
the previous
The scrape routine fetches the computed feature values from a running
Vespa instance and merges the features into a pandas DataFrame,
along with the labels and query ids.


The above figure displays a random sample of tabular feature data.
The DataFrame column names map 1:1 to Vespa feature names, native
built-in features, or our user-defined functions, for example,
bi_encoder and cross_encoder.

Gradient Boosting Decision Trees (GBDT)

This blog post won’t dive deep into how the gradient boosting
algorithm works or ranking loss calculations. Instead, for our
purposes, on a very high level, it suffices to say that each training
iteration builds a decision tree, and each new decision tree tries
to correct the errors from the previously generated decision trees.

The final model after k training iterations has k decision trees,
and the model inference is performed by traversing the decision
trees and summing scores in the leaf nodes of the decision trees.
GBDT supports different learning objectives and loss functions:

  • Classification with pointwise loss
  • Regression with pointwise loss
  • Ranking with listwise or pairwise ranking loss

There are two popular implementations of the GBDT algorithm; XGboost,
and LightGBM; Vespa supports importing models from both frameworks.
Vespa does not use the framework inference implementation but
converts the models into Vespa’s ranking framework for accelerated
GBDT inference. At Yahoo, Vespa has been used for ranking with
GBDT models at
long before the XGBoost and LightGBM implementations.
Since the different framework models are imported into the same unified
framework in Vespa, combining them into an ensemble of models from
both frameworks is easy.


The above figure shows a single decision tree from a forest of
decision trees. Each gradient-boosting learning iteration generates
a new decision tree, and the final model is a forest of decision

The decision tree in the illustration is two levels deep and has
four leaf nodes. This simplified decision tree can only output four
different scores (0.04, 0.3, -0.9, or 0.4). Each non-leaf node has
a decision (split condition) that compares a feature with a numeric
value. The tree traversal starts from the tree’s root, and the path
determines the leaf score. The final score is the sum of all the
trees in the forest.

Training GBDT models

Once we have scraped the Vespa computed features for our product
ranking train data, we can start to train models. Unfortunately,
it’s easy to overfit the training data with GBDT, and hyperparameters
impact generalization on unseen data (test data). The previous
explained how to split the training data into a new train and dev
split. Now we can use that dev split as the validation dataset used
during (GBDT) training to check generalization and tune
The training job terminates when observed dev accuracy (NDCG) stops
improving. Early termination avoids overfitting on the train set
and saves resources during training and inference.


The GBDT feature importance plot enables feature ablation studies
where we can remove low-importance features and re-train the model
to observe the NDCG impact on the dev set. Feature importance plots
allow us to reduce computational complexity and deployment costs
by removing features. We make these reductions to meet a specific
latency service level agreement (SLA).


In the feature importance illustration above, we have trained a
model using only five simple features that are computationally

GBDT models and Vespa serving performance

Training an ML model is only worthwhile if we can deploy it to
production. The total computational complexity of evaluating a GBDT
model for ranking in Vespa depends mainly on three factors.

The number of documents exposed to the GBDT model

The number of documents exposed to the model is the most significant
parameter. Computational cost is linear with the number of documents
ranked using the model. Vespa allows multiple ranking
phases to apply the
most computationally complex model to the top-scoring documents
from the previous ranking phase. Search ranking and model inference
with GBDT are highly parallelizable. Vespa’s ability to use multiple
threads per query reduces serving latency.

Features and feature complexity

The number of features and the computational complexity of the
features dramatically impact performance. For example, calculating
a matching feature such as bm25(title) is low complexity
compared to fieldMatch(title), which again is relatively low compared
to inference with the previous

The benefit Vespa delivers is that it allows developers
to understand relative feature complexity using rank
The trace below shows the relative feature complexity.



Improving Zero-Shot Ranking with Vespa Hybrid Search


Photo by Norbert
Braun on Unsplash

If you are planning to implement search functionality but have not
yet collected data from user
interactions to
train ranking models, where should you begin? In this series of
blog posts, we will examine the concept of zero-shot text ranking.
We implement a hybrid ranking method using Vespa and evaluate it on a
large set of text relevancy datasets in a zero-shot setting.

In the first post, we will discuss the distinction between in-domain
and out-of-domain (zero-shot) ranking and present the BEIR benchmark.
Furthermore, we highlight situations where in-domain embedding
ranking effectiveness does not carry over to a different domain in
a zero-shot setting.


Pre-trained neural language models, such as BERT, fine-tuned for
text ranking, have demonstrated remarkable effectiveness compared
to baseline text ranking methods when evaluating the models in-domain.
For example, in the Pretrained Transformer Language Models for
blog post series, we described three methods for using pre-trained
language models for text ranking, which all outperformed the
traditional lexical matching baseline

In the transformer ranking blog series,
we used in-domain data for training and production (test), and the
documents and the queries were drawn from the same in-domain data

MS MARCO Results

In-domain trained and evaluated ranking methods. All models are
end-to-end represented using Vespa, open-sourced in the msmarco-ranking
The Mean Reciprocal Rank
(MRR@10) is
reported for the dev query split of the MS MARCO passage ranking

This blog post looks at zero-shot text ranking, taking a ranking
model and applying it to new domains without adapting or fine-tuning
the model.

Zero-shot overview
In-domain training and inference (ranking) versus zero-shot inference
(ranking) with a model trained in a different domain.

Information Retrieval Evaluation

Information retrieval (IR) evaluation is the process of measuring
the effectiveness of an information retrieval system. Measuring
effectiveness is important because it allows us to compare different
ranking strategies and identify the most effective at retrieving
relevant information.

We need a corpus of documents and a set of queries with relevance
judgments to perform an IR evaluation. We can start experimenting
and evaluating different ranking methods using standard IR metrics
such as nDCG@10, Precision@10, and Recall@100. These IR metrics
allow us to reason about the strengths and weaknesses of proposed
ranking models, especially if we can evaluate the model in multiple
domains or relevance datasets. Evaluation like this contrasts the
industry’s most commonly used IR evaluation metric, LGTM (Looks
Good To Me
)@10, for a small number of queries.

Evaluating ranking models in a zero-shot setting

In BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of
Information Retrieval Models,
Thakur et al. introduce a benchmark for assessing text ranking
models in a zero-shot setting.

The benchmark includes 18 diverse datasets sampled from different
domains and task definitions. All BEIR datasets have relevance
judgments with varying relevance grading resolutions. For example,
TREC-COVID, a dataset of the BEIR benchmark, consists of 50 test
queries, with many graded relevance labels, on average 493,5 document
judgments per query. On the other hand, the BEIR Natural Questions
(NQ) dataset uses binary relevance labels, with 4352 test queries,
with, on average, 1.2 judgments per query.

The datasets included in BEIR also have a varying number of documents,
queries, and document lengths, but all the datasets are monolingual

BEIR overview
Statistics of the BEIR datasets. Table from BEIR: A Heterogeneous
Benchmark for Zero-shot Evaluation of Information Retrieval
Models; see also
BEIR Benchmark

The BEIR benchmark uses the Normalised Cumulative Discount
(nDCG@10) ranking metric. The nDCG@10 metric handles both datasets
with binary (relevant/not-relevant) and graded relevance judgments.
Since not all the datasets are available in the public domain (E.g.,
Robust04), it’s common to report nDCG@10 on the 13 datasets that
can be downloaded from the BEIR GitHub repository
using the wonderful ir_datasets
library. It’s also possible to aggregate the reported nDCG@10 metrics
per dataset to obtain an overall nDCG@10 score, for example, using
the average across the selected BEIR datasets. It’s important to
note which datasets are included in the average overall score, as
they differ significantly in retrieval difficulty.

Zero-Shot evaluation of models trained on Natural Questions

The most common BEIR experimental setup uses the MS MARCO labels
to train models and apply the models in a zero-shot setting on the
BEIR datasets. The simple reason for this setup is that MS MARCO
is the largest relevance dataset in the public domain, with more
than ½ million training queries. As with NQ, there are few, an
average of 1.1, relevant passages per query. Another setup variant
we highlight in detail in this section is to use a ranking model
trained on Natural Questions(NQ) labels with about 100K training
queries and evaluate it in an out-of-domain setting on MS MARCO

MS MARCO and Natural Questions datasets have fixed document corpora,
and queries are split into train and test. We can train a ranking
model using the train queries and evaluate the ranking method on
the test set. Both datasets are monolingual (English) and have user
queries formulated as natural questions.

MS MARCO sample queries

how many years did william bradford serve as governor of plymouth colony?

define preventive

color overlay photoshop

Natural Questions (NQ) sample queries

what is non controlling interest on balance sheet

how many episodes are in chicago fire season 4

who sings love will keep us alive by the eagles

On the surface, these two datasets are similar. Still, NQ has longer
queries and documents compared to MS MARCO. There are also subtle
differences in how these datasets were created. For example, NQ
uses passages from Wikipedia only, while MS MARCO is sampled from
web search results.

Statistics/DatasetMS MARCONatural Questions (NQ)
query length5.99.2
document length56.676.0

The above table summarizes basic statistics of
the two datasets. Words are counted after simple space tokenization.
Query lengths are calculated using the dev/test splits. Both
datasets have train splits with many queries to train the ranking

In open-domain question answering with
we described the Dense Passage Retriever (DPR) model, which uses
the Natural Questions dataset to train a dense 768-dimensional
vector representation of both queries and Wikipedia paragraphs. The
Wikipedia passages are encoded using the DPR model, representing
each passage as a dense vector. The Wikipedia passage vector
representations can be indexed and efficiently searched using an
approximate nearest neighbor

At query time, the text query is encoded with the DPR model into a dense
vector representation used to search the vector index. This ranking
model is an example of dense retrieval over vector text
from BERT. DPR was one of the first dense retrieval methods that
outperformed the BM25 baseline significantly on NQ. Since then,
much water has flown down the river, and dense vector models are
closing in on more computationally expensive cross-encoders in an
in-domain setting on MS MARCO.

In-domain effectiveness versus out-of-domain in a zero-shot setting

The DPR model trained on NQ labels outperforms the BM25 baseline
when evaluated on NQ. This is an example where the in-domain
application of the trained model improves the ranking accuracy over
baseline BM25.


In-domain evaluation of the Dense Passage Retriever (DPR). DPR is
an example of Embedding Based Retrieval (EMB).

Suppose we use the DPR model trained on NQ (and other question-answering
datasets) and apply the model on MS MARCO. Then we can say something
about generalization in a zero-shot setting on MS MARCO.


Out-of-domain evaluation of the Dense Passage Retriever (DPR). DPR
is an example of Embedding Based Retrieval (EMB). In this zero-shot
setting, the DPR model underperforms the BM25 baseline.

This case illustrates that in-domain effectiveness does not necessarily
transfer to an out-of-domain zero-shot application of the model.
Generally, as observed on the BEIR dense
dense embeddings models trained on NQ labels underperform the BM25
baseline across almost all BEIR datasets.


In this blog post, we introduced zero-shot and out-of-domain IR
evaluation. We also introduced the important BEIR benchmark.
Furthermore, we highlighted a case study of the DPR model and its
generalization when applied out-of-domain in a zero-shot setting.

We summarize this blog post with the following quote from the
BEIR paper:

In-domain performance is not a good indicator for out-of-domain
generalization. We observe that BM25 heavily underperforms neural
approaches by 7-18 points on in-domain MS MARCO. However, BEIR
reveals it to be a strong baseline for generalization and generally
outperforming many other, more complex approaches. This stresses
the point that retrieval methods must be evaluated on a broad range
of datasets

Next blog post in this series

In the next post in this series on zero-shot ranking, we introduce a
hybrid ranking model, a model which combines multi-vector representations with BM25.
This hybrid model overcomes the limitations of single-vector embedding models,
and we prove its effectiveness in a zero-shot setting on the BEIR benchmark.

Improving Zero-Shot Ranking with Vespa Hybrid Search – part two


Photo by Tamarcus Brown
on Unsplash

Where should you begin if you plan to implement search functionality
but have not yet collected data from user
interactions to
train ranking models?

In the first
post in
the series, we introduced the difference between in-domain and
out-of-domain (zero-shot) ranking. We also presented the BEIR
benchmark and highlighted cases where in-domain effectiveness does
not transfer to another domain in a zero-shot setting.

In this second post in this series, we introduce and evaluate three
different Vespa ranking methods on the
BEIR benchmark in a zero-shot
setting. We establish a new and strong BM25 baseline for the BEIR
dataset, which outperforms previously reported BM25 results. We
then show how a unique hybrid approach, combining a neural ranking
method with BM25, outperforms other evaluated methods on 12 out of
13 datasets on the BEIR benchmark. We also compare the effectiveness
of the hybrid ranking method with emerging few-shot methods that
generate in-domain synthetic training data via prompting large
language models (LLMs).

Establishing a strong baseline

In the BEIR paper,
the authors find that BM25
is a strong generalizable baseline text ranking model. Many, if not
most, of the dense single vector embedding models trained on MS
MARCO labels are outperformed by BM25 in an out-of-domain setting.
Quote from BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation
of Information Retrieval

In-domain performance is not a good indicator for out-of-domain
generalization. We observe that BM25 heavily underperforms neural
approaches by 7-18 points on in-domain MS MARCO. However, BEIR
reveals it to be a strong baseline for generalization and generally
outperforming many other, more complex approaches. This stresses
the point, that retrieval methods must be evaluated on a broad range
of datasets

What is interesting about reporting BM25 baselines is that there
are multiple implementations, variants, and performance tweaks, as
demonstrated in Which BM25 Do You Mean? A Large-Scale Reproducibility
Study of Scoring
Unfortunately, various papers have reported conflicting results for BM25 on
the same BEIR benchmark datasets. The BM25 effectiveness can vary due to different
hyperparameters and different linguistic processing methods used in different system implementations,
such as removing stop words, stemming, and tokenization. Furthermore, researchers want to contrast their proposed ranking
approach with a baseline ranking method. It could be tempting to
report a weak BM25 baseline, which makes the proposed ranking method
stand out better.

Several serving systems implement BM25 scoring, including Vespa.
Vespa’s lexical or sparse retrieval is also accelerated using the
weakAnd Vespa query
This is important because implementing a BM25 scoring function in
a system is trivial, but scoring all documents that contains at
least one of the query terms approaches linear complexity. Dynamic
pruning algorithms like weakAnd improve the
retrieval efficiency significantly compared to naive
brute-force implementations that scores all documents matching any of the query terms.

BM25 has two
hyperparameters, k1 and b, which impact ranking effectiveness.
Additionally, most (14 out of 18) of the BEIR datasets have both
title and text document fields, which in a real-production environment
would be the first thing that a seasoned search practitioner would
tune the relative importance of. In our BM25 baseline,
we configure Vespa to independently calculate the
BM25 score of both
title and text, and we combine the two BM25 scores
linearly. The complete Vespa rank
profile is given below.

rank-profile bm25 inherits default {
   first-phase {
      expression: bm25(title) + bm25(text)
   rank-properties {
      bm25(title).k1: 0.9
      bm25(title).b: 0.4
      bm25(text).k1: 0.9
      bm25(text).b: 0.4

We modify the BM25 k1 and b parameters but use the same parameters for
both fields. The values align with Anserini
(k1=0.9, b=0.4).

The following table reports nDCG@10 scores on a subset (13) of the
BEIR benchmark datasets. We exclude the four datasets that are not
publicly available. We also exclude the BEIR CQADupStack dataset
because it consists of 12 sub-datasets where the overall nDCG@10
score is found by averaging each
nDCG@10 score. Adding these sub-datasets would significantly increase
the evaluation effort.

BEIR DatasetBM25 from
BEIR Paper
Vespa BM25
MS MARCO0.2280.228
Natural Questions (NQ)0.3290.327
Touché-2020 (V2)0.3670.413
Average (excluding MS MARCO)0.4400.453

The table summarizes the BM25 nDCG@10 results. Vespa BM25 versus
BM25 from BEIR paper.

The table above demonstrates that the Vespa implementation has set a new high
standard, outperforming reported BM25 baselines on the BEIR benchmark.

Evaluating Vespa ranking models in a zero-shot setting

With the new strong BM25 baseline established in the above section, we
will now introduce two neural ranking models and compare their performance with the baseline.

Vespa ColBERT

We have previously described the Vespa ColBERT implementation in
this blog
and we use the same model
weights in this
work. The Vespa ColBERT model is based on a distilled 6-layer MiniLM
model with 22M parameters, using quantized int8 weights (post-training
quantization). The model uses only 32 vector dimensions per query
and document wordpiece),
in contrast to the original ColBERT model, which
uses 128 dimensions. Furthermore, we use Vespa’s support for
to reduce the per-dimension storage usage from 4 bytes per dimension
with float to 2 bytes with bfloat16. We configure the maximum query
length to 32 wordpieces, and maximum document length to 180
wordpieces. Both maximum length parameters align with the training and experiments

The ColBERT MaxSim scoring is implemented as a re-ranking model
using Vespa phased ranking,
re-ranking the top 2K hits ranked by BM25. We also compute and store
the title term embeddings for datasets with titles, meaning we have
two MaxSim scores for datasets with titles. We use a linear combination
to combine the title and text MaxSim scores.

The complete Vespa rank profile
is given below.

rank-profile colbert inherits bm25 {
   inputs {
      query(qt) tensor<float>(qt{}, x[32])
      query(title_weight): 0.5
   second-phase {
      rerank-count: 2000
	   expression {
	   (1 - query(title_weight))* sum(
		      query(qt) * cell_cast(attribute(dt), float), x
	      max, dt
	   ) +
	   query(title_weight) * sum(
		      query(qt) * cell_cast(attribute(title_dt), float), x
	      max, dt

The per wordpiece ColBERT vectors are stored in Vespa using Vespa’s
support for storing and computing over

Note: Users
can also trade efficiency versus cost by storing the tensors on
disk, or in-memory using
options. Paging is highly efficient in a re-ranking pipeline, as
just a few K tensors values are potentially paged on-demand.

Vespa Hybrid ColBERT + BM25

There are several ways to combine the ColBERT MaxSim with BM25,
including reciprocal rank
which does not consider model scores, just the ordering (ranking) the
scores produce. Quote from Reciprocal Rank Fusion outperforms Condorcet and
individual Rank Learning Methods:

RRF is simpler and more effective than Condorcet Fuse, while sharing
the valuable property that it combines ranks without regard to the
arbitrary scores returned by particular ranking methods

Another approach is to combine the model scores into a new score
to produce a new ranking. We use a linear combination in this work
to compute the hybrid score. Like the ColBERT-only model, we use
BM25 as the first-phase ranking model and only calculate the hybrid
score for the global top-ranking K documents from the BM25 model.

Before combining the scores, we want to normalize both the unbound
BM25 and the bound ColBERT score. Normalization is accomplished by
simple max-min
scaling of the scores. With max-min scaling, scores from any ranking
model are scaled from 0 to 1. This makes it easier to combine the
two using relative weighting.

Since scoring in a production serving system might be spread across
multiple nodes, each node involved in the query will not know the
global max scores. We solve this problem by letting Vespa content
nodes involved in the query return both scores using Vespa

A custom searcher
is injected in the query dispatching stateless Vespa service. This
searcher calculates the max and min for both model scores using
match features for hits within the window of global top-k hits
ranked by BM25. As with the ColBERT rank profile, we use a re-ranking
window of 2000 hits, but we perform feature-score scaling and
re-ranking in a stateless custom searcher instead of on the content

The complete Vespa rank profile is given below. Notice the
match-features, which are returned with each hit to the stateless
which performs the normalization and re-scoring. The first-phase
scoring function is inherited from the previously described bm25
rank profile.

rank-profile hybrid-colbert inherits bm25 {
   function bm25() {
	   expression: bm25(title) + bm25(text)

   function colbert_maxsim() {
	   expression {
		            query(qt) * cell_cast(attribute(dt), float) , x
	         max, dt
	      ) +
		            query(qt) * cell_cast(attribute(title_dt), float), x
	         max, dt
   match-features {

​​Results and analysis

As with the BM25 baseline model, we index one of the BEIR datasets
at a time on a Vespa instance and evaluate the models. The following
table summarizes the results. All numbers are nDCG@10. The
best-performing model score per dataset is in bold.

BEIR DatasetVespa
Vespa ColBERTVespa Hybrid
MS MARCO (in-domain)0.2280.401
Natural Questions (NQ)0.3270.4030.404
Touché-2020 (V2)0.4130.3150.415
Average nDCG@10 (excluding MS MARCO)0.4530.3630.481

The table summarizes the nDCG@10 results per dataset. Note that MS MARCO is in-domain for ColBERT and Hybrid.
Average nDCG@10 is only computed for zero-shot and out-of-domain datasets.

As shown in the table above, in a in-domain setting on MS MARCO,
the Vespa ColBERT model outperforms the BM25
baseline significantly. The resulting nDCG@10 score aligns with reported MRR@10
results from previous work using
in-domain on MS MARCO. However, mixing the baseline BM25 using the hybrid model on MS MARCO evaluation
hurts the nDCG@10 score, as we combine two models where the unsupervised BM25
model is significantly weaker than the ColBERT model.

The Vespa ColBERT model underperforms BM25 on out-of-domain datasets,
especially CLIMATE-FEVER. The CLIMATE-FEVER dataset has very long
queries (avg 20.2 words). The long questions challenge the ColBERT
model, configured with a max query length of 32 wordpieces in the
experimental setup. Additionally, the Vespa ColBERT model underperforms
reported results for the full-sized ColBERT
V2 model using 110M parameters
and 128 dimensions. This result could indicate that the compressed
(in the number of dimensions) and model distillation have a more
significant negative impact when applied in a zero-shot setting
compared to in-domain.

These exceptions aside, the data shows that the unique hybrid Vespa ColBERT and BM25 combination is highly
effective, performing the best on 12 of 13 datasets
. Its average
nDCG@10 score improves from 0.453 to 0.481 compared to the strong
Vespa BM25 baseline.

To reproduce the results of this benchmark,
follow the open-sourced

Comparing hybrid zero-shot with few-shot methods

To compare the hybrid Vespa ranking performance with other models,
we include the results reported in Promptagator: Few-shot Dense
Retrieval From 8 Examples from
Google Research.

Generating synthetic training data in-domain via prompting LLMs is a recent
emerging Information Retrieval(IR) trend also described in
InPars: Data Augmentation for Information Retrieval using Large Language

The basic idea is to “prompt” a large language model (LLM) to generate synthetic queries
for use in training of in-domain ranking models. A typical prompt include a few
examples of queries and relevant documents, then the LLM is “asked”
to generate syntetic queries for many of the documents in the corpus.
The generated syntetic query, document pairs can be used to train neural ranking models.
We include a quote describing the approach from the Promptagator paper:

Improving Search Ranking with Few-Shot Prompting of LLMs


Photo by Maxime VALCARCE on Unsplash

This blog post explores using large language models (LLMs) to
generate labeled data for training ranking models. Distilling the
knowledge and power of generative models with billions of parameters
to ranking models with a few million parameters. The approach uses
a handful of human-annotated labeled examples (few-shot) and prompts
the LLM to generate synthetic queries for documents in the corpus.

The ability to create high-quality synthetic training data might
be a turning point with the potential to revolutionize information
retrieval. With a handful of human annotations, the LLMs can generate
infinite amounts of high-quality labeled data at a low cost. Training data which
is used to train much smaller and compute efficient ranking models.


Language models built on the
Transformer architecture have
revolutionized text ranking overnight,
advancing the state-of-the-art by more than 30% on the MS MARCO
relevance dataset. However, Transformer-based ranking models need
significant amounts of labeled data and supervision to realize their
potential. Obtaining high-quality annotated data for training deep
ranking models is labor-intensive and costly. Therefore, many
organizations try to overcome the labeling cost problem by using
pseudo-labels derived from click models. Click models use query-document
user interaction from previously seen queries to label documents.
Unfortunately, ranking models trained on click-data suffer from
multiple bias issues, such as presentation bias and survivorship
bias towards the existing ranking model.

In addition, what if you want to build a great search experience
in a new domain without any interaction data (cold-start) or resources
to obtain sufficient amounts of labeled data to train neural ranking
models? The answer might be generative large language models (LLMs),
which can generate training data to train a ranking model adapted
to the domain and use case.

Generative large language models (LLMs)

The public interest in generative language models (LLMs) has
skyrocketed since OpenAI released ChatGPT in November
2022. The
GPT-3 model is trained on a
massive amount of text data using unsupervised learning and can
generate human-like text given a text prompt input.

Google is another leader in the language model space, except they
have not exposed any of them in a public chat-like interface like
OpenAI. In Scaling Instruction-Finetuned Language
Models, researchers from Google
describe a few of their generative language models and instruction
fine-tuning. In contrast to OpenAI and many other language model
providers, Google has open-sourced their generative FLAN-T5 models
in various model
up to 11B parameters, using a permissive Apache 2.0 License.

Figure from Scaling Instruction-Finetuned Language

A critical difference between large generative language models
and a vanilla BERT ranking model is that they necessarily don’t
require task-specific fine-tuning of the model weights. Massive
self-supervised training on piles of text, coupled with later
fine-tuning on a broad, diverse set of tasks, is one of the reasons
they are called foundation models

The foundation model weights are frozen, but we can adapt the model
to our task by mixing natural language instructions with data in
the prompt input. The art of mixing instructions and data in an LLM
instruction prompt has created a new artistic engineering field;
prompt engineering.
We can improve the model’s generated output using prompt engineering
by changing the instructions written in natural language.

Generating labeled data via instruction-prompting Large Language Models

Instruction-prompting large language models (LLMs) have also entered
the information retrieval research (IR). A recent trend in IR
research is to use generative LLMs, such as GPT-3, to generate
synthetic data to train ranking models .

The general idea is to design an instruction prompt with a few
labeled relevance examples fed to the LLM (large language model)
to generate synthetic queries or documents. The instruction prompts
that generate artificial questions are the most promising direction
since all you need is a few labeled queries, document examples, and
samples from the document corpus. In addition, with synthetic query
generators, you avoid running inference with computationally expensive
LLMs at user time, which can be unrealistic for most
organizations. Instead, the
synthetic generation process is performed offline with LLM-powered
query generators. Offline inference with LLMs is considerably less
engineering-intensive than online usage, as inference latency is
not a concern.

LLMs will hallucinate and sometimes produce fake queries that are
too generic or irrelevant. To overcome this, researchers
use a ranking model (RM) to test the query quality. One can, for example, rank
documents from the corpus for each generated synthetic query using
the RM. The synthetic query is only retained for model training if
the source document ranks highly. This query consistency check
grounds the LLM output and improves the training data. Once the
synthetic queries and positive (relevant) document pairs are filtered,
one can sample negative (potentially irrelevant) examples and train
a ranking model adapted to the domain, the characteristics of the
document corpus, and the few-show query examples.

Illustration of distilling the knowledge and power of generative Large Language Models (LLMs) with billions of parameters to ranking models with a few million parameters.


In the previous posts on zero-shot ranking
we evaluated zero-shot ranking models on 13 BEIR
benchmark datasets. We reported
large gains in ranking effectiveness by a hybrid ranking model that
combines unsupervised BM25 ranking with a neural ranking model
trained on a large relevance dataset. This hybrid model is an example
of a zero-shot ranking model without any fine-tuning or adaption
to the new domain.

In this work, we want to see if we can improve the effectiveness
by using a large language model to create synthetic training data
for the new domain. We focus on one of the BEIR datasets, the
bio-medical trec-covid IR dataset.
We chose this dataset because we have previously built a simple
demo search UI and Vespa
app, indexing the
final CORD-19 dataset. With this demo, we can deploy the ranking
models to a production app in Vespa cloud. The following subsections
describe the experimental setup. We also publish three
demonstrating the 3-stage process.

Synthetic query generation

We chose the open-source 3 Billion
flan-t5-xl generative
model as our artificial query generator. The model is genuine
open-source, licensed with a permissive Apache 2.0 license. We
devise the following few-shot instruction prompt template.

These are examples of queries with sample relevant documents for
each query. The query must be specific and detailed.

Example 1:
document: $document_example_1
query: $query_example_1

Example 2:
document: #document_example_2
query: $query_example_2

Example 3:
document: $document_example_3
query: $query_example

Example 4:
document: $input_document

Our first prompt attempt did not include “the query must be specific
and detailed”
phrase. Without it, many eyeballed queries were too
generic. Changing the prompt made the model produce more specific
queries. The change in output quality is an example of the magic of prompt engineering.

We use the three first trec-covid test queries (originally from, which is not available anymore)
as our in-domain examples for few-shot instruction examples.
We pick the first document annotated
as highly relevant to form the complete query-document example.

Finally, we iterate over the document collection, replace the
$input_document variable with a concatenation of the title and
abstract, then run an inference with the flan-t5-xl model and store the generated
query. We use a single A100 40GB GPU, which costs about 1$/hour and can
generate about 3600 synthetic queries per hour (depending on prompt size).

At completion, we ended up with synthetic queries for 33,099 documents
out of 171K docs. Notice that the three query-document examples are
the same for all document-to-query creations and that the model’s
max input sequence length limits the number of examples we can fit
into the prompt.

Query consistency checking

We use a robust zero-shot hybrid ranking
for query consistency checking. The generated query is retained for
training only if the source document is ranked #1 by the zero-shot
model. If the question passes this test, we also sample two negative
query-document pairs from the top 100 documents ranked by the
zero-shot model. After the consistency filter, the number of positive
query, document pairs drops to 14,156 (43% retention). The high retention
percentage demonstrates that the flan-t5-xl and prompt combination is creating
specific and detailed queries.

Dataframe with the generated synthetic queries and the document
metadata. The document abstract is summarized by the query contextual
Vespa dynamic
feature. There are three rows in the data frame for each unique
generated question, one positive (relevant) document and two
irrelevant. This is the input to the next step, which is to train
a ranking model on this purely synthetic data.

Rank model training

After query generation and consistency checking, we use the synthetic
query and document pairs to train a ranking model. As in previous
we use a cross-encoder based on a 6-layer MiniLM model with just
22M trainable parameters. We train the model for two epochs on the
synthetic data and export the model to ONNX for inference in

Finally, we deploy the fine-tuned model as a re-ranking phase
on top of the top 30 results from the hybrid model.

Evaluation & Results

We contrast the model tuned by synthetic queries with ranking models
evaluated in the zero-shot ranking blog
on the trec-covid dataset.


We gain four nDCG@10 points over the strong hybrid zero-shot model
and 10 nDCG@10 points over unsupervised BM25. These are significant
gains, especially given the low training and inference costs. We
also note that the Vespa BM25 baseline is strong, beating other
BM25 implementations on the same dataset. The model trained on
synthetic data outperforms the
PROMPTAGTOR model, which uses
a proprietary 137B FLAN checkpoint to generate synthetic queries.
In the paper they report a nDCG@10 of 76.2 on trec-covid. Finally,
we contrast with OpenAI’s GPT embedding
paper, where OpenAI reporst a nDCG@10 score of 64.9 for their GPT
XL embeddings on trec-covid.

Deploying to production

There are two reasons for choosing a cross-encoder model over a
bi-encoder for our synthetic fueled ranking model.

  • Cross-encoders are generally more effective than bi-encoders.
  • Updating the cross-encoder model weight does not require re-processing
    the document corpus. With bi-encoders using single vector
    representations, developers would have to re-process the document-side
    embeddings every time a new prompt-trained ranking model is available.
    With a cross-encoder, model versioning and A/B testing are easier
    to operationalize.

Cross-encoder’s downside is the computational complexity at query
time, which is quadratic with model sequence input length. We deploy
a trick to reduce the model input sequence; we input the query,
title, and a query contextual Vespa dynamic
of the abstract. The dynamic abstract summarization reduces the
sequence length while retaining segments of the abstract that matches
the query. Furthermore, we limit the complexity by reducing the
re-ranking depth to 30. Finally, we deploy the trained model to our demo site,
where users can choose between the prompt-generated model or the
previously described zero-shot ranking


Synthetic query generation via instruction-prompted LLMs is a
promising approach for overcoming the label shortage problem. With
just three human labeled examples, the query generator, built on
an open-source flan-t5 model, could generate high-quality training

As part of this work, we open-source three notebooks:

In addition to these resources; we open-source the generated
and the consistency-checked training data
The training data includes the zero-shot scores, the consistenty checked query, the document title, and the
query contextual summarization of the abstract. The end-to-end Vespa
application is also

In future work, we’ll look at how generative models can be used for
re-ranking, summarization, and generative question answering with Vespa.

Regardless, improving retrieval quality is step one in improving the
overall effectiveness of any retrieval-augmented system.