Vespa Product Updates, August 2019: BM25 Rank Feature, Searchable Parent References, Tensor Summary Features, and Metrics Export

Kristian Aune

Kristian Aune

Head of Customer Success, Vespa

In the recent Vespa product update, we mentioned Large Machine Learning Models, Multithreaded Disk Index Fusion, Ideal State Optimizations, and Feeding Improvements. Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow.

This month, we’re excited to share the following feature updates with you:

BM25 Rank Feature

The BM25 rank feature implements the Okapi BM25 ranking function and is a great candidate to use in a first phase ranking function when you’re ranking text documents. Read more.

Searchable Reference Attribute

A reference attribute field can be searched using the document id of the parent document-type instance as query term, making it easy to find all children for a parent document. Learn more.

Tensor in Summary Features

A tensor can now be returned in summary features.
This makes rank tuning easier and can be used in custom Searchers when generating result sets.
Read more.

Metrics Export

To export metrics out of Vespa, you can now use the new node metric interface. Aliasing metric names is possible and metrics are assigned to a namespace. This simplifies integration with monitoring products like CloudWatch and Prometheus. Learn more about this update.

We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Learning to Rank with Vespa – Getting started with Text Search have just published two tutorials to help people to get started with text search applications by building scalable solutions with Vespa.
The tutorials were based on the full document ranking task
released by Microsoft’s MS MARCO dataset’s team.

The first tutorial helps you to create and deploy a basic text search application with Vespa as well as to download, parse and feed the dataset to a running Vespa instance. They also show how easy it is to experiment with ranking functions based on built-in ranking features available in Vespa.

The second tutorial shows how to create a training dataset containing Vespa ranking features that allow you to start training ML models to improve the app’s ranking function. It also illustrates the importance of going beyond pointwise loss functions when training models in a learning to rank context.

Both tutorials are detailed and come with code available to reproduce the steps. Here are the highlights.

Basic text search app in a nutshell

The main task when creating a basic app with Vespa is to write a search definition file containing information about the data you want to feed to the application and how Vespa should match and order the results returned in response to a query.

Apart from some additional details described in the tutorial, the search definition for our text search engine looks like the code snippet below. We have a title and body field containing information about the documents available to be searched. The fieldset keyword indicates that our query will match documents by searching query words in both title and body fields. Finally, we have defined two rank-profile, which controls how the matched documents will be ranked. The default rank-profile uses nativeRank, which is one of many built-in rank features available in Vespa. The bm25 rank-profile uses the widely known BM25 rank feature.

search msmarco { 
    document msmarco {
        field title type string
        field body type string 
    fieldset default {
        fields: title, body
    rank-profile default {
        first-phase {
            expression: nativeRank(title, body)
    rank-profile bm25 inherits default {
        first-phase {
            expression: bm25(title) + bm25(body)

When we have more than one rank-profile defined, we can chose which one to use at query time, by including the ranking parameter in the query:

curl -s "<URL>/search/?query=what+is+dad+bod"
curl -s "<URL>/search/?query=what+is+dad+bod&ranking=bm25"

The first query above does not specify the ranking parameter and will therefore use the default rank-profile. The second query explicitly asks for the bm25 rank-profile to be used instead.

Having multiple rank-profiles allow us to experiment with different ranking functions. There is one relevant document for each query in the MSMARCO dataset. The figure below is the result of an evaluation script that sent more than 5.000 queries to our application and asked for results using both rank-profiles described above. We then tracked the position of the relevant document for each query and plotted the distribution for the first 10 positions.

It is clear that the bm25 rank-profile does a much better job in this case. It places the relevant document in the first positions much more often than the default rank-profile.

Data collection sanity check

After setting up a basic application, we likely want to collect rank feature data to help improve our ranking functions. Vespa allow us to return rank features along with query results, which enable us to create training datasets that combine relevance information with search engine rank information.

There are different ways to create a training dataset in this case. Because of this, we believe it is a good idea to have a sanity check established before we start to collect the dataset. The goal of such sanity check is to increase the likelihood that we catch bugs early and create datasets containing the right information associated with our task of improving ranking functions.

Our proposal is to use the dataset to train a model using the same features and functional form used by the baseline you want to improve upon. If the dataset is well built and contains useful information about the task you are interested you should be able to get results at least as good as the one obtained by your baseline on a separate test set.

Since our baseline in this case is the bm25 rank-profile, we should fit a linear model containing only the bm25 features:

a + b * bm25(title) + c * bm25(body)

Having this simple procedure in place helped us catch a few silly bugs in our data collection code and got us in the right track faster than would happen otherwise. Having bugs on your data is hard to catch when you begin experimenting with complex models as we never know if the bug comes from the data or the model. So this is a practice we highly recommend.

How to create a training dataset with Vespa

Asking Vespa to return ranking features in the result set is as simple as setting the ranking.listFeatures parameter to true in the request. Below is the body of a POST request that specify the query in YQL format and enable the rank features dumping.

body = {
    "yql": 'select * from sources * where (userInput(@userQuery));',
    "userQuery": "what is dad bod",
    "ranking": {"profile": "bm25", "listFeatures": "true"},

Vespa returns a bunch of ranking features by default, but we can explicitly define which features we want by creating a rank-profile and ask it to ignore-default-rank-features and list the features we want by using the rank-features keyword, as shown below. The random first phase will be used when sampling random documents to serve as a proxy to non-relevant documents.

rank-profile collect_rank_features inherits default {

    first-phase {
        expression: random


    rank-features {


We want a dataset that will help train models that will generalize well when running on a Vespa instance. This implies that we are only interested in collecting documents that are matched by the query because those are the documents that would be presented to the first-phase model in a production environment. Here is the data collection logic:

hits = get_relevant_hit(query, rank_profile, relevant_id)
if relevant_hit:
    hits.extend(get_random_hits(query, rank_profile, n_samples))
    data = annotate_data(hits, query_id, relevant_id)
    append_data(file, data)

For each query, we first send a request to Vespa to get the relevant document associated with the query. If the relevant document is matched by the query, Vespa will return it and we will expand the number of documents associated with the query by sending a second request to Vespa. The second request asks Vespa to return a number of random documents sampled from the set of documents that were matched by the query.

We then parse the hits returned by Vespa and organize the data into a tabular form containing the rank features and the binary variable indicating if the query-document pair is relevant or not. At the end we have a dataset with the following format. More details can be found in our second tutorial.

Beyond pointwise loss functions

The most straightforward way to train the linear model suggested in our data collection sanity check would be to use a vanilla logistic regression, since our target variable relevant is binary. The most commonly used loss function in this case (binary cross-entropy) is referred to as a pointwise loss function in the LTR literature, as it does not take the relative order of documents into account.

However, as we described in our first tutorial, the metric that we want to optimize in this case is the Mean Reciprocal Rank (MRR). The MRR is affected by the relative order of the relevance we assign to the list of documents generated by a query and not by their absolute magnitudes. This disconnect between the characteristics of the loss function and the metric of interest might lead to suboptimal results.

For ranking search results, it is preferable to use a listwise loss function when training our model, which takes the entire ranked list into consideration when updating the model parameters. To illustrate this, we trained linear models using the TF-Ranking framework. The framework is built on top of TensorFlow and allow us to specify pointwise, pairwise and listwise loss functions, among other things.

We made available the script that we used to train the two models that generated the results displayed in the figure below. The script uses simple linear models but can be useful as a starting point to build more complex ones.

Overall, on average, there is not much difference between those models (with respect to MRR), which was expected given the simplicity of the models described here. However, we can see that a model based on a listwise loss function allocate more documents in the first two positions of the ranked list when compared to the pointwise model. We expect the difference in MRR between pointwise and listwise loss functions to increase as we move on to more complex models.

The main goal here was simply to show the importance of choosing better loss functions when dealing with LTR tasks and to give a quick start for those who want to give it a shot in their own Vespa applications. Now, it is up to you, check out the tutorials, build something and let us know how it went. Feedbacks are welcome!

Vespa Product Updates, December 2019: Improved ONNX support, New rank feature attributeMatch().maxWeight, Free lists for attribute multivalue mapping, faster updates for out-of-sync documents, Zookeeper 3.5.6

Kristian Aune

Kristian Aune

Head of Customer Success, Vespa

In the November Vespa product update, we mentioned Nearest Neighbor and Tensor Ranking, Optimized JSON Tensor Feed Format, Matched Elements in Complex Multi-value Fields, Large Weighted Set Update Performance and Datadog Monitoring Support.

Today, we’re excited to share the following updates:

Improved ONNX Support

Vespa has added more operations to its ONNX model API, such as GEneral Matrix to Matrix Multiplication (GEMM) –
see list of supported opsets.
Vespa has also improved support for PyTorch through ONNX,
see the example.

New Rank Feature attributeMatch().maxWeight

attributeMatch(name).maxWeight was added in Vespa-7.135.5. The value is  the maximum weight of the attribute keys matched in a weighted set attribute.

Free Lists for Attribute Multivalue Mapping

Since Vespa-7.141.8, multivalue attributes uses a free list to improve performance. This reduces CPU (no compaction jobs) and approximately 10% memory. This primarily benefits applications with a high update rate to such attributes.

Faster Updates for Out-of-Sync Documents

Vespa handles replica consistency using bucket checksums. Updating documents can be cheaper than putting a new document, due to less updates to posting lists. For updates to documents in inconsistent buckets, a GET-UPDATE is now used instead of a GET-PUT whenever the document to update is consistent across replicas. This is the common case when only a subset of the documents in the bucket are out of sync. This is useful for applications with high update rates, updating multi-value fields with large sets. Explore details here.

ZooKeeper 3.5.6

Vespa now uses Apache ZooKeeper 3.5.6 and can encrypt communication between ZooKeeper servers.

About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow.

We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Improving Product Search with Learning to Rank – part one


Photo by Pawel Czerwinski on Unsplash

Over a few busy years, we have witnessed a neural ranking paradigm shift, where
pre-trained language models, such as BERT,
have revolutionized text ranking
in data-rich settings with lots of labeled data.
Feeding the data hungry neural networks lots of labeled relevance judgments
has proved more effective than statistical (tabular) text ranking feature-based methods.

However, historically, tree-based machine learning models
such as Gradient Boosting (GB)
with lambdarank loss have prevailed as state-of-the-art for product search ranking.
In this blog post series, we look at improving product search using learning to rank techniques,
with demonstrable results on the largest publicly available product ranking dataset.
We will train and measure their effectiveness on a large dataset, but
without getting into the details of complex loss functions.

In this first post, we introduce a product ranking dataset and establish multiple ranking baselines
that do not use the labeled relevance judgments in the dataset.
These ranking models are applied in a zero-shot setting, meaning they have not been trained using
domain-specific relevancy labels. The methods include traditional lexical
ranking like BM25 and other Vespa native text ranking features, semantic
off-the-shelf vector search models, and hybrid combinations of sparse and dense
methods. All of the ranking methods are represented end-to-end in Vespa.


In this series, we use a large shopping queries
dataset released by Amazon:

We introduce the “Shopping Queries Data Set”, a large dataset of complex search
queries released to foster research in semantic matching of queries and
products. For each query, the dataset provides a list of up to 40 potentially
relevant results, together with ESCI relevance judgments (Exact, Substitute,
Complement, Irrelevant) indicating the relevance of the product to the query.
Each query-product pair is accompanied by additional information. The dataset is
multilingual, containing queries in English, Japanese, and Spanish.

The dataset consists of a large number of questions, and where each query has,
on average, 20 graded query-item relevance judgments:

  • Exact (E): the item is relevant to the query and satisfies all the query
    specifications (e.g., water bottle matching all attributes of a query “plastic
    water bottle 24oz”, such as material and size)

  • Substitute (S): the item is somewhat relevant: it fails to fulfill some
    aspects of the query, but the item can be used as a functional substitute (e.g.,
    fleece for a “sweater” query)

  • Complement (C): the item does not fulfill the query but could be used in
    combination with an exact item (e.g., track pants for “running shoe” query)

  • Irrelevant (I): the item is irrelevant, or it fails to fulfill a central
    aspect of the query (e.g., socks for a “pant” query)

As with most ranking datasets, the dataset is split into a train and test part,
and models can be trained using the train split and evaluated on the test split.
In our work, we focus on the English queries in both splits.

English Train Split

The train split contains 20,888 queries with 419,653 query-item judgments. This
split can be used to train models, using learning to rank techniques. The train
set will be the focus of upcoming blog posts.

English Test Split

In this blog post we focus on the test split, and it contains 8,956 queries with
181,701 query-item judgments.

The judgment label distribution is as follows:

  • Exact (Relevant) 79,708 (44%)
  • Substitute (Somewhat Relevant) 8,099 (4%)
  • Complement 63,563 (35%)
  • Irrelevant 30,331 (17%)

Both splits have about 20 product judgments per query. As we can see from the label
distribution, the dataset has many Exact (relevant) judgments, especially
compared to Substitutes (somewhat relevant) and Irrelevant. On average, there
are about 9 relevant products for every query in the test split.

This product relevance dataset is the largest available product relevance
dataset and allows us to compare the effectiveness of multiple ranking models
using graded relevance judgments. A large dataset allows us to report
information retrieval ranking metrics and not anecdotical LGTM@10 (Looks Good To Me) metrics.

We can evaluate ranking models used in a
zero-shot setting without using the in-domain relevance labels and models
trained using learning to rank techniques, exploiting the relevance labels in the train split.

Product ranking example

The above image shows query #535 in the test split with a total of 16 product judgements.
The task is to optimize the ordering (ranking) so that products labeled as exact are ordered
before supplements and supplements before irrelevant.

Product ranking example
The above image shows perfect ranking of products for query #535.
Note that we have converted the textual labels (esci_label) to numeric labels using
E=4, S=3, C=2, I=1. If our ranking function is able to produce this ordering, we
would get a perfect score for this query.

Indexing the dataset with Vespa

We use the following Vespa document
schema, which allows us to index all the
products associated with English queries across both splits.
We use a utility script that converts the parquet product data file to a
Vespa JSON
formatted feed file. In total we index about 1.2M products in Vespa.

schema product {

    document product {

        field id type string {
            indexing: summary | index 

        field title type string {
            indexing: summary | index
            index: enable-bm25

        field description type string {
            indexing: summary | index
            index: enable-bm25
            weight: 200

        field bullets type string {
            indexing: summary | index
            index: enable-bm25
            weight: 200

        field brand type string {
            indexing: summary | index | attribute

        field color type string {
            indexing: summary | index | attribute


    field title_tokens type tensor<float>(d0[32]) {
        indexing: input title | embed tokenizer | attribute

    field description_tokens type tensor<float>(d0[128]) {
        indexing: input description | embed tokenizer | attribute

    field embedding type tensor<float>(d0[384]) {
        indexing {
            input title . input description | embed transformer | attribute | summary | index
        attribute {
            distance-metric: angular 
    fieldset default {
        fields: title, description, bullets, brand  

The product data contains title, description, brand, color and bullets. We use
Vespa’s support for encoding text into vector embeddings, in this case
we use the title and description as input.
See text embedding made simple
for details on how to embed text embedding models in Vespa.


The official dataset evaluation metric is
NDCG (Normalized
Discounted Cumulative Gain), a precision-oriented metric commonly used for
ranking datasets with graded relevance judgments. An important observation is
that the task only considers ranking of the products that have judgment
labels. In other words, we assume that a magic retriever has
retrieved all relevant documents (plus irrelevant documents), and our task is to
re-rank the products so that the relevant ones are pushed to the top of the
ranked list of products. This is a simpler task than implementing end-to-end
retrieval and ranking over the 1.2M products.

Ranking overview

Many deployed search systems need to hedge the deployment cost and deploy
multi-phased retrieval and ranking
pipelines to reduce computational complexity.
In a retrieval and ranking funnel, one or many diverse retrievers use a
cost-efficient retrieval ranking phase, and subsequent ranking phases re-rank
the documents, focusing on precision. In a later post in this series, we look at
how to train a retriever function using learning to retrieve and how to
represent retrieval and ranking phases in Vespa.

Baseline zero-shot ranking models

We propose and evaluate seven different baseline ranking models, all represented end-to-end in
Vespa using Vespa’s ranking framework.


Obviously not the greatest baseline, but allows us to compare other baselines with
random ranking. We use Vespa’s random

A pseudorandom number in the range [0,1] which is drawn once per document during rank evaluation.

rank-profile random inherits default {
    first-phase {
        expression: random 

BM25 – lexical ranking

BM25 is a tried and true
zero-shot text ranking model. BM25 only has two parameters and is the ranking
function many turn to when there are no explicit relevancy labels or implicit
click feedback to train ranking models. The BM25 scoring function can also be
accelerated for efficient retrieval using dynamic pruning algorithms like

rank-profile bm25 inherits default {
    first-phase {
        expression: bm25(title) + bm25(description) 

Note that we only use the title and description. All our lexical baselines
uses only these two fields.

Vespa nativeRank – lexical ranking

Vespa’s nativeRank is a
lexical ranking function and the default text ranking function in Vespa. It has
similar characteristics as BM25 but also considers the proximity between matched
terms in the document. Unlike BM25, which has an unbound score range, Vespa’s
nativeRank is normalized to a score in the range [0,1].

rank-profile nativeRank inherits default {
    first-phase {
        expression: nativeRank(title) + nativeRank(description) 

Vespa fieldMatch – lexical ranking

Vespa fieldMatch
is another lexical ranking function that is more sophisticated than Vespa
nativeRank but also computationally expensive and unsuitable as a retrieval
function. fieldMatch(field) or fieldMatch sub-features such as fieldMatch(field).proximity is
typically used as re-ranking features.

rank-profile fieldMatch inherits default {
    first-phase {
        expression: fieldMatch(title) + fieldMatch(description) 

Vespa semantic ranking using vector similarity – dense retrieval

Using an off-the-shelf dense vector model from
Huggingface, we
encode documents and queries into a 384-dimensional dense vector space. We use
cosine similarity between the query and the document in this shared embedding
vector space as our relevancy measure. The vector embedding model,
sentence-transformers/all-MiniLM-L6-v2, is trained
on a large number of sentence-length texts.

Semantic similarity search over dense representations can be accelerated using
Vespa‘s support for approximate nearest neighbor search, making it usable for
efficient retrieval. We use Vespa’s support for embedding dense embedding models in this work.
See text embeddings made simple.

rank-profile semantic inherits default {
    inputs {
        query(query_embedding) tensor<float>(d0[384])
    first-phase {
        expression: closeness(field, embedding)

Vespa semantic ranking using vector similarity – dense retrieval

Another model that can be used for vectorization,
bert-base-uncased, 768-dimensional.
This model has not been trained for semantic similarity, or semantic search. We
use the pre-trained language model directly. As with the previous semantic
model, we use cosine similarity between the query and the document as our relevancy measure.

rank-profile semantic inherits default {
    inputs {
        query(bert_query_embedding) tensor<float>(d0[768])
    first-phase {
        expression: closeness(field, bert_embedding)

Vespa hybrid – combining lexical and semantic ranking

In this baseline model, we combine the semantic similarity score of the
sentence-transformers/all-MiniLM-L6-v2 model with the lexical score computed by
Vespa’s nativeRank. We use a simple linear combination of the two ranking
models. Since the nativeRank lexical feature is normalized in the range [0,1],
combining it with semantic similarity is more straightforward as we don’t need
to normalize unbound lexical scoring functions, such as BM25.
We use a query(alpha) parameter to control how each method influences the overall score.

rank-profile hybrid inherits default {
    inputs {
        query(alpha): 0.5
        query(query_embedding) tensor<float>(d0[384])
    first-phase {
        expression: query(alpha)*closeness(field, embedding) + (1 - query(alpha))*(nativeRank(title) + nativeRank(description))

Zero-shot Baseline Results

Ranking result

We present the results of our evaluation in the above figure. We execute each
query in the English test split (8.9K queries) and ask Vespa to rank the
documents that we have labels for, using the Vespa search API’s recall

Sets a recall parameter to be combined with the query. This is identical to filter, except that recall terms are not exposed to the ranking framework and thus not ranked.

We use trec_eval utility to compute
NDCG, using the relevance judgments. The reported NDCG is the average NDCG
calculated across all 8,956 queries in the test split. These are unique queries
and we don’t know their frequency distribution on Amazon, but in a real-world
evaluation we would have to take into account the query frequency as well when
evaluating the model, for example using weighting so that popular (head) queries weights more
than rare (tail) queries. In other words, we don’t want to introduce a ranking model that on average improves
NDCG, but hurts many head queries.

Notice that the random ranking function produces a high NDCG score, this is because there are
many exact (relevant) judgements per query. On

Improving Product Search with Learning to Rank – part two


Photo by Carl Campbell
on Unsplash

In the first post,
we introduced a large product ranking dataset and established multiple zero-shot
ranking baselines that did not use the labeled relevance judgments in the
dataset. In this post, we start exploiting the labeled relevance judgments to
train deep neural ranking models. We evaluate these models on the test set to
see if we can increase the NDCG ranking metric.

Dataset Splits

The English train split contains 20,888 queries with 419,653 query-item
judgments. To avoid training and tuning models based on observed test set
accuracy, we split the official train set into two new splits; train and dev. In
practice, we pick, at random, 20% of the queries from the train as our dev split
and retain the remaining 80% as train queries.

Since we plan to train multiple models, we can use the performance on the dev
set when training ensemble models and then hide the dev set during the training.

Train Original: 20,888 queries with 419,653 judgments

  • Our Train 16,711 queries with 335,674 assessments
  • Our Dev 4,177 queries with 83,979 assessments

We do this dataset split once, so all our models are trained with the same train
and dev split.

ranking model

Our objective is to train a ranking model f(query,product) that takes a query
and product pair and outputs a relevance score. We aim to optimize the NDCG
ranking metric after sorting the products for a query by this score.

The ranking process is illustrated for a single query in the above figure. We
are asked to rank four products for a query with model f. The model scores each
pair and emits a relevance score. After sorting the products by this score, we
obtain the ranking of the products for the query. Once we have the ranked list,
we can calculate the NDCG metric using the labeled query and product data. The
overall effectiveness of the ranking model is given by computing the average
NDCG ranking score for all queries in the test dataset.

There are many ways to train a ranking model f(query, product), and in this
post, we introduce and evaluate two neural ranking methods based on pre-trained
language models.

Neural Cross-Encoder using pre-trained language models

One classic way to use Transformers for ranking is cross-encoding, where both
the query and the document are fed into the Transformer model simultaneously.

The simultaneous input of the query and document makes cross-encoders unsuitable
for cost-efficient retrieval. In this work, we are relatively lucky because we
already have a magic retriever, and our task is to re-rank a small set (average
20) of products per query.

cross-encoder model

One approach for training cross-encoders for text ranking is to convert the
ranking problem to a regression or classification problem. Then, one can train
the model by minimizing the error (loss) between the model’s predicted and true
label. A popular loss function for regression and classification problems is MSE
(Mean Squared Error) loss. MSE loss is a pointwise loss function, operating on a
single (query, document) point when calculating the error between the actual and
predicted label. In other words, the pointwise loss function does not consider the
ranking order of multiple (query, document) points for the same query. In the next
blog post in this series we will look at listwise loss functions.

With graded relevance labels, we can choose between converting the ranking
problem to a regression (graded relevance) or binary classification
(relevant/irrelevant) problem. With binary classification, we need to set a
threshold on the graded relevance labels. For example, let Exact and Complement
map to the value 1 (relevant) and the remaining labels as 0 (irrelevant).
Alternatively, we can use a map from the relevancy label to a numeric label

Exact => 1.0, Substitute => 0.1, Complement => 0.01, Irrelevant => 0

With this label mapping heuristic, we “tell” the model that Exact should be 10x as
important as Substitute and so forth. In other words, we want the model to output a score close
to 1 if the product is relevant and a score close to 0.1 if the product is labeled as a substitute.

training cross-encoder model

The above figure illustrates the training process; we initialize our model using
a pre-trained Transformer. Next, we prepare batches of labeled data of the form
<query, document, label>. This data must be pre-processed into
a format that the model understands, mapping free text to token ids. As part of
this process, we convert the textual graded label representation to a numeric
representation using the earlier label gain map. For each batch, the model makes
predictions, and the prediction errors are measured using the loss function. The
model weights are then adjusted to minimize the loss, one batch at a time.

Once we have trained the model, we can save the model weights and import the
model into Vespa for inference and ranking. More on how to represent
cross-encoder language models in Vespa will be in a later section in this blog

Choosing the Transformer model and model inputs

There are several pre-trained language models based on the Transformer model
architecture. Since we operate on English queries and products, we don’t need
multilingual capabilities. The following models use the English vocabulary and
wordpiece tokenizer.

  • bert-base-uncased – About 110M parameters
  • MiniLM Distilled model using bert-base-uncased as a teacher, about 22M parameters

Choosing a model is a tradeoff between deployment cost and accuracy. A larger
model (with more parameters) will take longer to train and have higher
deployment costs but with potentially higher accuracy. In our work, we base our
model on weights from a MiniLM
model trained on
relevancy labels from the MS Marco

We have worked with the MiniLM models before in our work on MS Marco passage
so we know these models provide a reasonable tradeoff between ranking
accuracy and inference efficiency, which impacts deployment cost. In addition,
MiniML is a CPU-friendly model, especially with quantized versions.
As a result, we are avoiding expensive GPU instances and GPU-related failure modes.

In addition to choosing the pre-trained Transformer model, we must decide which
product fields and the order we will input them into the model. Order matters,
and the Transformer model architecture has quadratic computational complexity
with the input length, so the size of the text inputs significantly impacts
inference-related costs.

Some alternative input permutations for our product dataset are:

  • query, product title,
  • query, product title, brand,
  • query, product title, brand, product description,
  • query, product title, product description, product bullets.

The product field order matters because the total input length is limited to a
maximum of 512 tokens, and shorter input reduces inference costs significantly.
Since all products in the dataset have a title, we used only the title as input,
avoiding complexity related to concatenating input fields and handling missing
fields. Furthermore, product titles are generally longer than, for example,
Wikipedia or web page titles.

We trained the model for two epochs, Machine Learning (ML) terminology for two
complete iteration over our training examples. Training the model on a free-tier
Google Colab GPU takes less than 30 minutes.

Representing cross-encoders in Vespa

Once we have trained our cross-model using the sentence-transformer
cross-encoder training library, we need to import it to Vespa for inference and
ranking. Vespa supports importing models saved in the ONNX
model format and where inference is accelerated using
We export the PyTorch Transformer model representation to ONNX format using an
ONNX export utility script. Then we can drop the ONNX file into the Vespa
application package and configure how to use it in the product schema:

Model declaration

document product {
    field title type string {..} 
field title_tokens type tensor(d0[128]) {
    indexing: input title | embed tokenizer | attribute | summary
onnx-model title_cross {
    file: models/title_ranker.onnx
    input input_ids: title_input_ids
    input attention_mask: title_attention_mask
    input token_type_ids: title_token_type_ids

The above declares the title_tokens field where we store the subword tokens
and the onnx-model with the file name in the application package and its inputs.
The title_tokens are declared outside of the document as these fields are populated
during indexing.

Using the model in rank-profile

rank-profile cross-title inherits default {
    inputs {
        query(query_tokens) tensor(d0[32])

    function title_input_ids() {
        expression: tokenInputIds(96, query(query_tokens), attribute(title_tokens))

    function title_token_type_ids() {
        expression: tokenTypeIds(96, query(query_tokens), attribute(title_tokens))

    function title_attention_mask() {
        expression: tokenAttentionMask(96, query(query_tokens), attribute(title_tokens)) 

    function cross() {
        expression: onnx(title_cross)({d0:0,d1:0} 
    first-phase {
        expression: cross() 

This defines the rank-profile with the function inputs and where to find the inputs. We
cap the sequence at 96 tokens, including special tokens such as CLS and SEP.
The onnx(title_cross)({d0:0,d1:0} invokes the model with the inputs and slices
the batch dimension (d0) and reads the predicted score.

Vespa provides convenience tensor functions to calculate the three inputs to the
Transformer models; tokenInputIds, tokenTypeIds, and tokenAttentionMask.
We just need to provide the query tokens and field from which we can read the document
tokens. To map text to token_ids using the fixed vocabulary associated with the model,
we use Vespa’s support for wordpiece embedding.
This avoids document side tokenization at inference time, and we can just read the product tokens
during ranking. See the full product schema.

Neural Bi-Encoder using pre-trained language models

Another popular approach for neural ranking is the two-tower or bi-encoder
approach. This approach encodes queries and documents independently.

For text ranking, queries and documents are encoded into a dense embedding
vector space using one or two Transformer models. Most methods use the same
Transformer instance, but it’s possible to decouple them. For example, suppose
both towers map documents and queries to the same vector dimensionality. In that
case, one could choose a smaller Transformer model for the online query encoder
versus the document tower to reduce latency and deployment costs.

The model learns the embedding vector representation from the labeled training
examples. After training the model, one can compute the embeddings for all
products, index them in Vespa, and use a nearest-neighbor search algorithm at
query time for efficient document retrieval or ranking. In this work, Amazon has
provided the “retrieved” products for all queries, so we don’t need HNSW
indexing for efficient
approximate nearest neighbor search.

cross-encoder model

The above illustrates the bi-encoder approach. This illustration uses the
Transformer CLS (classification token) output and uses that as the single dense
vector representation. Another common output pooling strategy is calculating the
mean over all the token vector outputs. Other representation methods use
multiple vectors to represent queries and documents. One example is the ColBERT
model described in this blog

training cross-encoder model

The above illustrates training a bi-encoder for semantic similarity. Training
bi-encoders for semantic similarity or semantic search/retrieval/ranking has
more options than training cross-encoders.

As with the cross-encoder, we must select a pre-trained model, minding size and

The model size impacts quality, inference speed, and vector representation
dimensionality. For example, bert-base-uncased uses 768 dimensions, while MiniLM
uses 384. Lower dimensionality lowers the computational complexity of the
similarity calculations and reduces the embedding storage footprint.

Furthermore, how we batch the input training data, the vector similarity
function, and, most importantly, the loss function determines the quality of the
model on the given task. While loss functions are out of the scope of this blog
post, we would like to mention that training a bi-encoder for semantic
retrieval over a large corpus, which “sees” many irrelevant documents, needs a
different loss function than a bi-encoder model used for re-ranking.

Like with the cross-encoder, we must decide what product fields we encode.
Instead of inputting multiple product fields, we train two bi-encoder models,
one that uses the product title and another that encodes the description.
Both models are based on

Representing bi-encoders in Vespa

A single Vespa document schema can contain multiple document embedding fields,

document product {
    field title type string {}
    field description type string {}
field title_embedding type tensor(d0[384]) {
    indexing: input title | embed title | attribute | summary
    attribute {
        distance-metric: angular 
field description_embedding type tensor(d0[384]) {
    indexing: input description | embed description | attribute 

Improving Product Search with Learning to Rank – part three


Photo by Niels Weiss
on Unsplash

We introduced a large product ranking dataset in the first
post in
this series and established multiple zero-shot ranking baselines.
In the second
we trained ranking models using pre-trained language models and
evaluated them using the NDCG ranking metric.

In this post, we dive deep into a popular method for learning to
rank; gradient-boosting decision trees
(GBDT). Vespa has
native support for evaluating GBDT models and imports models trained
with popular GBDT frameworks such as
XGBoost and


Our objective is to train a ranking model f(query, product) that
takes a query and product pair as input and which outputs a relevance
score. We aim to optimize the NDCG metric after sorting the products
by this score. There are many ways to train a ranking model f(query,
, and in this post, we introduce and evaluate tree-based
models. Tree-based GBDT models, implemented by popular frameworks
like XgBoost and
LightGBM, excel on tabular
(structured) features and handle feature mixes and feature value
ranges without any ceremony.

In our series, we only aim to optimize relevance (NDCG); still,
E-commerce ranking is a multi-objective optimization
where shops want to maximize relevance and revenue. GBDT models
are famous for multi-objective ranking optimization as they handle
a mix of features, combining normalized features (e.g., vector
similarity), unnormalized unbound features (e.g.,
BM25), and “business”
features such as product sales margin. GBDT is also an excellent
way to explore semantic search signals and integrate them into an
existing product ranking function without wasting years of ranking
model development.

Multi-objective ranking

The above figure illustrates a multi-objective ranking optimization
problem where we want to maximize relevance and revenue. One way
to approach this problem is to train the model with a modified
label, an aggregated weighted combination of the two conflicting
optimization objectives.

For organizations, prediction explainability is essential when
applying Machine Learning (ML) to ranking problems, and GBDT
predictions are more straightforward to interpret than neural
predictions. Finally, GBDT is a relatively simple algorithm, meaning
we can train on more data than deep neural networks (for the same
compute budget). For example, training on our product ranking
dataset takes a few seconds on a laptop with CPU-only, while our
neural methods from the second
took hours with GPU acceleration.

Model Features

Tree-based learning methods require tabular (structured) features.
Therefore, we convert the unstructured text ranking dataset to a
tabular dataset using simple feature engineering, where Vespa
calculates all tabular features.

Broadly we have four classes of features used for ranking problems.
Examples in this section do not necessarily map to the Amazon product
ranking dataset used in this blog post series.

Contextual query features

These are features that do not depend on the document. For example,
it could be the user’s age, the time of the day, or a list of
previously visited products — generally, query time contextual
features and the user’s query terms. Query classification techniques
using language models also fall under this feature category, for
example, using Vespa’s support for stateless model

Document features

These features are independent of the query. In E-commerce, for
example, this might be the product’s popularity, price, sale margin,
or user review rating score(s).

Online match features (query * document)

An example of this could be text matching features, such as
bm25(title), tensor computation
like a sparse dot product between the user profile interests and product category, or semantic vector
similarity, as demonstrated in the previous post in this series.
See also tensor computation examples.

Aggregate features

These features aggregate across documents or queries. The computing
of aggregated features is usually performed outside of Vespa, using
near real-time stream processing. The output of these steam processing
jobs are usually tensors
which are stored and updated at scale with Vespa. Aggregate features
can help cold-start problems when indexing a new document that lacks
document features such as sales rank or popularity. For example,
having a “category” popularity score could help rank new products
added to the catalog. For example, the Yahoo homepage recommendation
system uses topical click-through rate (CTR) using Vespa global

Representing features in Vespa

Vespa’s ranking framework has rich built-in
Still, developers can easily express domain-specific features by
combining Vespa’s ranking
language with its built-in features.

Many ranking architectures in the wild use external feature stores.
In these serving architectures, a middle-tier component retrieves
a small pool of documents from an indexing service, for example,
Elasticsearch. After retrieving documents, the middle-tier fetches
features from the feature store and inputs these to the model.

Vespa allows representing a richer set of features than other
traditional indexing services built on Apache Lucene. Furthermore,
since Vespa stores tensor features in memory and memory bandwidth
are much higher than network bandwidth, one can rank a much larger
candidate pool with Vespa than with external feature stores.

Vespa attributes and
tensors support
in-place partial updates,
and developers can update a document and aggregated features with
high throughput (up to 75K partial updates/s per float field per
node with low CPU resource consumption). Genuine partial update
support is a feature that differentiates Vespa compared to search
engines built on Apache Lucene, where partial updates trigger a
full re-indexing of the entire document (get-apply-write pattern).

Gathering features from Vespa

In previous posts, we introduced Vespa
rank-features and
We can define custom features and feature computations using the
Vespa rank expressions language with function support.

rank-profile features inherits default {
        inputs {
            query(query_tokens) tensor<float>(d0[32])
            query(q_title) tensor<float>(d0[384])
        function bi_encoder() {
            expression: closeness(field, title_embedding)

        function max_title_significance() {
            expression: foreach(terms, N, term(N).significance, ">0.5", max)

        function mean_title_significance() {
            expression: foreach(terms, N, term(N).significance, ">0.5", average)

        first-phase {
            expression: random

        match-features {

The above features rank-profile defines a set of custom features using function with rank expressions.
The match-features block defines the set of features that are returned with each ranked document.
See full schema on GitHub.

For explicitly labeled examples like in our dataset, we can ask
Vespa to compute and return the feature scores using a combination
of the Vespa recall
parameter and match-features or
The recall query
parameter allows developers to request Vespa to calculate features
for query document pairs without impacting other features like
queryTermCount or text matching features.


Query-product feature scraping illustration. We request that Vespa
retrieve only the labeled products for each query in the train
split. This feature scraping approach is used for explicit relevance
judgment labels.


The above figure illustrates feature logging in production, and
this process can be used to gather implicit relevance feedback using
clicks and skips.

Feature logging infrastructure is fundamental for scaling machine
learning (ML) applications beyond expensive labeled data.
With Vespa, developers can compute and log new features that the
current production model does not use. Vespa can log the query and
the list of ranked documents with the calculated features. Note:
Using click models to generate unbiased pseudo-relevance labels for
model training is out of the scope of this blog post.

Generally, one wants to train on the features as scraped or logged.
Computing features outside of Vespa for training might cause feature
drift between training and inference in Vespa, where the offline
feature computations differ from the Vespa implementation.

For our GBDT model training, we use built-in Vespa text-matching
features for our product ranking datasets, such as
nativeRank, and
We also include a few handcrafted function features and the neural
scores from the neural bi-encoder and cross-encoder introduced in
the previous
The scrape routine fetches the computed feature values from a running
Vespa instance and merges the features into a pandas DataFrame,
along with the labels and query ids.


The above figure displays a random sample of tabular feature data.
The DataFrame column names map 1:1 to Vespa feature names, native
built-in features, or our user-defined functions, for example,
bi_encoder and cross_encoder.

Gradient Boosting Decision Trees (GBDT)

This blog post won’t dive deep into how the gradient boosting
algorithm works or ranking loss calculations. Instead, for our
purposes, on a very high level, it suffices to say that each training
iteration builds a decision tree, and each new decision tree tries
to correct the errors from the previously generated decision trees.

The final model after k training iterations has k decision trees,
and the model inference is performed by traversing the decision
trees and summing scores in the leaf nodes of the decision trees.
GBDT supports different learning objectives and loss functions:

  • Classification with pointwise loss
  • Regression with pointwise loss
  • Ranking with listwise or pairwise ranking loss

There are two popular implementations of the GBDT algorithm; XGboost,
and LightGBM; Vespa supports importing models from both frameworks.
Vespa does not use the framework inference implementation but
converts the models into Vespa’s ranking framework for accelerated
GBDT inference. At Yahoo, Vespa has been used for ranking with
GBDT models at
long before the XGBoost and LightGBM implementations.
Since the different framework models are imported into the same unified
framework in Vespa, combining them into an ensemble of models from
both frameworks is easy.


The above figure shows a single decision tree from a forest of
decision trees. Each gradient-boosting learning iteration generates
a new decision tree, and the final model is a forest of decision

The decision tree in the illustration is two levels deep and has
four leaf nodes. This simplified decision tree can only output four
different scores (0.04, 0.3, -0.9, or 0.4). Each non-leaf node has
a decision (split condition) that compares a feature with a numeric
value. The tree traversal starts from the tree’s root, and the path
determines the leaf score. The final score is the sum of all the
trees in the forest.

Training GBDT models

Once we have scraped the Vespa computed features for our product
ranking train data, we can start to train models. Unfortunately,
it’s easy to overfit the training data with GBDT, and hyperparameters
impact generalization on unseen data (test data). The previous
explained how to split the training data into a new train and dev
split. Now we can use that dev split as the validation dataset used
during (GBDT) training to check generalization and tune
The training job terminates when observed dev accuracy (NDCG) stops
improving. Early termination avoids overfitting on the train set
and saves resources during training and inference.


The GBDT feature importance plot enables feature ablation studies
where we can remove low-importance features and re-train the model
to observe the NDCG impact on the dev set. Feature importance plots
allow us to reduce computational complexity and deployment costs
by removing features. We make these reductions to meet a specific
latency service level agreement (SLA).


In the feature importance illustration above, we have trained a
model using only five simple features that are computationally

GBDT models and Vespa serving performance

Training an ML model is only worthwhile if we can deploy it to
production. The total computational complexity of evaluating a GBDT
model for ranking in Vespa depends mainly on three factors.

The number of documents exposed to the GBDT model

The number of documents exposed to the model is the most significant
parameter. Computational cost is linear with the number of documents
ranked using the model. Vespa allows multiple ranking
phases to apply the
most computationally complex model to the top-scoring documents
from the previous ranking phase. Search ranking and model inference
with GBDT are highly parallelizable. Vespa’s ability to use multiple
threads per query reduces serving latency.

Features and feature complexity

The number of features and the computational complexity of the
features dramatically impact performance. For example, calculating
a matching feature such as bm25(title) is low complexity
compared to fieldMatch(title), which again is relatively low compared
to inference with the previous

The benefit Vespa delivers is that it allows developers
to understand relative feature complexity using rank
The trace below shows the relative feature complexity.