Join us at the Machine Learning Meetup hosted by Zillow in Seattle on November 29th

Hi Vespa Community,

If you are in Seattle on November 29th, please join Jon Bratseth (Distinguished Architect, Oath) at a machine learning meetup hosted by Zillow. Jon will share a Vespa overview and answer any questions about Oath’s open source big data serving engine. Eric Ringger (Director of Machine Learning for Personalization, Zillow) will discuss some of the models used to help users find homes, including collaborative filtering, a content-based model, and deep learning.

Learn more and RSVP here.

Hope you can join!

The Vespa Team

Serving article comments using reinforcement learning of a neural net

Don’t look at the comments. When you allow users to make comments on your content pages you face the problem that not all of them are worth showing — a difficult problem to solve, hence the saying. In this article I’ll show how this problem has been attacked using reinforcement learning at serving time on Yahoo content sites, using the Vespa open source platform to create a scalable production solution.

Yahoo properties such as Yahoo Finance, News and Sports allow users to comment on the articles, similar to many other apps and websites. To support this the team needed a system that can add, find, count and serve comments at scale in real time. Not all comments are equally as interesting or relevant though, and some articles can have hundreds of thousands of comments, so a good commenting system must also choose the right comments among these to show to users viewing the article. To accomplish this, the system must observe what users are doing and learn how to pick comments that are interesting.

Here I’ll explain how this problem was solved for Yahoo properties by using Vespa — the open source big data serving engine. I’ll start with the basics and then show how comment selection using a neural net and reinforcement learning was implemented.

As mentioned, the team needed a system that can add, find, count, and serve comments at scale in real time. The team chose Vespa, the open big data serving engine for this, as it supports both such basic serving as well as incorporating machine learning at serving time (which we’ll get to below). By storing each comment as a separate document in Vespa, containing the ID of the article commented upon, the ID of the user commenting, various comment metadata, and the comment text itself, the team could issue queries to quickly retrieve the comments on a given article for display, or to show a comment count next to the article:


In addition, this document structure allowed less-used operations such as showing all the articles of a given user and similar.

The Vespa instance used at Yahoo for this store about a billion comments at any time, serve about 12.000 queries per second, and about twice as many writes (new comments + comment metadata updates). Average latency for queries is about 4 ms, and write latency roughly 1 ms. Nodes are organized in two tiers as a single Vespa application: A single stateless cluster handling incoming queries and writes, and a content cluster storing the comments, maintaining indexes and executing the distributed part of queries in parallel. In total, 32 stateless and 96 stateful nodes are spread over 5 regional data centers. Data is automatically sharded by Vespa in each datacenter, in 6–12 shards depending on the traffic patterns of that region.

Some articles on Yahoo pages have a very large number of comments — up to hundreds of thousands are not uncommon, and no user is going to read all of them. Therefore it is necessary to pick the best comments to show each time someone views an article. Vespa does this by finding all the comments for the article, computing a score for each, and picking the comments with the best scores to show to the user. This process is called ranking. By configuring the function to compute for each comment as a ranking expression in Vespa, the engine will compute it locally on each data partition in parallel during query execution. This allows executing these queries with low latency and ensures that more comments can be handled by adding more content nodes, without causing an increase in latency.

The input to the ranking function is features which are typically stored in the document (here: a comment) or sent with the query. Comments have various features indicating how users interacted with the comment, as well as features computed from the comment content itself. In addition, the system keeps track of the reputation of each comment author as a feature.

User actions are sent as update operations to Vespa as they are performed. The information about authors is also continuously changing, but since each author can write many comments it would be wasteful to have to update each comment every time there is new information about the author.
Instead, the author information is stored in a separate document type — one document per author,
and a document reference in Vespa is used to import that author feature into each comment.
This allows updating the author information once and have it automatically take effect for all comments by that author.

With these features, it’s possible in Vespa to configure a mathematical function as a ranking expression which computes the rank score or each comment to produce a ranked list of the top comments, like the following:


Using a neural net and reinforcement learning

The team used to rank comments with a handwritten ranking expression having hardcoded weighting of the features. This is a good way to get started but obviously not optimal. To improve it they needed to decide on a measurable target and use machine learning to optimize towards it.

The ultimate goal is for users to find the comments interesting. This can not be measured directly, but luckily we can define a good proxy for interest based on signals such as dwell time (the amount of time the users spend on the comments of an article) and user actions (whether users reply to comments, provide upvotes and downvotes, etc). The team knew they wanted user interest to go up on average, but there is no way to know what the correct value of the measure of interest might be for any single given list of comments. Therefore it’s hard to create a training set of interest signals for articles (supervised learning), so reinforcement learning was chosen instead: Let the system make small changes to the live machine-learned model iteratively, observe the effect on the signal used as a proxy for user interest, and use this to converge on a model that increases it.

The model chosen here was a neural net with multiple hidden layers, roughly illustrated as follows:


The advantage of using a neural net compared to a simple function such as linear regression is that it can capture non-linear relationships in the feature data without anyone having to guess which relationship exists and hand-write functions to capture them (feature engineering).

To explore the space of possible rankings, the team implemented a sampling algorithm in a Searcher to perturb the ranking of comments returned from each query. They logged the ranking information and user interest signals such as dwell time to their Hadoop grid where they are joined. This generates a training set each hour which is used to retrain the model using TensorFlow-on-Spark, which produces a new model for the next iteration of the reinforcement learning cycle.

To implement this on Vespa, the team configured the neural net as the ranking function for comments. This was done as a manually written ranking function over tensors in a rank profile. Here is the production configuration used:

rank-profile neuralNet {
    function get_model_weights(field) {
        expression: if(query(field) == 0, constant(field), query(field))
    function layer_0() { # returns tensor(hidden[9])     
        expression: elu(xw_plus_b(nn_input, get_model_weights(W_0), get_model_weights(b_0), x))   
    function layer_1() { # returns tensor(out[9])
        expression: elu(xw_plus_b(layer_0 get_model_weights(W_1), get_model_weights(b_1), hidden))   
    # xw_plus_b returns tensor(out[1]), so sum converts to double   
    function layer_out() {
        expression: sum(xw_plus_b(layer_1, get_model_weights(W_out), get_model_weights(b_out), out))   
    first-phase {     
        expression: freshnessRank   
    second-phase {
        expression: layer_out
        rerank-count: 2000   

More recently Vespa added support for deploying TensorFlow SavedModels directly (as well as similar support for tools saving in the ONNX format), which would also be a good option here since the training happens in TensorFlow.

Neural nets have a pair of weight and bias tensors for each layer, which is what the team wanted the training process to optimize. The simplest way to include the weights and biases in the model is to add them as constant tensors to the application package. However, with reinforcement learning it is necessary to be able to update these tensor parameters frequently. This could be achieved by redeploying the application package frequently, as Vespa allows that to be done without restarts or disruption to ongoing queries. However, it is still a somewhat heavy-weight process, so another approach was chosen: Store the neural net parameters as tensors in a separate document type in Vespa, and create a Searcher component which looks up this document on each incoming query, and adds the parameter tensors to it before it’s passed to the content nodes for evaluation.

Here is the full production code needed to accomplish this serving-time operation:

import java.util.Map;

public class LoadRankingmodelSearcher extends Searcher {
    private static final String VESPA_ID_FORMAT = "id:canvass_search:rankingmodel::%s";
    private static final String FEATURE_FORMAT = "query(%s)";

    /** To fetch model documents from Vespa index */
    private final SyncSession fetchDocumentSession;
    public LoadRankingmodelSearcher() {
        this.fetchDocumentSession = DocumentAccess.createDefault().createSyncSession(new SyncParameters.Builder().build());

    public Result search(Query query, Execution execution) {
        // Fetch model document from Vespa
        String id = String.format(VESPA_ID_FORMAT, query.getRanking().getProfile());
        Document modelDoc = fetchDocumentSession.get(new DocumentId(id));
        // Add it to the query
        if (modelDoc != null) {
            modelDoc.iterator().forEachRemaining((Map.Entry<Field, FieldValue> e) ->
                addTensorFromDocumentToQuery(e.getKey().getName(), e.getValue(), query)

    private static void addTensorFromDocumentToQuery(String field, FieldValue value, Query query) {
        if (value instanceof TensorFieldValue) {
            Tensor tensor = ((TensorFieldValue) value).getTensor().get();
            query.getRanking().getFeatures().put(String.format(FEATURE_FORMAT, field), tensor);
The model weight document definition is added to the same content cluster as the comment documents and simply contains attribute fields for each weight and bias tensor of the neural net (where each field below is configured with “indexing: attributesummary”):
document rankingmodel {
    field modelTimestamp type long { … }
    field W_0 type tensor(x[9],hidden[9]) { … }
    field b_0 type tensor(hidden[9]) { … } 
    field W_1 type tensor(hidden[9],out[9]) { … } 
    field b_1 type tensor(out[9]) { … }
    field W_out type tensor(out[9]) { … } 
    field b_out type tensor(out[1]) { … } 

Since updating documents is a lightweight operation it is now possible to make frequent changes to the neural net to implement the reinforcement learning process.


Switching to the neural net model with reinforcement learning has already led to a 20% increase in average dwell time. The average response time when ranking with the neural net increased to about 7 ms since the neural net model is more expensive. The response time stays low because in Vespa the neural net is evaluated on all the content nodes (partitions) in parallel. This avoids the bottleneck of sending the data for each comment to be evaluated over the network and allows increasing parallelization indefinitely by adding more content nodes.

However, evaluating the neural net for all comments for outlier articles which have hundreds of thousands of comments would still be very costly. If you read the rank profile configuration shown above, you’ll have noticed the solution to this: Two-phase ranking was used where the comments are first selected by a cheap rank function (termed freshnessRank) and the highest scoring 2000 documents (per content node) are re-ranked using the neural net. This caps the max CPU spent on evaluating the neural net per query.

Conclusion and future work

In this article I have shown how to implement a real comment serving and ranking system on Vespa. With reinforcement learning gaining popularity, the serving system needs to become a more integrated part of the machine learning stack, and by using Vespa this can be accomplished relatively easily with a standard open source technology.

The team working on this plan to expand on this work by applying it to other domains such as content recommendation, incorporating more features in a larger network, and exploring personalized comment ranking.

Vespa Product Updates, May 2019: Deploy Large Machine Learning Models, Multithreaded Disk Index Fusion, Ideal State Optimizations, and Feeding Improvements

Kristian Aune

Kristian Aune

Head of Customer Success, Vespa

In last month’s Vespa update, we mentioned Tensor updates, Query tracing and coverage. Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to evolve.

For May, we’re excited to share the following feature updates with you:

Multithreaded disk index fusion

Content nodes are now able to sustain a higher feed rate by using multiple threads for disk index fusion. Read more.

Feeding improvements

Cluster-internal communications are now multithreaded out of the box, for  high throughput feeding operations. This fully utilizes a 10 Gbps network and improves utilization of high-CPU content nodes.

Ideal state optimizations

Whenever the content cluster state changes, the ideal state is calculated. This is now optimized (faster and runs less often) and state transitions like node up/down will have less impact on read and write operations. Learn more in the dynamic data distribution documentation.

Download ML models during deploy

One procedure for using/importing ML models to Vespa is to put them in the application package in the models directory. Applications where models are trained frequently in some external system can refer to the model by URL rather than including it in the application package. This use case is now documented in deploying remote models, and solves the challenge of deploying huge models.

We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Learning to Rank with Vespa – Getting started with Text Search have just published two tutorials to help people to get started with text search applications by building scalable solutions with Vespa.
The tutorials were based on the full document ranking task
released by Microsoft’s MS MARCO dataset’s team.

The first tutorial helps you to create and deploy a basic text search application with Vespa as well as to download, parse and feed the dataset to a running Vespa instance. They also show how easy it is to experiment with ranking functions based on built-in ranking features available in Vespa.

The second tutorial shows how to create a training dataset containing Vespa ranking features that allow you to start training ML models to improve the app’s ranking function. It also illustrates the importance of going beyond pointwise loss functions when training models in a learning to rank context.

Both tutorials are detailed and come with code available to reproduce the steps. Here are the highlights.

Basic text search app in a nutshell

The main task when creating a basic app with Vespa is to write a search definition file containing information about the data you want to feed to the application and how Vespa should match and order the results returned in response to a query.

Apart from some additional details described in the tutorial, the search definition for our text search engine looks like the code snippet below. We have a title and body field containing information about the documents available to be searched. The fieldset keyword indicates that our query will match documents by searching query words in both title and body fields. Finally, we have defined two rank-profile, which controls how the matched documents will be ranked. The default rank-profile uses nativeRank, which is one of many built-in rank features available in Vespa. The bm25 rank-profile uses the widely known BM25 rank feature.

search msmarco { 
    document msmarco {
        field title type string
        field body type string 
    fieldset default {
        fields: title, body
    rank-profile default {
        first-phase {
            expression: nativeRank(title, body)
    rank-profile bm25 inherits default {
        first-phase {
            expression: bm25(title) + bm25(body)

When we have more than one rank-profile defined, we can chose which one to use at query time, by including the ranking parameter in the query:

curl -s "<URL>/search/?query=what+is+dad+bod"
curl -s "<URL>/search/?query=what+is+dad+bod&ranking=bm25"

The first query above does not specify the ranking parameter and will therefore use the default rank-profile. The second query explicitly asks for the bm25 rank-profile to be used instead.

Having multiple rank-profiles allow us to experiment with different ranking functions. There is one relevant document for each query in the MSMARCO dataset. The figure below is the result of an evaluation script that sent more than 5.000 queries to our application and asked for results using both rank-profiles described above. We then tracked the position of the relevant document for each query and plotted the distribution for the first 10 positions.

It is clear that the bm25 rank-profile does a much better job in this case. It places the relevant document in the first positions much more often than the default rank-profile.

Data collection sanity check

After setting up a basic application, we likely want to collect rank feature data to help improve our ranking functions. Vespa allow us to return rank features along with query results, which enable us to create training datasets that combine relevance information with search engine rank information.

There are different ways to create a training dataset in this case. Because of this, we believe it is a good idea to have a sanity check established before we start to collect the dataset. The goal of such sanity check is to increase the likelihood that we catch bugs early and create datasets containing the right information associated with our task of improving ranking functions.

Our proposal is to use the dataset to train a model using the same features and functional form used by the baseline you want to improve upon. If the dataset is well built and contains useful information about the task you are interested you should be able to get results at least as good as the one obtained by your baseline on a separate test set.

Since our baseline in this case is the bm25 rank-profile, we should fit a linear model containing only the bm25 features:

a + b * bm25(title) + c * bm25(body)

Having this simple procedure in place helped us catch a few silly bugs in our data collection code and got us in the right track faster than would happen otherwise. Having bugs on your data is hard to catch when you begin experimenting with complex models as we never know if the bug comes from the data or the model. So this is a practice we highly recommend.

How to create a training dataset with Vespa

Asking Vespa to return ranking features in the result set is as simple as setting the ranking.listFeatures parameter to true in the request. Below is the body of a POST request that specify the query in YQL format and enable the rank features dumping.

body = {
    "yql": 'select * from sources * where (userInput(@userQuery));',
    "userQuery": "what is dad bod",
    "ranking": {"profile": "bm25", "listFeatures": "true"},

Vespa returns a bunch of ranking features by default, but we can explicitly define which features we want by creating a rank-profile and ask it to ignore-default-rank-features and list the features we want by using the rank-features keyword, as shown below. The random first phase will be used when sampling random documents to serve as a proxy to non-relevant documents.

rank-profile collect_rank_features inherits default {

    first-phase {
        expression: random


    rank-features {


We want a dataset that will help train models that will generalize well when running on a Vespa instance. This implies that we are only interested in collecting documents that are matched by the query because those are the documents that would be presented to the first-phase model in a production environment. Here is the data collection logic:

hits = get_relevant_hit(query, rank_profile, relevant_id)
if relevant_hit:
    hits.extend(get_random_hits(query, rank_profile, n_samples))
    data = annotate_data(hits, query_id, relevant_id)
    append_data(file, data)

For each query, we first send a request to Vespa to get the relevant document associated with the query. If the relevant document is matched by the query, Vespa will return it and we will expand the number of documents associated with the query by sending a second request to Vespa. The second request asks Vespa to return a number of random documents sampled from the set of documents that were matched by the query.

We then parse the hits returned by Vespa and organize the data into a tabular form containing the rank features and the binary variable indicating if the query-document pair is relevant or not. At the end we have a dataset with the following format. More details can be found in our second tutorial.

Beyond pointwise loss functions

The most straightforward way to train the linear model suggested in our data collection sanity check would be to use a vanilla logistic regression, since our target variable relevant is binary. The most commonly used loss function in this case (binary cross-entropy) is referred to as a pointwise loss function in the LTR literature, as it does not take the relative order of documents into account.

However, as we described in our first tutorial, the metric that we want to optimize in this case is the Mean Reciprocal Rank (MRR). The MRR is affected by the relative order of the relevance we assign to the list of documents generated by a query and not by their absolute magnitudes. This disconnect between the characteristics of the loss function and the metric of interest might lead to suboptimal results.

For ranking search results, it is preferable to use a listwise loss function when training our model, which takes the entire ranked list into consideration when updating the model parameters. To illustrate this, we trained linear models using the TF-Ranking framework. The framework is built on top of TensorFlow and allow us to specify pointwise, pairwise and listwise loss functions, among other things.

We made available the script that we used to train the two models that generated the results displayed in the figure below. The script uses simple linear models but can be useful as a starting point to build more complex ones.

Overall, on average, there is not much difference between those models (with respect to MRR), which was expected given the simplicity of the models described here. However, we can see that a model based on a listwise loss function allocate more documents in the first two positions of the ranked list when compared to the pointwise model. We expect the difference in MRR between pointwise and listwise loss functions to increase as we move on to more complex models.

The main goal here was simply to show the importance of choosing better loss functions when dealing with LTR tasks and to give a quick start for those who want to give it a shot in their own Vespa applications. Now, it is up to you, check out the tutorials, build something and let us know how it went. Feedbacks are welcome!

From research to production: scaling a state-of-the-art machine learning system

How we implemented a production-ready question-answering application and
reduced response time by more than two orders of magnitude.

Imagine you’ve created a machine learning system that surpasses
state-of-the-art performance on some task. You’ve optimized for a set of
objectives like classification accuracy, F1 scores, or AUC. Now you want to
create a web service from it. Other objectives, such as the time or cost of
delivering a result to the user, become more important.

These two sets of objectives are typically in conflict. More accurate models
are often large and computationally expensive to evaluate. Various
optimizations like reducing the models’ precision and complexity are often
introduced to use such models in production. While beneficial for decreasing
cost and energy consumption, this, unfortunately, hurts accuracy.

Obviously, inference time can be drastically lowered if accuracy is not
important. Likewise, accurate responses can be produced at high cost.
Which solution to ultimately choose lies somewhere between these extremes. A
useful technique for selecting the best solution is to enumerate them in terms
of accuracy and cost. The set of solutions not dominated by others is called
the Pareto frontier and
identifies the best trade-offs between accuracy and cost.

In a previous blog
we introduced a serving system that reproduces state-of-the-art accuracy in
open-domain question-answering. We based this on Facebook’s Dense Passage
Retrieval (DPR), which is a
Python-based research system. We built the serving system using, the open-source big data serving engine, which is
uniquely suited to tasks like this due to its native support for fast
similarity search and machine learned models in search and ranking. The result
is a web service taking a single question and returning an exact answer.

While this system reproduced DPR’s result and thus had excellent accuracy
metrics, the response time was initially poor, as measured in end-to-end
latency. This post will describe the various optimizations we made to bring
performance to acceptable levels for a production system. is built for production and thus has quite a few options for serving
We will particularly use’s ability to retrieve and rank documents
using multiple worker threads per query to significant effect. However, this
application’s main cost is in evaluating two BERT models. One of the questions
we would like to answer is whether smaller models with full precision are
preferable to larger models with quantized parameters. We’ll develop the Pareto
frontier to evaluate the merits of the various optimizations.

We’ll start with an overview of the serving system and identify which parts of
the system initially drive the cost. For more details on the implementation, we
refer to the previous blog
in this series.

Question Answering

The system’s task is to produce a textual answer in response to a question
given in natural language. There are primarily three stages involved:

  • The encoder generates a representation vector for the question.
  • The retriever performs a nearest neighbor search among the 21 million indexed passages.
  • The reader finds the most relevant passage and extracts the final answer.

The following figure illustrates the process:

Encoder, Retriever and Reader for question-answering

The encoder first creates a tokenized representation from the question. This
vector of token IDs is sent as input to the encoder BERT model. This is
initially a standard BERT-base model with 12 layers and a hidden layer size of

  1. The final hidden layer state is used as a representation vector for the
    question. As seen in the figure above, this primarily happens in the stateless
    container layer in Vespa. The token and vector representation is passed down
    (“scattered”) to all content nodes to perform the query.

On the content nodes, the passages have been indexed with their own
representation vectors. These vectors have been constructed so that the
euclidean distance between question and passage vectors indicate similarity.
This is used in the HNSW algorithm to perform an approximate nearest neighbor
search. The 10 passages with the smallest euclidean distance are sent to the
next stage.

The second-phase ranking stage, also performed on each content node, evaluates
the reader BERT model. Like the encoder model, this is initially a BERT-base
model with 12 layers and hidden length 768. The token representations from the
query and each passage are combined to form the model input. The reader model
produces three probability scores: the relevance score and the start and end
indices of the answer in the passage’s token sequence. The passage with the
best relevance score is selected as the winner, and its token representation is
returned to the stateless layer. There, custom code extracts the best span
using the start and end indices, de-tokenizes it, and returns the resulting
textual answer.

Now, that’s a lot of work. Here is an example of a response to the question
“Who won Tour De France in 2015”, where the most relevant passage is retrieved
and the correct answer “Chris Froome” is extracted:

Response from

To measure performance, we deployed the system on a single machine with an
Intel Xeon Gold 6240 processor
with 200 GB RAM and SSD disk. We evaluate the system over 3610 questions and
record the average latency and exact-match score. Initially, the system
achieves an exact-match score of 40.64. Before any consideration has been made
to optimize performance, the time spent in the three stages mentioned above is:

  • Encoding model: 300 ms
  • Approximate nearest neighbor search: 13 ms
  • Top-10 reader ranking: 9085 ms

Obviously, a total end-to-end latency of 9.4 seconds is not anything close to
acceptable as a service. In the following, we’ll lower this to well below 100

Multi-threaded retrieval ranking

Initially, the most expensive step by far is the reader stage. By default,
Vespa does all ranking for a query on a single thread. This is a reasonable
default to maximize throughput when the computational cost is low. In this
case, this means that the reader model is evaluated – in sequence – for each of
the top 10 passages. This results in high query latency.

However, Vespa has an option of using multiple threads per
Setting this value brings the average end-to-end latency down to 2.04 seconds,
a more than 4x improvement without affecting the exact match score.

It is worth clarifying that the reader model is not evaluated batch-wise. This
is due to Vespa’s ranking framework, where ranking expressions score a single
passage and query pair. For BERT models, however, this is not significant as
evaluation time is linear with batch
One reason is tensor multiplications with tensors of 3 or more dimensions, as
these iterate over several hardware-optimized matrix-matrix multiplications

In general, Vespa has many options to tune performance, such as easily
distributing the workload on additional content nodes. While we don’t explore
that here, see the Vespa serving scaling
guide for
more information.

Token sequence length

One of the defining and most prominent features of BERT models is the
full-attention layer. While this was a significant breakthrough in language
understanding, it has an unfortunate O(n^2) effect on evaluation time.

So, the length of the token sequence input to the BERT model significantly
impacts inference time. Initially, the encoder BERT model had an input length
of 128. By reducing it to 30, we decrease inference time from 300 ms to 125 ms
without loss of accuracy.

Likewise, the reader model initially had an input length of 380. By reducing
this to 128, we reduce average latency from 1.9 seconds to 741 ms, a
significant reduction. However, we do get a decrease in accuracy, as some
question and passage combinations can result in token sequences longer than 128.
This reduced the exact match score to 39.61.

Both the encoder and reader model support dynamic length inputs, but Vespa
currently only supports fixed length inputs. This will be fixed in the near
future, however. In summary, shortening token input lengths of the encoder and
reader models result in a 3x speedup.

Model quantization

Neural network models are commonly trained using single-precision
floating-point numbers. However, for inference in production, it has been shown
that this level of precision is not always necessary. The parameters can be
converted to a much smaller integer representation without significant loss in
accuracy. Converting the parameters from a 32-bit floating-point to 8-bit
integers reduces the model size by 75%. More importantly, integer operations
execute much faster. Modern CPUs that support AVX512 Vector Neural Network
Instructions (VNNI) are designed to accelerate INT8 inference performance.
Additionally, evaluating such quantized models requires less power.

Quantizing the reader model brings its size down from 435Mb to 109Mb. The
latency for the system drops on average to 374 ms. This has a slightly
unfortunate effect on accuracy, dropping the exact match to 37.98. Likewise,
quantizing the encoder model results in a similar size reduction, and system
evaluation time drops to 284 ms. The exact-match score drops to 37.87.

In summary, model quantization of both reader and encoder models result in
another 3x speedup.

Miniature models

Until this point, both the encoder and reader models are based on pre-trained
BERT-base models, containing 12 layers with hidden dimension size of 768 and
thus around 110 million parameters. These are reasonably large models,
particularly when used in time-constrained environments. However, in the paper
Well-Read Students Learn Better: On the Importance of Pre-training Compact
Models, the authors show that smaller
models can indeed work well. The “miniature” models referenced in this
paper can be found in the Transformers model

We trained new reader models as described in the DPR
repository, basing
them on the following pre-trained BERT miniature models:

  • Medium (8 layers, 512 hidden size)
  • Small (4 layers, 512 hidden size)
  • Mini (4 layers, 256 hidden size)
  • Tiny (2 layers, 128 hidden size)

We also quantize each model. The full overview of all 20 models (5 reader
models, with and without quantization, with and without quantized encoder
model) with exact match scores and average latency is given in the table below:

Exact match scores vs latencies

Plotting these results:

Exact match vs latency

In the figure above, the red line represents the Pareto front. The points along
this front are also marked in bold in the table above. Recall that these points
represent the best trade-offs between exact match and latency, meaning that
there are no other points that are superior in both exact match and latency for
each point along this front.

One interesting result that can be seen here is that, in general, quantized
models dominate other models with higher precision. For instance, the medium
quantized model has better exact match and latency numbers than the small
models with higher precision. So, in this case, even though quantization
reduces accuracy, it is more beneficial to choose a large model that has been
quantized over a smaller model that has not.

The Pareto front visualizes the objectively best solutions, and our subjective
preferences would guide us in finding the optimal solution. The tests above
have been run on a single Intel Xeon Gold 6240
machine. More powerful processors would lower the overall latency numbers but
not change the overall shape. The exact solution to choose is then based on our
latency and hardware budgets. For instance, organizations with considerable
resources that can scale up sufficiently can justify moving to the right on
this front. The economy of scale can mitigate the cost of hardware investments
and energy consumption to make the service viable. Such a solution might be out
of reach for others.

Putting all this together

Please refer to the companion sample
for more details and instructions on how to run this application yourself.


In summary, we’ve taken a research application with poor performance to levels
suitable for production. From 9.4 seconds for the full model down to 70ms for
the tiny model, this represents a 130x speedup. Unfortunately, to get down to
these levels, we noted a significant drop in exact-match as well. The best
choice lies somewhere between these extremes. If we were to bring this
application into production, we could use more powerful hardware to bring the
latency below 100ms with acceptable exact match metrics.

Results summary

There are quite a few

Improving Product Search with Learning to Rank – part one


Photo by Pawel Czerwinski on Unsplash

Over a few busy years, we have witnessed a neural ranking paradigm shift, where
pre-trained language models, such as BERT,
have revolutionized text ranking
in data-rich settings with lots of labeled data.
Feeding the data hungry neural networks lots of labeled relevance judgments
has proved more effective than statistical (tabular) text ranking feature-based methods.

However, historically, tree-based machine learning models
such as Gradient Boosting (GB)
with lambdarank loss have prevailed as state-of-the-art for product search ranking.
In this blog post series, we look at improving product search using learning to rank techniques,
with demonstrable results on the largest publicly available product ranking dataset.
We will train and measure their effectiveness on a large dataset, but
without getting into the details of complex loss functions.

In this first post, we introduce a product ranking dataset and establish multiple ranking baselines
that do not use the labeled relevance judgments in the dataset.
These ranking models are applied in a zero-shot setting, meaning they have not been trained using
domain-specific relevancy labels. The methods include traditional lexical
ranking like BM25 and other Vespa native text ranking features, semantic
off-the-shelf vector search models, and hybrid combinations of sparse and dense
methods. All of the ranking methods are represented end-to-end in Vespa.


In this series, we use a large shopping queries
dataset released by Amazon:

We introduce the “Shopping Queries Data Set”, a large dataset of complex search
queries released to foster research in semantic matching of queries and
products. For each query, the dataset provides a list of up to 40 potentially
relevant results, together with ESCI relevance judgments (Exact, Substitute,
Complement, Irrelevant) indicating the relevance of the product to the query.
Each query-product pair is accompanied by additional information. The dataset is
multilingual, containing queries in English, Japanese, and Spanish.

The dataset consists of a large number of questions, and where each query has,
on average, 20 graded query-item relevance judgments:

  • Exact (E): the item is relevant to the query and satisfies all the query
    specifications (e.g., water bottle matching all attributes of a query “plastic
    water bottle 24oz”, such as material and size)

  • Substitute (S): the item is somewhat relevant: it fails to fulfill some
    aspects of the query, but the item can be used as a functional substitute (e.g.,
    fleece for a “sweater” query)

  • Complement (C): the item does not fulfill the query but could be used in
    combination with an exact item (e.g., track pants for “running shoe” query)

  • Irrelevant (I): the item is irrelevant, or it fails to fulfill a central
    aspect of the query (e.g., socks for a “pant” query)

As with most ranking datasets, the dataset is split into a train and test part,
and models can be trained using the train split and evaluated on the test split.
In our work, we focus on the English queries in both splits.

English Train Split

The train split contains 20,888 queries with 419,653 query-item judgments. This
split can be used to train models, using learning to rank techniques. The train
set will be the focus of upcoming blog posts.

English Test Split

In this blog post we focus on the test split, and it contains 8,956 queries with
181,701 query-item judgments.

The judgment label distribution is as follows:

  • Exact (Relevant) 79,708 (44%)
  • Substitute (Somewhat Relevant) 8,099 (4%)
  • Complement 63,563 (35%)
  • Irrelevant 30,331 (17%)

Both splits have about 20 product judgments per query. As we can see from the label
distribution, the dataset has many Exact (relevant) judgments, especially
compared to Substitutes (somewhat relevant) and Irrelevant. On average, there
are about 9 relevant products for every query in the test split.

This product relevance dataset is the largest available product relevance
dataset and allows us to compare the effectiveness of multiple ranking models
using graded relevance judgments. A large dataset allows us to report
information retrieval ranking metrics and not anecdotical LGTM@10 (Looks Good To Me) metrics.

We can evaluate ranking models used in a
zero-shot setting without using the in-domain relevance labels and models
trained using learning to rank techniques, exploiting the relevance labels in the train split.

Product ranking example

The above image shows query #535 in the test split with a total of 16 product judgements.
The task is to optimize the ordering (ranking) so that products labeled as exact are ordered
before supplements and supplements before irrelevant.

Product ranking example
The above image shows perfect ranking of products for query #535.
Note that we have converted the textual labels (esci_label) to numeric labels using
E=4, S=3, C=2, I=1. If our ranking function is able to produce this ordering, we
would get a perfect score for this query.

Indexing the dataset with Vespa

We use the following Vespa document
schema, which allows us to index all the
products associated with English queries across both splits.
We use a utility script that converts the parquet product data file to a
Vespa JSON
formatted feed file. In total we index about 1.2M products in Vespa.

schema product {

    document product {

        field id type string {
            indexing: summary | index 

        field title type string {
            indexing: summary | index
            index: enable-bm25

        field description type string {
            indexing: summary | index
            index: enable-bm25
            weight: 200

        field bullets type string {
            indexing: summary | index
            index: enable-bm25
            weight: 200

        field brand type string {
            indexing: summary | index | attribute

        field color type string {
            indexing: summary | index | attribute


    field title_tokens type tensor<float>(d0[32]) {
        indexing: input title | embed tokenizer | attribute

    field description_tokens type tensor<float>(d0[128]) {
        indexing: input description | embed tokenizer | attribute

    field embedding type tensor<float>(d0[384]) {
        indexing {
            input title . input description | embed transformer | attribute | summary | index
        attribute {
            distance-metric: angular 
    fieldset default {
        fields: title, description, bullets, brand  

The product data contains title, description, brand, color and bullets. We use
Vespa’s support for encoding text into vector embeddings, in this case
we use the title and description as input.
See text embedding made simple
for details on how to embed text embedding models in Vespa.


The official dataset evaluation metric is
NDCG (Normalized
Discounted Cumulative Gain), a precision-oriented metric commonly used for
ranking datasets with graded relevance judgments. An important observation is
that the task only considers ranking of the products that have judgment
labels. In other words, we assume that a magic retriever has
retrieved all relevant documents (plus irrelevant documents), and our task is to
re-rank the products so that the relevant ones are pushed to the top of the
ranked list of products. This is a simpler task than implementing end-to-end
retrieval and ranking over the 1.2M products.

Ranking overview

Many deployed search systems need to hedge the deployment cost and deploy
multi-phased retrieval and ranking
pipelines to reduce computational complexity.
In a retrieval and ranking funnel, one or many diverse retrievers use a
cost-efficient retrieval ranking phase, and subsequent ranking phases re-rank
the documents, focusing on precision. In a later post in this series, we look at
how to train a retriever function using learning to retrieve and how to
represent retrieval and ranking phases in Vespa.

Baseline zero-shot ranking models

We propose and evaluate seven different baseline ranking models, all represented end-to-end in
Vespa using Vespa’s ranking framework.


Obviously not the greatest baseline, but allows us to compare other baselines with
random ranking. We use Vespa’s random

A pseudorandom number in the range [0,1] which is drawn once per document during rank evaluation.

rank-profile random inherits default {
    first-phase {
        expression: random 

BM25 – lexical ranking

BM25 is a tried and true
zero-shot text ranking model. BM25 only has two parameters and is the ranking
function many turn to when there are no explicit relevancy labels or implicit
click feedback to train ranking models. The BM25 scoring function can also be
accelerated for efficient retrieval using dynamic pruning algorithms like

rank-profile bm25 inherits default {
    first-phase {
        expression: bm25(title) + bm25(description) 

Note that we only use the title and description. All our lexical baselines
uses only these two fields.

Vespa nativeRank – lexical ranking

Vespa’s nativeRank is a
lexical ranking function and the default text ranking function in Vespa. It has
similar characteristics as BM25 but also considers the proximity between matched
terms in the document. Unlike BM25, which has an unbound score range, Vespa’s
nativeRank is normalized to a score in the range [0,1].

rank-profile nativeRank inherits default {
    first-phase {
        expression: nativeRank(title) + nativeRank(description) 

Vespa fieldMatch – lexical ranking

Vespa fieldMatch
is another lexical ranking function that is more sophisticated than Vespa
nativeRank but also computationally expensive and unsuitable as a retrieval
function. fieldMatch(field) or fieldMatch sub-features such as fieldMatch(field).proximity is
typically used as re-ranking features.

rank-profile fieldMatch inherits default {
    first-phase {
        expression: fieldMatch(title) + fieldMatch(description) 

Vespa semantic ranking using vector similarity – dense retrieval

Using an off-the-shelf dense vector model from
Huggingface, we
encode documents and queries into a 384-dimensional dense vector space. We use
cosine similarity between the query and the document in this shared embedding
vector space as our relevancy measure. The vector embedding model,
sentence-transformers/all-MiniLM-L6-v2, is trained
on a large number of sentence-length texts.

Semantic similarity search over dense representations can be accelerated using
Vespa‘s support for approximate nearest neighbor search, making it usable for
efficient retrieval. We use Vespa’s support for embedding dense embedding models in this work.
See text embeddings made simple.

rank-profile semantic inherits default {
    inputs {
        query(query_embedding) tensor<float>(d0[384])
    first-phase {
        expression: closeness(field, embedding)

Vespa semantic ranking using vector similarity – dense retrieval

Another model that can be used for vectorization,
bert-base-uncased, 768-dimensional.
This model has not been trained for semantic similarity, or semantic search. We
use the pre-trained language model directly. As with the previous semantic
model, we use cosine similarity between the query and the document as our relevancy measure.

rank-profile semantic inherits default {
    inputs {
        query(bert_query_embedding) tensor<float>(d0[768])
    first-phase {
        expression: closeness(field, bert_embedding)

Vespa hybrid – combining lexical and semantic ranking

In this baseline model, we combine the semantic similarity score of the
sentence-transformers/all-MiniLM-L6-v2 model with the lexical score computed by
Vespa’s nativeRank. We use a simple linear combination of the two ranking
models. Since the nativeRank lexical feature is normalized in the range [0,1],
combining it with semantic similarity is more straightforward as we don’t need
to normalize unbound lexical scoring functions, such as BM25.
We use a query(alpha) parameter to control how each method influences the overall score.

rank-profile hybrid inherits default {
    inputs {
        query(alpha): 0.5
        query(query_embedding) tensor<float>(d0[384])
    first-phase {
        expression: query(alpha)*closeness(field, embedding) + (1 - query(alpha))*(nativeRank(title) + nativeRank(description))

Zero-shot Baseline Results

Ranking result

We present the results of our evaluation in the above figure. We execute each
query in the English test split (8.9K queries) and ask Vespa to rank the
documents that we have labels for, using the Vespa search API’s recall

Sets a recall parameter to be combined with the query. This is identical to filter, except that recall terms are not exposed to the ranking framework and thus not ranked.

We use trec_eval utility to compute
NDCG, using the relevance judgments. The reported NDCG is the average NDCG
calculated across all 8,956 queries in the test split. These are unique queries
and we don’t know their frequency distribution on Amazon, but in a real-world
evaluation we would have to take into account the query frequency as well when
evaluating the model, for example using weighting so that popular (head) queries weights more
than rare (tail) queries. In other words, we don’t want to introduce a ranking model that on average improves
NDCG, but hurts many head queries.

Notice that the random ranking function produces a high NDCG score, this is because there are
many exact (relevant) judgements per query. On

Improving Product Search with Learning to Rank – part two


Photo by Carl Campbell
on Unsplash

In the first post,
we introduced a large product ranking dataset and established multiple zero-shot
ranking baselines that did not use the labeled relevance judgments in the
dataset. In this post, we start exploiting the labeled relevance judgments to
train deep neural ranking models. We evaluate these models on the test set to
see if we can increase the NDCG ranking metric.

Dataset Splits

The English train split contains 20,888 queries with 419,653 query-item
judgments. To avoid training and tuning models based on observed test set
accuracy, we split the official train set into two new splits; train and dev. In
practice, we pick, at random, 20% of the queries from the train as our dev split
and retain the remaining 80% as train queries.

Since we plan to train multiple models, we can use the performance on the dev
set when training ensemble models and then hide the dev set during the training.

Train Original: 20,888 queries with 419,653 judgments

  • Our Train 16,711 queries with 335,674 assessments
  • Our Dev 4,177 queries with 83,979 assessments

We do this dataset split once, so all our models are trained with the same train
and dev split.

ranking model

Our objective is to train a ranking model f(query,product) that takes a query
and product pair and outputs a relevance score. We aim to optimize the NDCG
ranking metric after sorting the products for a query by this score.

The ranking process is illustrated for a single query in the above figure. We
are asked to rank four products for a query with model f. The model scores each
pair and emits a relevance score. After sorting the products by this score, we
obtain the ranking of the products for the query. Once we have the ranked list,
we can calculate the NDCG metric using the labeled query and product data. The
overall effectiveness of the ranking model is given by computing the average
NDCG ranking score for all queries in the test dataset.

There are many ways to train a ranking model f(query, product), and in this
post, we introduce and evaluate two neural ranking methods based on pre-trained
language models.

Neural Cross-Encoder using pre-trained language models

One classic way to use Transformers for ranking is cross-encoding, where both
the query and the document are fed into the Transformer model simultaneously.

The simultaneous input of the query and document makes cross-encoders unsuitable
for cost-efficient retrieval. In this work, we are relatively lucky because we
already have a magic retriever, and our task is to re-rank a small set (average
20) of products per query.

cross-encoder model

One approach for training cross-encoders for text ranking is to convert the
ranking problem to a regression or classification problem. Then, one can train
the model by minimizing the error (loss) between the model’s predicted and true
label. A popular loss function for regression and classification problems is MSE
(Mean Squared Error) loss. MSE loss is a pointwise loss function, operating on a
single (query, document) point when calculating the error between the actual and
predicted label. In other words, the pointwise loss function does not consider the
ranking order of multiple (query, document) points for the same query. In the next
blog post in this series we will look at listwise loss functions.

With graded relevance labels, we can choose between converting the ranking
problem to a regression (graded relevance) or binary classification
(relevant/irrelevant) problem. With binary classification, we need to set a
threshold on the graded relevance labels. For example, let Exact and Complement
map to the value 1 (relevant) and the remaining labels as 0 (irrelevant).
Alternatively, we can use a map from the relevancy label to a numeric label

Exact => 1.0, Substitute => 0.1, Complement => 0.01, Irrelevant => 0

With this label mapping heuristic, we “tell” the model that Exact should be 10x as
important as Substitute and so forth. In other words, we want the model to output a score close
to 1 if the product is relevant and a score close to 0.1 if the product is labeled as a substitute.

training cross-encoder model

The above figure illustrates the training process; we initialize our model using
a pre-trained Transformer. Next, we prepare batches of labeled data of the form
<query, document, label>. This data must be pre-processed into
a format that the model understands, mapping free text to token ids. As part of
this process, we convert the textual graded label representation to a numeric
representation using the earlier label gain map. For each batch, the model makes
predictions, and the prediction errors are measured using the loss function. The
model weights are then adjusted to minimize the loss, one batch at a time.

Once we have trained the model, we can save the model weights and import the
model into Vespa for inference and ranking. More on how to represent
cross-encoder language models in Vespa will be in a later section in this blog

Choosing the Transformer model and model inputs

There are several pre-trained language models based on the Transformer model
architecture. Since we operate on English queries and products, we don’t need
multilingual capabilities. The following models use the English vocabulary and
wordpiece tokenizer.

  • bert-base-uncased – About 110M parameters
  • MiniLM Distilled model using bert-base-uncased as a teacher, about 22M parameters

Choosing a model is a tradeoff between deployment cost and accuracy. A larger
model (with more parameters) will take longer to train and have higher
deployment costs but with potentially higher accuracy. In our work, we base our
model on weights from a MiniLM
model trained on
relevancy labels from the MS Marco

We have worked with the MiniLM models before in our work on MS Marco passage
so we know these models provide a reasonable tradeoff between ranking
accuracy and inference efficiency, which impacts deployment cost. In addition,
MiniML is a CPU-friendly model, especially with quantized versions.
As a result, we are avoiding expensive GPU instances and GPU-related failure modes.

In addition to choosing the pre-trained Transformer model, we must decide which
product fields and the order we will input them into the model. Order matters,
and the Transformer model architecture has quadratic computational complexity
with the input length, so the size of the text inputs significantly impacts
inference-related costs.

Some alternative input permutations for our product dataset are:

  • query, product title,
  • query, product title, brand,
  • query, product title, brand, product description,
  • query, product title, product description, product bullets.

The product field order matters because the total input length is limited to a
maximum of 512 tokens, and shorter input reduces inference costs significantly.
Since all products in the dataset have a title, we used only the title as input,
avoiding complexity related to concatenating input fields and handling missing
fields. Furthermore, product titles are generally longer than, for example,
Wikipedia or web page titles.

We trained the model for two epochs, Machine Learning (ML) terminology for two
complete iteration over our training examples. Training the model on a free-tier
Google Colab GPU takes less than 30 minutes.

Representing cross-encoders in Vespa

Once we have trained our cross-model using the sentence-transformer
cross-encoder training library, we need to import it to Vespa for inference and
ranking. Vespa supports importing models saved in the ONNX
model format and where inference is accelerated using
We export the PyTorch Transformer model representation to ONNX format using an
ONNX export utility script. Then we can drop the ONNX file into the Vespa
application package and configure how to use it in the product schema:

Model declaration

document product {
    field title type string {..} 
field title_tokens type tensor(d0[128]) {
    indexing: input title | embed tokenizer | attribute | summary
onnx-model title_cross {
    file: models/title_ranker.onnx
    input input_ids: title_input_ids
    input attention_mask: title_attention_mask
    input token_type_ids: title_token_type_ids

The above declares the title_tokens field where we store the subword tokens
and the onnx-model with the file name in the application package and its inputs.
The title_tokens are declared outside of the document as these fields are populated
during indexing.

Using the model in rank-profile

rank-profile cross-title inherits default {
    inputs {
        query(query_tokens) tensor(d0[32])

    function title_input_ids() {
        expression: tokenInputIds(96, query(query_tokens), attribute(title_tokens))

    function title_token_type_ids() {
        expression: tokenTypeIds(96, query(query_tokens), attribute(title_tokens))

    function title_attention_mask() {
        expression: tokenAttentionMask(96, query(query_tokens), attribute(title_tokens)) 

    function cross() {
        expression: onnx(title_cross)({d0:0,d1:0} 
    first-phase {
        expression: cross() 

This defines the rank-profile with the function inputs and where to find the inputs. We
cap the sequence at 96 tokens, including special tokens such as CLS and SEP.
The onnx(title_cross)({d0:0,d1:0} invokes the model with the inputs and slices
the batch dimension (d0) and reads the predicted score.

Vespa provides convenience tensor functions to calculate the three inputs to the
Transformer models; tokenInputIds, tokenTypeIds, and tokenAttentionMask.
We just need to provide the query tokens and field from which we can read the document
tokens. To map text to token_ids using the fixed vocabulary associated with the model,
we use Vespa’s support for wordpiece embedding.
This avoids document side tokenization at inference time, and we can just read the product tokens
during ranking. See the full product schema.

Neural Bi-Encoder using pre-trained language models

Another popular approach for neural ranking is the two-tower or bi-encoder
approach. This approach encodes queries and documents independently.

For text ranking, queries and documents are encoded into a dense embedding
vector space using one or two Transformer models. Most methods use the same
Transformer instance, but it’s possible to decouple them. For example, suppose
both towers map documents and queries to the same vector dimensionality. In that
case, one could choose a smaller Transformer model for the online query encoder
versus the document tower to reduce latency and deployment costs.

The model learns the embedding vector representation from the labeled training
examples. After training the model, one can compute the embeddings for all
products, index them in Vespa, and use a nearest-neighbor search algorithm at
query time for efficient document retrieval or ranking. In this work, Amazon has
provided the “retrieved” products for all queries, so we don’t need HNSW
indexing for efficient
approximate nearest neighbor search.

cross-encoder model

The above illustrates the bi-encoder approach. This illustration uses the
Transformer CLS (classification token) output and uses that as the single dense
vector representation. Another common output pooling strategy is calculating the
mean over all the token vector outputs. Other representation methods use
multiple vectors to represent queries and documents. One example is the ColBERT
model described in this blog

training cross-encoder model

The above illustrates training a bi-encoder for semantic similarity. Training
bi-encoders for semantic similarity or semantic search/retrieval/ranking has
more options than training cross-encoders.

As with the cross-encoder, we must select a pre-trained model, minding size and

The model size impacts quality, inference speed, and vector representation
dimensionality. For example, bert-base-uncased uses 768 dimensions, while MiniLM
uses 384. Lower dimensionality lowers the computational complexity of the
similarity calculations and reduces the embedding storage footprint.

Furthermore, how we batch the input training data, the vector similarity
function, and, most importantly, the loss function determines the quality of the
model on the given task. While loss functions are out of the scope of this blog
post, we would like to mention that training a bi-encoder for semantic
retrieval over a large corpus, which “sees” many irrelevant documents, needs a
different loss function than a bi-encoder model used for re-ranking.

Like with the cross-encoder, we must decide what product fields we encode.
Instead of inputting multiple product fields, we train two bi-encoder models,
one that uses the product title and another that encodes the description.
Both models are based on

Representing bi-encoders in Vespa

A single Vespa document schema can contain multiple document embedding fields,

document product {
    field title type string {}
    field description type string {}
field title_embedding type tensor(d0[384]) {
    indexing: input title | embed title | attribute | summary
    attribute {
        distance-metric: angular 
field description_embedding type tensor(d0[384]) {
    indexing: input description | embed description | attribute 

Improving Product Search with Learning to Rank – part three


Photo by Niels Weiss
on Unsplash

We introduced a large product ranking dataset in the first
post in
this series and established multiple zero-shot ranking baselines.
In the second
we trained ranking models using pre-trained language models and
evaluated them using the NDCG ranking metric.

In this post, we dive deep into a popular method for learning to
rank; gradient-boosting decision trees
(GBDT). Vespa has
native support for evaluating GBDT models and imports models trained
with popular GBDT frameworks such as
XGBoost and


Our objective is to train a ranking model f(query, product) that
takes a query and product pair as input and which outputs a relevance
score. We aim to optimize the NDCG metric after sorting the products
by this score. There are many ways to train a ranking model f(query,
, and in this post, we introduce and evaluate tree-based
models. Tree-based GBDT models, implemented by popular frameworks
like XgBoost and
LightGBM, excel on tabular
(structured) features and handle feature mixes and feature value
ranges without any ceremony.

In our series, we only aim to optimize relevance (NDCG); still,
E-commerce ranking is a multi-objective optimization
where shops want to maximize relevance and revenue. GBDT models
are famous for multi-objective ranking optimization as they handle
a mix of features, combining normalized features (e.g., vector
similarity), unnormalized unbound features (e.g.,
BM25), and “business”
features such as product sales margin. GBDT is also an excellent
way to explore semantic search signals and integrate them into an
existing product ranking function without wasting years of ranking
model development.

Multi-objective ranking

The above figure illustrates a multi-objective ranking optimization
problem where we want to maximize relevance and revenue. One way
to approach this problem is to train the model with a modified
label, an aggregated weighted combination of the two conflicting
optimization objectives.

For organizations, prediction explainability is essential when
applying Machine Learning (ML) to ranking problems, and GBDT
predictions are more straightforward to interpret than neural
predictions. Finally, GBDT is a relatively simple algorithm, meaning
we can train on more data than deep neural networks (for the same
compute budget). For example, training on our product ranking
dataset takes a few seconds on a laptop with CPU-only, while our
neural methods from the second
took hours with GPU acceleration.

Model Features

Tree-based learning methods require tabular (structured) features.
Therefore, we convert the unstructured text ranking dataset to a
tabular dataset using simple feature engineering, where Vespa
calculates all tabular features.

Broadly we have four classes of features used for ranking problems.
Examples in this section do not necessarily map to the Amazon product
ranking dataset used in this blog post series.

Contextual query features

These are features that do not depend on the document. For example,
it could be the user’s age, the time of the day, or a list of
previously visited products — generally, query time contextual
features and the user’s query terms. Query classification techniques
using language models also fall under this feature category, for
example, using Vespa’s support for stateless model

Document features

These features are independent of the query. In E-commerce, for
example, this might be the product’s popularity, price, sale margin,
or user review rating score(s).

Online match features (query * document)

An example of this could be text matching features, such as
bm25(title), tensor computation
like a sparse dot product between the user profile interests and product category, or semantic vector
similarity, as demonstrated in the previous post in this series.
See also tensor computation examples.

Aggregate features

These features aggregate across documents or queries. The computing
of aggregated features is usually performed outside of Vespa, using
near real-time stream processing. The output of these steam processing
jobs are usually tensors
which are stored and updated at scale with Vespa. Aggregate features
can help cold-start problems when indexing a new document that lacks
document features such as sales rank or popularity. For example,
having a “category” popularity score could help rank new products
added to the catalog. For example, the Yahoo homepage recommendation
system uses topical click-through rate (CTR) using Vespa global

Representing features in Vespa

Vespa’s ranking framework has rich built-in
Still, developers can easily express domain-specific features by
combining Vespa’s ranking
language with its built-in features.

Many ranking architectures in the wild use external feature stores.
In these serving architectures, a middle-tier component retrieves
a small pool of documents from an indexing service, for example,
Elasticsearch. After retrieving documents, the middle-tier fetches
features from the feature store and inputs these to the model.

Vespa allows representing a richer set of features than other
traditional indexing services built on Apache Lucene. Furthermore,
since Vespa stores tensor features in memory and memory bandwidth
are much higher than network bandwidth, one can rank a much larger
candidate pool with Vespa than with external feature stores.

Vespa attributes and
tensors support
in-place partial updates,
and developers can update a document and aggregated features with
high throughput (up to 75K partial updates/s per float field per
node with low CPU resource consumption). Genuine partial update
support is a feature that differentiates Vespa compared to search
engines built on Apache Lucene, where partial updates trigger a
full re-indexing of the entire document (get-apply-write pattern).

Gathering features from Vespa

In previous posts, we introduced Vespa
rank-features and
We can define custom features and feature computations using the
Vespa rank expressions language with function support.

rank-profile features inherits default {
        inputs {
            query(query_tokens) tensor<float>(d0[32])
            query(q_title) tensor<float>(d0[384])
        function bi_encoder() {
            expression: closeness(field, title_embedding)

        function max_title_significance() {
            expression: foreach(terms, N, term(N).significance, ">0.5", max)

        function mean_title_significance() {
            expression: foreach(terms, N, term(N).significance, ">0.5", average)

        first-phase {
            expression: random

        match-features {

The above features rank-profile defines a set of custom features using function with rank expressions.
The match-features block defines the set of features that are returned with each ranked document.
See full schema on GitHub.

For explicitly labeled examples like in our dataset, we can ask
Vespa to compute and return the feature scores using a combination
of the Vespa recall
parameter and match-features or
The recall query
parameter allows developers to request Vespa to calculate features
for query document pairs without impacting other features like
queryTermCount or text matching features.


Query-product feature scraping illustration. We request that Vespa
retrieve only the labeled products for each query in the train
split. This feature scraping approach is used for explicit relevance
judgment labels.


The above figure illustrates feature logging in production, and
this process can be used to gather implicit relevance feedback using
clicks and skips.

Feature logging infrastructure is fundamental for scaling machine
learning (ML) applications beyond expensive labeled data.
With Vespa, developers can compute and log new features that the
current production model does not use. Vespa can log the query and
the list of ranked documents with the calculated features. Note:
Using click models to generate unbiased pseudo-relevance labels for
model training is out of the scope of this blog post.

Generally, one wants to train on the features as scraped or logged.
Computing features outside of Vespa for training might cause feature
drift between training and inference in Vespa, where the offline
feature computations differ from the Vespa implementation.

For our GBDT model training, we use built-in Vespa text-matching
features for our product ranking datasets, such as
nativeRank, and
We also include a few handcrafted function features and the neural
scores from the neural bi-encoder and cross-encoder introduced in
the previous
The scrape routine fetches the computed feature values from a running
Vespa instance and merges the features into a pandas DataFrame,
along with the labels and query ids.


The above figure displays a random sample of tabular feature data.
The DataFrame column names map 1:1 to Vespa feature names, native
built-in features, or our user-defined functions, for example,
bi_encoder and cross_encoder.

Gradient Boosting Decision Trees (GBDT)

This blog post won’t dive deep into how the gradient boosting
algorithm works or ranking loss calculations. Instead, for our
purposes, on a very high level, it suffices to say that each training
iteration builds a decision tree, and each new decision tree tries
to correct the errors from the previously generated decision trees.

The final model after k training iterations has k decision trees,
and the model inference is performed by traversing the decision
trees and summing scores in the leaf nodes of the decision trees.
GBDT supports different learning objectives and loss functions:

  • Classification with pointwise loss
  • Regression with pointwise loss
  • Ranking with listwise or pairwise ranking loss

There are two popular implementations of the GBDT algorithm; XGboost,
and LightGBM; Vespa supports importing models from both frameworks.
Vespa does not use the framework inference implementation but
converts the models into Vespa’s ranking framework for accelerated
GBDT inference. At Yahoo, Vespa has been used for ranking with
GBDT models at
long before the XGBoost and LightGBM implementations.
Since the different framework models are imported into the same unified
framework in Vespa, combining them into an ensemble of models from
both frameworks is easy.


The above figure shows a single decision tree from a forest of
decision trees. Each gradient-boosting learning iteration generates
a new decision tree, and the final model is a forest of decision

The decision tree in the illustration is two levels deep and has
four leaf nodes. This simplified decision tree can only output four
different scores (0.04, 0.3, -0.9, or 0.4). Each non-leaf node has
a decision (split condition) that compares a feature with a numeric
value. The tree traversal starts from the tree’s root, and the path
determines the leaf score. The final score is the sum of all the
trees in the forest.

Training GBDT models

Once we have scraped the Vespa computed features for our product
ranking train data, we can start to train models. Unfortunately,
it’s easy to overfit the training data with GBDT, and hyperparameters
impact generalization on unseen data (test data). The previous
explained how to split the training data into a new train and dev
split. Now we can use that dev split as the validation dataset used
during (GBDT) training to check generalization and tune
The training job terminates when observed dev accuracy (NDCG) stops
improving. Early termination avoids overfitting on the train set
and saves resources during training and inference.


The GBDT feature importance plot enables feature ablation studies
where we can remove low-importance features and re-train the model
to observe the NDCG impact on the dev set. Feature importance plots
allow us to reduce computational complexity and deployment costs
by removing features. We make these reductions to meet a specific
latency service level agreement (SLA).


In the feature importance illustration above, we have trained a
model using only five simple features that are computationally

GBDT models and Vespa serving performance

Training an ML model is only worthwhile if we can deploy it to
production. The total computational complexity of evaluating a GBDT
model for ranking in Vespa depends mainly on three factors.

The number of documents exposed to the GBDT model

The number of documents exposed to the model is the most significant
parameter. Computational cost is linear with the number of documents
ranked using the model. Vespa allows multiple ranking
phases to apply the
most computationally complex model to the top-scoring documents
from the previous ranking phase. Search ranking and model inference
with GBDT are highly parallelizable. Vespa’s ability to use multiple
threads per query reduces serving latency.

Features and feature complexity

The number of features and the computational complexity of the
features dramatically impact performance. For example, calculating
a matching feature such as bm25(title) is low complexity
compared to fieldMatch(title), which again is relatively low compared
to inference with the previous

The benefit Vespa delivers is that it allows developers
to understand relative feature complexity using rank
The trace below shows the relative feature complexity.