Scaling TensorFlow model evaluation with Vespa

In this blog post we’ll explain how to use Vespa to evaluate TensorFlow models over arbitrarily many data points while keeping total latency constant. We provide benchmark data from our performance lab where we compare evaluation using TensorFlow serving with evaluating TensorFlow models in Vespa.

We recently introduced a new feature that enables direct import of TensorFlow models into Vespa for use at serving time. As mentioned in a previous blog post, our approach to support TensorFlow is to extract the computational graph and parameters of the TensorFlow model and convert it to Vespa’s tensor primitives. We chose this approach over attempting to integrate our backend with the TensorFlow runtime. There were a few reasons for this. One was that we would like to support other frameworks than TensorFlow. For instance, our next target is to support ONNX. Another was that we would like to avoid the inevitable overhead of such an integration, both on performance and code maintenance. Of course, this means a lot of optimization work on our side to make this as efficient as possible, but we do believe it is a better long term solution.

Naturally, we thought it would be interesting to set up some sort of performance comparison between Vespa and TensorFlow for cases that use a machine learning ranking model.

Before we get to that however, it is worth noting that Vespa and TensorFlow serving has an important conceptual difference. With TensorFlow you are typically interested in evaluating a model for a single data point, be that an image for an image classifier, or a sentence for a semantic representation etc. The use case for Vespa is when you need to evaluate the model over many data points. Examples are finding the best document given a text, or images similar to a given image, or computing a stream of recommendations for a user.

So, let’s explore this by setting up a typical search application in Vespa. We’ve based the application in this post on the Vespa
blog recommendation tutorial part 3.
In this application we’ve trained a collaborative filtering model which computes an interest vector for each existing user (which we refer to as the user profile) and a content vector for each blog post. In collaborative filtering these vectors are commonly referred to as latent factors. The application takes a user id as the query, retrieves the corresponding user profile, and searches for the blog posts that best match the user profile. The match is computed by a simple dot-product between the latent factor vectors. This is used as the first phase ranking. We’ve chosen vectors of length 128.

In addition, we’ve trained a neural network in TensorFlow to serve as the second-phase ranking. The user vector and blog post vector are concatenated and represents the input (of size 256) to the neural network. The network is fully connected with 2 hidden layers of size 512 and 128 respectively, and the network has a single output value representing the probability that the user would like the blog post.

In the following we set up two cases we would like to compare. The first is where the imported neural network is evaluated on the content node using Vespa’s native tensors. In the other we run TensorFlow directly on the stateless container node in the Vespa 2-tier architecture. In this case, the additional data required to evaluate the TensorFlow model must be passed back from the content node(s) to the container node. We use Vespa’s fbench utility to stress the system under fairly heavy load.

In this first test, we set up the system on a single host. This means the container and content nodes are running on the same host. We set up fbench so it uses 64 clients in parallel to query this system as fast as possible. 1000 documents per query are evaluated in the first phase and the top 200 documents are evaluated in the second phase. In the following, latency is measured in ms at the 95th percentile and QPS is the actual query rate in queries per second:

  • Baseline: 19.68 ms / 3251.80 QPS
  • Baseline with additional data: 24.20 ms / 2644.74 QPS
  • Vespa ranking: 42.8 ms / 1495.02 QPS
  • TensorFlow batch ranking: 42.67 ms / 1499.80 QPS
  • TensorFlow single ranking: 103.23 ms / 619.97 QPS

Some explanation is in order. The baseline here is the first phase ranking only without returning the additional data required for full ranking. The baseline with additional data is the same but returns the data required for ranking. Vespa ranking evaluates the model on the content backend. Both TensorFlow tests evaluate the model after content has been sent to the container. The difference is that batch ranking evaluates the model in one pass by batching the 200 documents together in a larger matrix, while single evaluates the model once per document, i.e. 200 evaluations. The reason why we test this is that Vespa evaluates the model once per document to be able to evaluate during matching, so in terms of efficiency this is a fairer comparison.

We see in the numbers above for this application that Vespa ranking and TensorFlow batch ranking achieve similar performance. This means that the gains in ranking batch-wise is offset by the cost of transferring data to TensorFlow. This isn’t entirely a fair comparison however, as the model evaluation architecture of Vespa and TensorFlow differ significantly. For instance, we measure that TensorFlow has a much lower degree of cache misses. One reason is that batch-ranking necessitates a more contiguous data layout. In contrast, relevant document data can be spread out over the entire available memory on the Vespa content nodes.

Another significant reason is that Vespa currently uses double floating point precision in ranking and in tensors. In the above TensorFlow model we have used floats, resulting in half the required memory bandwidth. We are considering making the floating point precision in Vespa configurable to improve evaluation speed for cases where full precision is not necessary, such as in most machine learned models.

So we still have some work to do in optimizing our tensor evaluation pipeline, but we are pleased with our results so far. Now, the performance of the model evaluation itself is only a part of the system-wide performance. In order to rank with TensorFlow, we need to move data to the host running TensorFlow. This is not free, so let’s delve a bit deeper into this cost.

The locality of data in relation to where the ranking computation takes place is an important aspect and indeed a core design point of Vespa. If your data is too large to fit on a single machine, or you want to evaluate your model on more data points faster than is possible on a single machine, you need to split your data over multiple nodes. Let’s assume that documents are distributed randomly across all content nodes, which is a very reasonable thing to do. Now, when you need to find the globally top-N documents for a given query, you first need to find the set of candidate documents that match the query. In general, if ranking is done on some other node than where the content is, all the data required for the computation obviously needs to be transferred there. Usually, the candidate set can be large so this incurs a significant cost in network activity, particularly as the number of content nodes increase. This approach can become infeasible quite quickly.

This is why a core design aspect of Vespa is to evaluate models where the content is stored.


This is illustrated in the figure above. The problem of transferring data for ranking is compounded as the number of content nodes increase, because to find the global top-N documents, the top-K documents of each content node need to be passed to the external ranker. This means that, if we have C content nodes, we need to transfer C*K documents over the network. This runs into hard network limits as the number of documents and data size for each document increases.

Let’s see the effect of this when we change the setup of the same application to run on three content nodes and a single stateless container which runs TensorFlow. In the following graph we plot the 95th percentile latency as we increase the number of parallel requests (clients) from 1 to 30:


Here we see that with low traffic, TensorFlow and Vespa are comparable in terms of latency. When we increase the load however, the cost of transmitting the data is the driver for the increase in latency for TensorFlow, as seen in the red line in the graph. The differences between batch and single mode TensorFlow evaluation become smaller as the system as a whole becomes largely network-bound. In contrast, the Vespa application scales much better.

Now, as we increase traffic even further, will the Vespa solution likewise become network-bound? In the following graph we plot the sustained requests per second as we increase clients to 200:


Vespa ranking is unable to sustain the same amount of QPS as just transmitting the data (the blue line), which is a hint that the system has become CPU-bound on the evaluation of the model on Vespa. While Vespa can sustain around 3500 QPS, the TensorFlow solution maxes out at 350 QPS which is reached quite early as we increase traffic. As the system is unable to transmit data fast enough, the latency naturally has to increase which is the cause for the linearity in the latency graph above. At 200 clients the average latency of the TensorFlow solution is around 600 ms, while Vespa is around 60 ms.

So, the obvious key takeaway here is that from a scalability point of view it is beneficial to avoid sending data around for evaluation. That is both a key design point of Vespa, but also for why we implemented TensorFlow support in the first case. By running the models where the content is allows for better utilization of resources, but perhaps the more interesting aspect is the ability to run more complex or deeper models while still being able to scale the system.

Join us at the Big Data Technology Warsaw Summit on February 27th for Scalable Machine-Learned Model Serving

Online evaluation of machine-learned models (model serving) is difficult to scale to large datasets. is an open source big data serving solution used to solve this problem and in use today on some of the largest such systems in the world. These systems evaluate models over millions of data points per request for hundreds of thousands of requests per second.

If you’re in Warsaw on February 27th, please join Jon Bratseth (Distinguished Architect, Verizon Media) at the Big Data Technology Warsaw Summit, where he’ll share “Scalable machine-learned model serving” and answer any questions. Big Data Technology Warsaw Summit is a one-day conference with technical content focused on big data analysis, scalability, storage, and search. There will be 27 presentations and more than 500 attendees are expected.

Jon’s talk will explore the problem and architectural solution, show how Vespa can be used to achieve scalable serving of TensorFlow and ONNX models, and present benchmarks comparing performance and scalability to TensorFlow Serving.

Hope to see you there!

Fine-tuning a BERT model with transformers

Setup a custom Dataset, fine-tune BERT with Transformers Trainer and export the model via ONNX.

This post describes a simple way to get started with fine-tuning transformer models. It will cover the basics and introduce you to the amazing Trainer class from the transformers library. I will leave important topics such as hyperparameter tuning, cross-validation and more detailed model validation to followup posts.

Decorative image

Photo by Samule Sun on Unsplash

We use a dataset built from COVID-19 Open Research Dataset Challenge. This work is one small piece of a larger project that is to build the cord19 search app.

You can run the code from Google Colab but do not forget to enable GPU support.

Install required libraries

pip install pandas transformers

Load the dataset

In order to fine-tune the BERT models for the cord19 application we need to generate a set of query-document features as well as labels that indicate which documents are relevant for the specific queries. For this exercise we will use the query string to represent the query and the title string to represent the documents.

training_data = read_csv("")

Table 1

There are 50 unique queries.


For each query we have a list of documents, divided between relevant (label=1) and irrelevant (label=0).

training_data[["title", "label"]].groupby("label").count()

Table 2

Data split

We are going to use a simple data split into train and validation sets for illustration purposes. Even though we have more than 50 thousand data points when we consider unique query and document pairs, I believe this specific case would benefit from cross-validation since it has only 50 queries containing relevance judgement.

from sklearn.model_selection import train_test_split
train_queries, val_queries, train_docs, val_docs, train_labels, val_labels = train_test_split(

Create BERT encodings

Create train and validation encodings.
In order to do that we need to choose which BERT model to use.
We will use padding and truncation
because the training routine expects all tensors within a batch to have the same dimensions.

from transformers import BertTokenizerFast

model_name = "google/bert_uncased_L-4_H-512_A-8"
tokenizer = BertTokenizerFast.from_pretrained(model_name)

train_encodings = tokenizer(train_queries, train_docs, truncation=True, padding='max_length', max_length=128)
val_encodings = tokenizer(val_queries, val_docs, truncation=True, padding='max_length', max_length=128)

Create a custom dataset

Now that we have the encodings and the labels we can create a Dataset object as described in the transformers webpage about custom datasets.

import torch

class Cord19Dataset(
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = Cord19Dataset(train_encodings, train_labels)
val_dataset = Cord19Dataset(val_encodings, val_labels)

Fine-tune the BERT model

We are going to use BertForSequenceClassification, since we are trying to classify query and document pairs into two distinct classes (non-relevant, relevant).

from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained(model_name)

We can set requires_grad to False for all the base model parameters in order to fine-tune only the task-specific parameters.

for param in model.base_model.parameters():
    param.requires_grad = False

We can then fine-tune the model with Trainer. Below is a basic routine with out-of-the-box set of parameters. Care should be taken when chosing the parameters below, but this is out of the scope of this piece.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",          # output directory
    evaluation_strategy="epoch",     # Evaluation is done at the end of each epoch.
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    save_total_limit=1,              # limit the total amount of checkpoints. Deletes the older checkpoints.    

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset


Export the model to onnx

Once training is complete we can export the model using the ONNX format to be deployed elsewhere. I assume below that you have access to a GPU, which you can get from Google Colab for example.

from torch.onnx import export

device = torch.device("cuda") 

model_onnx_path = "model.onnx"
dummy_input = (
input_names = ["input_ids", "token_type_ids", "attention_mask"]
output_names = ["logits"]
    model, dummy_input, model_onnx_path, input_names = input_names, 
    output_names = output_names, verbose=False, opset_version=11

Fine-tuning a BERT model for search applications

Thiago Martins

Thiago Martins

Vespa Data Scientist

How to ensure training and serving encoding compatibility

There are cases where the inputs to your Transformer model are pairs of sentences, but you want to process each sentence of the pair at different times due to your application’s nature.

Decorative image

Photo by Alice Dietrich on Unsplash

The search use case

Search applications are one example. They involve a large collection of documents that can be pre-processed and stored before a search action is required. On the other hand, a query triggers a search action, and we can only process it in real-time. Search apps’ goal is to return the most relevant documents to the query as quickly as possible. By applying the tokenizer to the documents as soon as we feed them to the application, we only need to tokenize the query when a search action is required, saving time.

In addition to applying the tokenizer at different times, you also want to retain adequate control about encoding your pair of sentences. For search, you might want to have a joint input vector of length 128 where the query, which is usually smaller than the document, contributes with 32 tokens while the document can take up to 96 tokens.

Training and serving compatibility

When training a Transformer model for search, you want to ensure that the training data will follow the same pattern used by the search engine serving the final model. I have written a blog post on how to get started with BERT model fine-tuning using the transformer library. This piece will adapt the training routine with a custom encoding based on two separate tokenizers to reproduce how a Vespa application would serve the model once deployed.

Create independent BERT encodings

The only change required is simple but essential. In my previous post, we discussed the vanilla case where we simply applied the tokenizer directly to the pairs of queries and documents.

from transformers import BertTokenizerFast

model_name = "google/bert_uncased_L-4_H-512_A-8"
tokenizer = BertTokenizerFast.from_pretrained(model_name)

train_encodings = tokenizer(train_queries, train_docs, truncation=True, padding='max_length', max_length=128)
val_encodings = tokenizer(val_queries, val_docs, truncation=True, padding='max_length', max_length=128)

In the search case, we create the create_bert_encodings function that will apply two different tokenizers, one for the query and the other for the document. In addition to allowing for different query and document max_length, we also need to set add_special_tokens=False and not use padding, as those need to be included by our custom code when joining the tokens generated by the tokenizer.

def create_bert_encodings(queries, docs, tokenizer, query_input_size, doc_input_size):
    queries_encodings = tokenizer(
        queries, truncation=True, max_length=query_input_size-2, add_special_tokens=False
    docs_encodings = tokenizer(
        docs, truncation=True, max_length=doc_input_size-1, add_special_tokens=False

    input_ids = []
    token_type_ids = []
    attention_mask = []
    for query_input_ids, doc_input_ids in zip(queries_encodings["input_ids"], docs_encodings["input_ids"]):
        # create input id
        input_id = [TOKEN_CLS] + query_input_ids + [TOKEN_SEP] + doc_input_ids + [TOKEN_SEP]
        number_tokens = len(input_id)
        padding_length = max(128 - number_tokens, 0)
        input_id = input_id + [TOKEN_NONE] * padding_length
        # create token id
        token_type_id = [0] * len([TOKEN_CLS] + query_input_ids + [TOKEN_SEP]) + [1] * len(doc_input_ids + [TOKEN_SEP]) + [TOKEN_NONE] * padding_length
        # create attention_mask
        attention_mask.append([1] * number_tokens + [TOKEN_NONE] * padding_length)

    encodings = {
        "input_ids": input_ids,
        "token_type_ids": token_type_ids,
        "attention_mask": attention_mask
    return encodings

We then create the train_encodings and val_encodings required by the training routine. Everything else on the training routine works just the same.

from transformers import BertTokenizerFast

model_name = "google/bert_uncased_L-4_H-512_A-8"
tokenizer = BertTokenizerFast.from_pretrained(model_name)

train_encodings = create_bert_encodings(

val_encodings = create_bert_encodings(

Conclusion and future work

Training a model to deploy in a search application require us to ensure that the training encodings are compatible with encodings used at serving time. We generate document encodings offline when feeding the documents to the search engine while creating query encoding at run-time upon arrival of the query. It is often relevant to use different maximum lengths for queries and documents, and other possible configurations.

Decorative image

Photo by Steve Johnson on Unsplash

We showed how to customize BERT model encodings to ensure this training and serving compatibility. However, a better approach is to build tools that bridge the gap between training and serving by allowing users to request training data that respects by default the encodings used when serving the model. pyvespa will include such integration to make it easier for Vespa users to train BERT models without having to adjust the encoding generation manually as we did above.

Stateful model serving: how we accelerate inference using ONNX Runtime

By Lester Solbakken from Verizon Media and Pranav Sharma from Microsoft.

There’s a difference between stateless and stateful model serving.

Stateless model serving is what one usually thinks about when using a
machine-learned model in production. For instance, a web application handling
live traffic can call out to a model server from somewhere in the serving
stack. The output of this model service depends purely on the input. This is
fine for many tasks, such as classification, text generation, object detection,
and translation, where the model is evaluated once per query.

There are, however, some applications where the input is combined with stored
or persisted data to generate a result. We call this stateful model evaluation.
Applications such as search and recommendation need to evaluate models with a
potentially large number of items for each query. A model server can quickly
become a scalability
bottleneck in these
cases, regardless of how efficient the model inference is.

In other words, stateless model serving requires sending all necessary input
data to the model. In stateful model serving, the model should be computed
where the data is stored.

At, we are concerned with efficient stateful
model evaluation. is an open-source platform for building
applications that do real-time data processing over large data sets. Designed
to be highly performant and web-scalable, it is used for such diverse tasks as
search, personalization, recommendation, ads, auto-complete, image and
similarity search, comment ranking, and even for finding

It has become increasingly important for us to be able to evaluate complex
machine-learned models efficiently. Delivering low latency, fast inference and
low serving cost is challenging while at the same time providing support for
the various model training frameworks.

We eventually chose to leverage ONNX
Runtime (ORT) for this task. ONNX
Runtime is an accelerator for model inference. It has vastly increased’s capacity for evaluating large models, both in performance and model
types we support. ONNX Runtime’s capabilities within hardware acceleration and
model optimizations, such as quantization, has enabled efficient evaluation of
large NLP models like BERT and other Transformer models in

In this post, we’ll share our journey on why and how we eventually chose ONNX
Runtime and share some of our experiences with it.

About has a rich history. Its lineage comes from a search engine born in 1997.
Initially powering the web search at, it was flexible
enough to be used in various more specialized products, or verticals, such as
document search, mobile search, yellow pages, and banking. This flexibility in
being a vertical search platform eventually gave rise to its name, Vespa.

The technology was acquired by Yahoo in 2003. There, Vespa cemented itself as a
core piece of technology that powers hundreds of applications, including many
of Yahoo’s most essential services. We open-sourced Vespa in 2017 and today it
serves hundreds of thousands of queries per second worldwide at any given time,
with billions of content items for hundreds of millions of users.

Although Yahoo was eventually acquired by Verizon, it is interesting to note
that our team has stayed remarkably stable over the years. Indeed, a few of the
engineers that started working on that initial engine over 20 years ago are
still here. Our team counts about 30 developers, and we are situated in
Trondheim in Norway.

Building upon experience gained over many years, has evolved
substantially to become what it is today. It now stands as a battle-proven
general engine for real-time computation over large data sets. It has many
features that make it
suitable for web-scale applications. It stores and indexes data with instant
writes so that queries, selection, and processing over the data can be
performed efficiently at serving time. It’s elastic and fault-tolerant, so
nodes can be added, removed, or replaced live without losing data. It’s easy to
configure, operate, and add custom logic. Importantly, it contains built-in
capabilities for advanced computation, including machine learned models. applications

Vespa architecture is a distributed application consisting of stateless nodes and a set
of stateful content nodes containing the data. A application is fully
defined in an application package. This is a single unit containing everything
needed to set up an application, including all configuration, custom
components, schemas, and machine-learned models. When the application package
is deployed, the admin layer takes care of configuring all the services across
all the system’s nodes. This includes distributing all models to all content

Application packages contain one or more document schemas. The schema primarily
consists of:

  • The data fields for each document and how they should be stored and indexed.
  • The ranking profiles which define how each document should be scored during query handling.

The ranking profiles contain ranking expressions, which are mathematical
expressions combining ranking
Some features retrieve data from sources such as the query, stored data, or
constants. Others compute or aggregate data in various ways. Ranking profiles
support multi-phased evaluation, so a cheap model can be evaluated in the first
phase and a more expensive model for the second. Both sparse and dense
tensors are
supported for more advanced computation.

After the application is deployed, it is ready to handle data writes and
queries. Data feeds are first processed on the stateless layer before content
is distributed (with redundancy) to the content nodes. Similarly, queries go
through the stateless layer before being fanned out to the content nodes where
data-dependent computation is handled. They return their results back to the
stateless layer, where the globally best results are determined, and a response
is ultimately returned.

A guiding principle in is to move computation to the data rather than
the other way around. Machine-learned models are automatically deployed to all
content nodes and evaluated there for each query. This alleviates the cost of
query-time data transportation. Also, as takes care of distributing
data to the content nodes and redistributing elastically, one can scale up
computationally by adding more content nodes, thus distributing computation as

In summary, offers ease of deployment, flexibility in combining many
types of models and computations out of the box without any plugins or
extensions, efficient evaluation without moving data around and a less complex
system to maintain. This makes an attractive platform.


In the last few years, it has become increasingly important for to
support various types of machine learned models from different frameworks. This
led to us introducing initial support for ONNX models in 2018.

The Open Neural Network Exchange (ONNX) is an open standard
for distributing machine learned models between different systems. The goal of
ONNX is interoperability between model training frameworks and inference
engines, avoiding any vendor lock-in. For instance, HuggingFace’s Transformer
library includes export to ONNX, PyTorch has native ONNX export, and TensorFlow
models can be converted to ONNX. From our perspective, supporting ONNX is
obviously interesting as it would maximize the range of models we could

To support ONNX in, we introduced a special onnx ranking feature. When
used in a ranking expression this would instruct the framework to evaluate the
ONNX model. This is one of the unique features of, as one has the
flexibility to combine results from various features and string models
together. For instance, one could use a small, fast model in an early phase,
and a more complex and computationally expensive model that only runs on the
most promising candidates. For instance:

document my_document {
  field text_embedding type tensor(x[769]) {
    indexing: attribute | index
    attribute {
      distance-metric: euclidean
  field text_tokens type tensor(d0[256]) {
    indexing: summary | attribute

onnx-model my_model {
  file: files/my_model.onnx
  input input_0: ...
  input input_1: ...
  output output_0: ...

rank-profile my_profile {
  first-phase {
    expression: closeness(field, text_embedding)
  second-phase {
    rerank-count: 10
    expression: onnx(my_model)

This is an example of configuring to calculate the euclidean distance
between a query vector and the stored text_embedding vector in the first stage.
This is usually used together with an approximate nearest neighbor search. The
top 10 candidates are sent to the ONNX model in the second stage. Note that
this is per content node, so with 10 content nodes, the model is running
effectively on 100 candidates.

The model is set up in the onnx-model section. The file refers to an ONNX model
somewhere in the application package. Inputs to the model, while not actually
shown here for brevity, can come from various sources such as constants, the
query, a document, or some combination expressed through a user-defined
function. While the output of models are tensors, the resulting value of a
first or second phase expression needs to be a single scalar, as documents are
sorted according to this score before being returned.

Our initial implementation of the onnx ranking feature was to import the ONNX
model and convert the entire graph into native expressions. This was
feasible because of the flexibility of the various tensor
operations supports. For instance, a single neural network layer could be
converted like this:

Model rank expression

Here, weights and bias would be stored as constant tensors, whereas the input
tensor could be retrieved either from the query, a document field, or some
combination of both.

Initially, this worked fine. We implemented the various ONNX
operators using
the available tensor operations. However, we only supported a subset of the
150+ ONNX operators at first, as we considered that only certain types of
models were viable for use in due to its low-latency requirement. For
instance, the ranking expression language does not support iterations, making
it more challenging to implement operators used in convolutional or recurrent
neural networks. Instead, we opted to continuously add operator support as new
model types were used in

The advantage of this was that the various optimizations we introduced to our
tensor evaluation engine to efficiently evaluate the models benefitted all
other applications using tensors as well.

ONNX Runtime

Unfortunately, we ran into problems as we started developing support for
Transformer models. Our first attempt at supporting a 12-layer BERT-base model
failed. This was a model converted from TensorFlow to ONNX. The evaluation
result was incorrect, with relatively poor performance.

We spent significant efforts on this. Quite a few operators had to be rewritten
due to, sometimes subtle, edge cases. We introduced a dozen or so
performance optimizations, to avoid doing silly stuff such as calculating the
same expressions multiple times and allocating memory unnecessarily.
Ultimately, we were able to increase performance by more than two orders of

During this development we turned to ONNX
Runtime for reference. ONNX Runtime
is easy to use:

import onnxruntime as ort
session = ort.InferenceSession(“model.onnx”) output_names=[...], input_feed={...} )

This was invaluable, providing us with a reference for correctness and a
performance target.

At one point, we started toying with the idea of actually using ONNX Runtime
directly for the model inference instead of converting it to
expressions. The content node is written in C++, so this entailed
integrating with the C++ interface of ONNX Runtime. It should be mentioned that
adding dependencies to is not something we often do, as we prefer to
avoid dependencies and thus own the entire stack.

Within a couple of weeks, we had a proof of concept up and running which showed
a lot of promise. So we decided to go ahead and start using ONNX Runtime to
evaluate all ONNX models in

This proved to be a game-changer for us. It vastly increases the capabilities
of evaluating large deep-learning models in in terms of model types we
support and evaluation performance. We can leverage ONNX Runtime’s use of
a compute library containing processor-optimized kernels. ONNX Runtime also
contains model-specific optimizations for BERT models (such as multi-head
attention node fusion) and makes it easy to evaluate precision-reduced models
by quantization for even more efficient inference.

ONNX Runtime in

Consider the following:

onnx-model my_model {
  file: files/my_model.onnx
  input input_0: query(my_query_input)
  input input_1: attribute(my_doc_field)

rank-profile my_profile {

Accelerating stateless model evaluation on Vespa

A central architectural feature of is the division
of work between the stateless container cluster and the content cluster.

Most computation, such as evaluating machine-learned models, happens in
the content cluster. However, it has become increasingly important to
efficiently evaluate models in the container cluster as well, to
process or transform documents or queries before storage or execution.
One prominent example is to generate a vector representation of natural
language text for queries and documents for nearest neighbor retrieval.

We have recently implemented accelerated model evaluation using ONNX Runtime in
the stateless cluster, which opens up new usage areas for Vespa.


At we differentiate between stateful and stateless machine-learned
model evaluation. Stateless model evaluation is what one usually thinks about
when serving machine-learned models in production. For instance, one might have
a stand-alone model server that is called from somewhere in a serving stack.
The result of evaluating a model there only depends upon its input.

In contrast, stateful model serving combines input with stored or persisted
data. This poses some additional challenges. One is that models typically need
to be evaluated many times per query, once per data point. This has been a
focus area of for quite some time, and we have previously written about
how we accelerate stateful model
in using ONNX Runtime.

However, stateless model evaluation does have its place in as well.
For instance, transforming query input or document content using Transformer
models. Or finding a vector representation for an image for image similarity
search. Or translating text to another language. The list goes on. has actually had stateless model
evaluation for some
time, but we’ve recently added acceleration of ONNX models using ONNX
Runtime. This makes this feature
much more powerful and opens up some new use cases for In this
post, we’ll take a look at some capabilities this enables:

  • The automatically generated REST API for model serving.
  • Creating lightweight request handlers for serving models with some custom
    code without the need for content nodes.
  • Adding model evaluation to searchers for query processing and enrichment.
  • Adding model evaluation to document processors for transforming content
    before ingestion.
  • Batch-processing results from the ranking back-end for additional ranking

We’ll start with a quick overview of the difference between where we evaluate
machine-learned models in applications: container and content nodes is a distributed application
consisting of various types of services on multiple nodes. A
application is fully defined in an application package. This single unit
contains everything needed to set up an application, including all
configuration, custom components, schemas, and machine-learned models. When the
application package is deployed, the admin cluster takes care of configuring all
the services across all the system’s nodes, including distributing all
models to the nodes that need them.

Vespa architecture

The container nodes process queries or documents before passing them on to the
content nodes. So, when a document is fed to Vespa, content can be transformed
or added before being stored. Likewise, queries can be transformed or enriched
in various ways before being sent for further processing.

The content nodes are responsible for persisting data. They also do most of the
required computation when responding to queries. As that is where the data is,
this avoids the cost of transferring data across the network. Query data is
combined with document data to perform this computation in various ways.

We thus differentiate between stateless and stateful machine-learned model
evaluation. Stateless model evaluation happens on the container nodes and is
characterized by a single model evaluation per query or document. Stateful
model evaluation
happens on the content nodes, and the model is typically
evaluated a number of times using data from both the query and the document.

The exact configuration of the services on the nodes is specified in
services.xml. Here the
number of container and content nodes, and their capabilities, are fully
configured. Indeed, a application does not need to be set up with any
content nodes, purely running stateless container code, including serving
machine-learned models.

This makes it easy to deploy applications. It offers a lot of flexibility
in combining many types of models and computations out of the box without any
plugins or extensions. In the next section, we’ll see how to set up stateless
model evaluation.

Stateless model evaluation

So, by stateless model
evaluation we mean
machine-learned models that are evaluated on Vespa container nodes. This is
enabled by simply adding the model-evaluation tag in services.xml:


When this is specified, Vespa scans through the models directory in the
application packages to find any importable machine-learned models. Currently,
supported models are TensorFlow, ONNX, XGBoost, LightGBM or Vespa’s own

There are two effects of this. The first is that a REST API for model discovery
and evaluation is automatically enabled. The other is that custom
components can have
a special ModelsEvaluator object dependency injected into their constructors.

Stateless model evaluation

In the following we’ll take a look at some of the usages of these, and use the
model-evaluation sample
for demonstratation.


The automatically added REST API provides an API for model discovery and
evaluation. This is great for using Vespa as a standalone model server, or
making models available for other parts of the application stack.

To get a list of imported models, call http://host:port/model-evaluation/v1.
For instance:

$ curl -s 'http://localhost:8080/model-evaluation/v1/'
    "pairwise_ranker": "http://localhost:8080/model-evaluation/v1/pairwise_ranker",
    "transformer": "http://localhost:8080/model-evaluation/v1/transformer"

This application has two models, the transformer model and the
pairwise_ranker model. We can inspect a model to see expected inputs and

$ curl -s 'http://localhost:8080/model-evaluation/v1/transformer/output'
    "arguments": [
            "name": "input",
            "type": "tensor(d0[],d1[])"
            "name": "onnxModel(transformer).output",
            "type": "tensor<float>(d0[],d1[],d2[16])"
    "eval": "http://localhost:8080/model-evaluation/v1/transformer/output/eval",
    "function": "output",
    "info": "http://localhost:8080/model-evaluation/v1/transformer/output",
    "model": "transformer"

All model inputs and output are Vespa tensors. See the tensor user
guide for more information.

This model has one input, with tensor type tensor(d0[],d1[]). This tensor has
two dimensions: d0 is typically a batch dimension, and d1 represents for,
this model, a sequence of tokens. The output, of type tensor<float>(d0[],d1[],d2[16])
adds a dimension d2 which represents the embedding dimension. So the output is
an embedding representation for each token of the input.

By calling /model-evaluation/v1/transformer/eval and passing an URL encoded input
parameter, Vespa evaluates the model and returns the result as a JSON encoded

Please refer to the sample
for a runnable example.

Request handlers

The REST API takes exactly the same input as the models it serves. In some
cases one might want to pre-process the input before providing it to the model.
A common example is to tokenize natural language text before passing the token
sequence to a language model such as BERT.

Vespa provides request
which lets applications implement arbitrary HTTP APIs. With custom request
handlers, arbitrary code can be run both before and after model evaluation.

When the model-evaluation tag has been supplied, Vespa makes a special
ModelsEvaluator object available which can be injected into a component
(such as a request handler):

public class MyHandler extends ThreadedHttpRequestHandler {

    private final ModelsEvaluator modelsEvaluator;

    public MyHandler(ModelsEvaluator modelsEvaluator, Context context) {
        this.modelsEvaluator = modelsEvaluator;

    public HttpResponse handle(HttpRequest request) {

        // Get the input
        String inputString = request.getProperty("input");

        // Convert to a Vespa tensor
        TensorType expectedType = TensorType.fromSpec("tensor<int8>(x[])");
        Tensor input = Tensor.from(expectedType, inputString);

        // Perform any pre-processing to the tensor
        // ...

        // Evaluate the model
        FunctionEvaluator evaluator = modelsEvaluator.evaluatorOf("transformer");
        Tensor result = evaluator.bind("input", input).evaluate();

        // Perform any post-processing to the tensor
        // ...

A full example can be seen in the MyHandler class in the sample
and it’s unit

As mentioned, arbitrary code can be run here. Pragmatically, it is often more
convenient to put the processing pipeline in the model itself. While not always
possible, this helps protect against divergence between the data processing
pipeline in training and in production.

Document processors

The REST API and request handler can work with a purely stateless application,
such as a model server. However, it is much more common for applications to
have content. As such, it is fairly common to process incoming documents before
storing them. Vespa provides a chain of document
for this.

Applications can implement custom document processors, and add them to the
processing chain. In the context of model evaluation, a typical task is to use a
machine-learned model to create a vector representation for a natural language
text. The text is first tokenized, then run though a language model such as
BERT to generate a vector representation which is then stored. Such a vector
representation can be for instance used in nearest neighbor
search. Other examples
are sentiment analysis, creating representations of images, object detection,
translating text, and so on.

The ModelsEvaluator can be injected into your component as already seen:

public class MyDocumentProcessor extends DocumentProcessor {

    private final ModelsEvaluator modelsEvaluator;

    public MyDocumentProcessor(ModelsEvaluator modelsEvaluator) {
        this.modelsEvaluator = modelsEvaluator;

    public Progress process(Processing processing) {
        for (DocumentOperation op : processing.getDocumentOperations()) {
            if (op instanceof DocumentPut) {
                DocumentPut put = (DocumentPut) op;
                Document document = put.getDocument();

                // Get tokens
                Tensor tokens = (Tensor) document.getFieldValue("tokens").getWrappedValue();

                // Perform any pre-processing to the tensor
                // ...

                // Evaluate the model
                FunctionEvaluator evaluator = modelsEvaluator.evaluatorOf("transformer");
                Tensor result = evaluator.bind("input", input).evaluate();

                // Reshape and extract the embedding vector (not shown)
                Tensor embedding = ...

                // Set embedding in document
                document.setFieldValue("embedding", new TensorFieldValue(embedding));

Notice the code looks a lot like the previous example for the request handler.
The document processor receives a pre-constructed ModelsEvaluator from Vespa
which contains the transformer model. This code receives a tensor contained
in the tokens field, runs that through the transformer model, and puts the
resulting embedding into a new field. This is then stored along with the

Again, a full example can be seen in the MyDocumentProcessor class in the sample
and it’s unit

Searchers: query processing

Similar to document processing, queries are processed along a chain of
Vespa provides a default chain of searchers for various tasks, and applications
can provide additional custom searchers as well. In the context of model
evaluation, the use cases are similar to document processing: a typical task
for text search is to generate vector representations for nearest neighbor search.

Again, the ModelsEvaluator can be injected into your component:

public class MySearcher extends Searcher {

    private final ModelsEvaluator modelsEvaluator;

    public MySearcher(ModelsEvaluator modelsEvaluator) {
        this.modelsEvaluator = modelsEvaluator;

    public Result search(Query query, Execution execution) {

        // Get the query input
        String inputString ="input");

        // Convert to a Vespa tensor
        TensorType expectedType = TensorType.fromSpec("tensor<int8>(x[])");
        Tensor input = Tensor.from(expectedType, inputString);

        // Perform any pre-processing to the tensor
        // ...

        // Evaluate model
        FunctionEvaluator evaluator = modelsEvaluator.evaluatorOf("transformer");
        Tensor output = evaluator.bind("input", input).evaluate();

        // Reshape and extract the embedding vector (not shown)
        Tensor embedding = ...

        // Add this tensor to query
        query.getRanking().getFeatures().put("query(embedding)", embedding);

        // Continue processing

As before, a full example can be seen in the MySearcher class in the sample
and it’s unit

Searchers: result post-processing

Searchers don’t just process queries before being sent to the back-end: they
are just as useful in post-processing the results from the back-end. A typical
example is to de-duplicate similar results in a search application. Another is
to apply business rules to reorder the results, especially if coming from
various back-ends. In the context of machine learning, one example is is to
de-tokenize tokens back to a natural language text.

Post-processing is similar to the example above, but the search is executed
first, and tensor fields from the documents are extracted and used as input to
the models. In the sample application we have a model that compares all results
with each other to perform another phase of ranking. See the MyPostProcessing
for details.


In, most of the computation required for executing queries has
traditionally been run in the content cluster. This makes sense as it avoids
transmitting data across the network to external

Machine-learned model serving at scale

Photo by Jukan Tateisi
on Unsplash

Imagine you have a machine-learned model that you would like to use in some
application, for instance, a transformer model to generate vector
representations from text. You measure the time it takes for a single model
evaluation. Then, satisfied that the model can be evaluated quickly enough, you
deploy this model to production in some model server. Traffic increases, and
suddenly the model is executing much slower and can’t sustain the expected
traffic at all, severely missing SLAs. What could have happened?

You see, most libraries and platforms for evaluating machine-learned models are
by default tuned to use all available resources on the machine for model
inference. This means parallel execution utilizing a number of threads equal to
the number of available cores, or CPUs, on the machine. This is great for a
single model evaluation.

Unfortunately, this breaks down for concurrent evaluations. This is an
under-communicated and important point.

Let’s take a look at what happens. In the following, we serve a transformer
model using Vespa is a highly performant and
web-scalable open-source platform for applications that perform real-time data
processing over large data sets. uses ONNX
Runtime under the hood for model acceleration. We’ll
use the original BERT-base model, a
12-layer, 109 million parameter transformer neural network. We test the
performance of this model on a 32-core Xeon Gold 2.6GHz machine. Initially,
this model can be evaluated on this particular machine in around 24

Concurrency vs latency and throughput - 32 threads

Here, the blue line is the 95th percentile latency, meaning that 95% of all
requests have latency lower than this. The red line is the throughput: the
requests per second the machine can handle. The horizontal axis is the number
of concurrent connections (clients).

As the number of simultaneous connections increases, the latency increases
drastically. The maximum throughput is reached at around 10 concurrent
requests. At that point, the 95th percentile latency is around 150ms, pretty
far off from the expected 24ms. The result is a highly variable and poor

The type of application dictates the optimal balance between low latency and
high throughput. For instance, if the model is used for an end-user query,
(predictably) low latency is important for a given level of expected traffic.
On the other hand, if the model generates embeddings before ingestion in some
data store, high throughput might be more important. The driving force for both
is cost: how much hardware is needed to support the required throughput. As an
extreme example, if your application serves 10 000 queries per second with a
95% latency requirement of 50ms, you would need around 200 machines with the
setup above.

Of course, if you expect only a minimal amount of traffic, this might be
totally fine. However, if you are scaling up to thousands of requests per
second, this is a real problem. So, we’ll see what we can do to scale this up
in the following.

Parallel execution of models

We need to explain the threading model used during model inference to see what
is happening here. In general, there are 3 types of threads: inference
(application), inter-operation, and intra-operation threads. This is a common
feature among multiple frameworks, such as TensorFlow, PyTorch, and ONNX

The inference threads are the threads of the main application. Each request
gets handled in its own inference thread, which is ultimately responsible for
delivering the result of the model evaluation given the request.

The intra-operation threads evaluate single operations with multi-threaded
implementations. This is useful for many operations, such as element-wise
operations on large tensors, general matrix multiplications, embedding lookups,
and so on. Also, many frameworks chunk together several operations into a
higher-level one that can be executed in parallel for performance.

The inter-operation threads are used to evaluate independent parts of the
model in parallel. For instance, a model containing two distinct paths joined
in the end might benefit from this form of parallel execution. Examples are
Wide and Deep models or two-tower encoder architectures.

Various thread pools in inference

In the example above, which uses ONNX Runtime, the default disables the
inter-operation threads. However, the number of intra-operation threads is
equal to the number of CPUs on the machine. In this case, 32. So, each
concurrent request is handled in its own inference thread. Some operations,
however, are executed in parallel by employing threads from the intra-operation
thread pool. Since this pool is shared between requests, concurrent requests
need to wait for available threads to progress in the execution. This is why
the latency increases.

The model contains operations that are run both sequentially and in parallel.
That is why throughput increases to a certain level even as latency increases.
After that, however, throughput starts decreasing as we have a situation where
more threads are performing CPU-bound work than physical cores in the machine.
This is obviously detrimental to performance due to excessive thread swapping.

Scaling up

To avoid this thread over-subscription, we can ensure that each model runs
sequentially in its own inference thread. This avoids the competition between
concurrent evaluations for the intra-op threads. Unfortunately, it also avoids
the benefits of speeding up a single model evaluation using parallel execution.

Let’s see what happens when we set the number of intra-op threads to 1.

Concurrency vs latency and throughput - 1 thread

As seen, the latency is relatively stable up to a concurrency equalling the
number of cores on the machine (around 32). After that, latency increases due
to the greater number of threads than actual cores to execute them. The
throughput also increases to this point, reaching a maximum of around 120
requests per second, which is a 40% improvement. However, the 95th percentile
latency is now around 250ms, far from expectations.

So, the model that initially seemed promising might not be suitable for
efficient serving after all.

The first generation of transformer models, like BERT-base used above, are
relatively large and slow to evaluate. As a result, more efficient models that
can be used as drop-in replacements using the same tokenizer and vocabulary
have been developed. One example is the
XtremeDistilTransformers family. These are
distilled from BERT and have similar accuracy as BERT-base on many different
tasks with significantly lower computational complexity.

In the following, we will use the
model, which only has around 13M parameters compared to BERT-base’s 109M.
Despite having only 12% of the parameter count, the accuracy of this model is
very comparable to the full BERT-base model:

Distilled models accuracy

Using the default number of threads (same as available on the system), this
model can be evaluated on the CPU is around 4ms. However, it still suffers from
the same scaling issue as above with multiple concurrent requests. So, let’s
see how this scales with concurrent requests with single-threaded execution:

Concurrency vs latency and throughput with 1 intra-op thread on distilled model

As expected, the latency is much more stable until we reach concurrency levels equalling the number of cores on the machine. This gives a much better and predictable experience. The throughput now tops out at around 1600 requests per second, vastly superior to the other model, which topped out at roughly 120 requests per second. This results in much less hardware needed to achieve wanted levels of performance.

Experiment details

To measure the effects of scaling, we’ve used, an open-source platform
for building applications that do real-time data processing over large data
sets. Designed to be highly performant and web-scalable, it is used for diverse
tasks such as search, personalization, recommendation, ads, auto-complete,
image and similarity search, comment ranking, and even finding
love. has many integrated features
and supports many use cases right out of the box. Thus, it offers a simplified
path to deployment in production without the complexity of maintaining many
different subsystems. We’ve used as an easy-to-use model
server in this post.
In, it is straightforward to tune the threading model to
use for each model:

      <model name="reranker_margin_loss_v4">
        <intraop-threads> number </intraop-threads>
        <interop-threads> number </interop-threads>
        <execution-mode> parallel | sequential </execution-mode>

Also, it is easy to scale out horizontally to use additional nodes for model
evaluation. We have not explored that in this post.

The data in this post has been collected using Vespa’s
fbench tool,
which drives load to a system for benchmarking. Fbench provides detailed and
accurate information on how well the system manages the workload.


In this post, we’ve seen that the default thread settings are not suitable for
model serving in production, particularly for applications with a high degree
of concurrent requests. The competition for available threads between parallel
model evaluations leads to thread oversubscription and performance suffers. The
latency also becomes highly variable.

The problem is the shared intra-operation thread pool. Perhaps a different
threading model should be considered, which allows for utilizing multiple
threads in low traffic situations, but degrades to sequential evaluation when
high concurrency is required.

Currently however, the solution is to ensure that models are running in their
own threads. To manage the increased latency, we turned to model distillation,
which effectively lowers the computational complexity without sacrificing
accuracy. There are additional optimizations available which we did not touch
upon here, such as model
Another one that is important for transformer models is limiting input length
as evaluation time is quadratic to the input length.

We have not considered GPU evaluation here, which can significantly accelerate
execution. However, cost at scale is a genuine concern here as well.

The under-communicated point here is that platforms that promise very low
latencies for inference are only telling part of the story. As an example,
consider a platform promising 1ms latency for a given model. Naively, this can
support 1000 queries per second. However, consider what happens if 1000
requests arrive at almost the same time: the last request would have had to
wait almost 1 second before returning. This is far off from the expected 1ms.