Open Sourcing Vespa, Yahoo’s Big Data Processing and Serving Engine

By Jon Bratseth, Distinguished Architect, Vespa

Ever since we open sourced Hadoop in 2006, Yahoo – and now, Oath – has been committed to opening up its big data infrastructure to the larger developer community. Today, we are taking another major step in this direction by making Vespa, Yahoo’s big data processing and serving engine, available as open source on GitHub.

Building applications increasingly means dealing with huge amounts of data. While developers can use the Hadoop stack to store and batch process big data, and Storm to stream-process data, these technologies do not help with serving results to end users. Serving is challenging at large scale, especially when it is necessary to make computations quickly over data while a user is waiting, as with applications that feature search, recommendation, and personalization.

By releasing Vespa, we are making it easy for anyone to build applications that can compute responses to user requests, over large datasets, at real time and at internet scale – capabilities that up until now, have been within reach of only a few large companies.

Serving often involves more than looking up items by ID or computing a few numbers from a model. Many applications need to compute over large datasets at serving time. Two well-known examples are search and recommendation. To deliver a search result or a list of recommended articles to a user, you need to find all the items matching the query, determine how good each item is for the particular request using a relevance/recommendation model, organize the matches to remove duplicates, add navigation aids, and then return a response to the user. As these computations depend on features of the request, such as the user’s query or interests, it won’t do to compute the result upfront. It must be done at serving time, and since a user is waiting, it has to be done fast. Combining speedy completion of the aforementioned operations with the ability to perform them over large amounts of data requires a lot of infrastructure – distributed algorithms, data distribution and management, efficient data structures and memory management, and more. This is what Vespa provides in a neatly-packaged and easy to use engine.

With over 1 billion users, we currently use Vespa across many different Oath brands – including, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini, Flickr, and others – to process and serve billions of daily requests over billions of documents while responding to search queries, making recommendations, and providing personalized content and advertisements, to name just a few use cases. In fact, Vespa processes and serves content and ads almost 90,000 times every second with latencies in the tens of milliseconds. On Flickr alone, Vespa performs keyword and image searches on the scale of a few hundred queries per second on tens of billions of images. Additionally, Vespa makes direct contributions to our company’s revenue stream by serving over 3 billion native ad requests per day via Yahoo Gemini, at a peak of 140k requests per second (per Oath internal data).

With Vespa, our teams build applications that:

  • Select content items using SQL-like queries and text search
  • Organize all matches to generate data-driven pages
  • Rank matches by handwritten or machine-learned relevance models
  • Serve results with response times in the low milliseconds
  • Write data in real-time, thousands of times per second per node
  • Grow, shrink, and re-configure clusters while serving and writing data

To achieve both speed and scale, Vespa distributes data and computation over many machines without any single master as a bottleneck. Where conventional applications work by pulling data into a stateless tier for processing, Vespa instead pushes computations to the data. This involves managing clusters of nodes with background redistribution of data in case of machine failures or the addition of new capacity, implementing distributed low latency query and processing algorithms, handling distributed data consistency, and a lot more. It’s a ton of hard work!

As the team behind Vespa, we have been working on developing search and serving capabilities ever since building, which was later acquired by Yahoo. Over the last couple of years we have rewritten most of the engine from scratch to incorporate our experience onto a modern technology stack. Vespa is larger in scope and lines of code than any open source project we’ve ever released. Now that this has been battle-proven on Yahoo’s largest and most critical systems, we are pleased to release it to the world.

Vespa gives application developers the ability to feed data and models of any size to the serving system and make the final computations at request time. This often produces a better user experience at lower cost (for buying and running hardware) and complexity compared to pre-computing answers to requests. Furthermore it allows developers to work in a more interactive way where they navigate and interact with complex calculations in real time, rather than having to start offline jobs and check the results later.

Vespa can be run on premises or in the cloud. We provide both Docker images and rpm packages for Vespa, as well as guides for running them both on your own laptop or as an AWS cluster.

We’ll follow up this initial announcement with a series of posts on our blog showing how to build a real-world application with Vespa, but you can get started right now by following the getting started guide in our comprehensive documentation.

Managing distributed systems is not easy. We have worked hard to make it easy to develop and operate applications on Vespa so that you can focus on creating features that make use of the ability to compute over large datasets in real time, rather than the details of managing clusters and data. You should be able to get an application up and running in less than ten minutes by following the documentation.

We can’t wait to see what you’ll build with it!

Join us at the Big Data Technology Warsaw Summit on February 27th for Scalable Machine-Learned Model Serving

Online evaluation of machine-learned models (model serving) is difficult to scale to large datasets. is an open source big data serving solution used to solve this problem and in use today on some of the largest such systems in the world. These systems evaluate models over millions of data points per request for hundreds of thousands of requests per second.

If you’re in Warsaw on February 27th, please join Jon Bratseth (Distinguished Architect, Verizon Media) at the Big Data Technology Warsaw Summit, where he’ll share “Scalable machine-learned model serving” and answer any questions. Big Data Technology Warsaw Summit is a one-day conference with technical content focused on big data analysis, scalability, storage, and search. There will be 27 presentations and more than 500 attendees are expected.

Jon’s talk will explore the problem and architectural solution, show how Vespa can be used to achieve scalable serving of TensorFlow and ONNX models, and present benchmarks comparing performance and scalability to TensorFlow Serving.

Hope to see you there!

Serving article comments using reinforcement learning of a neural net

Don’t look at the comments. When you allow users to make comments on your content pages you face the problem that not all of them are worth showing — a difficult problem to solve, hence the saying. In this article I’ll show how this problem has been attacked using reinforcement learning at serving time on Yahoo content sites, using the Vespa open source platform to create a scalable production solution.

Yahoo properties such as Yahoo Finance, News and Sports allow users to comment on the articles, similar to many other apps and websites. To support this the team needed a system that can add, find, count and serve comments at scale in real time. Not all comments are equally as interesting or relevant though, and some articles can have hundreds of thousands of comments, so a good commenting system must also choose the right comments among these to show to users viewing the article. To accomplish this, the system must observe what users are doing and learn how to pick comments that are interesting.

Here I’ll explain how this problem was solved for Yahoo properties by using Vespa — the open source big data serving engine. I’ll start with the basics and then show how comment selection using a neural net and reinforcement learning was implemented.

As mentioned, the team needed a system that can add, find, count, and serve comments at scale in real time. The team chose Vespa, the open big data serving engine for this, as it supports both such basic serving as well as incorporating machine learning at serving time (which we’ll get to below). By storing each comment as a separate document in Vespa, containing the ID of the article commented upon, the ID of the user commenting, various comment metadata, and the comment text itself, the team could issue queries to quickly retrieve the comments on a given article for display, or to show a comment count next to the article:


In addition, this document structure allowed less-used operations such as showing all the articles of a given user and similar.

The Vespa instance used at Yahoo for this store about a billion comments at any time, serve about 12.000 queries per second, and about twice as many writes (new comments + comment metadata updates). Average latency for queries is about 4 ms, and write latency roughly 1 ms. Nodes are organized in two tiers as a single Vespa application: A single stateless cluster handling incoming queries and writes, and a content cluster storing the comments, maintaining indexes and executing the distributed part of queries in parallel. In total, 32 stateless and 96 stateful nodes are spread over 5 regional data centers. Data is automatically sharded by Vespa in each datacenter, in 6–12 shards depending on the traffic patterns of that region.

Some articles on Yahoo pages have a very large number of comments — up to hundreds of thousands are not uncommon, and no user is going to read all of them. Therefore it is necessary to pick the best comments to show each time someone views an article. Vespa does this by finding all the comments for the article, computing a score for each, and picking the comments with the best scores to show to the user. This process is called ranking. By configuring the function to compute for each comment as a ranking expression in Vespa, the engine will compute it locally on each data partition in parallel during query execution. This allows executing these queries with low latency and ensures that more comments can be handled by adding more content nodes, without causing an increase in latency.

The input to the ranking function is features which are typically stored in the document (here: a comment) or sent with the query. Comments have various features indicating how users interacted with the comment, as well as features computed from the comment content itself. In addition, the system keeps track of the reputation of each comment author as a feature.

User actions are sent as update operations to Vespa as they are performed. The information about authors is also continuously changing, but since each author can write many comments it would be wasteful to have to update each comment every time there is new information about the author.
Instead, the author information is stored in a separate document type — one document per author,
and a document reference in Vespa is used to import that author feature into each comment.
This allows updating the author information once and have it automatically take effect for all comments by that author.

With these features, it’s possible in Vespa to configure a mathematical function as a ranking expression which computes the rank score or each comment to produce a ranked list of the top comments, like the following:


Using a neural net and reinforcement learning

The team used to rank comments with a handwritten ranking expression having hardcoded weighting of the features. This is a good way to get started but obviously not optimal. To improve it they needed to decide on a measurable target and use machine learning to optimize towards it.

The ultimate goal is for users to find the comments interesting. This can not be measured directly, but luckily we can define a good proxy for interest based on signals such as dwell time (the amount of time the users spend on the comments of an article) and user actions (whether users reply to comments, provide upvotes and downvotes, etc). The team knew they wanted user interest to go up on average, but there is no way to know what the correct value of the measure of interest might be for any single given list of comments. Therefore it’s hard to create a training set of interest signals for articles (supervised learning), so reinforcement learning was chosen instead: Let the system make small changes to the live machine-learned model iteratively, observe the effect on the signal used as a proxy for user interest, and use this to converge on a model that increases it.

The model chosen here was a neural net with multiple hidden layers, roughly illustrated as follows:


The advantage of using a neural net compared to a simple function such as linear regression is that it can capture non-linear relationships in the feature data without anyone having to guess which relationship exists and hand-write functions to capture them (feature engineering).

To explore the space of possible rankings, the team implemented a sampling algorithm in a Searcher to perturb the ranking of comments returned from each query. They logged the ranking information and user interest signals such as dwell time to their Hadoop grid where they are joined. This generates a training set each hour which is used to retrain the model using TensorFlow-on-Spark, which produces a new model for the next iteration of the reinforcement learning cycle.

To implement this on Vespa, the team configured the neural net as the ranking function for comments. This was done as a manually written ranking function over tensors in a rank profile. Here is the production configuration used:

rank-profile neuralNet {
    function get_model_weights(field) {
        expression: if(query(field) == 0, constant(field), query(field))
    function layer_0() { # returns tensor(hidden[9])     
        expression: elu(xw_plus_b(nn_input, get_model_weights(W_0), get_model_weights(b_0), x))   
    function layer_1() { # returns tensor(out[9])
        expression: elu(xw_plus_b(layer_0 get_model_weights(W_1), get_model_weights(b_1), hidden))   
    # xw_plus_b returns tensor(out[1]), so sum converts to double   
    function layer_out() {
        expression: sum(xw_plus_b(layer_1, get_model_weights(W_out), get_model_weights(b_out), out))   
    first-phase {     
        expression: freshnessRank   
    second-phase {
        expression: layer_out
        rerank-count: 2000   

More recently Vespa added support for deploying TensorFlow SavedModels directly (as well as similar support for tools saving in the ONNX format), which would also be a good option here since the training happens in TensorFlow.

Neural nets have a pair of weight and bias tensors for each layer, which is what the team wanted the training process to optimize. The simplest way to include the weights and biases in the model is to add them as constant tensors to the application package. However, with reinforcement learning it is necessary to be able to update these tensor parameters frequently. This could be achieved by redeploying the application package frequently, as Vespa allows that to be done without restarts or disruption to ongoing queries. However, it is still a somewhat heavy-weight process, so another approach was chosen: Store the neural net parameters as tensors in a separate document type in Vespa, and create a Searcher component which looks up this document on each incoming query, and adds the parameter tensors to it before it’s passed to the content nodes for evaluation.

Here is the full production code needed to accomplish this serving-time operation:

import java.util.Map;

public class LoadRankingmodelSearcher extends Searcher {
    private static final String VESPA_ID_FORMAT = "id:canvass_search:rankingmodel::%s";
    private static final String FEATURE_FORMAT = "query(%s)";

    /** To fetch model documents from Vespa index */
    private final SyncSession fetchDocumentSession;
    public LoadRankingmodelSearcher() {
        this.fetchDocumentSession = DocumentAccess.createDefault().createSyncSession(new SyncParameters.Builder().build());

    public Result search(Query query, Execution execution) {
        // Fetch model document from Vespa
        String id = String.format(VESPA_ID_FORMAT, query.getRanking().getProfile());
        Document modelDoc = fetchDocumentSession.get(new DocumentId(id));
        // Add it to the query
        if (modelDoc != null) {
            modelDoc.iterator().forEachRemaining((Map.Entry<Field, FieldValue> e) ->
                addTensorFromDocumentToQuery(e.getKey().getName(), e.getValue(), query)

    private static void addTensorFromDocumentToQuery(String field, FieldValue value, Query query) {
        if (value instanceof TensorFieldValue) {
            Tensor tensor = ((TensorFieldValue) value).getTensor().get();
            query.getRanking().getFeatures().put(String.format(FEATURE_FORMAT, field), tensor);
The model weight document definition is added to the same content cluster as the comment documents and simply contains attribute fields for each weight and bias tensor of the neural net (where each field below is configured with “indexing: attributesummary”):
document rankingmodel {
    field modelTimestamp type long { … }
    field W_0 type tensor(x[9],hidden[9]) { … }
    field b_0 type tensor(hidden[9]) { … } 
    field W_1 type tensor(hidden[9],out[9]) { … } 
    field b_1 type tensor(out[9]) { … }
    field W_out type tensor(out[9]) { … } 
    field b_out type tensor(out[1]) { … } 

Since updating documents is a lightweight operation it is now possible to make frequent changes to the neural net to implement the reinforcement learning process.


Switching to the neural net model with reinforcement learning has already led to a 20% increase in average dwell time. The average response time when ranking with the neural net increased to about 7 ms since the neural net model is more expensive. The response time stays low because in Vespa the neural net is evaluated on all the content nodes (partitions) in parallel. This avoids the bottleneck of sending the data for each comment to be evaluated over the network and allows increasing parallelization indefinitely by adding more content nodes.

However, evaluating the neural net for all comments for outlier articles which have hundreds of thousands of comments would still be very costly. If you read the rank profile configuration shown above, you’ll have noticed the solution to this: Two-phase ranking was used where the comments are first selected by a cheap rank function (termed freshnessRank) and the highest scoring 2000 documents (per content node) are re-ranked using the neural net. This caps the max CPU spent on evaluating the neural net per query.

Conclusion and future work

In this article I have shown how to implement a real comment serving and ranking system on Vespa. With reinforcement learning gaining popularity, the serving system needs to become a more integrated part of the machine learning stack, and by using Vespa this can be accomplished relatively easily with a standard open source technology.

The team working on this plan to expand on this work by applying it to other domains such as content recommendation, incorporating more features in a larger network, and exploring personalized comment ranking.

Stateful model serving: how we accelerate inference using ONNX Runtime

By Lester Solbakken from Verizon Media and Pranav Sharma from Microsoft.

There’s a difference between stateless and stateful model serving.

Stateless model serving is what one usually thinks about when using a
machine-learned model in production. For instance, a web application handling
live traffic can call out to a model server from somewhere in the serving
stack. The output of this model service depends purely on the input. This is
fine for many tasks, such as classification, text generation, object detection,
and translation, where the model is evaluated once per query.

There are, however, some applications where the input is combined with stored
or persisted data to generate a result. We call this stateful model evaluation.
Applications such as search and recommendation need to evaluate models with a
potentially large number of items for each query. A model server can quickly
become a scalability
bottleneck in these
cases, regardless of how efficient the model inference is.

In other words, stateless model serving requires sending all necessary input
data to the model. In stateful model serving, the model should be computed
where the data is stored.

At, we are concerned with efficient stateful
model evaluation. is an open-source platform for building
applications that do real-time data processing over large data sets. Designed
to be highly performant and web-scalable, it is used for such diverse tasks as
search, personalization, recommendation, ads, auto-complete, image and
similarity search, comment ranking, and even for finding

It has become increasingly important for us to be able to evaluate complex
machine-learned models efficiently. Delivering low latency, fast inference and
low serving cost is challenging while at the same time providing support for
the various model training frameworks.

We eventually chose to leverage ONNX
Runtime (ORT) for this task. ONNX
Runtime is an accelerator for model inference. It has vastly increased’s capacity for evaluating large models, both in performance and model
types we support. ONNX Runtime’s capabilities within hardware acceleration and
model optimizations, such as quantization, has enabled efficient evaluation of
large NLP models like BERT and other Transformer models in

In this post, we’ll share our journey on why and how we eventually chose ONNX
Runtime and share some of our experiences with it.

About has a rich history. Its lineage comes from a search engine born in 1997.
Initially powering the web search at, it was flexible
enough to be used in various more specialized products, or verticals, such as
document search, mobile search, yellow pages, and banking. This flexibility in
being a vertical search platform eventually gave rise to its name, Vespa.

The technology was acquired by Yahoo in 2003. There, Vespa cemented itself as a
core piece of technology that powers hundreds of applications, including many
of Yahoo’s most essential services. We open-sourced Vespa in 2017 and today it
serves hundreds of thousands of queries per second worldwide at any given time,
with billions of content items for hundreds of millions of users.

Although Yahoo was eventually acquired by Verizon, it is interesting to note
that our team has stayed remarkably stable over the years. Indeed, a few of the
engineers that started working on that initial engine over 20 years ago are
still here. Our team counts about 30 developers, and we are situated in
Trondheim in Norway.

Building upon experience gained over many years, has evolved
substantially to become what it is today. It now stands as a battle-proven
general engine for real-time computation over large data sets. It has many
features that make it
suitable for web-scale applications. It stores and indexes data with instant
writes so that queries, selection, and processing over the data can be
performed efficiently at serving time. It’s elastic and fault-tolerant, so
nodes can be added, removed, or replaced live without losing data. It’s easy to
configure, operate, and add custom logic. Importantly, it contains built-in
capabilities for advanced computation, including machine learned models. applications

Vespa architecture is a distributed application consisting of stateless nodes and a set
of stateful content nodes containing the data. A application is fully
defined in an application package. This is a single unit containing everything
needed to set up an application, including all configuration, custom
components, schemas, and machine-learned models. When the application package
is deployed, the admin layer takes care of configuring all the services across
all the system’s nodes. This includes distributing all models to all content

Application packages contain one or more document schemas. The schema primarily
consists of:

  • The data fields for each document and how they should be stored and indexed.
  • The ranking profiles which define how each document should be scored during query handling.

The ranking profiles contain ranking expressions, which are mathematical
expressions combining ranking
Some features retrieve data from sources such as the query, stored data, or
constants. Others compute or aggregate data in various ways. Ranking profiles
support multi-phased evaluation, so a cheap model can be evaluated in the first
phase and a more expensive model for the second. Both sparse and dense
tensors are
supported for more advanced computation.

After the application is deployed, it is ready to handle data writes and
queries. Data feeds are first processed on the stateless layer before content
is distributed (with redundancy) to the content nodes. Similarly, queries go
through the stateless layer before being fanned out to the content nodes where
data-dependent computation is handled. They return their results back to the
stateless layer, where the globally best results are determined, and a response
is ultimately returned.

A guiding principle in is to move computation to the data rather than
the other way around. Machine-learned models are automatically deployed to all
content nodes and evaluated there for each query. This alleviates the cost of
query-time data transportation. Also, as takes care of distributing
data to the content nodes and redistributing elastically, one can scale up
computationally by adding more content nodes, thus distributing computation as

In summary, offers ease of deployment, flexibility in combining many
types of models and computations out of the box without any plugins or
extensions, efficient evaluation without moving data around and a less complex
system to maintain. This makes an attractive platform.


In the last few years, it has become increasingly important for to
support various types of machine learned models from different frameworks. This
led to us introducing initial support for ONNX models in 2018.

The Open Neural Network Exchange (ONNX) is an open standard
for distributing machine learned models between different systems. The goal of
ONNX is interoperability between model training frameworks and inference
engines, avoiding any vendor lock-in. For instance, HuggingFace’s Transformer
library includes export to ONNX, PyTorch has native ONNX export, and TensorFlow
models can be converted to ONNX. From our perspective, supporting ONNX is
obviously interesting as it would maximize the range of models we could

To support ONNX in, we introduced a special onnx ranking feature. When
used in a ranking expression this would instruct the framework to evaluate the
ONNX model. This is one of the unique features of, as one has the
flexibility to combine results from various features and string models
together. For instance, one could use a small, fast model in an early phase,
and a more complex and computationally expensive model that only runs on the
most promising candidates. For instance:

document my_document {
  field text_embedding type tensor(x[769]) {
    indexing: attribute | index
    attribute {
      distance-metric: euclidean
  field text_tokens type tensor(d0[256]) {
    indexing: summary | attribute

onnx-model my_model {
  file: files/my_model.onnx
  input input_0: ...
  input input_1: ...
  output output_0: ...

rank-profile my_profile {
  first-phase {
    expression: closeness(field, text_embedding)
  second-phase {
    rerank-count: 10
    expression: onnx(my_model)

This is an example of configuring to calculate the euclidean distance
between a query vector and the stored text_embedding vector in the first stage.
This is usually used together with an approximate nearest neighbor search. The
top 10 candidates are sent to the ONNX model in the second stage. Note that
this is per content node, so with 10 content nodes, the model is running
effectively on 100 candidates.

The model is set up in the onnx-model section. The file refers to an ONNX model
somewhere in the application package. Inputs to the model, while not actually
shown here for brevity, can come from various sources such as constants, the
query, a document, or some combination expressed through a user-defined
function. While the output of models are tensors, the resulting value of a
first or second phase expression needs to be a single scalar, as documents are
sorted according to this score before being returned.

Our initial implementation of the onnx ranking feature was to import the ONNX
model and convert the entire graph into native expressions. This was
feasible because of the flexibility of the various tensor
operations supports. For instance, a single neural network layer could be
converted like this:

Model rank expression

Here, weights and bias would be stored as constant tensors, whereas the input
tensor could be retrieved either from the query, a document field, or some
combination of both.

Initially, this worked fine. We implemented the various ONNX
operators using
the available tensor operations. However, we only supported a subset of the
150+ ONNX operators at first, as we considered that only certain types of
models were viable for use in due to its low-latency requirement. For
instance, the ranking expression language does not support iterations, making
it more challenging to implement operators used in convolutional or recurrent
neural networks. Instead, we opted to continuously add operator support as new
model types were used in

The advantage of this was that the various optimizations we introduced to our
tensor evaluation engine to efficiently evaluate the models benefitted all
other applications using tensors as well.

ONNX Runtime

Unfortunately, we ran into problems as we started developing support for
Transformer models. Our first attempt at supporting a 12-layer BERT-base model
failed. This was a model converted from TensorFlow to ONNX. The evaluation
result was incorrect, with relatively poor performance.

We spent significant efforts on this. Quite a few operators had to be rewritten
due to, sometimes subtle, edge cases. We introduced a dozen or so
performance optimizations, to avoid doing silly stuff such as calculating the
same expressions multiple times and allocating memory unnecessarily.
Ultimately, we were able to increase performance by more than two orders of

During this development we turned to ONNX
Runtime for reference. ONNX Runtime
is easy to use:

import onnxruntime as ort
session = ort.InferenceSession(“model.onnx”) output_names=[...], input_feed={...} )

This was invaluable, providing us with a reference for correctness and a
performance target.

At one point, we started toying with the idea of actually using ONNX Runtime
directly for the model inference instead of converting it to
expressions. The content node is written in C++, so this entailed
integrating with the C++ interface of ONNX Runtime. It should be mentioned that
adding dependencies to is not something we often do, as we prefer to
avoid dependencies and thus own the entire stack.

Within a couple of weeks, we had a proof of concept up and running which showed
a lot of promise. So we decided to go ahead and start using ONNX Runtime to
evaluate all ONNX models in

This proved to be a game-changer for us. It vastly increases the capabilities
of evaluating large deep-learning models in in terms of model types we
support and evaluation performance. We can leverage ONNX Runtime’s use of
a compute library containing processor-optimized kernels. ONNX Runtime also
contains model-specific optimizations for BERT models (such as multi-head
attention node fusion) and makes it easy to evaluate precision-reduced models
by quantization for even more efficient inference.

ONNX Runtime in

Consider the following:

onnx-model my_model {
  file: files/my_model.onnx
  input input_0: query(my_query_input)
  input input_1: attribute(my_doc_field)

rank-profile my_profile {

Machine-learned model serving at scale

Photo by Jukan Tateisi
on Unsplash

Imagine you have a machine-learned model that you would like to use in some
application, for instance, a transformer model to generate vector
representations from text. You measure the time it takes for a single model
evaluation. Then, satisfied that the model can be evaluated quickly enough, you
deploy this model to production in some model server. Traffic increases, and
suddenly the model is executing much slower and can’t sustain the expected
traffic at all, severely missing SLAs. What could have happened?

You see, most libraries and platforms for evaluating machine-learned models are
by default tuned to use all available resources on the machine for model
inference. This means parallel execution utilizing a number of threads equal to
the number of available cores, or CPUs, on the machine. This is great for a
single model evaluation.

Unfortunately, this breaks down for concurrent evaluations. This is an
under-communicated and important point.

Let’s take a look at what happens. In the following, we serve a transformer
model using Vespa is a highly performant and
web-scalable open-source platform for applications that perform real-time data
processing over large data sets. uses ONNX
Runtime under the hood for model acceleration. We’ll
use the original BERT-base model, a
12-layer, 109 million parameter transformer neural network. We test the
performance of this model on a 32-core Xeon Gold 2.6GHz machine. Initially,
this model can be evaluated on this particular machine in around 24

Concurrency vs latency and throughput - 32 threads

Here, the blue line is the 95th percentile latency, meaning that 95% of all
requests have latency lower than this. The red line is the throughput: the
requests per second the machine can handle. The horizontal axis is the number
of concurrent connections (clients).

As the number of simultaneous connections increases, the latency increases
drastically. The maximum throughput is reached at around 10 concurrent
requests. At that point, the 95th percentile latency is around 150ms, pretty
far off from the expected 24ms. The result is a highly variable and poor

The type of application dictates the optimal balance between low latency and
high throughput. For instance, if the model is used for an end-user query,
(predictably) low latency is important for a given level of expected traffic.
On the other hand, if the model generates embeddings before ingestion in some
data store, high throughput might be more important. The driving force for both
is cost: how much hardware is needed to support the required throughput. As an
extreme example, if your application serves 10 000 queries per second with a
95% latency requirement of 50ms, you would need around 200 machines with the
setup above.

Of course, if you expect only a minimal amount of traffic, this might be
totally fine. However, if you are scaling up to thousands of requests per
second, this is a real problem. So, we’ll see what we can do to scale this up
in the following.

Parallel execution of models

We need to explain the threading model used during model inference to see what
is happening here. In general, there are 3 types of threads: inference
(application), inter-operation, and intra-operation threads. This is a common
feature among multiple frameworks, such as TensorFlow, PyTorch, and ONNX

The inference threads are the threads of the main application. Each request
gets handled in its own inference thread, which is ultimately responsible for
delivering the result of the model evaluation given the request.

The intra-operation threads evaluate single operations with multi-threaded
implementations. This is useful for many operations, such as element-wise
operations on large tensors, general matrix multiplications, embedding lookups,
and so on. Also, many frameworks chunk together several operations into a
higher-level one that can be executed in parallel for performance.

The inter-operation threads are used to evaluate independent parts of the
model in parallel. For instance, a model containing two distinct paths joined
in the end might benefit from this form of parallel execution. Examples are
Wide and Deep models or two-tower encoder architectures.

Various thread pools in inference

In the example above, which uses ONNX Runtime, the default disables the
inter-operation threads. However, the number of intra-operation threads is
equal to the number of CPUs on the machine. In this case, 32. So, each
concurrent request is handled in its own inference thread. Some operations,
however, are executed in parallel by employing threads from the intra-operation
thread pool. Since this pool is shared between requests, concurrent requests
need to wait for available threads to progress in the execution. This is why
the latency increases.

The model contains operations that are run both sequentially and in parallel.
That is why throughput increases to a certain level even as latency increases.
After that, however, throughput starts decreasing as we have a situation where
more threads are performing CPU-bound work than physical cores in the machine.
This is obviously detrimental to performance due to excessive thread swapping.

Scaling up

To avoid this thread over-subscription, we can ensure that each model runs
sequentially in its own inference thread. This avoids the competition between
concurrent evaluations for the intra-op threads. Unfortunately, it also avoids
the benefits of speeding up a single model evaluation using parallel execution.

Let’s see what happens when we set the number of intra-op threads to 1.

Concurrency vs latency and throughput - 1 thread

As seen, the latency is relatively stable up to a concurrency equalling the
number of cores on the machine (around 32). After that, latency increases due
to the greater number of threads than actual cores to execute them. The
throughput also increases to this point, reaching a maximum of around 120
requests per second, which is a 40% improvement. However, the 95th percentile
latency is now around 250ms, far from expectations.

So, the model that initially seemed promising might not be suitable for
efficient serving after all.

The first generation of transformer models, like BERT-base used above, are
relatively large and slow to evaluate. As a result, more efficient models that
can be used as drop-in replacements using the same tokenizer and vocabulary
have been developed. One example is the
XtremeDistilTransformers family. These are
distilled from BERT and have similar accuracy as BERT-base on many different
tasks with significantly lower computational complexity.

In the following, we will use the
model, which only has around 13M parameters compared to BERT-base’s 109M.
Despite having only 12% of the parameter count, the accuracy of this model is
very comparable to the full BERT-base model:

Distilled models accuracy

Using the default number of threads (same as available on the system), this
model can be evaluated on the CPU is around 4ms. However, it still suffers from
the same scaling issue as above with multiple concurrent requests. So, let’s
see how this scales with concurrent requests with single-threaded execution:

Concurrency vs latency and throughput with 1 intra-op thread on distilled model

As expected, the latency is much more stable until we reach concurrency levels equalling the number of cores on the machine. This gives a much better and predictable experience. The throughput now tops out at around 1600 requests per second, vastly superior to the other model, which topped out at roughly 120 requests per second. This results in much less hardware needed to achieve wanted levels of performance.

Experiment details

To measure the effects of scaling, we’ve used, an open-source platform
for building applications that do real-time data processing over large data
sets. Designed to be highly performant and web-scalable, it is used for diverse
tasks such as search, personalization, recommendation, ads, auto-complete,
image and similarity search, comment ranking, and even finding
love. has many integrated features
and supports many use cases right out of the box. Thus, it offers a simplified
path to deployment in production without the complexity of maintaining many
different subsystems. We’ve used as an easy-to-use model
server in this post.
In, it is straightforward to tune the threading model to
use for each model:

      <model name="reranker_margin_loss_v4">
        <intraop-threads> number </intraop-threads>
        <interop-threads> number </interop-threads>
        <execution-mode> parallel | sequential </execution-mode>

Also, it is easy to scale out horizontally to use additional nodes for model
evaluation. We have not explored that in this post.

The data in this post has been collected using Vespa’s
fbench tool,
which drives load to a system for benchmarking. Fbench provides detailed and
accurate information on how well the system manages the workload.


In this post, we’ve seen that the default thread settings are not suitable for
model serving in production, particularly for applications with a high degree
of concurrent requests. The competition for available threads between parallel
model evaluations leads to thread oversubscription and performance suffers. The
latency also becomes highly variable.

The problem is the shared intra-operation thread pool. Perhaps a different
threading model should be considered, which allows for utilizing multiple
threads in low traffic situations, but degrades to sequential evaluation when
high concurrency is required.

Currently however, the solution is to ensure that models are running in their
own threads. To manage the increased latency, we turned to model distillation,
which effectively lowers the computational complexity without sacrificing
accuracy. There are additional optimizations available which we did not touch
upon here, such as model
Another one that is important for transformer models is limiting input length
as evaluation time is quadratic to the input length.

We have not considered GPU evaluation here, which can significantly accelerate
execution. However, cost at scale is a genuine concern here as well.

The under-communicated point here is that platforms that promise very low
latencies for inference are only telling part of the story. As an example,
consider a platform promising 1ms latency for a given model. Naively, this can
support 1000 queries per second. However, consider what happens if 1000
requests arrive at almost the same time: the last request would have had to
wait almost 1 second before returning. This is far off from the expected 1ms.