Stateful model serving: how we accelerate inference using ONNX Runtime

By Lester Solbakken from Verizon Media and Pranav Sharma from Microsoft.

There’s a difference between stateless and stateful model serving.

Stateless model serving is what one usually thinks about when using a
machine-learned model in production. For instance, a web application handling
live traffic can call out to a model server from somewhere in the serving
stack. The output of this model service depends purely on the input. This is
fine for many tasks, such as classification, text generation, object detection,
and translation, where the model is evaluated once per query.

There are, however, some applications where the input is combined with stored
or persisted data to generate a result. We call this stateful model evaluation.
Applications such as search and recommendation need to evaluate models with a
potentially large number of items for each query. A model server can quickly
become a scalability
bottleneck in these
cases, regardless of how efficient the model inference is.

In other words, stateless model serving requires sending all necessary input
data to the model. In stateful model serving, the model should be computed
where the data is stored.

At, we are concerned with efficient stateful
model evaluation. is an open-source platform for building
applications that do real-time data processing over large data sets. Designed
to be highly performant and web-scalable, it is used for such diverse tasks as
search, personalization, recommendation, ads, auto-complete, image and
similarity search, comment ranking, and even for finding

It has become increasingly important for us to be able to evaluate complex
machine-learned models efficiently. Delivering low latency, fast inference and
low serving cost is challenging while at the same time providing support for
the various model training frameworks.

We eventually chose to leverage ONNX
Runtime (ORT) for this task. ONNX
Runtime is an accelerator for model inference. It has vastly increased’s capacity for evaluating large models, both in performance and model
types we support. ONNX Runtime’s capabilities within hardware acceleration and
model optimizations, such as quantization, has enabled efficient evaluation of
large NLP models like BERT and other Transformer models in

In this post, we’ll share our journey on why and how we eventually chose ONNX
Runtime and share some of our experiences with it.

About has a rich history. Its lineage comes from a search engine born in 1997.
Initially powering the web search at, it was flexible
enough to be used in various more specialized products, or verticals, such as
document search, mobile search, yellow pages, and banking. This flexibility in
being a vertical search platform eventually gave rise to its name, Vespa.

The technology was acquired by Yahoo in 2003. There, Vespa cemented itself as a
core piece of technology that powers hundreds of applications, including many
of Yahoo’s most essential services. We open-sourced Vespa in 2017 and today it
serves hundreds of thousands of queries per second worldwide at any given time,
with billions of content items for hundreds of millions of users.

Although Yahoo was eventually acquired by Verizon, it is interesting to note
that our team has stayed remarkably stable over the years. Indeed, a few of the
engineers that started working on that initial engine over 20 years ago are
still here. Our team counts about 30 developers, and we are situated in
Trondheim in Norway.

Building upon experience gained over many years, has evolved
substantially to become what it is today. It now stands as a battle-proven
general engine for real-time computation over large data sets. It has many
features that make it
suitable for web-scale applications. It stores and indexes data with instant
writes so that queries, selection, and processing over the data can be
performed efficiently at serving time. It’s elastic and fault-tolerant, so
nodes can be added, removed, or replaced live without losing data. It’s easy to
configure, operate, and add custom logic. Importantly, it contains built-in
capabilities for advanced computation, including machine learned models. applications

Vespa architecture is a distributed application consisting of stateless nodes and a set
of stateful content nodes containing the data. A application is fully
defined in an application package. This is a single unit containing everything
needed to set up an application, including all configuration, custom
components, schemas, and machine-learned models. When the application package
is deployed, the admin layer takes care of configuring all the services across
all the system’s nodes. This includes distributing all models to all content

Application packages contain one or more document schemas. The schema primarily
consists of:

  • The data fields for each document and how they should be stored and indexed.
  • The ranking profiles which define how each document should be scored during query handling.

The ranking profiles contain ranking expressions, which are mathematical
expressions combining ranking
Some features retrieve data from sources such as the query, stored data, or
constants. Others compute or aggregate data in various ways. Ranking profiles
support multi-phased evaluation, so a cheap model can be evaluated in the first
phase and a more expensive model for the second. Both sparse and dense
tensors are
supported for more advanced computation.

After the application is deployed, it is ready to handle data writes and
queries. Data feeds are first processed on the stateless layer before content
is distributed (with redundancy) to the content nodes. Similarly, queries go
through the stateless layer before being fanned out to the content nodes where
data-dependent computation is handled. They return their results back to the
stateless layer, where the globally best results are determined, and a response
is ultimately returned.

A guiding principle in is to move computation to the data rather than
the other way around. Machine-learned models are automatically deployed to all
content nodes and evaluated there for each query. This alleviates the cost of
query-time data transportation. Also, as takes care of distributing
data to the content nodes and redistributing elastically, one can scale up
computationally by adding more content nodes, thus distributing computation as

In summary, offers ease of deployment, flexibility in combining many
types of models and computations out of the box without any plugins or
extensions, efficient evaluation without moving data around and a less complex
system to maintain. This makes an attractive platform.


In the last few years, it has become increasingly important for to
support various types of machine learned models from different frameworks. This
led to us introducing initial support for ONNX models in 2018.

The Open Neural Network Exchange (ONNX) is an open standard
for distributing machine learned models between different systems. The goal of
ONNX is interoperability between model training frameworks and inference
engines, avoiding any vendor lock-in. For instance, HuggingFace’s Transformer
library includes export to ONNX, PyTorch has native ONNX export, and TensorFlow
models can be converted to ONNX. From our perspective, supporting ONNX is
obviously interesting as it would maximize the range of models we could

To support ONNX in, we introduced a special onnx ranking feature. When
used in a ranking expression this would instruct the framework to evaluate the
ONNX model. This is one of the unique features of, as one has the
flexibility to combine results from various features and string models
together. For instance, one could use a small, fast model in an early phase,
and a more complex and computationally expensive model that only runs on the
most promising candidates. For instance:

document my_document {
  field text_embedding type tensor(x[769]) {
    indexing: attribute | index
    attribute {
      distance-metric: euclidean
  field text_tokens type tensor(d0[256]) {
    indexing: summary | attribute

onnx-model my_model {
  file: files/my_model.onnx
  input input_0: ...
  input input_1: ...
  output output_0: ...

rank-profile my_profile {
  first-phase {
    expression: closeness(field, text_embedding)
  second-phase {
    rerank-count: 10
    expression: onnx(my_model)

This is an example of configuring to calculate the euclidean distance
between a query vector and the stored text_embedding vector in the first stage.
This is usually used together with an approximate nearest neighbor search. The
top 10 candidates are sent to the ONNX model in the second stage. Note that
this is per content node, so with 10 content nodes, the model is running
effectively on 100 candidates.

The model is set up in the onnx-model section. The file refers to an ONNX model
somewhere in the application package. Inputs to the model, while not actually
shown here for brevity, can come from various sources such as constants, the
query, a document, or some combination expressed through a user-defined
function. While the output of models are tensors, the resulting value of a
first or second phase expression needs to be a single scalar, as documents are
sorted according to this score before being returned.

Our initial implementation of the onnx ranking feature was to import the ONNX
model and convert the entire graph into native expressions. This was
feasible because of the flexibility of the various tensor
operations supports. For instance, a single neural network layer could be
converted like this:

Model rank expression

Here, weights and bias would be stored as constant tensors, whereas the input
tensor could be retrieved either from the query, a document field, or some
combination of both.

Initially, this worked fine. We implemented the various ONNX
operators using
the available tensor operations. However, we only supported a subset of the
150+ ONNX operators at first, as we considered that only certain types of
models were viable for use in due to its low-latency requirement. For
instance, the ranking expression language does not support iterations, making
it more challenging to implement operators used in convolutional or recurrent
neural networks. Instead, we opted to continuously add operator support as new
model types were used in

The advantage of this was that the various optimizations we introduced to our
tensor evaluation engine to efficiently evaluate the models benefitted all
other applications using tensors as well.

ONNX Runtime

Unfortunately, we ran into problems as we started developing support for
Transformer models. Our first attempt at supporting a 12-layer BERT-base model
failed. This was a model converted from TensorFlow to ONNX. The evaluation
result was incorrect, with relatively poor performance.

We spent significant efforts on this. Quite a few operators had to be rewritten
due to, sometimes subtle, edge cases. We introduced a dozen or so
performance optimizations, to avoid doing silly stuff such as calculating the
same expressions multiple times and allocating memory unnecessarily.
Ultimately, we were able to increase performance by more than two orders of

During this development we turned to ONNX
Runtime for reference. ONNX Runtime
is easy to use:

import onnxruntime as ort
session = ort.InferenceSession(“model.onnx”) output_names=[...], input_feed={...} )

This was invaluable, providing us with a reference for correctness and a
performance target.

At one point, we started toying with the idea of actually using ONNX Runtime
directly for the model inference instead of converting it to
expressions. The content node is written in C++, so this entailed
integrating with the C++ interface of ONNX Runtime. It should be mentioned that
adding dependencies to is not something we often do, as we prefer to
avoid dependencies and thus own the entire stack.

Within a couple of weeks, we had a proof of concept up and running which showed
a lot of promise. So we decided to go ahead and start using ONNX Runtime to
evaluate all ONNX models in

This proved to be a game-changer for us. It vastly increases the capabilities
of evaluating large deep-learning models in in terms of model types we
support and evaluation performance. We can leverage ONNX Runtime’s use of
a compute library containing processor-optimized kernels. ONNX Runtime also
contains model-specific optimizations for BERT models (such as multi-head
attention node fusion) and makes it easy to evaluate precision-reduced models
by quantization for even more efficient inference.

ONNX Runtime in

Consider the following:

onnx-model my_model {
  file: files/my_model.onnx
  input input_0: query(my_query_input)
  input input_1: attribute(my_doc_field)

rank-profile my_profile {