Machine-learned model serving at scale

Photo by Jukan Tateisi
on Unsplash

Imagine you have a machine-learned model that you would like to use in some
application, for instance, a transformer model to generate vector
representations from text. You measure the time it takes for a single model
evaluation. Then, satisfied that the model can be evaluated quickly enough, you
deploy this model to production in some model server. Traffic increases, and
suddenly the model is executing much slower and can’t sustain the expected
traffic at all, severely missing SLAs. What could have happened?

You see, most libraries and platforms for evaluating machine-learned models are
by default tuned to use all available resources on the machine for model
inference. This means parallel execution utilizing a number of threads equal to
the number of available cores, or CPUs, on the machine. This is great for a
single model evaluation.

Unfortunately, this breaks down for concurrent evaluations. This is an
under-communicated and important point.

Let’s take a look at what happens. In the following, we serve a transformer
model using Vespa is a highly performant and
web-scalable open-source platform for applications that perform real-time data
processing over large data sets. uses ONNX
Runtime under the hood for model acceleration. We’ll
use the original BERT-base model, a
12-layer, 109 million parameter transformer neural network. We test the
performance of this model on a 32-core Xeon Gold 2.6GHz machine. Initially,
this model can be evaluated on this particular machine in around 24

Concurrency vs latency and throughput - 32 threads

Here, the blue line is the 95th percentile latency, meaning that 95% of all
requests have latency lower than this. The red line is the throughput: the
requests per second the machine can handle. The horizontal axis is the number
of concurrent connections (clients).

As the number of simultaneous connections increases, the latency increases
drastically. The maximum throughput is reached at around 10 concurrent
requests. At that point, the 95th percentile latency is around 150ms, pretty
far off from the expected 24ms. The result is a highly variable and poor

The type of application dictates the optimal balance between low latency and
high throughput. For instance, if the model is used for an end-user query,
(predictably) low latency is important for a given level of expected traffic.
On the other hand, if the model generates embeddings before ingestion in some
data store, high throughput might be more important. The driving force for both
is cost: how much hardware is needed to support the required throughput. As an
extreme example, if your application serves 10 000 queries per second with a
95% latency requirement of 50ms, you would need around 200 machines with the
setup above.

Of course, if you expect only a minimal amount of traffic, this might be
totally fine. However, if you are scaling up to thousands of requests per
second, this is a real problem. So, we’ll see what we can do to scale this up
in the following.

Parallel execution of models

We need to explain the threading model used during model inference to see what
is happening here. In general, there are 3 types of threads: inference
(application), inter-operation, and intra-operation threads. This is a common
feature among multiple frameworks, such as TensorFlow, PyTorch, and ONNX

The inference threads are the threads of the main application. Each request
gets handled in its own inference thread, which is ultimately responsible for
delivering the result of the model evaluation given the request.

The intra-operation threads evaluate single operations with multi-threaded
implementations. This is useful for many operations, such as element-wise
operations on large tensors, general matrix multiplications, embedding lookups,
and so on. Also, many frameworks chunk together several operations into a
higher-level one that can be executed in parallel for performance.

The inter-operation threads are used to evaluate independent parts of the
model in parallel. For instance, a model containing two distinct paths joined
in the end might benefit from this form of parallel execution. Examples are
Wide and Deep models or two-tower encoder architectures.

Various thread pools in inference

In the example above, which uses ONNX Runtime, the default disables the
inter-operation threads. However, the number of intra-operation threads is
equal to the number of CPUs on the machine. In this case, 32. So, each
concurrent request is handled in its own inference thread. Some operations,
however, are executed in parallel by employing threads from the intra-operation
thread pool. Since this pool is shared between requests, concurrent requests
need to wait for available threads to progress in the execution. This is why
the latency increases.

The model contains operations that are run both sequentially and in parallel.
That is why throughput increases to a certain level even as latency increases.
After that, however, throughput starts decreasing as we have a situation where
more threads are performing CPU-bound work than physical cores in the machine.
This is obviously detrimental to performance due to excessive thread swapping.

Scaling up

To avoid this thread over-subscription, we can ensure that each model runs
sequentially in its own inference thread. This avoids the competition between
concurrent evaluations for the intra-op threads. Unfortunately, it also avoids
the benefits of speeding up a single model evaluation using parallel execution.

Let’s see what happens when we set the number of intra-op threads to 1.

Concurrency vs latency and throughput - 1 thread

As seen, the latency is relatively stable up to a concurrency equalling the
number of cores on the machine (around 32). After that, latency increases due
to the greater number of threads than actual cores to execute them. The
throughput also increases to this point, reaching a maximum of around 120
requests per second, which is a 40% improvement. However, the 95th percentile
latency is now around 250ms, far from expectations.

So, the model that initially seemed promising might not be suitable for
efficient serving after all.

The first generation of transformer models, like BERT-base used above, are
relatively large and slow to evaluate. As a result, more efficient models that
can be used as drop-in replacements using the same tokenizer and vocabulary
have been developed. One example is the
XtremeDistilTransformers family. These are
distilled from BERT and have similar accuracy as BERT-base on many different
tasks with significantly lower computational complexity.

In the following, we will use the
model, which only has around 13M parameters compared to BERT-base’s 109M.
Despite having only 12% of the parameter count, the accuracy of this model is
very comparable to the full BERT-base model:

Distilled models accuracy

Using the default number of threads (same as available on the system), this
model can be evaluated on the CPU is around 4ms. However, it still suffers from
the same scaling issue as above with multiple concurrent requests. So, let’s
see how this scales with concurrent requests with single-threaded execution:

Concurrency vs latency and throughput with 1 intra-op thread on distilled model

As expected, the latency is much more stable until we reach concurrency levels equalling the number of cores on the machine. This gives a much better and predictable experience. The throughput now tops out at around 1600 requests per second, vastly superior to the other model, which topped out at roughly 120 requests per second. This results in much less hardware needed to achieve wanted levels of performance.

Experiment details

To measure the effects of scaling, we’ve used, an open-source platform
for building applications that do real-time data processing over large data
sets. Designed to be highly performant and web-scalable, it is used for diverse
tasks such as search, personalization, recommendation, ads, auto-complete,
image and similarity search, comment ranking, and even finding
love. has many integrated features
and supports many use cases right out of the box. Thus, it offers a simplified
path to deployment in production without the complexity of maintaining many
different subsystems. We’ve used as an easy-to-use model
server in this post.
In, it is straightforward to tune the threading model to
use for each model:

      <model name="reranker_margin_loss_v4">
        <intraop-threads> number </intraop-threads>
        <interop-threads> number </interop-threads>
        <execution-mode> parallel | sequential </execution-mode>

Also, it is easy to scale out horizontally to use additional nodes for model
evaluation. We have not explored that in this post.

The data in this post has been collected using Vespa’s
fbench tool,
which drives load to a system for benchmarking. Fbench provides detailed and
accurate information on how well the system manages the workload.


In this post, we’ve seen that the default thread settings are not suitable for
model serving in production, particularly for applications with a high degree
of concurrent requests. The competition for available threads between parallel
model evaluations leads to thread oversubscription and performance suffers. The
latency also becomes highly variable.

The problem is the shared intra-operation thread pool. Perhaps a different
threading model should be considered, which allows for utilizing multiple
threads in low traffic situations, but degrades to sequential evaluation when
high concurrency is required.

Currently however, the solution is to ensure that models are running in their
own threads. To manage the increased latency, we turned to model distillation,
which effectively lowers the computational complexity without sacrificing
accuracy. There are additional optimizations available which we did not touch
upon here, such as model
Another one that is important for transformer models is limiting input length
as evaluation time is quadratic to the input length.

We have not considered GPU evaluation here, which can significantly accelerate
execution. However, cost at scale is a genuine concern here as well.

The under-communicated point here is that platforms that promise very low
latencies for inference are only telling part of the story. As an example,
consider a platform promising 1ms latency for a given model. Naively, this can
support 1000 queries per second. However, consider what happens if 1000
requests arrive at almost the same time: the last request would have had to
wait almost 1 second before returning. This is far off from the expected 1ms.