Accelerating stateless model evaluation on Vespa

A central architectural feature of is the division
of work between the stateless container cluster and the content cluster.

Most computation, such as evaluating machine-learned models, happens in
the content cluster. However, it has become increasingly important to
efficiently evaluate models in the container cluster as well, to
process or transform documents or queries before storage or execution.
One prominent example is to generate a vector representation of natural
language text for queries and documents for nearest neighbor retrieval.

We have recently implemented accelerated model evaluation using ONNX Runtime in
the stateless cluster, which opens up new usage areas for Vespa.


At we differentiate between stateful and stateless machine-learned
model evaluation. Stateless model evaluation is what one usually thinks about
when serving machine-learned models in production. For instance, one might have
a stand-alone model server that is called from somewhere in a serving stack.
The result of evaluating a model there only depends upon its input.

In contrast, stateful model serving combines input with stored or persisted
data. This poses some additional challenges. One is that models typically need
to be evaluated many times per query, once per data point. This has been a
focus area of for quite some time, and we have previously written about
how we accelerate stateful model
in using ONNX Runtime.

However, stateless model evaluation does have its place in as well.
For instance, transforming query input or document content using Transformer
models. Or finding a vector representation for an image for image similarity
search. Or translating text to another language. The list goes on. has actually had stateless model
evaluation for some
time, but we’ve recently added acceleration of ONNX models using ONNX
Runtime. This makes this feature
much more powerful and opens up some new use cases for In this
post, we’ll take a look at some capabilities this enables:

  • The automatically generated REST API for model serving.
  • Creating lightweight request handlers for serving models with some custom
    code without the need for content nodes.
  • Adding model evaluation to searchers for query processing and enrichment.
  • Adding model evaluation to document processors for transforming content
    before ingestion.
  • Batch-processing results from the ranking back-end for additional ranking

We’ll start with a quick overview of the difference between where we evaluate
machine-learned models in applications: container and content nodes is a distributed application
consisting of various types of services on multiple nodes. A
application is fully defined in an application package. This single unit
contains everything needed to set up an application, including all
configuration, custom components, schemas, and machine-learned models. When the
application package is deployed, the admin cluster takes care of configuring all
the services across all the system’s nodes, including distributing all
models to the nodes that need them.

Vespa architecture

The container nodes process queries or documents before passing them on to the
content nodes. So, when a document is fed to Vespa, content can be transformed
or added before being stored. Likewise, queries can be transformed or enriched
in various ways before being sent for further processing.

The content nodes are responsible for persisting data. They also do most of the
required computation when responding to queries. As that is where the data is,
this avoids the cost of transferring data across the network. Query data is
combined with document data to perform this computation in various ways.

We thus differentiate between stateless and stateful machine-learned model
evaluation. Stateless model evaluation happens on the container nodes and is
characterized by a single model evaluation per query or document. Stateful
model evaluation
happens on the content nodes, and the model is typically
evaluated a number of times using data from both the query and the document.

The exact configuration of the services on the nodes is specified in
services.xml. Here the
number of container and content nodes, and their capabilities, are fully
configured. Indeed, a application does not need to be set up with any
content nodes, purely running stateless container code, including serving
machine-learned models.

This makes it easy to deploy applications. It offers a lot of flexibility
in combining many types of models and computations out of the box without any
plugins or extensions. In the next section, we’ll see how to set up stateless
model evaluation.

Stateless model evaluation

So, by stateless model
evaluation we mean
machine-learned models that are evaluated on Vespa container nodes. This is
enabled by simply adding the model-evaluation tag in services.xml:


When this is specified, Vespa scans through the models directory in the
application packages to find any importable machine-learned models. Currently,
supported models are TensorFlow, ONNX, XGBoost, LightGBM or Vespa’s own

There are two effects of this. The first is that a REST API for model discovery
and evaluation is automatically enabled. The other is that custom
components can have
a special ModelsEvaluator object dependency injected into their constructors.

Stateless model evaluation

In the following we’ll take a look at some of the usages of these, and use the
model-evaluation sample
for demonstratation.


The automatically added REST API provides an API for model discovery and
evaluation. This is great for using Vespa as a standalone model server, or
making models available for other parts of the application stack.

To get a list of imported models, call http://host:port/model-evaluation/v1.
For instance:

$ curl -s 'http://localhost:8080/model-evaluation/v1/'
    "pairwise_ranker": "http://localhost:8080/model-evaluation/v1/pairwise_ranker",
    "transformer": "http://localhost:8080/model-evaluation/v1/transformer"

This application has two models, the transformer model and the
pairwise_ranker model. We can inspect a model to see expected inputs and

$ curl -s 'http://localhost:8080/model-evaluation/v1/transformer/output'
    "arguments": [
            "name": "input",
            "type": "tensor(d0[],d1[])"
            "name": "onnxModel(transformer).output",
            "type": "tensor<float>(d0[],d1[],d2[16])"
    "eval": "http://localhost:8080/model-evaluation/v1/transformer/output/eval",
    "function": "output",
    "info": "http://localhost:8080/model-evaluation/v1/transformer/output",
    "model": "transformer"

All model inputs and output are Vespa tensors. See the tensor user
guide for more information.

This model has one input, with tensor type tensor(d0[],d1[]). This tensor has
two dimensions: d0 is typically a batch dimension, and d1 represents for,
this model, a sequence of tokens. The output, of type tensor<float>(d0[],d1[],d2[16])
adds a dimension d2 which represents the embedding dimension. So the output is
an embedding representation for each token of the input.

By calling /model-evaluation/v1/transformer/eval and passing an URL encoded input
parameter, Vespa evaluates the model and returns the result as a JSON encoded

Please refer to the sample
for a runnable example.

Request handlers

The REST API takes exactly the same input as the models it serves. In some
cases one might want to pre-process the input before providing it to the model.
A common example is to tokenize natural language text before passing the token
sequence to a language model such as BERT.

Vespa provides request
which lets applications implement arbitrary HTTP APIs. With custom request
handlers, arbitrary code can be run both before and after model evaluation.

When the model-evaluation tag has been supplied, Vespa makes a special
ModelsEvaluator object available which can be injected into a component
(such as a request handler):

public class MyHandler extends ThreadedHttpRequestHandler {

    private final ModelsEvaluator modelsEvaluator;

    public MyHandler(ModelsEvaluator modelsEvaluator, Context context) {
        this.modelsEvaluator = modelsEvaluator;

    public HttpResponse handle(HttpRequest request) {

        // Get the input
        String inputString = request.getProperty("input");

        // Convert to a Vespa tensor
        TensorType expectedType = TensorType.fromSpec("tensor<int8>(x[])");
        Tensor input = Tensor.from(expectedType, inputString);

        // Perform any pre-processing to the tensor
        // ...

        // Evaluate the model
        FunctionEvaluator evaluator = modelsEvaluator.evaluatorOf("transformer");
        Tensor result = evaluator.bind("input", input).evaluate();

        // Perform any post-processing to the tensor
        // ...

A full example can be seen in the MyHandler class in the sample
and it’s unit

As mentioned, arbitrary code can be run here. Pragmatically, it is often more
convenient to put the processing pipeline in the model itself. While not always
possible, this helps protect against divergence between the data processing
pipeline in training and in production.

Document processors

The REST API and request handler can work with a purely stateless application,
such as a model server. However, it is much more common for applications to
have content. As such, it is fairly common to process incoming documents before
storing them. Vespa provides a chain of document
for this.

Applications can implement custom document processors, and add them to the
processing chain. In the context of model evaluation, a typical task is to use a
machine-learned model to create a vector representation for a natural language
text. The text is first tokenized, then run though a language model such as
BERT to generate a vector representation which is then stored. Such a vector
representation can be for instance used in nearest neighbor
search. Other examples
are sentiment analysis, creating representations of images, object detection,
translating text, and so on.

The ModelsEvaluator can be injected into your component as already seen:

public class MyDocumentProcessor extends DocumentProcessor {

    private final ModelsEvaluator modelsEvaluator;

    public MyDocumentProcessor(ModelsEvaluator modelsEvaluator) {
        this.modelsEvaluator = modelsEvaluator;

    public Progress process(Processing processing) {
        for (DocumentOperation op : processing.getDocumentOperations()) {
            if (op instanceof DocumentPut) {
                DocumentPut put = (DocumentPut) op;
                Document document = put.getDocument();

                // Get tokens
                Tensor tokens = (Tensor) document.getFieldValue("tokens").getWrappedValue();

                // Perform any pre-processing to the tensor
                // ...

                // Evaluate the model
                FunctionEvaluator evaluator = modelsEvaluator.evaluatorOf("transformer");
                Tensor result = evaluator.bind("input", input).evaluate();

                // Reshape and extract the embedding vector (not shown)
                Tensor embedding = ...

                // Set embedding in document
                document.setFieldValue("embedding", new TensorFieldValue(embedding));

Notice the code looks a lot like the previous example for the request handler.
The document processor receives a pre-constructed ModelsEvaluator from Vespa
which contains the transformer model. This code receives a tensor contained
in the tokens field, runs that through the transformer model, and puts the
resulting embedding into a new field. This is then stored along with the

Again, a full example can be seen in the MyDocumentProcessor class in the sample
and it’s unit

Searchers: query processing

Similar to document processing, queries are processed along a chain of
Vespa provides a default chain of searchers for various tasks, and applications
can provide additional custom searchers as well. In the context of model
evaluation, the use cases are similar to document processing: a typical task
for text search is to generate vector representations for nearest neighbor search.

Again, the ModelsEvaluator can be injected into your component:

public class MySearcher extends Searcher {

    private final ModelsEvaluator modelsEvaluator;

    public MySearcher(ModelsEvaluator modelsEvaluator) {
        this.modelsEvaluator = modelsEvaluator;

    public Result search(Query query, Execution execution) {

        // Get the query input
        String inputString ="input");

        // Convert to a Vespa tensor
        TensorType expectedType = TensorType.fromSpec("tensor<int8>(x[])");
        Tensor input = Tensor.from(expectedType, inputString);

        // Perform any pre-processing to the tensor
        // ...

        // Evaluate model
        FunctionEvaluator evaluator = modelsEvaluator.evaluatorOf("transformer");
        Tensor output = evaluator.bind("input", input).evaluate();

        // Reshape and extract the embedding vector (not shown)
        Tensor embedding = ...

        // Add this tensor to query
        query.getRanking().getFeatures().put("query(embedding)", embedding);

        // Continue processing

As before, a full example can be seen in the MySearcher class in the sample
and it’s unit

Searchers: result post-processing

Searchers don’t just process queries before being sent to the back-end: they
are just as useful in post-processing the results from the back-end. A typical
example is to de-duplicate similar results in a search application. Another is
to apply business rules to reorder the results, especially if coming from
various back-ends. In the context of machine learning, one example is is to
de-tokenize tokens back to a natural language text.

Post-processing is similar to the example above, but the search is executed
first, and tensor fields from the documents are extracted and used as input to
the models. In the sample application we have a model that compares all results
with each other to perform another phase of ranking. See the MyPostProcessing
for details.


In, most of the computation required for executing queries has
traditionally been run in the content cluster. This makes sense as it avoids
transmitting data across the network to external

Accelerating Transformer-based Embedding Retrieval with Vespa

Decorative image

Photo by Appic on Unsplash

In this post, we’ll see how to accelerate embedding inference and retrieval with little impact on quality.
We’ll take a holistic approach and deep-dive into both aspects of an embedding retrieval system: Embedding inference and retrieval with nearest neighbor search.
All experiments are performed on a laptop with the open-source Vespa container image.


The fundamental concept behind text embedding models is transforming
textual data into a continuous vector space, wherein similar items
are brought closer together, and dissimilar ones are pushed farther
apart. Mapping multilingual texts into a unified vector embedding
space makes it possible to represent and compare queries and documents
from various languages within this shared space. By using contrastive
representation learning with retrieval data examples, we can make
embedding representations useful for retrieval with nearest neighbor


A search system using embedding retrieval consists of two primary

  • Embedding inference, using an embedding model to map text to a
    point in a vector space of D dimensions.
  • Retrieval in the D dimensional vector space using nearest neighbor search.

This blog post covers both aspects of an embedding retrieval system
and how to accelerate them, while also paying attention to the task
accuracy because what’s the point of having blazing fast but highly
inaccurate results?

Transformer Model Inferencing

The most popular text embedding models are typically based on
encoder-only Transformer models (such as BERT). We need a
high-level understanding of the complexity of encoder-only transformer
language models (without going deep into neural network architectures).

Inference complexity from the transformer architecture attention
mechanism scales quadratically with input sequence length.

BERT embedder

Illustration of obtaining a single vector representation of the
text ‘a new day’ through BERT.

The BERT model has a typical input
length limitation of 512 tokens, so the tokenization process truncates
the input to avoid exceeding the architecture’s maximum length.
Embedding models might also truncate the text at a lower limit than
the theoretical limit of the neural network to improve quality and
reduce training costs, as computational complexity is quadratic
with input sequence length for both training and inference. The
last pooling operation compresses the token vectors into a single
vector representation. A common pooling technique is averaging the
token vectors.

It’s worth noting that some models may not perform pooling and
instead represent the text with multiple
but that aspect is beyond the scope of this blog post.

Inference cost versus sequence length

Illustration of BERT inferenec cost versus sequence input length (sequence^2).

We use ‘Inference cost’ to refer to the computational resources
required for a single inference pass with a given input sequence
length. The graph depicts the relationship between the sequence
length and the squared compute complexity, demonstrating its quadratic
nature. Latency and throughput can be adjusted using different
techniques for parallelizing computations. See model serving at
scale for a
discussion on these techniques in Vespa.

Why does all of this matter? For retrieval systems, text queries
are usually much shorter than text documents, so invoking embedding
models for documents costs more than encoding shorter questions.

Sequence lengths and quadratic scaling are some of the reasons why
using frozen document-size
are practical at scale, as it avoids re-embedding documents when
the model weights are updated due to re-training the model. Similarly,
query embeddings can be cached for previously seen queries as long
as the model weights are unchanged. The asymmetric length properties
can also help us design a retrieval system architecture for scale.

  • Asymmetric model size: Use different-sized models for encoding
    queries and documents (with the same output embedding dimensionality).
    See this paper for an example.
  • Asymmetric batch size: Use batch on-demand computing for embedding
    documents, using auto-scaling features, for example, with Vespa
  • Asymmetric compute architecture: Use GPU acceleration for document inference but CPU
    for query inference.

The final point is that reporting embedding inference latency or
throughput without mentioning input sequence length provides little

Choose your Fighter

When deciding on an embedding model, developers must strike a balance
between quality and serving costs.

Triangle of tradeoffs

These serving-related costs are all roughly linear with model
parameters and embedding dimensionality (for a given sequence
length). For example, using an embedding model with 768 dimensions
instead of 384 increases embedding storage by 2x and nearest neighbor
search compute by 2x.

Accuracy, however, is not nearly linear, as demonstrated on the
MTEB leaderboard.

ModelDimensionalityModel params (M)Accuracy
Average (56 datasets)
Accuracy Retrieval
(15 datasets)

A comparison of the E5 multilingual models — accuracy numbers from the MTEB

In the following sections, we use the small E5 multilingual variant,
which gives us reasonable accuracy for a much lower cost than the
larger sister E5 variants. The small model inference complexity
also makes it servable on CPU architecture, allowing iterations and
development locally without managing GPU-related infrastructure

Exporting E5 to ONNX format for accelerated model inference

To export the embedding model from the Huggingface model hub to
ONNX format for inference in Vespa, we can use the Transformer
Optimum library:

$ optimum-cli export onnx --task sentence-similarity -m intfloat/multilingual-e5-small model-dir

The above exports the model without any optimizations. The optimum
client also allows specifying optimization
here using the highest optimization level usable for serving on the

The above commands export the model to ONNX format that can be
imported and used with the Vespa Huggingface
Using the Optimum generated ONNX and tokenizer configuration files,
we configure Vespa with the following in the Vespa application

<component id="e5" type="hugging-face-embedder">
  <transformer-model path="model/model.onnx"/>
  <tokenizer-model path="model/tokenizer.json"/>

These two simple steps are all we need to start using the multilingual
E5 model to embed queries and documents with Vespa.
We can also quantize the optimized ONNX model, for example, using
the optimum
or onnxruntime quantization like
Quantization (post-training) converts the float32 model weights (4
bytes per weight) to byte (int8), enabling faster inference on the

Performance Experiments

To demonstrate the many tradeoffs, we assess the mentioned small
E5 multilanguage model on the Swahili(SW) split from the
MIRACL (Multilingual Information
Retrieval Across a Continuum of Languages
) dataset.

document tokens
QueriesAvg query tokensRelevance
MIRACL swSwahili131,92463482135092

Dataset characteristics; tokens are the number of language model
token identifiers. Since Swahili is a low-resource language, the
LM tokenization uses more tokens to represent similar byte-length
texts than for more popular languages such as English.

We experiment with post-training quantization of the model (not the
output vectors) to document the impact quantization has on retrieval
effectiveness (NDCG@10). We use this
to quantize the model (We don’t use optimum for this due to this
issue – fixed
in v 1.11).

We then study the serving efficiency gains (latency/throughput) on
the same laptop-sized hardware using a quantized model versus a
full precision model

All experiments are run on an M1 Pro (arm64) laptop with 8 v-CPUs
and 32GB of memory, using the open-source Vespa container
image. No GPU
acceleration and no need to manage CUDA driver compatibility, huge
container images due to CUDA dependencies, or forwarding host GPU
devices to the container.

  • We use the multilingual-search Vespa sample
    as the starting point for these experiments. This sample app was
    introduced in Simplify search with multilingual embedding
  • We use the
    NDCG@10 metric
    to evaluate ranking effectiveness. When performing model optimizations,
    it’s important to pay attention to the impact on the task. This is
    stating the obvious, but still, many talk about accelerations and
    optimizations without mentioning task accuracy degradations
  • We measure the throughput of indexing text documents in Vespa. This
    includes embedding inference in Vespa using the Vespa Huggingface
    storing the embedding vector in Vespa, and regular inverted indexing
    of the title and text field. We use the
    vespa-cli feed option
    as the feeding client.
  • We use the Vespa fbench
    to drive HTTP query load using HTTP POST against the Vespa query
  • Batch size in Vespa embedders is one for document and query inference.
  • There is no caching of query embedding inference, so repeating the same query
    text while benchmarking will trigger a new embedding inference.

Sample Vespa JSON formatted feed document (prettified) from the
MIRACL dataset.

    "put": "id:miracl-sw:doc::2-0",
    "fields": {
        "title": "Akiolojia",
        "text": "Akiolojia (kutoka Kiyunani \u03b1\u03c1\u03c7\u03b1\u03af\u03bf\u03c2 = \"zamani\" na \u03bb\u03cc\u03b3\u03bf\u03c2 = \"neno, usemi\") ni somo linalohusu mabaki ya tamaduni za watu wa nyakati zilizopita. Wanaakiolojia wanatafuta vitu vilivyobaki, kwa mfano kwa kuchimba ardhi na kutafuta mabaki ya majengo, makaburi, silaha, vifaa, vyombo na mifupa ya watu.",
        "doc_id": "2#0",
        "language": "sw"
ModelModel size (MB)NDCG@10Docs/secondQueries/second
Int8 (Quantized)1120.661269640

Comparison of embedding inference in Vespa using a full precision
model with float32 weights against a quantized model using int8
weights. This is primarily benchmarking embedding inference. See
the next section for a deep dive into the experimental setup.

There is a small drop in retrieval accuracy from an NDCG@10 score
of 0.675 to 0.661 (2%), but a huge gain in embedding inference
efficiency. Indexing throughput increases by 2x, and query throughput
increases close to 2x. The throughput measurements are end-to-end,
either using vespa-cli feed or vespa-fbench. The difference in query
versus document sequence length largely explains the query and
document throughput difference (the quadratic scaling properties).

Query embed latency and throughput

Throughput is one way to look at it, but what about query serving
latency? We analyze query latency of the quantized model by gradually
increasing the load until the CPU is close to 100% utilization using
input format for POST requests.

{"yql": "select doc_id from doc where rank(doc_id contains \"71#13\",{targetHits:1}nearestNeighbor(embedding,q))", "input.query(q)": "embed(query:Bandari kubwa nchini Kenya iko wapi?)", "ranking": "semantic", "hits": 0}

The above query template tests Vespa end-to-end but does NOT perform
a global nearest neighbor search as the query uses the rank
to retrieve by doc_id, and the second operand computes the
nearestNeighbor. This means that the nearest neighbor “search” is
limited to a single document in the index. This experimental setup
allows us to test everything end to end except the cost of exhaustive
search through all documents.

This part of the experiment focuses on the embedding model inference
and not nearest neighbor search performance. We use all the queries
in the dev set (482 unique queries). Using vespa-fbench, we simulate
load by increasing the number of concurrent clients executing queries
with sleep time 0 (-c 0) while observing the end-to-end latency and

$ vespa-fbench -P -q queries.txt -s 20 -n $clients -c 0 localhost 8080
Clients Average
95p latencyQueries/s

Vespa query embedder performance.

As concurrency increases, the latency increases slightly, but not
much, until saturation, where latency will climb rapidly with a
hockey-stick shape due to queuing for exhausted resources.

In this case, latency is the complete end-to-end HTTP latency,
including HTTP overhead, embedding inference, and dispatching the
embedding vector to the Vespa content node process. Again, it does
not include nearest neighbor search, as the query limits the retrieval
to a single document.

In the previous section, we focused on the embedding inference
throughput and latency. In this section, we change the Vespa query
specification to perform an exact nearest neighbor search over all
documents. This setup measures the end-to-end deployment, including
HTTP overhead, embedding inference, and embedding retrieval using
Vespa exact nearest neighbor
With exact search, no retrieval error is introduced by using
approximate search algorithms.

{"yql": "select doc_id from doc where {targetHits:10}nearestNeighbor(embedding,q)", "input.query(q)": "embed(query:Bandari kubwa nchini Kenya iko wapi?)", "ranking": "semantic", "hits":