Accelerating Transformer-based Embedding Retrieval with Vespa

Decorative image

Photo by Appic on Unsplash

In this post, we’ll see how to accelerate embedding inference and retrieval with little impact on quality.
We’ll take a holistic approach and deep-dive into both aspects of an embedding retrieval system: Embedding inference and retrieval with nearest neighbor search.
All experiments are performed on a laptop with the open-source Vespa container image.

Introduction

The fundamental concept behind text embedding models is transforming
textual data into a continuous vector space, wherein similar items
are brought closer together, and dissimilar ones are pushed farther
apart. Mapping multilingual texts into a unified vector embedding
space makes it possible to represent and compare queries and documents
from various languages within this shared space. By using contrastive
representation learning with retrieval data examples, we can make
embedding representations useful for retrieval with nearest neighbor
search.

Overview

A search system using embedding retrieval consists of two primary
processes:

  • Embedding inference, using an embedding model to map text to a
    point in a vector space of D dimensions.
  • Retrieval in the D dimensional vector space using nearest neighbor search.

This blog post covers both aspects of an embedding retrieval system
and how to accelerate them, while also paying attention to the task
accuracy because what’s the point of having blazing fast but highly
inaccurate results?

Transformer Model Inferencing

The most popular text embedding models are typically based on
encoder-only Transformer models (such as BERT). We need a
high-level understanding of the complexity of encoder-only transformer
language models (without going deep into neural network architectures).

Inference complexity from the transformer architecture attention
mechanism scales quadratically with input sequence length.

BERT embedder

Illustration of obtaining a single vector representation of the
text ‘a new day’ through BERT.

The BERT model has a typical input
length limitation of 512 tokens, so the tokenization process truncates
the input to avoid exceeding the architecture’s maximum length.
Embedding models might also truncate the text at a lower limit than
the theoretical limit of the neural network to improve quality and
reduce training costs, as computational complexity is quadratic
with input sequence length for both training and inference. The
last pooling operation compresses the token vectors into a single
vector representation. A common pooling technique is averaging the
token vectors.

It’s worth noting that some models may not perform pooling and
instead represent the text with multiple
vectors,
but that aspect is beyond the scope of this blog post.

Inference cost versus sequence length

Illustration of BERT inferenec cost versus sequence input length (sequence^2).

We use ‘Inference cost’ to refer to the computational resources
required for a single inference pass with a given input sequence
length. The graph depicts the relationship between the sequence
length and the squared compute complexity, demonstrating its quadratic
nature. Latency and throughput can be adjusted using different
techniques for parallelizing computations. See model serving at
scale for a
discussion on these techniques in Vespa.

Why does all of this matter? For retrieval systems, text queries
are usually much shorter than text documents, so invoking embedding
models for documents costs more than encoding shorter questions.

Sequence lengths and quadratic scaling are some of the reasons why
using frozen document-size
embeddings
are practical at scale, as it avoids re-embedding documents when
the model weights are updated due to re-training the model. Similarly,
query embeddings can be cached for previously seen queries as long
as the model weights are unchanged. The asymmetric length properties
can also help us design a retrieval system architecture for scale.

  • Asymmetric model size: Use different-sized models for encoding
    queries and documents (with the same output embedding dimensionality).
    See this paper for an example.
  • Asymmetric batch size: Use batch on-demand computing for embedding
    documents, using auto-scaling features, for example, with Vespa
    Cloud.
  • Asymmetric compute architecture: Use GPU acceleration for document inference but CPU
    for query inference.

The final point is that reporting embedding inference latency or
throughput without mentioning input sequence length provides little
insight.

Choose your Fighter

When deciding on an embedding model, developers must strike a balance
between quality and serving costs.

Triangle of tradeoffs

These serving-related costs are all roughly linear with model
parameters and embedding dimensionality (for a given sequence
length). For example, using an embedding model with 768 dimensions
instead of 384 increases embedding storage by 2x and nearest neighbor
search compute by 2x.

Accuracy, however, is not nearly linear, as demonstrated on the
MTEB leaderboard.

ModelDimensionalityModel params (M)Accuracy
Average (56 datasets)
Accuracy Retrieval
(15 datasets)
Small38411857.8746.64
Base76827859.4548.88
Large102456061.551.43

A comparison of the E5 multilingual models — accuracy numbers from the MTEB
leaderboard.

In the following sections, we use the small E5 multilingual variant,
which gives us reasonable accuracy for a much lower cost than the
larger sister E5 variants. The small model inference complexity
also makes it servable on CPU architecture, allowing iterations and
development locally without managing GPU-related infrastructure
complexity.

Exporting E5 to ONNX format for accelerated model inference

To export the embedding model from the Huggingface model hub to
ONNX format for inference in Vespa, we can use the Transformer
Optimum library:

$ optimum-cli export onnx --task sentence-similarity -m intfloat/multilingual-e5-small model-dir

The above exports the model without any optimizations. The optimum
client also allows specifying optimization
levels,
here using the highest optimization level usable for serving on the
CPU.

The above commands export the model to ONNX format that can be
imported and used with the Vespa Huggingface
embedder.
Using the Optimum generated ONNX and tokenizer configuration files,
we configure Vespa with the following in the Vespa application
package
services.xml
file:

<component id="e5" type="hugging-face-embedder">
  <transformer-model path="model/model.onnx"/>
  <tokenizer-model path="model/tokenizer.json"/>
</component>

These two simple steps are all we need to start using the multilingual
E5 model to embed queries and documents with Vespa.
We can also quantize the optimized ONNX model, for example, using
the optimum
library
or onnxruntime quantization like
this.
Quantization (post-training) converts the float32 model weights (4
bytes per weight) to byte (int8), enabling faster inference on the
CPU.

Performance Experiments

To demonstrate the many tradeoffs, we assess the mentioned small
E5 multilanguage model on the Swahili(SW) split from the
MIRACL (Multilingual Information
Retrieval Across a Continuum of Languages
) dataset.

DatasetLanguageDocumentsAvg
document tokens
QueriesAvg query tokensRelevance
Judgments
MIRACL swSwahili131,92463482135092

Dataset characteristics; tokens are the number of language model
token identifiers. Since Swahili is a low-resource language, the
LM tokenization uses more tokens to represent similar byte-length
texts than for more popular languages such as English.

We experiment with post-training quantization of the model (not the
output vectors) to document the impact quantization has on retrieval
effectiveness (NDCG@10). We use this
routine
to quantize the model (We don’t use optimum for this due to this
issue – fixed
in v 1.11).

We then study the serving efficiency gains (latency/throughput) on
the same laptop-sized hardware using a quantized model versus a
full precision model
.

All experiments are run on an M1 Pro (arm64) laptop with 8 v-CPUs
and 32GB of memory, using the open-source Vespa container
image. No GPU
acceleration and no need to manage CUDA driver compatibility, huge
container images due to CUDA dependencies, or forwarding host GPU
devices to the container.

  • We use the multilingual-search Vespa sample
    application
    as the starting point for these experiments. This sample app was
    introduced in Simplify search with multilingual embedding
    models.
  • We use the
    NDCG@10 metric
    to evaluate ranking effectiveness. When performing model optimizations,
    it’s important to pay attention to the impact on the task. This is
    stating the obvious, but still, many talk about accelerations and
    optimizations without mentioning task accuracy degradations
    .
  • We measure the throughput of indexing text documents in Vespa. This
    includes embedding inference in Vespa using the Vespa Huggingface
    embedder,
    storing the embedding vector in Vespa, and regular inverted indexing
    of the title and text field. We use the
    vespa-cli feed option
    as the feeding client.
  • We use the Vespa fbench
    tool
    to drive HTTP query load using HTTP POST against the Vespa query
    api.
  • Batch size in Vespa embedders is one for document and query inference.
  • There is no caching of query embedding inference, so repeating the same query
    text while benchmarking will trigger a new embedding inference.

Sample Vespa JSON formatted feed document (prettified) from the
MIRACL dataset.

{
    "put": "id:miracl-sw:doc::2-0",
    "fields": {
        "title": "Akiolojia",
        "text": "Akiolojia (kutoka Kiyunani \u03b1\u03c1\u03c7\u03b1\u03af\u03bf\u03c2 = \"zamani\" na \u03bb\u03cc\u03b3\u03bf\u03c2 = \"neno, usemi\") ni somo linalohusu mabaki ya tamaduni za watu wa nyakati zilizopita. Wanaakiolojia wanatafuta vitu vilivyobaki, kwa mfano kwa kuchimba ardhi na kutafuta mabaki ya majengo, makaburi, silaha, vifaa, vyombo na mifupa ya watu.",
        "doc_id": "2#0",
        "language": "sw"
    }
}
ModelModel size (MB)NDCG@10Docs/secondQueries/second
(*)
float324480.675137340
Int8 (Quantized)1120.661269640

Comparison of embedding inference in Vespa using a full precision
model with float32 weights against a quantized model using int8
weights. This is primarily benchmarking embedding inference. See
the next section for a deep dive into the experimental setup.

There is a small drop in retrieval accuracy from an NDCG@10 score
of 0.675 to 0.661 (2%), but a huge gain in embedding inference
efficiency. Indexing throughput increases by 2x, and query throughput
increases close to 2x. The throughput measurements are end-to-end,
either using vespa-cli feed or vespa-fbench. The difference in query
versus document sequence length largely explains the query and
document throughput difference (the quadratic scaling properties).

Query embed latency and throughput

Throughput is one way to look at it, but what about query serving
latency? We analyze query latency of the quantized model by gradually
increasing the load until the CPU is close to 100% utilization using
vespa-fbench
input format for POST requests.

/search/
{"yql": "select doc_id from doc where rank(doc_id contains \"71#13\",{targetHits:1}nearestNeighbor(embedding,q))", "input.query(q)": "embed(query:Bandari kubwa nchini Kenya iko wapi?)", "ranking": "semantic", "hits": 0}

The above query template tests Vespa end-to-end but does NOT perform
a global nearest neighbor search as the query uses the rank
operator
to retrieve by doc_id, and the second operand computes the
nearestNeighbor. This means that the nearest neighbor “search” is
limited to a single document in the index. This experimental setup
allows us to test everything end to end except the cost of exhaustive
search through all documents.

This part of the experiment focuses on the embedding model inference
and not nearest neighbor search performance. We use all the queries
in the dev set (482 unique queries). Using vespa-fbench, we simulate
load by increasing the number of concurrent clients executing queries
with sleep time 0 (-c 0) while observing the end-to-end latency and
throughput.

$ vespa-fbench -P -q queries.txt -s 20 -n $clients -c 0 localhost 8080
Clients Average
latency
95p latencyQueries/s
1810125
2911222
41013400
81219640

Vespa query embedder performance.

As concurrency increases, the latency increases slightly, but not
much, until saturation, where latency will climb rapidly with a
hockey-stick shape due to queuing for exhausted resources.

In this case, latency is the complete end-to-end HTTP latency,
including HTTP overhead, embedding inference, and dispatching the
embedding vector to the Vespa content node process. Again, it does
not include nearest neighbor search, as the query limits the retrieval
to a single document.

In the previous section, we focused on the embedding inference
throughput and latency. In this section, we change the Vespa query
specification to perform an exact nearest neighbor search over all
documents. This setup measures the end-to-end deployment, including
HTTP overhead, embedding inference, and embedding retrieval using
Vespa exact nearest neighbor
search.
With exact search, no retrieval error is introduced by using
approximate search algorithms.

/search/
{"yql": "select doc_id from doc where {targetHits:10}nearestNeighbor(embedding,q)", "input.query(q)": "embed(query:Bandari kubwa nchini Kenya iko wapi?)", "ranking": "semantic", "hits":