Text embedding made simple | Vespa Blog

UPDATE 2023-06-06: use new syntax to configure Bert embedder.

Decorative image

“searching data using vector embeddings, unreal engine, high quality render, 4k, glossy, vivid colors, intricate detail” by Stable Diffusion

Embeddings are the basis for modern semantic search and neural ranking,
so the first step in developing such features is to convert your document
and query text to embeddings.

Once you have the embeddings, Vespa.ai makes it easy to use them efficiently
to find neighbors
or evaluate machine-learned models,
but you’ve had to create
them either on the client side or by writing your own Java component.
Now, we’re providing this building block out of the platform as well.

On Vespa 8.54.61 or higher, simply add this to your services.xml file under <container>:

<component id="bert" type="bert-embedder">
    <transformer-model path="model/bert-embedder.onnx"/>
    <tokenizer-vocab path="model/vocab.txt"/>
</component>

The model files here can be any BERT style model and vocabulary,
we recommend this one:
huggingface.co/sentence-transformers/msmarco-MiniLM-L-6-v3.

With this deployed, you can automatically
convert query text
to an embedding by writing embed(bert, “my text”) where you would otherwise supply an embedding tensor. For example:

input.query(myEmbedding)=embed(bert, "Hello world")

And to
create an embedding from a document field
you can add

field myEmbedding type tensor(x[384]) {
    indexing: input myTextField | embed bert
}

to your schema outside the document block.

Semantic search sample application

To get you started we have created a complete and minimal sample application using this:
simple-semantic-search.

Further reading

This should make it easy to get started with embeddings. If you want to dig deeper into the topic,
be sure to check out this blog post series on
using pretrained transformer models for search,
and this on efficiency in
combining vector search with filters.

Enhancing Vespa’s Embedding Management Capabilities

Decorative
image

Photo by
vnwayne fan
on Unsplash

We are thrilled to announce significant updates to Vespa’s support for inference with text embedding models
that maps texts into vector representations: General support for Huggingface models including multi-lingual embedding, embedding inference on GPUs, and new recommended models available on the Vespa Cloud model hub.

Vespa’s best-in-class vector and multi-vector
search support and inferences with embedding models
allow developers to build feature-rich semantic search applications
without managing separate systems for embedding inference and vector search over embedding representations.

embedding made easy
Vespa query request using embed
functionality to produce the vector embedding inside Vespa.

About text embedding models

Text embedding models have revolutionized natural language processing (NLP) and information retrieval tasks
by capturing the semantic meaning of unstructured text data.
Unlike traditional representations that treat words as discrete symbols,
embedding models maps text into continuous vector spaces.

multilingual embedding model

Embedding models trained on multilingual datasets can represent concepts across different languages enabling information retrieval across
diverse linguistic contexts.

Embedder Models from Huggingface

Vespa now comes with generic support for embedding models hosted on Huggingface.

With the new Huggingface Embedder functionality,
developers can export embedding models from Huggingface
and import them in ONNX format in Vespa for accelerated inference close to where the data is created:

<container id="default" version="1.0">
    <component id="my-embedder-id" type="hugging-face-embedder">
        <transformer-model model-id="cloud-model-id"
                           path="my-models/model.onnx"/>
        <tokenizer-model   model-id="cloud-model-id"
                           path="my-models/tokenizer.json"/>
    </component>
    ...
</container>

The Huggingface Embedder also supports multilingual embedding models that handle 100s of languages.
Multilingual embedding representations open new possibilities for cross-lingual applications
using Vespa linguistic processing
and multilingual vector representations to implement
hybrid search.
The new Huggingface Embedder also supports
multi-vector representations,
simplifying deploying semantic search applications at scale
without maintaining complex fan-out relationships due to model input context length constraints.
Read more about the Huggingface embedder in the
documentation.

GPU Acceleration of Embedding Models

Vespa now supports GPU acceleration of embedding model inferences.
By harnessing the power of GPUs, Vespa embedders can efficiently process large amounts of text data,
resulting in faster response times, improved scalability, and lower cost.
GPU support in Vespa also unlocks using larger and more powerful embedding models
while maintaining low serving latency and cost-effectiveness.

GPU acceleration is automatically enabled in Vespa Cloud for instances where GPU(s) is available.
Configure your stateless Vespa container cluster with a GPU resource in services.xml.
For open-source Vespa, specify the GPU device using the
embedder ONNX configuration.

Vespa Model Hub Updates

To make it easier to create embedding applications,
we have added new state-of-the-art text embedding models on the Vespa Model Hub for
Vespa Cloud users. The Vespa Model Hub is a centralized repository of selected models,
making it easier for developers to discover and use powerful open-source embedding models.

This expansion of the model hub provides developers with a broader range of embedding options.
It empowers them to make tradeoffs related to embedding quality, inference latency,
and embedding dimensionality-related resource footprint.

We expand the hub with the following open-source text embedding models:

Embedding ModelDimensionalityMetricLanguageVespa Hub Model Id
e5-small-v2384angularEnglishe5-small-v2
e5-base-v2768angularEnglishe5-base-v2
e5-large-v21024angularEnglishe5-large-v2
multilingual-e5-base768angularMultilingualmultilingual-e5-base

These embedding models perform strongly on various tasks,
as demonstrated on the MTEB: Massive Text Embedding Benchmark leaderboard.
The MTEB includes 56 datasets across 8 tasks, such as semantic search, clustering, classification, and re-ranking.

MTEB
MTEB Leaderboard, notice the strong performance of the E5-v2 models

Developers using Vespa Cloud can add these embedding models to their application by referencing the Vespa Cloud Model hub identifier:

<component id="e5" type="hugging-face-embedder">
    <transformer-model model-id="e5-small-v2"/>
</component>

With three lines of configuration added to the Vespa app, Vespa cloud developers can use the embed funcionality for
embedding queries and embedding document fields.

Producing the embeddings closer to the Vespa storage and indexes avoids network transfer-related latency and egress costs,
which can be substantial for high-dimensional vector representations.
In addition, with Vespa Cloud’s auto-scaling feature,
developers do not need to worry about scaling with changes in inference traffic volume.

Vespa Cloud also allows bringing your own models using the HuggingFace Embedder
with model files submitted in the application package. In Vespa Cloud, inference with embedding models is
automatically accelerated with GPU if the application uses Vespa Cloud GPU instances.
Read more on the Vespa Cloud model hub.

Summary

The improved Vespa embedding management options offer a significant leap in capabilities for anybody working with embeddings in online applications,
enabling developers to leverage state-of-the-art models, accelerate inference with GPUs,
and access a broader range of embedding options through the Vespa model hub.
All this functionality is available in Vespa version 8.179.37 and later.

Got questions? Join the Vespa community in Vespa Slack.

Simplify Search with Multilingual Embedding Models

Decorative image

Photo by Bruno Martins on Unsplash

This blog post presents and shows how to represent a robust
multilingual embedding model of the E5 family in Vespa. We also
demonstrate how to evaluate the model’s effectiveness on multilingual
information retrieval (IR) datasets.

Introduction

The fundamental concept behind embedding models is transforming
textual data into a continuous vector space, wherein similar items
are brought close together and dissimilar ones are pushed
farther apart. Mapping multilingual texts into a unified vector
embedding space makes it possible to represent and compare queries
and documents from various languages within this shared space.

multilingual embedding model

Meet the E5 family.

Researchers from Microsoft introduced the E5 family of text embedding
models in the paper Text Embeddings by Weakly-Supervised Contrastive
Pre-training. E5 is short for
EmbEddings from bidirEctional Encoder rEpresentations. Using a
permissive MIT license, the same researchers have also published
the model weights on the Huggingface model hub. There are three
multilingual E5 embedding model variants with different model sizes
and embedding dimensionality. All three models are initialized from
pre-trained transformer models with trained text vocabularies that
handle up to 100 languages.

This model is initialized from
xlm-roberta-base and
continually trained on a mixture of multilingual datasets. It
supports 100 languages from xlm-roberta, but low-resource languages
may see performance degradation._

Similarly, the E5 embedding model family includes three variants
trained only on English datasets.

Choose your E5 Fighter

The embedding model variants allow developers to trade effectiveness
versus serving related costs. Embedding model size and embedding dimensionality
impact task accuracy, model inference, nearest
neighbor search, and storage cost.

These serving-related costs are all roughly linear with model size
and embedding dimensionality. In other words, using an embedding
model with 768 dimensions instead of 384 increases embedding storage
by 2x and nearest neighbor search compute with 2x. Accuracy, however,
is not nearly linear, as demonstrated on the MTEB
leaderboard.

The nearest neighbor search for embedding-based retrieval could be
accelerated by introducing approximate algorithms like
HNSW. HNSW
significantly reduces distance calculations at query time but also
introduces degraded retrieval accuracy because the search is
approximate. Still, the same linear relationship between embedding
dimensionality and distance compute complexity holds.

ModelDimensionalityModel params (M)Accuracy
Average (56 datasets)
Accuracy Retrieval
(15 datasets)
Small38411857.8746.64
Base76827859.4548.88
Large102456061.551.43

Comparision of the E5 multilingual models. Accuracy numbers from MTEB
leaderboard.

Do note that the datasets included in MTEB are biased towards English
datasets, which means that the reported retrieval performance might
not match up with observed accuracy on private datasets, especially
for low-resource languages.

Representing E5 embedding models in Vespa

Vespa’s vector search and embedding inference support allows
developers to build multilingual semantic search applications without
managing separate systems for embedding inference and vector search
over the multilingual embedding representations.

In the following sections, we use the small E5 multilingual variant,
which gives us reasonable accuracy for a much lower cost than the
larger sister E5 variants. The small model inference complexity
also makes it servable on CPU architecture, allowing iterations and
development locally without managing GPU-related infrastructure
complexity.

Exporting E5 to ONNX format for accelerated model inference

To export the embedding model from the Huggingface model hub to
ONNX format for inference in Vespa, we can use the
Optimum library:

$ optimum-cli export onnx --task sentence-similarity -m intfloat/multilingual-e5-small multilingual-e5-small-onnx

The above optimum-cli command exports the HF model to ONNX format that can be imported
and used with the Vespa Huggingface
embedder.
Using the Optimum generated ONNX file and tokenizer configuration
file, we configure Vespa with the following in the Vespa application
package
services.xml
file.

<component id="e5" type="hugging-face-embedder">
  <transformer-model path="model/multilingual-e5-small.onnx"/>
  <tokenizer-model path="model/tokenizer.json"/>
</component>

That’s it! These two simple steps are all we need to start using the multilingual
E5 model to embed queries and documents with Vespa.

Using E5 with queries and documents in Vespa

The E5 family uses text instructions mixed with the input data to
separate queries and documents. Instead of having two different
models for queries and documents, the E5 family separates queries
and documents by prepending the input with “query:” or “passage:”.

schema doc {
  document doc  {
    field title type string { .. }
    field text type string { .. }
  }
  field embedding type tensor<float>(x[384]) {
    indexing {
      "passage: " . input title . " " . input text | embed | attribute
    }
  }

The above Vespa schema language
uses the embed indexing
language
functionality to invoke the configured E5 embedding model, using a
concatenation of the “passage: “ instruction, the title, and
the text. Notice that the embedding tensor
field defines the embedding dimensionality (384).

The above schema uses a single vector
representation per document. With Vespa multi-vector
indexing,
it’s also possible to represent and index multiple vector representations
for the same tensor field.

Similarly, on the query, we can embed the input query text with the
E5 model, now prepending the input user query with “query: “

{
  "yql": "select ..",
  "input.query(q)": "embed(query: the query to encode)", 
}

Evaluation

To demonstrate how to evaluate multilingual embedding models, we
evaluate the small E5 multilingual variant on three information
retrieval (IR) datasets. We use the classic trec-covid dataset, a
part of the BEIR benchmark,
that we have written about in blog
posts
before. We also include two languages from the
MIRACL (Multilingual Information
Retrieval Across a Continuum of Languages
) datasets.

All three datasets use
NDCG@10 to
evaluate ranking effectiveness. NDCG is a ranking metric that is
precision-oriented and handles graded relevance judgments.

DatasetIncluded in E5
fine-tuning
LanguageDocumentsQueriesRelevance Judgments
BEIR:trec-covidNoEnglish171,3325066,336
MIRACL:swYes (The train split was used)Swahili131,9244825092
MIRACL:yoNoYoruba49,0431191188

IR dataset characteristics

We consider both BEIR:trec-covid and MIRACL:yo as out-of-domain datasets
as E5 has not been trained or fine tuned on them since they don’t
contain any training split. Applying E5 on out-of-domain datasets
is called zero-shot, as no training examples (shots) are available.

The Swahili dataset could be categorized as an in-domain dataset
as E5 has been trained on the train split of the dataset. All three
datasets have documents with titles and text
fields. We use the concatenation strategy described in previous sections, inputting both title
and text to the embedding model.

We evaluate the E5 model using exact nearest neighbor
search
without HNSW indexing,
and all experiments are run on an M1 Pro (arm64) laptop using the
open-source Vespa container
image. We contrast
the E5 model results with Vespa BM25.

DatasetBM25Multilingual E5 (small)
MIRACL:sw0.42430.6755
MIRACL:yo0.68310.4187
BEIR:trec-covid0.68230.7139

Retrieval effectiveness for BM25 and E5 small (NDCG@10)

For BEIR:trec-covid, we also evaluated a hybrid combination of E5
and BM25, using a linear combination of the two scores, which lifted
NDCG@10 to 0.7670. This aligns with previous findings, where hybrid
combinations
outperform
each model used independently.

Summary

As demonstrated in the evaluation, multilingual embedding models
can enhance and simplify building multilingual search applications
and provide a solid baseline. Still, as we can see from the evaluation
results, the simple and cheap Vespa BM25 ranking model outperformed
the dense embedding model on the MIRACL Yoruba queries.

This result can largely be explained by the fact that the model had not
been pre-trained on the language (low resource) or tuned for retrieval
with Yoruba queries or documents. This is another reminder of what
we wrote about in a blog post about improving zero-shot
ranking,
where we summarize with a quote from the BEIR paper, which evaluates
multiple models in a zero-shot setting:

In-domain performance is not a good indicator for out-of-domain
generalization. We observe that BM25 heavily underperforms neural
approaches by 7-18 points on in-domain MS MARCO. However, BEIR
reveals it to be a strong baseline for generalization and generally
outperforming many other, more complex approaches. This stresses
the point that retrieval methods must be evaluated on a broad range
of datasets.

In the next blog post, we will look at ways to make embedding
inference cheaper without sacrificing much retrieval effectiveness
by optimizing the embedding model. Furthermore, we will show how
to save 50% of embedding storage using Vespa’s support for bfloat16
precision instead of float, with close to zero impact on retrieval
effectiveness.

If you want to reproduce the retrieval results, or get started
with multilingual embedding search, check out
the new multilingual search sample application.

Accelerating Transformer-based Embedding Retrieval with Vespa

Decorative image

Photo by Appic on Unsplash

In this post, we’ll see how to accelerate embedding inference and retrieval with little impact on quality.
We’ll take a holistic approach and deep-dive into both aspects of an embedding retrieval system: Embedding inference and retrieval with nearest neighbor search.
All experiments are performed on a laptop with the open-source Vespa container image.

Introduction

The fundamental concept behind text embedding models is transforming
textual data into a continuous vector space, wherein similar items
are brought closer together, and dissimilar ones are pushed farther
apart. Mapping multilingual texts into a unified vector embedding
space makes it possible to represent and compare queries and documents
from various languages within this shared space. By using contrastive
representation learning with retrieval data examples, we can make
embedding representations useful for retrieval with nearest neighbor
search.

Overview

A search system using embedding retrieval consists of two primary
processes:

  • Embedding inference, using an embedding model to map text to a
    point in a vector space of D dimensions.
  • Retrieval in the D dimensional vector space using nearest neighbor search.

This blog post covers both aspects of an embedding retrieval system
and how to accelerate them, while also paying attention to the task
accuracy because what’s the point of having blazing fast but highly
inaccurate results?

Transformer Model Inferencing

The most popular text embedding models are typically based on
encoder-only Transformer models (such as BERT). We need a
high-level understanding of the complexity of encoder-only transformer
language models (without going deep into neural network architectures).

Inference complexity from the transformer architecture attention
mechanism scales quadratically with input sequence length.

BERT embedder

Illustration of obtaining a single vector representation of the
text ‘a new day’ through BERT.

The BERT model has a typical input
length limitation of 512 tokens, so the tokenization process truncates
the input to avoid exceeding the architecture’s maximum length.
Embedding models might also truncate the text at a lower limit than
the theoretical limit of the neural network to improve quality and
reduce training costs, as computational complexity is quadratic
with input sequence length for both training and inference. The
last pooling operation compresses the token vectors into a single
vector representation. A common pooling technique is averaging the
token vectors.

It’s worth noting that some models may not perform pooling and
instead represent the text with multiple
vectors,
but that aspect is beyond the scope of this blog post.

Inference cost versus sequence length

Illustration of BERT inferenec cost versus sequence input length (sequence^2).

We use ‘Inference cost’ to refer to the computational resources
required for a single inference pass with a given input sequence
length. The graph depicts the relationship between the sequence
length and the squared compute complexity, demonstrating its quadratic
nature. Latency and throughput can be adjusted using different
techniques for parallelizing computations. See model serving at
scale for a
discussion on these techniques in Vespa.

Why does all of this matter? For retrieval systems, text queries
are usually much shorter than text documents, so invoking embedding
models for documents costs more than encoding shorter questions.

Sequence lengths and quadratic scaling are some of the reasons why
using frozen document-size
embeddings
are practical at scale, as it avoids re-embedding documents when
the model weights are updated due to re-training the model. Similarly,
query embeddings can be cached for previously seen queries as long
as the model weights are unchanged. The asymmetric length properties
can also help us design a retrieval system architecture for scale.

  • Asymmetric model size: Use different-sized models for encoding
    queries and documents (with the same output embedding dimensionality).
    See this paper for an example.
  • Asymmetric batch size: Use batch on-demand computing for embedding
    documents, using auto-scaling features, for example, with Vespa
    Cloud.
  • Asymmetric compute architecture: Use GPU acceleration for document inference but CPU
    for query inference.

The final point is that reporting embedding inference latency or
throughput without mentioning input sequence length provides little
insight.

Choose your Fighter

When deciding on an embedding model, developers must strike a balance
between quality and serving costs.

Triangle of tradeoffs

These serving-related costs are all roughly linear with model
parameters and embedding dimensionality (for a given sequence
length). For example, using an embedding model with 768 dimensions
instead of 384 increases embedding storage by 2x and nearest neighbor
search compute by 2x.

Accuracy, however, is not nearly linear, as demonstrated on the
MTEB leaderboard.

ModelDimensionalityModel params (M)Accuracy
Average (56 datasets)
Accuracy Retrieval
(15 datasets)
Small38411857.8746.64
Base76827859.4548.88
Large102456061.551.43

A comparison of the E5 multilingual models — accuracy numbers from the MTEB
leaderboard.

In the following sections, we use the small E5 multilingual variant,
which gives us reasonable accuracy for a much lower cost than the
larger sister E5 variants. The small model inference complexity
also makes it servable on CPU architecture, allowing iterations and
development locally without managing GPU-related infrastructure
complexity.

Exporting E5 to ONNX format for accelerated model inference

To export the embedding model from the Huggingface model hub to
ONNX format for inference in Vespa, we can use the Transformer
Optimum library:

$ optimum-cli export onnx --task sentence-similarity -m intfloat/multilingual-e5-small model-dir

The above exports the model without any optimizations. The optimum
client also allows specifying optimization
levels,
here using the highest optimization level usable for serving on the
CPU.

The above commands export the model to ONNX format that can be
imported and used with the Vespa Huggingface
embedder.
Using the Optimum generated ONNX and tokenizer configuration files,
we configure Vespa with the following in the Vespa application
package
services.xml
file:

<component id="e5" type="hugging-face-embedder">
  <transformer-model path="model/model.onnx"/>
  <tokenizer-model path="model/tokenizer.json"/>
</component>

These two simple steps are all we need to start using the multilingual
E5 model to embed queries and documents with Vespa.
We can also quantize the optimized ONNX model, for example, using
the optimum
library
or onnxruntime quantization like
this.
Quantization (post-training) converts the float32 model weights (4
bytes per weight) to byte (int8), enabling faster inference on the
CPU.

Performance Experiments

To demonstrate the many tradeoffs, we assess the mentioned small
E5 multilanguage model on the Swahili(SW) split from the
MIRACL (Multilingual Information
Retrieval Across a Continuum of Languages
) dataset.

DatasetLanguageDocumentsAvg
document tokens
QueriesAvg query tokensRelevance
Judgments
MIRACL swSwahili131,92463482135092

Dataset characteristics; tokens are the number of language model
token identifiers. Since Swahili is a low-resource language, the
LM tokenization uses more tokens to represent similar byte-length
texts than for more popular languages such as English.

We experiment with post-training quantization of the model (not the
output vectors) to document the impact quantization has on retrieval
effectiveness (NDCG@10). We use this
routine
to quantize the model (We don’t use optimum for this due to this
issue – fixed
in v 1.11).

We then study the serving efficiency gains (latency/throughput) on
the same laptop-sized hardware using a quantized model versus a
full precision model
.

All experiments are run on an M1 Pro (arm64) laptop with 8 v-CPUs
and 32GB of memory, using the open-source Vespa container
image. No GPU
acceleration and no need to manage CUDA driver compatibility, huge
container images due to CUDA dependencies, or forwarding host GPU
devices to the container.

  • We use the multilingual-search Vespa sample
    application
    as the starting point for these experiments. This sample app was
    introduced in Simplify search with multilingual embedding
    models.
  • We use the
    NDCG@10 metric
    to evaluate ranking effectiveness. When performing model optimizations,
    it’s important to pay attention to the impact on the task. This is
    stating the obvious, but still, many talk about accelerations and
    optimizations without mentioning task accuracy degradations
    .
  • We measure the throughput of indexing text documents in Vespa. This
    includes embedding inference in Vespa using the Vespa Huggingface
    embedder,
    storing the embedding vector in Vespa, and regular inverted indexing
    of the title and text field. We use the
    vespa-cli feed option
    as the feeding client.
  • We use the Vespa fbench
    tool
    to drive HTTP query load using HTTP POST against the Vespa query
    api.
  • Batch size in Vespa embedders is one for document and query inference.
  • There is no caching of query embedding inference, so repeating the same query
    text while benchmarking will trigger a new embedding inference.

Sample Vespa JSON formatted feed document (prettified) from the
MIRACL dataset.

{
    "put": "id:miracl-sw:doc::2-0",
    "fields": {
        "title": "Akiolojia",
        "text": "Akiolojia (kutoka Kiyunani \u03b1\u03c1\u03c7\u03b1\u03af\u03bf\u03c2 = \"zamani\" na \u03bb\u03cc\u03b3\u03bf\u03c2 = \"neno, usemi\") ni somo linalohusu mabaki ya tamaduni za watu wa nyakati zilizopita. Wanaakiolojia wanatafuta vitu vilivyobaki, kwa mfano kwa kuchimba ardhi na kutafuta mabaki ya majengo, makaburi, silaha, vifaa, vyombo na mifupa ya watu.",
        "doc_id": "2#0",
        "language": "sw"
    }
}
ModelModel size (MB)NDCG@10Docs/secondQueries/second
(*)
float324480.675137340
Int8 (Quantized)1120.661269640

Comparison of embedding inference in Vespa using a full precision
model with float32 weights against a quantized model using int8
weights. This is primarily benchmarking embedding inference. See
the next section for a deep dive into the experimental setup.

There is a small drop in retrieval accuracy from an NDCG@10 score
of 0.675 to 0.661 (2%), but a huge gain in embedding inference
efficiency. Indexing throughput increases by 2x, and query throughput
increases close to 2x. The throughput measurements are end-to-end,
either using vespa-cli feed or vespa-fbench. The difference in query
versus document sequence length largely explains the query and
document throughput difference (the quadratic scaling properties).

Query embed latency and throughput

Throughput is one way to look at it, but what about query serving
latency? We analyze query latency of the quantized model by gradually
increasing the load until the CPU is close to 100% utilization using
vespa-fbench
input format for POST requests.

/search/
{"yql": "select doc_id from doc where rank(doc_id contains \"71#13\",{targetHits:1}nearestNeighbor(embedding,q))", "input.query(q)": "embed(query:Bandari kubwa nchini Kenya iko wapi?)", "ranking": "semantic", "hits": 0}

The above query template tests Vespa end-to-end but does NOT perform
a global nearest neighbor search as the query uses the rank
operator
to retrieve by doc_id, and the second operand computes the
nearestNeighbor. This means that the nearest neighbor “search” is
limited to a single document in the index. This experimental setup
allows us to test everything end to end except the cost of exhaustive
search through all documents.

This part of the experiment focuses on the embedding model inference
and not nearest neighbor search performance. We use all the queries
in the dev set (482 unique queries). Using vespa-fbench, we simulate
load by increasing the number of concurrent clients executing queries
with sleep time 0 (-c 0) while observing the end-to-end latency and
throughput.

$ vespa-fbench -P -q queries.txt -s 20 -n $clients -c 0 localhost 8080
Clients Average
latency
95p latencyQueries/s
1810125
2911222
41013400
81219640

Vespa query embedder performance.

As concurrency increases, the latency increases slightly, but not
much, until saturation, where latency will climb rapidly with a
hockey-stick shape due to queuing for exhausted resources.

In this case, latency is the complete end-to-end HTTP latency,
including HTTP overhead, embedding inference, and dispatching the
embedding vector to the Vespa content node process. Again, it does
not include nearest neighbor search, as the query limits the retrieval
to a single document.

In the previous section, we focused on the embedding inference
throughput and latency. In this section, we change the Vespa query
specification to perform an exact nearest neighbor search over all
documents. This setup measures the end-to-end deployment, including
HTTP overhead, embedding inference, and embedding retrieval using
Vespa exact nearest neighbor
search.
With exact search, no retrieval error is introduced by using
approximate search algorithms.

/search/
{"yql": "select doc_id from doc where {targetHits:10}nearestNeighbor(embedding,q)", "input.query(q)": "embed(query:Bandari kubwa nchini Kenya iko wapi?)", "ranking": "semantic", "hits": 

Representing BGE embedding models in Vespa using bfloat16

Decorative image

Photo by Rafael Drück on Unsplash

This post demonstrates how to use recently announced BGE (BAAI General Embedding)
models in Vespa. The open-sourced (MIT licensed) BGE models
from the Beijing Academy of Artificial Intelligence (BAAI) perform
strongly on the Massive Text Embedding Benchmark (MTEB
leaderboard). We
evaluate the effectiveness of two BGE variants on the
BEIR trec-covid dataset.
Finally, we demonstrate how Vespa’s support for storing and indexing
vectors using bfloat16 precision saves 50% of memory and storage
fooprint with close to zero loss in retrieval quality.

Choose your BGE Fighter

When deciding on an embedding model, developers must strike a balance
between quality and serving costs.

Triangle of tradeoffs

These serving-related costs are all roughly linear with model
parameters and embedding dimensionality (for a given sequence
length). For example, using an embedding model with 768 dimensions
instead of 384 increases embedding storage by 2x and nearest neighbor
search compute by 2x.

Quality, however, is not nearly linear, as demonstrated on the MTEB
leaderboard.

ModelDimensionalityModel params (M)Accuracy
Average (56 datasets)
Accuracy Retrieval
(15 datasets)
bge-small-en3843362.1151.82
bge-base-en76811063.3653
bge-base-large102433563.9853.9

A comparison of the English BGE embedding models — accuracy numbers MTEB
leaderboard. All
three BGE models outperforms OpenAI ada embeddings with 1536
dimensions and unknown model parameters on MTEB

In the following sections, we experiment with the small and base
BGE variant, which gives us reasonable accuracy for a much lower
cost than the large variant. The small model inference complexity
also makes it servable on CPU architecture, allowing iterations and
development locally without managing GPU-related infrastructure
complexity.

Exporting BGE to ONNX format for accelerated model inference

To use the embedding model from the Huggingface model hub in Vespa
we need to export it to ONNX format. We can use
the Transformers Optimum
library for this:

$ optimum-cli export onnx --task sentence-similarity -m BAAI/bge-small-en --optimize O3 bge-small-en

This exports the small model with the highest optimization
level
usable for serving on CPU. We also quantize the optimized ONNX model
using onnxruntime quantization like
this.
Quantization (post-training) converts the float model weights (4
bytes per weight) to byte (int8), enabling faster inference on the
CPU. As demonstrated in this blog
post,
quantization accelerates embedding model inference by 2x on CPU with negligible
impact on retrieval quality.

Using BGE in Vespa

Using the Optimum generated ONNX model and
tokenizer files, we configure the Vespa Huggingface
embedder
with the following in the Vespa application
package
services.xml
file.

<component id="bge" type="hugging-face-embedder">
  <transformer-model path="model/model.onnx"/>
  <tokenizer-model path="model/tokenizer.json"/>
  <pooling-strategy>cls</pooling-strategy>
  <normalize>true</normalize>
</component>

BGE uses the CLS special token as the text representation vector
(instead of average pooling). We also specify normalization so that
we can use the prenormalized-angular distance
metric
for nearest neighbor search. See configuration
reference
for details.

With this, we are ready to use the BGE model to embed queries and
documents with Vespa.

Using BGE in Vespa schema

The BGE model family does not use instructions for documents like
the E5
family,
so we don’t need to prepend the input to the document model with
“passage: “ like with the E5 models. Since we configure the Vespa
Huggingface
embedder to
normalize the vectors, we use the optimized prenormalized-angular
distance-metric for the nearest neighbor search
distance-metric.

field embedding type tensor<float>(x[384]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular
    }
}

Note that the above does not enable HNSW
indexing, see
this
blog
post on the tradeoffs related to introducing approximative nearest
neighbor search. The small model embedding is configured with 384
dimensions, while the base model uses 768 dimensions.

field embedding type tensor<float>(x[768]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular
    }
}

Using BGE in queries

The BGE model uses query instructions like the E5
family
that are prepended to the input query text. We prepend the instruction
text to the user query as demonstrated in the snippet below:

query = 'is remdesivir an effective treatment for COVID-19'
body = {
        'yql': 'select doc_id from doc where ({targetHits:10}nearestNeighbor(embedding, q))',
        'input.query(q)': 'embed(Represent this sentence for searching relevant passages: ' + query +  ')', 
        'ranking': 'semantic',
        'hits' : '10' 
 }
response = session.post('http://localhost:8080/search/', json=body)

The BGE query instruction is Represent this sentence for searching
relevant passages:
. We are unsure why they choose a longer query instruction as
it does hurt efficiency as compute complexity is
quadratic
with sequence length.

Experiments

We evaluate the small and base model on the trec-covid test split
from the BEIR benchmark. We
concat the title and the abstract as input to the BEG embedding
models as demonstrated in the Vespa schema snippets in the previous
section.

DatasetDocumentsAvg document tokensQueriesAvg query
tokens
Relevance Judgments
BEIR trec_covid171,332245501866,336

Dataset characteristics; tokens are the number of language model
token identifiers (wordpieces)

All experiments are run on an M1 Pro (arm64) laptop with 8 v-CPUs
and 32GB of memory, using the open-source Vespa container
image. No GPU
acceleration and no need to manage CUDA driver compatibility, huge
container images due to CUDA dependencies, or forwarding host GPU
devices to the container.

Sample Vespa JSON
formatted
feed document (prettified) from the
BEIR trec-covid dataset:

{
  "put": "id:miracl-trec:doc::wnnsmx60",
  "fields": {
    "title": "Managing emerging infectious diseases: Is a federal system an impediment to effective laws?",
    "text": "In the 1980's and 1990's HIV/AIDS was the emerging infectious disease. In 2003\u20132004 we saw the emergence of SARS, Avian influenza and Anthrax in a man made form used for bioterrorism. Emergency powers legislation in Australia is a patchwork of Commonwealth quarantine laws and State and Territory based emergency powers in public health legislation. It is time for a review of such legislation and time for consideration of the efficacy of such legislation from a country wide perspective in an age when we have to consider the possibility of mass outbreaks of communicable diseases which ignore jurisdictional boundaries.",
    "doc_id": "wnnsmx60",
    "language": "en"
  }
}

Evalution results

ModelModel size (MB)NDCG@10 BGENDCG@10
BM25
bge-small-en330.73950.6823
bge-base-en1040.76620.6823

Evaluation results for quantized BGE models.

We contrast both BGE models with the unsupervised
BM25 baseline from
this blog
post.
Both models perform better than the BM25 baseline
on this dataset. We also note that our NDCG@10 numbers represented
in Vespa is slightly better than reported on the MTEB leaderboard
for the same dataset. We can also observe that the base model
performs better on this dataset, but is also 2x more costly due to
size of embedding model and the embedding dimensionality. The
bge-base model inference could benefit from GPU
acceleration
(without quantization).

Using bfloat16 precision

We evaluate using
bfloat16
instead of float for the tensor representation in Vespa. Using
bfloat16 instead of float reduces memory and storage requirements
by 2x since bfloat16 uses 2 bytes per embedding dimension instead
of 4 bytes for float. See Vespa tensor values
types.

We do not change the type of the query tensor. Vespa will take care
of casting the bfloat16 field representation to float at search
time, allowing CPU acceleration of floating point operations. The
cast operation does come with a small cost (20-30%) compared with
using float, but the saving in memory and storage resource footprint
is well worth it for most use cases.

field embedding type tensor<bfloat16>(x[384]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular
    }
}

Using bfloat16 instead of float for the embedding tensor.

ModelNDCG@10 bfloat16NDCG@10 float
bge-small-en0.73460.7395
bge-base-en0.76560.7662

Evaluation results for BGE models – float versus bfloat16 document representation.

By using bfloat16 instead of float to store the vectors, we save
50% of memory cost and we can store 2x more embeddings per instance
type with almost zero impact on retrieval quality:

Summary

Using the open-source Vespa container image, we’ve explored the
recently announced strong BGE text embedding models with embedding
inference and retrieval on our laptops. The local experimentation
eliminates prolonged feedback loops.

Moreover, the same Vespa configuration files suffice for many
deployment scenarios, whether in on-premise setups, on Vespa Cloud,
or locally on a laptop. The beauty lies in that specific
infrastructure for managing embedding inference and nearest neighbor
search as separate infra systems become obsolete with Vespa’s
native embedding
support.

If you are interested to learn more about Vespa; See Vespa Cloud – getting started,
or self-serve Vespa – getting started.
Got questions? Join the Vespa community in Vespa Slack.