Representing BGE embedding models in Vespa using bfloat16

Decorative image

Photo by Rafael Drück on Unsplash

This post demonstrates how to use recently announced BGE (BAAI General Embedding)
models in Vespa. The open-sourced (MIT licensed) BGE models
from the Beijing Academy of Artificial Intelligence (BAAI) perform
strongly on the Massive Text Embedding Benchmark (MTEB
leaderboard). We
evaluate the effectiveness of two BGE variants on the
BEIR trec-covid dataset.
Finally, we demonstrate how Vespa’s support for storing and indexing
vectors using bfloat16 precision saves 50% of memory and storage
fooprint with close to zero loss in retrieval quality.

Choose your BGE Fighter

When deciding on an embedding model, developers must strike a balance
between quality and serving costs.

Triangle of tradeoffs

These serving-related costs are all roughly linear with model
parameters and embedding dimensionality (for a given sequence
length). For example, using an embedding model with 768 dimensions
instead of 384 increases embedding storage by 2x and nearest neighbor
search compute by 2x.

Quality, however, is not nearly linear, as demonstrated on the MTEB
leaderboard.

ModelDimensionalityModel params (M)Accuracy
Average (56 datasets)
Accuracy Retrieval
(15 datasets)
bge-small-en3843362.1151.82
bge-base-en76811063.3653
bge-base-large102433563.9853.9

A comparison of the English BGE embedding models — accuracy numbers MTEB
leaderboard. All
three BGE models outperforms OpenAI ada embeddings with 1536
dimensions and unknown model parameters on MTEB

In the following sections, we experiment with the small and base
BGE variant, which gives us reasonable accuracy for a much lower
cost than the large variant. The small model inference complexity
also makes it servable on CPU architecture, allowing iterations and
development locally without managing GPU-related infrastructure
complexity.

Exporting BGE to ONNX format for accelerated model inference

To use the embedding model from the Huggingface model hub in Vespa
we need to export it to ONNX format. We can use
the Transformers Optimum
library for this:

$ optimum-cli export onnx --task sentence-similarity -m BAAI/bge-small-en --optimize O3 bge-small-en

This exports the small model with the highest optimization
level
usable for serving on CPU. We also quantize the optimized ONNX model
using onnxruntime quantization like
this.
Quantization (post-training) converts the float model weights (4
bytes per weight) to byte (int8), enabling faster inference on the
CPU. As demonstrated in this blog
post,
quantization accelerates embedding model inference by 2x on CPU with negligible
impact on retrieval quality.

Using BGE in Vespa

Using the Optimum generated ONNX model and
tokenizer files, we configure the Vespa Huggingface
embedder
with the following in the Vespa application
package
services.xml
file.

<component id="bge" type="hugging-face-embedder">
  <transformer-model path="model/model.onnx"/>
  <tokenizer-model path="model/tokenizer.json"/>
  <pooling-strategy>cls</pooling-strategy>
  <normalize>true</normalize>
</component>

BGE uses the CLS special token as the text representation vector
(instead of average pooling). We also specify normalization so that
we can use the prenormalized-angular distance
metric
for nearest neighbor search. See configuration
reference
for details.

With this, we are ready to use the BGE model to embed queries and
documents with Vespa.

Using BGE in Vespa schema

The BGE model family does not use instructions for documents like
the E5
family,
so we don’t need to prepend the input to the document model with
“passage: “ like with the E5 models. Since we configure the Vespa
Huggingface
embedder to
normalize the vectors, we use the optimized prenormalized-angular
distance-metric for the nearest neighbor search
distance-metric.

field embedding type tensor<float>(x[384]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular
    }
}

Note that the above does not enable HNSW
indexing, see
this
blog
post on the tradeoffs related to introducing approximative nearest
neighbor search. The small model embedding is configured with 384
dimensions, while the base model uses 768 dimensions.

field embedding type tensor<float>(x[768]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular
    }
}

Using BGE in queries

The BGE model uses query instructions like the E5
family
that are prepended to the input query text. We prepend the instruction
text to the user query as demonstrated in the snippet below:

query = 'is remdesivir an effective treatment for COVID-19'
body = {
        'yql': 'select doc_id from doc where ({targetHits:10}nearestNeighbor(embedding, q))',
        'input.query(q)': 'embed(Represent this sentence for searching relevant passages: ' + query +  ')', 
        'ranking': 'semantic',
        'hits' : '10' 
 }
response = session.post('http://localhost:8080/search/', json=body)

The BGE query instruction is Represent this sentence for searching
relevant passages:
. We are unsure why they choose a longer query instruction as
it does hurt efficiency as compute complexity is
quadratic
with sequence length.

Experiments

We evaluate the small and base model on the trec-covid test split
from the BEIR benchmark. We
concat the title and the abstract as input to the BEG embedding
models as demonstrated in the Vespa schema snippets in the previous
section.

DatasetDocumentsAvg document tokensQueriesAvg query
tokens
Relevance Judgments
BEIR trec_covid171,332245501866,336

Dataset characteristics; tokens are the number of language model
token identifiers (wordpieces)

All experiments are run on an M1 Pro (arm64) laptop with 8 v-CPUs
and 32GB of memory, using the open-source Vespa container
image. No GPU
acceleration and no need to manage CUDA driver compatibility, huge
container images due to CUDA dependencies, or forwarding host GPU
devices to the container.

Sample Vespa JSON
formatted
feed document (prettified) from the
BEIR trec-covid dataset:

{
  "put": "id:miracl-trec:doc::wnnsmx60",
  "fields": {
    "title": "Managing emerging infectious diseases: Is a federal system an impediment to effective laws?",
    "text": "In the 1980's and 1990's HIV/AIDS was the emerging infectious disease. In 2003\u20132004 we saw the emergence of SARS, Avian influenza and Anthrax in a man made form used for bioterrorism. Emergency powers legislation in Australia is a patchwork of Commonwealth quarantine laws and State and Territory based emergency powers in public health legislation. It is time for a review of such legislation and time for consideration of the efficacy of such legislation from a country wide perspective in an age when we have to consider the possibility of mass outbreaks of communicable diseases which ignore jurisdictional boundaries.",
    "doc_id": "wnnsmx60",
    "language": "en"
  }
}

Evalution results

ModelModel size (MB)NDCG@10 BGENDCG@10
BM25
bge-small-en330.73950.6823
bge-base-en1040.76620.6823

Evaluation results for quantized BGE models.

We contrast both BGE models with the unsupervised
BM25 baseline from
this blog
post.
Both models perform better than the BM25 baseline
on this dataset. We also note that our NDCG@10 numbers represented
in Vespa is slightly better than reported on the MTEB leaderboard
for the same dataset. We can also observe that the base model
performs better on this dataset, but is also 2x more costly due to
size of embedding model and the embedding dimensionality. The
bge-base model inference could benefit from GPU
acceleration
(without quantization).

Using bfloat16 precision

We evaluate using
bfloat16
instead of float for the tensor representation in Vespa. Using
bfloat16 instead of float reduces memory and storage requirements
by 2x since bfloat16 uses 2 bytes per embedding dimension instead
of 4 bytes for float. See Vespa tensor values
types.

We do not change the type of the query tensor. Vespa will take care
of casting the bfloat16 field representation to float at search
time, allowing CPU acceleration of floating point operations. The
cast operation does come with a small cost (20-30%) compared with
using float, but the saving in memory and storage resource footprint
is well worth it for most use cases.

field embedding type tensor<bfloat16>(x[384]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular
    }
}

Using bfloat16 instead of float for the embedding tensor.

ModelNDCG@10 bfloat16NDCG@10 float
bge-small-en0.73460.7395
bge-base-en0.76560.7662

Evaluation results for BGE models – float versus bfloat16 document representation.

By using bfloat16 instead of float to store the vectors, we save
50% of memory cost and we can store 2x more embeddings per instance
type with almost zero impact on retrieval quality:

Summary

Using the open-source Vespa container image, we’ve explored the
recently announced strong BGE text embedding models with embedding
inference and retrieval on our laptops. The local experimentation
eliminates prolonged feedback loops.

Moreover, the same Vespa configuration files suffice for many
deployment scenarios, whether in on-premise setups, on Vespa Cloud,
or locally on a laptop. The beauty lies in that specific
infrastructure for managing embedding inference and nearest neighbor
search as separate infra systems become obsolete with Vespa’s
native embedding
support.

If you are interested to learn more about Vespa; See Vespa Cloud – getting started,
or self-serve Vespa – getting started.
Got questions? Join the Vespa community in Vespa Slack.