Representing BGE embedding models in Vespa using bfloat16
Photo by Rafael Drück on Unsplash
This post demonstrates how to use recently announced BGE (BAAI General Embedding)
models in Vespa. The open-sourced (MIT licensed) BGE models
from the Beijing Academy of Artificial Intelligence (BAAI) perform
strongly on the Massive Text Embedding Benchmark (MTEB
leaderboard). We
evaluate the effectiveness of two BGE variants on the
BEIR trec-covid dataset.
Finally, we demonstrate how Vespa’s support for storing and indexing
vectors using bfloat16 precision saves 50% of memory and storage
fooprint with close to zero loss in retrieval quality.
Choose your BGE Fighter
When deciding on an embedding model, developers must strike a balance
between quality and serving costs.
These serving-related costs are all roughly linear with model
parameters and embedding dimensionality (for a given sequence
length). For example, using an embedding model with 768 dimensions
instead of 384 increases embedding storage by 2x and nearest neighbor
search compute by 2x.
Quality, however, is not nearly linear, as demonstrated on the MTEB
leaderboard.
Model | Dimensionality | Model params (M) | Accuracy Average (56 datasets) | Accuracy Retrieval (15 datasets) |
bge-small-en | 384 | 33 | 62.11 | 51.82 |
bge-base-en | 768 | 110 | 63.36 | 53 |
bge-base-large | 1024 | 335 | 63.98 | 53.9 |
A comparison of the English BGE embedding models — accuracy numbers MTEB
leaderboard. All
three BGE models outperforms OpenAI ada embeddings with 1536
dimensions and unknown model parameters on MTEB
In the following sections, we experiment with the small and base
BGE variant, which gives us reasonable accuracy for a much lower
cost than the large variant. The small model inference complexity
also makes it servable on CPU architecture, allowing iterations and
development locally without managing GPU-related infrastructure
complexity.
Exporting BGE to ONNX format for accelerated model inference
To use the embedding model from the Huggingface model hub in Vespa
we need to export it to ONNX format. We can use
the Transformers Optimum
library for this:
$ optimum-cli export onnx --task sentence-similarity -m BAAI/bge-small-en --optimize O3 bge-small-en
This exports the small model with the highest optimization
level
usable for serving on CPU. We also quantize the optimized ONNX model
using onnxruntime quantization like
this.
Quantization (post-training) converts the float model weights (4
bytes per weight) to byte (int8), enabling faster inference on the
CPU. As demonstrated in this blog
post,
quantization accelerates embedding model inference by 2x on CPU with negligible
impact on retrieval quality.
Using BGE in Vespa
Using the Optimum generated ONNX model and
tokenizer files, we configure the Vespa Huggingface
embedder
with the following in the Vespa application
package
services.xml
file.
<component id="bge" type="hugging-face-embedder">
<transformer-model path="model/model.onnx"/>
<tokenizer-model path="model/tokenizer.json"/>
<pooling-strategy>cls</pooling-strategy>
<normalize>true</normalize>
</component>
BGE uses the CLS special token as the text representation vector
(instead of average pooling). We also specify normalization so that
we can use the prenormalized-angular
distance
metric
for nearest neighbor search. See configuration
reference
for details.
With this, we are ready to use the BGE model to embed queries and
documents with Vespa.
Using BGE in Vespa schema
The BGE model family does not use instructions for documents like
the E5
family,
so we don’t need to prepend the input to the document model with
“passage: “ like with the E5 models. Since we configure the Vespa
Huggingface
embedder to
normalize the vectors, we use the optimized prenormalized-angular
distance-metric for the nearest neighbor search
distance-metric.
field embedding type tensor<float>(x[384]) {
indexing: input title . " " . input text | embed | attribute
attribute {
distance-metric: prenormalized-angular
}
}
Note that the above does not enable HNSW
indexing, see
this
blog
post on the tradeoffs related to introducing approximative nearest
neighbor search. The small model embedding is configured with 384
dimensions, while the base model uses 768 dimensions.
field embedding type tensor<float>(x[768]) {
indexing: input title . " " . input text | embed | attribute
attribute {
distance-metric: prenormalized-angular
}
}
Using BGE in queries
The BGE model uses query instructions like the E5
family
that are prepended to the input query text. We prepend the instruction
text to the user query as demonstrated in the snippet below:
query = 'is remdesivir an effective treatment for COVID-19'
body = {
'yql': 'select doc_id from doc where ({targetHits:10}nearestNeighbor(embedding, q))',
'input.query(q)': 'embed(Represent this sentence for searching relevant passages: ' + query + ')',
'ranking': 'semantic',
'hits' : '10'
}
response = session.post('http://localhost:8080/search/', json=body)
The BGE query instruction is Represent this sentence for searching
relevant passages:. We are unsure why they choose a longer query instruction as
it does hurt efficiency as compute complexity is
quadratic
with sequence length.
Experiments
We evaluate the small and base model on the trec-covid test split
from the BEIR benchmark. We
concat the title and the abstract as input to the BEG embedding
models as demonstrated in the Vespa schema snippets in the previous
section.
Dataset | Documents | Avg document tokens | Queries | Avg query tokens | Relevance Judgments |
BEIR trec_covid | 171,332 | 245 | 50 | 18 | 66,336 |
Dataset characteristics; tokens are the number of language model
token identifiers (wordpieces)
All experiments are run on an M1 Pro (arm64) laptop with 8 v-CPUs
and 32GB of memory, using the open-source Vespa container
image. No GPU
acceleration and no need to manage CUDA driver compatibility, huge
container images due to CUDA dependencies, or forwarding host GPU
devices to the container.
Sample Vespa JSON
formatted
feed document (prettified) from the
BEIR trec-covid dataset:
{
"put": "id:miracl-trec:doc::wnnsmx60",
"fields": {
"title": "Managing emerging infectious diseases: Is a federal system an impediment to effective laws?",
"text": "In the 1980's and 1990's HIV/AIDS was the emerging infectious disease. In 2003\u20132004 we saw the emergence of SARS, Avian influenza and Anthrax in a man made form used for bioterrorism. Emergency powers legislation in Australia is a patchwork of Commonwealth quarantine laws and State and Territory based emergency powers in public health legislation. It is time for a review of such legislation and time for consideration of the efficacy of such legislation from a country wide perspective in an age when we have to consider the possibility of mass outbreaks of communicable diseases which ignore jurisdictional boundaries.",
"doc_id": "wnnsmx60",
"language": "en"
}
}
Evalution results
Model | Model size (MB) | NDCG@10 BGE | NDCG@10 BM25 |
bge-small-en | 33 | 0.7395 | 0.6823 |
bge-base-en | 104 | 0.7662 | 0.6823 |
Evaluation results for quantized BGE models.
We contrast both BGE models with the unsupervised
BM25 baseline from
this blog
post.
Both models perform better than the BM25 baseline
on this dataset. We also note that our NDCG@10 numbers represented
in Vespa is slightly better than reported on the MTEB leaderboard
for the same dataset. We can also observe that the base model
performs better on this dataset, but is also 2x more costly due to
size of embedding model and the embedding dimensionality. The
bge-base model inference could benefit from GPU
acceleration
(without quantization).
Using bfloat16 precision
We evaluate using
bfloat16
instead of float for the tensor representation in Vespa. Using
bfloat16
instead of float
reduces memory and storage requirements
by 2x since bfloat16
uses 2 bytes per embedding dimension instead
of 4 bytes for float
. See Vespa tensor values
types.
We do not change the type of the query tensor. Vespa will take care
of casting the bfloat16
field representation to float at search
time, allowing CPU acceleration of floating point operations. The
cast operation does come with a small cost (20-30%) compared with
using float, but the saving in memory and storage resource footprint
is well worth it for most use cases.
field embedding type tensor<bfloat16>(x[384]) {
indexing: input title . " " . input text | embed | attribute
attribute {
distance-metric: prenormalized-angular
}
}
Using bfloat16 instead of float for the embedding tensor.
Model | NDCG@10 bfloat16 | NDCG@10 float |
bge-small-en | 0.7346 | 0.7395 |
bge-base-en | 0.7656 | 0.7662 |
Evaluation results for BGE models – float versus bfloat16 document representation.
By using bfloat16
instead of float
to store the vectors, we save
50% of memory cost and we can store 2x more embeddings per instance
type with almost zero impact on retrieval quality:
Summary
Using the open-source Vespa container image, we’ve explored the
recently announced strong BGE text embedding models with embedding
inference and retrieval on our laptops. The local experimentation
eliminates prolonged feedback loops.
Moreover, the same Vespa configuration files suffice for many
deployment scenarios, whether in on-premise setups, on Vespa Cloud,
or locally on a laptop. The beauty lies in that specific
infrastructure for managing embedding inference and nearest neighbor
search as separate infra systems become obsolete with Vespa’s
native embedding
support.
If you are interested to learn more about Vespa; See Vespa Cloud – getting started,
or self-serve Vespa – getting started.
Got questions? Join the Vespa community in Vespa Slack.
Leave a Reply