Simplify Search with Multilingual Embedding Models
Photo by Bruno Martins on Unsplash
This blog post presents and shows how to represent a robust
multilingual embedding model of the E5 family in Vespa. We also
demonstrate how to evaluate the model’s effectiveness on multilingual
information retrieval (IR) datasets.
Introduction
The fundamental concept behind embedding models is transforming
textual data into a continuous vector space, wherein similar items
are brought close together and dissimilar ones are pushed
farther apart. Mapping multilingual texts into a unified vector
embedding space makes it possible to represent and compare queries
and documents from various languages within this shared space.
Meet the E5 family.
Researchers from Microsoft introduced the E5 family of text embedding
models in the paper Text Embeddings by Weakly-Supervised Contrastive
Pre-training. E5 is short for
EmbEddings from bidirEctional Encoder rEpresentations. Using a
permissive MIT license, the same researchers have also published
the model weights on the Huggingface model hub. There are three
multilingual E5 embedding model variants with different model sizes
and embedding dimensionality. All three models are initialized from
pre-trained transformer models with trained text vocabularies that
handle up to 100 languages.
This model is initialized from
xlm-roberta-base and
continually trained on a mixture of multilingual datasets. It
supports 100 languages from xlm-roberta, but low-resource languages
may see performance degradation._
Similarly, the E5 embedding model family includes three variants
trained only on English datasets.
Choose your E5 Fighter
The embedding model variants allow developers to trade effectiveness
versus serving related costs. Embedding model size and embedding dimensionality
impact task accuracy, model inference, nearest
neighbor search, and storage cost.
These serving-related costs are all roughly linear with model size
and embedding dimensionality. In other words, using an embedding
model with 768 dimensions instead of 384 increases embedding storage
by 2x and nearest neighbor search compute with 2x. Accuracy, however,
is not nearly linear, as demonstrated on the MTEB
leaderboard.
The nearest neighbor search for embedding-based retrieval could be
accelerated by introducing approximate algorithms like
HNSW. HNSW
significantly reduces distance calculations at query time but also
introduces degraded retrieval accuracy because the search is
approximate. Still, the same linear relationship between embedding
dimensionality and distance compute complexity holds.
Model | Dimensionality | Model params (M) | Accuracy Average (56 datasets) | Accuracy Retrieval (15 datasets) |
Small | 384 | 118 | 57.87 | 46.64 |
Base | 768 | 278 | 59.45 | 48.88 |
Large | 1024 | 560 | 61.5 | 51.43 |
Comparision of the E5 multilingual models. Accuracy numbers from MTEB
leaderboard.
Do note that the datasets included in MTEB are biased towards English
datasets, which means that the reported retrieval performance might
not match up with observed accuracy on private datasets, especially
for low-resource languages.
Representing E5 embedding models in Vespa
Vespa’s vector search and embedding inference support allows
developers to build multilingual semantic search applications without
managing separate systems for embedding inference and vector search
over the multilingual embedding representations.
In the following sections, we use the small E5 multilingual variant,
which gives us reasonable accuracy for a much lower cost than the
larger sister E5 variants. The small model inference complexity
also makes it servable on CPU architecture, allowing iterations and
development locally without managing GPU-related infrastructure
complexity.
Exporting E5 to ONNX format for accelerated model inference
To export the embedding model from the Huggingface model hub to
ONNX format for inference in Vespa, we can use the
Optimum library:
$ optimum-cli export onnx --task sentence-similarity -m intfloat/multilingual-e5-small multilingual-e5-small-onnx
The above optimum-cli
command exports the HF model to ONNX format that can be imported
and used with the Vespa Huggingface
embedder.
Using the Optimum generated ONNX file and tokenizer configuration
file, we configure Vespa with the following in the Vespa application
package
services.xml
file.
<component id="e5" type="hugging-face-embedder">
<transformer-model path="model/multilingual-e5-small.onnx"/>
<tokenizer-model path="model/tokenizer.json"/>
</component>
That’s it! These two simple steps are all we need to start using the multilingual
E5 model to embed queries and documents with Vespa.
Using E5 with queries and documents in Vespa
The E5 family uses text instructions mixed with the input data to
separate queries and documents. Instead of having two different
models for queries and documents, the E5 family separates queries
and documents by prepending the input with “query:” or “passage:”.
schema doc {
document doc {
field title type string { .. }
field text type string { .. }
}
field embedding type tensor<float>(x[384]) {
indexing {
"passage: " . input title . " " . input text | embed | attribute
}
}
The above Vespa schema language
uses the embed
indexing
language
functionality to invoke the configured E5 embedding model, using a
concatenation of the “passage: “ instruction, the title, and
the text. Notice that the embedding
tensor
field defines the embedding dimensionality (384).
The above schema uses a single vector
representation per document. With Vespa multi-vector
indexing,
it’s also possible to represent and index multiple vector representations
for the same tensor field.
Similarly, on the query, we can embed the input query text with the
E5 model, now prepending the input user query with “query: “
{
"yql": "select ..",
"input.query(q)": "embed(query: the query to encode)",
}
Evaluation
To demonstrate how to evaluate multilingual embedding models, we
evaluate the small E5 multilingual variant on three information
retrieval (IR) datasets. We use the classic trec-covid dataset, a
part of the BEIR benchmark,
that we have written about in blog
posts
before. We also include two languages from the
MIRACL (Multilingual Information
Retrieval Across a Continuum of Languages) datasets.
All three datasets use
NDCG@10 to
evaluate ranking effectiveness. NDCG is a ranking metric that is
precision-oriented and handles graded relevance judgments.
Dataset | Included in E5 fine-tuning | Language | Documents | Queries | Relevance Judgments |
BEIR:trec-covid | No | English | 171,332 | 50 | 66,336 |
MIRACL:sw | Yes (The train split was used) | Swahili | 131,924 | 482 | 5092 |
MIRACL:yo | No | Yoruba | 49,043 | 119 | 1188 |
IR dataset characteristics
We consider both BEIR:trec-covid and MIRACL:yo as out-of-domain datasets
as E5 has not been trained or fine tuned on them since they don’t
contain any training split. Applying E5 on out-of-domain datasets
is called zero-shot, as no training examples (shots) are available.
The Swahili dataset could be categorized as an in-domain dataset
as E5 has been trained on the train split of the dataset. All three
datasets have documents with titles and text
fields. We use the concatenation strategy described in previous sections, inputting both title
and text to the embedding model.
We evaluate the E5 model using exact nearest neighbor
search
without HNSW indexing,
and all experiments are run on an M1 Pro (arm64) laptop using the
open-source Vespa container
image. We contrast
the E5 model results with Vespa BM25.
Dataset | BM25 | Multilingual E5 (small) |
MIRACL:sw | 0.4243 | 0.6755 |
MIRACL:yo | 0.6831 | 0.4187 |
BEIR:trec-covid | 0.6823 | 0.7139 |
Retrieval effectiveness for BM25 and E5 small (NDCG@10)
For BEIR:trec-covid, we also evaluated a hybrid combination of E5
and BM25, using a linear combination of the two scores, which lifted
NDCG@10 to 0.7670. This aligns with previous findings, where hybrid
combinations
outperform
each model used independently.
Summary
As demonstrated in the evaluation, multilingual embedding models
can enhance and simplify building multilingual search applications
and provide a solid baseline. Still, as we can see from the evaluation
results, the simple and cheap Vespa BM25 ranking model outperformed
the dense embedding model on the MIRACL Yoruba queries.
This result can largely be explained by the fact that the model had not
been pre-trained on the language (low resource) or tuned for retrieval
with Yoruba queries or documents. This is another reminder of what
we wrote about in a blog post about improving zero-shot
ranking,
where we summarize with a quote from the BEIR paper, which evaluates
multiple models in a zero-shot setting:
In-domain performance is not a good indicator for out-of-domain
generalization. We observe that BM25 heavily underperforms neural
approaches by 7-18 points on in-domain MS MARCO. However, BEIR
reveals it to be a strong baseline for generalization and generally
outperforming many other, more complex approaches. This stresses
the point that retrieval methods must be evaluated on a broad range
of datasets.
In the next blog post, we will look at ways to make embedding
inference cheaper without sacrificing much retrieval effectiveness
by optimizing the embedding model. Furthermore, we will show how
to save 50% of embedding storage using Vespa’s support for bfloat16
precision instead of float, with close to zero impact on retrieval
effectiveness.
If you want to reproduce the retrieval results, or get started
with multilingual embedding search, check out
the new multilingual search sample application.