Simplify Search with Multilingual Embedding Models

Decorative image

Photo by Bruno Martins on Unsplash

This blog post presents and shows how to represent a robust
multilingual embedding model of the E5 family in Vespa. We also
demonstrate how to evaluate the model’s effectiveness on multilingual
information retrieval (IR) datasets.


The fundamental concept behind embedding models is transforming
textual data into a continuous vector space, wherein similar items
are brought close together and dissimilar ones are pushed
farther apart. Mapping multilingual texts into a unified vector
embedding space makes it possible to represent and compare queries
and documents from various languages within this shared space.

multilingual embedding model

Meet the E5 family.

Researchers from Microsoft introduced the E5 family of text embedding
models in the paper Text Embeddings by Weakly-Supervised Contrastive
Pre-training. E5 is short for
EmbEddings from bidirEctional Encoder rEpresentations. Using a
permissive MIT license, the same researchers have also published
the model weights on the Huggingface model hub. There are three
multilingual E5 embedding model variants with different model sizes
and embedding dimensionality. All three models are initialized from
pre-trained transformer models with trained text vocabularies that
handle up to 100 languages.

This model is initialized from
xlm-roberta-base and
continually trained on a mixture of multilingual datasets. It
supports 100 languages from xlm-roberta, but low-resource languages
may see performance degradation._

Similarly, the E5 embedding model family includes three variants
trained only on English datasets.

Choose your E5 Fighter

The embedding model variants allow developers to trade effectiveness
versus serving related costs. Embedding model size and embedding dimensionality
impact task accuracy, model inference, nearest
neighbor search, and storage cost.

These serving-related costs are all roughly linear with model size
and embedding dimensionality. In other words, using an embedding
model with 768 dimensions instead of 384 increases embedding storage
by 2x and nearest neighbor search compute with 2x. Accuracy, however,
is not nearly linear, as demonstrated on the MTEB

The nearest neighbor search for embedding-based retrieval could be
accelerated by introducing approximate algorithms like
significantly reduces distance calculations at query time but also
introduces degraded retrieval accuracy because the search is
approximate. Still, the same linear relationship between embedding
dimensionality and distance compute complexity holds.

ModelDimensionalityModel params (M)Accuracy
Average (56 datasets)
Accuracy Retrieval
(15 datasets)

Comparision of the E5 multilingual models. Accuracy numbers from MTEB

Do note that the datasets included in MTEB are biased towards English
datasets, which means that the reported retrieval performance might
not match up with observed accuracy on private datasets, especially
for low-resource languages.

Representing E5 embedding models in Vespa

Vespa’s vector search and embedding inference support allows
developers to build multilingual semantic search applications without
managing separate systems for embedding inference and vector search
over the multilingual embedding representations.

In the following sections, we use the small E5 multilingual variant,
which gives us reasonable accuracy for a much lower cost than the
larger sister E5 variants. The small model inference complexity
also makes it servable on CPU architecture, allowing iterations and
development locally without managing GPU-related infrastructure

Exporting E5 to ONNX format for accelerated model inference

To export the embedding model from the Huggingface model hub to
ONNX format for inference in Vespa, we can use the
Optimum library:

$ optimum-cli export onnx --task sentence-similarity -m intfloat/multilingual-e5-small multilingual-e5-small-onnx

The above optimum-cli command exports the HF model to ONNX format that can be imported
and used with the Vespa Huggingface
Using the Optimum generated ONNX file and tokenizer configuration
file, we configure Vespa with the following in the Vespa application

<component id="e5" type="hugging-face-embedder">
  <transformer-model path="model/multilingual-e5-small.onnx"/>
  <tokenizer-model path="model/tokenizer.json"/>

That’s it! These two simple steps are all we need to start using the multilingual
E5 model to embed queries and documents with Vespa.

Using E5 with queries and documents in Vespa

The E5 family uses text instructions mixed with the input data to
separate queries and documents. Instead of having two different
models for queries and documents, the E5 family separates queries
and documents by prepending the input with “query:” or “passage:”.

schema doc {
  document doc  {
    field title type string { .. }
    field text type string { .. }
  field embedding type tensor<float>(x[384]) {
    indexing {
      "passage: " . input title . " " . input text | embed | attribute

The above Vespa schema language
uses the embed indexing
functionality to invoke the configured E5 embedding model, using a
concatenation of the “passage: “ instruction, the title, and
the text. Notice that the embedding tensor
field defines the embedding dimensionality (384).

The above schema uses a single vector
representation per document. With Vespa multi-vector
it’s also possible to represent and index multiple vector representations
for the same tensor field.

Similarly, on the query, we can embed the input query text with the
E5 model, now prepending the input user query with “query: “

  "yql": "select ..",
  "input.query(q)": "embed(query: the query to encode)", 


To demonstrate how to evaluate multilingual embedding models, we
evaluate the small E5 multilingual variant on three information
retrieval (IR) datasets. We use the classic trec-covid dataset, a
part of the BEIR benchmark,
that we have written about in blog
before. We also include two languages from the
MIRACL (Multilingual Information
Retrieval Across a Continuum of Languages
) datasets.

All three datasets use
NDCG@10 to
evaluate ranking effectiveness. NDCG is a ranking metric that is
precision-oriented and handles graded relevance judgments.

DatasetIncluded in E5
LanguageDocumentsQueriesRelevance Judgments
MIRACL:swYes (The train split was used)Swahili131,9244825092

IR dataset characteristics

We consider both BEIR:trec-covid and MIRACL:yo as out-of-domain datasets
as E5 has not been trained or fine tuned on them since they don’t
contain any training split. Applying E5 on out-of-domain datasets
is called zero-shot, as no training examples (shots) are available.

The Swahili dataset could be categorized as an in-domain dataset
as E5 has been trained on the train split of the dataset. All three
datasets have documents with titles and text
fields. We use the concatenation strategy described in previous sections, inputting both title
and text to the embedding model.

We evaluate the E5 model using exact nearest neighbor
without HNSW indexing,
and all experiments are run on an M1 Pro (arm64) laptop using the
open-source Vespa container
image. We contrast
the E5 model results with Vespa BM25.

DatasetBM25Multilingual E5 (small)

Retrieval effectiveness for BM25 and E5 small (NDCG@10)

For BEIR:trec-covid, we also evaluated a hybrid combination of E5
and BM25, using a linear combination of the two scores, which lifted
NDCG@10 to 0.7670. This aligns with previous findings, where hybrid
each model used independently.


As demonstrated in the evaluation, multilingual embedding models
can enhance and simplify building multilingual search applications
and provide a solid baseline. Still, as we can see from the evaluation
results, the simple and cheap Vespa BM25 ranking model outperformed
the dense embedding model on the MIRACL Yoruba queries.

This result can largely be explained by the fact that the model had not
been pre-trained on the language (low resource) or tuned for retrieval
with Yoruba queries or documents. This is another reminder of what
we wrote about in a blog post about improving zero-shot
where we summarize with a quote from the BEIR paper, which evaluates
multiple models in a zero-shot setting:

In-domain performance is not a good indicator for out-of-domain
generalization. We observe that BM25 heavily underperforms neural
approaches by 7-18 points on in-domain MS MARCO. However, BEIR
reveals it to be a strong baseline for generalization and generally
outperforming many other, more complex approaches. This stresses
the point that retrieval methods must be evaluated on a broad range
of datasets.

In the next blog post, we will look at ways to make embedding
inference cheaper without sacrificing much retrieval effectiveness
by optimizing the embedding model. Furthermore, we will show how
to save 50% of embedding storage using Vespa’s support for bfloat16
precision instead of float, with close to zero impact on retrieval

If you want to reproduce the retrieval results, or get started
with multilingual embedding search, check out
the new multilingual search sample application.