Vespa Newsletter, July 2023 | Vespa Blog

Kristian Aune

Kristian Aune

Head of Customer Success, Vespa

In the previous update,
we mentioned multi-vector HNSW Indexing, global-phase re-ranking, LangChain support, improved bfloat16 throughput,
and new document feed/export features in the Vespa CLI.
Today, we’re excited to share Vector Streaming Search, multiple new embedding features,
MIPS support, and performance optimizations:

When searching personal data or other data sets which are divided into many subsets you never search across,
maintaining global indexes is unnecessarily expensive.
Vespa streaming search is built for these use cases, and now supports vectors in searching and ranking.

This enables vector search in personal search use cases such as personal assistants
at typically less than 5% of the usual cost,
while delivering complete rather than approximate results,
something which is often crucial with personal data.
Read more in our announcement blog post.

Use Embedder Models from Huggingface

Vespa now comes with generic support for embedding models hosted on Huggingface.
With the new Huggingface Embedder functionality,
developers can export embedding models from Huggingface
and import them in ONNX format in Vespa for accelerated inference close to where the data is created.
The Huggingface Embedder supports multilingual embedding models as well as multi-vector representations –
read more.

GPU Acceleration of Embedding Models

GPU acceleration of embedding model inferences is now supported,
unlocking larger and more powerful embedding models while maintaining low serving latency.
With this, Vespa embedders can efficiently process large amounts of text data,
resulting in faster response times, improved scalability, and lower cost.

Embedding GPU acceleration is available both on Vespa Cloud and for Open Source Vespa use –
read more.

More models for Vespa Cloud users

As more teams use embeddings to improve search and recommendation use cases,
easy access to models is key for productivity. From the paper:

E5 is a family of state-of-the-art text embeddings that transfer well to a wide range of tasks.
The model is trained in a contrastive manner with weak supervision signals
from our curated large-scale text pair dataset (called CCPairs).
E5 can be readily used as a general-purpose embedding model for any tasks
requiring a single-vector representation of texts such as retrieval, clustering, and classification,
achieving strong performance in both zero-shot and fine-tuned settings.

Vespa Cloud users can find a set of E5 models on the
model hub.

Dotproduct distance metric for ANN

The Maximum Inner Product Search (MIPS) problem arises naturally in recommender systems,
where item recommendations and user preferences are modeled with vectors,
and the scoring is just the dot product (inner product) between the item vector and the query vector.

Vespa supports a range of distance metrics
for approximate nearest neighbor search.
Since 8.172, Vespa supports a dotproduct distance metric,
used for distance calculations and an extension to HNSW index structures.
Read more about how using an extra dimension to map points on a 3D hemisphere
makes the vector have the same magnitude and hence solvable as a nearest neighbor problem in the
blog post.

Optimizations and features

  • Query using emojis!
    The Unicode Characters of Category “Other Symbol” contains emojis, math symbols, etc.
    From Vespa 8.172 these are indexed as letter characters to support searching for them.
    E.g., you can now try vespa query ‘select * from music where song contains “🍉“‘.
  • Sorting on multivalue fields like array
    or weightedset is now supported:
    Ascending sort order uses the lowest value while descending sort order uses the highest value.
    E.g., descending order sort on an array field with [“apple”, “banana”, “melon”] will use “melon” as the sort value –
    see the reference documentation.
  • Since Vespa 8.185, you can balance feed vs query resource usage using feeding
    niceness – use this configuration to de-prioritize feeding.
  • Since Vespa 8.178, users can use conditional puts with auto-create –
    read more.
  • With lidspace max-bloat-factor
    you can fine tune this compaction job in the content node – since Vespa 8.171.
  • Vespa supports multivalue attributes,
    like arrays and maps.
    In Vespa 8.181 the static memory usage of multivalue attributes is reduced by up to 40%.
    This is useful for applications with many such fields, with little data each –
    see #26640 for details.

Blog posts since last newsletter

Thanks for reading! Try out Vespa on Vespa Cloud
or grab the latest release at and run it yourself! 😀

Simplify Search with Multilingual Embedding Models

Decorative image

Photo by Bruno Martins on Unsplash

This blog post presents and shows how to represent a robust
multilingual embedding model of the E5 family in Vespa. We also
demonstrate how to evaluate the model’s effectiveness on multilingual
information retrieval (IR) datasets.


The fundamental concept behind embedding models is transforming
textual data into a continuous vector space, wherein similar items
are brought close together and dissimilar ones are pushed
farther apart. Mapping multilingual texts into a unified vector
embedding space makes it possible to represent and compare queries
and documents from various languages within this shared space.

multilingual embedding model

Meet the E5 family.

Researchers from Microsoft introduced the E5 family of text embedding
models in the paper Text Embeddings by Weakly-Supervised Contrastive
Pre-training. E5 is short for
EmbEddings from bidirEctional Encoder rEpresentations. Using a
permissive MIT license, the same researchers have also published
the model weights on the Huggingface model hub. There are three
multilingual E5 embedding model variants with different model sizes
and embedding dimensionality. All three models are initialized from
pre-trained transformer models with trained text vocabularies that
handle up to 100 languages.

This model is initialized from
xlm-roberta-base and
continually trained on a mixture of multilingual datasets. It
supports 100 languages from xlm-roberta, but low-resource languages
may see performance degradation._

Similarly, the E5 embedding model family includes three variants
trained only on English datasets.

Choose your E5 Fighter

The embedding model variants allow developers to trade effectiveness
versus serving related costs. Embedding model size and embedding dimensionality
impact task accuracy, model inference, nearest
neighbor search, and storage cost.

These serving-related costs are all roughly linear with model size
and embedding dimensionality. In other words, using an embedding
model with 768 dimensions instead of 384 increases embedding storage
by 2x and nearest neighbor search compute with 2x. Accuracy, however,
is not nearly linear, as demonstrated on the MTEB

The nearest neighbor search for embedding-based retrieval could be
accelerated by introducing approximate algorithms like
significantly reduces distance calculations at query time but also
introduces degraded retrieval accuracy because the search is
approximate. Still, the same linear relationship between embedding
dimensionality and distance compute complexity holds.

ModelDimensionalityModel params (M)Accuracy
Average (56 datasets)
Accuracy Retrieval
(15 datasets)

Comparision of the E5 multilingual models. Accuracy numbers from MTEB

Do note that the datasets included in MTEB are biased towards English
datasets, which means that the reported retrieval performance might
not match up with observed accuracy on private datasets, especially
for low-resource languages.

Representing E5 embedding models in Vespa

Vespa’s vector search and embedding inference support allows
developers to build multilingual semantic search applications without
managing separate systems for embedding inference and vector search
over the multilingual embedding representations.

In the following sections, we use the small E5 multilingual variant,
which gives us reasonable accuracy for a much lower cost than the
larger sister E5 variants. The small model inference complexity
also makes it servable on CPU architecture, allowing iterations and
development locally without managing GPU-related infrastructure

Exporting E5 to ONNX format for accelerated model inference

To export the embedding model from the Huggingface model hub to
ONNX format for inference in Vespa, we can use the
Optimum library:

$ optimum-cli export onnx --task sentence-similarity -m intfloat/multilingual-e5-small multilingual-e5-small-onnx

The above optimum-cli command exports the HF model to ONNX format that can be imported
and used with the Vespa Huggingface
Using the Optimum generated ONNX file and tokenizer configuration
file, we configure Vespa with the following in the Vespa application

<component id="e5" type="hugging-face-embedder">
  <transformer-model path="model/multilingual-e5-small.onnx"/>
  <tokenizer-model path="model/tokenizer.json"/>

That’s it! These two simple steps are all we need to start using the multilingual
E5 model to embed queries and documents with Vespa.

Using E5 with queries and documents in Vespa

The E5 family uses text instructions mixed with the input data to
separate queries and documents. Instead of having two different
models for queries and documents, the E5 family separates queries
and documents by prepending the input with “query:” or “passage:”.

schema doc {
  document doc  {
    field title type string { .. }
    field text type string { .. }
  field embedding type tensor<float>(x[384]) {
    indexing {
      "passage: " . input title . " " . input text | embed | attribute

The above Vespa schema language
uses the embed indexing
functionality to invoke the configured E5 embedding model, using a
concatenation of the “passage: “ instruction, the title, and
the text. Notice that the embedding tensor
field defines the embedding dimensionality (384).

The above schema uses a single vector
representation per document. With Vespa multi-vector
it’s also possible to represent and index multiple vector representations
for the same tensor field.

Similarly, on the query, we can embed the input query text with the
E5 model, now prepending the input user query with “query: “

  "yql": "select ..",
  "input.query(q)": "embed(query: the query to encode)", 


To demonstrate how to evaluate multilingual embedding models, we
evaluate the small E5 multilingual variant on three information
retrieval (IR) datasets. We use the classic trec-covid dataset, a
part of the BEIR benchmark,
that we have written about in blog
before. We also include two languages from the
MIRACL (Multilingual Information
Retrieval Across a Continuum of Languages
) datasets.

All three datasets use
NDCG@10 to
evaluate ranking effectiveness. NDCG is a ranking metric that is
precision-oriented and handles graded relevance judgments.

DatasetIncluded in E5
LanguageDocumentsQueriesRelevance Judgments
MIRACL:swYes (The train split was used)Swahili131,9244825092

IR dataset characteristics

We consider both BEIR:trec-covid and MIRACL:yo as out-of-domain datasets
as E5 has not been trained or fine tuned on them since they don’t
contain any training split. Applying E5 on out-of-domain datasets
is called zero-shot, as no training examples (shots) are available.

The Swahili dataset could be categorized as an in-domain dataset
as E5 has been trained on the train split of the dataset. All three
datasets have documents with titles and text
fields. We use the concatenation strategy described in previous sections, inputting both title
and text to the embedding model.

We evaluate the E5 model using exact nearest neighbor
without HNSW indexing,
and all experiments are run on an M1 Pro (arm64) laptop using the
open-source Vespa container
image. We contrast
the E5 model results with Vespa BM25.

DatasetBM25Multilingual E5 (small)

Retrieval effectiveness for BM25 and E5 small (NDCG@10)

For BEIR:trec-covid, we also evaluated a hybrid combination of E5
and BM25, using a linear combination of the two scores, which lifted
NDCG@10 to 0.7670. This aligns with previous findings, where hybrid
each model used independently.


As demonstrated in the evaluation, multilingual embedding models
can enhance and simplify building multilingual search applications
and provide a solid baseline. Still, as we can see from the evaluation
results, the simple and cheap Vespa BM25 ranking model outperformed
the dense embedding model on the MIRACL Yoruba queries.

This result can largely be explained by the fact that the model had not
been pre-trained on the language (low resource) or tuned for retrieval
with Yoruba queries or documents. This is another reminder of what
we wrote about in a blog post about improving zero-shot
where we summarize with a quote from the BEIR paper, which evaluates
multiple models in a zero-shot setting:

In-domain performance is not a good indicator for out-of-domain
generalization. We observe that BM25 heavily underperforms neural
approaches by 7-18 points on in-domain MS MARCO. However, BEIR
reveals it to be a strong baseline for generalization and generally
outperforming many other, more complex approaches. This stresses
the point that retrieval methods must be evaluated on a broad range
of datasets.

In the next blog post, we will look at ways to make embedding
inference cheaper without sacrificing much retrieval effectiveness
by optimizing the embedding model. Furthermore, we will show how
to save 50% of embedding storage using Vespa’s support for bfloat16
precision instead of float, with close to zero impact on retrieval

If you want to reproduce the retrieval results, or get started
with multilingual embedding search, check out
the new multilingual search sample application.

Accelerating Transformer-based Embedding Retrieval with Vespa

Decorative image

Photo by Appic on Unsplash

In this post, we’ll see how to accelerate embedding inference and retrieval with little impact on quality.
We’ll take a holistic approach and deep-dive into both aspects of an embedding retrieval system: Embedding inference and retrieval with nearest neighbor search.
All experiments are performed on a laptop with the open-source Vespa container image.


The fundamental concept behind text embedding models is transforming
textual data into a continuous vector space, wherein similar items
are brought closer together, and dissimilar ones are pushed farther
apart. Mapping multilingual texts into a unified vector embedding
space makes it possible to represent and compare queries and documents
from various languages within this shared space. By using contrastive
representation learning with retrieval data examples, we can make
embedding representations useful for retrieval with nearest neighbor


A search system using embedding retrieval consists of two primary

  • Embedding inference, using an embedding model to map text to a
    point in a vector space of D dimensions.
  • Retrieval in the D dimensional vector space using nearest neighbor search.

This blog post covers both aspects of an embedding retrieval system
and how to accelerate them, while also paying attention to the task
accuracy because what’s the point of having blazing fast but highly
inaccurate results?

Transformer Model Inferencing

The most popular text embedding models are typically based on
encoder-only Transformer models (such as BERT). We need a
high-level understanding of the complexity of encoder-only transformer
language models (without going deep into neural network architectures).

Inference complexity from the transformer architecture attention
mechanism scales quadratically with input sequence length.

BERT embedder

Illustration of obtaining a single vector representation of the
text ‘a new day’ through BERT.

The BERT model has a typical input
length limitation of 512 tokens, so the tokenization process truncates
the input to avoid exceeding the architecture’s maximum length.
Embedding models might also truncate the text at a lower limit than
the theoretical limit of the neural network to improve quality and
reduce training costs, as computational complexity is quadratic
with input sequence length for both training and inference. The
last pooling operation compresses the token vectors into a single
vector representation. A common pooling technique is averaging the
token vectors.

It’s worth noting that some models may not perform pooling and
instead represent the text with multiple
but that aspect is beyond the scope of this blog post.

Inference cost versus sequence length

Illustration of BERT inferenec cost versus sequence input length (sequence^2).

We use ‘Inference cost’ to refer to the computational resources
required for a single inference pass with a given input sequence
length. The graph depicts the relationship between the sequence
length and the squared compute complexity, demonstrating its quadratic
nature. Latency and throughput can be adjusted using different
techniques for parallelizing computations. See model serving at
scale for a
discussion on these techniques in Vespa.

Why does all of this matter? For retrieval systems, text queries
are usually much shorter than text documents, so invoking embedding
models for documents costs more than encoding shorter questions.

Sequence lengths and quadratic scaling are some of the reasons why
using frozen document-size
are practical at scale, as it avoids re-embedding documents when
the model weights are updated due to re-training the model. Similarly,
query embeddings can be cached for previously seen queries as long
as the model weights are unchanged. The asymmetric length properties
can also help us design a retrieval system architecture for scale.

  • Asymmetric model size: Use different-sized models for encoding
    queries and documents (with the same output embedding dimensionality).
    See this paper for an example.
  • Asymmetric batch size: Use batch on-demand computing for embedding
    documents, using auto-scaling features, for example, with Vespa
  • Asymmetric compute architecture: Use GPU acceleration for document inference but CPU
    for query inference.

The final point is that reporting embedding inference latency or
throughput without mentioning input sequence length provides little

Choose your Fighter

When deciding on an embedding model, developers must strike a balance
between quality and serving costs.

Triangle of tradeoffs

These serving-related costs are all roughly linear with model
parameters and embedding dimensionality (for a given sequence
length). For example, using an embedding model with 768 dimensions
instead of 384 increases embedding storage by 2x and nearest neighbor
search compute by 2x.

Accuracy, however, is not nearly linear, as demonstrated on the
MTEB leaderboard.

ModelDimensionalityModel params (M)Accuracy
Average (56 datasets)
Accuracy Retrieval
(15 datasets)

A comparison of the E5 multilingual models — accuracy numbers from the MTEB

In the following sections, we use the small E5 multilingual variant,
which gives us reasonable accuracy for a much lower cost than the
larger sister E5 variants. The small model inference complexity
also makes it servable on CPU architecture, allowing iterations and
development locally without managing GPU-related infrastructure

Exporting E5 to ONNX format for accelerated model inference

To export the embedding model from the Huggingface model hub to
ONNX format for inference in Vespa, we can use the Transformer
Optimum library:

$ optimum-cli export onnx --task sentence-similarity -m intfloat/multilingual-e5-small model-dir

The above exports the model without any optimizations. The optimum
client also allows specifying optimization
here using the highest optimization level usable for serving on the

The above commands export the model to ONNX format that can be
imported and used with the Vespa Huggingface
Using the Optimum generated ONNX and tokenizer configuration files,
we configure Vespa with the following in the Vespa application

<component id="e5" type="hugging-face-embedder">
  <transformer-model path="model/model.onnx"/>
  <tokenizer-model path="model/tokenizer.json"/>

These two simple steps are all we need to start using the multilingual
E5 model to embed queries and documents with Vespa.
We can also quantize the optimized ONNX model, for example, using
the optimum
or onnxruntime quantization like
Quantization (post-training) converts the float32 model weights (4
bytes per weight) to byte (int8), enabling faster inference on the

Performance Experiments

To demonstrate the many tradeoffs, we assess the mentioned small
E5 multilanguage model on the Swahili(SW) split from the
MIRACL (Multilingual Information
Retrieval Across a Continuum of Languages
) dataset.

document tokens
QueriesAvg query tokensRelevance
MIRACL swSwahili131,92463482135092

Dataset characteristics; tokens are the number of language model
token identifiers. Since Swahili is a low-resource language, the
LM tokenization uses more tokens to represent similar byte-length
texts than for more popular languages such as English.

We experiment with post-training quantization of the model (not the
output vectors) to document the impact quantization has on retrieval
effectiveness (NDCG@10). We use this
to quantize the model (We don’t use optimum for this due to this
issue – fixed
in v 1.11).

We then study the serving efficiency gains (latency/throughput) on
the same laptop-sized hardware using a quantized model versus a
full precision model

All experiments are run on an M1 Pro (arm64) laptop with 8 v-CPUs
and 32GB of memory, using the open-source Vespa container
image. No GPU
acceleration and no need to manage CUDA driver compatibility, huge
container images due to CUDA dependencies, or forwarding host GPU
devices to the container.

  • We use the multilingual-search Vespa sample
    as the starting point for these experiments. This sample app was
    introduced in Simplify search with multilingual embedding
  • We use the
    NDCG@10 metric
    to evaluate ranking effectiveness. When performing model optimizations,
    it’s important to pay attention to the impact on the task. This is
    stating the obvious, but still, many talk about accelerations and
    optimizations without mentioning task accuracy degradations
  • We measure the throughput of indexing text documents in Vespa. This
    includes embedding inference in Vespa using the Vespa Huggingface
    storing the embedding vector in Vespa, and regular inverted indexing
    of the title and text field. We use the
    vespa-cli feed option
    as the feeding client.
  • We use the Vespa fbench
    to drive HTTP query load using HTTP POST against the Vespa query
  • Batch size in Vespa embedders is one for document and query inference.
  • There is no caching of query embedding inference, so repeating the same query
    text while benchmarking will trigger a new embedding inference.

Sample Vespa JSON formatted feed document (prettified) from the
MIRACL dataset.

    "put": "id:miracl-sw:doc::2-0",
    "fields": {
        "title": "Akiolojia",
        "text": "Akiolojia (kutoka Kiyunani \u03b1\u03c1\u03c7\u03b1\u03af\u03bf\u03c2 = \"zamani\" na \u03bb\u03cc\u03b3\u03bf\u03c2 = \"neno, usemi\") ni somo linalohusu mabaki ya tamaduni za watu wa nyakati zilizopita. Wanaakiolojia wanatafuta vitu vilivyobaki, kwa mfano kwa kuchimba ardhi na kutafuta mabaki ya majengo, makaburi, silaha, vifaa, vyombo na mifupa ya watu.",
        "doc_id": "2#0",
        "language": "sw"
ModelModel size (MB)NDCG@10Docs/secondQueries/second
Int8 (Quantized)1120.661269640

Comparison of embedding inference in Vespa using a full precision
model with float32 weights against a quantized model using int8
weights. This is primarily benchmarking embedding inference. See
the next section for a deep dive into the experimental setup.

There is a small drop in retrieval accuracy from an NDCG@10 score
of 0.675 to 0.661 (2%), but a huge gain in embedding inference
efficiency. Indexing throughput increases by 2x, and query throughput
increases close to 2x. The throughput measurements are end-to-end,
either using vespa-cli feed or vespa-fbench. The difference in query
versus document sequence length largely explains the query and
document throughput difference (the quadratic scaling properties).

Query embed latency and throughput

Throughput is one way to look at it, but what about query serving
latency? We analyze query latency of the quantized model by gradually
increasing the load until the CPU is close to 100% utilization using
input format for POST requests.

{"yql": "select doc_id from doc where rank(doc_id contains \"71#13\",{targetHits:1}nearestNeighbor(embedding,q))", "input.query(q)": "embed(query:Bandari kubwa nchini Kenya iko wapi?)", "ranking": "semantic", "hits": 0}

The above query template tests Vespa end-to-end but does NOT perform
a global nearest neighbor search as the query uses the rank
to retrieve by doc_id, and the second operand computes the
nearestNeighbor. This means that the nearest neighbor “search” is
limited to a single document in the index. This experimental setup
allows us to test everything end to end except the cost of exhaustive
search through all documents.

This part of the experiment focuses on the embedding model inference
and not nearest neighbor search performance. We use all the queries
in the dev set (482 unique queries). Using vespa-fbench, we simulate
load by increasing the number of concurrent clients executing queries
with sleep time 0 (-c 0) while observing the end-to-end latency and

$ vespa-fbench -P -q queries.txt -s 20 -n $clients -c 0 localhost 8080
Clients Average
95p latencyQueries/s

Vespa query embedder performance.

As concurrency increases, the latency increases slightly, but not
much, until saturation, where latency will climb rapidly with a
hockey-stick shape due to queuing for exhausted resources.

In this case, latency is the complete end-to-end HTTP latency,
including HTTP overhead, embedding inference, and dispatching the
embedding vector to the Vespa content node process. Again, it does
not include nearest neighbor search, as the query limits the retrieval
to a single document.

In the previous section, we focused on the embedding inference
throughput and latency. In this section, we change the Vespa query
specification to perform an exact nearest neighbor search over all
documents. This setup measures the end-to-end deployment, including
HTTP overhead, embedding inference, and embedding retrieval using
Vespa exact nearest neighbor
With exact search, no retrieval error is introduced by using
approximate search algorithms.

{"yql": "select doc_id from doc where {targetHits:10}nearestNeighbor(embedding,q)", "input.query(q)": "embed(query:Bandari kubwa nchini Kenya iko wapi?)", "ranking": "semantic", "hits": 

Representing BGE embedding models in Vespa using bfloat16

Decorative image

Photo by Rafael Drück on Unsplash

This post demonstrates how to use recently announced BGE (BAAI General Embedding)
models in Vespa. The open-sourced (MIT licensed) BGE models
from the Beijing Academy of Artificial Intelligence (BAAI) perform
strongly on the Massive Text Embedding Benchmark (MTEB
leaderboard). We
evaluate the effectiveness of two BGE variants on the
BEIR trec-covid dataset.
Finally, we demonstrate how Vespa’s support for storing and indexing
vectors using bfloat16 precision saves 50% of memory and storage
fooprint with close to zero loss in retrieval quality.

Choose your BGE Fighter

When deciding on an embedding model, developers must strike a balance
between quality and serving costs.

Triangle of tradeoffs

These serving-related costs are all roughly linear with model
parameters and embedding dimensionality (for a given sequence
length). For example, using an embedding model with 768 dimensions
instead of 384 increases embedding storage by 2x and nearest neighbor
search compute by 2x.

Quality, however, is not nearly linear, as demonstrated on the MTEB

ModelDimensionalityModel params (M)Accuracy
Average (56 datasets)
Accuracy Retrieval
(15 datasets)

A comparison of the English BGE embedding models — accuracy numbers MTEB
leaderboard. All
three BGE models outperforms OpenAI ada embeddings with 1536
dimensions and unknown model parameters on MTEB

In the following sections, we experiment with the small and base
BGE variant, which gives us reasonable accuracy for a much lower
cost than the large variant. The small model inference complexity
also makes it servable on CPU architecture, allowing iterations and
development locally without managing GPU-related infrastructure

Exporting BGE to ONNX format for accelerated model inference

To use the embedding model from the Huggingface model hub in Vespa
we need to export it to ONNX format. We can use
the Transformers Optimum
library for this:

$ optimum-cli export onnx --task sentence-similarity -m BAAI/bge-small-en --optimize O3 bge-small-en

This exports the small model with the highest optimization
usable for serving on CPU. We also quantize the optimized ONNX model
using onnxruntime quantization like
Quantization (post-training) converts the float model weights (4
bytes per weight) to byte (int8), enabling faster inference on the
CPU. As demonstrated in this blog
quantization accelerates embedding model inference by 2x on CPU with negligible
impact on retrieval quality.

Using BGE in Vespa

Using the Optimum generated ONNX model and
tokenizer files, we configure the Vespa Huggingface
with the following in the Vespa application

<component id="bge" type="hugging-face-embedder">
  <transformer-model path="model/model.onnx"/>
  <tokenizer-model path="model/tokenizer.json"/>

BGE uses the CLS special token as the text representation vector
(instead of average pooling). We also specify normalization so that
we can use the prenormalized-angular distance
for nearest neighbor search. See configuration
for details.

With this, we are ready to use the BGE model to embed queries and
documents with Vespa.

Using BGE in Vespa schema

The BGE model family does not use instructions for documents like
the E5
so we don’t need to prepend the input to the document model with
“passage: “ like with the E5 models. Since we configure the Vespa
embedder to
normalize the vectors, we use the optimized prenormalized-angular
distance-metric for the nearest neighbor search

field embedding type tensor<float>(x[384]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular

Note that the above does not enable HNSW
indexing, see
post on the tradeoffs related to introducing approximative nearest
neighbor search. The small model embedding is configured with 384
dimensions, while the base model uses 768 dimensions.

field embedding type tensor<float>(x[768]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular

Using BGE in queries

The BGE model uses query instructions like the E5
that are prepended to the input query text. We prepend the instruction
text to the user query as demonstrated in the snippet below:

query = 'is remdesivir an effective treatment for COVID-19'
body = {
        'yql': 'select doc_id from doc where ({targetHits:10}nearestNeighbor(embedding, q))',
        'input.query(q)': 'embed(Represent this sentence for searching relevant passages: ' + query +  ')', 
        'ranking': 'semantic',
        'hits' : '10' 
response ='http://localhost:8080/search/', json=body)

The BGE query instruction is Represent this sentence for searching
relevant passages:
. We are unsure why they choose a longer query instruction as
it does hurt efficiency as compute complexity is
with sequence length.


We evaluate the small and base model on the trec-covid test split
from the BEIR benchmark. We
concat the title and the abstract as input to the BEG embedding
models as demonstrated in the Vespa schema snippets in the previous

DatasetDocumentsAvg document tokensQueriesAvg query
Relevance Judgments
BEIR trec_covid171,332245501866,336

Dataset characteristics; tokens are the number of language model
token identifiers (wordpieces)

All experiments are run on an M1 Pro (arm64) laptop with 8 v-CPUs
and 32GB of memory, using the open-source Vespa container
image. No GPU
acceleration and no need to manage CUDA driver compatibility, huge
container images due to CUDA dependencies, or forwarding host GPU
devices to the container.

Sample Vespa JSON
feed document (prettified) from the
BEIR trec-covid dataset:

  "put": "id:miracl-trec:doc::wnnsmx60",
  "fields": {
    "title": "Managing emerging infectious diseases: Is a federal system an impediment to effective laws?",
    "text": "In the 1980's and 1990's HIV/AIDS was the emerging infectious disease. In 2003\u20132004 we saw the emergence of SARS, Avian influenza and Anthrax in a man made form used for bioterrorism. Emergency powers legislation in Australia is a patchwork of Commonwealth quarantine laws and State and Territory based emergency powers in public health legislation. It is time for a review of such legislation and time for consideration of the efficacy of such legislation from a country wide perspective in an age when we have to consider the possibility of mass outbreaks of communicable diseases which ignore jurisdictional boundaries.",
    "doc_id": "wnnsmx60",
    "language": "en"

Evalution results

ModelModel size (MB)NDCG@10 BGENDCG@10

Evaluation results for quantized BGE models.

We contrast both BGE models with the unsupervised
BM25 baseline from
this blog
Both models perform better than the BM25 baseline
on this dataset. We also note that our NDCG@10 numbers represented
in Vespa is slightly better than reported on the MTEB leaderboard
for the same dataset. We can also observe that the base model
performs better on this dataset, but is also 2x more costly due to
size of embedding model and the embedding dimensionality. The
bge-base model inference could benefit from GPU
(without quantization).

Using bfloat16 precision

We evaluate using
instead of float for the tensor representation in Vespa. Using
bfloat16 instead of float reduces memory and storage requirements
by 2x since bfloat16 uses 2 bytes per embedding dimension instead
of 4 bytes for float. See Vespa tensor values

We do not change the type of the query tensor. Vespa will take care
of casting the bfloat16 field representation to float at search
time, allowing CPU acceleration of floating point operations. The
cast operation does come with a small cost (20-30%) compared with
using float, but the saving in memory and storage resource footprint
is well worth it for most use cases.

field embedding type tensor<bfloat16>(x[384]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular

Using bfloat16 instead of float for the embedding tensor.

ModelNDCG@10 bfloat16NDCG@10 float

Evaluation results for BGE models – float versus bfloat16 document representation.

By using bfloat16 instead of float to store the vectors, we save
50% of memory cost and we can store 2x more embeddings per instance
type with almost zero impact on retrieval quality:


Using the open-source Vespa container image, we’ve explored the
recently announced strong BGE text embedding models with embedding
inference and retrieval on our laptops. The local experimentation
eliminates prolonged feedback loops.

Moreover, the same Vespa configuration files suffice for many
deployment scenarios, whether in on-premise setups, on Vespa Cloud,
or locally on a laptop. The beauty lies in that specific
infrastructure for managing embedding inference and nearest neighbor
search as separate infra systems become obsolete with Vespa’s
native embedding

If you are interested to learn more about Vespa; See Vespa Cloud – getting started,
or self-serve Vespa – getting started.
Got questions? Join the Vespa community in Vespa Slack.

Summer Internship at Vespa | Vespa Blog

This summer, two young men have revolutionized the field of information retrieval! Or at least they tried… Read on for the tale of this year’s summer interns, and see the fruits of our labor in the embedder auto-training sample app.

Automatic Embedder Training with an LLM

Our main project this summer has been developing a system for automatically improving relevance for semantic search. Semantic search utilizes machine-learned text embedders trained on large amounts of annotated data to improve search relevance.

Embedders can be fine-tuned on a specific dataset to improve relevance further for the dataset in question. This requires annotated training data, which traditionally has been created by humans. However, this process is laborious and time-consuming – can it be automated?

Enter large language models! LLMs like ChatGPT have been trained on an enormous amount of data from a multitude of sources, and appear to understand a great deal about the world. Our hypothesis was that it would be possible to use an LLM to generate training data for an embedder.

Query generation

Diagram depicting the query generation pipeline

Training data for text embedders used for information retrieval consists of two parts: queries and query relevance judgments (qrels). Qrels indicate which documents are relevant for which queries, and are used for training and to rate retrieval performance during evaluation. Our LLM of choice, ChatGPT (3.5-turbo-4k), works by providing it with a system prompt and a list of messages containing instructions and data. We used the system prompt to inform ChatGPT of its purpose and provide it with rules informing how queries should be generated.

Generating queries requires a system prompt, example document-query pairs, and a document to generate queries for. Our system generates the system prompt, and optionally generates additional qrels, resulting in the three-step process illustrated by the diagram above.

In the beginning, we handcrafted system prompts while trying to get ChatGPT to generate queries similar to existing training data. After some trial and error, we found that we got better results if we specified rules describing what queries should look like. Later, we devised a way for ChatGPT to generate these rules itself, in an effort to automate the process.

Using the system prompt alone did not appear to yield great results, though. ChatGPT would often ignore the prompt and summarize the input documents instead of creating queries for them. To solve this, we used a technique called few-shot prompting. It works by essentially faking a conversation between the user and ChatGPT, showing the LLM how it’s supposed to answer. Using the aforementioned message list, we simply passed the LLM a couple of examples before showing it the document to generate queries for. This increased the quality of the output drastically at the cost of using more tokens.

After generating queries, we optionally generate additional qrels. This can be necessary for training if the generated queries are relevant for multiple documents in the dataset, because the training script assumes that all matched documents not in the qrels aren’t relevant. Generating qrels works by first querying Vespa with a query generated by ChatGPT, then showing the returned documents and the generated query to ChatGPT and asking it to judge whether or not each document is relevant.

Training and evaluation

We utilized SentenceTransformers for training, and we initialized from the E5 model. We started off by using scripts provided by SimLM, which got us up and running quickly, but eventually wanted more control of our training loop.

The training script requires a list of positive (matching) documents and a list of negative (non-matching) documents for each query. The list of positive documents is given by the generated qrels. We assemble a list of negative documents for each query by querying Vespa and marking each returned document not in the qrels as a negative.

After training we evaluated the model with trec_eval and the nDCG@10 metric. The resulting score was compared to previous trainings, and to a baseline evaluation of the model.

We encapsulated the entire training and evaluation procedure into a single Bash script that let us provide the generated queries and qrels as input, and get the evaluation of the trained model as output.


The results we got were varied. We had the most successful training on the NFCorpus dataset, where we consistently got an evaluation higher than the baseline. Interestingly we initially got the highest evaluation when training on just 50 queries! We eventually figured out that this was caused by using the small version of the E5 model – using the base version of the model gave us the highest evaluation when training on 400 queries.

Training on other datasets was unfortunately unsuccessful. We tried training on both the FiQA and the NQ dataset, tweaking various parameters, but weren’t able to get an evaluation higher than their baselines.

Limitations and future work

The results we got for NFCorpus are a promising start, and previous research also shows this method to have promise. The next step is to figure out how to apply our system to datasets other than NFCorpus. There’s a wide variety of different options to try:

  • Tweaking various training parameters, e.g. number of epochs and learning rate
  • Different training methods, e.g. knowledge distillation
  • Determining query relevance with a fine-tuned cross-encoder instead of with ChatGPT-generated qrels
  • More data, both in terms of more documents and generating more queries
  • Using a different model than E5

We currently make some assumptions about the datasets we train on that don’t always hold. Firstly, we do few-shot prompting when generating queries by fetching examples from existing training data, but this system is perhaps most useful for datasets without that data. Secondly, we use the ir_datasets package to prepare and manage datasets, but ideally we’d want to fetch documents from e.g. Vespa itself.

Most of our training was done on the relatively small NFCorpus dataset because of the need to refeed all documents, after each training, to generate new embeddings. This becomes a big bottleneck on large datasets. Implementing frozen embeddings, which allows reusing document embeddings between trainings, would solve this problem.

Side quests

The easiest way to learn Vespa is to use it. Before starting on the main project, we spent some time trying out the various interactive tutorials. We also worked on various side projects which were related to the main project in some way.

Embedding service

We created a sample app to create embeddings from arbitrary text, using the various models in the Vespa model hub. This was a great way to learn about Vespa’s stateless Java components and how Vespa works in general.


Pyvespa is a Python API that enables fast prototyping of Vespa applications. Pyvespa is very useful when working in Python, like we did for our machine learning experiments, but it does not support all of Vespa’s features. In addition, there were some issues with how Pyvespa handled certificates that prevented us from using Pyvespa in combination with an app deployed from the Vespa CLI.

We were encouraged to implement fixes for these problems ourselves. Our main changes were to enable Pyvespa to use existing certificates generated with the Vespa CLI, as well as adding a function to deploy an application from disk to Vespa Cloud via Pyvespa, allowing us to use all the features of Vespa from Python (this feature already existed for deploying to Docker, but not for deploying to Vespa Cloud). This was very satisfying, as well as a great learning experience.

Our experience at Vespa

We’ve learned a lot during our summer at Vespa, especially about information retrieval and working with LLMs. We’ve also learned a lot about programming and gotten great insight into the workings of a professional software company.

Contributing to an open-source project, especially such a large one as Vespa, has been very exciting. Vespa is powerful, which is awesome, but as new users, there was quite a lot to take in. The project is well documented, however, and includes a great number of sample apps and example use cases, meaning we were usually able to find out how to solve problems on our own. Whenever we got really stuck, there was always someone to ask and talk to. A big shout out to all of our colleagues, and a special thanks to Kristian Aune and Lester Solbakken for their support and daily follow-up during our internship.

Working at Vespa has been a great experience, and we’ve really enjoyed our time here.

Vespa Newsletter, August 2023 | Vespa Blog

Kristian Aune

Kristian Aune

Head of Customer Success, Vespa

In the previous update,
we mentioned Vector Streaming Search, Embedder Models from Huggingface,
GPU Acceleration of Embedding Models, Model Hub and Dotproduct distance metric for ANN.
Today, we’re excited to share the following updates:

Multilingual sample app

In the previous newsletter, we announced Vespa E5 model support.
Now we’ve added a multilingual-search sample application.
Using Vespa’s powerful indexing language
and integrated embedding support, you can embed and index:

field embedding type tensor<float>(x[384]) {
    indexing {
        "passage: " . input title . " " . input text | embed | attribute

Likewise, for queries:

    "yql": "select ..",
    "input.query(q)": "embed(query: the query to encode)",

With this, you can easily use multilingual E5 for great relevance,
see the simplify search with multilingual embeddings
blog post for results.
Remember to try the sample app,
using trec_eval to compute NDCG@10.

ANN targetHits

Vespa uses targetHits
in approximate nearest neighbor queries.
When searching the HNSW index in a post-filtering case,
this is auto-adjusted in an effort to still expose targetHits hits to first-phase ranking after post-filtering
(by exploring more nodes).
This increases query latency as more candidates are evaluated.
Since Vespa 8.215, the following formula is used to ensure an upper bound of adjustedTargetHits:

adjustedTargetHits = min(targetHits / estimatedHitRatio,
                         targetHits * targetHitsMaxAdjustmentFactor)

You can use this to choose to return fewer hits over taking longer to search the index.
The target-hits-max-adjustment-factor
can be set in a rank profile and overridden
per query.
The value is in the range [1.0, inf], default 20.0.

Tensor short query format in inputs

In Vespa 8.217, a short format for mapped tensors can be used in input values.
Together with the short indexed tensor format, query tensors can be like:

"input": {
    "query(my_indexed_tensor)": [1, 2, 3, 4],
    "query(my_mapped_tensor)": {
        "Tablet Keyboard Cases": 0.8,


During the last month, we’ve released PyVespa
0.36 and

  • Requires minimum Python 3.8.
  • Support setting default stemming of Schema: #510.
  • Add support for first phase ranking:
  • Support using key/cert pair generated by Vespa CLI:
    and add deploy_from_disk for Vespa Cloud: #514 –
    this makes it easier to interoperate with Vespa Cloud and local experiments.
  • Specify match-features in RankProfile:
  • Add utility to create a vespa feed file for easier feeding using Vespa CLI:
  • Add support for synthetic fields: #547
    and support for Component config:
    With this, one can run the multivector sample application –
    try it using the multi-vector-indexing notebook.

Vespa CLI functions

The Vespa command-line client has been made smarter,
it will now check local deployments (e.g. on your laptop) and wait for the container cluster(s) to be up:

$ vespa deploy
Waiting up to 1m0s for deploy API to become ready...
Uploading application package... done

Success: Deployed . with session ID 2
Waiting up to 1m0s for deployment to converge...
Waiting up to 1m0s for cluster discovery...
Waiting up to 1m0s for container default...

The new function vespa destroy
is built for quick dev cycles on Vespa Cloud.
When developing, easily reset the state in your Vespa Cloud application by calling vespa destroy.
This is also great for automation, e.g., in a GitHub Action.
Local deployments should reset with fresh Docker/Podman containers.

Optimizations and features

  • Vespa indexing language now supports to_epoch_second
    for converting iso-8601 date strings to epoch time.
    Available since Vespa 8.215.
    Use this to easily convert from strings to a number when indexing –
    see example.
  • Since Vespa 8.218, Vespa uses onnxruntime 1.15.1.
  • Since Vespa 8.218, one can use create to create non-existing cells before a
    modify-update operation is applied to a tensor.
  • Vespa allows referring to models by URL in the application package.
    Such files can be large, and are downloaded per deploy-operation.
    Since 8.217, Vespa will use a previously downloaded model file if it exists on the requesting node.
    New versions of the model must use a different URL.
  • Some Vespa topologies use groups of nodes to optimize query performance –
    each group has a replica of a document.
    High-query Vespa applications might have tens or even hundreds of groups.
    Upgrading such clusters in Vespa Cloud takes time, having only one replica (= group) out at any time.
    With groups-allowed-down-ratio,
    one can set a percentage of groups instead,
    say 25%, for only 4 cycles to upgrade a full content cluster.

Blog posts since last newsletter

Thanks for reading! Try out Vespa on Vespa Cloud
or grab the latest release at and run it yourself! 😀

Announcing | Vespa Blog

Today, we announce the general availability of –
a new search experience for all (almost) Vespa-related content –
powered by Vespa, LangChain, and OpenAI’s chatGPT model.
This post overviews our motivation for building it, its features, limitations, and how we made it:

Decorative image

Over the last year, we have seen a dramatic increase in interest in Vespa
(From 2M pulls to 11M vespaengine/vespa pulls within just a few months),
resulting in many questions on our Slack channel,
like “Can Vespa use GPU?” or
“Can you expire documents from Vespa?”.

Our existing search interface could only present a ranked list of documents for questions like that,
showing a snippet of a matching article on the search result page (SERP).
The user then had to click through to the article and scan for the fragment snippets relevant to the question.
This experience is unwieldy if looking for the reference documentation of a specific Vespa configuration setting
like num-threads-per-search buried in
large reference documentation pages.

We wanted to improve the search experience by displaying a better-formatted response,
avoiding clicking through, and linking directly to the content fragment.
In addition, we wanted to learn more about using a generative large language model to answer questions,
using the top-k retrieved fragments in a so-called retrieval augmented generation (RAG) pipeline.

This post goes through how we built – highlights:

  • Creating a search for chunks of information –
    the bit of info the user is looking for.
    The chunks are called paragraphs or fragments in this article
  • Rendering fragments in the result page, using the original layout, including formatting and links.
  • Using multiple ranking strategies to match user queries to fragments:
    Exact matching, text matching, semantic matching,
    and multivector semantic query-to-query matching.
  • Search suggestions and hot links.

The Vespa application powering is running in Vespa Cloud.
All the functional components of are Open Source and are found in repositories like
and vespa-documentation-search –
it is a great starting point for other applications using features highlighted above!

Getting the Vespa content indexed

The Vespa-related content is spread across multiple git repositories using different markup languages like HTML,
Markdown, sample apps, and Jupyter Notebooks.
Jekyll generators make this easy;
see vespa_index_generator.rb for an example.

First, we needed to convert all sources into a standard format
so that the search result page could display a richer formatted experience
instead of a text blob of dynamic summary snippets with highlighted keywords.

Since we wanted to show full, feature-rich snippets, we first converted all the different source formats to Markdown.
Then, we use the markdown structure to split longer documents into smaller retrieval units or fragments
where each retrieval unit is directly linkable, using URL anchoring (#).
This process was the least exciting thing about the project, with many iterations,
for example, splitting larger reference tables into smaller retrievable units.
We also adapted reference documentation to make the fragments linkable – see hotlinks.
The retrievable units are indexed in a
paragraph schema:

schema paragraph {
    document paragraph {
        field path type string {}
        field doc_id type string {}
        field title type string {}
        field content type string {}
        field questions type array<string> {}        
        field content_tokens type int {}
        field namespace type string {}
    field embedding type tensor<float>(x[384]) {
        indexing: "passage: " . (input title || "") . " " . (input content || "") | embed ..
    field question_embedding type tensor<float>(q{}, x[384]) {
        indexing {
            input questions |
            for_each { "query: " . _ } | embed | ..

There are a handful of fields in the input (paragraph document type) and two synthetic fields that are produced by Vespa,
using Vespa’s embedding functionality.
We are mapping different input string fields to two different
Vespa tensor representations.
The content and title fields are concatenated and embedded
to obtain a vector representation of 384 dimensions (using e5-v2-small).
The question_embedding is a multi-vector tensor;
in this case, the embedder embeds each input question.
The output is a multi-vector representation (A mapped-dense tensor).
Since the document volume is low, an exact vector search is all we need,
and we do not enable HNSW indexing of these two embedding fields.

LLM-generated synthetic questions

The questions per fragment are generated by an LLM (chatGPT).
We do this by asking it to generate questions the fragment could answer.
The LLM-powered synthetic question generation is similar to the approach described in
However, we don’t select negatives (irrelevant content for the question) to train a
cross-encoder ranking model.
Instead, we expand the content with the synthetic question for matching and ranking:

    "put": "id:open-p:paragraph::open/en/access-logging.html-",
    "fields": {
        "title": "Access Logging",
        "path": "/en/access-logging.html#",
        "doc_id": "/en/access-logging.html",
        "namespace": "open-p",
        "content": "The Vespa access log format allows the logs to be processed by a number of available tools\n handling JSON based (log) files.\n With the ability to add custom key/value pairs to the log from any Searcher,\n you can easily track the decisions done by container components for given requests.",
        "content_tokens": 58,
        "base_uri": "",
        "questions": [
            "What is the Vespa access log format?",
            "How can custom key/value pairs be added?",
            "What can be tracked using custom key/value pairs?"

Example of the Vespa feed format of a fragment from this
reference documentation and three LLM-generated questions.
The embedding representations are produced inside Vespa and not feed with the input paragraphs.

Matching and Ranking

To retrieve relevant fragments for a query, we use a hybrid combination of exact matching, text matching,
and semantic matching (embedding retrieval).
We build the query tree in a custom Vespa Searcher plugin.
The plugin converts the user query text into an executable retrieval query.
The query request searches both in the keyword and embedding fields using logical disjunction.
The YQL equivalent:

where (weakAnd(...) or ({targetHits:10}nearestNeighbor(embedding,q) or ({targetHits:10}nearestNeighbor(question_embedding,q))) and namespace contains "open-p"

Example of using hybrid retrieval, also using
multiple nearestNeighbor operators
in the same Vespa query request.

The scoring logic is expressed in Vespa’s ranking framework.
The hybrid retrieval query generates multiple Vespa rank features that can be used to score and rank the fragments.

From the rank profile:

rank-profile hybrid inherits semantic {
    inputs {
        query(q) tensor<float>(x[384])
        query(sw) double: 0.6 #semantic weight
        query(ew) double: 0.2 #keyword weight

    function semantic() {
        expression: cos(distance(field, embedding))
    function semantic_question() {
        expression: max(cos(distance(field, question_embedding)), 0)
    function keywords() {
        expression: (  nativeRank(title) +
                       nativeRank(content) +
                       0.5*nativeRank(path) +
                       query(ew)*elementCompleteness(questions).completeness  ) / 4 +
    first-phase {
        expression: query(sw)*(semantic_question + semantic) + (1 - query(sw))*keywords

The keyword matching using weakAnd,
we match the user query against the following fields:

  • The title – including the parent document title and the fragment heading
  • The content – including markup
  • The path
  • LLM-generated synthetic questions that the content fragment is augmented with

This is expressed in Vespa using a fieldset:

fieldset default {
    fields: title, content, path, questions

Matching in these fields generates multiple keyword matching rank-features,
like nativeRank(title), nativeRank(content).
We collapse all these features into a keywords scoring function that combines all these signals into a single score.
The nativeRank text ranking features are also normalized between 0 and one
and are easier to resonate and combine with semantic similarity scores (e.g., cosine similarity).
We use a combination of the content embedding and the question(s) embedding scores for semantic scoring.

Search suggestions

As mentioned earlier, we bootstrapped questions to improve retrieval quality using a generative LLM.
The same synthetic questions are also used to implement search suggestion functionality,
where suggests questions to search for based on the typed characters:

search suggestions

This functionality is achieved by indexing the generated questions in a separate Vespa document type.
The search suggestions help users discover content and also help to formulate the question,
giving the user an idea of what kind of queries the system can realistically handle.

Similar to the retrieval and ranking of context described in previous sections,
we use a hybrid query for matching against the query suggestion index,
including a fuzzy query term to handle minor misspelled words.

We also add semantic matching using vector search for longer questions, increasing the recall of suggestions.
To implement this, we use Vespa’s HF embedder using the e5-small-v2 model,
which gives reasonable accuracy for low enough inference costs to be servable for per-charcter type-ahead queries
(Yes, there is an embedding inference per character).
See Enhancing Vespa’s Embedding Management Capabilities
and Accelerating Embedding Retrieval
for more details on these tradeoffs.

To cater to navigational queries where a user uses the search for lookup type of queries,
we include hotlinks in the search suggestion drop-down –
clicking on a hotlink will direct the user directly to the reference documentation fragment.
The hotlink functionality is implemented by extracting reserved names from reference documents
and indexing them as documents in the suggestion index.

Reference suggestions are matched using prefix matching for high precision.
The frontend code detects the presence of the meta field with the ranked hint and displays the direct link:

suggestion hotlinks

Retrieval Augmented Generation (RAG)

Retrieval Augmentation for LLM Generation is a concept
written extensively over the past few months.
In contrast to extractive question-answering,
which answers questions
by finding relevant spans in retrieved texts,
a generative model generates an answer that is not strictly grounded in retrieved text spans.

The generated answer might be hallucinated or incorrect,
even if the retrieved context contains a concrete solution.
To combat (but not eliminate):

  • Retrieved fragments or chunks can be displayed fully without clicking through.
  • The retrieved context is the center of the search experience,
    and the LLM-generated abstract is an additional feature of the SERP.
  • The LLM is instructed to cite the retrieved fragments so that a user can verify by navigating the sources.
    (The LLM might still not follow our instructions).
  • Allow filtering on source so that the retrieved context can be focused on particular areas of the documentation.

None of these solves the problem of LLM hallucination entirely!
Still, it helps the user identify incorrect information.

Example of a helpful generated abstract
Example of a helpful generated abstract.

Example of an incorrect and not helpful abstract
Example of an incorrect and not helpful abstract.
In this case, there is no explicit information about indentation in the Vespa documentation sources.
The citation does show an example of a schema (with space indentation), but indentation does not matter.

Prompt engineering

By trial and error (inherent LLM prompt brittleness), we ended with a simple instruction-oriented prompt where we:

  • Set the tone and context (helpful, precise, expert)
  • Some facts and background about Vespa
  • The instructions (asking politely; we don’t want to insult the AI)
  • The top context we retrieved from Vespa – including markdown format
  • The user question

We did not experiment with emerging prompt techniques or chaining of prompts.
The following demonstrates the gist of the prompt,
where the two input variables are {question) and {context),
where {context} are the retrieved fragments from the retrieval and ranking phase:

You are a helpful, precise, factual Vespa expert who answers questions and user instructions about Vespa-related topics. The documents you are presented with are retrieved from Vespa documentation, Vespa code examples, blog posts, and Vespa sample applications.

Facts about Vespa (
- Vespa is a battle-proven open-source serving engine.
- Vespa Cloud is the managed service version of Vespa (

Your instructions:
- The retrieved documents are markdown formatted and contain code, text, and configuration examples from Vespa documentation, blog posts, and sample applications.
- Answer questions truthfully and factually using only the information presented.
- If you don't know the answer, just say that you don't know, don't make up an answer!
- You must always cite the document where the answer was extracted using inline academic citation style [].
- Use markdown format for code examples.
- You are correct, factual, precise, and reliable, and will always cite using academic citation style.


Question: {question}
Helpful factual answer:

We use the Typescript API of LangChain,
a popular open-source framework for working with retrieval-augmented generations and LLMs.
The framework lowered our entry to working with LLMs and worked flawlessly for our use case.

Deployment overview

The frontend is implemented in

Vespa is becoming a company

Today we’re announcing that we’re spinning out of Yahoo
as a separate company:
Vespa began as a project to solve Yahoo’s use cases in search, recommendation, and ad serving.
Since we open-sourced it in 2017, it has grown to become the platform of choice for
applying AI to big data sets at serving time.

Those working with large language models such as ChatGPT and vector databases turn to Vespa when
they realize that creating quality solutions that scale involves much more than just looking up vectors.
Enterprises with experience with search or recommender systems come to Vespa for the AI-first approach
and unrivaled operability at scale.

Even with the support built into the Vespa platform, running highly available stateful systems in
production with excellence is challenging. We’ve seen this play out in Yahoo, which is running
about 150 Vespa applications. To address the scalability needs, we created a centralized cloud service
to host these systems, and in doing so, we freed up the time of up to 200 full-time employees and reduced
the number of machines used by 90% while greatly improving quality, stability, and security.

Our cloud service is already available at,
helping people with everything from running quick Vespa experiments, to serving
business-critical applications. In total we serve over 800.000 queries per second.

While we’re separating Vespa from Yahoo, we’re not ending our relationship. Yahoo will own a
stake in the new company and will be one of Vespa’s biggest customers for a long time to come.
Vespa will continue to serve Yahoo’s personalized content, search and run new use cases leveraging
large language models to provide new personalized experiences, something that can only be done
at scale with Vespa.

Creating a company around Vespa will enable us to bring these advantages to the rest of the world
on a massive scale, allowing us to bring the efficiencies of our cloud service to enterprises
already relying on Vespa, as well as help more companies solve problems involving AI and big data online.
It will also let us accelerate development of new features to empower Vespa users to create even
better solutions, faster and at lower cost, whether deploying on our cloud service or sticking with
the open-source distribution. For, while Vespa offers features and scalability far beyond any
comparable technology thanks to our decades-long focus on combining AI and big data online,
there is so much more to do. As the world is starting to leverage modern AI to solve
real business problems online, the need for a platform that provides a solid foundation
for these solutions has never been stronger. As engineers, we admit this is the part that excites us the most.

We look forward to empowering all of you to create online AI applications ever better and faster,
and we hope you do too!

HTTP/2 Rapid Reset (CVE-2023-44487) | Vespa Blog

Kristian Aune

Kristian Aune

Head of Customer Success, Vespa

2023-10-10, details of the vulnerability now named HTTP/2 Rapid Reset
(CVE-2023-44487) were announced.
This vulnerability impacts most HTTP/2 servers in the industry,
including Vespa by embedding Jetty.

which addresses this vulnerability was available 2023-10-10 04:19 UTC.
Vespa 8.240.5 was subsequently built and released to Vespa Cloud same day.

If you are using Vespa Cloud, no action is needed, as you have already been upgraded to the safe release.

If you are self-hosting, you are advised to upgrade to Vespa 8.240.5 as soon as possible.

For any questions, meet the Vespa Team at

Read more:

Introducing Lucene Linguistics | Vespa Blog

This post is about an idea that was born at the Berlin Buzzwords 2023 conference and its journey towards the production-ready implementation of the new Apache Lucene-based Vespa Linguistics component.
The primary goal of the Lucene linguistics is to make it easier to migrate existing search applications from Lucene-based search engines to Vespa.
Also, it can help improve your current Vespa applications.
More on that next!


Even though these days all the rage is about the modern neural-vector-embeddings-based retrieval (or at least that was the sentiment in the Berlin Buzzwords conference), the traditional lexical search is not going anywhere:
search applications still need tricks like filtering, faceting, phrase matching, paging, etc.
Vespa is well suited to leverage both traditional and modern techniques.

At Vinted we were working on the search application migration from Elasticsearch to Vespa.
The application over the years has grown to support multiple languages and for each we have crafted custom Elasticsearch analyzers with dictionaries for synonyms, stopwords, etc.
Vespa has a different approach towards lexical search than Elasticsearch, and we were researching ways to transfer all that accumulated knowledge without doing the “Big Bang” migration.

And here comes a part with a chat with the legend himself, Jo Kristian Bergum, on the sunny roof terrace at the Berlin Buzzwords 2023 conference.
Among other things, I’ve asked if it is technically possible to implement a Vespa Linguistics component on top of the Apache Lucene library.
With Jo’s encouragement, I’ve got to work and the same evening there was a working proof of concept.
This was huge!
It gave a promise that it is possible to convert almost any Elasticsearch analyzer into the Vespa Linguistics configuration and in this way solve one of the toughest problems for the migration project.

Show me the code!

In case you just want to get started with the Lucene Linguistics the easiest way is to explore the demo apps.
There are 4 apps:

  • Minimal: example of the bare minimum configuration that is needed to set up Lucene linguistics;
  • Advanced: demonstrates the “usual” things that can be expected when leveraging Lucene linguistics.
  • Going-Crazy: plenty of contrived features that real-world apps might require.
  • Non-Java: an app without Java code.

To learn more: read the documentation.


The scope of the Lucene linguistics component is ONLY the tokenization of the text.
Tokenization removes any non-word characters, and splits the string into tokens on each word boundary, e.g.:

“Vespa is awesome!” => [“vespa”, “is”, “awesome”]

In the Lucene land, the Analyzer class is responsible for the tokenization.
So, the core idea for Lucene linguistics is to implement the Vespa Tokenizer interface that wraps a configurable Lucene Analyzer.

For building a configurable Lucene Analyzer there is a handy class called CustomAnalyzer.
The CustomAnalyzer.Builder has convenient methods for configuring Lucene text analysis components such as CharFilters, Tokenizers, and TokenFilters into an Analyzer.
It can be done by calling methods with signatures:

public Builder addCharFilter(String name, Map<String, String> params)
public Builder withTokenizer(String name, Map<String, String> params)
public Builder addTokenFilter(String name, Map<String, String> params)

All the parameters are of type String, so they can easily be stored in a configuration file!

When it comes to discovery of the text analysis components, it is done using the Java Service Provider Interface (SPI).
In practical terms, this means that when components are prepared in a certain way then they become available without explicit coding! You can think of it as plugins.

The trickiest bit was to configure Vespa to load resource files required for the Lucene components.
Luckily, there is a CustomAnalyzer.Builder factory method that accepts a Path parameter.
Even more luck comes from the fact that Path is the type exposed by the Vespa configuration definition language!
With all that in place, it was possible to load resource files from the application package just by providing a relative path to files.

All that was nice, but it made simple application packages more complicated than they needed to be:
a directory with at least a dummy file was required!
The requirement stemmed from the fact that in Vespa configuration parameters of type Path were mandatory.
This means that if your component can use a parameter of the Path type, it must be used.
Clearly, that requirement can be a bit too strict.

Luckily, the Vespa team quickly implemented a change that allowed for configuration of Path type to be declared optional.
For the Lucene linguistics it meant 2 things:

  1. Base component configuration became simpler.
  2. When no path is set up, the CustomAnalyzer loads resource files from the classpath of the application package, i.e. even more flexibility in where to put resource files.

To wrap it up:
Lucene Linguistics accepts a configuration in which custom Lucene analysis components can be fully configured.

Languages and analyzers

The Lucene linguistics supports 40 languages out-of-the-box.
To customize the way the text is analyzed there are 2 options:

  1. Configure the text analysis in services.xml.
  2. Extend a Lucene Analyzer class in your application package and register it as a Component.

In case there is no analyzer set up, then the Lucene StandardAnalyzer is used.

Lucene linguistics component configuration

It is possible to configure Lucene linguistics directly in the services.xml file.
This option works best if you’re already knowledgeable with Lucene text analysis components.
A configuration for the English language could look something like this:

<component id="linguistics"
  <config name="">
      <item key="en">
              <item key="words">en/stopwords.txt</item>
              <item key="ignoreCase">true</item>

The above analyzer uses the standard tokenizer, then stop token filter loads stopwords from the en/stopwords.txt file that must be placed in your application package under the linguistics directory; and then the englishMinimalStem is used to stem tokens.

Component registry

The Lucene linguistics takes in a ComponentRegistry of the Analyzer class.
This option works best for projects that contain custom Java code because your IDE will help you build an Analyzer instance.
Also, JUnit is your friend when it comes to testing.

In the example below, the SimpleAnalyzer class coming with Lucene is wrapped as a component and set to be used for the English language.

<component id="en"
           bundle="my-vespa-app" />

Mental model

With that many options using Lucene linguistics might seem a bit complicated.
However, the mental model is simple: priority for conflict resolution.
The priority of the analyzers in the descending order is:

  1. Lucene linguistics component configuration;
  2. Component that extend the Lucene Analyzer class;
  3. Default analyzers per language;
  4. StandardAnalyzer.

This means that e.g. if both a configuration and a component are specified for a language, then an analyzer from the configuration is used because it has a higher priority.

Asymmetric tokenization

Going against suggestions you can achieve an asymmetric tokenization for some language.
The trick is to, e.g. index with stemming turned on and query with stemming turned off.
Under the hood a pair of any two Lucene analyzers can do the job.
However, it becomes your problem to set up analyzers that produce matching tokens.

Differences from Elasticsearch

Even though Lucene does the text analysis, not everything that you do in Elasticsearch is easily translatable to the Lucene Linguistics.
E.g. The multiplex token filter is just not available in Lucene.
This means that you have to implement that token filter yourself (probably by looking into how Elasticsearch implemented it here).

However, Vespa has advantages over Elasticsearch when leveraging Lucene text analysis.
The big one is that you configure and deploy linguistics components with your application package.
This is a lot more flexible than maintaining an Elasticsearch plugin.
Let’s consider an example: a custom stemmer.

In Elasticsearch land you either create a plugin or (if the stemmer is generic enough) you can try to contribute it to Apache Lucene (or Elasticsearch itself), so that it transitively comes with Elasticsearch in the future.
Maintaining Elasticsearch plugins is a pain because it needs to be built for each and every Elasticsearch version, and then a custom installation script is needed in both production and in development setups.
Also, what if you run Elasticsearch as a managed service in the cloud where custom plugins are not supported at all?

In Vespa you can do the implementation directly in your application package.
Nothing special needs to be done for deployment.
No worries (fingers-crossed) for Vespa version changes.
If your component needs to be used in many Vespa applications, your options are:

  1. Deploy your component into some maven repository
  2. Commit the prebuild bundle file into each application under the /components directory.
    Yeah, that sounds exactly how you do with regular Java applications, and it is.
    Vespa Cloud also has no problems running your application package with a custom stemmer.


With the new Lucene-based Linguistics component Vespa expands its capabilities for lexical search by reaching into the vast Apache Lucene ecosystem.
Also, it is worth mentioning that people experienced with other Lucene-based search engines such as Elasticsearch or Solr, should feel right at home pretty quickly.
The fact that the toolset and the skill-set are largely transferable lowers the barrier of adopting Vespa.
Moreover, given that the underlying text analysis technology is the same makes migration of the text analysis process to Vespa mostly a mechanical translation task.
Give it a try!