Representing BGE embedding models in Vespa using bfloat16

Decorative image

Photo by Rafael Drück on Unsplash

This post demonstrates how to use recently announced BGE (BAAI General Embedding)
models in Vespa. The open-sourced (MIT licensed) BGE models
from the Beijing Academy of Artificial Intelligence (BAAI) perform
strongly on the Massive Text Embedding Benchmark (MTEB
leaderboard). We
evaluate the effectiveness of two BGE variants on the
BEIR trec-covid dataset.
Finally, we demonstrate how Vespa’s support for storing and indexing
vectors using bfloat16 precision saves 50% of memory and storage
fooprint with close to zero loss in retrieval quality.

Choose your BGE Fighter

When deciding on an embedding model, developers must strike a balance
between quality and serving costs.

Triangle of tradeoffs

These serving-related costs are all roughly linear with model
parameters and embedding dimensionality (for a given sequence
length). For example, using an embedding model with 768 dimensions
instead of 384 increases embedding storage by 2x and nearest neighbor
search compute by 2x.

Quality, however, is not nearly linear, as demonstrated on the MTEB

ModelDimensionalityModel params (M)Accuracy
Average (56 datasets)
Accuracy Retrieval
(15 datasets)

A comparison of the English BGE embedding models — accuracy numbers MTEB
leaderboard. All
three BGE models outperforms OpenAI ada embeddings with 1536
dimensions and unknown model parameters on MTEB

In the following sections, we experiment with the small and base
BGE variant, which gives us reasonable accuracy for a much lower
cost than the large variant. The small model inference complexity
also makes it servable on CPU architecture, allowing iterations and
development locally without managing GPU-related infrastructure

Exporting BGE to ONNX format for accelerated model inference

To use the embedding model from the Huggingface model hub in Vespa
we need to export it to ONNX format. We can use
the Transformers Optimum
library for this:

$ optimum-cli export onnx --task sentence-similarity -m BAAI/bge-small-en --optimize O3 bge-small-en

This exports the small model with the highest optimization
usable for serving on CPU. We also quantize the optimized ONNX model
using onnxruntime quantization like
Quantization (post-training) converts the float model weights (4
bytes per weight) to byte (int8), enabling faster inference on the
CPU. As demonstrated in this blog
quantization accelerates embedding model inference by 2x on CPU with negligible
impact on retrieval quality.

Using BGE in Vespa

Using the Optimum generated ONNX model and
tokenizer files, we configure the Vespa Huggingface
with the following in the Vespa application

<component id="bge" type="hugging-face-embedder">
  <transformer-model path="model/model.onnx"/>
  <tokenizer-model path="model/tokenizer.json"/>

BGE uses the CLS special token as the text representation vector
(instead of average pooling). We also specify normalization so that
we can use the prenormalized-angular distance
for nearest neighbor search. See configuration
for details.

With this, we are ready to use the BGE model to embed queries and
documents with Vespa.

Using BGE in Vespa schema

The BGE model family does not use instructions for documents like
the E5
so we don’t need to prepend the input to the document model with
“passage: “ like with the E5 models. Since we configure the Vespa
embedder to
normalize the vectors, we use the optimized prenormalized-angular
distance-metric for the nearest neighbor search

field embedding type tensor<float>(x[384]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular

Note that the above does not enable HNSW
indexing, see
post on the tradeoffs related to introducing approximative nearest
neighbor search. The small model embedding is configured with 384
dimensions, while the base model uses 768 dimensions.

field embedding type tensor<float>(x[768]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular

Using BGE in queries

The BGE model uses query instructions like the E5
that are prepended to the input query text. We prepend the instruction
text to the user query as demonstrated in the snippet below:

query = 'is remdesivir an effective treatment for COVID-19'
body = {
        'yql': 'select doc_id from doc where ({targetHits:10}nearestNeighbor(embedding, q))',
        'input.query(q)': 'embed(Represent this sentence for searching relevant passages: ' + query +  ')', 
        'ranking': 'semantic',
        'hits' : '10' 
response ='http://localhost:8080/search/', json=body)

The BGE query instruction is Represent this sentence for searching
relevant passages:
. We are unsure why they choose a longer query instruction as
it does hurt efficiency as compute complexity is
with sequence length.


We evaluate the small and base model on the trec-covid test split
from the BEIR benchmark. We
concat the title and the abstract as input to the BEG embedding
models as demonstrated in the Vespa schema snippets in the previous

DatasetDocumentsAvg document tokensQueriesAvg query
Relevance Judgments
BEIR trec_covid171,332245501866,336

Dataset characteristics; tokens are the number of language model
token identifiers (wordpieces)

All experiments are run on an M1 Pro (arm64) laptop with 8 v-CPUs
and 32GB of memory, using the open-source Vespa container
image. No GPU
acceleration and no need to manage CUDA driver compatibility, huge
container images due to CUDA dependencies, or forwarding host GPU
devices to the container.

Sample Vespa JSON
feed document (prettified) from the
BEIR trec-covid dataset:

  "put": "id:miracl-trec:doc::wnnsmx60",
  "fields": {
    "title": "Managing emerging infectious diseases: Is a federal system an impediment to effective laws?",
    "text": "In the 1980's and 1990's HIV/AIDS was the emerging infectious disease. In 2003\u20132004 we saw the emergence of SARS, Avian influenza and Anthrax in a man made form used for bioterrorism. Emergency powers legislation in Australia is a patchwork of Commonwealth quarantine laws and State and Territory based emergency powers in public health legislation. It is time for a review of such legislation and time for consideration of the efficacy of such legislation from a country wide perspective in an age when we have to consider the possibility of mass outbreaks of communicable diseases which ignore jurisdictional boundaries.",
    "doc_id": "wnnsmx60",
    "language": "en"

Evalution results

ModelModel size (MB)NDCG@10 BGENDCG@10

Evaluation results for quantized BGE models.

We contrast both BGE models with the unsupervised
BM25 baseline from
this blog
Both models perform better than the BM25 baseline
on this dataset. We also note that our NDCG@10 numbers represented
in Vespa is slightly better than reported on the MTEB leaderboard
for the same dataset. We can also observe that the base model
performs better on this dataset, but is also 2x more costly due to
size of embedding model and the embedding dimensionality. The
bge-base model inference could benefit from GPU
(without quantization).

Using bfloat16 precision

We evaluate using
instead of float for the tensor representation in Vespa. Using
bfloat16 instead of float reduces memory and storage requirements
by 2x since bfloat16 uses 2 bytes per embedding dimension instead
of 4 bytes for float. See Vespa tensor values

We do not change the type of the query tensor. Vespa will take care
of casting the bfloat16 field representation to float at search
time, allowing CPU acceleration of floating point operations. The
cast operation does come with a small cost (20-30%) compared with
using float, but the saving in memory and storage resource footprint
is well worth it for most use cases.

field embedding type tensor<bfloat16>(x[384]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular

Using bfloat16 instead of float for the embedding tensor.

ModelNDCG@10 bfloat16NDCG@10 float

Evaluation results for BGE models – float versus bfloat16 document representation.

By using bfloat16 instead of float to store the vectors, we save
50% of memory cost and we can store 2x more embeddings per instance
type with almost zero impact on retrieval quality:


Using the open-source Vespa container image, we’ve explored the
recently announced strong BGE text embedding models with embedding
inference and retrieval on our laptops. The local experimentation
eliminates prolonged feedback loops.

Moreover, the same Vespa configuration files suffice for many
deployment scenarios, whether in on-premise setups, on Vespa Cloud,
or locally on a laptop. The beauty lies in that specific
infrastructure for managing embedding inference and nearest neighbor
search as separate infra systems become obsolete with Vespa’s
native embedding

If you are interested to learn more about Vespa; See Vespa Cloud – getting started,
or self-serve Vespa – getting started.
Got questions? Join the Vespa community in Vespa Slack.

Summer Internship at Vespa | Vespa Blog

This summer, two young men have revolutionized the field of information retrieval! Or at least they tried… Read on for the tale of this year’s summer interns, and see the fruits of our labor in the embedder auto-training sample app.

Automatic Embedder Training with an LLM

Our main project this summer has been developing a system for automatically improving relevance for semantic search. Semantic search utilizes machine-learned text embedders trained on large amounts of annotated data to improve search relevance.

Embedders can be fine-tuned on a specific dataset to improve relevance further for the dataset in question. This requires annotated training data, which traditionally has been created by humans. However, this process is laborious and time-consuming – can it be automated?

Enter large language models! LLMs like ChatGPT have been trained on an enormous amount of data from a multitude of sources, and appear to understand a great deal about the world. Our hypothesis was that it would be possible to use an LLM to generate training data for an embedder.

Query generation

Diagram depicting the query generation pipeline

Training data for text embedders used for information retrieval consists of two parts: queries and query relevance judgments (qrels). Qrels indicate which documents are relevant for which queries, and are used for training and to rate retrieval performance during evaluation. Our LLM of choice, ChatGPT (3.5-turbo-4k), works by providing it with a system prompt and a list of messages containing instructions and data. We used the system prompt to inform ChatGPT of its purpose and provide it with rules informing how queries should be generated.

Generating queries requires a system prompt, example document-query pairs, and a document to generate queries for. Our system generates the system prompt, and optionally generates additional qrels, resulting in the three-step process illustrated by the diagram above.

In the beginning, we handcrafted system prompts while trying to get ChatGPT to generate queries similar to existing training data. After some trial and error, we found that we got better results if we specified rules describing what queries should look like. Later, we devised a way for ChatGPT to generate these rules itself, in an effort to automate the process.

Using the system prompt alone did not appear to yield great results, though. ChatGPT would often ignore the prompt and summarize the input documents instead of creating queries for them. To solve this, we used a technique called few-shot prompting. It works by essentially faking a conversation between the user and ChatGPT, showing the LLM how it’s supposed to answer. Using the aforementioned message list, we simply passed the LLM a couple of examples before showing it the document to generate queries for. This increased the quality of the output drastically at the cost of using more tokens.

After generating queries, we optionally generate additional qrels. This can be necessary for training if the generated queries are relevant for multiple documents in the dataset, because the training script assumes that all matched documents not in the qrels aren’t relevant. Generating qrels works by first querying Vespa with a query generated by ChatGPT, then showing the returned documents and the generated query to ChatGPT and asking it to judge whether or not each document is relevant.

Training and evaluation

We utilized SentenceTransformers for training, and we initialized from the E5 model. We started off by using scripts provided by SimLM, which got us up and running quickly, but eventually wanted more control of our training loop.

The training script requires a list of positive (matching) documents and a list of negative (non-matching) documents for each query. The list of positive documents is given by the generated qrels. We assemble a list of negative documents for each query by querying Vespa and marking each returned document not in the qrels as a negative.

After training we evaluated the model with trec_eval and the nDCG@10 metric. The resulting score was compared to previous trainings, and to a baseline evaluation of the model.

We encapsulated the entire training and evaluation procedure into a single Bash script that let us provide the generated queries and qrels as input, and get the evaluation of the trained model as output.


The results we got were varied. We had the most successful training on the NFCorpus dataset, where we consistently got an evaluation higher than the baseline. Interestingly we initially got the highest evaluation when training on just 50 queries! We eventually figured out that this was caused by using the small version of the E5 model – using the base version of the model gave us the highest evaluation when training on 400 queries.

Training on other datasets was unfortunately unsuccessful. We tried training on both the FiQA and the NQ dataset, tweaking various parameters, but weren’t able to get an evaluation higher than their baselines.

Limitations and future work

The results we got for NFCorpus are a promising start, and previous research also shows this method to have promise. The next step is to figure out how to apply our system to datasets other than NFCorpus. There’s a wide variety of different options to try:

  • Tweaking various training parameters, e.g. number of epochs and learning rate
  • Different training methods, e.g. knowledge distillation
  • Determining query relevance with a fine-tuned cross-encoder instead of with ChatGPT-generated qrels
  • More data, both in terms of more documents and generating more queries
  • Using a different model than E5

We currently make some assumptions about the datasets we train on that don’t always hold. Firstly, we do few-shot prompting when generating queries by fetching examples from existing training data, but this system is perhaps most useful for datasets without that data. Secondly, we use the ir_datasets package to prepare and manage datasets, but ideally we’d want to fetch documents from e.g. Vespa itself.

Most of our training was done on the relatively small NFCorpus dataset because of the need to refeed all documents, after each training, to generate new embeddings. This becomes a big bottleneck on large datasets. Implementing frozen embeddings, which allows reusing document embeddings between trainings, would solve this problem.

Side quests

The easiest way to learn Vespa is to use it. Before starting on the main project, we spent some time trying out the various interactive tutorials. We also worked on various side projects which were related to the main project in some way.

Embedding service

We created a sample app to create embeddings from arbitrary text, using the various models in the Vespa model hub. This was a great way to learn about Vespa’s stateless Java components and how Vespa works in general.


Pyvespa is a Python API that enables fast prototyping of Vespa applications. Pyvespa is very useful when working in Python, like we did for our machine learning experiments, but it does not support all of Vespa’s features. In addition, there were some issues with how Pyvespa handled certificates that prevented us from using Pyvespa in combination with an app deployed from the Vespa CLI.

We were encouraged to implement fixes for these problems ourselves. Our main changes were to enable Pyvespa to use existing certificates generated with the Vespa CLI, as well as adding a function to deploy an application from disk to Vespa Cloud via Pyvespa, allowing us to use all the features of Vespa from Python (this feature already existed for deploying to Docker, but not for deploying to Vespa Cloud). This was very satisfying, as well as a great learning experience.

Our experience at Vespa

We’ve learned a lot during our summer at Vespa, especially about information retrieval and working with LLMs. We’ve also learned a lot about programming and gotten great insight into the workings of a professional software company.

Contributing to an open-source project, especially such a large one as Vespa, has been very exciting. Vespa is powerful, which is awesome, but as new users, there was quite a lot to take in. The project is well documented, however, and includes a great number of sample apps and example use cases, meaning we were usually able to find out how to solve problems on our own. Whenever we got really stuck, there was always someone to ask and talk to. A big shout out to all of our colleagues, and a special thanks to Kristian Aune and Lester Solbakken for their support and daily follow-up during our internship.

Working at Vespa has been a great experience, and we’ve really enjoyed our time here.

Vespa Newsletter, August 2023 | Vespa Blog

Kristian Aune

Kristian Aune

Head of Customer Success, Vespa

In the previous update,
we mentioned Vector Streaming Search, Embedder Models from Huggingface,
GPU Acceleration of Embedding Models, Model Hub and Dotproduct distance metric for ANN.
Today, we’re excited to share the following updates:

Multilingual sample app

In the previous newsletter, we announced Vespa E5 model support.
Now we’ve added a multilingual-search sample application.
Using Vespa’s powerful indexing language
and integrated embedding support, you can embed and index:

field embedding type tensor<float>(x[384]) {
    indexing {
        "passage: " . input title . " " . input text | embed | attribute

Likewise, for queries:

    "yql": "select ..",
    "input.query(q)": "embed(query: the query to encode)",

With this, you can easily use multilingual E5 for great relevance,
see the simplify search with multilingual embeddings
blog post for results.
Remember to try the sample app,
using trec_eval to compute NDCG@10.

ANN targetHits

Vespa uses targetHits
in approximate nearest neighbor queries.
When searching the HNSW index in a post-filtering case,
this is auto-adjusted in an effort to still expose targetHits hits to first-phase ranking after post-filtering
(by exploring more nodes).
This increases query latency as more candidates are evaluated.
Since Vespa 8.215, the following formula is used to ensure an upper bound of adjustedTargetHits:

adjustedTargetHits = min(targetHits / estimatedHitRatio,
                         targetHits * targetHitsMaxAdjustmentFactor)

You can use this to choose to return fewer hits over taking longer to search the index.
The target-hits-max-adjustment-factor
can be set in a rank profile and overridden
per query.
The value is in the range [1.0, inf], default 20.0.

Tensor short query format in inputs

In Vespa 8.217, a short format for mapped tensors can be used in input values.
Together with the short indexed tensor format, query tensors can be like:

"input": {
    "query(my_indexed_tensor)": [1, 2, 3, 4],
    "query(my_mapped_tensor)": {
        "Tablet Keyboard Cases": 0.8,


During the last month, we’ve released PyVespa
0.36 and

  • Requires minimum Python 3.8.
  • Support setting default stemming of Schema: #510.
  • Add support for first phase ranking:
  • Support using key/cert pair generated by Vespa CLI:
    and add deploy_from_disk for Vespa Cloud: #514 –
    this makes it easier to interoperate with Vespa Cloud and local experiments.
  • Specify match-features in RankProfile:
  • Add utility to create a vespa feed file for easier feeding using Vespa CLI:
  • Add support for synthetic fields: #547
    and support for Component config:
    With this, one can run the multivector sample application –
    try it using the multi-vector-indexing notebook.

Vespa CLI functions

The Vespa command-line client has been made smarter,
it will now check local deployments (e.g. on your laptop) and wait for the container cluster(s) to be up:

$ vespa deploy
Waiting up to 1m0s for deploy API to become ready...
Uploading application package... done

Success: Deployed . with session ID 2
Waiting up to 1m0s for deployment to converge...
Waiting up to 1m0s for cluster discovery...
Waiting up to 1m0s for container default...

The new function vespa destroy
is built for quick dev cycles on Vespa Cloud.
When developing, easily reset the state in your Vespa Cloud application by calling vespa destroy.
This is also great for automation, e.g., in a GitHub Action.
Local deployments should reset with fresh Docker/Podman containers.

Optimizations and features

  • Vespa indexing language now supports to_epoch_second
    for converting iso-8601 date strings to epoch time.
    Available since Vespa 8.215.
    Use this to easily convert from strings to a number when indexing –
    see example.
  • Since Vespa 8.218, Vespa uses onnxruntime 1.15.1.
  • Since Vespa 8.218, one can use create to create non-existing cells before a
    modify-update operation is applied to a tensor.
  • Vespa allows referring to models by URL in the application package.
    Such files can be large, and are downloaded per deploy-operation.
    Since 8.217, Vespa will use a previously downloaded model file if it exists on the requesting node.
    New versions of the model must use a different URL.
  • Some Vespa topologies use groups of nodes to optimize query performance –
    each group has a replica of a document.
    High-query Vespa applications might have tens or even hundreds of groups.
    Upgrading such clusters in Vespa Cloud takes time, having only one replica (= group) out at any time.
    With groups-allowed-down-ratio,
    one can set a percentage of groups instead,
    say 25%, for only 4 cycles to upgrade a full content cluster.

Blog posts since last newsletter

Thanks for reading! Try out Vespa on Vespa Cloud
or grab the latest release at and run it yourself! 😀

Announcing | Vespa Blog

Today, we announce the general availability of –
a new search experience for all (almost) Vespa-related content –
powered by Vespa, LangChain, and OpenAI’s chatGPT model.
This post overviews our motivation for building it, its features, limitations, and how we made it:

Decorative image

Over the last year, we have seen a dramatic increase in interest in Vespa
(From 2M pulls to 11M vespaengine/vespa pulls within just a few months),
resulting in many questions on our Slack channel,
like “Can Vespa use GPU?” or
“Can you expire documents from Vespa?”.

Our existing search interface could only present a ranked list of documents for questions like that,
showing a snippet of a matching article on the search result page (SERP).
The user then had to click through to the article and scan for the fragment snippets relevant to the question.
This experience is unwieldy if looking for the reference documentation of a specific Vespa configuration setting
like num-threads-per-search buried in
large reference documentation pages.

We wanted to improve the search experience by displaying a better-formatted response,
avoiding clicking through, and linking directly to the content fragment.
In addition, we wanted to learn more about using a generative large language model to answer questions,
using the top-k retrieved fragments in a so-called retrieval augmented generation (RAG) pipeline.

This post goes through how we built – highlights:

  • Creating a search for chunks of information –
    the bit of info the user is looking for.
    The chunks are called paragraphs or fragments in this article
  • Rendering fragments in the result page, using the original layout, including formatting and links.
  • Using multiple ranking strategies to match user queries to fragments:
    Exact matching, text matching, semantic matching,
    and multivector semantic query-to-query matching.
  • Search suggestions and hot links.

The Vespa application powering is running in Vespa Cloud.
All the functional components of are Open Source and are found in repositories like
and vespa-documentation-search –
it is a great starting point for other applications using features highlighted above!

Getting the Vespa content indexed

The Vespa-related content is spread across multiple git repositories using different markup languages like HTML,
Markdown, sample apps, and Jupyter Notebooks.
Jekyll generators make this easy;
see vespa_index_generator.rb for an example.

First, we needed to convert all sources into a standard format
so that the search result page could display a richer formatted experience
instead of a text blob of dynamic summary snippets with highlighted keywords.

Since we wanted to show full, feature-rich snippets, we first converted all the different source formats to Markdown.
Then, we use the markdown structure to split longer documents into smaller retrieval units or fragments
where each retrieval unit is directly linkable, using URL anchoring (#).
This process was the least exciting thing about the project, with many iterations,
for example, splitting larger reference tables into smaller retrievable units.
We also adapted reference documentation to make the fragments linkable – see hotlinks.
The retrievable units are indexed in a
paragraph schema:

schema paragraph {
    document paragraph {
        field path type string {}
        field doc_id type string {}
        field title type string {}
        field content type string {}
        field questions type array<string> {}        
        field content_tokens type int {}
        field namespace type string {}
    field embedding type tensor<float>(x[384]) {
        indexing: "passage: " . (input title || "") . " " . (input content || "") | embed ..
    field question_embedding type tensor<float>(q{}, x[384]) {
        indexing {
            input questions |
            for_each { "query: " . _ } | embed | ..

There are a handful of fields in the input (paragraph document type) and two synthetic fields that are produced by Vespa,
using Vespa’s embedding functionality.
We are mapping different input string fields to two different
Vespa tensor representations.
The content and title fields are concatenated and embedded
to obtain a vector representation of 384 dimensions (using e5-v2-small).
The question_embedding is a multi-vector tensor;
in this case, the embedder embeds each input question.
The output is a multi-vector representation (A mapped-dense tensor).
Since the document volume is low, an exact vector search is all we need,
and we do not enable HNSW indexing of these two embedding fields.

LLM-generated synthetic questions

The questions per fragment are generated by an LLM (chatGPT).
We do this by asking it to generate questions the fragment could answer.
The LLM-powered synthetic question generation is similar to the approach described in
However, we don’t select negatives (irrelevant content for the question) to train a
cross-encoder ranking model.
Instead, we expand the content with the synthetic question for matching and ranking:

    "put": "id:open-p:paragraph::open/en/access-logging.html-",
    "fields": {
        "title": "Access Logging",
        "path": "/en/access-logging.html#",
        "doc_id": "/en/access-logging.html",
        "namespace": "open-p",
        "content": "The Vespa access log format allows the logs to be processed by a number of available tools\n handling JSON based (log) files.\n With the ability to add custom key/value pairs to the log from any Searcher,\n you can easily track the decisions done by container components for given requests.",
        "content_tokens": 58,
        "base_uri": "",
        "questions": [
            "What is the Vespa access log format?",
            "How can custom key/value pairs be added?",
            "What can be tracked using custom key/value pairs?"

Example of the Vespa feed format of a fragment from this
reference documentation and three LLM-generated questions.
The embedding representations are produced inside Vespa and not feed with the input paragraphs.

Matching and Ranking

To retrieve relevant fragments for a query, we use a hybrid combination of exact matching, text matching,
and semantic matching (embedding retrieval).
We build the query tree in a custom Vespa Searcher plugin.
The plugin converts the user query text into an executable retrieval query.
The query request searches both in the keyword and embedding fields using logical disjunction.
The YQL equivalent:

where (weakAnd(...) or ({targetHits:10}nearestNeighbor(embedding,q) or ({targetHits:10}nearestNeighbor(question_embedding,q))) and namespace contains "open-p"

Example of using hybrid retrieval, also using
multiple nearestNeighbor operators
in the same Vespa query request.

The scoring logic is expressed in Vespa’s ranking framework.
The hybrid retrieval query generates multiple Vespa rank features that can be used to score and rank the fragments.

From the rank profile:

rank-profile hybrid inherits semantic {
    inputs {
        query(q) tensor<float>(x[384])
        query(sw) double: 0.6 #semantic weight
        query(ew) double: 0.2 #keyword weight

    function semantic() {
        expression: cos(distance(field, embedding))
    function semantic_question() {
        expression: max(cos(distance(field, question_embedding)), 0)
    function keywords() {
        expression: (  nativeRank(title) +
                       nativeRank(content) +
                       0.5*nativeRank(path) +
                       query(ew)*elementCompleteness(questions).completeness  ) / 4 +
    first-phase {
        expression: query(sw)*(semantic_question + semantic) + (1 - query(sw))*keywords

The keyword matching using weakAnd,
we match the user query against the following fields:

  • The title – including the parent document title and the fragment heading
  • The content – including markup
  • The path
  • LLM-generated synthetic questions that the content fragment is augmented with

This is expressed in Vespa using a fieldset:

fieldset default {
    fields: title, content, path, questions

Matching in these fields generates multiple keyword matching rank-features,
like nativeRank(title), nativeRank(content).
We collapse all these features into a keywords scoring function that combines all these signals into a single score.
The nativeRank text ranking features are also normalized between 0 and one
and are easier to resonate and combine with semantic similarity scores (e.g., cosine similarity).
We use a combination of the content embedding and the question(s) embedding scores for semantic scoring.

Search suggestions

As mentioned earlier, we bootstrapped questions to improve retrieval quality using a generative LLM.
The same synthetic questions are also used to implement search suggestion functionality,
where suggests questions to search for based on the typed characters:

search suggestions

This functionality is achieved by indexing the generated questions in a separate Vespa document type.
The search suggestions help users discover content and also help to formulate the question,
giving the user an idea of what kind of queries the system can realistically handle.

Similar to the retrieval and ranking of context described in previous sections,
we use a hybrid query for matching against the query suggestion index,
including a fuzzy query term to handle minor misspelled words.

We also add semantic matching using vector search for longer questions, increasing the recall of suggestions.
To implement this, we use Vespa’s HF embedder using the e5-small-v2 model,
which gives reasonable accuracy for low enough inference costs to be servable for per-charcter type-ahead queries
(Yes, there is an embedding inference per character).
See Enhancing Vespa’s Embedding Management Capabilities
and Accelerating Embedding Retrieval
for more details on these tradeoffs.

To cater to navigational queries where a user uses the search for lookup type of queries,
we include hotlinks in the search suggestion drop-down –
clicking on a hotlink will direct the user directly to the reference documentation fragment.
The hotlink functionality is implemented by extracting reserved names from reference documents
and indexing them as documents in the suggestion index.

Reference suggestions are matched using prefix matching for high precision.
The frontend code detects the presence of the meta field with the ranked hint and displays the direct link:

suggestion hotlinks

Retrieval Augmented Generation (RAG)

Retrieval Augmentation for LLM Generation is a concept
written extensively over the past few months.
In contrast to extractive question-answering,
which answers questions
by finding relevant spans in retrieved texts,
a generative model generates an answer that is not strictly grounded in retrieved text spans.

The generated answer might be hallucinated or incorrect,
even if the retrieved context contains a concrete solution.
To combat (but not eliminate):

  • Retrieved fragments or chunks can be displayed fully without clicking through.
  • The retrieved context is the center of the search experience,
    and the LLM-generated abstract is an additional feature of the SERP.
  • The LLM is instructed to cite the retrieved fragments so that a user can verify by navigating the sources.
    (The LLM might still not follow our instructions).
  • Allow filtering on source so that the retrieved context can be focused on particular areas of the documentation.

None of these solves the problem of LLM hallucination entirely!
Still, it helps the user identify incorrect information.

Example of a helpful generated abstract
Example of a helpful generated abstract.

Example of an incorrect and not helpful abstract
Example of an incorrect and not helpful abstract.
In this case, there is no explicit information about indentation in the Vespa documentation sources.
The citation does show an example of a schema (with space indentation), but indentation does not matter.

Prompt engineering

By trial and error (inherent LLM prompt brittleness), we ended with a simple instruction-oriented prompt where we:

  • Set the tone and context (helpful, precise, expert)
  • Some facts and background about Vespa
  • The instructions (asking politely; we don’t want to insult the AI)
  • The top context we retrieved from Vespa – including markdown format
  • The user question

We did not experiment with emerging prompt techniques or chaining of prompts.
The following demonstrates the gist of the prompt,
where the two input variables are {question) and {context),
where {context} are the retrieved fragments from the retrieval and ranking phase:

You are a helpful, precise, factual Vespa expert who answers questions and user instructions about Vespa-related topics. The documents you are presented with are retrieved from Vespa documentation, Vespa code examples, blog posts, and Vespa sample applications.

Facts about Vespa (
- Vespa is a battle-proven open-source serving engine.
- Vespa Cloud is the managed service version of Vespa (

Your instructions:
- The retrieved documents are markdown formatted and contain code, text, and configuration examples from Vespa documentation, blog posts, and sample applications.
- Answer questions truthfully and factually using only the information presented.
- If you don't know the answer, just say that you don't know, don't make up an answer!
- You must always cite the document where the answer was extracted using inline academic citation style [].
- Use markdown format for code examples.
- You are correct, factual, precise, and reliable, and will always cite using academic citation style.


Question: {question}
Helpful factual answer:

We use the Typescript API of LangChain,
a popular open-source framework for working with retrieval-augmented generations and LLMs.
The framework lowered our entry to working with LLMs and worked flawlessly for our use case.

Deployment overview

The frontend is implemented in

Vespa is becoming a company

Today we’re announcing that we’re spinning out of Yahoo
as a separate company:
Vespa began as a project to solve Yahoo’s use cases in search, recommendation, and ad serving.
Since we open-sourced it in 2017, it has grown to become the platform of choice for
applying AI to big data sets at serving time.

Those working with large language models such as ChatGPT and vector databases turn to Vespa when
they realize that creating quality solutions that scale involves much more than just looking up vectors.
Enterprises with experience with search or recommender systems come to Vespa for the AI-first approach
and unrivaled operability at scale.

Even with the support built into the Vespa platform, running highly available stateful systems in
production with excellence is challenging. We’ve seen this play out in Yahoo, which is running
about 150 Vespa applications. To address the scalability needs, we created a centralized cloud service
to host these systems, and in doing so, we freed up the time of up to 200 full-time employees and reduced
the number of machines used by 90% while greatly improving quality, stability, and security.

Our cloud service is already available at,
helping people with everything from running quick Vespa experiments, to serving
business-critical applications. In total we serve over 800.000 queries per second.

While we’re separating Vespa from Yahoo, we’re not ending our relationship. Yahoo will own a
stake in the new company and will be one of Vespa’s biggest customers for a long time to come.
Vespa will continue to serve Yahoo’s personalized content, search and run new use cases leveraging
large language models to provide new personalized experiences, something that can only be done
at scale with Vespa.

Creating a company around Vespa will enable us to bring these advantages to the rest of the world
on a massive scale, allowing us to bring the efficiencies of our cloud service to enterprises
already relying on Vespa, as well as help more companies solve problems involving AI and big data online.
It will also let us accelerate development of new features to empower Vespa users to create even
better solutions, faster and at lower cost, whether deploying on our cloud service or sticking with
the open-source distribution. For, while Vespa offers features and scalability far beyond any
comparable technology thanks to our decades-long focus on combining AI and big data online,
there is so much more to do. As the world is starting to leverage modern AI to solve
real business problems online, the need for a platform that provides a solid foundation
for these solutions has never been stronger. As engineers, we admit this is the part that excites us the most.

We look forward to empowering all of you to create online AI applications ever better and faster,
and we hope you do too!

HTTP/2 Rapid Reset (CVE-2023-44487) | Vespa Blog

Kristian Aune

Kristian Aune

Head of Customer Success, Vespa

2023-10-10, details of the vulnerability now named HTTP/2 Rapid Reset
(CVE-2023-44487) were announced.
This vulnerability impacts most HTTP/2 servers in the industry,
including Vespa by embedding Jetty.

which addresses this vulnerability was available 2023-10-10 04:19 UTC.
Vespa 8.240.5 was subsequently built and released to Vespa Cloud same day.

If you are using Vespa Cloud, no action is needed, as you have already been upgraded to the safe release.

If you are self-hosting, you are advised to upgrade to Vespa 8.240.5 as soon as possible.

For any questions, meet the Vespa Team at

Read more:

Introducing Lucene Linguistics | Vespa Blog

This post is about an idea that was born at the Berlin Buzzwords 2023 conference and its journey towards the production-ready implementation of the new Apache Lucene-based Vespa Linguistics component.
The primary goal of the Lucene linguistics is to make it easier to migrate existing search applications from Lucene-based search engines to Vespa.
Also, it can help improve your current Vespa applications.
More on that next!


Even though these days all the rage is about the modern neural-vector-embeddings-based retrieval (or at least that was the sentiment in the Berlin Buzzwords conference), the traditional lexical search is not going anywhere:
search applications still need tricks like filtering, faceting, phrase matching, paging, etc.
Vespa is well suited to leverage both traditional and modern techniques.

At Vinted we were working on the search application migration from Elasticsearch to Vespa.
The application over the years has grown to support multiple languages and for each we have crafted custom Elasticsearch analyzers with dictionaries for synonyms, stopwords, etc.
Vespa has a different approach towards lexical search than Elasticsearch, and we were researching ways to transfer all that accumulated knowledge without doing the “Big Bang” migration.

And here comes a part with a chat with the legend himself, Jo Kristian Bergum, on the sunny roof terrace at the Berlin Buzzwords 2023 conference.
Among other things, I’ve asked if it is technically possible to implement a Vespa Linguistics component on top of the Apache Lucene library.
With Jo’s encouragement, I’ve got to work and the same evening there was a working proof of concept.
This was huge!
It gave a promise that it is possible to convert almost any Elasticsearch analyzer into the Vespa Linguistics configuration and in this way solve one of the toughest problems for the migration project.

Show me the code!

In case you just want to get started with the Lucene Linguistics the easiest way is to explore the demo apps.
There are 4 apps:

  • Minimal: example of the bare minimum configuration that is needed to set up Lucene linguistics;
  • Advanced: demonstrates the “usual” things that can be expected when leveraging Lucene linguistics.
  • Going-Crazy: plenty of contrived features that real-world apps might require.
  • Non-Java: an app without Java code.

To learn more: read the documentation.


The scope of the Lucene linguistics component is ONLY the tokenization of the text.
Tokenization removes any non-word characters, and splits the string into tokens on each word boundary, e.g.:

“Vespa is awesome!” => [“vespa”, “is”, “awesome”]

In the Lucene land, the Analyzer class is responsible for the tokenization.
So, the core idea for Lucene linguistics is to implement the Vespa Tokenizer interface that wraps a configurable Lucene Analyzer.

For building a configurable Lucene Analyzer there is a handy class called CustomAnalyzer.
The CustomAnalyzer.Builder has convenient methods for configuring Lucene text analysis components such as CharFilters, Tokenizers, and TokenFilters into an Analyzer.
It can be done by calling methods with signatures:

public Builder addCharFilter(String name, Map<String, String> params)
public Builder withTokenizer(String name, Map<String, String> params)
public Builder addTokenFilter(String name, Map<String, String> params)

All the parameters are of type String, so they can easily be stored in a configuration file!

When it comes to discovery of the text analysis components, it is done using the Java Service Provider Interface (SPI).
In practical terms, this means that when components are prepared in a certain way then they become available without explicit coding! You can think of it as plugins.

The trickiest bit was to configure Vespa to load resource files required for the Lucene components.
Luckily, there is a CustomAnalyzer.Builder factory method that accepts a Path parameter.
Even more luck comes from the fact that Path is the type exposed by the Vespa configuration definition language!
With all that in place, it was possible to load resource files from the application package just by providing a relative path to files.

All that was nice, but it made simple application packages more complicated than they needed to be:
a directory with at least a dummy file was required!
The requirement stemmed from the fact that in Vespa configuration parameters of type Path were mandatory.
This means that if your component can use a parameter of the Path type, it must be used.
Clearly, that requirement can be a bit too strict.

Luckily, the Vespa team quickly implemented a change that allowed for configuration of Path type to be declared optional.
For the Lucene linguistics it meant 2 things:

  1. Base component configuration became simpler.
  2. When no path is set up, the CustomAnalyzer loads resource files from the classpath of the application package, i.e. even more flexibility in where to put resource files.

To wrap it up:
Lucene Linguistics accepts a configuration in which custom Lucene analysis components can be fully configured.

Languages and analyzers

The Lucene linguistics supports 40 languages out-of-the-box.
To customize the way the text is analyzed there are 2 options:

  1. Configure the text analysis in services.xml.
  2. Extend a Lucene Analyzer class in your application package and register it as a Component.

In case there is no analyzer set up, then the Lucene StandardAnalyzer is used.

Lucene linguistics component configuration

It is possible to configure Lucene linguistics directly in the services.xml file.
This option works best if you’re already knowledgeable with Lucene text analysis components.
A configuration for the English language could look something like this:

<component id="linguistics"
  <config name="">
      <item key="en">
              <item key="words">en/stopwords.txt</item>
              <item key="ignoreCase">true</item>

The above analyzer uses the standard tokenizer, then stop token filter loads stopwords from the en/stopwords.txt file that must be placed in your application package under the linguistics directory; and then the englishMinimalStem is used to stem tokens.

Component registry

The Lucene linguistics takes in a ComponentRegistry of the Analyzer class.
This option works best for projects that contain custom Java code because your IDE will help you build an Analyzer instance.
Also, JUnit is your friend when it comes to testing.

In the example below, the SimpleAnalyzer class coming with Lucene is wrapped as a component and set to be used for the English language.

<component id="en"
           bundle="my-vespa-app" />

Mental model

With that many options using Lucene linguistics might seem a bit complicated.
However, the mental model is simple: priority for conflict resolution.
The priority of the analyzers in the descending order is:

  1. Lucene linguistics component configuration;
  2. Component that extend the Lucene Analyzer class;
  3. Default analyzers per language;
  4. StandardAnalyzer.

This means that e.g. if both a configuration and a component are specified for a language, then an analyzer from the configuration is used because it has a higher priority.

Asymmetric tokenization

Going against suggestions you can achieve an asymmetric tokenization for some language.
The trick is to, e.g. index with stemming turned on and query with stemming turned off.
Under the hood a pair of any two Lucene analyzers can do the job.
However, it becomes your problem to set up analyzers that produce matching tokens.

Differences from Elasticsearch

Even though Lucene does the text analysis, not everything that you do in Elasticsearch is easily translatable to the Lucene Linguistics.
E.g. The multiplex token filter is just not available in Lucene.
This means that you have to implement that token filter yourself (probably by looking into how Elasticsearch implemented it here).

However, Vespa has advantages over Elasticsearch when leveraging Lucene text analysis.
The big one is that you configure and deploy linguistics components with your application package.
This is a lot more flexible than maintaining an Elasticsearch plugin.
Let’s consider an example: a custom stemmer.

In Elasticsearch land you either create a plugin or (if the stemmer is generic enough) you can try to contribute it to Apache Lucene (or Elasticsearch itself), so that it transitively comes with Elasticsearch in the future.
Maintaining Elasticsearch plugins is a pain because it needs to be built for each and every Elasticsearch version, and then a custom installation script is needed in both production and in development setups.
Also, what if you run Elasticsearch as a managed service in the cloud where custom plugins are not supported at all?

In Vespa you can do the implementation directly in your application package.
Nothing special needs to be done for deployment.
No worries (fingers-crossed) for Vespa version changes.
If your component needs to be used in many Vespa applications, your options are:

  1. Deploy your component into some maven repository
  2. Commit the prebuild bundle file into each application under the /components directory.
    Yeah, that sounds exactly how you do with regular Java applications, and it is.
    Vespa Cloud also has no problems running your application package with a custom stemmer.


With the new Lucene-based Linguistics component Vespa expands its capabilities for lexical search by reaching into the vast Apache Lucene ecosystem.
Also, it is worth mentioning that people experienced with other Lucene-based search engines such as Elasticsearch or Solr, should feel right at home pretty quickly.
The fact that the toolset and the skill-set are largely transferable lowers the barrier of adopting Vespa.
Moreover, given that the underlying text analysis technology is the same makes migration of the text analysis process to Vespa mostly a mechanical translation task.
Give it a try!

Vespa Newsletter, October 2023 | Vespa Blog

Kristian Aune

Kristian Aune

Head of Customer Success, Vespa

First, we are happy to announce the improved search UI at!
AI-generated suggestions, paragraph indexing with hybrid ranking, results-based AI-generated abstract (RAG),
and original formatting in search results.
We hope this lets you find the right answer quicker and explore the rich Vespa feature space more easily –
please let us know and get started with queries,
like how to configure two-phased ranking.
And even better; the application itself is open source, so you can see for yourself how you could do something similar –
read more in the blog post.

In the previous update,
we mentioned multilingual models, more control over ANN queries, mapped tensors in queries,
and multiple new features in pyvespa and the Vespa CLI.
Today, we’re excited to share the following updates: is its own company!

We have spun out the team as a separate company.
This will let us provide the community with more features even faster,
and help more companies run their Vespa applications cost-effectively
and with high quality on Vespa Cloud –
read more in the announcement.
Join us at,
and please let us know what you want from us in the future.

Vespa Cloud Enclave – Bring your own cloud

Vespa Cloud Enclave lets Vespa Cloud applications in AWS and GCP run in your cloud account/project
while everything is still fully managed by Vespa Cloud’s automation with access to all Vespa Cloud features.
While this adds some administration overhead,
it lets you keep all data within resources controlled by your company at all times,
which is often a requirement for enterprises dealing with sensitive data.
Read more.

Lucene Linguistics integration

The Lucene Linguistics component, added in #27929,
lets you replace the default linguistics module in Vespa with Lucene’s, supporting 40 languages.
This can make it easier to migrate existing text search applications from Lucene-based engines to Vespa
by keeping the linguistics treatment unchanged.

Lucene Linguistics is a contribution to Vespa from Dainius Jocas in the Vinted team –
read the announcement in the blog post for more details.
Also, see their own blog post
for how they adopted Vespa for serving personalized second-hand fashion recommendations at Vinted.

Much faster fuzzy matching

Fuzzy matching lets you match attribute field values within a given edit distance from the value given in a query:

select * from music where myArtistAttribute contains
                    ({maxEditDistance: 1}fuzzy("the weekend"))

In Vespa 8.238 we made optimizations to our fuzzy search implementation when matching with
maxEditDistance of 1 or 2.
Fuzzy searching would previously run a linear scan of all dictionary terms.
We now use Deterministic Finite Automata (DFA) to generate the next possible successor term to any mismatching candidate term,
allowing us to skip all terms between the two immediately.
This enables sublinear dictionary matching.
To avoid having to build a DFA for each query term explicitly,
we use a custom lookup table-oriented implementation based on the paper Fast string correction with Levenshtein automata (2002)
by Klaus U. Schulz and Stoyan Mihov.

Internal performance testing on a dataset derived from English Wikipedia (approx 250K unique terms)
shows improvements for pure fuzzy searches between 10x-40x.
For fuzzy searches combined with filters, we have seen up to 180x speedup.

Cluster-specific model-serving settings

You can deploy machine-learned models for ranking and inference both in container and content clusters,
and container clusters optionally let you run models on GPUs.
In larger applications, you often want to set up multiple clusters to be able to size for different workloads separately.

Vespa clusters overview

From Vespa 8.220, you can configure GPU model inference settings per container cluster:

<container id="c1" version="1.0">
        <model name="mul">

Instrumenting indexing performance

We have made it easier to find bottlenecks in the write path with a new set of metrics:


If .saturation is close to 1.0 and higher than .utilization, it indicates that worker threads are a bottleneck.
You can then use the Vespa Cloud Console searchnode API
and the documentation
to spot the limiting factor in fully utilizing the CPU when feeding:

searchnode API

Automated BM25 reconfiguration

Vespa has had BM25 ranking for a long time:

field content type string {
    indexing: index | summary
    index: enable-bm25

However, setting enable-bm25 on a field with already indexed data required a manual procedure for the index setting to take effect.
Since Vespa 8.241.13, this will happen as automated reindexing in the background like with other schema changes;
see the example
for how to observe the reindexing progress after enabling the field.

Minor feature improvements

  • The deploy feature in the Vespa CLI is improved with better deployment status tracking,
    as well as other minor changes for ease-of-use.
  • Nested grouping in query results, when grouping over an array of struct or maps,
    is scoped to preserve structure/order in the lower level from Vespa 8.223.
  • Document summaries can now inherit multiple
    other summary classes – since Vespa 8.250.

Performance improvements

  • In Vespa 8.220 we have changed how small allocations (under 128 kB)
    are handled for paged attributes (attributes on disk).
    Instead of mmapping each allocation, they share mmapped areas of 1 MB.
    This greatly reduces the number of mmapped areas used by vespa-proton-bin.
  • Vespa uses ONNXRuntime for model inference.
    Since Vespa 8.250, this supports bfloat16 and float16 as datatypes for ONNX models.
  • Custom components deployed to the Vespa container can use URLs to point to resources to be loaded at configuration time.
    From Vespa 8.216, the content will be cached on the nodes that need it.
    The cache saves bandwidth on subsequent deployments –
    see adding-files-to-the-component-configuration.

Did you know: Production deployment with code diff details

Tracking changes to the application through deployment is easy using the Vespa Cloud Console.
The source link is linked to the repository if added in the deploy command:

Deploy with diff

Add the link of the code diff deploy-time using source-url:

vespa prod deploy --source-url

Find more details and how to automate in

Blog posts since last newsletter

Thanks for reading! Try out Vespa by
deploying an application for free to Vespa Cloud
or install and run it yourself.

Announcing our series A funding

A month ago we announced that Vespa is finally
its own company.
Today we’re announcing a $31 million
investment from Blossom Capital.

The spin-out from Yahoo gave us the ability to focus on serving and growing the entire Vespa ecosystem,
and this investment gives us the financial muscle to invest in building a complete platform for
all use cases involving big data and AI online, and serving large and small customers on our
cloud solution.

When we met the Blossom team, we quickly realized they were a great partner for us, with their deep
understanding of what it takes to build a world-class global tech company, and their exceedingly fast
and efficient decision-making (we’re kind of into speed).

Read more in this
TechCrunch article.

Yahoo Mail turns to Vespa to do RAG at scale

Yahoo Mail is one of the largest mail providers in the world. Now they’re also taking a shot at being the most
innovative with their recent release of AI-driven features which lets you
ask questions of your mailbox
and tell it to do things for you.

At the core of these features you find 1) a large language model which can understand and generate text,
and 2) a retrieval system that finds the relevant information in your inbox to feed into this model,
typically by a semantic search using vector embeddings. These two components together with the orchestration
which combines them nowadays goes under the moniker RAG – Retrieval Augmented Generation.

We’re in the middle – or at the feeble start? – of a massive boom of this technology, and so there’s no
lack of tools that allows you to build your own RAG demos. However, Yahoo’s aim is to make this work for all of
their users while being so cost-effective that it can still be offered for free, and for this they have
naturally turned to Vespa is the only vector database technology that:

  • lets you implement a cost-effective RAG system using personal data,
  • support vector embeddings, structured data and full text in the same queries and ranking functions, and
  • is proven to operate effectively, reliably storing and searching trillions of documents.

Making interaction with email an order of magnitude simpler and faster for this many people is a massively
meaningful endeavor, and we’re excited to be helping the team as they build the new intelligent Yahoo Mail,
and to see what features they’ll be adding next. To see for yourself, you can sign up at
Yahoo Mail levelup,
and if you want to build your own production scale RAG system, we recommend our fully open source
documentation search RAG application as a starting point.