The basics of Vespa applications

Distributed computation over large data sets in real-time — what we call big data serving — is a complex task. We have worked hard to hide this complexity to make it as easy as possible to create your own production quality Vespa application.
The quick-start guides
take you through the steps of getting Vespa up and running, deploying a basic application, writing data and issuing some queries to it, but without room for explanation.
Here, we’ll explain the basics of creating your own Vespa application.
The blog search and recommendation tutorial
covers these topics in full detail with hands-on instructions.
Update 2021-05-20: Blog tutorials are replaced by the
News search and recommendation tutorial:

Application packages

The configuration, components and models which makes out an application to be run by Vespa is contained in an application package. The application package:

  • Defines which clusters and services should run and how they should be configured
  • Contains the document types the application will use
  • Contains the ranking models to execute
  • Configures how data will be processed during feeding and indexing
  • Configures how queries will be pre- and post-processed

The three mandatory parts of the application specification are the search definition, the services specification, and the hosts specification — all of which have their own file in the application package.
This is enough to set up a basic production ready Vespa applications, like, e.g., the
sample application.
Most applications however, are much larger and may contain machine-learned ranking models and application specific Java components which perform various application specific tasks such as query enrichment and post-search processing.

The schema definition

Data stored in Vespa is represented as a set of documents of a type defined in the application package. An application can have multiple document types. Each search definition describes one such document type: it lists the name and data type of each field found in the document, and configures the behaviour of these. Examples are like whether field values are in-memory or can be stored on disk, and whether they should be indexed or not. It can also contain ranking profiles, which are used to select the most relevant documents among the set of matches for a given query – and it specifies which fields to return.

The services definition

A Vespa application consists of a set of services, such as stateless query and document processing containers and stateful content clusters. Which services to run, where to run those services and the configuration of those services are all set up in services.xml. This includes the search endpoint(s), the document feeding API, the content cluster, and how documents are stored and searched.

The hosts definition

The deployment specification hosts.xml contains a list of all hosts that is part of the application, with an alias for each of them. The aliases are used in services.xml to define which services is to be started on which nodes.

Deploying applications

After the application package has been constructed, it is deployed using vespa-deploy. This uploads the package to the configuration cluster and pushes the configuration to all nodes. After this, the Vespa cluster is now configured and ready for use.

One of the nice features is that new configurations are loaded without service disruption. When a new application package is deployed, the configuration pushes the new generation to all the defined nodes in the application, which consume and effectuate the new configuration without restarting the services. There are some rare cases that require a restart, the vespa-deploy command will notify when this is needed.

Writing data to Vespa

One of the required files when setting up a Vespa application is the search definition. This file (or files) contains a document definition which defines the fields and their data types for each document type. Data is written to Vespa using Vespa’s JSON document format. The data in this format must match the search definition for the document type.

The process of writing data to Vespa is called feeding, and there are multiple tools that can be used to feed data to Vespa for various use cases. For instance there is a REST API for smaller updates and a Java client that can be embedded into other applications.

An important concept in writing data to Vespa is that of document processors. These processors can be chained together to form a processing pipeline to process each document before indexing. This is useful for many use cases, including enrichment by pulling in relevant data from other sources.

Querying Vespa

If you know the id of the document you want, you can fetch it directly using the document API. However, with Vespa you are usually more interested in searching for relevant documents given some query.

Basic querying in Vespa is done through YQL which is an SQL-like language. An example is:

select title,isbn from music where artist contains "kygo"

Here we select the fields “title” and “isbn” from document type “music” where the field called “artist” contains the string “kygo”. Wildcards (*) are supported in the result fields and the document types to return all available fields in all defined document types.

The example above shows how to send a query to Vespa over HTTP. Many applications choose to build the queries in Java components running inside Vespa instead. Such components are called searchers, and can be used to build or modify queries, run multiple queries for each incoming request and filter and modify results. Similar to the document processor chains, you can set up chains of searchers. Vespa contains a set of default Searchers which does various common operations such as stemming and federation to multiple content clusters.

Ranking models

Ranking executes a ranking expression specified in the search definition on all the documents matching a query. When returning specific documents for a query, those with the highest rank score are returned.

A ranking expression is a mathematical function over features (named values).

Features are either sent with the query, attributes of the document, constants in the application package or features computed by Vespa from both the query and document – example:

rank-profile popularity inherits default {  
    first-phase {  
        expression: 0.7 * nativeRank(title, description) + 0.3 * attribute(popularity)  

Here, each document is ranked by the nativeRank function but boosted by a popularity score. This score can be updated at regular intervals, for instance from user feedback, using partial document updates from some external system such as a Hadoop cluster.

In real applications ranking expressions often get much more complicated than this.

For example, a recommendation application may use a deep neural net to compute a recommendation score, or a search application may use a machine-learned gradient boosted decision tree. To support such complex models, Vespa allows ranking expressions to compute over tensors in addition to scalars. This makes it possible to work effectively with large models and parameter spaces.

As complex ranking models can be expensive to compute over many documents, it is often a good idea to use a cheaper function to find good candidates and then rank only those using the full model. To do this you can configure both a first-phase and second-phase ranking expression, where the second-phase function is only computed on the best candidate documents.

Grouping and aggregation

In addition to returning the set of results ordered by a relevance score, Vespa can group and aggregate data over all the documents selected by a query. Common use cases include:

  • Group documents by unique value of some field.
  • Group documents by time and date, for instance sort bug tickets by date of creation into the buckets Today, Past Week, Past Month, Past Year, and Everything else.
  • Calculate the minimum/maximum/average value for a given field.

Groups can be nested arbitrarily and multiple groupings and aggregations can be executed in the same query.

More information

You should now have a basic understanding of the core concepts in building Vespa applications.
To try out these core features in practice, head on over to the
blog search and recommendation tutorial.
Update 2021-05-20: Blog tutorials are replaced by the
News search and recommendation tutorial.

We’ll post some more in-depth blog posts with concrete examples soon.

Fine-tuning a BERT model for search applications

Thiago Martins

Thiago Martins

Vespa Data Scientist

How to ensure training and serving encoding compatibility

There are cases where the inputs to your Transformer model are pairs of sentences, but you want to process each sentence of the pair at different times due to your application’s nature.

Decorative image

Photo by Alice Dietrich on Unsplash

The search use case

Search applications are one example. They involve a large collection of documents that can be pre-processed and stored before a search action is required. On the other hand, a query triggers a search action, and we can only process it in real-time. Search apps’ goal is to return the most relevant documents to the query as quickly as possible. By applying the tokenizer to the documents as soon as we feed them to the application, we only need to tokenize the query when a search action is required, saving time.

In addition to applying the tokenizer at different times, you also want to retain adequate control about encoding your pair of sentences. For search, you might want to have a joint input vector of length 128 where the query, which is usually smaller than the document, contributes with 32 tokens while the document can take up to 96 tokens.

Training and serving compatibility

When training a Transformer model for search, you want to ensure that the training data will follow the same pattern used by the search engine serving the final model. I have written a blog post on how to get started with BERT model fine-tuning using the transformer library. This piece will adapt the training routine with a custom encoding based on two separate tokenizers to reproduce how a Vespa application would serve the model once deployed.

Create independent BERT encodings

The only change required is simple but essential. In my previous post, we discussed the vanilla case where we simply applied the tokenizer directly to the pairs of queries and documents.

from transformers import BertTokenizerFast

model_name = "google/bert_uncased_L-4_H-512_A-8"
tokenizer = BertTokenizerFast.from_pretrained(model_name)

train_encodings = tokenizer(train_queries, train_docs, truncation=True, padding='max_length', max_length=128)
val_encodings = tokenizer(val_queries, val_docs, truncation=True, padding='max_length', max_length=128)

In the search case, we create the create_bert_encodings function that will apply two different tokenizers, one for the query and the other for the document. In addition to allowing for different query and document max_length, we also need to set add_special_tokens=False and not use padding, as those need to be included by our custom code when joining the tokens generated by the tokenizer.

def create_bert_encodings(queries, docs, tokenizer, query_input_size, doc_input_size):
    queries_encodings = tokenizer(
        queries, truncation=True, max_length=query_input_size-2, add_special_tokens=False
    docs_encodings = tokenizer(
        docs, truncation=True, max_length=doc_input_size-1, add_special_tokens=False

    input_ids = []
    token_type_ids = []
    attention_mask = []
    for query_input_ids, doc_input_ids in zip(queries_encodings["input_ids"], docs_encodings["input_ids"]):
        # create input id
        input_id = [TOKEN_CLS] + query_input_ids + [TOKEN_SEP] + doc_input_ids + [TOKEN_SEP]
        number_tokens = len(input_id)
        padding_length = max(128 - number_tokens, 0)
        input_id = input_id + [TOKEN_NONE] * padding_length
        # create token id
        token_type_id = [0] * len([TOKEN_CLS] + query_input_ids + [TOKEN_SEP]) + [1] * len(doc_input_ids + [TOKEN_SEP]) + [TOKEN_NONE] * padding_length
        # create attention_mask
        attention_mask.append([1] * number_tokens + [TOKEN_NONE] * padding_length)

    encodings = {
        "input_ids": input_ids,
        "token_type_ids": token_type_ids,
        "attention_mask": attention_mask
    return encodings

We then create the train_encodings and val_encodings required by the training routine. Everything else on the training routine works just the same.

from transformers import BertTokenizerFast

model_name = "google/bert_uncased_L-4_H-512_A-8"
tokenizer = BertTokenizerFast.from_pretrained(model_name)

train_encodings = create_bert_encodings(

val_encodings = create_bert_encodings(

Conclusion and future work

Training a model to deploy in a search application require us to ensure that the training encodings are compatible with encodings used at serving time. We generate document encodings offline when feeding the documents to the search engine while creating query encoding at run-time upon arrival of the query. It is often relevant to use different maximum lengths for queries and documents, and other possible configurations.

Decorative image

Photo by Steve Johnson on Unsplash

We showed how to customize BERT model encodings to ensure this training and serving compatibility. However, a better approach is to build tools that bridge the gap between training and serving by allowing users to request training data that respects by default the encodings used when serving the model. pyvespa will include such integration to make it easier for Vespa users to train BERT models without having to adjust the encoding generation manually as we did above.

Using approximate nearest neighbor search in real world applications

From text search and recommendation to ads and online dating, ANN search rarely works in isolation

Anything can be represented by a list of numbers.

For instance, text can be represented by a list of numbers describing the
text’s meaning. Images can be represented by the objects it contains. Users of
a system can be represented by their interests and preferences. Even time-based
entities such as video, sound, or user interactions can be represented by a
single list of numbers.

These vector representations describe content or meaning: the original,
containing thousands of characters or pixels, is compressed to a much smaller
representation of a few hundred numbers.

Most often, we are interested in finding the most similar vectors. This is
called k-nearest neighbor (KNN) search or similarity search and has all kinds
of useful applications. Examples here are model-free classification, pattern
recognition, collaborative filtering for recommendation, and data compression,
to name but a few. We’ll see some more examples later in this post.

However, a nearest neighbor search is only a part of the process for many
applications. For applications doing search and recommendation, the potential
candidates from the KNN search are often combined with other facets of the
query or request, such as some form of filtering, to refine the results.

This can severely limit the quality of the end result, as post-filtering can
prevent otherwise relevant results from surfacing. The solution is to integrate
the nearest neighbor search with filtering, however most libraries for nearest
neighbor search work in isolation and do not support this. To my knowledge, the
only open-source platform that does is

In this post, we’ll take a closer look at approximate neighbor search, explore
some real cases combining this with filtering, and delve into how
solves this problem.

Finding the (approximate) nearest neighbors

The representations can be visualized as points in a high-dimension space, even
though it’s kind of difficult to envision a space with hundreds of dimensions.
This allows us to think of these points as vectors, sometimes called thought
vectors, and we can use various distance metrics to measure the likeness or
similarity between them. Examples are the dot (or inner) product, cosine angle,
or euclidean distance.

The 5 nearest neighbors

Finding the nearest neighbors of a point is reasonably straight-forward: just
compute the similarity using the distance metric between the point and all
other points. Unfortunately, this brute-force approach doesn’t scale well,
particularly in time-critical settings such as online serving, where you have a
large number of points to consider.

There are no known exact methods for finding nearest neighbors efficiently. As
both the number of points increases and the number of dimensions increase, we
fall victim to the curse of dimensionality. In high dimensions, all points are
almost equally distant from each other. A good enough solution for many
applications is to trade accuracy for efficiency. In approximately nearest
neighbors (ANN), we build index structures that narrow down the search space.
The implicit neighborhoods in such indexes also help reduce the problem of high

You can roughly divide the approaches used for ANNs into whether or not they
can be implemented using an inverse index. The inverse index originates from
information retrieval and is comparable to the index often found at many books’
back. This index points from a word (or term) to the documents containing it.
This can be used for ANNs as well. Using k-means clustering, one can cluster
all points and index them by which cluster they belong to. A related approach
is product quantization (and its relatives), which splits the vectors into
products of lower-dimensional spaces. Yet another is locality-sensitive
hashing, which uses hash functions to group similar vectors together. These
approaches index the centroids or buckets.

A method that is not compatible with inverted indexes is HNSW (hierarchical
navigable small world). HNSW is based on graph structures, is efficient,
and lets the graph be incrementally built at runtime. This is in contrast to
most other methods that require offline, batch-oriented index building.

As approximate nearest neighbor search has many applications, quite a few tools
and libraries exist. A few examples are:

A good overview of tradeoffs for these can be found at

Nearest neighbors in search and recommendation

In many applications, such as search and recommendation, the results of the
nearest neighbor search is combined with additional facets of the request. In
this section, we’ll provide some examples of when this becomes problematic.

Only 2 of the 5 nearest neighbors remain after filtering

Modern text search increasingly uses representation vectors, often called text
embeddings or embedding vectors. Word2vec was an early example. More recently,
sophisticated language understanding models such as BERT and other
Transformer-based models are increasingly used. These are capable of assigning
different representations for a word depending upon the context. For text
search, the current state-of-the-art uses different models to encode query
vectors and document vectors. These representations are trained so that the
inner product of these vectors is maximized for relevant results.

Using embedding vectors in text search is often called semantic search. For
many text search applications, we would like to combine this semantic search
with other filters. For instance, we can combine a query for “approximate
nearest neighbor” with a date filter such as “2020”. The naive approach here is
to use one of the ANN libraries mentioned above to perform a nearest neighbor
search and then filter out the results.

However, this is problematic. Imagine that 1000 documents are relevant to the
query “approximate nearest neighbor”, with 100 added each year over the past 10
years. Assume they all are approximately equally likely to be retrieved from
the ANN. So, retrieving the top 100 will result in about 10 documents from each
year. Applying the filter “2020” will result in only 10 documents. That means
the other 90 relevant documents from 2020 are missed.


Recommender systems, such as YouTube and TikTok, are built to provide
continually interesting content to all users. As such, it’s essential to learn
the interests or preferences of the user. Such user profiles are represented by
one or more vectors, as are the items that should be recommended.

These vectors are often generated by using some form of collaborative
filtering. One method is matrix factorization, where the maximum inner product
is used as a distance function. Deep learning approaches have recently shown
great promise, trained explicitly for the distance function between the user
and item vector.

Recommendation systems employ filters to a great degree. Examples are filters
for age-appropriate content, NSFW labels, availability of content in various
regions due to distribution rights, and user-specified filters blocking certain
content. These are examples of direct filters. More indirect filters come in
the form of business rules such as diversity and de-duplication, which filters
out content that has already been recommended.

The problem of filtering is more evident for recommendation systems than for
text search. These filters’ quantity and strength lead to a greater probability
that items retrieved from the ANN search are filtered away. So, only a few of
the relevant items are actually recommended.

Serving ads

Ad serving systems work much like recommender systems. Given a user
profile and a context such as a search query or page content, the system should
provide an advertisement relevant to the user. The advertisements are stored
with advertiser-specific rules, for instance, who the ad or campaign should
target. One such rule is to not exceed the budget of the campaign.

These rules function as filters. Like with text search and recommendation, if
these filters are applied after the user-profile based retrieval, there is a
probability that an appropriate advertisement is not retrieved. This is
particularly important regarding the budget. Income is lost if there are no
results retrieved with an available spending budget.

Online dating

In the world of online dating, people have a set of preferences. These can be
binary such as gender, age range, location, height, and so on. Interests might
be less absolute, such as hiking, loves pets, traveling, and exercise. These
interests and preferences can be represented by a vector, and at least parts
can be compressed to a representation vector as well.

Suppose retrieval is based on an ANN over interests, and the preferences are
applied as a filter afterward. In that case, it’s clear why online dating is
hard. As we retrieve the best matches from the ANN, there is a significant
probability that all or most of these are filtered out, for instance, by
location or by profiles already seen.

Local search and recommendation is based on geographical location. Given
longitude and latitude coordinates, we can find places or businesses within
certain distances from a point: finding restaurants near a user is a typical
case. Imagine that we have the dining preferences of a user represented as a
vector. Likewise, all restaurants are represented by vectors. Then, by
performing an ANN search followed by a location filter, we could retrieve the
restaurants preferred by the user in their local area.

However, this would not work. Of all the restaurants in the world, only a small
fraction are close to the user. The location filter is much stronger than the
ANN retrieval. So with a high probability, no results would be produced at all.


The naive approach to the problems above is simply to request more candidates
from the ANN search. This obviously hurts performance, as the workload of both
the ANN and post-filtering increases. Besides, this is not guaranteed to work.
If you have a strong filter independent of the ANN, there is a real chance of
not producing any results at all. The local restaurant case is an example of
this, where the location is a strong filter independent of the user

The real solution here is to integrate the filters into the ANN search. Such an
algorithm would be able to reject candidates early that don’t pass the filters.
This effectively increases the search area from the query point dynamically
until enough candidates are found. This guarantees that the requested number of
candidates are produced.

Unfortunately, for most ANN libraries, this is not an option as they work in

The 5 nearest neighbors with integrated filtering is to my knowledge the only implementation of ANN that supports
integrated filtering. The implementation is based on a modified HNSW graph
algorithm, and innovates in 3 main areas:

  • Dynamic modification of the graph. Most ANN algorithms require the index to
    be built offline, but HNSW supports incremental building of the index. Vespa
    takes advantage of this and supports both adding and removing items in
    real-time while serving.
  • Multi-threaded indexing using lock-free data structures and copy-on-write
    semantics drastically increase the performance of building the index.
  • Metadata filtering modifies the algorithm to skip non-eligible candidates.

To support filtering, first evaluates the filters to create a list of
eligible candidates. During the ANN search, a point close to the query point is
selected and the graph is explored by following each node’s edge to its
neighbors. Candidates not in the eligibility list are skipped, and the search
continues until we have produced enough candidates.

There is a small problem here however. If the eligibility list is small in
relation to the number of items in the graph, skipping occurs with a high
probability. This means that the algorithm needs to consider an exponentially
increasing number of candidates, slowing down the search significantly. To
solve this, switches over to a brute-force search when this occurs.
The result is a efficient ANN search when combined with filters.

About is an open-source platform for building applications that do real-time
data processing over large data sets. Designed to be highly performant and
web-scalable, it is used for such diverse tasks as search, personalization,
recommendation, ads, auto-complete, image and similarity search, comment
ranking, and more.

One of’s strengths is that it includes all the necessary features to
realize such applications. This means one does not need additional plugins or
external services. Thus, it offers a simplified path to deployment in
production without coping

Basic HTTP testing of Vespa applications

Jon M Venstad

Jon M Venstad

Principal Vespa Engineer

Håkon Hallingstad

Håkon Hallingstad

Principal Vespa Engineer

HTTP interfaces are the bread and butter for interacting with a Vespa application.
A typical system test of a Vespa application consists of a sequence of
HTTP requests, and corresponding assertions on the HTTP responses.

The latest addition to the Vespa CLI
is the test command, which makes it easy to develop and run basic HTTP tests,
expressed in JSON format.
Like the document and query commands, endpoint discovery and authentication are
handled by the CLI, leaving developers free to focus on the tests themselves.

Basic HTTP tests are also supported by the CD framework of Vespa Cloud,
allowing applications to be safely, and easily, deployed to production.

Developing and running tests

To get started with Vespa’s basic HTTP tests:

  • Install and configure Vespa CLI
  • Clone the album-recommendation sample app
    vespa clone vespa-cloud/album-recommendation myapp
  • Configure and deploy the application, locally or to the cloud
    vespa deploy --wait 600
  • Run the system tests, or staging setup and tests
    vespa test src/test/application/system-test
  • To enter production in Vespa Cloud, modify the tests, and then
    vespa prod submit

For more information, see the reference documentation:
Basic HTTP Testing.