Changes in OS support for Vespa

Photo by Claudio Schwarz
on Unsplash

Currently, we support CentOS Stream 8 for open-source Vespa, and we announced that
in the Upcoming changes in OS support for Vespa blog
post in 2022. The choice to use CentOS Stream was made around the time that RedHat announced the EOL for CentOS 8 and the
new CentOS Stream initiative.
Other groups were scrambling to be the successor to CentOS, and the landscape was not settled. This is now about to change.


We are committed to providing Vespa releases that we have high confidence in. Internally, at,
we have migrated to AlmaLinux 8 and
Red Hat Enterprise Linux 8. CentOS Stream 8 is
also approaching its EOL, which is May 31st, 2024. Because of this, we want to change the supported OS for open-source Vespa.

Vespa is released up to 4 times a week depending on internal testing and battle-proven verification in our production
systems. Each high-confidence version is published as RPMs and a container image.
RPMs are built and published on Copr, and container images are
published on Docker Hub. In this blog post, we will look at options going
forward and announce the selected OS support for the upcoming Vespa releases.


There is a wide selection of Linux distributions out there with different purposes and target groups. For us, we need to choose
an OS that is as close to what we use internally as possible, and that is acceptable for our open-source users. These
criteria limit the options significantly to an enterprise Linux-based distribution.


Binary releases of Vespa is a set of RPMs. This set has to be built somewhere and must be uploaded to a
repository where it can be downloaded by package managers. These RPMs are then installed either on a host machine or in
container images. We will still build our RPMs on Copr, but we
have a choice there to compile for different environments that are either downstream or upstream of RHEL. In the time
between the EOL of CentOS 8 (end 2021) and now, Copr has added support to build for
EPEL 8 and 9. This means that we can build for EPEL 8 and install it on
RHEL 8 and its downstream clones.

Distribution of RPMs is currently done on Copr as the built RPMs are directly put in the repository there. The repositories
have limited capacity, and Copr only guarantees that the latest version is available. It would be nice to have an archive
of more than just the most recent version, but this will rely on vendors offering space and network traffic for the RPMs
to be hosted.

Container images

Given the choice of building RPMs on Copr for EPEL 8, this opens up a few options when selecting a base image for our
container image:

We should be able to select any of the above due to the RHEL ABI compatibility
and the distributions’ respective guarantees to be binary compatible with RHEL.

Red Hat has also announced a free version of RHEL
for developers and small workloads. However, it is not hassle-free, as it requires registration at Red Hat
to be able to use the version. We believe that this will not be well received by the consumers of Vespa


Container Image

Considering the options, we have chosen AlmaLinux 8 as the supported OS going forward. The main reasons for this decision

  • AlmaLinux is used in the Vespa Cloud production systems
  • The OS is available to anyone free of charge
  • We will still be able to leverage the Copr build system for open-source Vespa artifacts

We will use the image as the base for the Vespa container image on Docker Hub.

RPM distribution

Although we will continue to build RPMs on Copr, we are going to switch to a new RPM repository that can keep an archive
of a limited set of historic releases. We have been accepted as an open-source project at
Cloudsmith and will use the
vespa/open-source-rpms repository to distribute our RPMs.
Cloudsmith generously allows qualified open-source projects to store 50 GB and have 200 GB of network traffic. The
vespa-engine.repo repository definition
will be updated shortly, and information about how to install Vespa from RPMs can be found in the
documentation. Within our storage limits, we will be able to store
approximately 50 Vespa releases.

Compatibility for current Vespa installations

The consumers of Vespa container images should not notice any differences when Vespa changes the base of the container
image to AlmaLinux 8. Everything comes preinstalled in the image, and this is tested the same way as it was before. If
you use the Vespa container image as a base of custom images, the differences between the latest AlmaLinux 8 and CentOS
Stream 8 are minimal. We do not expect any changes to be required.

For consumers that install Vespa RPMs in their container images or install directly on host instances, we will continue
to build and publish RPMs for CentOS Stream 8 on Copr until Dec 31st, 2023. RPMs built on EPEL 8 will be forward compatible
with CentOS Stream 8 due to the RHEL ABI compatibility. This
means that you can make the switch by replacing the repository configuration with the one defined in
vespa-engine.repo the next time Vespa
is upgraded. If you do not, no new Vespa versions will be available for upgrade once we stop building for CentOS Stream 8.

Future OS support

Predicting the path of future OS support is not trivial in an environment where the landscape is changing. RedHat
announced closing off the RHEL sources and strengthening
CentOS Stream. Open Enterprise Linux Association has popped up as a response to this, and AlmaLinux
commits to binary compatibility. We expect the landscape to change, and
hopefully, we will have more clarity when deciding on the next OS to support.

Regardless of the landscape, we are periodically publishing a preview on AlmaLinux 9 that can be used at your own risk
here. Please use this for testing purposes only.


We have selected AlmaLinux 8 as the supported OS for Vespa going forward. The change is expected to have no impact on the
consumers of Vespa container images and RPMs. The primary RPM repository has moved to a Cloudsmith-hosted
repository where we can have an archive of releases allowing installation of not just the latest Vespa version.

Anonymized endpoints and token authentication in Vespa Cloud

Morten Tokle

Morten Tokle

Principal Software Systems Engineer

Martin Polden

Martin Polden

Principal Vespa Engineer

When you deploy a Vespa application on Vespa Cloud your application is assigned an endpoint for each container cluster declared in your application package. This is the endpoint you communicate with when you query or feed documents to your application.

Since the launch of Vespa Cloud these endpoints have included many dimensions identifying your exact cluster, on the following format {service}.{instance}.{application}.{tenant}.{zone} This format allows easy identification of a given endpoint.

However, while this format makes it easy to identify where an endpoint points, it also reveals details of the application that you might want to keep confidential. This is why we are introducing anonymized endpoints in Vespa Cloud. Anonymized endpoints are created on-demand when you deploy your application and have the format {generated-id}.{generated-id}.{scope} As with existing endpoints, details of anonymized endpoints and where they point are shown in the Vespa Cloud Console for your application.

Anonymized endpoints are the now the default for all new applications in Vespa Cloud. They have also been enabled for existing applications but with backward compatibility. This means that endpoints on the old format continue to work for now but are marked as deprecated in the Vespa Cloud console. We will continue to support the previous format for existing applications, but we encourage using the new endpoints.

In addition to making your endpoint details confidential, this new endpoint format allows Vespa Cloud to optimize certificate issuing. It allows for much faster deployments of new applications as they no longer have to wait for a new certificate to be published.

No action is needed to enable this feature. You can find the new anonymized endpoints in the Vespa Console for your application or by running the Vespa CLI command vespa status.

In addition to anonymized endpoints, we are introducing support for data plane authenticating using access tokens. Token authentication is intended for cases where mTLS authentication is unavailable or impractical. For example, edge runtimes like Vercel edge runtime are built on the V8 Javascript engine that does not support mTLS authentication. Access tokens are created and defined in the Vespa Cloud console and referenced in the application package. See instructions for creating and referencing tokens in the application package in the security guide.

Note it’s still required to define a data plane certificate for mTLS authentication; mTLS is still the preferred authentication method for data plane access, and applications configuring token-based authentication will have two distinct endpoints.


Application endpoints in Vespa Console – Deprecated legacy mTLS endpoint name and two anonymized endpoints, one with mTLS support and the other with token authentication. Using token-based authentication on the mTLS endpoint is not supported.

Using data plane authentication tokens

Using the token endpoint from the above screenshot,, we can authenticate against it by
adding a standard Authorization HTTP header to the data plane requests. For example as demonstrated below using curl:

curl -H "Authorization: Bearer vespa_cloud_...."


Using the latest release of pyvespa, you can interact with token endpoints by setting an environment variable named VESPA_CLOUD_SECRET_TOKEN. If this environment variable is present, pyvespa will read this and use it when interacting with the token endpoint.

import os
os.environ['VESPA_CLOUD_SECRET_TOKEN'] = "vespa_cloud_...."
from vespa.application import Vespa
vespa = Vespa(url="")

In this case, pyvespa will read the environment variable VESPA_CLOUD_SECRET_TOKEN and use that when interacting with the data plane endpoint of your application. There are no changes concerning control-plane authentication, which requires a valid developer/API key.

We do not plan to add token-based authentication to other Vespa data plane tools like vespa-cli or vespa-feed-client, as these are not designed for lightweight edge runtimes.

Edge Runtimes

This is a minimalistic example of using Cloudflare worker, where we have stored the secret Vespa Cloud token using Cloudflare worker functionality for storing secrets. Note that Cloudflare workers also support mTLS.

export default {
    async fetch(request, env, ctx) {
        const secret_token = env.vespa_cloud_secret_key
        return fetch('', 
                     {headers:{'Authorization': `Bearer ${secret_token}`}})

Consult your preferred edge runtime provider documentation on how to store and access secrets.

Security recommendations

It may be easier to use a token for authentication, but we still recommend using mTLS wherever possible. Before using a token for your application, consider the following recommendations.

Token expiration

While the cryptographic properties of tokens are comparable to certificates, it is recommended that tokens have a shorter expiration. Tokens are part of the request headers and not used to set up the connection. This means they are more likely to be included in e.g. log outputs. The default token expiration in Vespa Cloud is 30 days, but it is possible to create tokens with shorter expiration.

Token secret storage

The token value should be treated as a secret, and never be included in source code. Make sure to use a secure way of accessing the tokens and in such a way that they are not exposed in any log output.


Keeping your data safe is a number one priority for us. With these changes, we continue to improve the developer friendliness of Vespa Cloud while maintaining the highest level of security. With anonymized endpoints, we improve deployment time for new applications by several minutes, avoiding waiting for certificate issuing. Furthermore, anonymized endpoints eliminate disclosing tenant and application details in certificates and DNS entries.

Open Sourcing Vespa, Yahoo’s Big Data Processing and Serving Engine

By Jon Bratseth, Distinguished Architect, Vespa

Ever since we open sourced Hadoop in 2006, Yahoo – and now, Oath – has been committed to opening up its big data infrastructure to the larger developer community. Today, we are taking another major step in this direction by making Vespa, Yahoo’s big data processing and serving engine, available as open source on GitHub.

Building applications increasingly means dealing with huge amounts of data. While developers can use the Hadoop stack to store and batch process big data, and Storm to stream-process data, these technologies do not help with serving results to end users. Serving is challenging at large scale, especially when it is necessary to make computations quickly over data while a user is waiting, as with applications that feature search, recommendation, and personalization.

By releasing Vespa, we are making it easy for anyone to build applications that can compute responses to user requests, over large datasets, at real time and at internet scale – capabilities that up until now, have been within reach of only a few large companies.

Serving often involves more than looking up items by ID or computing a few numbers from a model. Many applications need to compute over large datasets at serving time. Two well-known examples are search and recommendation. To deliver a search result or a list of recommended articles to a user, you need to find all the items matching the query, determine how good each item is for the particular request using a relevance/recommendation model, organize the matches to remove duplicates, add navigation aids, and then return a response to the user. As these computations depend on features of the request, such as the user’s query or interests, it won’t do to compute the result upfront. It must be done at serving time, and since a user is waiting, it has to be done fast. Combining speedy completion of the aforementioned operations with the ability to perform them over large amounts of data requires a lot of infrastructure – distributed algorithms, data distribution and management, efficient data structures and memory management, and more. This is what Vespa provides in a neatly-packaged and easy to use engine.

With over 1 billion users, we currently use Vespa across many different Oath brands – including, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini, Flickr, and others – to process and serve billions of daily requests over billions of documents while responding to search queries, making recommendations, and providing personalized content and advertisements, to name just a few use cases. In fact, Vespa processes and serves content and ads almost 90,000 times every second with latencies in the tens of milliseconds. On Flickr alone, Vespa performs keyword and image searches on the scale of a few hundred queries per second on tens of billions of images. Additionally, Vespa makes direct contributions to our company’s revenue stream by serving over 3 billion native ad requests per day via Yahoo Gemini, at a peak of 140k requests per second (per Oath internal data).

With Vespa, our teams build applications that:

  • Select content items using SQL-like queries and text search
  • Organize all matches to generate data-driven pages
  • Rank matches by handwritten or machine-learned relevance models
  • Serve results with response times in the low milliseconds
  • Write data in real-time, thousands of times per second per node
  • Grow, shrink, and re-configure clusters while serving and writing data

To achieve both speed and scale, Vespa distributes data and computation over many machines without any single master as a bottleneck. Where conventional applications work by pulling data into a stateless tier for processing, Vespa instead pushes computations to the data. This involves managing clusters of nodes with background redistribution of data in case of machine failures or the addition of new capacity, implementing distributed low latency query and processing algorithms, handling distributed data consistency, and a lot more. It’s a ton of hard work!

As the team behind Vespa, we have been working on developing search and serving capabilities ever since building, which was later acquired by Yahoo. Over the last couple of years we have rewritten most of the engine from scratch to incorporate our experience onto a modern technology stack. Vespa is larger in scope and lines of code than any open source project we’ve ever released. Now that this has been battle-proven on Yahoo’s largest and most critical systems, we are pleased to release it to the world.

Vespa gives application developers the ability to feed data and models of any size to the serving system and make the final computations at request time. This often produces a better user experience at lower cost (for buying and running hardware) and complexity compared to pre-computing answers to requests. Furthermore it allows developers to work in a more interactive way where they navigate and interact with complex calculations in real time, rather than having to start offline jobs and check the results later.

Vespa can be run on premises or in the cloud. We provide both Docker images and rpm packages for Vespa, as well as guides for running them both on your own laptop or as an AWS cluster.

We’ll follow up this initial announcement with a series of posts on our blog showing how to build a real-world application with Vespa, but you can get started right now by following the getting started guide in our comprehensive documentation.

Managing distributed systems is not easy. We have worked hard to make it easy to develop and operate applications on Vespa so that you can focus on creating features that make use of the ability to compute over large datasets in real time, rather than the details of managing clusters and data. You should be able to get an application up and running in less than ten minutes by following the documentation.

We can’t wait to see what you’ll build with it!

The basics of Vespa applications

Distributed computation over large data sets in real-time — what we call big data serving — is a complex task. We have worked hard to hide this complexity to make it as easy as possible to create your own production quality Vespa application.
The quick-start guides
take you through the steps of getting Vespa up and running, deploying a basic application, writing data and issuing some queries to it, but without room for explanation.
Here, we’ll explain the basics of creating your own Vespa application.
The blog search and recommendation tutorial
covers these topics in full detail with hands-on instructions.
Update 2021-05-20: Blog tutorials are replaced by the
News search and recommendation tutorial:

Application packages

The configuration, components and models which makes out an application to be run by Vespa is contained in an application package. The application package:

  • Defines which clusters and services should run and how they should be configured
  • Contains the document types the application will use
  • Contains the ranking models to execute
  • Configures how data will be processed during feeding and indexing
  • Configures how queries will be pre- and post-processed

The three mandatory parts of the application specification are the search definition, the services specification, and the hosts specification — all of which have their own file in the application package.
This is enough to set up a basic production ready Vespa applications, like, e.g., the
sample application.
Most applications however, are much larger and may contain machine-learned ranking models and application specific Java components which perform various application specific tasks such as query enrichment and post-search processing.

The schema definition

Data stored in Vespa is represented as a set of documents of a type defined in the application package. An application can have multiple document types. Each search definition describes one such document type: it lists the name and data type of each field found in the document, and configures the behaviour of these. Examples are like whether field values are in-memory or can be stored on disk, and whether they should be indexed or not. It can also contain ranking profiles, which are used to select the most relevant documents among the set of matches for a given query – and it specifies which fields to return.

The services definition

A Vespa application consists of a set of services, such as stateless query and document processing containers and stateful content clusters. Which services to run, where to run those services and the configuration of those services are all set up in services.xml. This includes the search endpoint(s), the document feeding API, the content cluster, and how documents are stored and searched.

The hosts definition

The deployment specification hosts.xml contains a list of all hosts that is part of the application, with an alias for each of them. The aliases are used in services.xml to define which services is to be started on which nodes.

Deploying applications

After the application package has been constructed, it is deployed using vespa-deploy. This uploads the package to the configuration cluster and pushes the configuration to all nodes. After this, the Vespa cluster is now configured and ready for use.

One of the nice features is that new configurations are loaded without service disruption. When a new application package is deployed, the configuration pushes the new generation to all the defined nodes in the application, which consume and effectuate the new configuration without restarting the services. There are some rare cases that require a restart, the vespa-deploy command will notify when this is needed.

Writing data to Vespa

One of the required files when setting up a Vespa application is the search definition. This file (or files) contains a document definition which defines the fields and their data types for each document type. Data is written to Vespa using Vespa’s JSON document format. The data in this format must match the search definition for the document type.

The process of writing data to Vespa is called feeding, and there are multiple tools that can be used to feed data to Vespa for various use cases. For instance there is a REST API for smaller updates and a Java client that can be embedded into other applications.

An important concept in writing data to Vespa is that of document processors. These processors can be chained together to form a processing pipeline to process each document before indexing. This is useful for many use cases, including enrichment by pulling in relevant data from other sources.

Querying Vespa

If you know the id of the document you want, you can fetch it directly using the document API. However, with Vespa you are usually more interested in searching for relevant documents given some query.

Basic querying in Vespa is done through YQL which is an SQL-like language. An example is:

select title,isbn from music where artist contains "kygo"

Here we select the fields “title” and “isbn” from document type “music” where the field called “artist” contains the string “kygo”. Wildcards (*) are supported in the result fields and the document types to return all available fields in all defined document types.

The example above shows how to send a query to Vespa over HTTP. Many applications choose to build the queries in Java components running inside Vespa instead. Such components are called searchers, and can be used to build or modify queries, run multiple queries for each incoming request and filter and modify results. Similar to the document processor chains, you can set up chains of searchers. Vespa contains a set of default Searchers which does various common operations such as stemming and federation to multiple content clusters.

Ranking models

Ranking executes a ranking expression specified in the search definition on all the documents matching a query. When returning specific documents for a query, those with the highest rank score are returned.

A ranking expression is a mathematical function over features (named values).

Features are either sent with the query, attributes of the document, constants in the application package or features computed by Vespa from both the query and document – example:

rank-profile popularity inherits default {  
    first-phase {  
        expression: 0.7 * nativeRank(title, description) + 0.3 * attribute(popularity)  

Here, each document is ranked by the nativeRank function but boosted by a popularity score. This score can be updated at regular intervals, for instance from user feedback, using partial document updates from some external system such as a Hadoop cluster.

In real applications ranking expressions often get much more complicated than this.

For example, a recommendation application may use a deep neural net to compute a recommendation score, or a search application may use a machine-learned gradient boosted decision tree. To support such complex models, Vespa allows ranking expressions to compute over tensors in addition to scalars. This makes it possible to work effectively with large models and parameter spaces.

As complex ranking models can be expensive to compute over many documents, it is often a good idea to use a cheaper function to find good candidates and then rank only those using the full model. To do this you can configure both a first-phase and second-phase ranking expression, where the second-phase function is only computed on the best candidate documents.

Grouping and aggregation

In addition to returning the set of results ordered by a relevance score, Vespa can group and aggregate data over all the documents selected by a query. Common use cases include:

  • Group documents by unique value of some field.
  • Group documents by time and date, for instance sort bug tickets by date of creation into the buckets Today, Past Week, Past Month, Past Year, and Everything else.
  • Calculate the minimum/maximum/average value for a given field.

Groups can be nested arbitrarily and multiple groupings and aggregations can be executed in the same query.

More information

You should now have a basic understanding of the core concepts in building Vespa applications.
To try out these core features in practice, head on over to the
blog search and recommendation tutorial.
Update 2021-05-20: Blog tutorials are replaced by the
News search and recommendation tutorial.

We’ll post some more in-depth blog posts with concrete examples soon.

Blog search application in Vespa

Update 2021-05-20:
This blog post refers to Vespa sample applications that do not exist anymore.
Please refer to the
News search and recommendation tutorial
for an updated version of text and sample applications.


This is the first of a series of blog posts where data from (WP) is used to highlight how Vespa can be used to store, search and recommend blog posts. The data was made available during a Kaggle challenge to predict which blog posts someone would like based on their past behavior. It contains many ingredients that are necessary to showcase needs, challenges and possible solutions that are useful for those interested in building and deploying such applications in production.

The end goal is to build an application where:

  • Users will be able to search and manipulate the pool of blog posts available.
  • Users will get blog post recommendations from the content pool based on their interest.

This part addresses:

  • How to describe the dataset used as well as any information connected to the data.
  • How to set up a basic blog post search engine using Vespa.

The next parts show how to extend this basic search engine application with machine learned models to create a blog recommendation engine.


The dataset contains blog posts written by WP bloggers and actions, in this case ‘likes’, performed by WP readers in blog posts they have interacted with. The dataset is publicly available at Kaggle and was released during a challenge to develop algorithms to help predict which blog posts users would most likely ‘like’ if they were exposed to them. The data includes these fields per blog post:

  • _ post_id _ – unique numerical id identifying the blog post
  • _ date_gmt _ – string representing date of blog post creation in GMT format yyyy-mm-dd hh:mm:ss
  • _ author _ – unique numerical id identifying the author of the blog post
  • _ url _ – blog post URL
  • _ title _ – blog post title
  • _ blog _ – unique numerical id identifying the blog that the blog post belongs to
  • _ tags _ – array of strings representing the tags of the blog posts
  • _ content _ – body text of the blog post, in html format
  • _ categories _ – array of strings representing the categories the blog post was assigned to

For the user actions:

  • _ post_id _ – unique numerical id identifying the blog post
  • _ uid _ – unique numerical id identifying the user that liked post_id
  • _ dt _ – date of the interaction in GMT format yyyy-mm-dd hh:mm:ss

Downloading raw data

For the purposes of this post, it is sufficient to use the first release of training data that consists of 5 weeks of posts as well as all the ‘like’ actions that occurred during those 5 weeks.

This first release of training data is available here – once downloaded, unzip it. The 1,196,111 line trainPosts.json will be our practice document data. This file is around 5GB in size.


Indexing the full data set requires 23GB disk space. We have tested with a Docker container with 10GB RAM. We used similar settings as described in the vespa quick start guide. As in the guide we assume that the $VESPA_SAMPLE_APPS env variable points to the directory with your local clone of the vespa sample apps:

$ docker run -m 10G --detach --name vespa --hostname vespa --privileged --volume $VESPA_SAMPLE_APPS:/vespa-sample-apps --publish 8080:8080 vespaengine/vespa

Searching blog posts

Functional specification:

  • Blog post title, content, tags and categories must all be searchable
  • Allow blog posts to be sorted by both relevance and date
  • Allow grouping of search results by tag or category

In terms of data, Vespa operates with the notion of documents. A document represents a single, searchable item in your system, e.g., a blog post, a photo, or a news article. Each document type must be defined in the Vespa configuration through a search definition. Think of a search definition as being similar to a table definition in a relational database; it consists of a set of fields, each with a given name, a specific type, and some optional properties.

As an example, for this simple blog post search application, we could create the document type blog_post with the following fields:

  • _ url _ – of type uri
  • _ title _ – of type string
  • _ content _ – of type string (string fields can be of any length)
  • _ date_gmt _ – of type string (to store the creation date in GMT format)

The data fed into Vespa must match the structure of the search definition, and the hits returned when searching will be on this format as well.

Application Packages

A Vespa application package is the set of configuration files and Java plugins that together define the behavior of a Vespa system: what functionality to use, the available document types, how ranking will be done and how data will be processed during feeding and indexing. The search definition, e.g.,, is a required part of an application package — the other required files are services.xml and hosts.xml.

The sample application blog search creates a simple but functional blog post search engine. The application package is found in src/main/application.

Services Specification

services.xml defines the services that make up the Vespa application — which services to run and how many nodes per service:

<?xml version='1.0' encoding='UTF-8'?>
<services version='1.0'>

  <container id='default' version='1.0'>
      <node hostalias="node1"/>

  <content id='blog_post' version='1.0'>
      <document mode="index" type="blog_post"/>
      <node hostalias="node1"/>

  • <container> defines the container cluster for document, query and result processing
  • <search> sets up the search endpoint for Vespa queries. The default port is 8080.
  • <document-api> sets up the document endpoint for feeding.
  • <nodes> defines the nodes required per service. (See the reference for more on container cluster setup.)
  • <content> defines how documents are stored and searched
  • <redundancy> denotes how many copies to keep of each document.
  • <documents> assigns the document types in the search definition — the content cluster capacity can be increased by adding node elements — see elastic Vespa. (See also the reference for more on content cluster setup.)
  • <nodes> defines the hosts for the content cluster.

Deployment Specification

hosts.xml contains a list of all the hosts/nodes that is part of the application, with an alias for each of them. Here we use a single node:

<?xml version="1.0" encoding="utf-8" ?>
  <host name="localhost">

Search Definition

The blog_post document type mentioned in src/main/application/service.xml is defined in the search definition. src/main/application/searchdefinitions/ contains the search definition for a document of type blog_post:

search blog_post {

    document blog_post {

        field date_gmt type string {
            indexing: summary

        field language type string {
            indexing: summary

        field author type string {
            indexing: summary

        field url type string {
            indexing: summary

        field title type string {
            indexing: summary | index

        field blog type string {
            indexing: summary

        field post_id type string {
            indexing: summary

        field tags type array<string> {
            indexing: summary

        field blogname type string {
            indexing: summary

        field content type string {
            indexing: summary | index

        field categories type array<string> {
            indexing: summary

        field date type int {
            indexing: summary | attribute


    fieldset default {
        fields: title, content

    rank-profile post inherits default {

        first-phase {
            expression:nativeRank(title, content)



document is wrapped inside another element called search. The name following these elements, here blog_post, must be exactly the same for both.

The field property indexing configures the indexing pipeline for a field, which defines how Vespa will treat input during indexing — see indexing language. Each part of the indexing pipeline is separated by the pipe character ‘’:

Deploy the Application Package

Once done with the application package, deploy the Vespa application — build and start Vespa as in the quick start. Deploy the application:

$ cd /vespa-sample-apps/blog-search
$ vespa-deploy prepare src/main/application && vespa-deploy activate

This prints that the application was activated successfully and also the checksum, timestamp and generation for this deployment (more on that later). Pointing a browser to http://localhost:8080/ApplicationStatus returns JSON-formatted information about the active application, including its checksum, timestamp and generation (and should be the same as the values when vespa-deploy activate was run). The generation will increase by 1 each time a new application is successfully deployed, and is the easiest way to verify that the correct version is active.

The Vespa node is now configured and ready for use.

Feeding Data

The data fed to Vespa must match the search definition for the document type. The data downloaded from Kaggle, contained in trainPosts.json, must be converted to a valid Vespa document format before it can be fed to Vespa. Find a parser in the utility repository. Since the full data set is unnecessarily large for the purposes of this first part of this post, we use only the first 10,000 lines of it, but feel free to load all 1,1M entries:

$ head -10000 trainPosts.json > trainPostsSmall.json
$ python trainPostsSmall.json > feed.json

Send this to Vespa using one of the tools Vespa provides for feeding. Here we will use the Java feeding API:

$ java -jar $VESPA_HOME/lib/jars/vespa-http-client-jar-with-dependencies.jar --verbose --file feed.json --host localhost --port 8080

Note that in the sample-apps/blog-search directory, there is a file with sample data. You may also feed this file using this method.

Track feeding progress

Use the Metrics API to track number of documents indexed:

$ curl -s 'http://localhost:19112/state/v1/metrics' | tr ',' '\n' | grep -A 2 proton.doctypes.blog_post.numdocs

You can also inspect the search node state by

$ vespa-proton-cmd --local getState  

Fetch documents

Fetch documents by document id using the Document API:

$ curl -s 'http://localhost:8080/document/v1/blog-search/blog_post/docid/1750271' | python -m json.tool

The first query

Searching with Vespa is done using a HTTP GET requests, like:


The only mandatory parameter is the query, using yql=<yql query>. More details can be found in the Search API.

Given the above search definition, where the fields title and content are part of the fieldset default, any document containing the word “music” in one or more of these two fields matches our query below:

$ curl -s 'http://localhost:8080/search/?yql=select+*+from+sources+*+where+default+contains+%22music%22%3B' | python -m json.tool

Looking at the output, please note:

  • The field documentid in the output and how it matches the value we assigned to each put operation when feeding data to Vespa.
  • Each hit has a property named relevance, which indicates how well the given document matches our query, using a pre-defined default ranking function. You have full control over ranking — more about ranking and ordering later. The hits are sorted by this value.
  • When multiple hits have the same relevance score their internal ordering is undefined. However, their internal ordering will not change unless the documents are re-indexed.
  • Add &tracelevel=9 to dump query parsing details

Other examples


Once more a search for the single term “music”, but this time with the explicit field title. This means that we only want to match documents that contain the word “music” in the field title. As expected, you will see fewer hits for this query, than for the previous one.


This is a query for the two terms “music” and “festival”, combined with an AND operation; it finds documents that match both terms — but not just one of them.


This is a single-term query in the special field sddocname for the value “blog_post”. This is a common and useful Vespa trick to get the number of indexed documents for a certain document type (search definition): sddocname is a special and reserved field which is always set to the name of the document type for a given document. The documents are all of type blog_post, and will therefore automatically have the field sddocname set to that value.

This means that the query above really means “Return all documents of type blog_post”, and as such all documents in the index are returned.

Vespa Meetup in Sunnyvale | Vespa Blog

WHAT: Vespa meetup with various presentations from the Vespa team.

Several Vespa developers from Norway are in Sunnyvale, use this opportunity to learn more about the open big data serving engine Vespa and meet the team behind it.

WHEN: Monday, December 4th, 6:00pm – 8:00pm PDT

WHERE: Oath/Yahoo Sunnyvale Campus
Building E, Classroom 9 & 10
700 First Avenue, Sunnyvale, CA 94089



6.00 pm: Welcome & Intro

6.15 pm: Vespa tips and tricks

7.00 pm: Tensors in Vespa, intro and usecases

7.45 pm: Vespa future and roadmap

7.50 pm: Q&A

This meetup is a good arena for sharing experience, get good tips, get inside details in Vespa, discuss and impact the roadmap, and it is a great opportunity for the Vespa team to meet our users. Hope to see many of you!

Blog recommendation in Vespa | Vespa Blog

Update 2021-05-20:
This blog post refers to Vespa sample applications that do not exist anymore.
Please refer to the
News search and recommendation tutorial
for an updated version of text and sample applications.


This post builds upon the previous blog search application and extends the basic search engine to include machine learned models to help us recommend blog posts to users that arrive at our application. Assume that once a user arrives, we obtain his user identification number, denoted in here by user_id, and that we will send this information down to Vespa and expect to obtain a blog post recommendation list containing 100 blog posts tailored for that specific user.


Collaborative Filtering

We will start our recommendation system by implementing the collaborative filtering algorithm for implicit feedback described in (Hu et. al. 2008). The data is said to be implicit because the users did not explicitly rate each blog post they have read. Instead, the have “liked” blog posts they have likely enjoyed (positive feedback) but did not have the chance to “dislike” blog posts they did not enjoy (absence of negative feedback). Because of that, implicit feedback is said to be inherently noisy and the fact that a user did not “like” a blog post might have many different reasons not related with his negative feelings about that blog post.

In terms of modeling, a big difference between explicit and implicit feedback datasets is that the ratings for the explicit feedback are typically unknown for the majority of user-item pairs and are treated as missing values and ignored by the training algorithm. For an implicit dataset, we would assume a rating of zero in case the user has not liked a blog post. To encode the fact that a value of zero could come from different reasons we will use the concept of confidence as introduced by (Hu et. al. 2008), which causes the positive feedback to have a higher weight than a negative feedback.

Once we train the collaborative filtering model, we will have one vector representing a latent factor for each user and item contained in the training set. Those vectors will later be used in the Vespa ranking framework to make recommendations to a user based on the dot product between the user and documents latent factors. An obvious problem with this approach is that new users and new documents will not have those latent factors available to them. This is what is called a cold start problem and will be addressed with content-based techniques described in future posts.

Evaluation metrics

The evaluation metric used by Kaggle for this challenge was the Mean Average Precision at 100 (MAP@100). However, since we do not have information about which blog posts the users did not like (that is, we have only positive feedback) and our inability to obtain user behavior to the recommendations we make (this is an offline evaluation, different from the usual A/B testing performed by companies that use recommendation systems), we offer a similar remark as the one included in (Hu et. al. 2008) and prefer recall-oriented measures. Following (Hu et. al. 2008) we will use the expected percentile ranking.

Evaluation Framework

Generate training and test sets

In order to evaluate the gains obtained by the recommendation system when we start to improve it with more accurate algorithms, we will split the dataset we have available into training and test sets. The training set will contain document (blog post) and user action (likes) pairs as well as any information available about the documents contained in the training set. There is no additional information about the users besides the blog posts they have liked. The test set will be formed by a series of documents available to be recommended and a set of users to whom we need to make recommendations. This list of test set documents constitutes the Vespa content pool, which is the set of documents stored in Vespa that are available to be served to users. The user actions will be hidden from the test set and used later to evaluate the recommendations made by Vespa.

To create an application that more closely resembles the challenges faced by companies when building their recommendation systems, we decided to construct the training and test sets in such a way that:

  • There will be blog posts that had been liked in the training set by a set of users and that had also been liked in the test set by another set of users, even though this information will be hidden in the test set. Those cases are interesting to evaluate if the exploitation (as opposed to exploration) component of the system is working well. That is, if we are able to identify high quality blog posts based on the available information during training and exploit this knowledge by recommending those high quality blog posts to another set of users that might like them as well.
  • There will be blog posts in the test set that had never been seen in the training set. Those cases are interesting in order to evaluate how the system deals with the cold-start problem. Systems that are too biased towards exploitation will fail to recommend new and unexplored blog posts, leading to a feedback loop that will cause the system to focus into a small share of the available content.

A key challenge faced by recommender system designers is how to balance the exploitation/exploration components of their system, and our training/test set split outlined above will try to replicate this challenge in our application. Notice that this split is different from the approach taken by the Kaggle competition where the blog posts available in the test set had never been seen in the training set, which removes the exploitation component of the equation.

The Spark job uses trainPosts.json and creates the folders blog-job/training_set_ids and blog-job/test_set_ids containing files with post_id and user_idpairs:

$ cd blog-recommendation; export SPARK_LOCAL_IP=""
$ spark-submit --class "" \
  --master local[4] ../blog-tutorial-shared/target/scala-*/blog-support*.jar \
  --task split_set --input_file ../trainPosts.json \
  --test_perc_stage1 0.05 --test_perc_stage2 0.20 --seed 123 \
  --output_path blog-job/training_and_test_indices
  • test_perc_stage1: The percentage of the blog posts that will be located only on the test set (exploration component).
  • test_perc_stage2: The percentage of the remaining (post_id, user_id) pairs that should be moved to the test set (exploitation component).
  • seed: seed value used in order to replicate results if required.

Compute user and item latent factors

Use the complete training set to compute user and item latent factors. We will leave the discussion about tuning and performance improvement of the model used to the section about model tuning and offline evaluation. Submit the Spark job to compute the user and item latent factors:

$ spark-submit --class "" \
  --master local[4] ../blog-tutorial-shared/target/scala-*/blog-support*.jar \
  --task collaborative_filtering \
  --input_file blog-job/training_and_test_indices/training_set_ids \
  --rank 10 --numIterations 10 --lambda 0.01 \
  --output_path blog-job/user_item_cf

Verify the vectors for the latent factors for users and posts:

$ head -1 blog-job/user_item_cf/user_features/part-00000 | python -m json.tool
    "user_id": 270,
    "user_item_cf": {
        "user_item_cf:0": -1.750116e-05,
        "user_item_cf:1": 9.730623e-05,
        "user_item_cf:2": 8.515047e-05,
        "user_item_cf:3": 6.9297894e-05,
        "user_item_cf:4": 7.343942e-05,
        "user_item_cf:5": -0.00017635927,
        "user_item_cf:6": 5.7642872e-05,
        "user_item_cf:7": -6.6685796e-05,
        "user_item_cf:8": 8.5506894e-05,
        "user_item_cf:9": -1.7209566e-05
$ head -1 blog-job/user_item_cf/product_features/part-00000 | python -m json.tool
    "post_id": 20,
    "user_item_cf": {
        "user_item_cf:0": 0.0019320602,
        "user_item_cf:1": -0.004728486,
        "user_item_cf:2": 0.0032499845,
        "user_item_cf:3": -0.006453364,
        "user_item_cf:4": 0.0015929453,
        "user_item_cf:5": -0.00420313,
        "user_item_cf:6": 0.009350027,
        "user_item_cf:7": -0.0015649397,
        "user_item_cf:8": 0.009262732,
        "user_item_cf:9": -0.0030964287

At this point, the vectors with latent factors can be added to posts and users.

Add vectors to search definitions using tensors

Modern machine learning applications often make use of large, multidimensional feature spaces and perform complex operations on those features, such as in large logistic regression and deep learning models. It is therefore necessary to have an expressive framework to define and evaluate ranking expressions of such complexity at scale.

Vespa comes with a Tensor framework, which unify and generalizes scalar, vector and matrix operations, handles the sparseness inherent to most machine learning application (most cases evaluated by the model is lacking values for most of the features) and allow for models to be continuously updated. Additional information about the Tensor framework can be found in the tensor user guide.

We want to have those latent factors available in a Tensor representation to be used during ranking by the Tensor framework. A tensor field user_item_cf is added to to hold the blog post latent factor:

field user_item_cf type tensor(user_item_cf[10]) {
	indexing: summary | attribute
	attribute: tensor(user_item_cf[10])

field has_user_item_cf type byte {
	indexing: summary | attribute
	attribute: fast-search

A new search definition defines a document type named user to hold information for users:

search user {
    document user {
        field user_id type string {
            indexing: summary | attribute
            attribute: fast-search

        field has_read_items type array<string> {
            indexing: summary | attribute

        field user_item_cf type tensor(user_item_cf[10]) {
            indexing: summary | attribute
            attribute: tensor(user_item_cf[10])

        field has_user_item_cf type byte {
            indexing: summary | attribute
            attribute: fast-search


  • user_id: unique identifier for the user
  • user_item_cf: tensor that will hold the user latent factor
  • has_user_item_cf: flag to indicate the user has a latent factor

Join and feed data

Build and deploy the application:

Deploy the application (in the Docker container):

$ vespa-deploy prepare /vespa-sample-apps/blog-recommendation/target/application && \
  vespa-deploy activate

Wait for app to activate (200 OK):

$ curl -s --head http://localhost:8080/ApplicationStatus

The code to join the latent factors in blog-job/user_item_cf into blog_post and user documents is implemented in tutorial_feed_content_and_tensor_vespa.pig. After joining in the new fields, a Vespa feed is generated and fed to Vespa directly from Pig :

$ pig -Dvespa.feed.defaultport=8080 \
  -x local \
  -f ../blog-tutorial-shared/src/main/pig/tutorial_feed_content_and_tensor_vespa.pig \
  -param VESPA_HADOOP_JAR=../vespa-hadoop*.jar \
  -param DATA_PATH=../trainPosts.json \
  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \
  -param BLOG_POST_FACTORS=blog-job/user_item_cf/product_features \
  -param USER_FACTORS=blog-job/user_item_cf/user_features \
  -param ENDPOINT=localhost

A successful data join and feed will output:

Successfully read 1196111 records from: "file:///Users/kraune/github/vespa-engine/sample-apps/trainPosts.json"
Successfully read 341416 records from: "file:///Users/kraune/github/vespa-engine/sample-apps/blog-recommendation/blog-job/training_and_test_indices/testing_set_ids"
Successfully read 323727 records from: "file:///Users/kraune/github/vespa-engine/sample-apps/blog-recommendation/blog-job/user_item_cf/product_features"
Successfully read 6290 records from: "file:///Users/kraune/github/vespa-engine/sample-apps/blog-recommendation/blog-job/user_item_cf/user_features"

Successfully stored 286237 records in: "localhost"

Sample blog post and user:


Set up a rank function to return the best matching blog posts given some user latent factor. Rank the documents using a dot product between the user and blog post latent factors, i.e. the query tensor and blog post tensor dot product (sum of the product of the two tensors) – from

rank-profile tensor {
    first-phase {
        expression {
            sum(query(user_item_cf) * attribute(user_item_cf))

Configure the ranking framework to expect that query(user_item_cf) is a tensor, and that it is compatible with the attribute in a query profile type – see search/query-profiles/types/root.xml and search/query-profiles/default.xml:

<query-profile-type id="root" inherits="native">
    <field name="ranking.features.query(user_item_cf)" type="tensor(user_item_cf[10])" />

<query-profile id="default" type="root" />

This configures a ranking feature named query(user_item_cf) with type tensor(user_item_cf[10]), which defines it as an indexed tensor with 10 elements. This is the same as the attribute, hence the dot product can be computed.

Query Vespa with a tensor

Test recommendations by sending a tensor with latenct factors: localhost:8080/search/?yql=select%20*%20from%20sources%20blog_post%20where%20has_user_item_cf%20=%201;&ranking=tensor&ranking.features.query(user_item_cf)=%7B%7Buser_item_cf%3A0%7D%3A0.1%2C%7Buser_item_cf%3A1%7D%3A0.1%2C%7Buser_item_cf%3A2%7D%3A0.1%2C%7Buser_item_cf%3A3%7D%3A0.1%2C%7Buser_item_cf%3A4%7D%3A0.1%2C%7Buser_item_cf%3A5%7D%3A0.1%2C%7Buser_item_cf%3A6%7D%3A0.1%2C%7Buser_item_cf%3A7%7D%3A0.1%2C%7Buser_item_cf%3A8%7D%3A0.1%2C%7Buser_item_cf%3A9%7D%3A0.1%7D

The query string, decomposed:

  • yql=select * from sources blog_post where has_user_item_cf = 1 – this selects all documents of type blog_post which has a latent factor tensor
  • restrict=blog_post – search only in blog_post documents
  • ranking=tensor – use the rank-profile tensor in
  • ranking.features.query(user_item_cf) – send the tensor as user_item_cf. As this tensor is defined in the query-profile-type, the ranking framework knows its type (i.e. dimensions) and is able to do a dot product with the attribute of same type. The tensor before URL-encoding:


Query Vespa with user id

Next step is to query Vespa by user id, look up the user profile for the user, get the tensor from it and recommend documents based on this tensor (like the query in previous section). The user profiles is fed to Vespa in the user_item_cf field of the user document type.

In short, set up a searcher to retrieve the user profile by user id – then run the query. When the Vespa Container receives a request, it will create a Query representing it and execute

Blog recommendation with neural network models

Update 2021-05-20:
This blog post refers to Vespa sample applications that do not exist anymore.
Please refer to the
News search and recommendation tutorial
for an updated version of text and sample applications.


The main objective of this post is to show how to deploy neural network models in Vespa using our Tensor Framework. In fact, any model that can be represented by a series of Tensor operations can be deployed in Vespa. Neural networks is just a popular example. In addition, we will introduce the multi-phase ranking model available in Vespa that can be used to run more expensive models in a phase based on a reduced number of documents returned by previous phases. This feature allow us to run models that would be prohibitively expensive to use if we had to run them at query-time across all the documents indexed in Vespa.

Model Training

In this section, we will define a neural network model, show how we created a suitable dataset to train the model and train the model using TensorFlow.

The neural network model

In the previous blog post, we computed latent factors for each user and each document and then used a dot-product between user and document vectors to rank the documents available for recommendation to a specific user. In this tutorial we will train a 2-layer fully connected neural network model that will take the same user (u) and document (d) latent factors as input and will output the probability of that specific user liking the document.

More technically, our previous rank function r was given by


while in this tutorial it will be given by


where f represents the neural network model described below and θ is the neural network parameter values that we need to learn from training data.

The specific form of the neural network model used here is

p = sigmoid(h1×W2+b2)
h1 = ReLU(x×W1+b1)

where x=[u,d] is the concatenation of the user and document latent factor, ReLU is the rectifier activation function, sigmoid represents the sigmoid function, p is the output of the model and in this case can be interpreted as the probability of the user u liking a blog post d. The parameters of the model are represented by θ=(W1,W2,b1,b2).

Training data

For the training dataset, we will start with the (user_id, post_id) rows from the “training_set_ids” generated previously. Then, we remove every row for which there is no latent factors for the user_id or post_id contained in that row. This gives us a dataset with only positive feedback (label = 1), since each row represents one instance of a user_id liking a post_id.

In order to train our model, we need to generate negative feedback (label = 0). So, for each row (user_id, post_id) in the current dataset we will generate N negative feedback rows by randomly sampling post_id_fake from the pool of post_id’s available in the current set, so that for each (user_id, post_id) row with label = 1 we will increase the dataset with N (user_id, post_id_fake) rows with label = 0.

Find code to generate the dataset in the utility scripts.

Training with TensorFlow

With the training data in hand, we have split it into 80% training set and 20% validation set and used TensorFlow to train the model. The script used can be found in the utility scripts and executed by

$ python --product_features_file_path vespa_tutorial_data/user_item_cf_cv/product.json \
                       --user_features_file_path vespa_tutorial_data/user_item_cf_cv/user.json \
                       --dataset_file_path vespa_tutorial_data/nn_model/training_set.txt

The progress of your training can be visualized using Tensorboard

$ tensorboard --logdir runs/*/summaries/

Model deployment in Vespa

Two Phase Ranking

When a query is sent to Vespa, it will scan all documents available and select the ones (possibly all) that match the query. When the set of documents matching a query is found, Vespa must decide the order of these documents. Unless explicit sorting is used, Vespa decides this order by calculating a number for each document, the rank score, and sorts the documents by this number.

The rank score can be any function that takes as arguments parameters sent by the query, document attributes defined in search definitions and global parameters not directly linked to query or document parameters. One example of rank score is the output of the neural network model defined in this tutorial. The model takes the latent factor u associated with a specific user_id (query parameter), the latent factor dd associated with document post_id (document attribute) and learned model parameters (global parameters not related to a specific query nor document) and returns the probability of user u to like document d.

However, even though Vespa is designed to carry out such calculations optimally, complex expressions becomes expensive when they must be calculated over every one of a large set of matching documents. To relieve this, Vespa can be configured to run two ranking expressions – a smaller and less accurate one on all hits during the matching phase, and a more expensive and accurate one only on the best hits during the reranking phase. In general this allows a more optimal usage of the cpu budget by dedicating more of the total cpu towards the best candidate hits.

The reranking phase, if specified, will by default be run on the 100 best hits on each search node, after matching and before information is returned upwards to the search container. The number of hits to rerank can be turned up or down as needed. Below is a toy example showing how to configure first and second phase ranking expressions in the rank profile section of search definitions where the second phase rank expression is run on the 200 best hits from first phase on each search node.

search myapp {


    rank-profile default inherits default {

        first-phase {
            expression: nativeRank + query(deservesFreshness) * freshness(timestamp)

        second-phase {
            expression {
                0.7 * ( 0.7*fieldMatch(title) + 0.2*fieldMatch(description) + 0.1*fieldMatch(body) ) +
                0.3 * attributeMatch(keywords)
            rerank-count: 200

Constant Tensor files

Once the model has been trained in TensorFlow, export the model parameters (W1,W2,b1,b2) to the application folder as Tensors according to the Vespa Document JSON format.

The complete code to serialize the model parameters using Vespa Tensor format can be found in the utility scripts but the following code snipped shows how to serialize the hidden layer weights W1:

serializer.serialize_to_disk(variable_name = "W_hidden", dimension_names = ['input', 'hidden'])

Note that Vespa currently requires dimension names for all the Tensor dimensions (in this case W1 is a matrix, therefore dimension is 2).

In the following section, we will use the following code in the blog_post search definition in order to be able to use the constant tensor W_hidden in our ranking expression.

    constant W_hidden {
        file: constants/W_hidden.json
        type: tensor(input[20],hidden[40])

A constant tensor is data that is not specific to a given document type. In the case above we define W_hidden to be a tensor with two dimensions (matrix), where the first dimension is named input and has size 20 and second dimension is named hidden and has size 40. The data were serialized to a JSON file located at constants/W_hidden.json relative to the application package folder.

Vespa ranking expressions

In order to evaluate the neural network model trained with TensorFlow in the previous section, we need to translate the model structure to a Vespa ranking expression to be defined in the blog_post search definition. To honor a low-latency response, we will take advantage of the Two Phase Ranking available in Vespa and define the first phase ranking to be the same ranking function used in the previous blog post, which is a dot-product between the user and latent factors. After the documents have been sorted by the first phase ranking function, we will rerank the top 200 document from each search node using the second phase ranking given by the neural network model presented above.

Note that we define two ranking profiles in the search definition below. This allow us to decide which ranking profile to use at query time. We defined a ranking profile named tensor which only applies the dot-product between user and document latent factors for all matching documents and a ranking profile named nn_tensor, which rerank the top 200 documents using the neural network model discussed in the previous section.

We will walk through each part of the blog_post search definition, see

As always, we start the a search definition with the following line

We define the document type blog_post the same way we have done in the previous tutorial.

    document blog_post {

      # Field definitions
      # Examples:

      field date_gmt type string {
          indexing: summary
      field language type string {
          indexing: summary

      # Remaining fields as found in previous tutorial


We define a ranking profile named tensor which rank all the matching documents by the dot-product between the document latent factor and the user latent factor. This is the same ranking expression used in the previous tutorial, which include code to retrieve the user latent factor based on the user_id sent by the query to Vespa.

    # Simpler ranking profile without
    # second-phase ranking
    rank-profile tensor {
      first-phase {
          expression {
              sum(query(user_item_cf) * attribute(user_item_cf))

Since we want to evaluate the neural network model we have trained, we need to define where to find the model parameters (W1,W2,b1,b2). See the previous section for how to write the TensorFlow model parameters to Vespa Tensor format.

    # We need to specify the type and the location
    # of the files storing tensor values for each
    # Variable in our TensorFlow model. In this case,
    # W_hidden, b_hidden, W_final, b_final

    constant W_hidden {
        file: constants/W_hidden.json
        type: tensor(input[20],hidden[40])

    constant b_hidden {
        file: constants/b_hidden.json
        type: tensor(hidden[40])

    constant W_final {
        file: constants/W_final.json
        type: tensor(hidden[40], final[1])

    constant b_final {
        file: constants/b_final.json
        type: tensor(final[1])

Now, we specify a second rank-profile called nn_tensor that will use the same first phase as the rank-profile tensor but will rerank the top 200 documents using the neural network model as second phase. We refer to the Tensor Reference document for more information regarding the Tensor operations used in the code below.

    # rank profile with neural network model as
    # second phase
    rank-profile nn_tensor {

        # The input to the neural network is the
        # concatenation of the document and query vectors.

        macro nn_input() {
            expression: concat(attribute(user_item_cf), query(user_item_cf), input)

        # Computes the hidden layer

        macro hidden_layer() {
            expression: relu(sum(nn_input * constant(W_hidden), input) + constant(b_hidden))

        # Computes the output layer

        macro final_layer() {
            expression: sigmoid(sum(hidden_layer * constant(W_final), hidden) + constant(b_final))

        # First-phase ranking:
        # Dot-product between user and document latent factors

        first-phase {
            expression: sum(query(user_item_cf) * attribute(user_item_cf))

        # Second-phase ranking:
        # Neural network model based on the user and latent factors

        second-phase {
            rerank-count: 200
            expression: sum(final_layer)



Offline evaluation

We will now query Vespa and obtain 100 blog post recommendations for each user_id in our test set. Below, we query Vespa using the tensor ranking function which contain the simpler ranking expression involving the dot-product between user and document latent factors.

pig -x local -f tutorial_compute_metric.pig \
  -param VESPA_HADOOP_JAR=vespa-hadoop.jar \
  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \
  -param ENDPOINT=$(hostname):8080
  -param RANKING_NAME=tensor
  -param OUTPUT=blog-job/cf-metric

We perform the same query routine below, but now using the ranking-profile nn_tensor which reranks the top 200 documents using the neural network model.

pig -x local -f tutorial_compute_metric.pig \
  -param VESPA_HADOOP_JAR=vespa-hadoop.jar \
  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \
  -param ENDPOINT=$(hostname):8080
  -param RANKING_NAME=nn_tensor
  -param OUTPUT=blog-job/cf-metric

The tutorial_compute_metric.pig script can be found in our repo.

Comparing the recommendations obtained by those two ranking profiles and our test set, we see that by deploying a more complex and accurate model in the second phase ranking, we increased the number of relevant documents (documents read by the user) retrieved from 11948 to 12804 (more than 7% increase) and those documents retrieved appeared higher up in the list of recommendations, as shown by the expected percentile ranking metric introduced in the Vespa tutorial pt. 2 which decreased from 37.1% to

Optimizing realtime evaluation of neural net models on Vespa

In this blog post we describe how we recently made neural network evaluation over 20 times faster on Vespa’s tensor framework.

Vespa is the open source platform for building applications that carry out scalable real-time data processing, for instance search and recommendation systems. These require significant amounts of computation over large data sets. With advances in machine learning, it is desirable to run more advanced ranking models such as large linear or logistic regression models and artificial neural networks. Because of the tight computational budget at serving time, the evaluation of such models must be done in an efficient and scalable manner.

We introduced the tensor API to help solve such problems. The tensor API allows the concise expression of general computations on many-dimensional data, while simultaneously leaving room for deep optimizations on the platform side.  What we mean by this is that the tensor API is very expressive and supports a large range of model types. The general evaluation of tensors is not necessarily efficient in all cases, so in addition to continually working to increase the baseline performance, we also perform specific optimizations for important use cases. In this blog post we will describe one such important optimization we recently did, which improved neural network evaluation performance by over 20x.

To illustrate the types of optimization we can do, consider the following tensor expression representing a dot product between vectors v1 and v2:

reduce(join(v1, v2, f(x, y)(x * y)), sum)

The dot product is calculated by multiplying the vectors together by using the join operation,
then summing the elements in the vector together using the reduce operation.
The result is a single scalar. A naive implementation would first calculate the join and introduce a temporary tensor before the reduce sums up the cells to a single scalar. Particularly for large tensors with many dimensions, such a temporary tensor can be large and require significant memory allocations. This is obviously not the most efficient path to calculate the resulting tensor.  A general improvement would be to avoid the temporary tensor and reduce to the single scalar directly as the tensors are iterated through.

In Vespa, when ranking expressions are compiled, the abstract syntax tree (AST) is analyzed for such optimizations. When known cases are recognized, the most efficient implementation is selected. In the above example, assuming the vectors are dense and they share dimensions, Vespa has optimized hardware accelerated code for doing dot products on vectors. For sparse vectors, Vespa falls back to a implementation for weighted sets which build hash tables for efficient lookups.  This method allows recognition of both large and small optimizations, from simple dot products to specialized implementations for more advanced ranking models. Vespa currently has a few optimizations implemented, and we are adding more as important use cases arise.

We recently set out to improve the performance of evaluating simple neural networks, a case quite similar to the one presented in the previous blog post. The ranking expression to optimize was:

   macro hidden_layer() {
       expression: elu(xw_plus_b(nn_input, constant(W_fc1), constant(b_fc1), x))
   macro final_layer() {
       expression: xw_plus_b(hidden_layer, constant(W_fc2), constant(b_fc2), hidden)
   first-phase {
       expression: final_layer

This represents a simple two-layer neural network.

Whenever a new version of Vespa is built, a large suite of integration and performance tests are run. When we want to optimize a specific use case, we first create a performance test to set a baseline.  With the performance tests we get both historical graphs as well as detailed profiling information and performance statistics sampled from the system under load.  This allows us to identify and optimize any bottlenecks. Also, it adds a bit of gamification to the process.

The graph below shows the performance of a test where 10 000 random documents are ranked according to the evaluation of a simple two-layer neural network:


Here, the x-axis represent builds, and the y-axis is the end-to-end latency as measured from a machine firing off queries to a server running the test on Vespa. As can be seen, over the course of optimization the latency was reduced from 150-160 ms to 7 ms, an impressive 20x end-to-end latency improvement.

When a query is received by Vespa, it is first processed in the stateless container. This is usually where applications would process the query, possibly enriching it with additional information. Vespa does a bit of default work here as well, and also transforms the query a bit. For this test, no specific handling was done except this default handling. After initial processing, the query is dispatched to each node in the stateful content layer. For this test, only a single node is used in the content layer, but applications would typically have multiple. The query is processed in parallel on each node utilizing multiple cores and the ranking expression gets executed once for each document that matches the query. For this test with 10 000 documents, the ranking expression and thus the neural network gets evaluated in total 10 000 times before the top N documents are returned to the container layer.

The following steps were taken to optimize this expression, with each step visible as a step in the graph above:

  1. Recognize join with multiplication as part of an inner product.
  2. Optimize for bias addition.
  3. Optimize vector concatenation (which was part of the input to the neural network)
  4. Replace appropriate sub-expressions with the dense vector-matrix product.

It was particularly the final step which gave the biggest percent wise performance boost. The solution in total was to recognize the vector-matrix multiplication done in the neural network layer and replace that with specialized code that invokes the existing hardware accelerated dot product code. In the expression above, the operation xw_plus_b is replaced with a reduce of the multiplicative join and additive join. This is what is recognized and performed in one step instead of three.

This strategy of optimizing specific use cases allows for a more rapid application development for users of Vespa. Consider the case where some exotic model needs to be run on Vespa. Without the generic tensor API users would have to implement their own custom rank features or wait for the Vespa core developers to implement them. In contrast, with the tensor API, teams can continue their development without external dependencies to the Vespa team.  If necessary, the Vespa team can in parallel implement the optimizations needed to meet performance requirements, as we did in this case with neural networks.

Introducing TensorFlow support | Vespa Blog

In previous blog posts we have talked about Vespa’s tensor API which enables some advanced ranking capabilities. The primary use case is for machine learned ranking, where you train your models using some machine learning framework, convert the models to Vespa’s tensor format, and deploy them to Vespa. This works well, but converting trained models to Vespa form is cumbersome.

We are now happy to announce a new feature that makes this process a lot easier: TensorFlow import. With this feature you can directly deploy models you’ve trained in TensorFlow to Vespa, and use these models during ranking. This means that the models are executed in parallel over multiple threads and machines for a single query, which makes it possible to evaluate the model over any number of data items and still bound the total response time. In addition the data items to evaluate with the TensorFlow model can be selected dynamically with a query, and with a cheaper first-phase rank function if needed. Since the TensorFlow models are evaluated on the nodes storing the data, we avoid sending any data over the wire for evaluation.

In this post we’d like to introduce this new feature by discussing how it works, some assumptions behind working with TensorFlow and Vespa, and how to use the feature.

Vespa is optimized to evaluate models repeatedly over many data items (documents).  To do this efficiently, we do not evaluate the model using the TensorFlow inference engine. TensorFlow adds a non-trivial amount of overhead and instrumentation which it uses to manage potentially large scale computations. This is significant in our case, since we need to evaluate models on a micro-second scale. Hence our approach is to extract the parameters (weights) into Vespa tensors, and use the model specification in the TensorFlow graph to generate efficient Vespa tensor expressions.

Importing TensorFlow models is as simple as saving the TensorFlow model using the
SavedModel API,
adding those files to the Vespa application package, and referencing the model using the new TensorFlow ranking feature.
For instance, if your files are in models/my_model in the application package:

first-phase {
    expression: sum(tensorflow(“my_model/saved”))

The above expressions runs the model, and sums it to a single scalar value to use in ranking.  One thing you will have to provide is the input(s), or feed, to the graph. Vespa expects you to provide a macro with the same name as the input placeholder. In the macro you can specify where the input should come from, be it a parameter sent along with the query, a document field (possibly in a parent document) or a constant.

As mentioned, Vespa evaluates the imported models once per document. Depending on the requirements of the application, this can impose some natural limitations on the size and complexity of the models that can be evaluated. However, Vespa has a number of other search and rank features that can be used to reduce the search space before running the machine learned models. Typically, one would use the search and first ranking phases to select a relatively small number of candidate documents, which are then given their final rank score in the more computationally expensive second phase model evaluation.

Also note that TensorFlow import is new to Vespa, and we currently only support a subset of the TensorFlow operations. While the supported operations should suffice for many relevant use cases, there are some that are not supported yet due to potentially being too expensive to evaluate per document. For instance, convolutional networks and recurrent networks (LSTMs etc) are not supported. We are continually working to add functionality, if you find that we have some glaring omissions, please let us know.

Going forward we are focusing on further improving performance of our tensor framework for important use cases. We’ll follow up this post with one showing how the performance of evaluation in Vespa compares with TensorFlow serving. We will also add more supported frameworks and our next target is ONNX.

You can read more about this feature in the ranking with TensorFlow model in Vespa documentation. We are excited to announce the TensorFlow support, and we’re eager to hear what you are building with it.