Open Sourcing Vespa, Yahoo’s Big Data Processing and Serving Engine

By Jon Bratseth, Distinguished Architect, Vespa

Ever since we open sourced Hadoop in 2006, Yahoo – and now, Oath – has been committed to opening up its big data infrastructure to the larger developer community. Today, we are taking another major step in this direction by making Vespa, Yahoo’s big data processing and serving engine, available as open source on GitHub.

Building applications increasingly means dealing with huge amounts of data. While developers can use the Hadoop stack to store and batch process big data, and Storm to stream-process data, these technologies do not help with serving results to end users. Serving is challenging at large scale, especially when it is necessary to make computations quickly over data while a user is waiting, as with applications that feature search, recommendation, and personalization.

By releasing Vespa, we are making it easy for anyone to build applications that can compute responses to user requests, over large datasets, at real time and at internet scale – capabilities that up until now, have been within reach of only a few large companies.

Serving often involves more than looking up items by ID or computing a few numbers from a model. Many applications need to compute over large datasets at serving time. Two well-known examples are search and recommendation. To deliver a search result or a list of recommended articles to a user, you need to find all the items matching the query, determine how good each item is for the particular request using a relevance/recommendation model, organize the matches to remove duplicates, add navigation aids, and then return a response to the user. As these computations depend on features of the request, such as the user’s query or interests, it won’t do to compute the result upfront. It must be done at serving time, and since a user is waiting, it has to be done fast. Combining speedy completion of the aforementioned operations with the ability to perform them over large amounts of data requires a lot of infrastructure – distributed algorithms, data distribution and management, efficient data structures and memory management, and more. This is what Vespa provides in a neatly-packaged and easy to use engine.

With over 1 billion users, we currently use Vespa across many different Oath brands – including, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini, Flickr, and others – to process and serve billions of daily requests over billions of documents while responding to search queries, making recommendations, and providing personalized content and advertisements, to name just a few use cases. In fact, Vespa processes and serves content and ads almost 90,000 times every second with latencies in the tens of milliseconds. On Flickr alone, Vespa performs keyword and image searches on the scale of a few hundred queries per second on tens of billions of images. Additionally, Vespa makes direct contributions to our company’s revenue stream by serving over 3 billion native ad requests per day via Yahoo Gemini, at a peak of 140k requests per second (per Oath internal data).

With Vespa, our teams build applications that:

  • Select content items using SQL-like queries and text search
  • Organize all matches to generate data-driven pages
  • Rank matches by handwritten or machine-learned relevance models
  • Serve results with response times in the low milliseconds
  • Write data in real-time, thousands of times per second per node
  • Grow, shrink, and re-configure clusters while serving and writing data

To achieve both speed and scale, Vespa distributes data and computation over many machines without any single master as a bottleneck. Where conventional applications work by pulling data into a stateless tier for processing, Vespa instead pushes computations to the data. This involves managing clusters of nodes with background redistribution of data in case of machine failures or the addition of new capacity, implementing distributed low latency query and processing algorithms, handling distributed data consistency, and a lot more. It’s a ton of hard work!

As the team behind Vespa, we have been working on developing search and serving capabilities ever since building, which was later acquired by Yahoo. Over the last couple of years we have rewritten most of the engine from scratch to incorporate our experience onto a modern technology stack. Vespa is larger in scope and lines of code than any open source project we’ve ever released. Now that this has been battle-proven on Yahoo’s largest and most critical systems, we are pleased to release it to the world.

Vespa gives application developers the ability to feed data and models of any size to the serving system and make the final computations at request time. This often produces a better user experience at lower cost (for buying and running hardware) and complexity compared to pre-computing answers to requests. Furthermore it allows developers to work in a more interactive way where they navigate and interact with complex calculations in real time, rather than having to start offline jobs and check the results later.

Vespa can be run on premises or in the cloud. We provide both Docker images and rpm packages for Vespa, as well as guides for running them both on your own laptop or as an AWS cluster.

We’ll follow up this initial announcement with a series of posts on our blog showing how to build a real-world application with Vespa, but you can get started right now by following the getting started guide in our comprehensive documentation.

Managing distributed systems is not easy. We have worked hard to make it easy to develop and operate applications on Vespa so that you can focus on creating features that make use of the ability to compute over large datasets in real time, rather than the details of managing clusters and data. You should be able to get an application up and running in less than ten minutes by following the documentation.

We can’t wait to see what you’ll build with it!

Q&A from “The Great Search Engine Debate – Elasticsearch, Solr or Vespa?” Meetup

On January 28th, 2021, at 17:00 CET,
Charlie Hull from OpenSource Connections hosted

The Great Search Engine Debate – Elasticsearch, Solr or Vespa? –
a meetup on Haystack LIVE!,
with Anshum Gupta, VP of Apache Lucene, Josh Devins from Elastic and Jo Kristian Bergum from Vespa.

So many great questions were asked that there was no time to go through them all.
This blog post addresses the Vespa-related questions,
with quicklinks into the recording for easy access.
We have also extracted the unanswered questions from the chat log, linking to Vespa resources.
Please let us know if this is useful.
Feel free to follow up with the Vespa Team using the resources at,
Gitter live chat.
You will also find us in the #vespa channel of Relevance Slack.
You can also find Charlie’s summary post at
Solr vs Elasticsearch vs Vespa – what did we learn at The Great Search Engine Debate?.

All three speakers were asked to do a pitch and closing words.
Three things that make you recommend your technology –
see the Vespa pitch and Vespa top three –

  1. Vespa has a great toolbox for modern retrieval, state-of-the-art retrieval/ranking with Machine Learning
  2. Vespa’s indexing architecture allows true partial updates at scale, with high indexing volume – when combined with #1, one can have realtime updated models to make decisions in real time, on updated information
  3. Vespa’s scalability and true elastic content cluster. You don’t have to pre-determine the number of shards. Can go from 1 node to 100 nodes, just add nodes.

Resources: ranking,
reads and writes,
elastic Vespa

Use case differentiator, I am curious if the participants could walk through:
let’s say I have an index with text for search, but also a couple dozen features I intend to use in LTR.
I want to update two of the dozen features across several billion documents because I changed my feature extraction.
How does the engine deal with this?

[ quicklink ].
Common and widely used Vespa use case.
True partial updates of attribute fields which are in-memory, update and evaluate in place –
no need to read the entire document and apply the update and write it to a new index segment
like in Solr/Elasticsearch which builds on Lucene.
Vespa can do 50,000 numeric partial updates per second per node.
Ranking will immediately see the update and use value in computations (search, rank, sorting, faceting).

Resources: ranking,
reads and writes,
elastic Vespa

Much of the popularity around ES and Solr arises from the fact that they are very “approachable” technologies.
It’s simple for a beginner to get started indexing and searching documents at a basic level,
and most importantly, understanding and influencing which documents are returned.
My impression is that the technical entry level for Vespa is much more advanced.
Would you agree or disagree? How would you recommend starting out with Vespa?

[ quicklink ].
Learned a lot from Elasticsearch on developer friendliness,
maybe at 80% ease of use. With Vespa, it’s easy to go from laptop to full cloud deployment.
Use Docker to run Vespa on your laptop.
Use Vespa application package to go from laptop to full size – it is the same config.

Resources: application packages,
getting started,

I have a question regarding Vespa: How is the support for non-English languages regarding tokenizers, stemmers, etc.?
I’m especially interested in German, Russian, Polish, Czech and Hungarian.
How big would be the effort to adapt Solr / OpenNLP resources to use them with Vespa?

[ quicklink ].
Vespa integrates with Apache OpenNLP,
so any language supported by it, Vespa supports it.
It’s easy to integrate with new linguistic libraries and we’ve already received CJK contributions to Vespa.

Resources: linguistics

Which search engine is best for a write-heavy application?
Based on my experience, Elasticsearch read performance is impacted when there are heavy writes.

[ quicklink ].
Vespa moved away from indexing architecture similar to Elasticsearch and Solr,
where it used small immutable index segments that were later merged.
Vespa has a mutable in-memory index in front of immutable index segments.
All IO writes are sequential. No shards. Attributes fields are searchable, in-place updateable.
Efficient use of OS buffer cache for random reads from search.
Real-time indexing with Solr and Elasticsearch creates many immutable segments
which all need to be searched (single threaded execution as well),
so latency is definitively impacted more than with Vespa which has a memory index + larger immutable index.

Resources: reads and writes,

“Join” is always a problem with SOLR/Elasticsearch. How does Vespa handle it?

[ quicklink ].
Supported scalable join is implemented using parent/child relationship.
The parent is a global document – distributed across all nodes in the cluster.
Child documents access attribute in-memory fields imported from parent documents.
Can also use the stateless container, deploy a custom Java searcher, do joins on top of multiple searches.

Resources: parent-child,
Vespa overview,

Can people talk a bit more about kubernetes integrations?

[ quicklink ].
Yes, one can run Vespa on K8s.

Resources: vespa-quick-start-kubernetes

How does Vespa compare to FAISS?

[ quicklink ].
FAISS uses HSNW like Vespa.
FAISS can only nearest neighbor search returning the ID of the vector, very fast.
In Vespa, combine with query filters,
not like the Open Distro for Elasticsearch k-NN plugin
that does post-processing step after retrieving the nearest neighbors.
With a restrictive filter, like last day, might end up with zero documents.
Vespa combines ANN search and filters.

Vespa has hybrid evaluation;
Term-at-a-time (TAAT) which is much more cache friendly, and document-at-a-time (DAAT).
Can evaluate part of the query tree using TAAT,
then search in the HNSW graph using the documents eligible as an input filter.
Including a filter makes ANN a bit slower, but the value it adds makes it worth it.

FAISS is faster as it does not have an HTTP api and distribution layer with realtime updates –
FAISS is a library, batch oriented.

Resources: using-approximate-nearest-neighbor-search-in-real-world-applications,
approximate nearest neighbor, hnsw,
feature tuning

Since Vespa has a different approach, is there anything Vespa is learning from Elastic/Solr/Lucene?
Also the other way around, Elastic/Solr learning from Vespa?

[ quicklink].
Both are great engines! Vespa’s toolbox is bigger.
Learned how Elasticsearch became popular:
developer friendliness, nice APIs, great support for analytics, great for handling immutable data.
Lucene has had a large developer crowd for 20 years.

If I remember correctly, FAISS or similar libraries support indexing/searching with the power of GPU,
how does this compare to Vespa/Elastic/Solr?

[ quicklink ].
Vespa is CPU only, but looking at GPU as pretrained language models grow larger.
GPU easier to use in indexing than serving.
We are trying to find models that run efficiently on GPU. Vespa is written in C++,
making use of OpenBLAS and special instructions to get the most out of CPUs.


Given large language model dominance, in 5 years, how much do we need to support manual relevance tuning operations?
Should that be our focus? Or will search engines just be initial retrieval before sending docs to eg. BERT?

[ quicklink ].
BERT and pretrained language models helps machines understand text better than before,
dramatic progress on ranking, roughly 2x BM25 on multiple Information retrieval datasets.
However more than just text matching and ranking, like click models and site popularity.
In Vespa, ranking with BERT locally on the content nodes,
can combine scoring from language model into LTR framework, taking other signals into account.
There are ways to use BERT that could lead to close to random ranking,
e.g. using BERT as a representation model without fine-tuning for the retrieval task
where there are many many negative (irrelevant) documents.

However, good zero-shot transfer capabilities for interaction based models
has demonstrated strong ranking accuracy on other data sets.
See Pretrained Transformers for Text Ranking: BERT and Beyond.

Resources: from-research-to-production-scaling-a-state-of-the-art-machine-learning-system

Can you speak about the history of Vespa? All top contributors work at Verizon/Yahoo.
Are you aware of prominent Vespa users beside Verizon? Who’s behind Vespa Cloud?
Is there a (larger) ecommerce shop using Vespa in production already?

[ quicklink ]. is run by Verizon Media.
In Verizon Media, Vespa is used for search and recommendation (including APAC e-commerce) + Gemini ad serving stack.
Vespa’s background is from Fast Search and Transfer, founded in 1997 from NTNU in Trondheim, Norway.


What are your plans for growing your communities?
Where should we go to ask questions and have discussion?

[ quicklink ].
#vespa on Stack Overflow,
Gitter channel,
#vespa channel of Relevance Slack.
Asking the extended Vespa team to document use cases / blog posts.


What type of node? Helps me understand 50k/node number

Single value update assign of an int field on a c5d.2xlarge, 8 v-cpu, 16GB, 200G SSD. 49K updates/s.

How does vespa handle search query contain both dense vector + scalar fields?
I.e. internally, it first retrieves top-k doc and then to the filters?

See the How does Vespa compare to FAISS? question above –
filter first, maybe using TAAT for speed, then top-k.
This to ensure low latency and non-empty result sets.

Which engine supports the usage of KNN clustering together with vector similarity queries?

Vespa supports approximate nearest neighbor search using HNSW,
but can also be combined with pre-computed KNN clustering
where vectors have been assigned a cluster id at indexing time.
Using the Vespa ranking framework,
one can combine (approximate) nearest neighbor queries with any other computations.
Using tensors and operations on these, custom algorithms can be built.

Resources: tensor user guide,
approximate nearest neighbor HNSW,

Which engine would you use for real-time systems with emphasis on queries latency?

The Vespa Team has helped implementation of numerous applications
with millisecond latency requirements and update rates in thousands per second per node in Verizon Media.
When the feed operation is ack’ed, the operation is visible.
There is no index refresh delay or immutable batch indexing
as in engines like Solr or Elasticsearch using the batch oriented Lucene library.
Vespa also allows using multiple searcher threads per query to scale latency versus throughput,
functionality which is not exposed in Solr or Elasticsearch.

Does Vespa support IBM ICU libraries? (language processing question as well)

Yes, used in sorting.

For what kind of problem would you recommend Elastic or Solr for (over Vespa)?

See the question above for anything Vespa is learning from Elastic/Solr/Lucene?

Resources: vespa-elastic-solr

Can any of the search engine beat Redis when it comes to read performance? Do we have any benchmarking?

The Vespa Team has not compared Vespa with Redis, as they are built for different use cases.
Vespa is built for Big Data Serving with distributed computations over large, mutable data sets.
Use Redis for in-memory database, cache, and message broker.

All 3 search engines rely on OS file caching for read optimizations.
How does that work in kubernetes when multiple processes/pods/containers are racing against each other for that?

The Vespa Team has not tested specifically for K8s and we would love to learn from the community when tested!
We run multiple Docker multi-process containers on bare-metal and AWS hosts, memory is isolated, but the IO is shared.
We hence monitor IO, but not more than that.

I’d love to read more about the TAAT/DAAT mix and ANN, I don’t follow that yet.
Any chance you can drop a code link or doc link?

See feature-tuning.
We will see if we can publish a paper or article on this subject.

With regard to GPU vs CPU this is also asking “How do you execute code on multi-arch cluster?”.
If you’re on K8s, you may just be calling out across the nodes.
Something like the nboost proxy is an interesting example

Moving computation to where the data lives is the mantra for both Vespa and the Map Reduce paradigm (Hadoop).
This allows scaling latency and throughput without moving data across the wire.
Vespa integrates with many machine learning techniques and allows,
e.g. using the pre-trained language model relevancy score in combination with other core ranking features

Run search engine experiments in Vespa from python

Thiago Martins

Thiago Martins

Vespa Data Scientist

Three ways to get started with pyvespa.

pyvespa provides a python API to Vespa.
The library’s primary goal is to allow for faster prototyping and facilitate Machine Learning experiments for Vespa applications.

UPDATE 2023-02-13: Code examples are updated to work with the latest releases of

There are three ways you can get value out of pyvespa:

  1. You can connect to a running Vespa application.

  2. You can build and deploy a Vespa application using pyvespa API.

  3. You can deploy an application from Vespa config files stored on disk.

We will review each of those methods.

Decorative image

Photo by
Kristin Hillery on

Connect to a running Vespa application

In case you already have a Vespa application running somewhere, you can directly instantiate the Vespa class with the appropriate endpoint. The example below connects to the application:

from vespa.application import Vespa

app = Vespa(url = "")

We are then good to go and ready to interact with the application through pyvespa:

app.query(body = {
  'yql': 'select title from sources * where userQuery()',
  'hits': 1,
  'summary': 'short',
  'timeout': '1.0s',
  'query': 'coronavirus temperature sensitivity',
  'type': 'all',
  'ranking': 'default'
[{'id': 'index:content/1/ad8f0a6204288c0d497399a2',
  'relevance': 0.36920467353113595,
  'source': 'content',
  'fields': {'title': '<hi>Temperature</hi> <hi>Sensitivity</hi>: A Potential Method for the Generation of Vaccines against the Avian <hi>Coronavirus</hi> Infectious Bronchitis Virus'}}]

Build and deploy with pyvespa API

You can also build your Vespa application from scratch using the pyvespa API. Here is a simple example:

from vespa.package import ApplicationPackage, Field, RankProfile

app_package = ApplicationPackage(name = "sampleapp")
        indexing=["index", "summary"], 

We can then deploy app_package to a Docker container
(or directly to VespaCloud):

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=app_package)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Finished deployment.

app holds an instance of the Vespa class just like our first example,
and we can use it to feed and query the application just deployed.
This can be useful when we want to fine-tune our application based on Vespa features not available through the pyvespa API.

There is also the possibility to explicitly export app_package to Vespa configuration files (without deploying them):

$ mkdir -p /tmp/sampleapp

Clean up:


Deploy from Vespa config files

pyvespa API provides a subset of the functionality available in Vespa. The reason is that pyvespa is meant to be used as an experimentation tool for Information Retrieval (IR) and not for building production-ready applications. So, the python API expands based on the needs we have to replicate common use cases that often require IR experimentation.

If your application requires functionality or fine-tuning not available in pyvespa, you simply build it directly through Vespa configuration files as shown in many examples on Vespa docs. But even in this case, you can still get value out of pyvespa by deploying it from python based on the Vespa configuration files stored on disk. To show that, we can clone and deploy the news search app covered in this Vespa tutorial:

$ git clone

The Vespa configuration files of the news search app are stored in the sample-apps/news/app-3-searching/ folder:

$ tree sample-apps/news/app-3-searching/
├── schemas/
│   └──
└── services.xml

1 directory, 2 files

We can then deploy to a Docker container from disk:

from vespa.deployment import VespaDocker

vespa_docker_news = VespaDocker()
app = vespa_docker_news.deploy_from_disk(
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Finished deployment.

Again, app holds an instance of the Vespa class just like our first example,
and we can use it to feed and query the application just deployed.

Clean up:


Final thoughts

We covered three different ways to connect to a Vespa application from python using the pyvespa library. Those methods provide great workflow flexibility. They allow you to quickly get started with pyvespa experimentation while enabling you to modify Vespa config files to include features not available in the pyvespa API without losing the ability to experiment with the added features.