Blog search application in Vespa

Update 2021-05-20:
This blog post refers to Vespa sample applications that do not exist anymore.
Please refer to the
News search and recommendation tutorial
for an updated version of text and sample applications.


This is the first of a series of blog posts where data from (WP) is used to highlight how Vespa can be used to store, search and recommend blog posts. The data was made available during a Kaggle challenge to predict which blog posts someone would like based on their past behavior. It contains many ingredients that are necessary to showcase needs, challenges and possible solutions that are useful for those interested in building and deploying such applications in production.

The end goal is to build an application where:

  • Users will be able to search and manipulate the pool of blog posts available.
  • Users will get blog post recommendations from the content pool based on their interest.

This part addresses:

  • How to describe the dataset used as well as any information connected to the data.
  • How to set up a basic blog post search engine using Vespa.

The next parts show how to extend this basic search engine application with machine learned models to create a blog recommendation engine.


The dataset contains blog posts written by WP bloggers and actions, in this case ‘likes’, performed by WP readers in blog posts they have interacted with. The dataset is publicly available at Kaggle and was released during a challenge to develop algorithms to help predict which blog posts users would most likely ‘like’ if they were exposed to them. The data includes these fields per blog post:

  • _ post_id _ – unique numerical id identifying the blog post
  • _ date_gmt _ – string representing date of blog post creation in GMT format yyyy-mm-dd hh:mm:ss
  • _ author _ – unique numerical id identifying the author of the blog post
  • _ url _ – blog post URL
  • _ title _ – blog post title
  • _ blog _ – unique numerical id identifying the blog that the blog post belongs to
  • _ tags _ – array of strings representing the tags of the blog posts
  • _ content _ – body text of the blog post, in html format
  • _ categories _ – array of strings representing the categories the blog post was assigned to

For the user actions:

  • _ post_id _ – unique numerical id identifying the blog post
  • _ uid _ – unique numerical id identifying the user that liked post_id
  • _ dt _ – date of the interaction in GMT format yyyy-mm-dd hh:mm:ss

Downloading raw data

For the purposes of this post, it is sufficient to use the first release of training data that consists of 5 weeks of posts as well as all the ‘like’ actions that occurred during those 5 weeks.

This first release of training data is available here – once downloaded, unzip it. The 1,196,111 line trainPosts.json will be our practice document data. This file is around 5GB in size.


Indexing the full data set requires 23GB disk space. We have tested with a Docker container with 10GB RAM. We used similar settings as described in the vespa quick start guide. As in the guide we assume that the $VESPA_SAMPLE_APPS env variable points to the directory with your local clone of the vespa sample apps:

$ docker run -m 10G --detach --name vespa --hostname vespa --privileged --volume $VESPA_SAMPLE_APPS:/vespa-sample-apps --publish 8080:8080 vespaengine/vespa

Searching blog posts

Functional specification:

  • Blog post title, content, tags and categories must all be searchable
  • Allow blog posts to be sorted by both relevance and date
  • Allow grouping of search results by tag or category

In terms of data, Vespa operates with the notion of documents. A document represents a single, searchable item in your system, e.g., a blog post, a photo, or a news article. Each document type must be defined in the Vespa configuration through a search definition. Think of a search definition as being similar to a table definition in a relational database; it consists of a set of fields, each with a given name, a specific type, and some optional properties.

As an example, for this simple blog post search application, we could create the document type blog_post with the following fields:

  • _ url _ – of type uri
  • _ title _ – of type string
  • _ content _ – of type string (string fields can be of any length)
  • _ date_gmt _ – of type string (to store the creation date in GMT format)

The data fed into Vespa must match the structure of the search definition, and the hits returned when searching will be on this format as well.

Application Packages

A Vespa application package is the set of configuration files and Java plugins that together define the behavior of a Vespa system: what functionality to use, the available document types, how ranking will be done and how data will be processed during feeding and indexing. The search definition, e.g.,, is a required part of an application package — the other required files are services.xml and hosts.xml.

The sample application blog search creates a simple but functional blog post search engine. The application package is found in src/main/application.

Services Specification

services.xml defines the services that make up the Vespa application — which services to run and how many nodes per service:

<?xml version='1.0' encoding='UTF-8'?>
<services version='1.0'>

  <container id='default' version='1.0'>
      <node hostalias="node1"/>

  <content id='blog_post' version='1.0'>
      <document mode="index" type="blog_post"/>
      <node hostalias="node1"/>

  • <container> defines the container cluster for document, query and result processing
  • <search> sets up the search endpoint for Vespa queries. The default port is 8080.
  • <document-api> sets up the document endpoint for feeding.
  • <nodes> defines the nodes required per service. (See the reference for more on container cluster setup.)
  • <content> defines how documents are stored and searched
  • <redundancy> denotes how many copies to keep of each document.
  • <documents> assigns the document types in the search definition — the content cluster capacity can be increased by adding node elements — see elastic Vespa. (See also the reference for more on content cluster setup.)
  • <nodes> defines the hosts for the content cluster.

Deployment Specification

hosts.xml contains a list of all the hosts/nodes that is part of the application, with an alias for each of them. Here we use a single node:

<?xml version="1.0" encoding="utf-8" ?>
  <host name="localhost">

Search Definition

The blog_post document type mentioned in src/main/application/service.xml is defined in the search definition. src/main/application/searchdefinitions/ contains the search definition for a document of type blog_post:

search blog_post {

    document blog_post {

        field date_gmt type string {
            indexing: summary

        field language type string {
            indexing: summary

        field author type string {
            indexing: summary

        field url type string {
            indexing: summary

        field title type string {
            indexing: summary | index

        field blog type string {
            indexing: summary

        field post_id type string {
            indexing: summary

        field tags type array<string> {
            indexing: summary

        field blogname type string {
            indexing: summary

        field content type string {
            indexing: summary | index

        field categories type array<string> {
            indexing: summary

        field date type int {
            indexing: summary | attribute


    fieldset default {
        fields: title, content

    rank-profile post inherits default {

        first-phase {
            expression:nativeRank(title, content)



document is wrapped inside another element called search. The name following these elements, here blog_post, must be exactly the same for both.

The field property indexing configures the indexing pipeline for a field, which defines how Vespa will treat input during indexing — see indexing language. Each part of the indexing pipeline is separated by the pipe character ‘’:

Deploy the Application Package

Once done with the application package, deploy the Vespa application — build and start Vespa as in the quick start. Deploy the application:

$ cd /vespa-sample-apps/blog-search
$ vespa-deploy prepare src/main/application && vespa-deploy activate

This prints that the application was activated successfully and also the checksum, timestamp and generation for this deployment (more on that later). Pointing a browser to http://localhost:8080/ApplicationStatus returns JSON-formatted information about the active application, including its checksum, timestamp and generation (and should be the same as the values when vespa-deploy activate was run). The generation will increase by 1 each time a new application is successfully deployed, and is the easiest way to verify that the correct version is active.

The Vespa node is now configured and ready for use.

Feeding Data

The data fed to Vespa must match the search definition for the document type. The data downloaded from Kaggle, contained in trainPosts.json, must be converted to a valid Vespa document format before it can be fed to Vespa. Find a parser in the utility repository. Since the full data set is unnecessarily large for the purposes of this first part of this post, we use only the first 10,000 lines of it, but feel free to load all 1,1M entries:

$ head -10000 trainPosts.json > trainPostsSmall.json
$ python trainPostsSmall.json > feed.json

Send this to Vespa using one of the tools Vespa provides for feeding. Here we will use the Java feeding API:

$ java -jar $VESPA_HOME/lib/jars/vespa-http-client-jar-with-dependencies.jar --verbose --file feed.json --host localhost --port 8080

Note that in the sample-apps/blog-search directory, there is a file with sample data. You may also feed this file using this method.

Track feeding progress

Use the Metrics API to track number of documents indexed:

$ curl -s 'http://localhost:19112/state/v1/metrics' | tr ',' '\n' | grep -A 2 proton.doctypes.blog_post.numdocs

You can also inspect the search node state by

$ vespa-proton-cmd --local getState  

Fetch documents

Fetch documents by document id using the Document API:

$ curl -s 'http://localhost:8080/document/v1/blog-search/blog_post/docid/1750271' | python -m json.tool

The first query

Searching with Vespa is done using a HTTP GET requests, like:


The only mandatory parameter is the query, using yql=<yql query>. More details can be found in the Search API.

Given the above search definition, where the fields title and content are part of the fieldset default, any document containing the word “music” in one or more of these two fields matches our query below:

$ curl -s 'http://localhost:8080/search/?yql=select+*+from+sources+*+where+default+contains+%22music%22%3B' | python -m json.tool

Looking at the output, please note:

  • The field documentid in the output and how it matches the value we assigned to each put operation when feeding data to Vespa.
  • Each hit has a property named relevance, which indicates how well the given document matches our query, using a pre-defined default ranking function. You have full control over ranking — more about ranking and ordering later. The hits are sorted by this value.
  • When multiple hits have the same relevance score their internal ordering is undefined. However, their internal ordering will not change unless the documents are re-indexed.
  • Add &tracelevel=9 to dump query parsing details

Other examples


Once more a search for the single term “music”, but this time with the explicit field title. This means that we only want to match documents that contain the word “music” in the field title. As expected, you will see fewer hits for this query, than for the previous one.


This is a query for the two terms “music” and “festival”, combined with an AND operation; it finds documents that match both terms — but not just one of them.


This is a single-term query in the special field sddocname for the value “blog_post”. This is a common and useful Vespa trick to get the number of indexed documents for a certain document type (search definition): sddocname is a special and reserved field which is always set to the name of the document type for a given document. The documents are all of type blog_post, and will therefore automatically have the field sddocname set to that value.

This means that the query above really means “Return all documents of type blog_post”, and as such all documents in the index are returned.

Efficient personal search at large scale

Vespa includes a relatively unknown mode which provides personal search at massive scale for a fraction of the cost of alternatives: streaming search. In this article we explain streaming search and how to use it.

Imagine you are tasked with building the next Gmail, a massive personal data store centered around search. How do you do it? An obvious answer is to just use a regular search engine, write all documents to a big index and simply restrict queries to match documents belonging to a single user.

This works, but the problem is cost. Successful personal data stores has a tendency to become massive — the amount of personal data produced in the world outweighs public data by many orders of magnitude. Storing indexes in addition to raw data means paying for extra disk space for all this data and paying for the overhead of updating this massive index each time a user changes or adds data. Index updates are costly, especially when they need to be handled in real time, which users often expect for their own data. Systems like Gmail handle billions of writes per day so this quickly becomes the dominating cost of the entire system.

However, when you think about it there’s really no need to go through the trouble of maintaining global indexes when each user only searches her own data. What if we just maintain a separate small index per user? This makes both index updates and queries cheaper, but leads to a new problem: Writes will arrive randomly over all users, which means we’ll need to read and write a user’s index on every update without help from caching. A billion writes per day translates to about 25k read-and write operations per second peak. Handling traffic at that scale either means using a few thousand spinning disks, or storing all data on SSD’s. Both options are expensive.

Large scale data stores already solve this problem for appending writes, by using some variant of multilevel log storage. Could we leverage this to layer the index on top of a data store like that? That helps, but means we need to do our own development to put these systems together in a way that performs at scale every time for both queries and writes. And we still pay the cost of storing the indexes in addition to the raw user data.

Do we need indexes at all though? With some reflection, it turns out that we don’t. Indexes consists of pointers from words/tokens to the documents containing them. This allows us to find those documents faster than would be possible if we had to read the content of the documents to find the right ones, of course at the considerable cost of maintaining those indexes. In personal search however, any query only accesses a small subset of the data, and the subsets are know in advance. If we take care to store the data of each subset together we can achieve search with low latency by simply reading the data at query time — what we call streaming search. In most cases, most subsets of data (i.e most users) are so small that this can be done serially on a single node. Subsets of data that are too large to stream quickly on a single node can be split over multiple nodes streaming in parallel.


How many documents can be searched per node per second with this solution? Assuming a node with 500 Mb/sec read speed (either from an SSD or multiple spinning disks), and 1k average compressed document size, the disk can search max 500Mb/sec / 1k/doc = 500,000 docs/sec. If each user store 1000 documents each on average this gives a max throughput per node of 500 queries/second. This is not an exact computation since we disregard time used to seek and write, and inefficiency from reading non-compacted data on one hand, and assume an overly pessimistic zero effect from caching on the other, but it is a good indication that our solution is cost effective.

What about latency? From the calculation above we see that the latency from finding the matching documents will be 2 ms on average. However, we usually care more about the 99% latency (or similar). This will be driven by large users which needs to be split among multiple nodes streaming in parallel. The max data size per node is then a tradeoff between latency for such users and the overall cost of executing their queries (less nodes per query is cheaper). For example, we can choose to store max 50.000 documents per user per node such that we get a max latency of 100 ms per query. Lastly, the total number of nodes decides the max parallelism and hence latency for the very largest users. For example, with 20 nodes in total a cluster we can support 20 * 50k = 1 million documents for a single user with 100 ms latency.

All right — with this we have our cost-effective solution to implement the next Gmail: Store just the raw data of users, in a log-level store. Locate the data of each user on a single node in the system for locality (or, really 2–3 nodes for redundancy), but split over multiple nodes for users that grow large. Implement a fully functional search and relevance engine on top of the raw data store, which distributes queries to the right set of nodes for each user and merges the results. This will be cheap and efficient, but it sounds like a lot of work! It sure would be nice if somebody already did all of it, ran it at large scale for years and then released it as open source.

Well, as luck would have it we already did this in Vespa. In addition to the standard indexing mode, Vespa includes a streaming mode for documents which provides this solution, implemented by layering the full search engine functionality over the raw data store built into Vespa. When this solution is compared to indexed search in Vespa or more complicated sharding solutions in Elastic Search for personal search applications, we typically see about an order of magnitude reduction in cost of achieving a system which can sustain the query and update rates needed by the application with stable latencies over long time periods. It has been used to implement various applications such as storing and searching massive amounts of mails, personal typeahead suggestions, personal image collections, and private forum group content.

Using streaming search on Vespa

The steps to using streaming search on Vespa are:

  • Set streaming mode for the document type(s) in question in services.xml.
  • Write documents with a group name (e.g a user id) in their id, by setting g=[groupid] in the third part of the document id, as in e.g id:mynamespace:mydocumenttype:g=user123:doc123
  • Pass the group id in queries by setting the query property streaming.groupname in queries.

That’s it! With those steps you have created a scalable, battle-proven personal search solution which is an order of magnitude cheaper than any alternative out there, with full support for structured and text search, advanced relevance including natural language and machine-learned models, and powerful grouping and aggregation for features like faceting. For more details see the documentation on streaming search. Have fun with it, and as usual let us know what you are building!

Vespa Product Updates, February 2019: Boolean Field Type, Environment Variables, and Advanced Search Core Tuning

Kristian Aune

Kristian Aune

Head of Customer Success, Vespa

In last month’s Vespa update, we mentioned Parent/Child, Large File Config Download, and a Simplified Feeding Interface. Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and Oath Ads Platforms. Thanks to helpful feedback and contributions from the community, Vespa continues to grow.

This month, we’re excited to share the following updates:

Boolean field type

Vespa has released a boolean field type in #6644. This feature was requested by the open source community and is targeted for applications that have many boolean fields. This feature reduces memory footprint to 1/8 for the fields (compared to byte) and hence increases query throughput / cuts latency. Learn more about choosing the field type here.

Environment variables

The Vespa Container now supports setting environment variables in services.xml. This is useful if the application uses libraries that read environment variables.

Advanced search core tuning

You can now configure index warmup – this reduces high-latency requests at startup. Also, reduce spiky memory usage when attributes grow using resizing-amortize-count – the default is changed to provide smoother memory usage. This uses less transient memory in growing applications. More details surrounding search core configuration can be explored here.

We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to see.

E-commerce search and recommendation with

Holiday shopping season is upon us and it’s time for a blog post on E-commerce search and recommendation using is used as the search and recommendation backend at multiple Yahoo e-commerce sites in Asia, like

This blog post discusses some of the challenges in e-commerce search and recommendation, and shows how they can be solved using the features of


Photo by Jonas Leupe on Unsplash

E-commerce search have text ranking requirements where traditional text ranking features like BM25 or TF-IDF might produce poor results. For an introduction to some of the issues with TF-IDF/BM25 see the influence of TF-IDF algorithms in e-commerce search. One example from the blog post is a search for ipad 2 which with traditional TF-IDF ranking will rank ‘black mini ipad cover, compatible with ipad 2’ higher than ‘Ipad 2’ as the former product description has several occurrences of the query terms Ipad and 2.

Vespa allows developers and relevancy engineers to fine tune the text ranking features to meet the domain specific ranking challenges. For example developers can control if multiple occurrences of a query term in the matched text should impact the relevance score. See text ranking occurrence tables and Vespa text ranking types for in-depth details. Also the Vespa text ranking features takes text proximity into account in the relevancy calculation, i.e how close the query terms appear in the matched text. BM25/TF-IDF on the other hand does not take query term proximity into account at all. Vespa also implements BM25 but it’s up to the relevancy engineer to chose which of the rich set of built-in text ranking features in Vespa that is used.

Vespa uses OpenNLP for linguistic processing like tokenization and stemming with support for multiple languages (as supported by OpenNLP).

Your manager might tell you that these items of the product catalog should be prominent in the search results. How to tackle this with your existing search solution? Maybe by adding some synthetic query terms to the original user query, maybe by using separate indexes with federated search or even with a key value store which rarely is in synch with the product catalog search index?

With Vespa it’s easy to promote content as Vespa’s ranking framework is just math and allows the developer to formulate the relevancy scoring function explicitly without having to rewrite the query formulation. Vespa controls ranking through ranking expressions configured in rank profiles which enables full control through the expressive Vespa ranking expression language. The rank profile to use is chosen at query time so developers can design multiple ranking profiles to rank documents differently based on query intent classification. See later section on query classification for more details how query classification can be done with Vespa.

A sample ranking profile which implements a tiered relevance scoring function where sponsored or promoted items are always ranked above non-sponsored documents is shown below. The ranking profile is applied to all documents which matches the query formulation and the relevance score of the hit is the assigned the value of the first-phase expression. Vespa also supports multi-phase ranking.

Sample hand crafted ranking profile defined in the Vespa application package.

The above example is hand crafted but for optimal relevance we do recommend looking at learning to rank (LTR) methods. See learning to Rank using TensorFlow Ranking and learning to Rank using XGBoost. The trained MLR models can be used in combination with the specific business ranking logic. In the example above we could replace the default-ranking function with the trained MLR model, hence combining business logic with MLR models.

Guiding the user through the product catalog by guided navigation or faceted search is a feature which users expects from an e-commerce search solution today and with Vespa, facets and guided navigation is easily implemented by the powerful Vespa Grouping Language.

Sample screenshot from Vespa e-commerce sample application UI demonstrating search facets using Vespa Grouping Language.

The Vespa grouping language supports deep nested grouping and aggregation operations over the matched content. The language also allows pagination within the group(s). For example if grouping hits by category and displaying top 3 ranking hits per category the language allows paginating to render more hits from a specified category group.

Studies (e.g. this study from FlipKart) finds that there is a significant fraction of queries in e-commerce search which suffer from vocabulary mismatch between the user query formulation and the relevant product descriptions in the product catalog. For example, the query “ladies pregnancy dress” would not match a product with description “women maternity gown” due to vocabulary mismatch between the query and the product description. Traditional Information Retrieval (IR) methods like TF-IDF/BM25 would fail retrieving the relevant product right off the bat.

Most techniques currently used to try to tackle the vocabulary mismatch problem are built around query expansion. With the recent advances in NLP using transfer learning with large pre-trained language models, we believe that future solutions will be built around multilingual semantic retrieval using text embeddings from pre-trained deep neural network language models. Vespa has recently announced a sample application on semantic retrieval which addresses the vocabulary mismatch problem as the retrieval is not based on query terms alone, but instead based on the dense text tensor embedding representation of the query and the document. The mentioned sample app reproduces the accuracy of the retrieval model described in the Google blog post about Semantic Retrieval.

Using our query and product title example from the section above, which suffers from the vocabulary mismatch, and instead move away from the textual representation to using the respective dense tensor embedding representation, we find that the semantic similarity between them is high (0.93). The high semantic similarity means that the relevant product would be retrieved when using semantic retrieval. The semantic similarity is in this case defined as the cosine similarity between the dense tensor embedding representations of the query and the product description. Vespa has strong support for expressing and storing tensor fields which one can perform tensor operations (e.g cosine similarity) over for ranking, this functionality is demonstrated in the mentioned sample application.

Below is a simple matrix comparing the semantic similarity of three pairs of (query, product description). The tensor embeddings of the textual representation is obtained with the Universal Sentence Encoder from Google.

Semantic similarity matrix of different queries and product descriptions.

The Universal Sentence Encoder Model from Google is multilingual as it was trained on text from multiple languages. Using these text embeddings enables multilingual retrieval so searches written in Chinese can retrieve relevant products by descriptions written in multiple languages. This is another nice property of semantic retrieval models which is particularly useful in e-commerce search applications with global reach.

Vespa supports deploying stateless machine learned (ML) models which comes handy when doing query classification. Machine learned models which classify the query is commonly used in e-commerce search solutions and the recent advances in natural language processing (NLP) using pre-trained deep neural language models have improved the accuracy of text classification models significantly. See e.g text classification using BERT for an illustrated guide to text classification using BERT. Vespa supports deploying ML models built with TensorFlow, XGBoost and PyTorch through the Open Neural Network Exchange (ONNX) format. ML models trained with mentioned tools can successfully be used for various query classification tasks with high accuracy.

In e-commerce search, classifying the intent of the query or query session can help ranking the results by using an intent specific ranking profile which is tailored to the specific query intent. The intent classification can also determine how the result page is displayed and organised.

Consider a category browse intent query like ‘shoes for men’. A query intent which might benefit from a query rewrite which limits the result set to contain only items which matched the unambiguous category id instead of just searching the product description or category fields for ‘shoes for men’ . Also ranking could change based on the query classification by using a ranking profile which gives higher weight to signals like popularity or price than text ranking features.

Vespa also features a powerful query rewriting language which supports rule based query rewrites, synonym expansion and query phrasing.

Vespa is commonly used for recommendation use cases and e-commerce is no exception.

Vespa is able to evaluate complex Machine Learned (ML) models over many data points (documents, products) in user time which allows the ML model to use real time signals derived from the current user’s online shopping session (e.g products browsed, queries performed, time of day) as model features. An offline batch oriented inference architecture would not be able to use these important real time signals. By batch oriented inference architecture we mean pre-computing the inference offline for a set of users or products and where the model inference results is stored in a key-value store for online retrieval.

Update 2021-05-20: Blog recommendation tutorials are replaced by the
News search and recommendation tutorial:

In our blog recommendation tutorial
we demonstrate how to apply a collaborative filtering model for content recommendation and in
part 2 of the blog recommendation tutorial
we show how to use a neural network trained with TensorFlow to serve recommendations in user time. Similar recommendation approaches are used with success in e-commerce.

Keeping your e-commerce index up to date with real time updates

Vespa is designed for horizontal scaling with high sustainable write and read throughput with low predictable latency. Updating the product catalog in real time is of critical importance for e-commerce applications as the real time information is used in retrieval filters and also as ranking signals. The product description or product title rarely changes but meta information like inventory status, price and popularity are real time signals which will improve relevance when used in ranking. Also having the inventory status reflected in the search index also avoids retrieving content which is out of stock.

Vespa has true native support for partial updates where there is no need to re-index the entire document but only a subset of the document (i.e fields in the document). Real time partial updates can be done at scale against attribute fields which are stored and updated in memory. Attribute fields in Vespa can be updated at rates up to about 40-50K updates/s per content node.

Using Vespa’s support for predicate fields it’s easy to control when content is surfaced in search results and not. The predicate field type allows the content (e.g a document) to express if it should match the query instead of the other way around. For e-commerce search and recommendation we can use predicate expressions to control how product campaigns are surfaced in search results. Some examples of what predicate fields can be used for:

  • Only match and retrieve the document if time of day is in the range 8–16 or range 19–20 and the user is a member. This could be used for promoting content for certain users, controlled by the predicate expression stored in the document. The time of day and member status is passed with the query.
  • Represent recurring campaigns with multiple time ranges.

Above examples are by no means exhaustive as predicates can be used for multiple campaign related use cases where the filtering logic is expressed in the content.

Are you worried that your current search installation will break by the traffic surge associated with the holiday shopping season? Are your cloud VMs running high on disk busy metrics already? What about those long GC pauses in the JVM old generation causing your 95percentile latency go through the roof? Needless to say but any downtime due to a slow search backend causing a

Learning to Rank with Vespa – Getting started with Text Search have just published two tutorials to help people to get started with text search applications by building scalable solutions with Vespa.
The tutorials were based on the full document ranking task
released by Microsoft’s MS MARCO dataset’s team.

The first tutorial helps you to create and deploy a basic text search application with Vespa as well as to download, parse and feed the dataset to a running Vespa instance. They also show how easy it is to experiment with ranking functions based on built-in ranking features available in Vespa.

The second tutorial shows how to create a training dataset containing Vespa ranking features that allow you to start training ML models to improve the app’s ranking function. It also illustrates the importance of going beyond pointwise loss functions when training models in a learning to rank context.

Both tutorials are detailed and come with code available to reproduce the steps. Here are the highlights.

Basic text search app in a nutshell

The main task when creating a basic app with Vespa is to write a search definition file containing information about the data you want to feed to the application and how Vespa should match and order the results returned in response to a query.

Apart from some additional details described in the tutorial, the search definition for our text search engine looks like the code snippet below. We have a title and body field containing information about the documents available to be searched. The fieldset keyword indicates that our query will match documents by searching query words in both title and body fields. Finally, we have defined two rank-profile, which controls how the matched documents will be ranked. The default rank-profile uses nativeRank, which is one of many built-in rank features available in Vespa. The bm25 rank-profile uses the widely known BM25 rank feature.

search msmarco { 
    document msmarco {
        field title type string
        field body type string 
    fieldset default {
        fields: title, body
    rank-profile default {
        first-phase {
            expression: nativeRank(title, body)
    rank-profile bm25 inherits default {
        first-phase {
            expression: bm25(title) + bm25(body)

When we have more than one rank-profile defined, we can chose which one to use at query time, by including the ranking parameter in the query:

curl -s "<URL>/search/?query=what+is+dad+bod"
curl -s "<URL>/search/?query=what+is+dad+bod&ranking=bm25"

The first query above does not specify the ranking parameter and will therefore use the default rank-profile. The second query explicitly asks for the bm25 rank-profile to be used instead.

Having multiple rank-profiles allow us to experiment with different ranking functions. There is one relevant document for each query in the MSMARCO dataset. The figure below is the result of an evaluation script that sent more than 5.000 queries to our application and asked for results using both rank-profiles described above. We then tracked the position of the relevant document for each query and plotted the distribution for the first 10 positions.

It is clear that the bm25 rank-profile does a much better job in this case. It places the relevant document in the first positions much more often than the default rank-profile.

Data collection sanity check

After setting up a basic application, we likely want to collect rank feature data to help improve our ranking functions. Vespa allow us to return rank features along with query results, which enable us to create training datasets that combine relevance information with search engine rank information.

There are different ways to create a training dataset in this case. Because of this, we believe it is a good idea to have a sanity check established before we start to collect the dataset. The goal of such sanity check is to increase the likelihood that we catch bugs early and create datasets containing the right information associated with our task of improving ranking functions.

Our proposal is to use the dataset to train a model using the same features and functional form used by the baseline you want to improve upon. If the dataset is well built and contains useful information about the task you are interested you should be able to get results at least as good as the one obtained by your baseline on a separate test set.

Since our baseline in this case is the bm25 rank-profile, we should fit a linear model containing only the bm25 features:

a + b * bm25(title) + c * bm25(body)

Having this simple procedure in place helped us catch a few silly bugs in our data collection code and got us in the right track faster than would happen otherwise. Having bugs on your data is hard to catch when you begin experimenting with complex models as we never know if the bug comes from the data or the model. So this is a practice we highly recommend.

How to create a training dataset with Vespa

Asking Vespa to return ranking features in the result set is as simple as setting the ranking.listFeatures parameter to true in the request. Below is the body of a POST request that specify the query in YQL format and enable the rank features dumping.

body = {
    "yql": 'select * from sources * where (userInput(@userQuery));',
    "userQuery": "what is dad bod",
    "ranking": {"profile": "bm25", "listFeatures": "true"},

Vespa returns a bunch of ranking features by default, but we can explicitly define which features we want by creating a rank-profile and ask it to ignore-default-rank-features and list the features we want by using the rank-features keyword, as shown below. The random first phase will be used when sampling random documents to serve as a proxy to non-relevant documents.

rank-profile collect_rank_features inherits default {

    first-phase {
        expression: random


    rank-features {


We want a dataset that will help train models that will generalize well when running on a Vespa instance. This implies that we are only interested in collecting documents that are matched by the query because those are the documents that would be presented to the first-phase model in a production environment. Here is the data collection logic:

hits = get_relevant_hit(query, rank_profile, relevant_id)
if relevant_hit:
    hits.extend(get_random_hits(query, rank_profile, n_samples))
    data = annotate_data(hits, query_id, relevant_id)
    append_data(file, data)

For each query, we first send a request to Vespa to get the relevant document associated with the query. If the relevant document is matched by the query, Vespa will return it and we will expand the number of documents associated with the query by sending a second request to Vespa. The second request asks Vespa to return a number of random documents sampled from the set of documents that were matched by the query.

We then parse the hits returned by Vespa and organize the data into a tabular form containing the rank features and the binary variable indicating if the query-document pair is relevant or not. At the end we have a dataset with the following format. More details can be found in our second tutorial.

Beyond pointwise loss functions

The most straightforward way to train the linear model suggested in our data collection sanity check would be to use a vanilla logistic regression, since our target variable relevant is binary. The most commonly used loss function in this case (binary cross-entropy) is referred to as a pointwise loss function in the LTR literature, as it does not take the relative order of documents into account.

However, as we described in our first tutorial, the metric that we want to optimize in this case is the Mean Reciprocal Rank (MRR). The MRR is affected by the relative order of the relevance we assign to the list of documents generated by a query and not by their absolute magnitudes. This disconnect between the characteristics of the loss function and the metric of interest might lead to suboptimal results.

For ranking search results, it is preferable to use a listwise loss function when training our model, which takes the entire ranked list into consideration when updating the model parameters. To illustrate this, we trained linear models using the TF-Ranking framework. The framework is built on top of TensorFlow and allow us to specify pointwise, pairwise and listwise loss functions, among other things.

We made available the script that we used to train the two models that generated the results displayed in the figure below. The script uses simple linear models but can be useful as a starting point to build more complex ones.

Overall, on average, there is not much difference between those models (with respect to MRR), which was expected given the simplicity of the models described here. However, we can see that a model based on a listwise loss function allocate more documents in the first two positions of the ranked list when compared to the pointwise model. We expect the difference in MRR between pointwise and listwise loss functions to increase as we move on to more complex models.

The main goal here was simply to show the importance of choosing better loss functions when dealing with LTR tasks and to give a quick start for those who want to give it a shot in their own Vespa applications. Now, it is up to you, check out the tutorials, build something and let us know how it went. Feedbacks are welcome!

Approximate Nearest Neighbor Search in Vespa – Part 1

Searching for the nearest neighbors of a data point in a high dimensional vector space is an important problem for many real-time applications.
For example, in Computer Vision it can be used to find similar images in large image datasets.
In Information Retrieval, pre-trained text embeddings can be used to match documents based on the distance between query and document embeddings.
In many of these applications,
the document corpus is constantly evolving and the search is constrained by query filters applied on the metadata of the data points.
For example, in E-commerce,
a nearest neighbor search for products in a vector space would typically be constrained by product metadata like inventory status and price.

Vespa (the open source big data serving engine) already supports exact
nearest neighbor search
that is integrated with the Vespa query tree and its filter support.
This enables you to get the exact nearest neighbors meeting the filter criterias of the query.
This works well when the number of documents to calculate nearest neighbors for is small,
e.g when the query filter is strong, or the document corpus is small.
However, as the number of documents to consider increases, we want to trade exactness for performance and we need an approximate solution.

This blog post is part 1 in a series of blog posts where we share how the Vespa team implemented an approximate nearest neighbor (ANN) search algorithm.
In this first post, we’ll explain why we selected HNSW (Hierarchical Navigable Small World Graphs)
as the baseline algorithm and how we extended it to meet the requirements for integration in Vespa.
Requirements included supporting real-time updates, integration with the Vespa query tree, and being flexible enough to fit a wide range of use cases.

Requirements & Challenges

Vespa is an open source real-time big data serving engine.
In a typical application, the document corpus is constantly evolving.
New documents are added and removed, and metadata is being updated.
A typical use case is news article search and recommendation, keeping a week, month or year’s worth of articles,
continually adding or updating new articles while selectively removing the oldest ones.
In this case, nearest neighbor search over pre-trained text embeddings could be used in combination with query filters over metadata.

Based on the existing functionality and flexibility of Vespa,
we defined a set of requirements an ANN search algorithm had to support to be a good fit for integration:

  • Indexes used in the algorithm must be real-time updateable with low latency and high throughput.
    Data points can be added, updated and removed, and the entire corpus should not be needed when creating the index.
  • Concurrent querying with low latency and high throughput without huge performance impact due to ongoing update operations.
  • Seamless integration with the Vespa query tree and its filter support.
    This enables correct results, compared to just increasing top K for the nearest neighbor search algorithm to compensate when query filters are used.

Algorithms Background

There exists a lot of algorithms for ANN search.
Many of these are summarized and analyzed in
Approximate Nearest Neighbor Search on High Dimensional Data — Experiments, Analyses, and Improvement (v1.0)
A Revisit of Hashing Algorithms for Approximate Nearest Neighbor Search.
In addition, benchmark results for different algorithms over different datasets are summarized in
ANN Benchmarks.

Existing algorithms are mainly focused on a stable document corpus where all data points in the vector space are known up front.
In this case the index structure is built once, and then used for nearest neighbor searches.
To support real-time updates, we looked for algorithms that were either incremental in nature or could be modified to meet this requirement.

There are three broad categories of algorithms: tree-based, graph-based and hash-based.
We choose one from each category for a more detailed exploration.
This selection was based on how they performed in benchmarks within papers and in
ANN Benchmarks,
and how easy the algorithm was to modify to fit our requirements.

We ended up with the following:

  • Annoy
    Annoy is tree-based and described in
    Nearest neighbors and vector models – part 2 – algorithms and data structures.
    It takes a straightforward engineering approach to the ANN problem, and is quite easy to understand and implement.
    An Annoy index consists of N binary trees, where each tree partitions the vector space using random hyperplanes at each node in the tree.
    The implementation is available on github.
  • HNSW – Hierarchical Navigable Small World Graphs
    This is graph-based and described in
    Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.
    HNSW builds a hierarchical graph incrementally,
    and has great search performance with high recall, motivating us to prototype it for comparison.
    Each node in the graph represents a point in the vector space, and nodes are linked to other nodes that are close in space.
    The implementation is available on github.
  • RPLSH – Random Projection based Locality Sensitive Hashing
    This is hash-based and described in
    A Revisit of Hashing Algorithms for Approximate Nearest Neighbor Search.
    The implementation is available on github.
    Since the random hyperplanes used for projections can be selected up-front
    (only depending on the number of dimensions of the vector space) this approach is data-independent.
    For our purposes, we could use the hash value as a filter,
    where only documents having most bits in common with the hash value of the query data point would be considered for full distance computation.
    This would give us little extra memory and CPU usage and would be easy to fit into our exact (brute-force) nearest neighbor feature.
    It was the first algorithm that we implemented as a prototype.
    However, in our prototype we found that getting a significant speedup required a large sacrifice of quality.
    For more info, see Dropping RPLSH below.

We created prototype implementations of the three algorithms in C++.
This gave us a deeper understanding and the ability to make necessary modifications to support real-time updates and query filters.
The prototypes were also used to do low-level benchmarking and quality testing.

Prototype Benchmark

We benchmarked the three prototype implementations to look at indexing throughput and search throughput with different document corpus sizes.
We used the
1M SIFT (128 dim)
1M GIST (960 dim)
datasets, where one vector corresponds to a document.
In this section, we summarize the results of the 1M SIFT tests. Findings were similar with the 1M GIST dataset.


The tests were executed in a CentOS 7 Docker image using Docker Desktop on a MacBook Pro (15 inch, 2018)
with a 2.6 GHz 6-Core Intel Core i7 CPU and 32GB of memory. The Docker setup was 8 CPUs, 16GB of memory and 1GB of Swap.

The configuration of the algorithms were as follows:

  • Float size: 4 bytes.
  • Annoy: Number of trees = 50. Max points in leaf node = 128.
  • HNSW: Max links per node (M) = 16. Note that level 0 has 32 links per node (2*M). Number of neighbors to explore at insert (efConstruction) = 200.
  • RPLSH: 512 bits used for hash and linear scan of hash table.

Indexing Throughput

The 1M SIFT dataset was split into chunks of first 100k, 200k, and 500k vectors,
and indexing throughput was tested with different corpus sizes.
The index structures for all algorithms started empty, and we added one document at a time to simulate real-time feeding.
One thread was used. Necessary adjustments to the prototype implementations were done to accommodate this.
Note that the data points for 10M documents are estimated based on the 100k and 1M data points.


  • Indexing throughput depends on corpus size for Annoy and HNSW, where throughput is halved when corpus size is increased by 10x.
  • Indexing throughput for RPLSH is independent of corpus size.
  • Annoy is 4.5 to 5 times faster than HNSW.
  • RPLSH is 23 to 24 times faster than HNSW at 1M documents.

Search Throughput (QPS)

The search throughput was measured after the indexes were built, using a single benchmarking client and search thread.
We used the query vectors from the dataset, and searched for the nearest K=100 neighbors from the query vector.
To get comparable quality results between the algorithms we had to adjust some of the parameters used.
The default behavior for Annoy is to collect K=100 candidates per tree, and then calculate brute force distance for all those.
To get comparable quality between Annoy and HNSW we had to collect 8000 candidates instead of 5000.
For RPLSH, we used the Hamming distance between hashes to pre-filter candidates after collecting 200 candidates into a heap,
skipping full distance calculation for most candidates. For HNSW, we asked for 100 candidates.


  • HNSW outperforms Annoy and RPLSH. At corpus size 1M the QPS is 9 times as high as Annoy,
    and 16 times as high as RPLSH at comparable quality.
    Similar observations between hnswlib and Annoy are found in ANN Benchmarks,
    where the QPS of hnswlib is 5-10 times higher at the same quality on all tested datasets.
  • The HNSW search algorithm depends heavily on the number of links between nodes, which again depends on corpus size.
    The QPS is halved when the corpus size is increased by 10x.
    We see the same during indexing as that uses the search algorithm to find candidates to connect to.
  • The Annoy search algorithm is less dependent on corpus size, at least with small corpus sizes.
    The static cost is driven by brute force calculation of distances for the candidate set, 8000 in this case.
    With high corpus size the cost of traversing the binary trees will likely match and exceed the static cost.
    We don’t know where this limit is.
  • For RPLSH, doing exhaustive search (linear scan) of the hashes is more expensive than expected.
    One major consideration here is that the data reduction (from 128 dimensions*32 bit floats, to 512 bits hash) is just a factor 8,
    and indeed we got around 8 times speedup compared to brute-force linear scan.

Index Structures & Memory Usage

The prototype implementations used simple data structures to represent the index structures of the algorithms in memory.
Only one thread was used for indexing and searching and we didn’t have to worry about locking.
These challenges would have to be overcome in the actual implementation of the selected algorithm.
We created a rough index structure design for both Annoy and HNSW to see how much memory the index would use.

In Vespa, a data point in a high dimensional vector space is represented by a
of rank one with dimension size equal to the dimensionality of the vector space.
To use this, a document type with a tensor field of such type is defined, and documents with tensors are fed to Vespa.
A document type typically consists of multiple other fields as well, for instance, metadata fields and other tensor fields.
The tensors across all documents for a given tensor field are stored in-memory.
The index structures were designed such that they just reference the tensor data points stored in-memory.
In addition, they support one writer thread doing changes, while multiple reader threads are searching, without use of locking.
The actual index structure details for HNSW will be covered in an upcoming blog post.

Annoy Index

An Annoy index consists of multiple binary trees. Each tree consists of split nodes and leaf nodes.
A split node has an offset from origo, a hyperplane that splits the space in two, and reference to left and right child nodes.
A leaf node contains the document ids of the points that ended up in that portion of space.
The document ids are used to lookup the actual tensor data from memory.

HNSW Index

An HNSW index consists of navigable small world graphs in a hierarchy.
Each document in the index is represented by a single graph node.
Each node has an array of levels, from level 0 to n.
The number of levels is constant during the lifetime of the node and is drawn randomly when the node is created.
All nodes have

Fine-tuning a BERT model for search applications

Thiago Martins

Thiago Martins

Vespa Data Scientist

How to ensure training and serving encoding compatibility

There are cases where the inputs to your Transformer model are pairs of sentences, but you want to process each sentence of the pair at different times due to your application’s nature.

Decorative image

Photo by Alice Dietrich on Unsplash

The search use case

Search applications are one example. They involve a large collection of documents that can be pre-processed and stored before a search action is required. On the other hand, a query triggers a search action, and we can only process it in real-time. Search apps’ goal is to return the most relevant documents to the query as quickly as possible. By applying the tokenizer to the documents as soon as we feed them to the application, we only need to tokenize the query when a search action is required, saving time.

In addition to applying the tokenizer at different times, you also want to retain adequate control about encoding your pair of sentences. For search, you might want to have a joint input vector of length 128 where the query, which is usually smaller than the document, contributes with 32 tokens while the document can take up to 96 tokens.

Training and serving compatibility

When training a Transformer model for search, you want to ensure that the training data will follow the same pattern used by the search engine serving the final model. I have written a blog post on how to get started with BERT model fine-tuning using the transformer library. This piece will adapt the training routine with a custom encoding based on two separate tokenizers to reproduce how a Vespa application would serve the model once deployed.

Create independent BERT encodings

The only change required is simple but essential. In my previous post, we discussed the vanilla case where we simply applied the tokenizer directly to the pairs of queries and documents.

from transformers import BertTokenizerFast

model_name = "google/bert_uncased_L-4_H-512_A-8"
tokenizer = BertTokenizerFast.from_pretrained(model_name)

train_encodings = tokenizer(train_queries, train_docs, truncation=True, padding='max_length', max_length=128)
val_encodings = tokenizer(val_queries, val_docs, truncation=True, padding='max_length', max_length=128)

In the search case, we create the create_bert_encodings function that will apply two different tokenizers, one for the query and the other for the document. In addition to allowing for different query and document max_length, we also need to set add_special_tokens=False and not use padding, as those need to be included by our custom code when joining the tokens generated by the tokenizer.

def create_bert_encodings(queries, docs, tokenizer, query_input_size, doc_input_size):
    queries_encodings = tokenizer(
        queries, truncation=True, max_length=query_input_size-2, add_special_tokens=False
    docs_encodings = tokenizer(
        docs, truncation=True, max_length=doc_input_size-1, add_special_tokens=False

    input_ids = []
    token_type_ids = []
    attention_mask = []
    for query_input_ids, doc_input_ids in zip(queries_encodings["input_ids"], docs_encodings["input_ids"]):
        # create input id
        input_id = [TOKEN_CLS] + query_input_ids + [TOKEN_SEP] + doc_input_ids + [TOKEN_SEP]
        number_tokens = len(input_id)
        padding_length = max(128 - number_tokens, 0)
        input_id = input_id + [TOKEN_NONE] * padding_length
        # create token id
        token_type_id = [0] * len([TOKEN_CLS] + query_input_ids + [TOKEN_SEP]) + [1] * len(doc_input_ids + [TOKEN_SEP]) + [TOKEN_NONE] * padding_length
        # create attention_mask
        attention_mask.append([1] * number_tokens + [TOKEN_NONE] * padding_length)

    encodings = {
        "input_ids": input_ids,
        "token_type_ids": token_type_ids,
        "attention_mask": attention_mask
    return encodings

We then create the train_encodings and val_encodings required by the training routine. Everything else on the training routine works just the same.

from transformers import BertTokenizerFast

model_name = "google/bert_uncased_L-4_H-512_A-8"
tokenizer = BertTokenizerFast.from_pretrained(model_name)

train_encodings = create_bert_encodings(

val_encodings = create_bert_encodings(

Conclusion and future work

Training a model to deploy in a search application require us to ensure that the training encodings are compatible with encodings used at serving time. We generate document encodings offline when feeding the documents to the search engine while creating query encoding at run-time upon arrival of the query. It is often relevant to use different maximum lengths for queries and documents, and other possible configurations.

Decorative image

Photo by Steve Johnson on Unsplash

We showed how to customize BERT model encodings to ensure this training and serving compatibility. However, a better approach is to build tools that bridge the gap between training and serving by allowing users to request training data that respects by default the encodings used when serving the model. pyvespa will include such integration to make it easier for Vespa users to train BERT models without having to adjust the encoding generation manually as we did above.

Using approximate nearest neighbor search in real world applications

From text search and recommendation to ads and online dating, ANN search rarely works in isolation

Anything can be represented by a list of numbers.

For instance, text can be represented by a list of numbers describing the
text’s meaning. Images can be represented by the objects it contains. Users of
a system can be represented by their interests and preferences. Even time-based
entities such as video, sound, or user interactions can be represented by a
single list of numbers.

These vector representations describe content or meaning: the original,
containing thousands of characters or pixels, is compressed to a much smaller
representation of a few hundred numbers.

Most often, we are interested in finding the most similar vectors. This is
called k-nearest neighbor (KNN) search or similarity search and has all kinds
of useful applications. Examples here are model-free classification, pattern
recognition, collaborative filtering for recommendation, and data compression,
to name but a few. We’ll see some more examples later in this post.

However, a nearest neighbor search is only a part of the process for many
applications. For applications doing search and recommendation, the potential
candidates from the KNN search are often combined with other facets of the
query or request, such as some form of filtering, to refine the results.

This can severely limit the quality of the end result, as post-filtering can
prevent otherwise relevant results from surfacing. The solution is to integrate
the nearest neighbor search with filtering, however most libraries for nearest
neighbor search work in isolation and do not support this. To my knowledge, the
only open-source platform that does is

In this post, we’ll take a closer look at approximate neighbor search, explore
some real cases combining this with filtering, and delve into how
solves this problem.

Finding the (approximate) nearest neighbors

The representations can be visualized as points in a high-dimension space, even
though it’s kind of difficult to envision a space with hundreds of dimensions.
This allows us to think of these points as vectors, sometimes called thought
vectors, and we can use various distance metrics to measure the likeness or
similarity between them. Examples are the dot (or inner) product, cosine angle,
or euclidean distance.

The 5 nearest neighbors

Finding the nearest neighbors of a point is reasonably straight-forward: just
compute the similarity using the distance metric between the point and all
other points. Unfortunately, this brute-force approach doesn’t scale well,
particularly in time-critical settings such as online serving, where you have a
large number of points to consider.

There are no known exact methods for finding nearest neighbors efficiently. As
both the number of points increases and the number of dimensions increase, we
fall victim to the curse of dimensionality. In high dimensions, all points are
almost equally distant from each other. A good enough solution for many
applications is to trade accuracy for efficiency. In approximately nearest
neighbors (ANN), we build index structures that narrow down the search space.
The implicit neighborhoods in such indexes also help reduce the problem of high

You can roughly divide the approaches used for ANNs into whether or not they
can be implemented using an inverse index. The inverse index originates from
information retrieval and is comparable to the index often found at many books’
back. This index points from a word (or term) to the documents containing it.
This can be used for ANNs as well. Using k-means clustering, one can cluster
all points and index them by which cluster they belong to. A related approach
is product quantization (and its relatives), which splits the vectors into
products of lower-dimensional spaces. Yet another is locality-sensitive
hashing, which uses hash functions to group similar vectors together. These
approaches index the centroids or buckets.

A method that is not compatible with inverted indexes is HNSW (hierarchical
navigable small world). HNSW is based on graph structures, is efficient,
and lets the graph be incrementally built at runtime. This is in contrast to
most other methods that require offline, batch-oriented index building.

As approximate nearest neighbor search has many applications, quite a few tools
and libraries exist. A few examples are:

A good overview of tradeoffs for these can be found at

Nearest neighbors in search and recommendation

In many applications, such as search and recommendation, the results of the
nearest neighbor search is combined with additional facets of the request. In
this section, we’ll provide some examples of when this becomes problematic.

Only 2 of the 5 nearest neighbors remain after filtering

Modern text search increasingly uses representation vectors, often called text
embeddings or embedding vectors. Word2vec was an early example. More recently,
sophisticated language understanding models such as BERT and other
Transformer-based models are increasingly used. These are capable of assigning
different representations for a word depending upon the context. For text
search, the current state-of-the-art uses different models to encode query
vectors and document vectors. These representations are trained so that the
inner product of these vectors is maximized for relevant results.

Using embedding vectors in text search is often called semantic search. For
many text search applications, we would like to combine this semantic search
with other filters. For instance, we can combine a query for “approximate
nearest neighbor” with a date filter such as “2020”. The naive approach here is
to use one of the ANN libraries mentioned above to perform a nearest neighbor
search and then filter out the results.

However, this is problematic. Imagine that 1000 documents are relevant to the
query “approximate nearest neighbor”, with 100 added each year over the past 10
years. Assume they all are approximately equally likely to be retrieved from
the ANN. So, retrieving the top 100 will result in about 10 documents from each
year. Applying the filter “2020” will result in only 10 documents. That means
the other 90 relevant documents from 2020 are missed.


Recommender systems, such as YouTube and TikTok, are built to provide
continually interesting content to all users. As such, it’s essential to learn
the interests or preferences of the user. Such user profiles are represented by
one or more vectors, as are the items that should be recommended.

These vectors are often generated by using some form of collaborative
filtering. One method is matrix factorization, where the maximum inner product
is used as a distance function. Deep learning approaches have recently shown
great promise, trained explicitly for the distance function between the user
and item vector.

Recommendation systems employ filters to a great degree. Examples are filters
for age-appropriate content, NSFW labels, availability of content in various
regions due to distribution rights, and user-specified filters blocking certain
content. These are examples of direct filters. More indirect filters come in
the form of business rules such as diversity and de-duplication, which filters
out content that has already been recommended.

The problem of filtering is more evident for recommendation systems than for
text search. These filters’ quantity and strength lead to a greater probability
that items retrieved from the ANN search are filtered away. So, only a few of
the relevant items are actually recommended.

Serving ads

Ad serving systems work much like recommender systems. Given a user
profile and a context such as a search query or page content, the system should
provide an advertisement relevant to the user. The advertisements are stored
with advertiser-specific rules, for instance, who the ad or campaign should
target. One such rule is to not exceed the budget of the campaign.

These rules function as filters. Like with text search and recommendation, if
these filters are applied after the user-profile based retrieval, there is a
probability that an appropriate advertisement is not retrieved. This is
particularly important regarding the budget. Income is lost if there are no
results retrieved with an available spending budget.

Online dating

In the world of online dating, people have a set of preferences. These can be
binary such as gender, age range, location, height, and so on. Interests might
be less absolute, such as hiking, loves pets, traveling, and exercise. These
interests and preferences can be represented by a vector, and at least parts
can be compressed to a representation vector as well.

Suppose retrieval is based on an ANN over interests, and the preferences are
applied as a filter afterward. In that case, it’s clear why online dating is
hard. As we retrieve the best matches from the ANN, there is a significant
probability that all or most of these are filtered out, for instance, by
location or by profiles already seen.

Local search and recommendation is based on geographical location. Given
longitude and latitude coordinates, we can find places or businesses within
certain distances from a point: finding restaurants near a user is a typical
case. Imagine that we have the dining preferences of a user represented as a
vector. Likewise, all restaurants are represented by vectors. Then, by
performing an ANN search followed by a location filter, we could retrieve the
restaurants preferred by the user in their local area.

However, this would not work. Of all the restaurants in the world, only a small
fraction are close to the user. The location filter is much stronger than the
ANN retrieval. So with a high probability, no results would be produced at all.


The naive approach to the problems above is simply to request more candidates
from the ANN search. This obviously hurts performance, as the workload of both
the ANN and post-filtering increases. Besides, this is not guaranteed to work.
If you have a strong filter independent of the ANN, there is a real chance of
not producing any results at all. The local restaurant case is an example of
this, where the location is a strong filter independent of the user

The real solution here is to integrate the filters into the ANN search. Such an
algorithm would be able to reject candidates early that don’t pass the filters.
This effectively increases the search area from the query point dynamically
until enough candidates are found. This guarantees that the requested number of
candidates are produced.

Unfortunately, for most ANN libraries, this is not an option as they work in

The 5 nearest neighbors with integrated filtering is to my knowledge the only implementation of ANN that supports
integrated filtering. The implementation is based on a modified HNSW graph
algorithm, and innovates in 3 main areas:

  • Dynamic modification of the graph. Most ANN algorithms require the index to
    be built offline, but HNSW supports incremental building of the index. Vespa
    takes advantage of this and supports both adding and removing items in
    real-time while serving.
  • Multi-threaded indexing using lock-free data structures and copy-on-write
    semantics drastically increase the performance of building the index.
  • Metadata filtering modifies the algorithm to skip non-eligible candidates.

To support filtering, first evaluates the filters to create a list of
eligible candidates. During the ANN search, a point close to the query point is
selected and the graph is explored by following each node’s edge to its
neighbors. Candidates not in the eligibility list are skipped, and the search
continues until we have produced enough candidates.

There is a small problem here however. If the eligibility list is small in
relation to the number of items in the graph, skipping occurs with a high
probability. This means that the algorithm needs to consider an exponentially
increasing number of candidates, slowing down the search significantly. To
solve this, switches over to a brute-force search when this occurs.
The result is a efficient ANN search when combined with filters.

About is an open-source platform for building applications that do real-time
data processing over large data sets. Designed to be highly performant and
web-scalable, it is used for such diverse tasks as search, personalization,
recommendation, ads, auto-complete, image and similarity search, comment
ranking, and more.

One of’s strengths is that it includes all the necessary features to
realize such applications. This means one does not need additional plugins or
external services. Thus, it offers a simplified path to deployment in
production without coping

Q&A from “The Great Search Engine Debate – Elasticsearch, Solr or Vespa?” Meetup

On January 28th, 2021, at 17:00 CET,
Charlie Hull from OpenSource Connections hosted

The Great Search Engine Debate – Elasticsearch, Solr or Vespa? –
a meetup on Haystack LIVE!,
with Anshum Gupta, VP of Apache Lucene, Josh Devins from Elastic and Jo Kristian Bergum from Vespa.

So many great questions were asked that there was no time to go through them all.
This blog post addresses the Vespa-related questions,
with quicklinks into the recording for easy access.
We have also extracted the unanswered questions from the chat log, linking to Vespa resources.
Please let us know if this is useful.
Feel free to follow up with the Vespa Team using the resources at,
Gitter live chat.
You will also find us in the #vespa channel of Relevance Slack.
You can also find Charlie’s summary post at
Solr vs Elasticsearch vs Vespa – what did we learn at The Great Search Engine Debate?.

All three speakers were asked to do a pitch and closing words.
Three things that make you recommend your technology –
see the Vespa pitch and Vespa top three –

  1. Vespa has a great toolbox for modern retrieval, state-of-the-art retrieval/ranking with Machine Learning
  2. Vespa’s indexing architecture allows true partial updates at scale, with high indexing volume – when combined with #1, one can have realtime updated models to make decisions in real time, on updated information
  3. Vespa’s scalability and true elastic content cluster. You don’t have to pre-determine the number of shards. Can go from 1 node to 100 nodes, just add nodes.

Resources: ranking,
reads and writes,
elastic Vespa

Use case differentiator, I am curious if the participants could walk through:
let’s say I have an index with text for search, but also a couple dozen features I intend to use in LTR.
I want to update two of the dozen features across several billion documents because I changed my feature extraction.
How does the engine deal with this?

[ quicklink ].
Common and widely used Vespa use case.
True partial updates of attribute fields which are in-memory, update and evaluate in place –
no need to read the entire document and apply the update and write it to a new index segment
like in Solr/Elasticsearch which builds on Lucene.
Vespa can do 50,000 numeric partial updates per second per node.
Ranking will immediately see the update and use value in computations (search, rank, sorting, faceting).

Resources: ranking,
reads and writes,
elastic Vespa

Much of the popularity around ES and Solr arises from the fact that they are very “approachable” technologies.
It’s simple for a beginner to get started indexing and searching documents at a basic level,
and most importantly, understanding and influencing which documents are returned.
My impression is that the technical entry level for Vespa is much more advanced.
Would you agree or disagree? How would you recommend starting out with Vespa?

[ quicklink ].
Learned a lot from Elasticsearch on developer friendliness,
maybe at 80% ease of use. With Vespa, it’s easy to go from laptop to full cloud deployment.
Use Docker to run Vespa on your laptop.
Use Vespa application package to go from laptop to full size – it is the same config.

Resources: application packages,
getting started,

I have a question regarding Vespa: How is the support for non-English languages regarding tokenizers, stemmers, etc.?
I’m especially interested in German, Russian, Polish, Czech and Hungarian.
How big would be the effort to adapt Solr / OpenNLP resources to use them with Vespa?

[ quicklink ].
Vespa integrates with Apache OpenNLP,
so any language supported by it, Vespa supports it.
It’s easy to integrate with new linguistic libraries and we’ve already received CJK contributions to Vespa.

Resources: linguistics

Which search engine is best for a write-heavy application?
Based on my experience, Elasticsearch read performance is impacted when there are heavy writes.

[ quicklink ].
Vespa moved away from indexing architecture similar to Elasticsearch and Solr,
where it used small immutable index segments that were later merged.
Vespa has a mutable in-memory index in front of immutable index segments.
All IO writes are sequential. No shards. Attributes fields are searchable, in-place updateable.
Efficient use of OS buffer cache for random reads from search.
Real-time indexing with Solr and Elasticsearch creates many immutable segments
which all need to be searched (single threaded execution as well),
so latency is definitively impacted more than with Vespa which has a memory index + larger immutable index.

Resources: reads and writes,

“Join” is always a problem with SOLR/Elasticsearch. How does Vespa handle it?

[ quicklink ].
Supported scalable join is implemented using parent/child relationship.
The parent is a global document – distributed across all nodes in the cluster.
Child documents access attribute in-memory fields imported from parent documents.
Can also use the stateless container, deploy a custom Java searcher, do joins on top of multiple searches.

Resources: parent-child,
Vespa overview,

Can people talk a bit more about kubernetes integrations?

[ quicklink ].
Yes, one can run Vespa on K8s.

Resources: vespa-quick-start-kubernetes

How does Vespa compare to FAISS?

[ quicklink ].
FAISS uses HSNW like Vespa.
FAISS can only nearest neighbor search returning the ID of the vector, very fast.
In Vespa, combine with query filters,
not like the Open Distro for Elasticsearch k-NN plugin
that does post-processing step after retrieving the nearest neighbors.
With a restrictive filter, like last day, might end up with zero documents.
Vespa combines ANN search and filters.

Vespa has hybrid evaluation;
Term-at-a-time (TAAT) which is much more cache friendly, and document-at-a-time (DAAT).
Can evaluate part of the query tree using TAAT,
then search in the HNSW graph using the documents eligible as an input filter.
Including a filter makes ANN a bit slower, but the value it adds makes it worth it.

FAISS is faster as it does not have an HTTP api and distribution layer with realtime updates –
FAISS is a library, batch oriented.

Resources: using-approximate-nearest-neighbor-search-in-real-world-applications,
approximate nearest neighbor, hnsw,
feature tuning

Since Vespa has a different approach, is there anything Vespa is learning from Elastic/Solr/Lucene?
Also the other way around, Elastic/Solr learning from Vespa?

[ quicklink].
Both are great engines! Vespa’s toolbox is bigger.
Learned how Elasticsearch became popular:
developer friendliness, nice APIs, great support for analytics, great for handling immutable data.
Lucene has had a large developer crowd for 20 years.

If I remember correctly, FAISS or similar libraries support indexing/searching with the power of GPU,
how does this compare to Vespa/Elastic/Solr?

[ quicklink ].
Vespa is CPU only, but looking at GPU as pretrained language models grow larger.
GPU easier to use in indexing than serving.
We are trying to find models that run efficiently on GPU. Vespa is written in C++,
making use of OpenBLAS and special instructions to get the most out of CPUs.


Given large language model dominance, in 5 years, how much do we need to support manual relevance tuning operations?
Should that be our focus? Or will search engines just be initial retrieval before sending docs to eg. BERT?

[ quicklink ].
BERT and pretrained language models helps machines understand text better than before,
dramatic progress on ranking, roughly 2x BM25 on multiple Information retrieval datasets.
However more than just text matching and ranking, like click models and site popularity.
In Vespa, ranking with BERT locally on the content nodes,
can combine scoring from language model into LTR framework, taking other signals into account.
There are ways to use BERT that could lead to close to random ranking,
e.g. using BERT as a representation model without fine-tuning for the retrieval task
where there are many many negative (irrelevant) documents.

However, good zero-shot transfer capabilities for interaction based models
has demonstrated strong ranking accuracy on other data sets.
See Pretrained Transformers for Text Ranking: BERT and Beyond.

Resources: from-research-to-production-scaling-a-state-of-the-art-machine-learning-system

Can you speak about the history of Vespa? All top contributors work at Verizon/Yahoo.
Are you aware of prominent Vespa users beside Verizon? Who’s behind Vespa Cloud?
Is there a (larger) ecommerce shop using Vespa in production already?

[ quicklink ]. is run by Verizon Media.
In Verizon Media, Vespa is used for search and recommendation (including APAC e-commerce) + Gemini ad serving stack.
Vespa’s background is from Fast Search and Transfer, founded in 1997 from NTNU in Trondheim, Norway.


What are your plans for growing your communities?
Where should we go to ask questions and have discussion?

[ quicklink ].
#vespa on Stack Overflow,
Gitter channel,
#vespa channel of Relevance Slack.
Asking the extended Vespa team to document use cases / blog posts.


What type of node? Helps me understand 50k/node number

Single value update assign of an int field on a c5d.2xlarge, 8 v-cpu, 16GB, 200G SSD. 49K updates/s.

How does vespa handle search query contain both dense vector + scalar fields?
I.e. internally, it first retrieves top-k doc and then to the filters?

See the How does Vespa compare to FAISS? question above –
filter first, maybe using TAAT for speed, then top-k.
This to ensure low latency and non-empty result sets.

Which engine supports the usage of KNN clustering together with vector similarity queries?

Vespa supports approximate nearest neighbor search using HNSW,
but can also be combined with pre-computed KNN clustering
where vectors have been assigned a cluster id at indexing time.
Using the Vespa ranking framework,
one can combine (approximate) nearest neighbor queries with any other computations.
Using tensors and operations on these, custom algorithms can be built.

Resources: tensor user guide,
approximate nearest neighbor HNSW,

Which engine would you use for real-time systems with emphasis on queries latency?

The Vespa Team has helped implementation of numerous applications
with millisecond latency requirements and update rates in thousands per second per node in Verizon Media.
When the feed operation is ack’ed, the operation is visible.
There is no index refresh delay or immutable batch indexing
as in engines like Solr or Elasticsearch using the batch oriented Lucene library.
Vespa also allows using multiple searcher threads per query to scale latency versus throughput,
functionality which is not exposed in Solr or Elasticsearch.

Does Vespa support IBM ICU libraries? (language processing question as well)

Yes, used in sorting.

For what kind of problem would you recommend Elastic or Solr for (over Vespa)?

See the question above for anything Vespa is learning from Elastic/Solr/Lucene?

Resources: vespa-elastic-solr

Can any of the search engine beat Redis when it comes to read performance? Do we have any benchmarking?

The Vespa Team has not compared Vespa with Redis, as they are built for different use cases.
Vespa is built for Big Data Serving with distributed computations over large, mutable data sets.
Use Redis for in-memory database, cache, and message broker.

All 3 search engines rely on OS file caching for read optimizations.
How does that work in kubernetes when multiple processes/pods/containers are racing against each other for that?

The Vespa Team has not tested specifically for K8s and we would love to learn from the community when tested!
We run multiple Docker multi-process containers on bare-metal and AWS hosts, memory is isolated, but the IO is shared.
We hence monitor IO, but not more than that.

I’d love to read more about the TAAT/DAAT mix and ANN, I don’t follow that yet.
Any chance you can drop a code link or doc link?

See feature-tuning.
We will see if we can publish a paper or article on this subject.

With regard to GPU vs CPU this is also asking “How do you execute code on multi-arch cluster?”.
If you’re on K8s, you may just be calling out across the nodes.
Something like the nboost proxy is an interesting example

Moving computation to where the data lives is the mantra for both Vespa and the Map Reduce paradigm (Hadoop).
This allows scaling latency and throughput without moving data across the wire.
Vespa integrates with many machine learning techniques and allows,
e.g. using the pre-trained language model relevancy score in combination with other core ranking features

Using approximate nearest neighbor search to find similar products

In this blog we give an introduction to how to use the
Vespa’s approximate nearest neighbor search query operator.

We demonstrate how nearest neighbor search in an image feature vector space can be used to find similar products.
Given an image of a product, we want to find
similar products, but also with the possibility to filter the returned products by inventory status, price or other real world constraints.

We’ll be using the
Amazon Products dataset
as our demo dataset. The subset we use has about 22K products.

  • The dataset contains product information like title, description and price and we show how to map the data into the Vespa document model
  • The dataset also contains binary feature vectors for images, features produced by a neural network model. We’ll use these image vectors and show you how to store and index vectors in Vespa.

We also demonstrate some of the real world challenges with using nearest neighbor search:

  • Combining the nearest neighbor search with filters, for example on inventory status
  • Real time indexing of vectors and update documents with vectors
  • True partial updates to update inventory status at scale

Since we use the Amazon Products dataset, we also recommend these resources:


In this blog we’ll be using pyvespa, which is a simple python api built on top of Vespa’s native HTTP apis.

The python api is not meant as a production ready api, but an api to explore features in Vespa.
It is also for training ML models, which can be deployed to Vespa for serving.

See also the Complete notebook which powered this blog post.

Exploring the Amazon Product dataset

Let us download the demo data from the Amazon Products dataset, we use the Amazon_Fashion subset.

!wget -nc
!wget -nc
!wget -nc

from amazonutils import *
from pprint import pprint

Let us have a look at a selected slice of the dataset:

for p in islice(iter_products('meta_Amazon_Fashion.json.gz'), 221,223):
  display(Image(p['imUrl'], width=128, height=128))

{'asin': 'B0001KHRKU',
'categories': [['Amazon Fashion']],
'imUrl': '',
'salesRank': {'Watches': 220635},
'title': 'Midas Remote Control Watch'}


{'asin': 'B0001LU08U',
'categories': [['Amazon Fashion']],
'imUrl': '',
'salesRank': {'Clothing': 1296873},
'title': 'Solid Genuine Leather Fanny Pack Waist Bag/purse with '


## Build basic search functionality

A Vespa instance is described by a [Vespa application package]( Let us create an application called product and define our [document schema]( We use pyvespa to define our application. This is not a requirement for creating a Vespa application, one can just use a text editor of choice.

from vespa.package import ApplicationPackage
app_package = ApplicationPackage(name = "product")
from vespa.package import Field
    Field(name = "asin", type = "string", indexing = ["attribute", "summary"]),
    Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
    Field(name = "description", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
    Field(name = "price", type = "float", indexing = ["attribute", "summary"]),
    Field(name = "salesRank", type = "weightedset", indexing = ["summary","attribute"]),
    Field(name = "imUrl", type = "string", indexing = ["summary"])

We define a fieldset which is a way to combine matching over multiple string fields. 
We will only do text queries over the *title* and *description* field. 

from vespa.package import FieldSet
    FieldSet(name = "default", fields = ["title", "description"])

Then define a simple [ranking]( function which uses a linear combination of the [bm25]( text ranking feature over our two free text string fields.

from vespa.package import RankProfile
        name = "bm25", 
        first_phase = "0.9*bm25(title) + 0.2*bm25(description)")

So let us deploy this application. We use docker in this example. See also [Vespa quick start](

### Deploy the application and start Vespa

from vespa.package import VespaDocker
vespa_docker = VespaDocker(port=8080)

app = vespa_docker.deploy(
    application_package = app_package,
    disk_folder="/home/centos/product_search" # include the desired absolute path here

Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for application status.
Waiting for application status.
Finished deployment.

### Index our product data

Pyvespa does expose a feed api, but in this notebook we use the raw [Vespa http /document/v1 feed api](

The HTTP document api is synchronous and the operation is visible in search when acked with a response code 200. In this case, the feed throughput is limited by the client as we are posting one document at a time. For high throughput use cases use the asynchronous feed api, or use more client threads with the synchronous api.

import requests 
session = requests.Session()

def index_document(product):
    asin = product['asin']
    doc = {
        "fields": {
            "asin": asin,
            "title": product.get("title",None),
            "description": product.get('description',None),
            "price": product.get("price",None),
            "imUrl": product.get("imUrl",None),
            "salesRank": product.get("salesRank",None)     
    resource = "http://localhost:8080/document/v1/demo/product/docid/{}".format(asin)
    request_response =,json=doc)

With our routine defined we can iterate over the data and index the products, one doc at a time:

from tqdm import tqdm
for product in tqdm(iter_products("meta_Amazon_Fashion.json.gz")):

24145it [01:46, 226.40it/s]

So we have our index ready, no need to perform any additional index maintenance operations like merging segments.
All the data is searchable. Let us define a simple routine to display search results. Parsing the [Vespa JSON search response]( format:

def display_hits(res, ranking):
    time = 1000*res['timing']['searchtime'] #convert to ms
    totalCount = res['root']['fields']['totalCount']
    print("Found {} hits in {:.2f} ms.".format(totalCount,time))
    print("Showing top {}, ranked by {}".format(len(res['root']['children']),ranking))
    for hit in res['root']['children']:
        fields = hit['fields']
        print("{}".format(fields.get('title', None)))
        display(Image(fields.get("imUrl"), width=128, height=128))  
        print("documentid: {}".format(fields.get('documentid')))
        if 'inventory' in fields:
            print("Inventory: {}".format(fields.get('inventory')))
        print("asin: {}".format(fields.get('asin')))
        if 'price' in fields:
            print("price: {}".format(fields.get('price',None)))
        if 'priceRank' in fields:
            print("priceRank: {}".format(fields.get('priceRank',None)))  
        print("relevance score: {:.2f}".format(hit.get('relevance')))

### Query our product data

We use the [Vespa HTTP Search API]( to search our product index.

In this example we assume there is a user query 'mens wrist watch' which we use as input to the
[YQL]( query language. Vespa allows combining the structured application logic expressed by YQL with a
user query language called [Vespa simple query language](

In this case we use *type=any* so matching any of our 3 terms is enough to retrieve the document.
In the YQL statement we select the fields we want to return. Only fields which are marked as *summary* in the schema can be returned with the hit result.

We don't mention which fields we want to search, so Vespa uses the fieldset defined earlier called default, which will search both the title and the description fields.

query = {
    'yql': 'select documentid, asin,title,imUrl,price from sources * where userQuery();',
    'query': 'mens wrist watch',
    'ranking': 'bm25',
    'type': 'any',
    'presentation.timing': True,
    'hits': 2

display_hits(app.query(body=query).json, "bm25")

Found 3285 hits in 4.00 ms.
Showing top 2, ranked by bm25

Geekbuying 814 Analog Alloy Quartz Men's Wrist Watch - Black (White)


documentid: id:demo:product::B00GLP1GTW
asin: B00GLP1GTW
price: 8.99
relevance score: 10.27

Popular Quality Brand New Fashion Mens Boy Leatheroid Quartz Wrist Watch Watches


documentid: id:demo:product::B009EJATDQ
asin: B009EJATDQ
relevance score: 9.67

So there we have our basic search functionality.

## Related products using nearest neighbor search in image feature spaces
Now we have basic search functionality up and running, but the Amazon Product dataset also includes image features which we can also
index in Vespa and use approximate nearest neighbor search on.
Let us load the image feature data. We reduce the vector dimensionality to something more practical and use 256 dimensions.

vectors = []
reduced = iter_vectors_reduced("image_features_Amazon_Fashion.b", 256, 1000)
for asin,v in tqdm(reduced("image_features_Amazon_Fashion.b")):

22929it [00:04, 4739.67it/s]

We need to re-configure our application to add our image vector field.
We also define a *HNSW* index for it and using *angular* as our [distance metric](

We also need to define the input query vector in the application package.
Without defining our query input tensor we won't be able to perform our nearest neighbor search,
so make sure you remember to include that.

Most changes like adding or remove a field is a [live change]( in Vespa, no need to re-index the data.

from vespa.package import HNSW
    Field(name = "image_vector", 
          type = "tensor(x[256])", 
          indexing = ["attribute","index"],
from vespa.package import QueryTypeField
    QueryTypeField(name="ranking.features.query(query_image_vector)", type="tensor(x[256])")

We also need to define a ranking profile on how we want to score our documents. We use the *closeness* [ranking feature]( Note that it's also possible to retrieve results using approximate nearest neighbor search operator and use the first phase ranking function as a re-ranking stage (e.g by sales popularity etc). 
        name = "vector_similarity", 
        first_phase = "closeness(field,image_vector)")

Now, we need to re-deploy our application package to make the changes effective.

app = vespa_docker.deploy(
    application_package = app_package,
    disk_folder="/home/centos/product_search" # include the desired absolute path here

## Update the index with image vectors
Now we are ready to feed and index the image vectors.

We update the documents in the index by running partial update operations, adding the vectors using real time updates of the existing documents.
Partially updating a tensor field, with or without tensor, does not trigger re-indexing.

for asin,vector in tqdm(vectors):
    update_doc = {
        "fields":  {
            "image_vector": {
                "assign": {
                    "values": vector
    url = "http://localhost:8080/document/v1/demo/product/docid/{}".format(asin)
    response = session.put(url, json=update_doc)

100%|██████████| 22929/22929 [01:40<00:00, 228.94it/s]

We now want to get similar products using the image feature data.
We do so by first fetching the
vector of the product we want to find similar products for, and use this vector as input to the nearest neighbor search operator of Vespa.
First we define a simple get vector utility to fetch the vector of a given product *asin*.

def get_vector(asin):
    resource = "http://localhost:8080/document/v1/demo/product/docid/{}".format(asin)
    response = session.get(resource)
    document = response.json()
    cells = document['fields']['image_vector']['cells']
    vector = {}
    for i,cell in enumerate(cells):
        v = cell['value']
        adress = cell['address']['x']
        vector[int(adress)] = v
    values = []
    for i in range(0,256):
    return values

Let us repeat the query from above to find an image to find similar products for

query = {
    'yql': 'select documentid, asin,title,imUrl,price from sources * where userQuery();',
    'query': 'mens wrist watch',
    'ranking': 'bm25',
    'type': 'any',
    'presentation.timing': True,
    'hits': 1
display_hits(app.query(body=query).json, "bm25")

Found 3285 hits in 4.00 ms.
Showing top 1, ranked by bm25

Geekbuying 814 Analog Alloy Quartz Men's Wrist Watch - Black (White)


documentid: id:demo:product::B00GLP1GTW
asin: B00GLP1GTW
price: 8.99
relevance score: 10.27

Let us search for similar images using **exact** nearest neighbor search. We ask for 3 most similar to the product image with asin id **B00GLP1GTW**

query = {
    'yql': 'select documentid, asin,title,imUrl,description,price from sources * where \
    'ranking': 'vector_similarity',
    'hits': 3, 
    'presentation.timing': True,
    'ranking.features.query(query_image_vector)': get_vector('B00GLP1GTW')
display_hits(app.query(body=query).json, "vector_similarity")

Found 46 hits in 10.00 ms.
Showing top 3, ranked by vector_similarity

Geekbuying 814 Analog Alloy Quartz Men's Wrist Watch - Black (White)


documentid: id:demo:product::B00GLP1GTW
asin: B00GLP1GTW
price: 8.99
relevance score: 1.00

Avalon EZC Unisex Low-Vision Silver-Tone Flex Bracelet One-Button Talking Watch, # 2609-1B


documentid: id:demo:product::B000M9GQ0M
asin: B000M9GQ0M
price: 44.95
relevance score: 0.63

White Rubber Strap Digital Watch


documentid: id:demo:product::B003ZYXF1Y
asin: B003ZYXF1Y
relevance score: 0.62

Let us repeat the same query but this time using the faster approximate version.
When there is a HNSW index on the tensor,
the default behavior is to use approximate:true, so we remove the approximation flag.

query = {
    'yql': 'select documentid, asin,title,imUrl,description,price from sources * where \
    'ranking': 'vector_similarity',
    'hits': 3, 
    'presentation.timing': True,
    'ranking.features.query(query_image_vector)': get_vector('B00GLP1GTW')
display_hits(app.query(body=query).json, "vector_similarity")

Found 3 hits in 6.00 ms.
Showing top 3, ranked by vector_similarity

Geekbuying 814 Analog Alloy Quartz Men's Wrist Watch - Black (White)


documentid: id:demo:product::B00GLP1GTW
asin: B00GLP1GTW
price: 8.99
relevance score: 1.00

Avalon EZC Unisex Low-Vision Silver-Tone Flex Bracelet One-Button Talking Watch, # 2609-1B


documentid: id:demo:product::B000M9GQ0M
asin: B000M9GQ0M
price: 44.95
relevance score: 0.63

White Rubber Strap Digital Watch


documentid: id:demo:product::B003ZYXF1Y
asin: B003ZYXF1Y
relevance score: 0.62

## Combining nearest neighbor search with filters

If we look at the results for the above exact and approximate nearest neighbor searches, we got the same results using the approximate version (perfect recall).
But naturally the first listed product was the same product that we used as input, and the closeness score was 1.0 simply because the angular distance is 0.
Since the user is already presented with the product, we want to remove it from the result. We can do that