E-commerce search and recommendation with Vespa.ai

Holiday shopping season is upon us and it’s time for a blog post on E-commerce search and recommendation using Vespa.ai. Vespa.ai is used as the search and recommendation backend at multiple Yahoo e-commerce sites in Asia, like tw.buy.yahoo.com.

This blog post discusses some of the challenges in e-commerce search and recommendation, and shows how they can be solved using the features of Vespa.ai.


Photo by Jonas Leupe on Unsplash

E-commerce search have text ranking requirements where traditional text ranking features like BM25 or TF-IDF might produce poor results. For an introduction to some of the issues with TF-IDF/BM25 see the influence of TF-IDF algorithms in e-commerce search. One example from the blog post is a search for ipad 2 which with traditional TF-IDF ranking will rank ‘black mini ipad cover, compatible with ipad 2’ higher than ‘Ipad 2’ as the former product description has several occurrences of the query terms Ipad and 2.

Vespa allows developers and relevancy engineers to fine tune the text ranking features to meet the domain specific ranking challenges. For example developers can control if multiple occurrences of a query term in the matched text should impact the relevance score. See text ranking occurrence tables and Vespa text ranking types for in-depth details. Also the Vespa text ranking features takes text proximity into account in the relevancy calculation, i.e how close the query terms appear in the matched text. BM25/TF-IDF on the other hand does not take query term proximity into account at all. Vespa also implements BM25 but it’s up to the relevancy engineer to chose which of the rich set of built-in text ranking features in Vespa that is used.

Vespa uses OpenNLP for linguistic processing like tokenization and stemming with support for multiple languages (as supported by OpenNLP).

Your manager might tell you that these items of the product catalog should be prominent in the search results. How to tackle this with your existing search solution? Maybe by adding some synthetic query terms to the original user query, maybe by using separate indexes with federated search or even with a key value store which rarely is in synch with the product catalog search index?

With Vespa it’s easy to promote content as Vespa’s ranking framework is just math and allows the developer to formulate the relevancy scoring function explicitly without having to rewrite the query formulation. Vespa controls ranking through ranking expressions configured in rank profiles which enables full control through the expressive Vespa ranking expression language. The rank profile to use is chosen at query time so developers can design multiple ranking profiles to rank documents differently based on query intent classification. See later section on query classification for more details how query classification can be done with Vespa.

A sample ranking profile which implements a tiered relevance scoring function where sponsored or promoted items are always ranked above non-sponsored documents is shown below. The ranking profile is applied to all documents which matches the query formulation and the relevance score of the hit is the assigned the value of the first-phase expression. Vespa also supports multi-phase ranking.

Sample hand crafted ranking profile defined in the Vespa application package.

The above example is hand crafted but for optimal relevance we do recommend looking at learning to rank (LTR) methods. See learning to Rank using TensorFlow Ranking and learning to Rank using XGBoost. The trained MLR models can be used in combination with the specific business ranking logic. In the example above we could replace the default-ranking function with the trained MLR model, hence combining business logic with MLR models.

Guiding the user through the product catalog by guided navigation or faceted search is a feature which users expects from an e-commerce search solution today and with Vespa, facets and guided navigation is easily implemented by the powerful Vespa Grouping Language.

Sample screenshot from Vespa e-commerce sample application UI demonstrating search facets using Vespa Grouping Language.

The Vespa grouping language supports deep nested grouping and aggregation operations over the matched content. The language also allows pagination within the group(s). For example if grouping hits by category and displaying top 3 ranking hits per category the language allows paginating to render more hits from a specified category group.

Studies (e.g. this study from FlipKart) finds that there is a significant fraction of queries in e-commerce search which suffer from vocabulary mismatch between the user query formulation and the relevant product descriptions in the product catalog. For example, the query “ladies pregnancy dress” would not match a product with description “women maternity gown” due to vocabulary mismatch between the query and the product description. Traditional Information Retrieval (IR) methods like TF-IDF/BM25 would fail retrieving the relevant product right off the bat.

Most techniques currently used to try to tackle the vocabulary mismatch problem are built around query expansion. With the recent advances in NLP using transfer learning with large pre-trained language models, we believe that future solutions will be built around multilingual semantic retrieval using text embeddings from pre-trained deep neural network language models. Vespa has recently announced a sample application on semantic retrieval which addresses the vocabulary mismatch problem as the retrieval is not based on query terms alone, but instead based on the dense text tensor embedding representation of the query and the document. The mentioned sample app reproduces the accuracy of the retrieval model described in the Google blog post about Semantic Retrieval.

Using our query and product title example from the section above, which suffers from the vocabulary mismatch, and instead move away from the textual representation to using the respective dense tensor embedding representation, we find that the semantic similarity between them is high (0.93). The high semantic similarity means that the relevant product would be retrieved when using semantic retrieval. The semantic similarity is in this case defined as the cosine similarity between the dense tensor embedding representations of the query and the product description. Vespa has strong support for expressing and storing tensor fields which one can perform tensor operations (e.g cosine similarity) over for ranking, this functionality is demonstrated in the mentioned sample application.

Below is a simple matrix comparing the semantic similarity of three pairs of (query, product description). The tensor embeddings of the textual representation is obtained with the Universal Sentence Encoder from Google.

Semantic similarity matrix of different queries and product descriptions.

The Universal Sentence Encoder Model from Google is multilingual as it was trained on text from multiple languages. Using these text embeddings enables multilingual retrieval so searches written in Chinese can retrieve relevant products by descriptions written in multiple languages. This is another nice property of semantic retrieval models which is particularly useful in e-commerce search applications with global reach.

Vespa supports deploying stateless machine learned (ML) models which comes handy when doing query classification. Machine learned models which classify the query is commonly used in e-commerce search solutions and the recent advances in natural language processing (NLP) using pre-trained deep neural language models have improved the accuracy of text classification models significantly. See e.g text classification using BERT for an illustrated guide to text classification using BERT. Vespa supports deploying ML models built with TensorFlow, XGBoost and PyTorch through the Open Neural Network Exchange (ONNX) format. ML models trained with mentioned tools can successfully be used for various query classification tasks with high accuracy.

In e-commerce search, classifying the intent of the query or query session can help ranking the results by using an intent specific ranking profile which is tailored to the specific query intent. The intent classification can also determine how the result page is displayed and organised.

Consider a category browse intent query like ‘shoes for men’. A query intent which might benefit from a query rewrite which limits the result set to contain only items which matched the unambiguous category id instead of just searching the product description or category fields for ‘shoes for men’ . Also ranking could change based on the query classification by using a ranking profile which gives higher weight to signals like popularity or price than text ranking features.

Vespa also features a powerful query rewriting language which supports rule based query rewrites, synonym expansion and query phrasing.

Vespa is commonly used for recommendation use cases and e-commerce is no exception.

Vespa is able to evaluate complex Machine Learned (ML) models over many data points (documents, products) in user time which allows the ML model to use real time signals derived from the current user’s online shopping session (e.g products browsed, queries performed, time of day) as model features. An offline batch oriented inference architecture would not be able to use these important real time signals. By batch oriented inference architecture we mean pre-computing the inference offline for a set of users or products and where the model inference results is stored in a key-value store for online retrieval.

Update 2021-05-20: Blog recommendation tutorials are replaced by the
News search and recommendation tutorial:

In our blog recommendation tutorial
we demonstrate how to apply a collaborative filtering model for content recommendation and in
part 2 of the blog recommendation tutorial
we show how to use a neural network trained with TensorFlow to serve recommendations in user time. Similar recommendation approaches are used with success in e-commerce.

Keeping your e-commerce index up to date with real time updates

Vespa is designed for horizontal scaling with high sustainable write and read throughput with low predictable latency. Updating the product catalog in real time is of critical importance for e-commerce applications as the real time information is used in retrieval filters and also as ranking signals. The product description or product title rarely changes but meta information like inventory status, price and popularity are real time signals which will improve relevance when used in ranking. Also having the inventory status reflected in the search index also avoids retrieving content which is out of stock.

Vespa has true native support for partial updates where there is no need to re-index the entire document but only a subset of the document (i.e fields in the document). Real time partial updates can be done at scale against attribute fields which are stored and updated in memory. Attribute fields in Vespa can be updated at rates up to about 40-50K updates/s per content node.

Using Vespa’s support for predicate fields it’s easy to control when content is surfaced in search results and not. The predicate field type allows the content (e.g a document) to express if it should match the query instead of the other way around. For e-commerce search and recommendation we can use predicate expressions to control how product campaigns are surfaced in search results. Some examples of what predicate fields can be used for:

  • Only match and retrieve the document if time of day is in the range 8–16 or range 19–20 and the user is a member. This could be used for promoting content for certain users, controlled by the predicate expression stored in the document. The time of day and member status is passed with the query.
  • Represent recurring campaigns with multiple time ranges.

Above examples are by no means exhaustive as predicates can be used for multiple campaign related use cases where the filtering logic is expressed in the content.

Are you worried that your current search installation will break by the traffic surge associated with the holiday shopping season? Are your cloud VMs running high on disk busy metrics already? What about those long GC pauses in the JVM old generation causing your 95percentile latency go through the roof? Needless to say but any downtime due to a slow search backend causing a

Vespa.ai and the CORD-19 public API

Thiago Martins

Thiago Martins

Vespa Data Scientist

This post was first published at
Vespa.ai and the CORD-19 public API.

The Vespa team has been working non-stop to put together the
cord19.vespa.ai search app
based on the COVID-19 Open Research Dataset (CORD-19) released by the
Allen Institute for AI.
Both the frontend and
the backend
are 100% open-sourced.
The backend is based on vespa.ai, a powerful and open-sourced computation engine.
Since everything is open-sourced, you can contribute to the project in multiple ways.

As a user, you can either search for articles by using the frontend
or perform advanced search by using the
public search API.
As a developer, you can contribute by improving the existing application through pull requests to
the backend and
or you can fork and create your own application,
either locally
or through Vespa Cloud,
to experiment with different ways to match and rank the CORD-19 articles.
My goal here with this piece is to give you an overview of what can be accomplished with Vespa
by using the cord19 search app public API.
This only scratches the surface
but I hope it can help direct you to the right places to learn more about what is possible.

Simple query language

The cord19.vespa.ai query interface supports the Vespa
simple query language
that allows you to quickly perform simple queries. Examples:

Additional resources:

Vespa Search API

In addition to the simple query language,
Vespa has also a more powerful search API
that gives full control in terms of search experience through the
Vespa query language called YQL.
We can then send a wide range of queries by sending a POST request to the search end-point of cord19.vespa.ai.
Following are python code illustrating the API:

import requests # Install via 'pip install requests'

response = requests.post(endpoint, json=body)

Search by query terms

Let’s break down one example to give you a hint of what is possible to do with Vespa search API:

body = {
  'yql'    : 'select title, abstract from sources * where userQuery() and has_full_text=true and timestamp > 1577836800;',
  'hits'   : 5,
  'query'  : 'coronavirus temperature sensitivity',
  'type'   : 'any',
  'ranking': 'bm25'

The match phase:
The body parameter above will select the title and the abstract fields for all articles that match
any ('type': 'any') of the 'query' terms
and that has full text available (has_full_text=true) and timestamp greater than 1577836800.

The ranking phase:
After matching the articles by the criteria described above, Vespa will rank them according to their
BM25 scores ('ranking': 'bm25')
and return the top 5 articles ('hits': 5) according to this rank criteria.

The example above gives only a taste of what is possible with the search API.
We can tailor both the match phase and ranking phase to our needs.
For example, we can use more complex match operators such as the Vespa weakAND,
we can restrict the search to look for a match only in the abstract by adding 'default-index': 'abstract' in the body above.
We can experiment with different ranking function at query time
by changing the 'ranking' parameter to one of the rank-profiles available in the
search definition file.

Additional resources:

  • The Vespa text search tutorial shows how to create a text search app on a step-by-step basis.
    Part 1
    shows how to create a basic app from scratch.
    Part 2
    shows how to collect training data from Vespa and improve the application with ML models.
    Part 3
    shows how to get started with semantic search by using pre-trained sentence embeddings.
  • More YQL examples specific to the cord19 app can be found in
    cord19 API doc.

Search by semantic relevance

In addition to searching by query terms, Vespa supports semantic search.

body = {
    'yql': 'select * from sources * where  ([{"targetNumHits":100}]nearestNeighbor(title_embedding, vector));',
    'hits': 5,
    'ranking.features.query(vector)': embedding.tolist(),
    'ranking.profile': 'semantic-search-title',

The match phase:
In the query above we match at least 100 articles ([{"targetNumHits":100}])
which have the smallest (euclidean) distance between the title_embedding
and the query embedding vector by using the nearestNeighbor operator.

The ranking phase:
After matching we can rank the documents in a variety of ways.
In this case, we use a specific rank-profile named 'semantic-search-title'
that was pre-defined to order the matched articles the distance between title and query embeddings.

The title embeddings have been created while feeding the documents to Vespa
while the query embedding is created at query time and sent to Vespa by the ranking.features.query(vector) parameter.
This Kaggle notebook
illustrates how to perform a semantic search in the cord19 app by using the

Additional resources:

  • Part 3 of the text search tutorial
    shows how to get started with semantic search by using pre-trained sentence embeddings.
  • Go to the Ranking page
    to know more about ranking in general and how to deploy ML models in Vespa (including TensorFlow, XGBoost, etc).

WRITTEN BY: Thiago G. Martins. Working on Vespa.ai. Follow me on Twitter @Thiagogm.

Efficient open-domain question-answering on Vespa.ai

Open-domain question-answering has emerged as a benchmark for measuring a
system’s capability to read, represent, and retrieve general knowledge.
Retrieval-based question-answering systems require connecting various systems
and services, such as BM25 text search, vector similarity search, NLP model
serving, tokenizers, and middleware to glue all this together. Most of these
are core features of Vespa.ai. In this post, we reproduce the state-of-the-art
baseline for retrieval-based question-answering systems within a single,
scalable production ready application on Vespa.ai.


Some of the most effective drivers of scientific progress are benchmarks.
Benchmarks provide a common goal, a purpose, for improving the state-of-the-art
on datasets that are available to everyone. Leaderboards additionally add a
competitive motivation, offering the opportunity to excel among peers. And
rather than just endlessly tinkering to improve relevant metrics, competitions
add deadlines that spur researchers to actually get things done.

Within the field of machine learning, benchmarks have been particularly
important to stimulate innovation and progress. A new competition, the
Efficient Open-Domain Question Answering challenge for NeurIPS 2020, seeks to
advance the state-of-the-art in question answering systems. The goal here is to
develop a system capable of answering questions without any topic restriction.
With all the recent progress in natural language processing, this area has
emerged as a benchmark for measuring a system’s capability to read, represent,
and retrieve general knowledge.

The current retrieval-based state-of-the-art is the Dense Passage Retrieval
system, as described in the Dense
Passage Retrieval for Open-Domain Question Answering
paper. It consists of a set of python
scripts, tools, and models developed primarily for research. There are a lot
of parts in such a system. These include two BERT-based models for encoding
text to embedding vectors, another BERT-based model for extracting answers,
approximate nearest-neighbor similarity search and text-based BM25 methods for
retrieving candidates, tokenizers, and so on. It’s not trivial to bring such a
system to production. We thought it would be interesting to consolidate these
different parts and demonstrate how to build an open-domain question-answering
serving system with Vespa.ai that achieves state-of-the-art accuracy.

Most of these components are core features in Vespa. A while ago, we improved
Vespa’s text search support for term-based retrieval and ranking. We recently
added efficient approximate nearest neighbors for semantic, dense vector
recall. For hybrid retrieval, Vespa supports many types of machine-learned
models, for instance neural networks and decision forests. We have also
improved our support for TensorFlow and PyTorch models to run larger NLP and
Transformer models.

This is interesting because while this has obvious benefits in a research
setting, such systems’ real value lies in their end-use in applications. Vespa
is designed as a highly performant and scalable production-ready system. Thus,
it offers a simplified path to deployment in production without coping with the
complexity of maintaining many different subsystems. That makes Vespa an
attractive package.

During this blog post, we’ll touch upon

  • Fast approximate-nearest neighbors for semantic, dense vector retrieval.
  • Term-based (BM25) retrieval for sparse vector retrieval.
  • Importing of multiple pre-trained BERT-based models in Vespa for encoding embedding vectors and extracting answers.
  • Custom logic for tokenization and other things.

For more details we refer to the companion sample

This post’s goal is to recreate the Dense Passage Retrieval (DPR) paper results
for the Natural Questions
We’ll first go through a high-level overview of how a retrieval-based
question-answering system works in the context of this paper. Then we’ll show
how this can all be implemented on Vespa without any external services or
plugins, and recreate the paper’s state-of-the-art results as measured by the
exact match of answers given a set of questions. We’ll wrap up with a look to
the future and the next post in this series.

Background: the anatomy of a retrieval-based QA system

The Natural Questions benchmark consists of natural language questions and
answers. How to retrieve and represent the knowledge required to answer the
questions is up to each system. There are two main approaches to this:
retrieval and parametric. A retrieval-based system uses search terms or
semantic representation vectors to recall a set of sentences or paragraphs
subsequently evaluated with a machine-learned model to extract the exact
answer. A parametric system stores the answers more or less directly in the
weights of a large neural network model. There has also been research into
hybrid retrieval and parametric systems such as the Retrieval-Augmented
Generation system, which recently improved
the state-of-the-art for the natural question benchmark as a whole. In this
blog post, we’ll focus on a retrieval-based system, but will explore parametric
and hybrid approaches in later blog posts.

A retrieval-based question answering system typically stores its “knowledge” in
an information retrieval system. This can be sentences, paragraphs, or entire
documents. Here we’ll use the Wikipedia dataset where each page is split into
passages of 100 words each. The dataset contains 21 million such passages. When
answering a question, we first retrieve the passages that most likely include
the answer. They are then analyzed with a machine-learned model to extract the
spans that most likely results in the correct response. These stages are called
the “retriever” and “reader”, respectively.

Extracting the answer from passages

The retriever

The retriever is responsible for generating a set of candidate passages. Since
the subsequent reader component is expensive to evaluate, it is crucial to have
an effective retrieval mechanism. There are two main approaches to passage
retrieval: term-based (sparse) such as for BM25, and embedding (dense) vectors,
which each have their strengths and weaknesses.

Term-based (sparse) retrieval

Term-based retrieval is the classic information retrieval method and covers
algorithms such as TF-IDF and BM25. Conceptually, the text is represented by a
vector where each dimension represents a term in a vocabulary. A non-zero value
signifies its presence. As each text only contains a subset of possible terms
in the vocabulary, these vectors are large and sparse. The similarity between
two texts, for instance, a document and a query, can be computed by a dot
product between the sparse vectors with slight modifications (e.g., term
importance) for TF-IDF or BM25. Term-based methods rely on inverted index
structures for efficient retrieval. This can in some cases be further
accelerated by algorithms such as WAND.

Except for any pre-processing such as lemmatization, stemming, and possibly
stop-word removal, terms are matched exactly as found in the text. This can be
a strength as well as a weakness. For salient terms, e.g. names and
places, this cuts down the search space significantly. However, potentially
relevant documents that don’t contain the exact term will not be retrieved
unless one uses query expansion or related techniques. The Dense Passage
Retrieval (DPR) paper uses ElasticSearch as the providing system for BM25.

Embedding-based (dense) retrieval

The number of potential terms in a vocabulary can be vast indeed. The basic
idea behind embedding vectors is to compress this high dimensional sparse
vector to a much smaller dense vector where most dimensions contain a non-zero
value. This has the effect of projecting a query or document vector into a
lower-dimensional space. This can be done so that vectors that are close
geometrically are also close semantically. The DPR paper uses two BERT models
to encode text: one for encoding queries and one for encoding documents. The
two models are trained simultaneously in a two-tower configuration to maximize
the dot product for passages likely to answer the question.

In contrast to the sparse representation, there are no exact methods for
finding the nearest neighbors efficiently. So we trade accuracy for efficiency
in what is called approximate nearest neighbors (ANN). Many different methods
for ANN search have been proposed. Some are compatible with inverted index
structures so they can be readily implemented in existing information retrieval
systems. Examples are k-means clustering, product quantization (and it’s
relatives), and locality sensitive hashing, where the centroids or buckets can
be indexed. A method that is not compatible with inverted indexes is
HNSW (hierarchical navigable small world).
HNSW is based on graph structures, is efficient, and has an attractive
property where the graph can be incrementally built at runtime. This is in
contrast to most other methods that require offline, batch oriented index

Retrieval based on semantic embeddings complements term-based retrieval
well. Semantically similar documents can be recalled even though they don’t
contain the exact same terms. Unlike the bag-of-words approach for term-based
retrieval, word order can provide additional context. Historically, however,
term-based retrieval has outperformed semantic embeddings on question answering
problems, but the DPR paper shows that dense retrieval can be vastly improved
if the encoding has specifically been trained to the task. The DPR paper uses
FAISS with an HNSW index for
similarity search.

The reader

While the retriever component’s job is to produce a set of candidate passages
that hopefully contain the answer to the question, the reader extracts the
passages’ actual answer. This requires some form of natural language
understanding model, and BERT (or other Transformer) models are used. These
models are typically huge and expensive to evaluate, so only a small number of
candidate passages are run through them.

Transformer models take sequences of tokens as input. The tokenization of text
can be done in quite a few different ways to balance vocabulary size and
sequence length. Due to BERT models’ full attention mechanism, evaluation time
increases quadratically with sequence length. So a reasonable balance must be
struck, and BERT-based models use a WordPiece or similar algorithm to split
less common words into subwords.

The reader model’s input is the concatenation of the tokens representing the
question, the document’s title, and the passage itself. The model looks up the
embedding representation for each token, and through a series of layers,
produces a new representation for each token. These representations can then be
used for different tasks. For question-answering, an additional layer is added
to compute the three necessary outputs: the relevance of the passage to the
question, and the start and end indexes of the answer.

BERT for question answering

To extract the final answer, the passage that produced the largest relevance
score is used. The two other outputs of the model are probabilities for each
token of being a start token and an end token. The final answer is chosen by
finding the span with the largest sum of start probability and end probability.
This results in a sequence of tokens, which must be converted to words by the
tokenizer before returning. The DPR paper uses a BERT-based
model to
output span predictions.

Putting all this together

The retrieval-based question-answering system as described above, capable of
both term- and embedding-based retrieval, requires at least the following

  • A BM25 information retrieval system storing the 21 million Wikipedia text passages.
  • An efficient vector similarity search system storing the passage embedding vectors.
  • A model serving system for the three different BERT-based models: query encoder, document encoder, and reader model.
  • A BERT-based tokenizer.
  • Middleware to glue this all together.

The tokenizer generates the token sequence for the text. These tokens are
stored for usage in the reader model. They are also used in a document
BERT-base encoder model to create the embedding vector for dense retrieval. The
text and embedding vectors need to be indexed for fast retrieval.

A similar process is followed for each query. The tokenizer produces a token
sequence used to generate an embedding vector in the query BERT-based encoder.
The first stage, retrieval, is done using either term-based retrieval or
embedding based retrieval. The top-N passages are passed to the Reader model
and ranked accordingly. The best passage is analyzed to extract the best span
containing the answer.

This is a non-trivial list of services that need to be set up to implement a
question-answering system. In the next section we show how to implement all
this as a single Vespa application.

Reproducing the baseline on Vespa

Most of the components mentioned above have become core features in Vespa. In
the following we’ll present an overview of setting this up in Vespa. The
details can be seen in the companion sample


When creating an application with Vespa, one typically starts with a document