and the CORD-19 public API

Thiago Martins

Thiago Martins

Vespa Data Scientist

This post was first published at and the CORD-19 public API.

The Vespa team has been working non-stop to put together the search app
based on the COVID-19 Open Research Dataset (CORD-19) released by the
Allen Institute for AI.
Both the frontend and
the backend
are 100% open-sourced.
The backend is based on, a powerful and open-sourced computation engine.
Since everything is open-sourced, you can contribute to the project in multiple ways.

As a user, you can either search for articles by using the frontend
or perform advanced search by using the
public search API.
As a developer, you can contribute by improving the existing application through pull requests to
the backend and
or you can fork and create your own application,
either locally
or through Vespa Cloud,
to experiment with different ways to match and rank the CORD-19 articles.
My goal here with this piece is to give you an overview of what can be accomplished with Vespa
by using the cord19 search app public API.
This only scratches the surface
but I hope it can help direct you to the right places to learn more about what is possible.

Simple query language

The query interface supports the Vespa
simple query language
that allows you to quickly perform simple queries. Examples:

Additional resources:

Vespa Search API

In addition to the simple query language,
Vespa has also a more powerful search API
that gives full control in terms of search experience through the
Vespa query language called YQL.
We can then send a wide range of queries by sending a POST request to the search end-point of
Following are python code illustrating the API:

import requests # Install via 'pip install requests'

response =, json=body)

Search by query terms

Let’s break down one example to give you a hint of what is possible to do with Vespa search API:

body = {
  'yql'    : 'select title, abstract from sources * where userQuery() and has_full_text=true and timestamp > 1577836800;',
  'hits'   : 5,
  'query'  : 'coronavirus temperature sensitivity',
  'type'   : 'any',
  'ranking': 'bm25'

The match phase:
The body parameter above will select the title and the abstract fields for all articles that match
any ('type': 'any') of the 'query' terms
and that has full text available (has_full_text=true) and timestamp greater than 1577836800.

The ranking phase:
After matching the articles by the criteria described above, Vespa will rank them according to their
BM25 scores ('ranking': 'bm25')
and return the top 5 articles ('hits': 5) according to this rank criteria.

The example above gives only a taste of what is possible with the search API.
We can tailor both the match phase and ranking phase to our needs.
For example, we can use more complex match operators such as the Vespa weakAND,
we can restrict the search to look for a match only in the abstract by adding 'default-index': 'abstract' in the body above.
We can experiment with different ranking function at query time
by changing the 'ranking' parameter to one of the rank-profiles available in the
search definition file.

Additional resources:

  • The Vespa text search tutorial shows how to create a text search app on a step-by-step basis.
    Part 1
    shows how to create a basic app from scratch.
    Part 2
    shows how to collect training data from Vespa and improve the application with ML models.
    Part 3
    shows how to get started with semantic search by using pre-trained sentence embeddings.
  • More YQL examples specific to the cord19 app can be found in
    cord19 API doc.

Search by semantic relevance

In addition to searching by query terms, Vespa supports semantic search.

body = {
    'yql': 'select * from sources * where  ([{"targetNumHits":100}]nearestNeighbor(title_embedding, vector));',
    'hits': 5,
    'ranking.features.query(vector)': embedding.tolist(),
    'ranking.profile': 'semantic-search-title',

The match phase:
In the query above we match at least 100 articles ([{"targetNumHits":100}])
which have the smallest (euclidean) distance between the title_embedding
and the query embedding vector by using the nearestNeighbor operator.

The ranking phase:
After matching we can rank the documents in a variety of ways.
In this case, we use a specific rank-profile named 'semantic-search-title'
that was pre-defined to order the matched articles the distance between title and query embeddings.

The title embeddings have been created while feeding the documents to Vespa
while the query embedding is created at query time and sent to Vespa by the ranking.features.query(vector) parameter.
This Kaggle notebook
illustrates how to perform a semantic search in the cord19 app by using the

Additional resources:

  • Part 3 of the text search tutorial
    shows how to get started with semantic search by using pre-trained sentence embeddings.
  • Go to the Ranking page
    to know more about ranking in general and how to deploy ML models in Vespa (including TensorFlow, XGBoost, etc).

WRITTEN BY: Thiago G. Martins. Working on Follow me on Twitter @Thiagogm.