Build a basic text search application from python with Vespa

Thiago Martins

Thiago Martins

Vespa Data Scientist


Introducing pyvespa simplified API. Build Vespa application from python with few lines of code.

UPDATE 2023-02-13: Code examples and links are updated to work with the latest releases of
pyvespa
and learntorank.

This post will introduce you to the simplified
pyvespa
API that allows us to build a basic text search application from scratch with just a few code lines from python. Follow-up posts will add layers of complexity by incrementally building on top of the basic app described here.

Decorative image

Photo by Sarah Dorweiler on Unsplash

pyvespa exposes a subset of Vespa API in python. The library’s primary goal is to allow for faster prototyping and facilitate Machine Learning experiments for Vespa applications. I have written about how we can use it to connect and interact with running Vespa applications and evaluate Vespa ranking functions from python. This time, we focus on building and deploying applications from scratch.

Install

The pyvespa simplified API introduced here was released in version 0.2.0

pip3 install pyvespa>=0.2.0 learntorank

Define the application

As an example, we will build an application to search through
CORD19 sample data.

Create an application package

The first step is to create a Vespa ApplicationPackage:

from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name="cord19")

Add fields to the Schema

We can then add fields to the application’s Schema created by default in app_package.

from vespa.package import Field

app_package.schema.add_fields(
    Field(
        name = "cord_uid", 
        type = "string", 
        indexing = ["attribute", "summary"]
    ),
    Field(
        name = "title", 
        type = "string", 
        indexing = ["index", "summary"], 
        index = "enable-bm25"
    ),
    Field(
        name = "abstract", 
        type = "string", 
        indexing = ["index", "summary"], 
        index = "enable-bm25"
    )
)
  • cord_uid will store the cord19 document ids, while title and abstract are self explanatory.

  • All the fields, in this case, are of type string.

  • Including "index" in the indexing list means that Vespa will create a searchable index for title and abstract. You can read more about which options is available for indexing in the Vespa documentation.

  • Setting index = "enable-bm25" makes Vespa pre-compute quantities to make it fast to compute the bm25 score. We will use BM25 to rank the documents retrieved.

Search multiple fields when querying

A Fieldset groups fields together for searching. For example, the default fieldset defined below groups title and abstract together.

from vespa.package import FieldSet

app_package.schema.add_field_set(
    FieldSet(name = "default", fields = ["title", "abstract"])
)

Define how to rank the documents matched

We can specify how to rank the matched documents by defining a RankProfile. In this case, we defined the bm25 rank profile that combines that BM25 scores computed over the title and abstract fields.

from vespa.package import RankProfile

app_package.schema.add_rank_profile(
    RankProfile(
        name = "bm25", 
        first_phase = "bm25(title) + bm25(abstract)"
    )
)

Deploy your application

We have now defined a basic text search app containing relevant fields, a fieldset to group fields together, and a rank profile to rank matched documents. It is time to deploy our application. We can locally deploy our app_package using Docker without leaving the notebook,
by creating an instance of VespaDocker,
as shown below:

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()

app = vespa_docker.deploy(application_package = app_package)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Finished deployment.

app now holds a Vespa instance, which we are going to use to interact with our application. Congratulations, you now have a Vespa application up and running.

It is important to know that pyvespa simply provides a convenient API to define Vespa application packages from python. vespa_docker.deploy export Vespa configuration files to the disk_folder defined above. Going through those files is an excellent way to start learning about Vespa syntax.

Feed some data

Our first action after deploying a Vespa application is usually to feed some data to it. To make it easier to follow, we have prepared a DataFrame containing 100 rows and the cord_uid, title, and abstract columns required by our schema definition.

from pandas import read_csv

parsed_feed = read_csv(
    "https://thigm85.github.io/data/cord19/parsed_feed_100.csv"
)
cord_uidtitleabstract
0ug7v899jClinical features of culture-proven Mycoplasma…OBJECTIVE: This retrospective chart review des…
102tnwd4mNitric oxide: a pro-inflammatory mediator in l…Inflammatory diseases of the respiratory tract…
2ejv2xln0Surfactant protein-D and pulmonary host defenseSurfactant protein-D (SP-D) participates in th…
32b73a28nRole of endothelin-1 in lung diseaseEndothelin-1 (ET-1) is a 21 amino acid peptide…
49785vg6dGene expression in epithelial cells in respons…Respiratory syncytial virus (RSV) and pneumoni…
9563bos83oGlobal Surveillance of Emerging Influenza Viru…BACKGROUND: Effective influenza surveillance r…
96hqc7u9w3Transmission Parameters of the 2001 Foot and M…Despite intensive ongoing research, key aspect…
9787zt7lewEfficient replication of pneumonia virus of mi…Pneumonia virus of mice (PVM; family Paramyxov…
98wgxt36jvDesigning and conducting tabletop exercises to…BACKGROUND: Since 2001, state and local health…
99qbldmef1Transcript-level annotation of Affymetrix prob…BACKGROUND: The wide use of Affymetrix microar…

100 rows × 3 columns

We can then iterate through the DataFrame above and feed each row by using the app.feed_data_point method:

  • The schema name is by default set to be equal to the application name, which is cord19 in this case.

  • When feeding data to Vespa, we must have a unique id for each data point. We will use cord_uid here.

for idx, row in parsed_feed.iterrows():
    fields = {
        "cord_uid": str(row["cord_uid"]),
        "title": str(row["title"]),
        "abstract": str(row["abstract"])
    }
    response = app.feed_data_point(
        schema = "cord19",
        data_id = str(row["cord_uid"]),
        fields = fields,
    )

You can also inspect the response to each request if desired.

{'pathId': '/document/v1/cord19/cord19/docid/qbldmef1',
 'id': 'id:cord19:cord19::qbldmef1'}

Query your application

With data fed, we can start to query our text search app. We can use the Vespa Query language directly by sending the required parameters to the body argument of the app.query method.

query = {
    'yql': 'select * from sources * where userQuery()',
    'query': 'What is the role of endothelin-1',
    'ranking': 'bm25',
    'type': 'any',
    'presentation.timing': True,
    'hits': 3
}
res = app.query(body=query)
res.hits[0]
{'id': 'id:cord19:cord19::2b73a28n',
 'relevance': 20.79338929607865,
 'source': 'cord19_content',
 'fields': {'sddocname': 'cord19',
  'documentid': 'id:cord19:cord19::2b73a28n',
  'cord_uid': '2b73a28n',
  'title': 'Role of endothelin-1 in lung disease',
  'abstract': 'Endothelin-1 (ET-1) is a 21 amino acid peptide with diverse biological activity that has been implicated in numerous diseases. ET-1 is a potent mitogen regulator of smooth muscle tone, and inflammatory mediator that may play a key role in diseases of the airways, pulmonary circulation, and inflammatory lung diseases, both acute and chronic. This review will focus on the biology of ET-1 and its role in lung disease.'}}

We can also define the same query by using the
QueryModel abstraction
that allows us to specify how we want to match and rank our documents. In this case, we defined that we want to:

  • match our documents using the OR operator, which matches all the documents that share at least one term with the query.
  • rank the matched documents using the bm25 rank profile defined in our application package.
from learntorank.query import QueryModel, OR, Ranking, send_query

res = send_query(
    app=app,
    query="What is the role of endothelin-1", 
    query_model = QueryModel(
        match_phase=OR(), 
        ranking=Ranking(name="bm25")
    )
)
res.hits[0]
{
    'id': 'id:cord19:cord19::2b73a28n',
    'relevance': 20.79338929607865,
    'source': 'cord19_content',
    'fields': {
        'sddocname': 'cord19',
        'documentid': 'id:cord19:cord19::2b73a28n',
        'cord_uid': '2b73a28n',
        'title': 'Role of endothelin-1 in lung disease',
        'abstract': 'Endothelin-1 (ET-1) is a 21 amino acid peptide with diverse biological activity that has been implicated in numerous diseases. ET-1 is a potent mitogen regulator of smooth muscle tone, and inflammatory mediator that may play a key role in diseases of the airways, pulmonary circulation, and inflammatory lung diseases, both acute and chronic. This review will focus on the biology of ET-1 and its role in lung disease.'
    }
}

Using the Vespa Query Language as in our first example gives you the full power and flexibility that Vespa can offer.
In contrast, the QueryModel abstraction focuses on specific use cases and can be more useful for ML experiments,
but this is a future post topic.

Jump to Build a basic text search application from python with Vespa: Part 2
or clean up:

vespa_docker.container.stop()
vespa_docker.container.remove()

Run search engine experiments in Vespa from python

Thiago Martins

Thiago Martins

Vespa Data Scientist


Three ways to get started with pyvespa.

pyvespa provides a python API to Vespa.
The library’s primary goal is to allow for faster prototyping and facilitate Machine Learning experiments for Vespa applications.

UPDATE 2023-02-13: Code examples are updated to work with the latest releases of
pyvespa.

There are three ways you can get value out of pyvespa:

  1. You can connect to a running Vespa application.

  2. You can build and deploy a Vespa application using pyvespa API.

  3. You can deploy an application from Vespa config files stored on disk.

We will review each of those methods.

Decorative image

Photo by
Kristin Hillery on
Unsplash

Connect to a running Vespa application

In case you already have a Vespa application running somewhere, you can directly instantiate the Vespa class with the appropriate endpoint. The example below connects to the cord19.vespa.ai application:

from vespa.application import Vespa

app = Vespa(url = "https://api.cord19.vespa.ai")

We are then good to go and ready to interact with the application through pyvespa:

app.query(body = {
  'yql': 'select title from sources * where userQuery()',
  'hits': 1,
  'summary': 'short',
  'timeout': '1.0s',
  'query': 'coronavirus temperature sensitivity',
  'type': 'all',
  'ranking': 'default'
}).hits
[{'id': 'index:content/1/ad8f0a6204288c0d497399a2',
  'relevance': 0.36920467353113595,
  'source': 'content',
  'fields': {'title': '<hi>Temperature</hi> <hi>Sensitivity</hi>: A Potential Method for the Generation of Vaccines against the Avian <hi>Coronavirus</hi> Infectious Bronchitis Virus'}}]

Build and deploy with pyvespa API

You can also build your Vespa application from scratch using the pyvespa API. Here is a simple example:

from vespa.package import ApplicationPackage, Field, RankProfile

app_package = ApplicationPackage(name = "sampleapp")
app_package.schema.add_fields(
    Field(
        name="title", 
        type="string", 
        indexing=["index", "summary"], 
        index="enable-bm25")
)
app_package.schema.add_rank_profile(
    RankProfile(
        name="bm25", 
        inherits="default", 
        first_phase="bm25(title)"
    )
)

We can then deploy app_package to a Docker container
(or directly to VespaCloud):

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=app_package)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Finished deployment.

app holds an instance of the Vespa class just like our first example,
and we can use it to feed and query the application just deployed.
This can be useful when we want to fine-tune our application based on Vespa features not available through the pyvespa API.

There is also the possibility to explicitly export app_package to Vespa configuration files (without deploying them):

$ mkdir -p /tmp/sampleapp
app_package.to_files("/tmp/sampleapp")

Clean up:

vespa_docker.container.stop()
vespa_docker.container.remove()

Deploy from Vespa config files

pyvespa API provides a subset of the functionality available in Vespa. The reason is that pyvespa is meant to be used as an experimentation tool for Information Retrieval (IR) and not for building production-ready applications. So, the python API expands based on the needs we have to replicate common use cases that often require IR experimentation.

If your application requires functionality or fine-tuning not available in pyvespa, you simply build it directly through Vespa configuration files as shown in many examples on Vespa docs. But even in this case, you can still get value out of pyvespa by deploying it from python based on the Vespa configuration files stored on disk. To show that, we can clone and deploy the news search app covered in this Vespa tutorial:

$ git clone https://github.com/vespa-engine/sample-apps.git

The Vespa configuration files of the news search app are stored in the sample-apps/news/app-3-searching/ folder:

$ tree sample-apps/news/app-3-searching/
sample-apps/news/app-3-searching/
├── schemas/
│   └── news.sd
└── services.xml

1 directory, 2 files

We can then deploy to a Docker container from disk:

from vespa.deployment import VespaDocker

vespa_docker_news = VespaDocker()
app = vespa_docker_news.deploy_from_disk(
    application_name="news",
    application_root="sample-apps/news/app-3-searching")
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Finished deployment.

Again, app holds an instance of the Vespa class just like our first example,
and we can use it to feed and query the application just deployed.

Clean up:

vespa_docker_news.container.stop()
vespa_docker_news.container.remove()

Final thoughts

We covered three different ways to connect to a Vespa application from python using the pyvespa library. Those methods provide great workflow flexibility. They allow you to quickly get started with pyvespa experimentation while enabling you to modify Vespa config files to include features not available in the pyvespa API without losing the ability to experiment with the added features.

Build a News recommendation app from python with Vespa: Part 1

Part 1 – News search functionality.

We will build a news recommendation app in Vespa without leaving a python environment. In this first part of the series, we want to develop an application with basic search functionality. Future posts will add recommendation capabilities based on embeddings and other ML models.

UPDATE 2023-02-13: Code examples are updated to work with the latest release of
pyvespa.

Decorative image

Photo by Filip Mishevski on Unsplash

This series is a simplified version of Vespa’s News search and recommendation tutorial. We will also use the demo version of the Microsoft News Dataset (MIND) so that anyone can follow along on their laptops.

Dataset

The original Vespa news search tutorial provides a script to download, parse and convert the MIND dataset to Vespa format. To make things easier for you, we made the final parsed data required for this tutorial available for download:

import requests, json

data = json.loads(
    requests.get("https://thigm85.github.io/data/mind/mind_demo_fields_parsed.json").text
)
data[0]
{'abstract': "Shop the notebooks, jackets, and more that the royals can't live without.",
 'title': 'The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By',
 'subcategory': 'lifestyleroyals',
 'news_id': 'N3112',
 'category': 'lifestyle',
 'url': 'https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata',
 'date': 20191103,
 'clicks': 0,
 'impressions': 0}

The final parsed data used here is a list where each element is a dictionary containing relevant fields about a news article such as title and category. We also have information about the number of impressions and clicks the article has received. The demo version of the mind dataset has 28.603 news articles included.

Install pyvespa

Create the search app

Create the application package. app_package will hold all the relevant data related to your application’s specification.

from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name="news")

Add fields to the schema. Here is a short description of the non-obvious arguments used below:

  • indexing argument: configures the indexing pipeline for a field, which defines how Vespa will treat input during indexing.

  • index argument: configure how Vespa should create the search index.

    • “enable-bm25”: set up an index compatible with bm25 ranking for text search.
  • attribute argument: configure how Vespa should treat an attribute field.

    • “fast-search”: Build an index for an attribute field. By default, no index is generated for attributes, and search over these defaults to a linear scan.
from vespa.package import Field

app_package.schema.add_fields(
    Field(name="news_id", type="string", indexing=["summary", "attribute"], attribute=["fast-search"]),
    Field(name="category", type="string", indexing=["summary", "attribute"]),
    Field(name="subcategory", type="string", indexing=["summary", "attribute"]),
    Field(name="title", type="string", indexing=["index", "summary"], index="enable-bm25"),
    Field(name="abstract", type="string", indexing=["index", "summary"], index="enable-bm25"),
    Field(name="url", type="string", indexing=["index", "summary"]),        
    Field(name="date", type="int", indexing=["summary", "attribute"]),            
    Field(name="clicks", type="int", indexing=["summary", "attribute"]),            
    Field(name="impressions", type="int", indexing=["summary", "attribute"]),                
)

Add a fieldset to the schema. Fieldset allows us to search over multiple fields easily. In this case, searching over the default fieldset is equivalent to searching over title and abstract.

from vespa.package import FieldSet

app_package.schema.add_field_set(
    FieldSet(name="default", fields=["title", "abstract"])
)

We have enough to deploy the first version of our application. Later in this tutorial, we will include an article’s popularity into the relevance score used to rank the news that matches our queries.

Deploy the app on Docker

If you have Docker installed on your machine, you can deploy the app_package in a local Docker container:

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy(
    application_package=app_package, 
)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Finished deployment.

vespa_docker will parse the app_package and write all the necessary Vespa config files to the disk_folder. It will then create the docker containers and use the Vespa config files to deploy the Vespa application. We can then use the app instance to interact with the deployed application, such as for feeding and querying. If you want to know more about what happens behind the scenes, we suggest you go through this getting started with Docker tutorial.

Feed data to the app

We can use the feed_data_point method. We need to specify:

  • data_id: unique id to identify the data point

  • fields: dictionary with keys matching the field names defined in our application package schema.

  • schema: name of the schema we want to feed data to. When we created an application package, we created a schema by default with the same name as the application name, news in our case.

This takes 10 minutes or so:

for article in data:
    res = app.feed_data_point(
        data_id=article["news_id"], 
        fields=article, 
        schema="news"
    )

Query the app

We can use the Vespa Query API through app.query to unlock the full query flexibility Vespa can offer.

Search over indexed fields using keywords

Select all the fields from documents where default (title or abstract) contains the keyword ‘music’.

res = app.query(body={"yql" : "select * from sources * where default contains 'music'"})
res.hits[0]
{
    'id': 'id:news:news::N14152',
    'relevance': 0.25641557752127125,
    'source': 'news_content',
    'fields': {
        'sddocname': 'news',
        'documentid': 'id:news:news::N14152',
        'news_id': 'N14152',
        'category': 'music',
        'subcategory': 'musicnews',
        'title': 'Music is hot in Nashville this week',
        'abstract': 'Looking for fun, entertaining music events to check out in Nashville this week? Here are top picks with dates, times, locations and ticket links.', 'url': 'https://www.msn.com/en-us/music/musicnews/music-is-hot-in-nashville-this-week/ar-BBWImOh?ocid=chopendata',
        'date': 20191101,
        'clicks': 0,
        'impressions': 3
    }
}

Select title and abstract where title contains ‘music’ and default contains ‘festival’.

res = app.query(body = {"yql" : "select title, abstract from sources * where title contains 'music' AND default contains 'festival'"})
res.hits[0]
{
    'id': 'index:news_content/0/988f76793a855e48b16dc5d3',
    'relevance': 0.19587240022210403,
    'source': 'news_content',
    'fields': {
        'title': "At Least 3 Injured In Stampede At Travis Scott's Astroworld Music Festival",
        'abstract': "A stampede Saturday outside rapper Travis Scott's Astroworld musical festival in Houston, left three people injured. Minutes before the gates were scheduled to open at noon, fans began climbing over metal barricades and surged toward the entrance, according to local news reports."
    }
}

Search by document type

Select the title of all the documents with document type equal to news. Our application has only one document type, so the query below retrieves all our documents.

res = app.query(body = {"yql" : "select title from sources * where sddocname contains 'news'"})
res.hits[0]
{
    'id': 'index:news_content/0/698f73a87a936f1c773f2161',
    'relevance': 0.0,
    'source': 'news_content',
    'fields': {
        'title': 'The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By'
    }
}

Search over attribute fields such as date

Since date is not specified with attribute=["fast-search"] there is no index built for it. Therefore, search over it is equivalent to doing a linear scan over the values of the field.

res = app.query(body={"yql" : "select title, date from sources * where date contains '20191110'"})
res.hits[0]
{
    'id': 'index:news_content/0/debbdfe653c6d11f71cc2353',
    'relevance': 0.0017429193899782135,
    'source': 'news_content',
    'fields': {
        'title': 'These Cranberry Sauce Recipes Are Perfect for Thanksgiving Dinner',
        'date': 20191110
    }
}

Since the default fieldset is formed by indexed fields, Vespa will first filter by all the documents that contain the keyword ‘weather’ within title or abstract, before scanning the date field for ‘20191110’.

res = app.query(body={"yql" : "select title, abstract, date from sources * where default contains 'weather' AND date contains '20191110'"})
res.hits[0]
{
    'id': 'index:news_content/0/bb88325ae94d888c46538d0b',
    'relevance': 0.27025156546141466,
    'source': 'news_content',
    'fields': {
        'title': 'Weather forecast in St. Louis',
        'abstract': "What's the weather today? What's the weather for the week? Here's your forecast.",
        'date': 20191110
    }
}

We can also perform range searches:

res = app.query({"yql" : "select date from sources * where date <= 20191110 AND date >= 20191108"})
res.hits[0]
{
    'id': 'index:news_content/0/c41a873213fdcffbb74987c0',
    'relevance': 0.0017429193899782135,
    'source': 'news_content',
    'fields': {
        'date': 20191109
    }
}

Sorting

By default, Vespa sorts the hits by descending relevance score. The relevance score is given by the nativeRank unless something else is specified, as we will do later in this post.

res = app.query(body={"yql" : "select title, date from sources * where default contains 'music'"})
res.hits[:2]
[
    {
        'id': 'index:news_content/0/5f1b30d14d4a15050dae9f7f',
        'relevance': 0.25641557752127125,
        'source': 'news_content',
        'fields': {
            'title': 'Music is hot in Nashville this week',
            'date': 20191101
        }
    },
    {
        'id': 'index:news_content/0/6a031d5eff95264c54daf56d',
        'relevance': 0.23351089409559303,
        'source': 'news_content',
        'fields': {
            'title': 'Apple Music Replay highlights your favorite tunes of the year',
            'date': 20191105
        }
    }
]

However, we can explicitly order by a given field with the order keyword.

res = app.query(body={"yql" : "select title, date from sources * where default contains 'music' order by date"})
res.hits[:2]
[
    {
        'id': 'index:news_content/0/d0d7e1c080f0faf5989046d8',
        'relevance': 0.0,
        'source': 'news_content',
        'fields': {
            'title': "Elton John's second farewell tour stop in Cleveland shows why he's still standing after all these years",
            'date': 20191031
        }
    },
    {
        'id': 'index:news_content/0/abf7f6f46ff2a96862075155',
        'relevance': 0.0,
        'source': 'news_content',
        'fields': {
            'title': 'The best hair metal bands',
            'date': 20191101
        }
    }
]

order sorts in ascending order by default, we can override that with the desc keyword:

res = app.query(body={"yql" : "select title, date from sources * where default contains 'music' order by date desc"})
res.hits[:2]
[
    {
        'id': 'index:news_content/0/934a8d976ff8694772009362',
        'relevance': 0.0,
        'source': 'news_content',
        'fields': {
            'title': 'Korg Minilogue XD update adds key triggers for synth sequences',
            'date': 20191113
        }
    },
    {
        'id': 'index:news_content/0/4feca287fdfa1d027f61e7bf',
        'relevance': 0.0,
        'source': 'news_content',
        'fields': {
            'title': 'Tom Draper, Black Music Industry Pioneer, Dies at 79',
            'date': 20191113
        }
    }
]

Grouping

We can use Vespa’s grouping feature to compute the three news categories with the highest number of document counts:

res = app.query(body={"yql" : "select * from sources * where sddocname contains 'news' limit 0 | all(group(category) max(3) order(-count())each(output(count())))"})
res.hits[0]
{
    'id': 'group:root:0',
    'relevance': 1.0,
    'continuation': {
        'this': ''
    },
    'children': [
        {
            'id': 'grouplist:category',
            'relevance': 1.0,
            'label': 'category',
            'continuation': {
                'next': 'BGAAABEBGBC'
            },
            'children': [
                {
                    'id': 'group:string:news',
                    'relevance': 1.0,
                    'value': 'news',
                    'fields': {
                        'count()': 9115
                    }
                },
                {
                    'id': 'group:string:sports',
                    'relevance': 0.6666666666666666,
                    'value': 'sports',
                    'fields': {
                        'count()': 6765
                    }
                },
                {
                    'id': 'group:string:finance',
                    'relevance': 0.3333333333333333,
                    'value': 'finance',
                    'fields': {
                        'count()': 1886
                    }
                }
            ]
        }
    ]
}

Use news popularity signal for ranking

Vespa uses nativeRank to compute relevance scores by default. We will create a new rank-profile that includes a popularity signal in our relevance score computation.

from vespa.package import RankProfile, Function

app_package.schema.add_rank_profile(
    RankProfile(
        name="popularity",
        inherits="default",
        functions=[
            Function(
                name="popularity", 
                expression="if (attribute(impressions) > 0, attribute(clicks) / attribute(impressions), 0)"
            )
        ], 
        first_phase="nativeRank(title, abstract) + 10 * popularity"
    )
)

Our new rank-profile will be called

Build a News recommendation app from python with Vespa: Part 2

Thiago Martins

Thiago Martins

Vespa Data Scientist


Part 2 – From news search to news recommendation with embeddings.

UPDATE 2023-02-14: Code examples are updated to work with the latest releases of
pyvespa.

In this part, we’ll start transforming our application from news search to news recommendation using the embeddings created in this tutorial. An embedding vector will represent each user and news article. We will make the embeddings used available for download to make it easier to follow this post along. When a user comes, we retrieve his embedding and use it to retrieve the closest news articles via an approximate nearest neighbor (ANN) search. We also show that Vespa can jointly apply general filtering and ANN search, unlike competing alternatives available in the market.

Decorative image

Photo by Matt Popovich on Unsplash

We assume that you have followed the news search tutorial. Therefore, you should have an app_package variable holding the news search app definition and a Docker container named news running a search application fed with news articles from the demo version of the MIND dataset.

Add a user schema

We need to add another document type to represent a user. We set up the schema to search for a user_id and retrieve the user’s embedding vector.

from vespa.package import Schema, Document, Field

app_package.add_schema(
    Schema(
        name="user", 
        document=Document(
            fields=[
                Field(
                    name="user_id", 
                    type="string", 
                    indexing=["summary", "attribute"], 
                    attribute=["fast-search"]
                ), 
                Field(
                    name="embedding", 
                    type="tensor<float>(d0[51])", 
                    indexing=["summary", "attribute"]
                )
            ]
        )
    )
)

We build an index for the attribute field user_id by specifying the fast-search attribute. Remember that attribute fields are held in memory and are not indexed by default.

The embedding field is a tensor field. Tensors in Vespa are flexible multi-dimensional data structures and, as first-class citizens, can be used in queries, document fields, and constants in ranking. Tensors can be either dense or sparse or both and can contain any number of dimensions. Please see the tensor user guide for more information. Here we have defined a dense tensor with a single dimension (d0 – dimension 0), representing a vector. 51 is the size of the embeddings used in this post.

We now have one schema for the news and one schema for the user.

[schema.name for schema in app_package.schemas]

Index news embeddings

Similarly to the user schema, we will use a dense tensor to represent the news embeddings. But unlike the user embedding field, we will index the news embedding by including index in the indexing argument and specify that we want to build the index using the HNSW (hierarchical navigable small world) algorithm. The distance metric used is euclidean. Read this blog post to know more about Vespa’s journey to implement ANN search.

from vespa.package import Field, HNSW

app_package.get_schema(name="news").add_fields(
    Field(
        name="embedding", 
        type="tensor<float>(d0[51])", 
        indexing=["attribute", "index"],
        ann=HNSW(distance_metric="euclidean")
    )
)

Recommendation using embeddings

Here, we’ve added a ranking expression using the closeness ranking feature, which calculates the euclidean distance and uses that to rank the news articles. This rank-profile depends on using the nearestNeighbor search operator, which we’ll get back to below when searching. But for now, this expects a tensor in the query to use as the initial search point.

from vespa.package import RankProfile

app_package.get_schema(name="news").add_rank_profile(
    RankProfile(
        name="recommendation", 
        inherits="default", 
        first_phase="closeness(field, embedding)"
    )
)

Query Profile Type

The recommendation rank profile above requires that we send a tensor along with the query. For Vespa to bind the correct types, it needs to know the expected type of this query parameter.

from vespa.package import QueryTypeField

app_package.query_profile_type.add_fields(
    QueryTypeField(
        name="ranking.features.query(user_embedding)",
        type="tensor<float>(d0[51])"
    )
)

This query profile type instructs Vespa to expect a float tensor with dimension d0[51] when the query parameter ranking.features.query(user_embedding) is passed. We’ll see how this works together with the nearestNeighbor search operator below.

Redeploy the application

We made all the required changes to turn our news search app into a news recommendation app. We can now redeploy the app_package to our running container named news.

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker.from_container_name_or_id("news")
app = vespa_docker.deploy(application_package=app_package)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Finished deployment.
["Uploading application '/app/application' using http://localhost:19071/application/v2/tenant/default/session",
 "Session 7 for tenant 'default' created.",
 'Preparing session 7 using http://localhost:19071/application/v2/tenant/default/session/7/prepared',
 "WARNING: Host named 'news' may not receive any config since it is not a canonical hostname. Disregard this warning when testing in a Docker container.",
 "Session 7 for tenant 'default' prepared.",
 'Activating session 7 using http://localhost:19071/application/v2/tenant/default/session/7/active',
 "Session 7 for tenant 'default' activated.",
 'Checksum:   62d964000c4ff4a5280b342cd8d95c80',
 'Timestamp:  1616671116728',
 'Generation: 7',
 '']

Feeding and partial updates: news and user embeddings

To keep this tutorial easy to follow, we make the parsed embeddings available for download. To build them yourself, please follow this tutorial.

import requests, json

user_embeddings = json.loads(
    requests.get("https://thigm85.github.io/data/mind/mind_demo_user_embeddings_parsed.json").text
)
news_embeddings = json.loads(
    requests.get("https://thigm85.github.io/data/mind/mind_demo_news_embeddings_parsed.json").text
)

We just created the user schema, so we need to feed user data for the first time.

for user_embedding in user_embeddings:
    response = app.feed_data_point(
        schema="user", 
        data_id=user_embedding["user_id"], 
        fields=user_embedding
    )

For the news documents, we just need to update the embedding field added to the news schema.
This takes ten minutes or so:

for news_embedding in news_embeddings:
    response = app.update_data(
        schema="news", 
        data_id=news_embedding["news_id"], 
        fields={"embedding": news_embedding["embedding"]}
    )

Fetch the user embedding

Next, we create a query_user_embedding function to retrieve the user embedding by the user_id. Of course, you could do this more efficiently using a Vespa Searcher as described here, but keeping everything in python at this point makes learning easier.

def parse_embedding(hit_json):
    embedding_json = hit_json["fields"]["embedding"]["values"]
    embedding_vector = [0.0] * len(embedding_json)
    i=0
    for val in embedding_json:
        embedding_vector[i] = val
        i+=1
    return embedding_vector

def query_user_embedding(user_id):
    result = app.query(body={"yql": "select * from sources user where user_id contains '{}'".format(user_id)})
    embedding = parse_embedding(result.hits[0])
    return embedding

The function will query Vespa, retrieve the embedding and parse it into a list of floats. Here are the first five elements of the user U63195’s embedding.

query_user_embedding(user_id="U63195")[:5]
[
    0.0,
    -0.1694680005311966,
    -0.0703359991312027,
    -0.03539799898862839,
    0.14579899609088898
]

Get recommendations

The following yql instructs Vespa to select the title and the category from the ten news documents closest to the user embedding.

yql = "select title, category from sources news where ({targetHits:10}nearestNeighbor(embedding, user_embedding))" 

We also specify that we want to rank those documents by the recommendation rank-profile that we defined earlier and send the user embedding via the query profile type ranking.features.query(user_embedding) that we also defined in our app_package.

result = app.query(
    body={
        "yql": yql,        
        "hits": 10,
        "ranking.features.query(user_embedding)": str(query_user_embedding(user_id="U63195")),
        "ranking.profile": "recommendation"
    }
)

Here are the first two hits out of the ten returned.

[
    {
        'id': 'index:news_content/0/aca03f4ba2274dd95b58db9a',
        'relevance': 0.1460561756063909,
        'source': 'news_content',
        'fields': {
            'category': 'music',
            'title': 'Broadway Star Laurel Griggs Suffered Asthma Attack Before She Died at Age 13'
        }
    },
    {
        'id': 'index:news_content/0/bd02238644c604f3a2d53364',
        'relevance': 0.14591827245062294,
        'source': 'news_content',
        'fields': {
            'category': 'tv',
            'title': "Rip Taylor's Cause of Death Revealed, Memorial Service Scheduled for Later This Month"
        }
    }
]

Combine ANN search with query filters

Vespa ANN search is fully integrated into the Vespa query tree. This integration means that we can include query filters and the ANN search will be applied only to documents that satisfy the filters. No need to do pre- or post-processing involving filters.

The following yql search over news documents that have sports as their category.

yql = "select title, category from sources news where " \
      "({targetHits:10}nearestNeighbor(embedding, user_embedding)) AND " \
      "category contains 'sports'"
result = app.query(
    body={
        "yql": yql,        
        "hits": 10,
        "ranking.features.query(user_embedding)": str(query_user_embedding(user_id="U63195")),
        "ranking.profile": "recommendation"
    }
)

Here are the first two hits out of the ten returned. Notice the category field.

[
    {
        'id': 'index:news_content/0/375ea340c21b3138fae1a05c',
        'relevance': 0.14417346200569972,
        'source': 'news_content',
        'fields': {
            'category': 'sports',
            'title': 'Charles Rogers, former Michigan State football, Detroit Lions star, dead at 38'
        }
    },
    {
        'id': 'index:news_content/0/2b892989020ddf7796dae435',
        'relevance': 0.14404365847394848,
        'source': 'news_content',
        'fields': {
            'category': 'sports',
            'title': "'Monday Night Football' commentator under fire after belittling criticism of 49ers kicker for missed field goal"
        }
    }
]

Next steps

Step to part 3 –
or see conclusion
for how to clean up the Docker container instances if you are done with this.

Build sentence/paragraph level QA application from python with Vespa

Thiago Martins

Thiago Martins

Vespa Data Scientist


Retrieve paragraph and sentence level information with sparse and dense ranking features.

UPDATE 2023-02-14: Code examples are updated to work with the latest release of
pyvespa.

We will walk through the steps necessary to create a question answering (QA) application that can retrieve sentence or paragraph level answers based on a combination of semantic and/or term-based search. We start by discussing the dataset used and the question and sentence embeddings generated for semantic search. We then include the steps necessary to create and deploy a Vespa application to serve the answers. We make all the required data available to feed the application and show how to query for sentence and paragraph level answers based on a combination of semantic and term-based search.

Decorative image

Photo by Brett Jordan on Unsplash

This tutorial is based on earlier work by the Vespa team to reproduce the results of the paper ReQA: An Evaluation for End-to-End Answer Retrieval Models by Ahmad Et al. using the Stanford Question Answering Dataset (SQuAD) v1.1 dataset.

About the data

We are going to use the Stanford Question Answering Dataset (SQuAD) v1.1 dataset. The data contains paragraphs (denoted here as context), and each paragraph has questions that have answers in the associated paragraph. We have parsed the dataset and organized the data that we will use in this tutorial to make it easier to follow along.

Paragraph

import requests, json

context_data = json.loads(
    requests.get("https://data.vespa.oath.cloud/blog/qa/qa_squad_context_data.json").text
)

Each context data point contains a context_id that uniquely identifies a paragraph, a text field holding the paragraph string, and a questions field holding a list of question ids that can be answered from the paragraph text. We also include a dataset field to identify the data source if we want to index more than one dataset in our application.

{
    'text': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
     'dataset': 'squad',
     'questions': [0, 1, 2, 3, 4],
     'context_id': 0
}

Questions

According to the data point above, context_id = 0 can be used to answer the questions with id = [0, 1, 2, 3, 4]. We can load the file containing the questions and display those first five questions.

from pandas import read_csv

# Note that squad_queries.txt has approx. 1 Gb due to the 512-sized question embeddings
questions = read_csv(
    filepath_or_buffer="https://data.vespa.oath.cloud/blog/qa/squad_queries.txt", 
    sep="\t", 
    names=["question_id", "question", "number_answers", "embedding"]
)
questions[["question_id", "question"]].head()
question_idquestion
0To whom did the Virgin Mary allegedly appear i…
1What is in front of the Notre Dame Main Building?
2The Basilica of the Sacred heart at Notre Dame…
3What is the Grotto at Notre Dame?
4What sits on top of the Main Building at Notre…

Paragraph sentences

To build a more accurate application, we can break the paragraphs down into sentences. For example, the first sentence below comes from the paragraph with context_id = 0 and can answer the question with question_id = 4.

# Note that qa_squad_sentence_data.json has approx. 1 Gb due to the 512-sized sentence embeddings
sentence_data = json.loads(
    requests.get("https://data.vespa.oath.cloud/blog/qa/qa_squad_sentence_data.json").text
)
{k:sentence_data[0][k] for k in ["text", "dataset", "questions", "context_id"]}
{
    'text': "Atop the Main Building's gold dome is a golden statue of the Virgin Mary.",
    'dataset': 'squad',
    'questions': [4],
    'context_id': 0
}

Embeddings

We want to combine semantic (dense) and term-based (sparse) signals to answer the questions sent to our application. We have generated embeddings for both the questions and the sentences to implement the semantic search, each having size equal to 512.

questions[["question_id", "embedding"]].head(1)
question_idembedding
0[-0.025649750605225563, -0.01708591915667057, …
sentence_data[0]["sentence_embedding"]["values"][0:5] # display the first five elements
[
    -0.005731593817472458,
    0.007575507741421461,
    -0.06413306295871735,
    -0.007967847399413586,
    -0.06464996933937073
]

Here is the script containing the code that we used to generate the sentence and questions embeddings. We used Google’s Universal Sentence Encoder at the time but feel free to replace it with embeddings generated by your preferred model.

Create and deploy the application

We can now build a sentence-level Question answering application based on the data described above.

Schema to hold context information

The context schema will have a document containing the four relevant fields described in the data section. We create an index for the text field and use enable-bm25 to pre-compute data required to speed up the use of BM25 for ranking. The summary indexing indicates that all the fields will be included in the requested context documents. The attribute indexing store the fields in memory as an attribute for sorting, querying, and grouping.

from vespa.package import Document, Field

context_document = Document(
    fields=[
        Field(name="questions", type="array<int>", indexing=["summary", "attribute"]),
        Field(name="dataset", type="string", indexing=["summary", "attribute"]),
        Field(name="context_id", type="int", indexing=["summary", "attribute"]),        
        Field(name="text", type="string", indexing=["summary", "index"], index="enable-bm25"),                
    ]
)

The default fieldset means query tokens will be matched against the text field by default. We defined two rank-profiles (bm25 and nativeRank) to illustrate that we can define and experiment with as many rank-profiles as we want. You can create different ones using the ranking expressions and features available.

from vespa.package import Schema, FieldSet, RankProfile

context_schema = Schema(
    name="context",
    document=context_document, 
    fieldsets=[FieldSet(name="default", fields=["text"])], 
    rank_profiles=[
        RankProfile(name="bm25", inherits="default", first_phase="bm25(text)"), 
        RankProfile(name="nativeRank", inherits="default", first_phase="nativeRank(text)")]
)

Schema to hold sentence information

The document of the sentence schema will inherit the fields defined in the context document to avoid unnecessary duplication of the same field types. Besides, we add the sentence_embedding field defined to hold a one-dimensional tensor of floats of size 512. We will store the field as an attribute in memory and build an ANN index using the HNSW (hierarchical navigable small world) algorithm. Read this blog post to know more about Vespa’s journey to implement ANN search and the documentation for more information about the HNSW parameters.

from vespa.package import HNSW

sentence_document = Document(
    inherits="context", 
    fields=[
        Field(
            name="sentence_embedding", 
            type="tensor<float>(x[512])", 
            indexing=["attribute", "index"], 
            ann=HNSW(
                distance_metric="euclidean", 
                max_links_per_node=16, 
                neighbors_to_explore_at_insert=500
            )
        )
    ]
)

For the sentence schema, we define three rank profiles. The semantic-similarity uses the Vespa closeness ranking feature, which is defined as 1/(1 + distance) so that sentences with embeddings closer to the question embedding will be ranked higher than sentences that are far apart. The bm25 is an example of a term-based rank profile, and bm25-semantic-similarity combines both term-based and semantic-based signals as an example of a hybrid approach.

sentence_schema = Schema(
    name="sentence", 
    document=sentence_document, 
    fieldsets=[FieldSet(name="default", fields=["text"])], 
    rank_profiles=[
        RankProfile(
            name="semantic-similarity", 
            inherits="default", 
            first_phase="closeness(sentence_embedding)"
        ),
        RankProfile(
            name="bm25", 
            inherits="default", 
            first_phase="bm25(text)"
        ),
        RankProfile(
            name="bm25-semantic-similarity", 
            inherits="default", 
            first_phase="bm25(text) + closeness(sentence_embedding)"
        )
    ]
)

Build the application package

We can now define our qa application by creating an application package with both the context_schema and the sentence_schema that we defined above. In addition, we need to inform Vespa that we plan to send a query ranking feature named query_embedding with the same type that we used to define the sentence_embedding field.

from vespa.package import ApplicationPackage, QueryProfile, QueryProfileType, QueryTypeField

app_package = ApplicationPackage(
    name="qa", 
    schema=[context_schema, sentence_schema], 
    query_profile=QueryProfile(),
    query_profile_type=QueryProfileType(
        fields=[
            QueryTypeField(
                name="ranking.features.query(query_embedding)", 
                type="tensor<float>(x[512])"
            )
        ]
    )
)

Deploy the application

We can deploy the app_package in a Docker container (or to Vespa Cloud):

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=app_package)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Finished deployment.

Feed the data

Once deployed, we can use the Vespa instance app to interact with the application. We can start by feeding context and sentence data.

Takes about 20 minutes to feed:

for idx, sentence in enumerate(sentence_data):
    result = app.feed_data_point(schema="sentence", data_id=idx, fields=sentence)

5 minutes to feed:

for context in context_data:
    result = app.feed_data_point(schema="context", data_id=context["context_id"], fields=context)

Sentence level retrieval

The query below sends the first question embedding (questions.loc[0, "embedding"]) through the ranking.features.query(query_embedding) parameter and use the nearestNeighbor search operator to retrieve the closest 100 sentences in embedding space using Euclidean distance as configured in the HNSW settings. The sentences returned will be ranked by the semantic-similarity rank profile defined in the sentence schema.

result = app.query(body={
  'yql': 'select * from sources sentence where ({targetNumHits:100}nearestNeighbor(sentence_embedding,query_embedding))',
  'hits': 100,
  'ranking.features.query(query_embedding)': questions.loc[0, "embedding"],
  'ranking.profile': 'semantic-similarity' 
})
{
    'id': 'id:sentence:sentence::2',
    'relevance': 0.5540203635649571,
    'source': 'qa_content',
    'fields': {
        'sddocname': 'sentence',
        'documentid': 'id:sentence:sentence::2',
        'questions': [0],
        'dataset': 'squad',
        'context_id': 0,
        'text': 'It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858.'
    }
}

Sentence level hybrid retrieval

In addition to sending the query embedding, we can send the question string (questions.loc[0, "question"]) via the query parameter and use the or operator to retrieve documents that satisfy either the semantic operator nearestNeighbor or the term-based operator userQuery. Choosing type equal any means that the term-based operator will retrieve all the documents that match at least one query token. The retrieved documents will be ranked by the hybrid rank-profile bm25-semantic-similarity.

result = app.query(body={
  'yql': 'select * from sources sentence  where ({targetNumHits:100}nearestNeighbor(sentence_embedding,query_embedding)) or userQuery()',
  'query': questions.loc[0, "question"],
  'type': 'any',
  'hits': 100,
  'ranking.features.query(query_embedding)': questions.loc[0, "embedding"],
  'ranking.profile': 'bm25-semantic-similarity'
})
{
    'id': 'id:sentence:sentence::2',
    'relevance': 44.46252359752296,
    'source': 'qa_content',
    'fields': {
        'sddocname': 'sentence',
        'documentid': 'id:sentence:sentence::2',
        'questions': [0],
        'dataset': 'squad',
        'context_id': 0,
        'text': 'It is a replica of the grotto at Lourdes, France where the Virgin 

Build a News recommendation app from python with Vespa: Part 3

Thiago Martins

Thiago Martins

Vespa Data Scientist


Part 3 – Efficient use of click-through rate via parent-child relationship.

UPDATE 2023-02-14: Code examples are updated to work with the latest releases of
pyvespa.

This part of the series introduces a new ranking signal: category click-through rate (CTR). The idea is that we can recommend popular content for users that don’t have a click history yet. Rather than just recommending based on articles, we recommend based on categories. However, these global CTR values can often change continuously, so we need an efficient way to update this value for all documents. We’ll do that by introducing parent-child relationships between documents in Vespa. We will also use sparse tensors directly in ranking. This post replicates this more detailed Vespa tutorial.

Decorative image

Photo by AbsolutVision on Unsplash

We assume that you have followed the part2 of the news recommendation tutorial. Therefore, you should have an app_package variable holding the news app definition and a Docker container named news running the application fed with data from the demo version of the MIND dataset.

Setting up a global category CTR document

If we add a category_ctr field in the news document, we would have to update all the sport’s documents every time there is a change in the sport’s CTR statistic. If we assume that the category CTR will change often, this turns out to be inefficient.

For these cases, Vespa introduced the parent-child relationship. Parents are global documents, which are automatically distributed to all content nodes. Other documents can reference these parents and “import” values for use in ranking. The benefit is that the global category CTR values only need to be written to one place: the global document.

from vespa.package import Schema, Document, Field

app_package.add_schema(
    Schema(
        name="category_ctr",
        global_document=True,
        document=Document(
            fields=[
                Field(
                    name="ctrs", 
                    type="tensor<float>(category{})", 
                    indexing=["attribute"], 
                    attribute=["fast-search"]
                ), 
            ]
        )
    )
)

We implement that by creating a new category_ctr schema and setting global_document=True to indicate that we want Vespa to keep a copy of these documents on all content nodes. Setting a document to be global is required for using it in a parent-child relationship. Note that we use a tensor with a single sparse dimension to hold the ctrs data.

Sparse tensors have strings as dimension addresses rather than a numeric index. More concretely, an example of such a tensor is (using the tensor literal form):

{
    {category: entertainment}: 0.2 }, 
    {category: news}: 0.3 },
    {category: sports}: 0.5 },
    {category: travel}: 0.4 },
    {category: finance}: 0.1 },
    ...
}

This tensor holds all the CTR scores for all the categories. When updating this tensor, we can update individual cells, and we don’t need to update the whole tensor. This operation is called tensor modify and can be helpful when you have large tensors.

Importing parent values in child documents

We need to set up two things to use the category_ctr tensor for ranking news documents. We need to reference the parent document (category_ctr in this case) and import the ctrs from the referenced parent document.

app_package.get_schema("news").add_fields(
    Field(
        name="category_ctr_ref",
        type="reference<category_ctr>",
        indexing=["attribute"],
    )
)

The field category_ctr_ref is a field of type reference of the category_ctr document type. When feeding this field, Vespa expects the fully qualified document id. For instance, if our global CTR document has the id id:category_ctr:category_ctr::global, that is the value that we need to feed to the category_ctr_ref field. A document can reference many parent documents.

from vespa.package import ImportedField

app_package.get_schema("news").add_imported_field(
    ImportedField(
        name="global_category_ctrs",
        reference_field="category_ctr_ref",
        field_to_import="ctrs",
    )
)

The imported field defines that we should import the ctrs field from the document referenced in the category_ctr_ref field. We name this as global_category_ctrs, and we can reference this as attribute(global_category_ctrs) during ranking.

Tensor expressions in ranking

Each news document has a category field of type string indicating which category the document belongs to. We want to use this information to select the correct CTR score stored in the global_category_ctrs. Unfortunately, tensor expressions only work on tensors, so we need to add a new field of type tensor called category_tensor to hold category information in a way that can be used in a tensor expression:

app_package.get_schema("news").add_fields(
    Field(
        name="category_tensor",
        type="tensor<float>(category{})",
        indexing=["attribute"],
    )
)

With the category_tensor field as defined above, we can use the tensor expression sum(attribute(category_tensor) * attribute(global_category_ctrs)) to select the specific CTR related to the category of the document being ranked. We implement this expression as a Function in the rank-profile below:

from vespa.package import Function

app_package.get_schema("news").add_rank_profile(
    RankProfile(
        name="recommendation_with_global_category_ctr", 
        inherits="recommendation",
        functions=[
            Function(
                name="category_ctr", 
                expression="sum(attribute(category_tensor) * attribute(global_category_ctrs))"
            ),
            Function(
                name="nearest_neighbor", 
                expression="closeness(field, embedding)"
            )
            
        ],
        first_phase="nearest_neighbor * category_ctr",
        summary_features=[
            "attribute(category_tensor)", 
            "attribute(global_category_ctrs)", 
            "category_ctr", 
            "nearest_neighbor"
        ]
    )
)

In the new rank-profile, we have added a first phase ranking expression that multiplies the nearest-neighbor score with the category CTR score, implemented with the functions nearest_neighbor and category_ctr, respectively. As a first attempt, we just multiply the nearest-neighbor with the category CTR score, which might not be the best way to combine those two values.

Deploy

We can reuse the same container named news created in the first part of this tutorial.

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker.from_container_name_or_id("news")
app = vespa_docker.deploy(application_package=app_package)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Finished deployment.

Feed

Next, we will download the global category CTR data, already parsed in the format that is expected by a sparse tensor with the category dimension.

import requests, json

global_category_ctr = json.loads(
    requests.get("https://data.vespa.oath.cloud/blog/news/global_category_ctr_parsed.json").text
)
global_category_ctr
{
    'ctrs': {
        'cells': [
            {'address': {'category': 'entertainment'}, 'value': 0.029266420380943244},
            {'address': {'category': 'autos'}, 'value': 0.028475809103747123},
            {'address': {'category': 'tv'}, 'value': 0.05374837981352176},
            {'address': {'category': 'health'}, 'value': 0.03531784305129329},
            {'address': {'category': 'sports'}, 'value': 0.05611187986670051},
            {'address': {'category': 'music'}, 'value': 0.05471192953054426},
            {'address': {'category': 'news'}, 'value': 0.04420778372641991},
            {'address': {'category': 'foodanddrink'}, 'value': 0.029256852366228187},
            {'address': {'category': 'travel'}, 'value': 0.025144552013730358},
            {'address': {'category': 'finance'}, 'value': 0.03231013195899643},
            {'address': {'category': 'lifestyle'}, 'value': 0.04423279317474416},
            {'address': {'category': 'video'}, 'value': 0.04006693315980292},
            {'address': {'category': 'movies'}, 'value': 0.03335647459420146},
            {'address': {'category': 'weather'}, 'value': 0.04532171803495617},
            {'address': {'category': 'northamerica'}, 'value': 0.0},
            {'address': {'category': 'kids'}, 'value': 0.043478260869565216}
        ]
    }
}

We can feed this data point to the document defined in the category_ctr. We will assign the global id to this document. Reference to this document can be done by using the Vespa id id:category_ctr:category_ctr::global.

response = app.feed_data_point(schema="category_ctr", data_id="global", fields=global_category_ctr)

We need to perform a partial update on the news documents to include information about the reference field category_ctr_ref and the new category_tensor that will have the value 1.0 for the specific category associated with each document.

news_category_ctr = json.loads(
    requests.get("https://data.vespa.oath.cloud/blog/news/news_category_ctr_update_parsed.json").text
)
news_category_ctr[0]
{
    'id': 'N3112',
    'fields': {
        'category_ctr_ref': 'id:category_ctr:category_ctr::global',
        'category_tensor': {
            'cells': [
                { 'address': {'category': 'lifestyle'}, 'value': 1.0}
            ]
        }
    }
}

This takes ten minutes or so:

for data_point in news_category_ctr:
    response = app.update_data(schema="news", data_id=data_point["id"], fields=data_point["fields"])

Testing the new rank-profile

We will redefine the query_user_embedding function defined in the second part of this tutorial and use it to make a query involving the user U33527 and the recommendation_with_global_category_ctr rank-profile.

def parse_embedding(hit_json):
    embedding_json = hit_json["fields"]["embedding"]["values"]
    embedding_vector = [0.0] * len(embedding_json)
    i=0
    for val in embedding_json:
        embedding_vector[i] = val
        i+=1
    return embedding_vector

def query_user_embedding(user_id):
    result = app.query(body={"yql": "select * from sources user where user_id contains '{}'".format(user_id)})
    embedding = parse_embedding(result.hits[0])
    return embedding
yql = "select * from sources news where " \
      "({targetHits:10}nearestNeighbor(embedding, user_embedding))"
result = app.query(
    body={
        "yql": yql,        
        "hits": 10,
        "ranking.features.query(user_embedding)": str(query_user_embedding(user_id="U33527")),
        "ranking.profile": "recommendation_with_global_category_ctr"
    }
)

The first hit below is a sports article. The global CTR document is also listed here, and the CTR score for the sports category is 0.0561. Thus, the result of the category_ctr function is 0.0561 as intended. The nearest_neighbor score is 0.149, and the resulting relevance score is 0.00836. So, this worked as expected.

{
    'id': 'id:news:news::N5316',
    'relevance': 0.008369192847921151,
    'source': 'news_content',
    'fields': {
        'sddocname': 'news',
        'documentid': 'id:news:news::N5316',
        'news_id': 'N5316',
        'category': 'sports',
        'subcategory': 'football_nfl',
        'title': "Matthew Stafford's status vs. Bears uncertain, Sam Martin will play",
        'abstract': "Stafford's start streak could be in jeopardy, according to Ian Rapoport.",
        'url': "https://www.msn.com/en-us/sports/football_nfl/matthew-stafford's-status-vs.-bears-uncertain,-sam-martin-will-play/ar-BBWwcVN?ocid=chopendata",
        'date': 20191112,
        'clicks': 0,
        'impressions': 1,
        'summaryfeatures': {
            'attribute(category_tensor)': {
                'type': 'tensor<float>(category{})',
                'cells': [
                    {'address': {'category': 'sports'}, 'value': 1.0}
                ]
            },
            'attribute(global_category_ctrs)': {
                'type': 'tensor<float>(category{})',
                'cells': [
                    {'address': {'category': 'entertainment'}, 'value': 0.029266420751810074},
                    {'address': {'category': 'autos'}, 'value': 0.0284758098423481},
                    {'address': {'category': 'tv'}, 'value': 0.05374838039278984},
                    {'address': {'category': 'health'}, 'value': 0.03531784191727638},
                    {'address': {'category': 'sports'}, 'value': 0.05611187964677811},
                    {'address': {'category': 'music'}, 'value': 0.05471193045377731},
                    {'address': {'category': 'news'}, 'value': 0.04420778527855873},
                    {'address': {'category': 'foodanddrink'}, 'value': 0.029256852343678474},
                    {'address': {'category': 'travel'}, 'value': 0.025144552811980247},
                    {'address': {'category': 'finance'}, 'value': 0.032310131937265396},
                    {'address': {'category': 'lifestyle'}, 'value': 0.044232793152332306},
                    {'address': {'category': 'video'}, 'value': 0.040066931396722794},
                    {'address': {'category': 'movies'}, 'value': 0.033356472849845886},
                    {'address': {'category': 'weather'}, 'value': 0.045321717858314514},
                    {'address': {'category': 'northamerica'}, 'value': 0.0},
                    {'address': {'category': 'kids'}, 'value': 0.043478261679410934}
                ]
            },
            'rankingExpression(category_ctr)': 0.05611187964677811,
            'rankingExpression(nearest_neighbor)': 0.14915188666574342,
            'vespa.summaryFeatures.cached': 0.0
        }
    }
}

Conclusion

This tutorial introduced parent-child relationships and demonstrated it through a global CTR feature we used in ranking. We also introduced ranking with (sparse) tensor expressions.

Clean up Docker container instances:

vespa_docker.container.stop()
vespa_docker.container.remove()

Build a basic text search application from python with Vespa: Part 2

Thiago Martins

Thiago Martins

Vespa Data Scientist


Evaluate search engine experiments using Python.

We want to enable Vespa users to run their experiments from python. This tutorial illustrates how to define query models and evaluation metrics to perform search engine experiments.

UPDATE 2023-02-13: Code examples and links are updated to work with the latest releases of
pyvespa
and learntorank.

Decorative image

Photo by Eugene Golovesov on Unsplash

We show how to use the pyvespa API to run search engine experiments based on the text search app we built in the first part of this tutorial series. Specifically, we compare two different matching operators and show how to reduce the number of documents matched by the queries while keeping similar recall and precision metrics.

We assume that you have followed the first tutorial and have a variable app holding the Vespa connection instance that we established there. This connection should be pointing to a Docker container named cord19 running the Vespa application.

Feed additional data points

We will continue to use the CORD19 sample data
that fed the search app in the first tutorial.
In addition, we are going to feed a few additional data points to make it possible to get relevant metrics from our experiments.
We tried to minimize the amount of data required to make this tutorial easy to reproduce.
You can download the additional 494 data points below:

from pandas import read_csv

parsed_feed = read_csv("https://data.vespa.oath.cloud/blog/cord19/parsed_feed_additional.csv")
parsed_feed.head(5)

Feed data

We can then feed the data we just downloaded to the app via the feed_data_point method:

for idx, row in parsed_feed.iterrows():
    fields = {
        "cord_uid": str(row["cord_uid"]),
        "title": str(row["title"]),
        "abstract": str(row["abstract"])
    }
    response = app.feed_data_point(
        schema = "cord19",
        data_id = str(row["cord_uid"]),
        fields = fields,
    )

Define query models to compare

A QueryModel is an abstraction that encapsulates all the relevant information controlling how your app matches and ranks documents. Since we are dealing with a simple text search app here, we will start by creating two query models that use BM25 to rank but differ on how they match documents.

from learntorank.query import QueryModel, OR, WeakAnd, Ranking

or_bm25 = QueryModel(
    name="or_bm25",
    match_phase=OR(), 
    ranking=Ranking(name="bm25")
)

The first model is named or_bm25 and will match all the documents that share at least one token with the query.

from learntorank.query import WeakAnd

wand_bm25 = QueryModel(
    name="wand_bm25", 
    match_phase=WeakAnd(hits=10), 
    ranking=Ranking(name="bm25")
)

The second model is named wand_bm25 and uses the WeakAnd operator, considered an accelerated OR operator. The next section shows that the WeakAnd operator matches fewer documents without affecting the recall and precision metrics for the case considered here. We also analyze the optimal hits parameter to use for our specific application.

Run experiments

We can define which metrics we want to compute when running our experiments.

from learntorank.evaluation import MatchRatio, Recall, NormalizedDiscountedCumulativeGain

eval_metrics = [
    MatchRatio(), 
    Recall(at=10), 
    NormalizedDiscountedCumulativeGain(at=10)
]

MatchRatio computes the fraction of the document corpus matched by the queries. This metric will be critical when comparing match phase operators such as the OR and the WeakAnd. In addition, we compute Recall and NDCG metrics.

We can download labeled data to perform our experiments and compare query models. In our sample data, we have 50 queries, and each has a relevant document associated with them.

import json, requests

labeled_data = json.loads(
    requests.get("https://data.vespa.oath.cloud/blog/cord19/labeled_data.json").text
)
labeled_data[:3]
[{'query_id': 1,
  'relevant_docs': [{'id': 'kqqantwg', 'score': 2}],
  'query': 'coronavirus origin'},
 {'query_id': 2,
  'relevant_docs': [{'id': '526elsrf', 'score': 2}],
  'query': 'coronavirus response to weather changes'},
 {'query_id': 3,
  'relevant_docs': [{'id': '5jl6ltfj', 'score': 1}],
  'query': 'coronavirus immunity'}]

Evaluate

Once we have labeled data, the evaluation metrics to compute, and the query models we want to compare, we can run experiments with the evaluate method. The cord_uid field of the Vespa application should match the id of the relevant documents.

from learntorank.evaluation import evaluate

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=[or_bm25, wand_bm25], 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
)
evaluation

Evaluate

The result shows that, on average, we match 67% of our document corpus when using the OR operator and 21% when using the WeakAnd operator. The reduction in matched documents did not affect the recall and the NDCG metrics, which stayed at around 0.84 and 0.40, respectively. The Match Ratio will get even better when we experiment with the hits parameter of the WeakAnd further down in this tutorial.

There are different options available to configure the output of the evaluate method.

Specify summary statistics

The evaluate method returns the mean, the median, and the standard deviation of the metrics by default. We can customize this by specifying the desired aggregators. Below we choose the mean, the max, and the min as an example.

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=[or_bm25, wand_bm25], 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
    aggregators=["mean", "min", "max"]
)
evaluation

Summaries

Check detailed metrics output

Some of the metrics have intermediate results that might be of interest. For example, the MatchRatio metric requires us to compute the number of matched documents (retrieved_docs) and the number of documents available to be retrieved (docs_available). We can output those intermediate steps by setting detailed_metrics=True.

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=[or_bm25, wand_bm25], 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
    aggregators=["mean"],
    detailed_metrics=True
)
evaluation

detailed

Get per-query results

When debugging the results, it is often helpful to look at the metrics on a per-query basis, which is available by setting per_query=True.

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=[or_bm25, wand_bm25], 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
    per_query=True
)
evaluation.head(5)

per-query

Find optimal WeakAnd parameter

We can use the same evaluation framework to find the optimal hits parameter of the WeakAnd operator for this specific application. To do that, we can define a list of query models that only differ by the hits parameter.

wand_models = [QueryModel(
    name="wand_{}_bm25".format(hits), 
    match_phase=WeakAnd(hits=hits), 
    ranking=Ranking(name="bm25")
) for hits in range(1, 11)]

We can then call evaluate as before and show the match ratio and recall for each of the options defined above.

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=wand_models, 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
    aggregators=["mean"],
)
evaluation.loc[["match_ratio", "recall_10"], ["wand_{}_bm25".format(hits) for hits in range(1, 11)]]

optimal

As expected, we can see that a higher hits parameter implies a higher match ratio. But the recall metric remains the same as long as we pick hits > 3. So, using WeakAnd with hits = 4 is enough for this specific application and dataset, leading to a further reduction in the number of documents matched on average by our queries.

Clean up:

vespa_docker.container.stop()
vespa_docker.container.remove()

Conclusion

We want to enable Vespa users to run their experiments from python. This tutorial illustrates how to define query models and evaluation metrics to run search engine experiments via the evaluate method. We used a simple example that compares two different match operators and another that optimizes the parameter of one of those operators. Our key finding is that we can reduce the size of the retrieved set of hits without losing recall and precision by using the WeakAnd instead of the OR match operator.

The following Vespa resources are related to the topics explored by the experiments presented here: