Blog search application in Vespa

Update 2021-05-20:
This blog post refers to Vespa sample applications that do not exist anymore.
Please refer to the
News search and recommendation tutorial
for an updated version of text and sample applications.

Introduction

This is the first of a series of blog posts where data from WordPress.com (WP) is used to highlight how Vespa can be used to store, search and recommend blog posts. The data was made available during a Kaggle challenge to predict which blog posts someone would like based on their past behavior. It contains many ingredients that are necessary to showcase needs, challenges and possible solutions that are useful for those interested in building and deploying such applications in production.

The end goal is to build an application where:

  • Users will be able to search and manipulate the pool of blog posts available.
  • Users will get blog post recommendations from the content pool based on their interest.

This part addresses:

  • How to describe the dataset used as well as any information connected to the data.
  • How to set up a basic blog post search engine using Vespa.

The next parts show how to extend this basic search engine application with machine learned models to create a blog recommendation engine.

Dataset

The dataset contains blog posts written by WP bloggers and actions, in this case ‘likes’, performed by WP readers in blog posts they have interacted with. The dataset is publicly available at Kaggle and was released during a challenge to develop algorithms to help predict which blog posts users would most likely ‘like’ if they were exposed to them. The data includes these fields per blog post:

  • _ post_id _ – unique numerical id identifying the blog post
  • _ date_gmt _ – string representing date of blog post creation in GMT format yyyy-mm-dd hh:mm:ss
  • _ author _ – unique numerical id identifying the author of the blog post
  • _ url _ – blog post URL
  • _ title _ – blog post title
  • _ blog _ – unique numerical id identifying the blog that the blog post belongs to
  • _ tags _ – array of strings representing the tags of the blog posts
  • _ content _ – body text of the blog post, in html format
  • _ categories _ – array of strings representing the categories the blog post was assigned to

For the user actions:

  • _ post_id _ – unique numerical id identifying the blog post
  • _ uid _ – unique numerical id identifying the user that liked post_id
  • _ dt _ – date of the interaction in GMT format yyyy-mm-dd hh:mm:ss

Downloading raw data

For the purposes of this post, it is sufficient to use the first release of training data that consists of 5 weeks of posts as well as all the ‘like’ actions that occurred during those 5 weeks.

This first release of training data is available here – once downloaded, unzip it. The 1,196,111 line trainPosts.json will be our practice document data. This file is around 5GB in size.

Requirements

Indexing the full data set requires 23GB disk space. We have tested with a Docker container with 10GB RAM. We used similar settings as described in the vespa quick start guide. As in the guide we assume that the $VESPA_SAMPLE_APPS env variable points to the directory with your local clone of the vespa sample apps:

$ docker run -m 10G --detach --name vespa --hostname vespa --privileged --volume $VESPA_SAMPLE_APPS:/vespa-sample-apps --publish 8080:8080 vespaengine/vespa

Searching blog posts

Functional specification:

  • Blog post title, content, tags and categories must all be searchable
  • Allow blog posts to be sorted by both relevance and date
  • Allow grouping of search results by tag or category

In terms of data, Vespa operates with the notion of documents. A document represents a single, searchable item in your system, e.g., a blog post, a photo, or a news article. Each document type must be defined in the Vespa configuration through a search definition. Think of a search definition as being similar to a table definition in a relational database; it consists of a set of fields, each with a given name, a specific type, and some optional properties.

As an example, for this simple blog post search application, we could create the document type blog_post with the following fields:

  • _ url _ – of type uri
  • _ title _ – of type string
  • _ content _ – of type string (string fields can be of any length)
  • _ date_gmt _ – of type string (to store the creation date in GMT format)

The data fed into Vespa must match the structure of the search definition, and the hits returned when searching will be on this format as well.

Application Packages

A Vespa application package is the set of configuration files and Java plugins that together define the behavior of a Vespa system: what functionality to use, the available document types, how ranking will be done and how data will be processed during feeding and indexing. The search definition, e.g., blog_post.sd, is a required part of an application package — the other required files are services.xml and hosts.xml.

The sample application blog search creates a simple but functional blog post search engine. The application package is found in src/main/application.

Services Specification

services.xml defines the services that make up the Vespa application — which services to run and how many nodes per service:

<?xml version='1.0' encoding='UTF-8'?>
<services version='1.0'>

  <container id='default' version='1.0'>
    <search/>
    <document-api/>
    <nodes>
      <node hostalias="node1"/>
    </nodes>
  </container>

  <content id='blog_post' version='1.0'>
    <search>
      <visibility-delay>1.0</visibility-delay>
    </search>
    <redundancy>1</redundancy>
    <documents>
      <document mode="index" type="blog_post"/>
    </documents>
    <nodes>
      <node hostalias="node1"/>
    </nodes>
    <engine>
      <proton>
        <searchable-copies>1</searchable-copies>
      </proton>
    </engine>
  </content>

</services>
  • <container> defines the container cluster for document, query and result processing
  • <search> sets up the search endpoint for Vespa queries. The default port is 8080.
  • <document-api> sets up the document endpoint for feeding.
  • <nodes> defines the nodes required per service. (See the reference for more on container cluster setup.)
  • <content> defines how documents are stored and searched
  • <redundancy> denotes how many copies to keep of each document.
  • <documents> assigns the document types in the search definition — the content cluster capacity can be increased by adding node elements — see elastic Vespa. (See also the reference for more on content cluster setup.)
  • <nodes> defines the hosts for the content cluster.

Deployment Specification

hosts.xml contains a list of all the hosts/nodes that is part of the application, with an alias for each of them. Here we use a single node:

<?xml version="1.0" encoding="utf-8" ?>
<hosts>
  <host name="localhost">
    <alias>node1</alias>
  </host>
</hosts>

Search Definition

The blog_post document type mentioned in src/main/application/service.xml is defined in the search definition. src/main/application/searchdefinitions/blog_post.sd contains the search definition for a document of type blog_post:

search blog_post {

    document blog_post {

        field date_gmt type string {
            indexing: summary
        }

        field language type string {
            indexing: summary
        }

        field author type string {
            indexing: summary
        }

        field url type string {
            indexing: summary
        }

        field title type string {
            indexing: summary | index
        }

        field blog type string {
            indexing: summary
        }

        field post_id type string {
            indexing: summary
        }

        field tags type array<string> {
            indexing: summary
        }

        field blogname type string {
            indexing: summary
        }

        field content type string {
            indexing: summary | index
        }

        field categories type array<string> {
            indexing: summary
        }

        field date type int {
            indexing: summary | attribute
        }

    }


    fieldset default {
        fields: title, content
    }


    rank-profile post inherits default {

        first-phase {
            expression:nativeRank(title, content)
        }

    }

}

document is wrapped inside another element called search. The name following these elements, here blog_post, must be exactly the same for both.

The field property indexing configures the indexing pipeline for a field, which defines how Vespa will treat input during indexing — see indexing language. Each part of the indexing pipeline is separated by the pipe character ‘’:

Deploy the Application Package

Once done with the application package, deploy the Vespa application — build and start Vespa as in the quick start. Deploy the application:

$ cd /vespa-sample-apps/blog-search
$ vespa-deploy prepare src/main/application && vespa-deploy activate

This prints that the application was activated successfully and also the checksum, timestamp and generation for this deployment (more on that later). Pointing a browser to http://localhost:8080/ApplicationStatus returns JSON-formatted information about the active application, including its checksum, timestamp and generation (and should be the same as the values when vespa-deploy activate was run). The generation will increase by 1 each time a new application is successfully deployed, and is the easiest way to verify that the correct version is active.

The Vespa node is now configured and ready for use.

Feeding Data

The data fed to Vespa must match the search definition for the document type. The data downloaded from Kaggle, contained in trainPosts.json, must be converted to a valid Vespa document format before it can be fed to Vespa. Find a parser in the utility repository. Since the full data set is unnecessarily large for the purposes of this first part of this post, we use only the first 10,000 lines of it, but feel free to load all 1,1M entries:

$ head -10000 trainPosts.json > trainPostsSmall.json
$ python parse.py trainPostsSmall.json > feed.json

Send this to Vespa using one of the tools Vespa provides for feeding. Here we will use the Java feeding API:

$ java -jar $VESPA_HOME/lib/jars/vespa-http-client-jar-with-dependencies.jar --verbose --file feed.json --host localhost --port 8080

Note that in the sample-apps/blog-search directory, there is a file with sample data. You may also feed this file using this method.

Track feeding progress

Use the Metrics API to track number of documents indexed:

$ curl -s 'http://localhost:19112/state/v1/metrics' | tr ',' '\n' | grep -A 2 proton.doctypes.blog_post.numdocs

You can also inspect the search node state by

$ vespa-proton-cmd --local getState  

Fetch documents

Fetch documents by document id using the Document API:

$ curl -s 'http://localhost:8080/document/v1/blog-search/blog_post/docid/1750271' | python -m json.tool

The first query

Searching with Vespa is done using a HTTP GET requests, like:

<host:port>/<search>?<yql=value1>&<param2=value2>...

The only mandatory parameter is the query, using yql=<yql query>. More details can be found in the Search API.

Given the above search definition, where the fields title and content are part of the fieldset default, any document containing the word “music” in one or more of these two fields matches our query below:

$ curl -s 'http://localhost:8080/search/?yql=select+*+from+sources+*+where+default+contains+%22music%22%3B' | python -m json.tool

Looking at the output, please note:

  • The field documentid in the output and how it matches the value we assigned to each put operation when feeding data to Vespa.
  • Each hit has a property named relevance, which indicates how well the given document matches our query, using a pre-defined default ranking function. You have full control over ranking — more about ranking and ordering later. The hits are sorted by this value.
  • When multiple hits have the same relevance score their internal ordering is undefined. However, their internal ordering will not change unless the documents are re-indexed.
  • Add &tracelevel=9 to dump query parsing details

Other examples

yql=select+title+from+sources+*+where+title+contains+%22music%22%3B

Once more a search for the single term “music”, but this time with the explicit field title. This means that we only want to match documents that contain the word “music” in the field title. As expected, you will see fewer hits for this query, than for the previous one.

yql=select+*+from+sources+*+where+default+contains+%22music%22+AND+default+contains+%22festival%22%3B

This is a query for the two terms “music” and “festival”, combined with an AND operation; it finds documents that match both terms — but not just one of them.

yql=select+*+from+sources+*+where+sddocname+contains+%22blog_post%22%3B

This is a single-term query in the special field sddocname for the value “blog_post”. This is a common and useful Vespa trick to get the number of indexed documents for a certain document type (search definition): sddocname is a special and reserved field which is always set to the name of the document type for a given document. The documents are all of type blog_post, and will therefore automatically have the field sddocname set to that value.

This means that the query above really means “Return all documents of type blog_post”, and as such all documents in the index are returned.

Build a basic text search application from python with Vespa

Thiago Martins

Thiago Martins

Vespa Data Scientist


Introducing pyvespa simplified API. Build Vespa application from python with few lines of code.

UPDATE 2023-02-13: Code examples and links are updated to work with the latest releases of
pyvespa
and learntorank.

This post will introduce you to the simplified
pyvespa
API that allows us to build a basic text search application from scratch with just a few code lines from python. Follow-up posts will add layers of complexity by incrementally building on top of the basic app described here.

Decorative image

Photo by Sarah Dorweiler on Unsplash

pyvespa exposes a subset of Vespa API in python. The library’s primary goal is to allow for faster prototyping and facilitate Machine Learning experiments for Vespa applications. I have written about how we can use it to connect and interact with running Vespa applications and evaluate Vespa ranking functions from python. This time, we focus on building and deploying applications from scratch.

Install

The pyvespa simplified API introduced here was released in version 0.2.0

pip3 install pyvespa>=0.2.0 learntorank

Define the application

As an example, we will build an application to search through
CORD19 sample data.

Create an application package

The first step is to create a Vespa ApplicationPackage:

from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name="cord19")

Add fields to the Schema

We can then add fields to the application’s Schema created by default in app_package.

from vespa.package import Field

app_package.schema.add_fields(
    Field(
        name = "cord_uid", 
        type = "string", 
        indexing = ["attribute", "summary"]
    ),
    Field(
        name = "title", 
        type = "string", 
        indexing = ["index", "summary"], 
        index = "enable-bm25"
    ),
    Field(
        name = "abstract", 
        type = "string", 
        indexing = ["index", "summary"], 
        index = "enable-bm25"
    )
)
  • cord_uid will store the cord19 document ids, while title and abstract are self explanatory.

  • All the fields, in this case, are of type string.

  • Including "index" in the indexing list means that Vespa will create a searchable index for title and abstract. You can read more about which options is available for indexing in the Vespa documentation.

  • Setting index = "enable-bm25" makes Vespa pre-compute quantities to make it fast to compute the bm25 score. We will use BM25 to rank the documents retrieved.

Search multiple fields when querying

A Fieldset groups fields together for searching. For example, the default fieldset defined below groups title and abstract together.

from vespa.package import FieldSet

app_package.schema.add_field_set(
    FieldSet(name = "default", fields = ["title", "abstract"])
)

Define how to rank the documents matched

We can specify how to rank the matched documents by defining a RankProfile. In this case, we defined the bm25 rank profile that combines that BM25 scores computed over the title and abstract fields.

from vespa.package import RankProfile

app_package.schema.add_rank_profile(
    RankProfile(
        name = "bm25", 
        first_phase = "bm25(title) + bm25(abstract)"
    )
)

Deploy your application

We have now defined a basic text search app containing relevant fields, a fieldset to group fields together, and a rank profile to rank matched documents. It is time to deploy our application. We can locally deploy our app_package using Docker without leaving the notebook,
by creating an instance of VespaDocker,
as shown below:

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()

app = vespa_docker.deploy(application_package = app_package)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Finished deployment.

app now holds a Vespa instance, which we are going to use to interact with our application. Congratulations, you now have a Vespa application up and running.

It is important to know that pyvespa simply provides a convenient API to define Vespa application packages from python. vespa_docker.deploy export Vespa configuration files to the disk_folder defined above. Going through those files is an excellent way to start learning about Vespa syntax.

Feed some data

Our first action after deploying a Vespa application is usually to feed some data to it. To make it easier to follow, we have prepared a DataFrame containing 100 rows and the cord_uid, title, and abstract columns required by our schema definition.

from pandas import read_csv

parsed_feed = read_csv(
    "https://thigm85.github.io/data/cord19/parsed_feed_100.csv"
)
cord_uidtitleabstract
0ug7v899jClinical features of culture-proven Mycoplasma…OBJECTIVE: This retrospective chart review des…
102tnwd4mNitric oxide: a pro-inflammatory mediator in l…Inflammatory diseases of the respiratory tract…
2ejv2xln0Surfactant protein-D and pulmonary host defenseSurfactant protein-D (SP-D) participates in th…
32b73a28nRole of endothelin-1 in lung diseaseEndothelin-1 (ET-1) is a 21 amino acid peptide…
49785vg6dGene expression in epithelial cells in respons…Respiratory syncytial virus (RSV) and pneumoni…
9563bos83oGlobal Surveillance of Emerging Influenza Viru…BACKGROUND: Effective influenza surveillance r…
96hqc7u9w3Transmission Parameters of the 2001 Foot and M…Despite intensive ongoing research, key aspect…
9787zt7lewEfficient replication of pneumonia virus of mi…Pneumonia virus of mice (PVM; family Paramyxov…
98wgxt36jvDesigning and conducting tabletop exercises to…BACKGROUND: Since 2001, state and local health…
99qbldmef1Transcript-level annotation of Affymetrix prob…BACKGROUND: The wide use of Affymetrix microar…

100 rows × 3 columns

We can then iterate through the DataFrame above and feed each row by using the app.feed_data_point method:

  • The schema name is by default set to be equal to the application name, which is cord19 in this case.

  • When feeding data to Vespa, we must have a unique id for each data point. We will use cord_uid here.

for idx, row in parsed_feed.iterrows():
    fields = {
        "cord_uid": str(row["cord_uid"]),
        "title": str(row["title"]),
        "abstract": str(row["abstract"])
    }
    response = app.feed_data_point(
        schema = "cord19",
        data_id = str(row["cord_uid"]),
        fields = fields,
    )

You can also inspect the response to each request if desired.

{'pathId': '/document/v1/cord19/cord19/docid/qbldmef1',
 'id': 'id:cord19:cord19::qbldmef1'}

Query your application

With data fed, we can start to query our text search app. We can use the Vespa Query language directly by sending the required parameters to the body argument of the app.query method.

query = {
    'yql': 'select * from sources * where userQuery()',
    'query': 'What is the role of endothelin-1',
    'ranking': 'bm25',
    'type': 'any',
    'presentation.timing': True,
    'hits': 3
}
res = app.query(body=query)
res.hits[0]
{'id': 'id:cord19:cord19::2b73a28n',
 'relevance': 20.79338929607865,
 'source': 'cord19_content',
 'fields': {'sddocname': 'cord19',
  'documentid': 'id:cord19:cord19::2b73a28n',
  'cord_uid': '2b73a28n',
  'title': 'Role of endothelin-1 in lung disease',
  'abstract': 'Endothelin-1 (ET-1) is a 21 amino acid peptide with diverse biological activity that has been implicated in numerous diseases. ET-1 is a potent mitogen regulator of smooth muscle tone, and inflammatory mediator that may play a key role in diseases of the airways, pulmonary circulation, and inflammatory lung diseases, both acute and chronic. This review will focus on the biology of ET-1 and its role in lung disease.'}}

We can also define the same query by using the
QueryModel abstraction
that allows us to specify how we want to match and rank our documents. In this case, we defined that we want to:

  • match our documents using the OR operator, which matches all the documents that share at least one term with the query.
  • rank the matched documents using the bm25 rank profile defined in our application package.
from learntorank.query import QueryModel, OR, Ranking, send_query

res = send_query(
    app=app,
    query="What is the role of endothelin-1", 
    query_model = QueryModel(
        match_phase=OR(), 
        ranking=Ranking(name="bm25")
    )
)
res.hits[0]
{
    'id': 'id:cord19:cord19::2b73a28n',
    'relevance': 20.79338929607865,
    'source': 'cord19_content',
    'fields': {
        'sddocname': 'cord19',
        'documentid': 'id:cord19:cord19::2b73a28n',
        'cord_uid': '2b73a28n',
        'title': 'Role of endothelin-1 in lung disease',
        'abstract': 'Endothelin-1 (ET-1) is a 21 amino acid peptide with diverse biological activity that has been implicated in numerous diseases. ET-1 is a potent mitogen regulator of smooth muscle tone, and inflammatory mediator that may play a key role in diseases of the airways, pulmonary circulation, and inflammatory lung diseases, both acute and chronic. This review will focus on the biology of ET-1 and its role in lung disease.'
    }
}

Using the Vespa Query Language as in our first example gives you the full power and flexibility that Vespa can offer.
In contrast, the QueryModel abstraction focuses on specific use cases and can be more useful for ML experiments,
but this is a future post topic.

Jump to Build a basic text search application from python with Vespa: Part 2
or clean up:

vespa_docker.container.stop()
vespa_docker.container.remove()

Build sentence/paragraph level QA application from python with Vespa

Thiago Martins

Thiago Martins

Vespa Data Scientist


Retrieve paragraph and sentence level information with sparse and dense ranking features.

UPDATE 2023-02-14: Code examples are updated to work with the latest release of
pyvespa.

We will walk through the steps necessary to create a question answering (QA) application that can retrieve sentence or paragraph level answers based on a combination of semantic and/or term-based search. We start by discussing the dataset used and the question and sentence embeddings generated for semantic search. We then include the steps necessary to create and deploy a Vespa application to serve the answers. We make all the required data available to feed the application and show how to query for sentence and paragraph level answers based on a combination of semantic and term-based search.

Decorative image

Photo by Brett Jordan on Unsplash

This tutorial is based on earlier work by the Vespa team to reproduce the results of the paper ReQA: An Evaluation for End-to-End Answer Retrieval Models by Ahmad Et al. using the Stanford Question Answering Dataset (SQuAD) v1.1 dataset.

About the data

We are going to use the Stanford Question Answering Dataset (SQuAD) v1.1 dataset. The data contains paragraphs (denoted here as context), and each paragraph has questions that have answers in the associated paragraph. We have parsed the dataset and organized the data that we will use in this tutorial to make it easier to follow along.

Paragraph

import requests, json

context_data = json.loads(
    requests.get("https://data.vespa.oath.cloud/blog/qa/qa_squad_context_data.json").text
)

Each context data point contains a context_id that uniquely identifies a paragraph, a text field holding the paragraph string, and a questions field holding a list of question ids that can be answered from the paragraph text. We also include a dataset field to identify the data source if we want to index more than one dataset in our application.

{
    'text': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
     'dataset': 'squad',
     'questions': [0, 1, 2, 3, 4],
     'context_id': 0
}

Questions

According to the data point above, context_id = 0 can be used to answer the questions with id = [0, 1, 2, 3, 4]. We can load the file containing the questions and display those first five questions.

from pandas import read_csv

# Note that squad_queries.txt has approx. 1 Gb due to the 512-sized question embeddings
questions = read_csv(
    filepath_or_buffer="https://data.vespa.oath.cloud/blog/qa/squad_queries.txt", 
    sep="\t", 
    names=["question_id", "question", "number_answers", "embedding"]
)
questions[["question_id", "question"]].head()
question_idquestion
0To whom did the Virgin Mary allegedly appear i…
1What is in front of the Notre Dame Main Building?
2The Basilica of the Sacred heart at Notre Dame…
3What is the Grotto at Notre Dame?
4What sits on top of the Main Building at Notre…

Paragraph sentences

To build a more accurate application, we can break the paragraphs down into sentences. For example, the first sentence below comes from the paragraph with context_id = 0 and can answer the question with question_id = 4.

# Note that qa_squad_sentence_data.json has approx. 1 Gb due to the 512-sized sentence embeddings
sentence_data = json.loads(
    requests.get("https://data.vespa.oath.cloud/blog/qa/qa_squad_sentence_data.json").text
)
{k:sentence_data[0][k] for k in ["text", "dataset", "questions", "context_id"]}
{
    'text': "Atop the Main Building's gold dome is a golden statue of the Virgin Mary.",
    'dataset': 'squad',
    'questions': [4],
    'context_id': 0
}

Embeddings

We want to combine semantic (dense) and term-based (sparse) signals to answer the questions sent to our application. We have generated embeddings for both the questions and the sentences to implement the semantic search, each having size equal to 512.

questions[["question_id", "embedding"]].head(1)
question_idembedding
0[-0.025649750605225563, -0.01708591915667057, …
sentence_data[0]["sentence_embedding"]["values"][0:5] # display the first five elements
[
    -0.005731593817472458,
    0.007575507741421461,
    -0.06413306295871735,
    -0.007967847399413586,
    -0.06464996933937073
]

Here is the script containing the code that we used to generate the sentence and questions embeddings. We used Google’s Universal Sentence Encoder at the time but feel free to replace it with embeddings generated by your preferred model.

Create and deploy the application

We can now build a sentence-level Question answering application based on the data described above.

Schema to hold context information

The context schema will have a document containing the four relevant fields described in the data section. We create an index for the text field and use enable-bm25 to pre-compute data required to speed up the use of BM25 for ranking. The summary indexing indicates that all the fields will be included in the requested context documents. The attribute indexing store the fields in memory as an attribute for sorting, querying, and grouping.

from vespa.package import Document, Field

context_document = Document(
    fields=[
        Field(name="questions", type="array<int>", indexing=["summary", "attribute"]),
        Field(name="dataset", type="string", indexing=["summary", "attribute"]),
        Field(name="context_id", type="int", indexing=["summary", "attribute"]),        
        Field(name="text", type="string", indexing=["summary", "index"], index="enable-bm25"),                
    ]
)

The default fieldset means query tokens will be matched against the text field by default. We defined two rank-profiles (bm25 and nativeRank) to illustrate that we can define and experiment with as many rank-profiles as we want. You can create different ones using the ranking expressions and features available.

from vespa.package import Schema, FieldSet, RankProfile

context_schema = Schema(
    name="context",
    document=context_document, 
    fieldsets=[FieldSet(name="default", fields=["text"])], 
    rank_profiles=[
        RankProfile(name="bm25", inherits="default", first_phase="bm25(text)"), 
        RankProfile(name="nativeRank", inherits="default", first_phase="nativeRank(text)")]
)

Schema to hold sentence information

The document of the sentence schema will inherit the fields defined in the context document to avoid unnecessary duplication of the same field types. Besides, we add the sentence_embedding field defined to hold a one-dimensional tensor of floats of size 512. We will store the field as an attribute in memory and build an ANN index using the HNSW (hierarchical navigable small world) algorithm. Read this blog post to know more about Vespa’s journey to implement ANN search and the documentation for more information about the HNSW parameters.

from vespa.package import HNSW

sentence_document = Document(
    inherits="context", 
    fields=[
        Field(
            name="sentence_embedding", 
            type="tensor<float>(x[512])", 
            indexing=["attribute", "index"], 
            ann=HNSW(
                distance_metric="euclidean", 
                max_links_per_node=16, 
                neighbors_to_explore_at_insert=500
            )
        )
    ]
)

For the sentence schema, we define three rank profiles. The semantic-similarity uses the Vespa closeness ranking feature, which is defined as 1/(1 + distance) so that sentences with embeddings closer to the question embedding will be ranked higher than sentences that are far apart. The bm25 is an example of a term-based rank profile, and bm25-semantic-similarity combines both term-based and semantic-based signals as an example of a hybrid approach.

sentence_schema = Schema(
    name="sentence", 
    document=sentence_document, 
    fieldsets=[FieldSet(name="default", fields=["text"])], 
    rank_profiles=[
        RankProfile(
            name="semantic-similarity", 
            inherits="default", 
            first_phase="closeness(sentence_embedding)"
        ),
        RankProfile(
            name="bm25", 
            inherits="default", 
            first_phase="bm25(text)"
        ),
        RankProfile(
            name="bm25-semantic-similarity", 
            inherits="default", 
            first_phase="bm25(text) + closeness(sentence_embedding)"
        )
    ]
)

Build the application package

We can now define our qa application by creating an application package with both the context_schema and the sentence_schema that we defined above. In addition, we need to inform Vespa that we plan to send a query ranking feature named query_embedding with the same type that we used to define the sentence_embedding field.

from vespa.package import ApplicationPackage, QueryProfile, QueryProfileType, QueryTypeField

app_package = ApplicationPackage(
    name="qa", 
    schema=[context_schema, sentence_schema], 
    query_profile=QueryProfile(),
    query_profile_type=QueryProfileType(
        fields=[
            QueryTypeField(
                name="ranking.features.query(query_embedding)", 
                type="tensor<float>(x[512])"
            )
        ]
    )
)

Deploy the application

We can deploy the app_package in a Docker container (or to Vespa Cloud):

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=app_package)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Finished deployment.

Feed the data

Once deployed, we can use the Vespa instance app to interact with the application. We can start by feeding context and sentence data.

Takes about 20 minutes to feed:

for idx, sentence in enumerate(sentence_data):
    result = app.feed_data_point(schema="sentence", data_id=idx, fields=sentence)

5 minutes to feed:

for context in context_data:
    result = app.feed_data_point(schema="context", data_id=context["context_id"], fields=context)

Sentence level retrieval

The query below sends the first question embedding (questions.loc[0, "embedding"]) through the ranking.features.query(query_embedding) parameter and use the nearestNeighbor search operator to retrieve the closest 100 sentences in embedding space using Euclidean distance as configured in the HNSW settings. The sentences returned will be ranked by the semantic-similarity rank profile defined in the sentence schema.

result = app.query(body={
  'yql': 'select * from sources sentence where ({targetNumHits:100}nearestNeighbor(sentence_embedding,query_embedding))',
  'hits': 100,
  'ranking.features.query(query_embedding)': questions.loc[0, "embedding"],
  'ranking.profile': 'semantic-similarity' 
})
{
    'id': 'id:sentence:sentence::2',
    'relevance': 0.5540203635649571,
    'source': 'qa_content',
    'fields': {
        'sddocname': 'sentence',
        'documentid': 'id:sentence:sentence::2',
        'questions': [0],
        'dataset': 'squad',
        'context_id': 0,
        'text': 'It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858.'
    }
}

Sentence level hybrid retrieval

In addition to sending the query embedding, we can send the question string (questions.loc[0, "question"]) via the query parameter and use the or operator to retrieve documents that satisfy either the semantic operator nearestNeighbor or the term-based operator userQuery. Choosing type equal any means that the term-based operator will retrieve all the documents that match at least one query token. The retrieved documents will be ranked by the hybrid rank-profile bm25-semantic-similarity.

result = app.query(body={
  'yql': 'select * from sources sentence  where ({targetNumHits:100}nearestNeighbor(sentence_embedding,query_embedding)) or userQuery()',
  'query': questions.loc[0, "question"],
  'type': 'any',
  'hits': 100,
  'ranking.features.query(query_embedding)': questions.loc[0, "embedding"],
  'ranking.profile': 'bm25-semantic-similarity'
})
{
    'id': 'id:sentence:sentence::2',
    'relevance': 44.46252359752296,
    'source': 'qa_content',
    'fields': {
        'sddocname': 'sentence',
        'documentid': 'id:sentence:sentence::2',
        'questions': [0],
        'dataset': 'squad',
        'context_id': 0,
        'text': 'It is a replica of the grotto at Lourdes, France where the Virgin 

Build a basic text search application from python with Vespa: Part 2

Thiago Martins

Thiago Martins

Vespa Data Scientist


Evaluate search engine experiments using Python.

We want to enable Vespa users to run their experiments from python. This tutorial illustrates how to define query models and evaluation metrics to perform search engine experiments.

UPDATE 2023-02-13: Code examples and links are updated to work with the latest releases of
pyvespa
and learntorank.

Decorative image

Photo by Eugene Golovesov on Unsplash

We show how to use the pyvespa API to run search engine experiments based on the text search app we built in the first part of this tutorial series. Specifically, we compare two different matching operators and show how to reduce the number of documents matched by the queries while keeping similar recall and precision metrics.

We assume that you have followed the first tutorial and have a variable app holding the Vespa connection instance that we established there. This connection should be pointing to a Docker container named cord19 running the Vespa application.

Feed additional data points

We will continue to use the CORD19 sample data
that fed the search app in the first tutorial.
In addition, we are going to feed a few additional data points to make it possible to get relevant metrics from our experiments.
We tried to minimize the amount of data required to make this tutorial easy to reproduce.
You can download the additional 494 data points below:

from pandas import read_csv

parsed_feed = read_csv("https://data.vespa.oath.cloud/blog/cord19/parsed_feed_additional.csv")
parsed_feed.head(5)

Feed data

We can then feed the data we just downloaded to the app via the feed_data_point method:

for idx, row in parsed_feed.iterrows():
    fields = {
        "cord_uid": str(row["cord_uid"]),
        "title": str(row["title"]),
        "abstract": str(row["abstract"])
    }
    response = app.feed_data_point(
        schema = "cord19",
        data_id = str(row["cord_uid"]),
        fields = fields,
    )

Define query models to compare

A QueryModel is an abstraction that encapsulates all the relevant information controlling how your app matches and ranks documents. Since we are dealing with a simple text search app here, we will start by creating two query models that use BM25 to rank but differ on how they match documents.

from learntorank.query import QueryModel, OR, WeakAnd, Ranking

or_bm25 = QueryModel(
    name="or_bm25",
    match_phase=OR(), 
    ranking=Ranking(name="bm25")
)

The first model is named or_bm25 and will match all the documents that share at least one token with the query.

from learntorank.query import WeakAnd

wand_bm25 = QueryModel(
    name="wand_bm25", 
    match_phase=WeakAnd(hits=10), 
    ranking=Ranking(name="bm25")
)

The second model is named wand_bm25 and uses the WeakAnd operator, considered an accelerated OR operator. The next section shows that the WeakAnd operator matches fewer documents without affecting the recall and precision metrics for the case considered here. We also analyze the optimal hits parameter to use for our specific application.

Run experiments

We can define which metrics we want to compute when running our experiments.

from learntorank.evaluation import MatchRatio, Recall, NormalizedDiscountedCumulativeGain

eval_metrics = [
    MatchRatio(), 
    Recall(at=10), 
    NormalizedDiscountedCumulativeGain(at=10)
]

MatchRatio computes the fraction of the document corpus matched by the queries. This metric will be critical when comparing match phase operators such as the OR and the WeakAnd. In addition, we compute Recall and NDCG metrics.

We can download labeled data to perform our experiments and compare query models. In our sample data, we have 50 queries, and each has a relevant document associated with them.

import json, requests

labeled_data = json.loads(
    requests.get("https://data.vespa.oath.cloud/blog/cord19/labeled_data.json").text
)
labeled_data[:3]
[{'query_id': 1,
  'relevant_docs': [{'id': 'kqqantwg', 'score': 2}],
  'query': 'coronavirus origin'},
 {'query_id': 2,
  'relevant_docs': [{'id': '526elsrf', 'score': 2}],
  'query': 'coronavirus response to weather changes'},
 {'query_id': 3,
  'relevant_docs': [{'id': '5jl6ltfj', 'score': 1}],
  'query': 'coronavirus immunity'}]

Evaluate

Once we have labeled data, the evaluation metrics to compute, and the query models we want to compare, we can run experiments with the evaluate method. The cord_uid field of the Vespa application should match the id of the relevant documents.

from learntorank.evaluation import evaluate

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=[or_bm25, wand_bm25], 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
)
evaluation

Evaluate

The result shows that, on average, we match 67% of our document corpus when using the OR operator and 21% when using the WeakAnd operator. The reduction in matched documents did not affect the recall and the NDCG metrics, which stayed at around 0.84 and 0.40, respectively. The Match Ratio will get even better when we experiment with the hits parameter of the WeakAnd further down in this tutorial.

There are different options available to configure the output of the evaluate method.

Specify summary statistics

The evaluate method returns the mean, the median, and the standard deviation of the metrics by default. We can customize this by specifying the desired aggregators. Below we choose the mean, the max, and the min as an example.

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=[or_bm25, wand_bm25], 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
    aggregators=["mean", "min", "max"]
)
evaluation

Summaries

Check detailed metrics output

Some of the metrics have intermediate results that might be of interest. For example, the MatchRatio metric requires us to compute the number of matched documents (retrieved_docs) and the number of documents available to be retrieved (docs_available). We can output those intermediate steps by setting detailed_metrics=True.

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=[or_bm25, wand_bm25], 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
    aggregators=["mean"],
    detailed_metrics=True
)
evaluation

detailed

Get per-query results

When debugging the results, it is often helpful to look at the metrics on a per-query basis, which is available by setting per_query=True.

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=[or_bm25, wand_bm25], 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
    per_query=True
)
evaluation.head(5)

per-query

Find optimal WeakAnd parameter

We can use the same evaluation framework to find the optimal hits parameter of the WeakAnd operator for this specific application. To do that, we can define a list of query models that only differ by the hits parameter.

wand_models = [QueryModel(
    name="wand_{}_bm25".format(hits), 
    match_phase=WeakAnd(hits=hits), 
    ranking=Ranking(name="bm25")
) for hits in range(1, 11)]

We can then call evaluate as before and show the match ratio and recall for each of the options defined above.

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=wand_models, 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
    aggregators=["mean"],
)
evaluation.loc[["match_ratio", "recall_10"], ["wand_{}_bm25".format(hits) for hits in range(1, 11)]]

optimal

As expected, we can see that a higher hits parameter implies a higher match ratio. But the recall metric remains the same as long as we pick hits > 3. So, using WeakAnd with hits = 4 is enough for this specific application and dataset, leading to a further reduction in the number of documents matched on average by our queries.

Clean up:

vespa_docker.container.stop()
vespa_docker.container.remove()

Conclusion

We want to enable Vespa users to run their experiments from python. This tutorial illustrates how to define query models and evaluation metrics to run search engine experiments via the evaluate method. We used a simple example that compares two different match operators and another that optimizes the parameter of one of those operators. Our key finding is that we can reduce the size of the retrieved set of hits without losing recall and precision by using the WeakAnd instead of the OR match operator.

The following Vespa resources are related to the topics explored by the experiments presented here: