Build a basic text search application from python with Vespa

Thiago Martins

Thiago Martins

Vespa Data Scientist


Introducing pyvespa simplified API. Build Vespa application from python with few lines of code.

UPDATE 2023-02-13: Code examples and links are updated to work with the latest releases of
pyvespa
and learntorank.

This post will introduce you to the simplified
pyvespa
API that allows us to build a basic text search application from scratch with just a few code lines from python. Follow-up posts will add layers of complexity by incrementally building on top of the basic app described here.

Decorative image

Photo by Sarah Dorweiler on Unsplash

pyvespa exposes a subset of Vespa API in python. The library’s primary goal is to allow for faster prototyping and facilitate Machine Learning experiments for Vespa applications. I have written about how we can use it to connect and interact with running Vespa applications and evaluate Vespa ranking functions from python. This time, we focus on building and deploying applications from scratch.

Install

The pyvespa simplified API introduced here was released in version 0.2.0

pip3 install pyvespa>=0.2.0 learntorank

Define the application

As an example, we will build an application to search through
CORD19 sample data.

Create an application package

The first step is to create a Vespa ApplicationPackage:

from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name="cord19")

Add fields to the Schema

We can then add fields to the application’s Schema created by default in app_package.

from vespa.package import Field

app_package.schema.add_fields(
    Field(
        name = "cord_uid", 
        type = "string", 
        indexing = ["attribute", "summary"]
    ),
    Field(
        name = "title", 
        type = "string", 
        indexing = ["index", "summary"], 
        index = "enable-bm25"
    ),
    Field(
        name = "abstract", 
        type = "string", 
        indexing = ["index", "summary"], 
        index = "enable-bm25"
    )
)
  • cord_uid will store the cord19 document ids, while title and abstract are self explanatory.

  • All the fields, in this case, are of type string.

  • Including "index" in the indexing list means that Vespa will create a searchable index for title and abstract. You can read more about which options is available for indexing in the Vespa documentation.

  • Setting index = "enable-bm25" makes Vespa pre-compute quantities to make it fast to compute the bm25 score. We will use BM25 to rank the documents retrieved.

Search multiple fields when querying

A Fieldset groups fields together for searching. For example, the default fieldset defined below groups title and abstract together.

from vespa.package import FieldSet

app_package.schema.add_field_set(
    FieldSet(name = "default", fields = ["title", "abstract"])
)

Define how to rank the documents matched

We can specify how to rank the matched documents by defining a RankProfile. In this case, we defined the bm25 rank profile that combines that BM25 scores computed over the title and abstract fields.

from vespa.package import RankProfile

app_package.schema.add_rank_profile(
    RankProfile(
        name = "bm25", 
        first_phase = "bm25(title) + bm25(abstract)"
    )
)

Deploy your application

We have now defined a basic text search app containing relevant fields, a fieldset to group fields together, and a rank profile to rank matched documents. It is time to deploy our application. We can locally deploy our app_package using Docker without leaving the notebook,
by creating an instance of VespaDocker,
as shown below:

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()

app = vespa_docker.deploy(application_package = app_package)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Finished deployment.

app now holds a Vespa instance, which we are going to use to interact with our application. Congratulations, you now have a Vespa application up and running.

It is important to know that pyvespa simply provides a convenient API to define Vespa application packages from python. vespa_docker.deploy export Vespa configuration files to the disk_folder defined above. Going through those files is an excellent way to start learning about Vespa syntax.

Feed some data

Our first action after deploying a Vespa application is usually to feed some data to it. To make it easier to follow, we have prepared a DataFrame containing 100 rows and the cord_uid, title, and abstract columns required by our schema definition.

from pandas import read_csv

parsed_feed = read_csv(
    "https://thigm85.github.io/data/cord19/parsed_feed_100.csv"
)
cord_uidtitleabstract
0ug7v899jClinical features of culture-proven Mycoplasma…OBJECTIVE: This retrospective chart review des…
102tnwd4mNitric oxide: a pro-inflammatory mediator in l…Inflammatory diseases of the respiratory tract…
2ejv2xln0Surfactant protein-D and pulmonary host defenseSurfactant protein-D (SP-D) participates in th…
32b73a28nRole of endothelin-1 in lung diseaseEndothelin-1 (ET-1) is a 21 amino acid peptide…
49785vg6dGene expression in epithelial cells in respons…Respiratory syncytial virus (RSV) and pneumoni…
9563bos83oGlobal Surveillance of Emerging Influenza Viru…BACKGROUND: Effective influenza surveillance r…
96hqc7u9w3Transmission Parameters of the 2001 Foot and M…Despite intensive ongoing research, key aspect…
9787zt7lewEfficient replication of pneumonia virus of mi…Pneumonia virus of mice (PVM; family Paramyxov…
98wgxt36jvDesigning and conducting tabletop exercises to…BACKGROUND: Since 2001, state and local health…
99qbldmef1Transcript-level annotation of Affymetrix prob…BACKGROUND: The wide use of Affymetrix microar…

100 rows × 3 columns

We can then iterate through the DataFrame above and feed each row by using the app.feed_data_point method:

  • The schema name is by default set to be equal to the application name, which is cord19 in this case.

  • When feeding data to Vespa, we must have a unique id for each data point. We will use cord_uid here.

for idx, row in parsed_feed.iterrows():
    fields = {
        "cord_uid": str(row["cord_uid"]),
        "title": str(row["title"]),
        "abstract": str(row["abstract"])
    }
    response = app.feed_data_point(
        schema = "cord19",
        data_id = str(row["cord_uid"]),
        fields = fields,
    )

You can also inspect the response to each request if desired.

{'pathId': '/document/v1/cord19/cord19/docid/qbldmef1',
 'id': 'id:cord19:cord19::qbldmef1'}

Query your application

With data fed, we can start to query our text search app. We can use the Vespa Query language directly by sending the required parameters to the body argument of the app.query method.

query = {
    'yql': 'select * from sources * where userQuery()',
    'query': 'What is the role of endothelin-1',
    'ranking': 'bm25',
    'type': 'any',
    'presentation.timing': True,
    'hits': 3
}
res = app.query(body=query)
res.hits[0]
{'id': 'id:cord19:cord19::2b73a28n',
 'relevance': 20.79338929607865,
 'source': 'cord19_content',
 'fields': {'sddocname': 'cord19',
  'documentid': 'id:cord19:cord19::2b73a28n',
  'cord_uid': '2b73a28n',
  'title': 'Role of endothelin-1 in lung disease',
  'abstract': 'Endothelin-1 (ET-1) is a 21 amino acid peptide with diverse biological activity that has been implicated in numerous diseases. ET-1 is a potent mitogen regulator of smooth muscle tone, and inflammatory mediator that may play a key role in diseases of the airways, pulmonary circulation, and inflammatory lung diseases, both acute and chronic. This review will focus on the biology of ET-1 and its role in lung disease.'}}

We can also define the same query by using the
QueryModel abstraction
that allows us to specify how we want to match and rank our documents. In this case, we defined that we want to:

  • match our documents using the OR operator, which matches all the documents that share at least one term with the query.
  • rank the matched documents using the bm25 rank profile defined in our application package.
from learntorank.query import QueryModel, OR, Ranking, send_query

res = send_query(
    app=app,
    query="What is the role of endothelin-1", 
    query_model = QueryModel(
        match_phase=OR(), 
        ranking=Ranking(name="bm25")
    )
)
res.hits[0]
{
    'id': 'id:cord19:cord19::2b73a28n',
    'relevance': 20.79338929607865,
    'source': 'cord19_content',
    'fields': {
        'sddocname': 'cord19',
        'documentid': 'id:cord19:cord19::2b73a28n',
        'cord_uid': '2b73a28n',
        'title': 'Role of endothelin-1 in lung disease',
        'abstract': 'Endothelin-1 (ET-1) is a 21 amino acid peptide with diverse biological activity that has been implicated in numerous diseases. ET-1 is a potent mitogen regulator of smooth muscle tone, and inflammatory mediator that may play a key role in diseases of the airways, pulmonary circulation, and inflammatory lung diseases, both acute and chronic. This review will focus on the biology of ET-1 and its role in lung disease.'
    }
}

Using the Vespa Query Language as in our first example gives you the full power and flexibility that Vespa can offer.
In contrast, the QueryModel abstraction focuses on specific use cases and can be more useful for ML experiments,
but this is a future post topic.

Jump to Build a basic text search application from python with Vespa: Part 2
or clean up:

vespa_docker.container.stop()
vespa_docker.container.remove()

Build a basic text search application from python with Vespa: Part 2

Thiago Martins

Thiago Martins

Vespa Data Scientist


Evaluate search engine experiments using Python.

We want to enable Vespa users to run their experiments from python. This tutorial illustrates how to define query models and evaluation metrics to perform search engine experiments.

UPDATE 2023-02-13: Code examples and links are updated to work with the latest releases of
pyvespa
and learntorank.

Decorative image

Photo by Eugene Golovesov on Unsplash

We show how to use the pyvespa API to run search engine experiments based on the text search app we built in the first part of this tutorial series. Specifically, we compare two different matching operators and show how to reduce the number of documents matched by the queries while keeping similar recall and precision metrics.

We assume that you have followed the first tutorial and have a variable app holding the Vespa connection instance that we established there. This connection should be pointing to a Docker container named cord19 running the Vespa application.

Feed additional data points

We will continue to use the CORD19 sample data
that fed the search app in the first tutorial.
In addition, we are going to feed a few additional data points to make it possible to get relevant metrics from our experiments.
We tried to minimize the amount of data required to make this tutorial easy to reproduce.
You can download the additional 494 data points below:

from pandas import read_csv

parsed_feed = read_csv("https://data.vespa.oath.cloud/blog/cord19/parsed_feed_additional.csv")
parsed_feed.head(5)

Feed data

We can then feed the data we just downloaded to the app via the feed_data_point method:

for idx, row in parsed_feed.iterrows():
    fields = {
        "cord_uid": str(row["cord_uid"]),
        "title": str(row["title"]),
        "abstract": str(row["abstract"])
    }
    response = app.feed_data_point(
        schema = "cord19",
        data_id = str(row["cord_uid"]),
        fields = fields,
    )

Define query models to compare

A QueryModel is an abstraction that encapsulates all the relevant information controlling how your app matches and ranks documents. Since we are dealing with a simple text search app here, we will start by creating two query models that use BM25 to rank but differ on how they match documents.

from learntorank.query import QueryModel, OR, WeakAnd, Ranking

or_bm25 = QueryModel(
    name="or_bm25",
    match_phase=OR(), 
    ranking=Ranking(name="bm25")
)

The first model is named or_bm25 and will match all the documents that share at least one token with the query.

from learntorank.query import WeakAnd

wand_bm25 = QueryModel(
    name="wand_bm25", 
    match_phase=WeakAnd(hits=10), 
    ranking=Ranking(name="bm25")
)

The second model is named wand_bm25 and uses the WeakAnd operator, considered an accelerated OR operator. The next section shows that the WeakAnd operator matches fewer documents without affecting the recall and precision metrics for the case considered here. We also analyze the optimal hits parameter to use for our specific application.

Run experiments

We can define which metrics we want to compute when running our experiments.

from learntorank.evaluation import MatchRatio, Recall, NormalizedDiscountedCumulativeGain

eval_metrics = [
    MatchRatio(), 
    Recall(at=10), 
    NormalizedDiscountedCumulativeGain(at=10)
]

MatchRatio computes the fraction of the document corpus matched by the queries. This metric will be critical when comparing match phase operators such as the OR and the WeakAnd. In addition, we compute Recall and NDCG metrics.

We can download labeled data to perform our experiments and compare query models. In our sample data, we have 50 queries, and each has a relevant document associated with them.

import json, requests

labeled_data = json.loads(
    requests.get("https://data.vespa.oath.cloud/blog/cord19/labeled_data.json").text
)
labeled_data[:3]
[{'query_id': 1,
  'relevant_docs': [{'id': 'kqqantwg', 'score': 2}],
  'query': 'coronavirus origin'},
 {'query_id': 2,
  'relevant_docs': [{'id': '526elsrf', 'score': 2}],
  'query': 'coronavirus response to weather changes'},
 {'query_id': 3,
  'relevant_docs': [{'id': '5jl6ltfj', 'score': 1}],
  'query': 'coronavirus immunity'}]

Evaluate

Once we have labeled data, the evaluation metrics to compute, and the query models we want to compare, we can run experiments with the evaluate method. The cord_uid field of the Vespa application should match the id of the relevant documents.

from learntorank.evaluation import evaluate

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=[or_bm25, wand_bm25], 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
)
evaluation

Evaluate

The result shows that, on average, we match 67% of our document corpus when using the OR operator and 21% when using the WeakAnd operator. The reduction in matched documents did not affect the recall and the NDCG metrics, which stayed at around 0.84 and 0.40, respectively. The Match Ratio will get even better when we experiment with the hits parameter of the WeakAnd further down in this tutorial.

There are different options available to configure the output of the evaluate method.

Specify summary statistics

The evaluate method returns the mean, the median, and the standard deviation of the metrics by default. We can customize this by specifying the desired aggregators. Below we choose the mean, the max, and the min as an example.

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=[or_bm25, wand_bm25], 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
    aggregators=["mean", "min", "max"]
)
evaluation

Summaries

Check detailed metrics output

Some of the metrics have intermediate results that might be of interest. For example, the MatchRatio metric requires us to compute the number of matched documents (retrieved_docs) and the number of documents available to be retrieved (docs_available). We can output those intermediate steps by setting detailed_metrics=True.

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=[or_bm25, wand_bm25], 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
    aggregators=["mean"],
    detailed_metrics=True
)
evaluation

detailed

Get per-query results

When debugging the results, it is often helpful to look at the metrics on a per-query basis, which is available by setting per_query=True.

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=[or_bm25, wand_bm25], 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
    per_query=True
)
evaluation.head(5)

per-query

Find optimal WeakAnd parameter

We can use the same evaluation framework to find the optimal hits parameter of the WeakAnd operator for this specific application. To do that, we can define a list of query models that only differ by the hits parameter.

wand_models = [QueryModel(
    name="wand_{}_bm25".format(hits), 
    match_phase=WeakAnd(hits=hits), 
    ranking=Ranking(name="bm25")
) for hits in range(1, 11)]

We can then call evaluate as before and show the match ratio and recall for each of the options defined above.

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=wand_models, 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
    aggregators=["mean"],
)
evaluation.loc[["match_ratio", "recall_10"], ["wand_{}_bm25".format(hits) for hits in range(1, 11)]]

optimal

As expected, we can see that a higher hits parameter implies a higher match ratio. But the recall metric remains the same as long as we pick hits > 3. So, using WeakAnd with hits = 4 is enough for this specific application and dataset, leading to a further reduction in the number of documents matched on average by our queries.

Clean up:

vespa_docker.container.stop()
vespa_docker.container.remove()

Conclusion

We want to enable Vespa users to run their experiments from python. This tutorial illustrates how to define query models and evaluation metrics to run search engine experiments via the evaluate method. We used a simple example that compares two different match operators and another that optimizes the parameter of one of those operators. Our key finding is that we can reduce the size of the retrieved set of hits without losing recall and precision by using the WeakAnd instead of the OR match operator.

The following Vespa resources are related to the topics explored by the experiments presented here:

Basic HTTP testing of Vespa applications

Jon M Venstad

Jon M Venstad

Principal Vespa Engineer

Håkon Hallingstad

Håkon Hallingstad

Principal Vespa Engineer


HTTP interfaces are the bread and butter for interacting with a Vespa application.
A typical system test of a Vespa application consists of a sequence of
HTTP requests, and corresponding assertions on the HTTP responses.

The latest addition to the Vespa CLI
is the test command, which makes it easy to develop and run basic HTTP tests,
expressed in JSON format.
Like the document and query commands, endpoint discovery and authentication are
handled by the CLI, leaving developers free to focus on the tests themselves.

Basic HTTP tests are also supported by the CD framework of Vespa Cloud,
allowing applications to be safely, and easily, deployed to production.

Developing and running tests

To get started with Vespa’s basic HTTP tests:

  • Install and configure Vespa CLI
  • Clone the album-recommendation sample app
    vespa clone vespa-cloud/album-recommendation myapp
  • Configure and deploy the application, locally or to the cloud
    vespa deploy --wait 600
  • Run the system tests, or staging setup and tests
    vespa test src/test/application/system-test
  • To enter production in Vespa Cloud, modify the tests, and then
    vespa prod submit

For more information, see the reference documentation:
Basic HTTP Testing.