In the August Vespa product update, we mentioned BM25 Rank Feature, Searchable Parent References, Tensor Summary Features, and Metrics Export. Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow.
This month, we’re excited to share the following updates with you:
Tensor Float Support
Tensors now supports float cell values, for example tensor<float>(key{}, x[100]). Using the 32 bits float type cuts memory footprint in half compared to the 64 bits double, and can increase ranking performance up to 30%. Vespa’s TensorFlow and ONNX integration now converts to float tensors for higher performance. Read more.
Reduced Memory Use for Text Attributes
Attributes in Vespa are fields stored in columnar form in memory for access during ranking and grouping. From Vespa 7.102, the enum store used to hold attribute data uses a set of smaller buffers instead of one large. This typically cuts static memory usage by 5%, but more importantly reduces peak memory usage (during background compaction) by 30%.
Prometheus Monitoring Support
Integrating with the Prometheus open-source monitoring solution is now easy to do
using the new interface to Vespa metrics.
Read more.
Query Dispatch Integrated in Container
The Vespa query flow is optimized for multi-phase evaluation over a large set of search nodes. Since Vespa-7-109.10, the dispatch function is integrated into the Vespa Container process which simplifies the architecture with one less service to manage. Read more.
We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.
Vespa.ai have just published two tutorials to help people to get started with text search applications by building scalable solutions with Vespa.
The tutorials were based on the full document ranking task
released by Microsoft’s MS MARCO dataset’s team.
The first tutorial helps you to create and deploy a basic text search application with Vespa as well as to download, parse and feed the dataset to a running Vespa instance. They also show how easy it is to experiment with ranking functions based on built-in ranking features available in Vespa.
The second tutorial shows how to create a training dataset containing Vespa ranking features that allow you to start training ML models to improve the app’s ranking function. It also illustrates the importance of going beyond pointwise loss functions when training models in a learning to rank context.
Both tutorials are detailed and come with code available to reproduce the steps. Here are the highlights.
Basic text search app in a nutshell
The main task when creating a basic app with Vespa is to write a search definition file containing information about the data you want to feed to the application and how Vespa should match and order the results returned in response to a query.
Apart from some additional details described in the tutorial, the search definition for our text search engine looks like the code snippet below. We have a title and body field containing information about the documents available to be searched. The fieldset keyword indicates that our query will match documents by searching query words in both title and body fields. Finally, we have defined two rank-profile, which controls how the matched documents will be ranked. The default rank-profile uses nativeRank, which is one of many built-in rank features available in Vespa. The bm25 rank-profile uses the widely known BM25 rank feature.
search msmarco {
document msmarco {
field title type string
field body type string
}
fieldset default {
fields: title, body
}
rank-profile default {
first-phase {
expression: nativeRank(title, body)
}
}
rank-profile bm25 inherits default {
first-phase {
expression: bm25(title) + bm25(body)
}
}
}
When we have more than one rank-profile defined, we can chose which one to use at query time, by including the ranking parameter in the query:
The first query above does not specify the ranking parameter and will therefore use the default rank-profile. The second query explicitly asks for the bm25 rank-profile to be used instead.
Having multiple rank-profiles allow us to experiment with different ranking functions. There is one relevant document for each query in the MSMARCO dataset. The figure below is the result of an evaluation script that sent more than 5.000 queries to our application and asked for results using both rank-profiles described above. We then tracked the position of the relevant document for each query and plotted the distribution for the first 10 positions.
It is clear that the bm25 rank-profile does a much better job in this case. It places the relevant document in the first positions much more often than the default rank-profile.
Data collection sanity check
After setting up a basic application, we likely want to collect rank feature data to help improve our ranking functions. Vespa allow us to return rank features along with query results, which enable us to create training datasets that combine relevance information with search engine rank information.
There are different ways to create a training dataset in this case. Because of this, we believe it is a good idea to have a sanity check established before we start to collect the dataset. The goal of such sanity check is to increase the likelihood that we catch bugs early and create datasets containing the right information associated with our task of improving ranking functions.
Our proposal is to use the dataset to train a model using the same features and functional form used by the baseline you want to improve upon. If the dataset is well built and contains useful information about the task you are interested you should be able to get results at least as good as the one obtained by your baseline on a separate test set.
Since our baseline in this case is the bm25 rank-profile, we should fit a linear model containing only the bm25 features:
a + b * bm25(title) + c * bm25(body)
Having this simple procedure in place helped us catch a few silly bugs in our data collection code and got us in the right track faster than would happen otherwise. Having bugs on your data is hard to catch when you begin experimenting with complex models as we never know if the bug comes from the data or the model. So this is a practice we highly recommend.
How to create a training dataset with Vespa
Asking Vespa to return ranking features in the result set is as simple as setting the ranking.listFeatures parameter to true in the request. Below is the body of a POST request that specify the query in YQL format and enable the rank features dumping.
body = {
"yql": 'select * from sources * where (userInput(@userQuery));',
"userQuery": "what is dad bod",
"ranking": {"profile": "bm25", "listFeatures": "true"},
}
Vespa returns a bunch of ranking features by default, but we can explicitly define which features we want by creating a rank-profile and ask it to ignore-default-rank-features and list the features we want by using the rank-features keyword, as shown below. The random first phase will be used when sampling random documents to serve as a proxy to non-relevant documents.
We want a dataset that will help train models that will generalize well when running on a Vespa instance. This implies that we are only interested in collecting documents that are matched by the query because those are the documents that would be presented to the first-phase model in a production environment. Here is the data collection logic:
hits = get_relevant_hit(query, rank_profile, relevant_id)
if relevant_hit:
hits.extend(get_random_hits(query, rank_profile, n_samples))
data = annotate_data(hits, query_id, relevant_id)
append_data(file, data)
For each query, we first send a request to Vespa to get the relevant document associated with the query. If the relevant document is matched by the query, Vespa will return it and we will expand the number of documents associated with the query by sending a second request to Vespa. The second request asks Vespa to return a number of random documents sampled from the set of documents that were matched by the query.
We then parse the hits returned by Vespa and organize the data into a tabular form containing the rank features and the binary variable indicating if the query-document pair is relevant or not. At the end we have a dataset with the following format. More details can be found in our second tutorial.
Beyond pointwise loss functions
The most straightforward way to train the linear model suggested in our data collection sanity check would be to use a vanilla logistic regression, since our target variable relevant is binary. The most commonly used loss function in this case (binary cross-entropy) is referred to as a pointwise loss function in the LTR literature, as it does not take the relative order of documents into account.
However, as we described in our first tutorial, the metric that we want to optimize in this case is the Mean Reciprocal Rank (MRR). The MRR is affected by the relative order of the relevance we assign to the list of documents generated by a query and not by their absolute magnitudes. This disconnect between the characteristics of the loss function and the metric of interest might lead to suboptimal results.
For ranking search results, it is preferable to use a listwise loss function when training our model, which takes the entire ranked list into consideration when updating the model parameters. To illustrate this, we trained linear models using the TF-Ranking framework. The framework is built on top of TensorFlow and allow us to specify pointwise, pairwise and listwise loss functions, among other things.
We made available the script that we used to train the two models that generated the results displayed in the figure below. The script uses simple linear models but can be useful as a starting point to build more complex ones.
Overall, on average, there is not much difference between those models (with respect to MRR), which was expected given the simplicity of the models described here. However, we can see that a model based on a listwise loss function allocate more documents in the first two positions of the ranked list when compared to the pointwise model. We expect the difference in MRR between pointwise and listwise loss functions to increase as we move on to more complex models.
The main goal here was simply to show the importance of choosing better loss functions when dealing with LTR tasks and to give a quick start for those who want to give it a shot in their own Vespa applications. Now, it is up to you, check out the tutorials, build something and let us know how it went. Feedbacks are welcome!
Introducing pyvespa simplified API. Build Vespa application from python with few lines of code.
UPDATE 2023-02-13: Code examples and links are updated to work with the latest releases of
pyvespa
and learntorank.
This post will introduce you to the simplified
pyvespa
API that allows us to build a basic text search application from scratch with just a few code lines from python. Follow-up posts will add layers of complexity by incrementally building on top of the basic app described here.
Photo by Sarah Dorweiler on Unsplash
pyvespa exposes a subset of Vespa API in python. The library’s primary goal is to allow for faster prototyping and facilitate Machine Learning experiments for Vespa applications. I have written about how we can use it to connect and interact with running Vespa applications and evaluate Vespa ranking functions from python. This time, we focus on building and deploying applications from scratch.
Install
The pyvespa simplified API introduced here was released in version 0.2.0
pip3 install pyvespa>=0.2.0 learntorank
Define the application
As an example, we will build an application to search through
CORD19 sample data.
Create an application package
The first step is to create a Vespa ApplicationPackage:
cord_uid will store the cord19 document ids, while title and abstract are self explanatory.
All the fields, in this case, are of type string.
Including "index" in the indexing list means that Vespa will create a searchable index for title and abstract. You can read more about which options is available for indexing in the Vespa documentation.
Setting index = "enable-bm25" makes Vespa pre-compute quantities to make it fast to compute the bm25 score. We will use BM25 to rank the documents retrieved.
Search multiple fields when querying
A Fieldset groups fields together for searching. For example, the default fieldset defined below groups title and abstract together.
We can specify how to rank the matched documents by defining a RankProfile. In this case, we defined the bm25 rank profile that combines that BM25 scores computed over the title and abstract fields.
We have now defined a basic text search app containing relevant fields, a fieldset to group fields together, and a rank profile to rank matched documents. It is time to deploy our application. We can locally deploy our app_package using Docker without leaving the notebook,
by creating an instance of VespaDocker,
as shown below:
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Finished deployment.
app now holds a Vespa instance, which we are going to use to interact with our application. Congratulations, you now have a Vespa application up and running.
It is important to know that pyvespa simply provides a convenient API to define Vespa application packages from python. vespa_docker.deploy export Vespa configuration files to the disk_folder defined above. Going through those files is an excellent way to start learning about Vespa syntax.
Feed some data
Our first action after deploying a Vespa application is usually to feed some data to it. To make it easier to follow, we have prepared a DataFrame containing 100 rows and the cord_uid, title, and abstract columns required by our schema definition.
With data fed, we can start to query our text search app. We can use the Vespa Query language directly by sending the required parameters to the body argument of the app.query method.
query={'yql':'select * from sources * where userQuery()','query':'What is the role of endothelin-1','ranking':'bm25','type':'any','presentation.timing':True,'hits':3}
res=app.query(body=query)res.hits[0]
{'id': 'id:cord19:cord19::2b73a28n',
'relevance': 20.79338929607865,
'source': 'cord19_content',
'fields': {'sddocname': 'cord19',
'documentid': 'id:cord19:cord19::2b73a28n',
'cord_uid': '2b73a28n',
'title': 'Role of endothelin-1 in lung disease',
'abstract': 'Endothelin-1 (ET-1) is a 21 amino acid peptide with diverse biological activity that has been implicated in numerous diseases. ET-1 is a potent mitogen regulator of smooth muscle tone, and inflammatory mediator that may play a key role in diseases of the airways, pulmonary circulation, and inflammatory lung diseases, both acute and chronic. This review will focus on the biology of ET-1 and its role in lung disease.'}}
We can also define the same query by using the
QueryModel abstraction
that allows us to specify how we want to match and rank our documents. In this case, we defined that we want to:
match our documents using the OR operator, which matches all the documents that share at least one term with the query.
rank the matched documents using the bm25 rank profile defined in our application package.
fromlearntorank.queryimportQueryModel,OR,Ranking,send_queryres=send_query(app=app,query="What is the role of endothelin-1",query_model=QueryModel(match_phase=OR(),ranking=Ranking(name="bm25")))res.hits[0]
{
'id': 'id:cord19:cord19::2b73a28n',
'relevance': 20.79338929607865,
'source': 'cord19_content',
'fields': {
'sddocname': 'cord19',
'documentid': 'id:cord19:cord19::2b73a28n',
'cord_uid': '2b73a28n',
'title': 'Role of endothelin-1 in lung disease',
'abstract': 'Endothelin-1 (ET-1) is a 21 amino acid peptide with diverse biological activity that has been implicated in numerous diseases. ET-1 is a potent mitogen regulator of smooth muscle tone, and inflammatory mediator that may play a key role in diseases of the airways, pulmonary circulation, and inflammatory lung diseases, both acute and chronic. This review will focus on the biology of ET-1 and its role in lung disease.'
}
}
Using the Vespa Query Language as in our first example gives you the full power and flexibility that Vespa can offer.
In contrast, the QueryModel abstraction focuses on specific use cases and can be more useful for ML experiments,
but this is a future post topic.
Jump to Build a basic text search application from python with Vespa: Part 2
or clean up:
We want to enable Vespa users to run their experiments from python. This tutorial illustrates how to define query models and evaluation metrics to perform search engine experiments.
UPDATE 2023-02-13: Code examples and links are updated to work with the latest releases of
pyvespa
and learntorank.
Photo by Eugene Golovesov on Unsplash
We show how to use the pyvespa API to run search engine experiments based on the text search app we built in the first part of this tutorial series. Specifically, we compare two different matching operators and show how to reduce the number of documents matched by the queries while keeping similar recall and precision metrics.
We assume that you have followed the first tutorial and have a variable app holding the Vespa connection instance that we established there. This connection should be pointing to a Docker container named cord19 running the Vespa application.
Feed additional data points
We will continue to use the CORD19 sample data
that fed the search app in the first tutorial.
In addition, we are going to feed a few additional data points to make it possible to get relevant metrics from our experiments.
We tried to minimize the amount of data required to make this tutorial easy to reproduce.
You can download the additional 494 data points below:
A QueryModel is an abstraction that encapsulates all the relevant information controlling how your app matches and ranks documents. Since we are dealing with a simple text search app here, we will start by creating two query models that use BM25 to rank but differ on how they match documents.
The second model is named wand_bm25 and uses the WeakAnd operator, considered an accelerated OR operator. The next section shows that the WeakAnd operator matches fewer documents without affecting the recall and precision metrics for the case considered here. We also analyze the optimal hits parameter to use for our specific application.
Run experiments
We can define which metrics we want to compute when running our experiments.
MatchRatio computes the fraction of the document corpus matched by the queries. This metric will be critical when comparing match phase operators such as the OR and the WeakAnd. In addition, we compute Recall and NDCG metrics.
We can download labeled data to perform our experiments and compare query models. In our sample data, we have 50 queries, and each has a relevant document associated with them.
Once we have labeled data, the evaluation metrics to compute, and the query models we want to compare, we can run experiments with the evaluate method. The cord_uid field of the Vespa application should match the id of the relevant documents.
The result shows that, on average, we match 67% of our document corpus when using the OR operator and 21% when using the WeakAnd operator. The reduction in matched documents did not affect the recall and the NDCG metrics, which stayed at around 0.84 and 0.40, respectively. The Match Ratio will get even better when we experiment with the hits parameter of the WeakAnd further down in this tutorial.
There are different options available to configure the output of the evaluate method.
Specify summary statistics
The evaluate method returns the mean, the median, and the standard deviation of the metrics by default. We can customize this by specifying the desired aggregators. Below we choose the mean, the max, and the min as an example.
Some of the metrics have intermediate results that might be of interest. For example, the MatchRatio metric requires us to compute the number of matched documents (retrieved_docs) and the number of documents available to be retrieved (docs_available). We can output those intermediate steps by setting detailed_metrics=True.
We can use the same evaluation framework to find the optimal hits parameter of the WeakAnd operator for this specific application. To do that, we can define a list of query models that only differ by the hits parameter.
As expected, we can see that a higher hits parameter implies a higher match ratio. But the recall metric remains the same as long as we pick hits > 3. So, using WeakAnd with hits = 4 is enough for this specific application and dataset, leading to a further reduction in the number of documents matched on average by our queries.
We want to enable Vespa users to run their experiments from python. This tutorial illustrates how to define query models and evaluation metrics to run search engine experiments via the evaluate method. We used a simple example that compares two different match operators and another that optimizes the parameter of one of those operators. Our key finding is that we can reduce the size of the retrieved set of hits without losing recall and precision by using the WeakAnd instead of the OR match operator.
The following Vespa resources are related to the topics explored by the experiments presented here:
UPDATE 2023-06-06: use new syntax to configure Bert embedder.
“searching data using vector embeddings, unreal engine, high quality render, 4k, glossy, vivid colors, intricate detail” by Stable Diffusion
Embeddings are the basis for modern semantic search and neural ranking,
so the first step in developing such features is to convert your document
and query text to embeddings.
Once you have the embeddings, Vespa.ai makes it easy to use them efficiently
to find neighbors
or evaluate machine-learned models,
but you’ve had to create
them either on the client side or by writing your own Java component.
Now, we’re providing this building block out of the platform as well.
On Vespa 8.54.61 or higher, simply add this to your services.xml file under <container>:
The model files here can be any BERT style model and vocabulary,
we recommend this one:
huggingface.co/sentence-transformers/msmarco-MiniLM-L-6-v3.
With this deployed, you can automatically
convert query text
to an embedding by writing embed(bert, “my text”) where you would otherwise supply an embedding tensor. For example:
And to
create an embedding from a document field
you can add
field myEmbedding type tensor(x[384]) {
indexing: input myTextField | embed bert
}
to your schema outside the document block.
Semantic search sample application
To get you started we have created a complete and minimal sample application using this:
simple-semantic-search.
Further reading
This should make it easy to get started with embeddings. If you want to dig deeper into the topic,
be sure to check out this blog post series on
using pretrained transformer models for search,
and this on efficiency in
combining vector search with filters.