Blog search application in Vespa
Update 2021-05-20:
This blog post refers to Vespa sample applications that do not exist anymore.
Please refer to the
News search and recommendation tutorial
for an updated version of text and sample applications.
Introduction
This is the first of a series of blog posts where data from WordPress.com (WP) is used to highlight how Vespa can be used to store, search and recommend blog posts. The data was made available during a Kaggle challenge to predict which blog posts someone would like based on their past behavior. It contains many ingredients that are necessary to showcase needs, challenges and possible solutions that are useful for those interested in building and deploying such applications in production.
The end goal is to build an application where:
- Users will be able to search and manipulate the pool of blog posts available.
- Users will get blog post recommendations from the content pool based on their interest.
This part addresses:
- How to describe the dataset used as well as any information connected to the data.
- How to set up a basic blog post search engine using Vespa.
The next parts show how to extend this basic search engine application with machine learned models to create a blog recommendation engine.
Dataset
The dataset contains blog posts written by WP bloggers and actions, in this case ‘likes’, performed by WP readers in blog posts they have interacted with. The dataset is publicly available at Kaggle and was released during a challenge to develop algorithms to help predict which blog posts users would most likely ‘like’ if they were exposed to them. The data includes these fields per blog post:
- _ post_id _ – unique numerical id identifying the blog post
- _ date_gmt _ – string representing date of blog post creation in GMT format yyyy-mm-dd hh:mm:ss
- _ author _ – unique numerical id identifying the author of the blog post
- _ url _ – blog post URL
- _ title _ – blog post title
- _ blog _ – unique numerical id identifying the blog that the blog post belongs to
- _ tags _ – array of strings representing the tags of the blog posts
- _ content _ – body text of the blog post, in html format
- _ categories _ – array of strings representing the categories the blog post was assigned to
For the user actions:
- _ post_id _ – unique numerical id identifying the blog post
- _ uid _ – unique numerical id identifying the user that liked post_id
- _ dt _ – date of the interaction in GMT format yyyy-mm-dd hh:mm:ss
Downloading raw data
For the purposes of this post, it is sufficient to use the first release of training data that consists of 5 weeks of posts as well as all the ‘like’ actions that occurred during those 5 weeks.
This first release of training data is available here – once downloaded, unzip it. The 1,196,111 line trainPosts.json will be our practice document data. This file is around 5GB in size.
Requirements
Indexing the full data set requires 23GB disk space. We have tested with a Docker container with 10GB RAM. We used similar settings as described in the vespa quick start guide. As in the guide we assume that the $VESPA_SAMPLE_APPS env variable points to the directory with your local clone of the vespa sample apps:
$ docker run -m 10G --detach --name vespa --hostname vespa --privileged --volume $VESPA_SAMPLE_APPS:/vespa-sample-apps --publish 8080:8080 vespaengine/vespa
Searching blog posts
Functional specification:
- Blog post title, content, tags and categories must all be searchable
- Allow blog posts to be sorted by both relevance and date
- Allow grouping of search results by tag or category
In terms of data, Vespa operates with the notion of documents. A document represents a single, searchable item in your system, e.g., a blog post, a photo, or a news article. Each document type must be defined in the Vespa configuration through a search definition. Think of a search definition as being similar to a table definition in a relational database; it consists of a set of fields, each with a given name, a specific type, and some optional properties.
As an example, for this simple blog post search application, we could create the document type blog_post
with the following fields:
- _ url _ – of type uri
- _ title _ – of type string
- _ content _ – of type string (string fields can be of any length)
- _ date_gmt _ – of type string (to store the creation date in GMT format)
The data fed into Vespa must match the structure of the search definition, and the hits returned when searching will be on this format as well.
Application Packages
A Vespa application package is the set of configuration files and Java plugins that together define the behavior of a Vespa system: what functionality to use, the available document types, how ranking will be done and how data will be processed during feeding and indexing. The search definition, e.g., blog_post.sd
, is a required part of an application package — the other required files are services.xml
and hosts.xml
.
The sample application blog search creates a simple but functional blog post search engine. The application package is found in src/main/application.
Services Specification
services.xml defines the services that make up the Vespa application — which services to run and how many nodes per service:
<?xml version='1.0' encoding='UTF-8'?>
<services version='1.0'>
<container id='default' version='1.0'>
<search/>
<document-api/>
<nodes>
<node hostalias="node1"/>
</nodes>
</container>
<content id='blog_post' version='1.0'>
<search>
<visibility-delay>1.0</visibility-delay>
</search>
<redundancy>1</redundancy>
<documents>
<document mode="index" type="blog_post"/>
</documents>
<nodes>
<node hostalias="node1"/>
</nodes>
<engine>
<proton>
<searchable-copies>1</searchable-copies>
</proton>
</engine>
</content>
</services>
<container>
defines the container cluster for document, query and result processing<search>
sets up the search endpoint for Vespa queries. The default port is 8080.<document-api>
sets up the document endpoint for feeding.<nodes>
defines the nodes required per service. (See the reference for more on container cluster setup.)<content>
defines how documents are stored and searched<redundancy>
denotes how many copies to keep of each document.<documents>
assigns the document types in the search definition — the content cluster capacity can be increased by adding node elements — see elastic Vespa. (See also the reference for more on content cluster setup.)<nodes>
defines the hosts for the content cluster.
Deployment Specification
hosts.xml contains a list of all the hosts/nodes that is part of the application, with an alias for each of them. Here we use a single node:
<?xml version="1.0" encoding="utf-8" ?>
<hosts>
<host name="localhost">
<alias>node1</alias>
</host>
</hosts>
Search Definition
The blog_post
document type mentioned in src/main/application/service.xml
is defined in the search definition. src/main/application/searchdefinitions/blog_post.sd
contains the search definition for a document of type blog_post
:
search blog_post {
document blog_post {
field date_gmt type string {
indexing: summary
}
field language type string {
indexing: summary
}
field author type string {
indexing: summary
}
field url type string {
indexing: summary
}
field title type string {
indexing: summary | index
}
field blog type string {
indexing: summary
}
field post_id type string {
indexing: summary
}
field tags type array<string> {
indexing: summary
}
field blogname type string {
indexing: summary
}
field content type string {
indexing: summary | index
}
field categories type array<string> {
indexing: summary
}
field date type int {
indexing: summary | attribute
}
}
fieldset default {
fields: title, content
}
rank-profile post inherits default {
first-phase {
expression:nativeRank(title, content)
}
}
}
document
is wrapped inside another element called search
. The name following these elements, here blog_post
, must be exactly the same for both.
The field property indexing configures the indexing pipeline for a field, which defines how Vespa will treat input during indexing — see indexing language. Each part of the indexing pipeline is separated by the pipe character ‘ | ’: |
Deploy the Application Package
Once done with the application package, deploy the Vespa application — build and start Vespa as in the quick start. Deploy the application:
$ cd /vespa-sample-apps/blog-search
$ vespa-deploy prepare src/main/application && vespa-deploy activate
This prints that the application was activated successfully and also the checksum, timestamp and generation for this deployment (more on that later). Pointing a browser to http://localhost:8080/ApplicationStatus returns JSON-formatted information about the active application, including its checksum, timestamp and generation (and should be the same as the values when vespa-deploy activate
was run). The generation will increase by 1 each time a new application is successfully deployed, and is the easiest way to verify that the correct version is active.
The Vespa node is now configured and ready for use.
Feeding Data
The data fed to Vespa must match the search definition for the document type. The data downloaded from Kaggle, contained in trainPosts.json, must be converted to a valid Vespa document format before it can be fed to Vespa. Find a parser in the utility repository. Since the full data set is unnecessarily large for the purposes of this first part of this post, we use only the first 10,000 lines of it, but feel free to load all 1,1M entries:
$ head -10000 trainPosts.json > trainPostsSmall.json
$ python parse.py trainPostsSmall.json > feed.json
Send this to Vespa using one of the tools Vespa provides for feeding. Here we will use the Java feeding API:
$ java -jar $VESPA_HOME/lib/jars/vespa-http-client-jar-with-dependencies.jar --verbose --file feed.json --host localhost --port 8080
Note that in the sample-apps/blog-search directory, there is a file with sample data. You may also feed this file using this method.
Track feeding progress
Use the Metrics API to track number of documents indexed:
$ curl -s 'http://localhost:19112/state/v1/metrics' | tr ',' '\n' | grep -A 2 proton.doctypes.blog_post.numdocs
You can also inspect the search node state by
$ vespa-proton-cmd --local getState
Fetch documents
Fetch documents by document id using the Document API:
$ curl -s 'http://localhost:8080/document/v1/blog-search/blog_post/docid/1750271' | python -m json.tool
The first query
Searching with Vespa is done using a HTTP GET requests, like:
<host:port>/<search>?<yql=value1>&<param2=value2>...
The only mandatory parameter is the query, using yql=<yql query>
. More details can be found in the Search API.
Given the above search definition, where the fields title
and content
are part of the fieldset default
, any document containing the word “music” in one or more of these two fields matches our query below:
$ curl -s 'http://localhost:8080/search/?yql=select+*+from+sources+*+where+default+contains+%22music%22%3B' | python -m json.tool
Looking at the output, please note:
- The field
documentid
in the output and how it matches the value we assigned to each put operation when feeding data to Vespa. - Each hit has a property named relevance, which indicates how well the given document matches our query, using a pre-defined default ranking function. You have full control over ranking — more about ranking and ordering later. The hits are sorted by this value.
- When multiple hits have the same relevance score their internal ordering is undefined. However, their internal ordering will not change unless the documents are re-indexed.
- Add
&tracelevel=9
to dump query parsing details
Other examples
yql=select+title+from+sources+*+where+title+contains+%22music%22%3B
Once more a search for the single term “music”, but this time with the explicit field title
. This means that we only want to match documents that contain the word “music” in the field title
. As expected, you will see fewer hits for this query, than for the previous one.
yql=select+*+from+sources+*+where+default+contains+%22music%22+AND+default+contains+%22festival%22%3B
This is a query for the two terms “music” and “festival”, combined with an AND
operation; it finds documents that match both terms — but not just one of them.
yql=select+*+from+sources+*+where+sddocname+contains+%22blog_post%22%3B
This is a single-term query in the special field sddocname
for the value “blog_post”. This is a common and useful Vespa trick to get the number of indexed documents for a certain document type (search definition): sddocname
is a special and reserved field which is always set to the name of the document type for a given document. The documents are all of type blog_post
, and will therefore automatically have the field sddocname set to that value.
This means that the query above really means “Return all documents of type blog_post”, and as such all documents in the index are returned.