Build a News recommendation app from python with Vespa: Part 1

Part 1 – News search functionality.

We will build a news recommendation app in Vespa without leaving a python environment. In this first part of the series, we want to develop an application with basic search functionality. Future posts will add recommendation capabilities based on embeddings and other ML models.

UPDATE 2023-02-13: Code examples are updated to work with the latest release of
pyvespa.

Decorative image

Photo by Filip Mishevski on Unsplash

This series is a simplified version of Vespa’s News search and recommendation tutorial. We will also use the demo version of the Microsoft News Dataset (MIND) so that anyone can follow along on their laptops.

Dataset

The original Vespa news search tutorial provides a script to download, parse and convert the MIND dataset to Vespa format. To make things easier for you, we made the final parsed data required for this tutorial available for download:

import requests, json

data = json.loads(
    requests.get("https://thigm85.github.io/data/mind/mind_demo_fields_parsed.json").text
)
data[0]
{'abstract': "Shop the notebooks, jackets, and more that the royals can't live without.",
 'title': 'The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By',
 'subcategory': 'lifestyleroyals',
 'news_id': 'N3112',
 'category': 'lifestyle',
 'url': 'https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata',
 'date': 20191103,
 'clicks': 0,
 'impressions': 0}

The final parsed data used here is a list where each element is a dictionary containing relevant fields about a news article such as title and category. We also have information about the number of impressions and clicks the article has received. The demo version of the mind dataset has 28.603 news articles included.

Install pyvespa

Create the search app

Create the application package. app_package will hold all the relevant data related to your application’s specification.

from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name="news")

Add fields to the schema. Here is a short description of the non-obvious arguments used below:

  • indexing argument: configures the indexing pipeline for a field, which defines how Vespa will treat input during indexing.

  • index argument: configure how Vespa should create the search index.

    • “enable-bm25”: set up an index compatible with bm25 ranking for text search.
  • attribute argument: configure how Vespa should treat an attribute field.

    • “fast-search”: Build an index for an attribute field. By default, no index is generated for attributes, and search over these defaults to a linear scan.
from vespa.package import Field

app_package.schema.add_fields(
    Field(name="news_id", type="string", indexing=["summary", "attribute"], attribute=["fast-search"]),
    Field(name="category", type="string", indexing=["summary", "attribute"]),
    Field(name="subcategory", type="string", indexing=["summary", "attribute"]),
    Field(name="title", type="string", indexing=["index", "summary"], index="enable-bm25"),
    Field(name="abstract", type="string", indexing=["index", "summary"], index="enable-bm25"),
    Field(name="url", type="string", indexing=["index", "summary"]),        
    Field(name="date", type="int", indexing=["summary", "attribute"]),            
    Field(name="clicks", type="int", indexing=["summary", "attribute"]),            
    Field(name="impressions", type="int", indexing=["summary", "attribute"]),                
)

Add a fieldset to the schema. Fieldset allows us to search over multiple fields easily. In this case, searching over the default fieldset is equivalent to searching over title and abstract.

from vespa.package import FieldSet

app_package.schema.add_field_set(
    FieldSet(name="default", fields=["title", "abstract"])
)

We have enough to deploy the first version of our application. Later in this tutorial, we will include an article’s popularity into the relevance score used to rank the news that matches our queries.

Deploy the app on Docker

If you have Docker installed on your machine, you can deploy the app_package in a local Docker container:

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy(
    application_package=app_package, 
)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Finished deployment.

vespa_docker will parse the app_package and write all the necessary Vespa config files to the disk_folder. It will then create the docker containers and use the Vespa config files to deploy the Vespa application. We can then use the app instance to interact with the deployed application, such as for feeding and querying. If you want to know more about what happens behind the scenes, we suggest you go through this getting started with Docker tutorial.

Feed data to the app

We can use the feed_data_point method. We need to specify:

  • data_id: unique id to identify the data point

  • fields: dictionary with keys matching the field names defined in our application package schema.

  • schema: name of the schema we want to feed data to. When we created an application package, we created a schema by default with the same name as the application name, news in our case.

This takes 10 minutes or so:

for article in data:
    res = app.feed_data_point(
        data_id=article["news_id"], 
        fields=article, 
        schema="news"
    )

Query the app

We can use the Vespa Query API through app.query to unlock the full query flexibility Vespa can offer.

Search over indexed fields using keywords

Select all the fields from documents where default (title or abstract) contains the keyword ‘music’.

res = app.query(body={"yql" : "select * from sources * where default contains 'music'"})
res.hits[0]
{
    'id': 'id:news:news::N14152',
    'relevance': 0.25641557752127125,
    'source': 'news_content',
    'fields': {
        'sddocname': 'news',
        'documentid': 'id:news:news::N14152',
        'news_id': 'N14152',
        'category': 'music',
        'subcategory': 'musicnews',
        'title': 'Music is hot in Nashville this week',
        'abstract': 'Looking for fun, entertaining music events to check out in Nashville this week? Here are top picks with dates, times, locations and ticket links.', 'url': 'https://www.msn.com/en-us/music/musicnews/music-is-hot-in-nashville-this-week/ar-BBWImOh?ocid=chopendata',
        'date': 20191101,
        'clicks': 0,
        'impressions': 3
    }
}

Select title and abstract where title contains ‘music’ and default contains ‘festival’.

res = app.query(body = {"yql" : "select title, abstract from sources * where title contains 'music' AND default contains 'festival'"})
res.hits[0]
{
    'id': 'index:news_content/0/988f76793a855e48b16dc5d3',
    'relevance': 0.19587240022210403,
    'source': 'news_content',
    'fields': {
        'title': "At Least 3 Injured In Stampede At Travis Scott's Astroworld Music Festival",
        'abstract': "A stampede Saturday outside rapper Travis Scott's Astroworld musical festival in Houston, left three people injured. Minutes before the gates were scheduled to open at noon, fans began climbing over metal barricades and surged toward the entrance, according to local news reports."
    }
}

Search by document type

Select the title of all the documents with document type equal to news. Our application has only one document type, so the query below retrieves all our documents.

res = app.query(body = {"yql" : "select title from sources * where sddocname contains 'news'"})
res.hits[0]
{
    'id': 'index:news_content/0/698f73a87a936f1c773f2161',
    'relevance': 0.0,
    'source': 'news_content',
    'fields': {
        'title': 'The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By'
    }
}

Search over attribute fields such as date

Since date is not specified with attribute=["fast-search"] there is no index built for it. Therefore, search over it is equivalent to doing a linear scan over the values of the field.

res = app.query(body={"yql" : "select title, date from sources * where date contains '20191110'"})
res.hits[0]
{
    'id': 'index:news_content/0/debbdfe653c6d11f71cc2353',
    'relevance': 0.0017429193899782135,
    'source': 'news_content',
    'fields': {
        'title': 'These Cranberry Sauce Recipes Are Perfect for Thanksgiving Dinner',
        'date': 20191110
    }
}

Since the default fieldset is formed by indexed fields, Vespa will first filter by all the documents that contain the keyword ‘weather’ within title or abstract, before scanning the date field for ‘20191110’.

res = app.query(body={"yql" : "select title, abstract, date from sources * where default contains 'weather' AND date contains '20191110'"})
res.hits[0]
{
    'id': 'index:news_content/0/bb88325ae94d888c46538d0b',
    'relevance': 0.27025156546141466,
    'source': 'news_content',
    'fields': {
        'title': 'Weather forecast in St. Louis',
        'abstract': "What's the weather today? What's the weather for the week? Here's your forecast.",
        'date': 20191110
    }
}

We can also perform range searches:

res = app.query({"yql" : "select date from sources * where date <= 20191110 AND date >= 20191108"})
res.hits[0]
{
    'id': 'index:news_content/0/c41a873213fdcffbb74987c0',
    'relevance': 0.0017429193899782135,
    'source': 'news_content',
    'fields': {
        'date': 20191109
    }
}

Sorting

By default, Vespa sorts the hits by descending relevance score. The relevance score is given by the nativeRank unless something else is specified, as we will do later in this post.

res = app.query(body={"yql" : "select title, date from sources * where default contains 'music'"})
res.hits[:2]
[
    {
        'id': 'index:news_content/0/5f1b30d14d4a15050dae9f7f',
        'relevance': 0.25641557752127125,
        'source': 'news_content',
        'fields': {
            'title': 'Music is hot in Nashville this week',
            'date': 20191101
        }
    },
    {
        'id': 'index:news_content/0/6a031d5eff95264c54daf56d',
        'relevance': 0.23351089409559303,
        'source': 'news_content',
        'fields': {
            'title': 'Apple Music Replay highlights your favorite tunes of the year',
            'date': 20191105
        }
    }
]

However, we can explicitly order by a given field with the order keyword.

res = app.query(body={"yql" : "select title, date from sources * where default contains 'music' order by date"})
res.hits[:2]
[
    {
        'id': 'index:news_content/0/d0d7e1c080f0faf5989046d8',
        'relevance': 0.0,
        'source': 'news_content',
        'fields': {
            'title': "Elton John's second farewell tour stop in Cleveland shows why he's still standing after all these years",
            'date': 20191031
        }
    },
    {
        'id': 'index:news_content/0/abf7f6f46ff2a96862075155',
        'relevance': 0.0,
        'source': 'news_content',
        'fields': {
            'title': 'The best hair metal bands',
            'date': 20191101
        }
    }
]

order sorts in ascending order by default, we can override that with the desc keyword:

res = app.query(body={"yql" : "select title, date from sources * where default contains 'music' order by date desc"})
res.hits[:2]
[
    {
        'id': 'index:news_content/0/934a8d976ff8694772009362',
        'relevance': 0.0,
        'source': 'news_content',
        'fields': {
            'title': 'Korg Minilogue XD update adds key triggers for synth sequences',
            'date': 20191113
        }
    },
    {
        'id': 'index:news_content/0/4feca287fdfa1d027f61e7bf',
        'relevance': 0.0,
        'source': 'news_content',
        'fields': {
            'title': 'Tom Draper, Black Music Industry Pioneer, Dies at 79',
            'date': 20191113
        }
    }
]

Grouping

We can use Vespa’s grouping feature to compute the three news categories with the highest number of document counts:

res = app.query(body={"yql" : "select * from sources * where sddocname contains 'news' limit 0 | all(group(category) max(3) order(-count())each(output(count())))"})
res.hits[0]
{
    'id': 'group:root:0',
    'relevance': 1.0,
    'continuation': {
        'this': ''
    },
    'children': [
        {
            'id': 'grouplist:category',
            'relevance': 1.0,
            'label': 'category',
            'continuation': {
                'next': 'BGAAABEBGBC'
            },
            'children': [
                {
                    'id': 'group:string:news',
                    'relevance': 1.0,
                    'value': 'news',
                    'fields': {
                        'count()': 9115
                    }
                },
                {
                    'id': 'group:string:sports',
                    'relevance': 0.6666666666666666,
                    'value': 'sports',
                    'fields': {
                        'count()': 6765
                    }
                },
                {
                    'id': 'group:string:finance',
                    'relevance': 0.3333333333333333,
                    'value': 'finance',
                    'fields': {
                        'count()': 1886
                    }
                }
            ]
        }
    ]
}

Use news popularity signal for ranking

Vespa uses nativeRank to compute relevance scores by default. We will create a new rank-profile that includes a popularity signal in our relevance score computation.

from vespa.package import RankProfile, Function

app_package.schema.add_rank_profile(
    RankProfile(
        name="popularity",
        inherits="default",
        functions=[
            Function(
                name="popularity", 
                expression="if (attribute(impressions) > 0, attribute(clicks) / attribute(impressions), 0)"
            )
        ], 
        first_phase="nativeRank(title, abstract) + 10 * popularity"
    )
)

Our new rank-profile will be called

Build a News recommendation app from python with Vespa: Part 2

Thiago Martins

Thiago Martins

Vespa Data Scientist


Part 2 – From news search to news recommendation with embeddings.

UPDATE 2023-02-14: Code examples are updated to work with the latest releases of
pyvespa.

In this part, we’ll start transforming our application from news search to news recommendation using the embeddings created in this tutorial. An embedding vector will represent each user and news article. We will make the embeddings used available for download to make it easier to follow this post along. When a user comes, we retrieve his embedding and use it to retrieve the closest news articles via an approximate nearest neighbor (ANN) search. We also show that Vespa can jointly apply general filtering and ANN search, unlike competing alternatives available in the market.

Decorative image

Photo by Matt Popovich on Unsplash

We assume that you have followed the news search tutorial. Therefore, you should have an app_package variable holding the news search app definition and a Docker container named news running a search application fed with news articles from the demo version of the MIND dataset.

Add a user schema

We need to add another document type to represent a user. We set up the schema to search for a user_id and retrieve the user’s embedding vector.

from vespa.package import Schema, Document, Field

app_package.add_schema(
    Schema(
        name="user", 
        document=Document(
            fields=[
                Field(
                    name="user_id", 
                    type="string", 
                    indexing=["summary", "attribute"], 
                    attribute=["fast-search"]
                ), 
                Field(
                    name="embedding", 
                    type="tensor<float>(d0[51])", 
                    indexing=["summary", "attribute"]
                )
            ]
        )
    )
)

We build an index for the attribute field user_id by specifying the fast-search attribute. Remember that attribute fields are held in memory and are not indexed by default.

The embedding field is a tensor field. Tensors in Vespa are flexible multi-dimensional data structures and, as first-class citizens, can be used in queries, document fields, and constants in ranking. Tensors can be either dense or sparse or both and can contain any number of dimensions. Please see the tensor user guide for more information. Here we have defined a dense tensor with a single dimension (d0 – dimension 0), representing a vector. 51 is the size of the embeddings used in this post.

We now have one schema for the news and one schema for the user.

[schema.name for schema in app_package.schemas]

Index news embeddings

Similarly to the user schema, we will use a dense tensor to represent the news embeddings. But unlike the user embedding field, we will index the news embedding by including index in the indexing argument and specify that we want to build the index using the HNSW (hierarchical navigable small world) algorithm. The distance metric used is euclidean. Read this blog post to know more about Vespa’s journey to implement ANN search.

from vespa.package import Field, HNSW

app_package.get_schema(name="news").add_fields(
    Field(
        name="embedding", 
        type="tensor<float>(d0[51])", 
        indexing=["attribute", "index"],
        ann=HNSW(distance_metric="euclidean")
    )
)

Recommendation using embeddings

Here, we’ve added a ranking expression using the closeness ranking feature, which calculates the euclidean distance and uses that to rank the news articles. This rank-profile depends on using the nearestNeighbor search operator, which we’ll get back to below when searching. But for now, this expects a tensor in the query to use as the initial search point.

from vespa.package import RankProfile

app_package.get_schema(name="news").add_rank_profile(
    RankProfile(
        name="recommendation", 
        inherits="default", 
        first_phase="closeness(field, embedding)"
    )
)

Query Profile Type

The recommendation rank profile above requires that we send a tensor along with the query. For Vespa to bind the correct types, it needs to know the expected type of this query parameter.

from vespa.package import QueryTypeField

app_package.query_profile_type.add_fields(
    QueryTypeField(
        name="ranking.features.query(user_embedding)",
        type="tensor<float>(d0[51])"
    )
)

This query profile type instructs Vespa to expect a float tensor with dimension d0[51] when the query parameter ranking.features.query(user_embedding) is passed. We’ll see how this works together with the nearestNeighbor search operator below.

Redeploy the application

We made all the required changes to turn our news search app into a news recommendation app. We can now redeploy the app_package to our running container named news.

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker.from_container_name_or_id("news")
app = vespa_docker.deploy(application_package=app_package)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Finished deployment.
["Uploading application '/app/application' using http://localhost:19071/application/v2/tenant/default/session",
 "Session 7 for tenant 'default' created.",
 'Preparing session 7 using http://localhost:19071/application/v2/tenant/default/session/7/prepared',
 "WARNING: Host named 'news' may not receive any config since it is not a canonical hostname. Disregard this warning when testing in a Docker container.",
 "Session 7 for tenant 'default' prepared.",
 'Activating session 7 using http://localhost:19071/application/v2/tenant/default/session/7/active',
 "Session 7 for tenant 'default' activated.",
 'Checksum:   62d964000c4ff4a5280b342cd8d95c80',
 'Timestamp:  1616671116728',
 'Generation: 7',
 '']

Feeding and partial updates: news and user embeddings

To keep this tutorial easy to follow, we make the parsed embeddings available for download. To build them yourself, please follow this tutorial.

import requests, json

user_embeddings = json.loads(
    requests.get("https://thigm85.github.io/data/mind/mind_demo_user_embeddings_parsed.json").text
)
news_embeddings = json.loads(
    requests.get("https://thigm85.github.io/data/mind/mind_demo_news_embeddings_parsed.json").text
)

We just created the user schema, so we need to feed user data for the first time.

for user_embedding in user_embeddings:
    response = app.feed_data_point(
        schema="user", 
        data_id=user_embedding["user_id"], 
        fields=user_embedding
    )

For the news documents, we just need to update the embedding field added to the news schema.
This takes ten minutes or so:

for news_embedding in news_embeddings:
    response = app.update_data(
        schema="news", 
        data_id=news_embedding["news_id"], 
        fields={"embedding": news_embedding["embedding"]}
    )

Fetch the user embedding

Next, we create a query_user_embedding function to retrieve the user embedding by the user_id. Of course, you could do this more efficiently using a Vespa Searcher as described here, but keeping everything in python at this point makes learning easier.

def parse_embedding(hit_json):
    embedding_json = hit_json["fields"]["embedding"]["values"]
    embedding_vector = [0.0] * len(embedding_json)
    i=0
    for val in embedding_json:
        embedding_vector[i] = val
        i+=1
    return embedding_vector

def query_user_embedding(user_id):
    result = app.query(body={"yql": "select * from sources user where user_id contains '{}'".format(user_id)})
    embedding = parse_embedding(result.hits[0])
    return embedding

The function will query Vespa, retrieve the embedding and parse it into a list of floats. Here are the first five elements of the user U63195’s embedding.

query_user_embedding(user_id="U63195")[:5]
[
    0.0,
    -0.1694680005311966,
    -0.0703359991312027,
    -0.03539799898862839,
    0.14579899609088898
]

Get recommendations

The following yql instructs Vespa to select the title and the category from the ten news documents closest to the user embedding.

yql = "select title, category from sources news where ({targetHits:10}nearestNeighbor(embedding, user_embedding))" 

We also specify that we want to rank those documents by the recommendation rank-profile that we defined earlier and send the user embedding via the query profile type ranking.features.query(user_embedding) that we also defined in our app_package.

result = app.query(
    body={
        "yql": yql,        
        "hits": 10,
        "ranking.features.query(user_embedding)": str(query_user_embedding(user_id="U63195")),
        "ranking.profile": "recommendation"
    }
)

Here are the first two hits out of the ten returned.

[
    {
        'id': 'index:news_content/0/aca03f4ba2274dd95b58db9a',
        'relevance': 0.1460561756063909,
        'source': 'news_content',
        'fields': {
            'category': 'music',
            'title': 'Broadway Star Laurel Griggs Suffered Asthma Attack Before She Died at Age 13'
        }
    },
    {
        'id': 'index:news_content/0/bd02238644c604f3a2d53364',
        'relevance': 0.14591827245062294,
        'source': 'news_content',
        'fields': {
            'category': 'tv',
            'title': "Rip Taylor's Cause of Death Revealed, Memorial Service Scheduled for Later This Month"
        }
    }
]

Combine ANN search with query filters

Vespa ANN search is fully integrated into the Vespa query tree. This integration means that we can include query filters and the ANN search will be applied only to documents that satisfy the filters. No need to do pre- or post-processing involving filters.

The following yql search over news documents that have sports as their category.

yql = "select title, category from sources news where " \
      "({targetHits:10}nearestNeighbor(embedding, user_embedding)) AND " \
      "category contains 'sports'"
result = app.query(
    body={
        "yql": yql,        
        "hits": 10,
        "ranking.features.query(user_embedding)": str(query_user_embedding(user_id="U63195")),
        "ranking.profile": "recommendation"
    }
)

Here are the first two hits out of the ten returned. Notice the category field.

[
    {
        'id': 'index:news_content/0/375ea340c21b3138fae1a05c',
        'relevance': 0.14417346200569972,
        'source': 'news_content',
        'fields': {
            'category': 'sports',
            'title': 'Charles Rogers, former Michigan State football, Detroit Lions star, dead at 38'
        }
    },
    {
        'id': 'index:news_content/0/2b892989020ddf7796dae435',
        'relevance': 0.14404365847394848,
        'source': 'news_content',
        'fields': {
            'category': 'sports',
            'title': "'Monday Night Football' commentator under fire after belittling criticism of 49ers kicker for missed field goal"
        }
    }
]

Next steps

Step to part 3 –
or see conclusion
for how to clean up the Docker container instances if you are done with this.

Build a News recommendation app from python with Vespa: Part 3

Thiago Martins

Thiago Martins

Vespa Data Scientist


Part 3 – Efficient use of click-through rate via parent-child relationship.

UPDATE 2023-02-14: Code examples are updated to work with the latest releases of
pyvespa.

This part of the series introduces a new ranking signal: category click-through rate (CTR). The idea is that we can recommend popular content for users that don’t have a click history yet. Rather than just recommending based on articles, we recommend based on categories. However, these global CTR values can often change continuously, so we need an efficient way to update this value for all documents. We’ll do that by introducing parent-child relationships between documents in Vespa. We will also use sparse tensors directly in ranking. This post replicates this more detailed Vespa tutorial.

Decorative image

Photo by AbsolutVision on Unsplash

We assume that you have followed the part2 of the news recommendation tutorial. Therefore, you should have an app_package variable holding the news app definition and a Docker container named news running the application fed with data from the demo version of the MIND dataset.

Setting up a global category CTR document

If we add a category_ctr field in the news document, we would have to update all the sport’s documents every time there is a change in the sport’s CTR statistic. If we assume that the category CTR will change often, this turns out to be inefficient.

For these cases, Vespa introduced the parent-child relationship. Parents are global documents, which are automatically distributed to all content nodes. Other documents can reference these parents and “import” values for use in ranking. The benefit is that the global category CTR values only need to be written to one place: the global document.

from vespa.package import Schema, Document, Field

app_package.add_schema(
    Schema(
        name="category_ctr",
        global_document=True,
        document=Document(
            fields=[
                Field(
                    name="ctrs", 
                    type="tensor<float>(category{})", 
                    indexing=["attribute"], 
                    attribute=["fast-search"]
                ), 
            ]
        )
    )
)

We implement that by creating a new category_ctr schema and setting global_document=True to indicate that we want Vespa to keep a copy of these documents on all content nodes. Setting a document to be global is required for using it in a parent-child relationship. Note that we use a tensor with a single sparse dimension to hold the ctrs data.

Sparse tensors have strings as dimension addresses rather than a numeric index. More concretely, an example of such a tensor is (using the tensor literal form):

{
    {category: entertainment}: 0.2 }, 
    {category: news}: 0.3 },
    {category: sports}: 0.5 },
    {category: travel}: 0.4 },
    {category: finance}: 0.1 },
    ...
}

This tensor holds all the CTR scores for all the categories. When updating this tensor, we can update individual cells, and we don’t need to update the whole tensor. This operation is called tensor modify and can be helpful when you have large tensors.

Importing parent values in child documents

We need to set up two things to use the category_ctr tensor for ranking news documents. We need to reference the parent document (category_ctr in this case) and import the ctrs from the referenced parent document.

app_package.get_schema("news").add_fields(
    Field(
        name="category_ctr_ref",
        type="reference<category_ctr>",
        indexing=["attribute"],
    )
)

The field category_ctr_ref is a field of type reference of the category_ctr document type. When feeding this field, Vespa expects the fully qualified document id. For instance, if our global CTR document has the id id:category_ctr:category_ctr::global, that is the value that we need to feed to the category_ctr_ref field. A document can reference many parent documents.

from vespa.package import ImportedField

app_package.get_schema("news").add_imported_field(
    ImportedField(
        name="global_category_ctrs",
        reference_field="category_ctr_ref",
        field_to_import="ctrs",
    )
)

The imported field defines that we should import the ctrs field from the document referenced in the category_ctr_ref field. We name this as global_category_ctrs, and we can reference this as attribute(global_category_ctrs) during ranking.

Tensor expressions in ranking

Each news document has a category field of type string indicating which category the document belongs to. We want to use this information to select the correct CTR score stored in the global_category_ctrs. Unfortunately, tensor expressions only work on tensors, so we need to add a new field of type tensor called category_tensor to hold category information in a way that can be used in a tensor expression:

app_package.get_schema("news").add_fields(
    Field(
        name="category_tensor",
        type="tensor<float>(category{})",
        indexing=["attribute"],
    )
)

With the category_tensor field as defined above, we can use the tensor expression sum(attribute(category_tensor) * attribute(global_category_ctrs)) to select the specific CTR related to the category of the document being ranked. We implement this expression as a Function in the rank-profile below:

from vespa.package import Function

app_package.get_schema("news").add_rank_profile(
    RankProfile(
        name="recommendation_with_global_category_ctr", 
        inherits="recommendation",
        functions=[
            Function(
                name="category_ctr", 
                expression="sum(attribute(category_tensor) * attribute(global_category_ctrs))"
            ),
            Function(
                name="nearest_neighbor", 
                expression="closeness(field, embedding)"
            )
            
        ],
        first_phase="nearest_neighbor * category_ctr",
        summary_features=[
            "attribute(category_tensor)", 
            "attribute(global_category_ctrs)", 
            "category_ctr", 
            "nearest_neighbor"
        ]
    )
)

In the new rank-profile, we have added a first phase ranking expression that multiplies the nearest-neighbor score with the category CTR score, implemented with the functions nearest_neighbor and category_ctr, respectively. As a first attempt, we just multiply the nearest-neighbor with the category CTR score, which might not be the best way to combine those two values.

Deploy

We can reuse the same container named news created in the first part of this tutorial.

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker.from_container_name_or_id("news")
app = vespa_docker.deploy(application_package=app_package)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Finished deployment.

Feed

Next, we will download the global category CTR data, already parsed in the format that is expected by a sparse tensor with the category dimension.

import requests, json

global_category_ctr = json.loads(
    requests.get("https://data.vespa.oath.cloud/blog/news/global_category_ctr_parsed.json").text
)
global_category_ctr
{
    'ctrs': {
        'cells': [
            {'address': {'category': 'entertainment'}, 'value': 0.029266420380943244},
            {'address': {'category': 'autos'}, 'value': 0.028475809103747123},
            {'address': {'category': 'tv'}, 'value': 0.05374837981352176},
            {'address': {'category': 'health'}, 'value': 0.03531784305129329},
            {'address': {'category': 'sports'}, 'value': 0.05611187986670051},
            {'address': {'category': 'music'}, 'value': 0.05471192953054426},
            {'address': {'category': 'news'}, 'value': 0.04420778372641991},
            {'address': {'category': 'foodanddrink'}, 'value': 0.029256852366228187},
            {'address': {'category': 'travel'}, 'value': 0.025144552013730358},
            {'address': {'category': 'finance'}, 'value': 0.03231013195899643},
            {'address': {'category': 'lifestyle'}, 'value': 0.04423279317474416},
            {'address': {'category': 'video'}, 'value': 0.04006693315980292},
            {'address': {'category': 'movies'}, 'value': 0.03335647459420146},
            {'address': {'category': 'weather'}, 'value': 0.04532171803495617},
            {'address': {'category': 'northamerica'}, 'value': 0.0},
            {'address': {'category': 'kids'}, 'value': 0.043478260869565216}
        ]
    }
}

We can feed this data point to the document defined in the category_ctr. We will assign the global id to this document. Reference to this document can be done by using the Vespa id id:category_ctr:category_ctr::global.

response = app.feed_data_point(schema="category_ctr", data_id="global", fields=global_category_ctr)

We need to perform a partial update on the news documents to include information about the reference field category_ctr_ref and the new category_tensor that will have the value 1.0 for the specific category associated with each document.

news_category_ctr = json.loads(
    requests.get("https://data.vespa.oath.cloud/blog/news/news_category_ctr_update_parsed.json").text
)
news_category_ctr[0]
{
    'id': 'N3112',
    'fields': {
        'category_ctr_ref': 'id:category_ctr:category_ctr::global',
        'category_tensor': {
            'cells': [
                { 'address': {'category': 'lifestyle'}, 'value': 1.0}
            ]
        }
    }
}

This takes ten minutes or so:

for data_point in news_category_ctr:
    response = app.update_data(schema="news", data_id=data_point["id"], fields=data_point["fields"])

Testing the new rank-profile

We will redefine the query_user_embedding function defined in the second part of this tutorial and use it to make a query involving the user U33527 and the recommendation_with_global_category_ctr rank-profile.

def parse_embedding(hit_json):
    embedding_json = hit_json["fields"]["embedding"]["values"]
    embedding_vector = [0.0] * len(embedding_json)
    i=0
    for val in embedding_json:
        embedding_vector[i] = val
        i+=1
    return embedding_vector

def query_user_embedding(user_id):
    result = app.query(body={"yql": "select * from sources user where user_id contains '{}'".format(user_id)})
    embedding = parse_embedding(result.hits[0])
    return embedding
yql = "select * from sources news where " \
      "({targetHits:10}nearestNeighbor(embedding, user_embedding))"
result = app.query(
    body={
        "yql": yql,        
        "hits": 10,
        "ranking.features.query(user_embedding)": str(query_user_embedding(user_id="U33527")),
        "ranking.profile": "recommendation_with_global_category_ctr"
    }
)

The first hit below is a sports article. The global CTR document is also listed here, and the CTR score for the sports category is 0.0561. Thus, the result of the category_ctr function is 0.0561 as intended. The nearest_neighbor score is 0.149, and the resulting relevance score is 0.00836. So, this worked as expected.

{
    'id': 'id:news:news::N5316',
    'relevance': 0.008369192847921151,
    'source': 'news_content',
    'fields': {
        'sddocname': 'news',
        'documentid': 'id:news:news::N5316',
        'news_id': 'N5316',
        'category': 'sports',
        'subcategory': 'football_nfl',
        'title': "Matthew Stafford's status vs. Bears uncertain, Sam Martin will play",
        'abstract': "Stafford's start streak could be in jeopardy, according to Ian Rapoport.",
        'url': "https://www.msn.com/en-us/sports/football_nfl/matthew-stafford's-status-vs.-bears-uncertain,-sam-martin-will-play/ar-BBWwcVN?ocid=chopendata",
        'date': 20191112,
        'clicks': 0,
        'impressions': 1,
        'summaryfeatures': {
            'attribute(category_tensor)': {
                'type': 'tensor<float>(category{})',
                'cells': [
                    {'address': {'category': 'sports'}, 'value': 1.0}
                ]
            },
            'attribute(global_category_ctrs)': {
                'type': 'tensor<float>(category{})',
                'cells': [
                    {'address': {'category': 'entertainment'}, 'value': 0.029266420751810074},
                    {'address': {'category': 'autos'}, 'value': 0.0284758098423481},
                    {'address': {'category': 'tv'}, 'value': 0.05374838039278984},
                    {'address': {'category': 'health'}, 'value': 0.03531784191727638},
                    {'address': {'category': 'sports'}, 'value': 0.05611187964677811},
                    {'address': {'category': 'music'}, 'value': 0.05471193045377731},
                    {'address': {'category': 'news'}, 'value': 0.04420778527855873},
                    {'address': {'category': 'foodanddrink'}, 'value': 0.029256852343678474},
                    {'address': {'category': 'travel'}, 'value': 0.025144552811980247},
                    {'address': {'category': 'finance'}, 'value': 0.032310131937265396},
                    {'address': {'category': 'lifestyle'}, 'value': 0.044232793152332306},
                    {'address': {'category': 'video'}, 'value': 0.040066931396722794},
                    {'address': {'category': 'movies'}, 'value': 0.033356472849845886},
                    {'address': {'category': 'weather'}, 'value': 0.045321717858314514},
                    {'address': {'category': 'northamerica'}, 'value': 0.0},
                    {'address': {'category': 'kids'}, 'value': 0.043478261679410934}
                ]
            },
            'rankingExpression(category_ctr)': 0.05611187964677811,
            'rankingExpression(nearest_neighbor)': 0.14915188666574342,
            'vespa.summaryFeatures.cached': 0.0
        }
    }
}

Conclusion

This tutorial introduced parent-child relationships and demonstrated it through a global CTR feature we used in ranking. We also introduced ranking with (sparse) tensor expressions.

Clean up Docker container instances:

vespa_docker.container.stop()
vespa_docker.container.remove()