Blog recommendation in Vespa | Vespa Blog

Update 2021-05-20:
This blog post refers to Vespa sample applications that do not exist anymore.
Please refer to the
News search and recommendation tutorial
for an updated version of text and sample applications.

Introduction

This post builds upon the previous blog search application and extends the basic search engine to include machine learned models to help us recommend blog posts to users that arrive at our application. Assume that once a user arrives, we obtain his user identification number, denoted in here by user_id, and that we will send this information down to Vespa and expect to obtain a blog post recommendation list containing 100 blog posts tailored for that specific user.

Prerequisites:

Collaborative Filtering

We will start our recommendation system by implementing the collaborative filtering algorithm for implicit feedback described in (Hu et. al. 2008). The data is said to be implicit because the users did not explicitly rate each blog post they have read. Instead, the have “liked” blog posts they have likely enjoyed (positive feedback) but did not have the chance to “dislike” blog posts they did not enjoy (absence of negative feedback). Because of that, implicit feedback is said to be inherently noisy and the fact that a user did not “like” a blog post might have many different reasons not related with his negative feelings about that blog post.

In terms of modeling, a big difference between explicit and implicit feedback datasets is that the ratings for the explicit feedback are typically unknown for the majority of user-item pairs and are treated as missing values and ignored by the training algorithm. For an implicit dataset, we would assume a rating of zero in case the user has not liked a blog post. To encode the fact that a value of zero could come from different reasons we will use the concept of confidence as introduced by (Hu et. al. 2008), which causes the positive feedback to have a higher weight than a negative feedback.

Once we train the collaborative filtering model, we will have one vector representing a latent factor for each user and item contained in the training set. Those vectors will later be used in the Vespa ranking framework to make recommendations to a user based on the dot product between the user and documents latent factors. An obvious problem with this approach is that new users and new documents will not have those latent factors available to them. This is what is called a cold start problem and will be addressed with content-based techniques described in future posts.

Evaluation metrics

The evaluation metric used by Kaggle for this challenge was the Mean Average Precision at 100 (MAP@100). However, since we do not have information about which blog posts the users did not like (that is, we have only positive feedback) and our inability to obtain user behavior to the recommendations we make (this is an offline evaluation, different from the usual A/B testing performed by companies that use recommendation systems), we offer a similar remark as the one included in (Hu et. al. 2008) and prefer recall-oriented measures. Following (Hu et. al. 2008) we will use the expected percentile ranking.

Evaluation Framework

Generate training and test sets

In order to evaluate the gains obtained by the recommendation system when we start to improve it with more accurate algorithms, we will split the dataset we have available into training and test sets. The training set will contain document (blog post) and user action (likes) pairs as well as any information available about the documents contained in the training set. There is no additional information about the users besides the blog posts they have liked. The test set will be formed by a series of documents available to be recommended and a set of users to whom we need to make recommendations. This list of test set documents constitutes the Vespa content pool, which is the set of documents stored in Vespa that are available to be served to users. The user actions will be hidden from the test set and used later to evaluate the recommendations made by Vespa.

To create an application that more closely resembles the challenges faced by companies when building their recommendation systems, we decided to construct the training and test sets in such a way that:

  • There will be blog posts that had been liked in the training set by a set of users and that had also been liked in the test set by another set of users, even though this information will be hidden in the test set. Those cases are interesting to evaluate if the exploitation (as opposed to exploration) component of the system is working well. That is, if we are able to identify high quality blog posts based on the available information during training and exploit this knowledge by recommending those high quality blog posts to another set of users that might like them as well.
  • There will be blog posts in the test set that had never been seen in the training set. Those cases are interesting in order to evaluate how the system deals with the cold-start problem. Systems that are too biased towards exploitation will fail to recommend new and unexplored blog posts, leading to a feedback loop that will cause the system to focus into a small share of the available content.

A key challenge faced by recommender system designers is how to balance the exploitation/exploration components of their system, and our training/test set split outlined above will try to replicate this challenge in our application. Notice that this split is different from the approach taken by the Kaggle competition where the blog posts available in the test set had never been seen in the training set, which removes the exploitation component of the equation.

The Spark job uses trainPosts.json and creates the folders blog-job/training_set_ids and blog-job/test_set_ids containing files with post_id and user_idpairs:

$ cd blog-recommendation; export SPARK_LOCAL_IP="127.0.0.1"
$ spark-submit --class "com.yahoo.example.blog.BlogRecommendationApp" \
  --master local[4] ../blog-tutorial-shared/target/scala-*/blog-support*.jar \
  --task split_set --input_file ../trainPosts.json \
  --test_perc_stage1 0.05 --test_perc_stage2 0.20 --seed 123 \
  --output_path blog-job/training_and_test_indices
  • test_perc_stage1: The percentage of the blog posts that will be located only on the test set (exploration component).
  • test_perc_stage2: The percentage of the remaining (post_id, user_id) pairs that should be moved to the test set (exploitation component).
  • seed: seed value used in order to replicate results if required.

Compute user and item latent factors

Use the complete training set to compute user and item latent factors. We will leave the discussion about tuning and performance improvement of the model used to the section about model tuning and offline evaluation. Submit the Spark job to compute the user and item latent factors:

$ spark-submit --class "com.yahoo.example.blog.BlogRecommendationApp" \
  --master local[4] ../blog-tutorial-shared/target/scala-*/blog-support*.jar \
  --task collaborative_filtering \
  --input_file blog-job/training_and_test_indices/training_set_ids \
  --rank 10 --numIterations 10 --lambda 0.01 \
  --output_path blog-job/user_item_cf

Verify the vectors for the latent factors for users and posts:

$ head -1 blog-job/user_item_cf/user_features/part-00000 | python -m json.tool
{
    "user_id": 270,
    "user_item_cf": {
        "user_item_cf:0": -1.750116e-05,
        "user_item_cf:1": 9.730623e-05,
        "user_item_cf:2": 8.515047e-05,
        "user_item_cf:3": 6.9297894e-05,
        "user_item_cf:4": 7.343942e-05,
        "user_item_cf:5": -0.00017635927,
        "user_item_cf:6": 5.7642872e-05,
        "user_item_cf:7": -6.6685796e-05,
        "user_item_cf:8": 8.5506894e-05,
        "user_item_cf:9": -1.7209566e-05
    }
}
$ head -1 blog-job/user_item_cf/product_features/part-00000 | python -m json.tool
{
    "post_id": 20,
    "user_item_cf": {
        "user_item_cf:0": 0.0019320602,
        "user_item_cf:1": -0.004728486,
        "user_item_cf:2": 0.0032499845,
        "user_item_cf:3": -0.006453364,
        "user_item_cf:4": 0.0015929453,
        "user_item_cf:5": -0.00420313,
        "user_item_cf:6": 0.009350027,
        "user_item_cf:7": -0.0015649397,
        "user_item_cf:8": 0.009262732,
        "user_item_cf:9": -0.0030964287
    }
}

At this point, the vectors with latent factors can be added to posts and users.

Add vectors to search definitions using tensors

Modern machine learning applications often make use of large, multidimensional feature spaces and perform complex operations on those features, such as in large logistic regression and deep learning models. It is therefore necessary to have an expressive framework to define and evaluate ranking expressions of such complexity at scale.

Vespa comes with a Tensor framework, which unify and generalizes scalar, vector and matrix operations, handles the sparseness inherent to most machine learning application (most cases evaluated by the model is lacking values for most of the features) and allow for models to be continuously updated. Additional information about the Tensor framework can be found in the tensor user guide.

We want to have those latent factors available in a Tensor representation to be used during ranking by the Tensor framework. A tensor field user_item_cf is added to blog_post.sd to hold the blog post latent factor:

field user_item_cf type tensor(user_item_cf[10]) {
	indexing: summary | attribute
	attribute: tensor(user_item_cf[10])
}

field has_user_item_cf type byte {
	indexing: summary | attribute
	attribute: fast-search
}

A new search definition user.sd defines a document type named user to hold information for users:

search user {
    document user {
        field user_id type string {
            indexing: summary | attribute
            attribute: fast-search
        }

        field has_read_items type array<string> {
            indexing: summary | attribute
        }

        field user_item_cf type tensor(user_item_cf[10]) {
            indexing: summary | attribute
            attribute: tensor(user_item_cf[10])
        }

        field has_user_item_cf type byte {
            indexing: summary | attribute
            attribute: fast-search
        }
    }
}

Where:

  • user_id: unique identifier for the user
  • user_item_cf: tensor that will hold the user latent factor
  • has_user_item_cf: flag to indicate the user has a latent factor

Join and feed data

Build and deploy the application:

Deploy the application (in the Docker container):

$ vespa-deploy prepare /vespa-sample-apps/blog-recommendation/target/application && \
  vespa-deploy activate

Wait for app to activate (200 OK):

$ curl -s --head http://localhost:8080/ApplicationStatus

The code to join the latent factors in blog-job/user_item_cf into blog_post and user documents is implemented in tutorial_feed_content_and_tensor_vespa.pig. After joining in the new fields, a Vespa feed is generated and fed to Vespa directly from Pig :

$ pig -Dvespa.feed.defaultport=8080 -Dvespa.feed.random.startup.sleep.ms=0 \
  -x local \
  -f ../blog-tutorial-shared/src/main/pig/tutorial_feed_content_and_tensor_vespa.pig \
  -param VESPA_HADOOP_JAR=../vespa-hadoop*.jar \
  -param DATA_PATH=../trainPosts.json \
  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \
  -param BLOG_POST_FACTORS=blog-job/user_item_cf/product_features \
  -param USER_FACTORS=blog-job/user_item_cf/user_features \
  -param ENDPOINT=localhost

A successful data join and feed will output:

Input(s):
Successfully read 1196111 records from: "file:///Users/kraune/github/vespa-engine/sample-apps/trainPosts.json"
Successfully read 341416 records from: "file:///Users/kraune/github/vespa-engine/sample-apps/blog-recommendation/blog-job/training_and_test_indices/testing_set_ids"
Successfully read 323727 records from: "file:///Users/kraune/github/vespa-engine/sample-apps/blog-recommendation/blog-job/user_item_cf/product_features"
Successfully read 6290 records from: "file:///Users/kraune/github/vespa-engine/sample-apps/blog-recommendation/blog-job/user_item_cf/user_features"

Output(s):
Successfully stored 286237 records in: "localhost"

Sample blog post and user:

Ranking

Set up a rank function to return the best matching blog posts given some user latent factor. Rank the documents using a dot product between the user and blog post latent factors, i.e. the query tensor and blog post tensor dot product (sum of the product of the two tensors) – from blog_post.sd:

rank-profile tensor {
    first-phase {
        expression {
            sum(query(user_item_cf) * attribute(user_item_cf))
        }
    }
}

Configure the ranking framework to expect that query(user_item_cf) is a tensor, and that it is compatible with the attribute in a query profile type – see search/query-profiles/types/root.xml and search/query-profiles/default.xml:

<query-profile-type id="root" inherits="native">
    <field name="ranking.features.query(user_item_cf)" type="tensor(user_item_cf[10])" />
</query-profile-type>

<query-profile id="default" type="root" />

This configures a ranking feature named query(user_item_cf) with type tensor(user_item_cf[10]), which defines it as an indexed tensor with 10 elements. This is the same as the attribute, hence the dot product can be computed.

Query Vespa with a tensor

Test recommendations by sending a tensor with latenct factors: localhost:8080/search/?yql=select%20*%20from%20sources%20blog_post%20where%20has_user_item_cf%20=%201;&ranking=tensor&ranking.features.query(user_item_cf)=%7B%7Buser_item_cf%3A0%7D%3A0.1%2C%7Buser_item_cf%3A1%7D%3A0.1%2C%7Buser_item_cf%3A2%7D%3A0.1%2C%7Buser_item_cf%3A3%7D%3A0.1%2C%7Buser_item_cf%3A4%7D%3A0.1%2C%7Buser_item_cf%3A5%7D%3A0.1%2C%7Buser_item_cf%3A6%7D%3A0.1%2C%7Buser_item_cf%3A7%7D%3A0.1%2C%7Buser_item_cf%3A8%7D%3A0.1%2C%7Buser_item_cf%3A9%7D%3A0.1%7D

The query string, decomposed:

  • yql=select * from sources blog_post where has_user_item_cf = 1 – this selects all documents of type blog_post which has a latent factor tensor
  • restrict=blog_post – search only in blog_post documents
  • ranking=tensor – use the rank-profile tensor in blog_post.sd.
  • ranking.features.query(user_item_cf) – send the tensor as user_item_cf. As this tensor is defined in the query-profile-type, the ranking framework knows its type (i.e. dimensions) and is able to do a dot product with the attribute of same type. The tensor before URL-encoding:

    {
     {user_item_cf:0}:0.1,
     {user_item_cf:1}:0.1,
     {user_item_cf:2}:0.1,
     {user_item_cf:3}:0.1,
     {user_item_cf:4}:0.1,
     {user_item_cf:5}:0.1,
     {user_item_cf:6}:0.1,
     {user_item_cf:7}:0.1,
     {user_item_cf:8}:0.1,
     {user_item_cf:9}:0.1
    }

Query Vespa with user id

Next step is to query Vespa by user id, look up the user profile for the user, get the tensor from it and recommend documents based on this tensor (like the query in previous section). The user profiles is fed to Vespa in the user_item_cf field of the user document type.

In short, set up a searcher to retrieve the user profile by user id – then run the query. When the Vespa Container receives a request, it will create a Query representing it and execute

Blog recommendation with neural network models

Update 2021-05-20:
This blog post refers to Vespa sample applications that do not exist anymore.
Please refer to the
News search and recommendation tutorial
for an updated version of text and sample applications.

Introduction

The main objective of this post is to show how to deploy neural network models in Vespa using our Tensor Framework. In fact, any model that can be represented by a series of Tensor operations can be deployed in Vespa. Neural networks is just a popular example. In addition, we will introduce the multi-phase ranking model available in Vespa that can be used to run more expensive models in a phase based on a reduced number of documents returned by previous phases. This feature allow us to run models that would be prohibitively expensive to use if we had to run them at query-time across all the documents indexed in Vespa.

Model Training

In this section, we will define a neural network model, show how we created a suitable dataset to train the model and train the model using TensorFlow.

The neural network model

In the previous blog post, we computed latent factors for each user and each document and then used a dot-product between user and document vectors to rank the documents available for recommendation to a specific user. In this tutorial we will train a 2-layer fully connected neural network model that will take the same user (u) and document (d) latent factors as input and will output the probability of that specific user liking the document.

More technically, our previous rank function r was given by

r(u,d)=u∗d

while in this tutorial it will be given by

r(u,d,θ)=f(u,d,θ)

where f represents the neural network model described below and θ is the neural network parameter values that we need to learn from training data.

The specific form of the neural network model used here is

p = sigmoid(h1×W2+b2)
h1 = ReLU(x×W1+b1)

where x=[u,d] is the concatenation of the user and document latent factor, ReLU is the rectifier activation function, sigmoid represents the sigmoid function, p is the output of the model and in this case can be interpreted as the probability of the user u liking a blog post d. The parameters of the model are represented by θ=(W1,W2,b1,b2).

Training data

For the training dataset, we will start with the (user_id, post_id) rows from the “training_set_ids” generated previously. Then, we remove every row for which there is no latent factors for the user_id or post_id contained in that row. This gives us a dataset with only positive feedback (label = 1), since each row represents one instance of a user_id liking a post_id.

In order to train our model, we need to generate negative feedback (label = 0). So, for each row (user_id, post_id) in the current dataset we will generate N negative feedback rows by randomly sampling post_id_fake from the pool of post_id’s available in the current set, so that for each (user_id, post_id) row with label = 1 we will increase the dataset with N (user_id, post_id_fake) rows with label = 0.

Find code to generate the dataset in the utility scripts.

Training with TensorFlow

With the training data in hand, we have split it into 80% training set and 20% validation set and used TensorFlow to train the model. The script used can be found in the utility scripts and executed by

$ python vespaModel.py --product_features_file_path vespa_tutorial_data/user_item_cf_cv/product.json \
                       --user_features_file_path vespa_tutorial_data/user_item_cf_cv/user.json \
                       --dataset_file_path vespa_tutorial_data/nn_model/training_set.txt

The progress of your training can be visualized using Tensorboard

$ tensorboard --logdir runs/*/summaries/

##
Model deployment in Vespa

Two Phase Ranking

When a query is sent to Vespa, it will scan all documents available and select the ones (possibly all) that match the query. When the set of documents matching a query is found, Vespa must decide the order of these documents. Unless explicit sorting is used, Vespa decides this order by calculating a number for each document, the rank score, and sorts the documents by this number.

The rank score can be any function that takes as arguments parameters sent by the query, document attributes defined in search definitions and global parameters not directly linked to query or document parameters. One example of rank score is the output of the neural network model defined in this tutorial. The model takes the latent factor u associated with a specific user_id (query parameter), the latent factor dd associated with document post_id (document attribute) and learned model parameters (global parameters not related to a specific query nor document) and returns the probability of user u to like document d.

However, even though Vespa is designed to carry out such calculations optimally, complex expressions becomes expensive when they must be calculated over every one of a large set of matching documents. To relieve this, Vespa can be configured to run two ranking expressions – a smaller and less accurate one on all hits during the matching phase, and a more expensive and accurate one only on the best hits during the reranking phase. In general this allows a more optimal usage of the cpu budget by dedicating more of the total cpu towards the best candidate hits.

The reranking phase, if specified, will by default be run on the 100 best hits on each search node, after matching and before information is returned upwards to the search container. The number of hits to rerank can be turned up or down as needed. Below is a toy example showing how to configure first and second phase ranking expressions in the rank profile section of search definitions where the second phase rank expression is run on the 200 best hits from first phase on each search node.

search myapp {

    …

    rank-profile default inherits default {

        first-phase {
            expression: nativeRank + query(deservesFreshness) * freshness(timestamp)
        }

        second-phase {
            expression {
                0.7 * ( 0.7*fieldMatch(title) + 0.2*fieldMatch(description) + 0.1*fieldMatch(body) ) +
                0.3 * attributeMatch(keywords)
            }
            rerank-count: 200
        }
    }
}

Constant Tensor files

Once the model has been trained in TensorFlow, export the model parameters (W1,W2,b1,b2) to the application folder as Tensors according to the Vespa Document JSON format.

The complete code to serialize the model parameters using Vespa Tensor format can be found in the utility scripts but the following code snipped shows how to serialize the hidden layer weights W1:

serializer.serialize_to_disk(variable_name = "W_hidden", dimension_names = ['input', 'hidden'])

Note that Vespa currently requires dimension names for all the Tensor dimensions (in this case W1 is a matrix, therefore dimension is 2).

In the following section, we will use the following code in the blog_post search definition in order to be able to use the constant tensor W_hidden in our ranking expression.

    constant W_hidden {
        file: constants/W_hidden.json
        type: tensor(input[20],hidden[40])
    }

A constant tensor is data that is not specific to a given document type. In the case above we define W_hidden to be a tensor with two dimensions (matrix), where the first dimension is named input and has size 20 and second dimension is named hidden and has size 40. The data were serialized to a JSON file located at constants/W_hidden.json relative to the application package folder.

Vespa ranking expressions

In order to evaluate the neural network model trained with TensorFlow in the previous section, we need to translate the model structure to a Vespa ranking expression to be defined in the blog_post search definition. To honor a low-latency response, we will take advantage of the Two Phase Ranking available in Vespa and define the first phase ranking to be the same ranking function used in the previous blog post, which is a dot-product between the user and latent factors. After the documents have been sorted by the first phase ranking function, we will rerank the top 200 document from each search node using the second phase ranking given by the neural network model presented above.

Note that we define two ranking profiles in the search definition below. This allow us to decide which ranking profile to use at query time. We defined a ranking profile named tensor which only applies the dot-product between user and document latent factors for all matching documents and a ranking profile named nn_tensor, which rerank the top 200 documents using the neural network model discussed in the previous section.

We will walk through each part of the blog_post search definition, see blog_post.sd.

As always, we start the a search definition with the following line

We define the document type blog_post the same way we have done in the previous tutorial.

    document blog_post {

      # Field definitions
      # Examples:

      field date_gmt type string {
          indexing: summary
      }
      field language type string {
          indexing: summary
      }

      # Remaining fields as found in previous tutorial

    }

We define a ranking profile named tensor which rank all the matching documents by the dot-product between the document latent factor and the user latent factor. This is the same ranking expression used in the previous tutorial, which include code to retrieve the user latent factor based on the user_id sent by the query to Vespa.

    # Simpler ranking profile without
    # second-phase ranking
    rank-profile tensor {
      first-phase {
          expression {
              sum(query(user_item_cf) * attribute(user_item_cf))
          }
      }
    }

Since we want to evaluate the neural network model we have trained, we need to define where to find the model parameters (W1,W2,b1,b2). See the previous section for how to write the TensorFlow model parameters to Vespa Tensor format.

    # We need to specify the type and the location
    # of the files storing tensor values for each
    # Variable in our TensorFlow model. In this case,
    # W_hidden, b_hidden, W_final, b_final

    constant W_hidden {
        file: constants/W_hidden.json
        type: tensor(input[20],hidden[40])
    }

    constant b_hidden {
        file: constants/b_hidden.json
        type: tensor(hidden[40])
    }

    constant W_final {
        file: constants/W_final.json
        type: tensor(hidden[40], final[1])
    }

    constant b_final {
        file: constants/b_final.json
        type: tensor(final[1])
    }

Now, we specify a second rank-profile called nn_tensor that will use the same first phase as the rank-profile tensor but will rerank the top 200 documents using the neural network model as second phase. We refer to the Tensor Reference document for more information regarding the Tensor operations used in the code below.

    # rank profile with neural network model as
    # second phase
    rank-profile nn_tensor {

        # The input to the neural network is the
        # concatenation of the document and query vectors.

        macro nn_input() {
            expression: concat(attribute(user_item_cf), query(user_item_cf), input)
        }

        # Computes the hidden layer

        macro hidden_layer() {
            expression: relu(sum(nn_input * constant(W_hidden), input) + constant(b_hidden))
        }

        # Computes the output layer

        macro final_layer() {
            expression: sigmoid(sum(hidden_layer * constant(W_final), hidden) + constant(b_final))
        }


        # First-phase ranking:
        # Dot-product between user and document latent factors

        first-phase {
            expression: sum(query(user_item_cf) * attribute(user_item_cf))
        }

        # Second-phase ranking:
        # Neural network model based on the user and latent factors

        second-phase {
            rerank-count: 200
            expression: sum(final_layer)
        }

    }

}

Offline evaluation

We will now query Vespa and obtain 100 blog post recommendations for each user_id in our test set. Below, we query Vespa using the tensor ranking function which contain the simpler ranking expression involving the dot-product between user and document latent factors.

pig -x local -f tutorial_compute_metric.pig \
  -param VESPA_HADOOP_JAR=vespa-hadoop.jar \
  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \
  -param ENDPOINT=$(hostname):8080
  -param NUMBER_RECOMMENDATIONS=100
  -param RANKING_NAME=tensor
  -param OUTPUT=blog-job/cf-metric

We perform the same query routine below, but now using the ranking-profile nn_tensor which reranks the top 200 documents using the neural network model.

pig -x local -f tutorial_compute_metric.pig \
  -param VESPA_HADOOP_JAR=vespa-hadoop.jar \
  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \
  -param ENDPOINT=$(hostname):8080
  -param NUMBER_RECOMMENDATIONS=100
  -param RANKING_NAME=nn_tensor
  -param OUTPUT=blog-job/cf-metric

The tutorial_compute_metric.pig script can be found in our repo.

Comparing the recommendations obtained by those two ranking profiles and our test set, we see that by deploying a more complex and accurate model in the second phase ranking, we increased the number of relevant documents (documents read by the user) retrieved from 11948 to 12804 (more than 7% increase) and those documents retrieved appeared higher up in the list of recommendations, as shown by the expected percentile ranking metric introduced in the Vespa tutorial pt. 2 which decreased from 37.1% to

E-commerce search and recommendation with Vespa.ai

Holiday shopping season is upon us and it’s time for a blog post on E-commerce search and recommendation using Vespa.ai. Vespa.ai is used as the search and recommendation backend at multiple Yahoo e-commerce sites in Asia, like tw.buy.yahoo.com.

This blog post discusses some of the challenges in e-commerce search and recommendation, and shows how they can be solved using the features of Vespa.ai.

image

Photo by Jonas Leupe on Unsplash

E-commerce search have text ranking requirements where traditional text ranking features like BM25 or TF-IDF might produce poor results. For an introduction to some of the issues with TF-IDF/BM25 see the influence of TF-IDF algorithms in e-commerce search. One example from the blog post is a search for ipad 2 which with traditional TF-IDF ranking will rank ‘black mini ipad cover, compatible with ipad 2’ higher than ‘Ipad 2’ as the former product description has several occurrences of the query terms Ipad and 2.

Vespa allows developers and relevancy engineers to fine tune the text ranking features to meet the domain specific ranking challenges. For example developers can control if multiple occurrences of a query term in the matched text should impact the relevance score. See text ranking occurrence tables and Vespa text ranking types for in-depth details. Also the Vespa text ranking features takes text proximity into account in the relevancy calculation, i.e how close the query terms appear in the matched text. BM25/TF-IDF on the other hand does not take query term proximity into account at all. Vespa also implements BM25 but it’s up to the relevancy engineer to chose which of the rich set of built-in text ranking features in Vespa that is used.

Vespa uses OpenNLP for linguistic processing like tokenization and stemming with support for multiple languages (as supported by OpenNLP).

Your manager might tell you that these items of the product catalog should be prominent in the search results. How to tackle this with your existing search solution? Maybe by adding some synthetic query terms to the original user query, maybe by using separate indexes with federated search or even with a key value store which rarely is in synch with the product catalog search index?

With Vespa it’s easy to promote content as Vespa’s ranking framework is just math and allows the developer to formulate the relevancy scoring function explicitly without having to rewrite the query formulation. Vespa controls ranking through ranking expressions configured in rank profiles which enables full control through the expressive Vespa ranking expression language. The rank profile to use is chosen at query time so developers can design multiple ranking profiles to rank documents differently based on query intent classification. See later section on query classification for more details how query classification can be done with Vespa.

A sample ranking profile which implements a tiered relevance scoring function where sponsored or promoted items are always ranked above non-sponsored documents is shown below. The ranking profile is applied to all documents which matches the query formulation and the relevance score of the hit is the assigned the value of the first-phase expression. Vespa also supports multi-phase ranking.

Sample hand crafted ranking profile defined in the Vespa application package.

The above example is hand crafted but for optimal relevance we do recommend looking at learning to rank (LTR) methods. See learning to Rank using TensorFlow Ranking and learning to Rank using XGBoost. The trained MLR models can be used in combination with the specific business ranking logic. In the example above we could replace the default-ranking function with the trained MLR model, hence combining business logic with MLR models.

Guiding the user through the product catalog by guided navigation or faceted search is a feature which users expects from an e-commerce search solution today and with Vespa, facets and guided navigation is easily implemented by the powerful Vespa Grouping Language.

Sample screenshot from Vespa e-commerce sample application UI demonstrating search facets using Vespa Grouping Language.

The Vespa grouping language supports deep nested grouping and aggregation operations over the matched content. The language also allows pagination within the group(s). For example if grouping hits by category and displaying top 3 ranking hits per category the language allows paginating to render more hits from a specified category group.

Studies (e.g. this study from FlipKart) finds that there is a significant fraction of queries in e-commerce search which suffer from vocabulary mismatch between the user query formulation and the relevant product descriptions in the product catalog. For example, the query “ladies pregnancy dress” would not match a product with description “women maternity gown” due to vocabulary mismatch between the query and the product description. Traditional Information Retrieval (IR) methods like TF-IDF/BM25 would fail retrieving the relevant product right off the bat.

Most techniques currently used to try to tackle the vocabulary mismatch problem are built around query expansion. With the recent advances in NLP using transfer learning with large pre-trained language models, we believe that future solutions will be built around multilingual semantic retrieval using text embeddings from pre-trained deep neural network language models. Vespa has recently announced a sample application on semantic retrieval which addresses the vocabulary mismatch problem as the retrieval is not based on query terms alone, but instead based on the dense text tensor embedding representation of the query and the document. The mentioned sample app reproduces the accuracy of the retrieval model described in the Google blog post about Semantic Retrieval.

Using our query and product title example from the section above, which suffers from the vocabulary mismatch, and instead move away from the textual representation to using the respective dense tensor embedding representation, we find that the semantic similarity between them is high (0.93). The high semantic similarity means that the relevant product would be retrieved when using semantic retrieval. The semantic similarity is in this case defined as the cosine similarity between the dense tensor embedding representations of the query and the product description. Vespa has strong support for expressing and storing tensor fields which one can perform tensor operations (e.g cosine similarity) over for ranking, this functionality is demonstrated in the mentioned sample application.

Below is a simple matrix comparing the semantic similarity of three pairs of (query, product description). The tensor embeddings of the textual representation is obtained with the Universal Sentence Encoder from Google.

Semantic similarity matrix of different queries and product descriptions.

The Universal Sentence Encoder Model from Google is multilingual as it was trained on text from multiple languages. Using these text embeddings enables multilingual retrieval so searches written in Chinese can retrieve relevant products by descriptions written in multiple languages. This is another nice property of semantic retrieval models which is particularly useful in e-commerce search applications with global reach.

Vespa supports deploying stateless machine learned (ML) models which comes handy when doing query classification. Machine learned models which classify the query is commonly used in e-commerce search solutions and the recent advances in natural language processing (NLP) using pre-trained deep neural language models have improved the accuracy of text classification models significantly. See e.g text classification using BERT for an illustrated guide to text classification using BERT. Vespa supports deploying ML models built with TensorFlow, XGBoost and PyTorch through the Open Neural Network Exchange (ONNX) format. ML models trained with mentioned tools can successfully be used for various query classification tasks with high accuracy.

In e-commerce search, classifying the intent of the query or query session can help ranking the results by using an intent specific ranking profile which is tailored to the specific query intent. The intent classification can also determine how the result page is displayed and organised.

Consider a category browse intent query like ‘shoes for men’. A query intent which might benefit from a query rewrite which limits the result set to contain only items which matched the unambiguous category id instead of just searching the product description or category fields for ‘shoes for men’ . Also ranking could change based on the query classification by using a ranking profile which gives higher weight to signals like popularity or price than text ranking features.

Vespa also features a powerful query rewriting language which supports rule based query rewrites, synonym expansion and query phrasing.

Vespa is commonly used for recommendation use cases and e-commerce is no exception.

Vespa is able to evaluate complex Machine Learned (ML) models over many data points (documents, products) in user time which allows the ML model to use real time signals derived from the current user’s online shopping session (e.g products browsed, queries performed, time of day) as model features. An offline batch oriented inference architecture would not be able to use these important real time signals. By batch oriented inference architecture we mean pre-computing the inference offline for a set of users or products and where the model inference results is stored in a key-value store for online retrieval.

Update 2021-05-20: Blog recommendation tutorials are replaced by the
News search and recommendation tutorial:

In our blog recommendation tutorial
we demonstrate how to apply a collaborative filtering model for content recommendation and in
part 2 of the blog recommendation tutorial
we show how to use a neural network trained with TensorFlow to serve recommendations in user time. Similar recommendation approaches are used with success in e-commerce.

Keeping your e-commerce index up to date with real time updates

Vespa is designed for horizontal scaling with high sustainable write and read throughput with low predictable latency. Updating the product catalog in real time is of critical importance for e-commerce applications as the real time information is used in retrieval filters and also as ranking signals. The product description or product title rarely changes but meta information like inventory status, price and popularity are real time signals which will improve relevance when used in ranking. Also having the inventory status reflected in the search index also avoids retrieving content which is out of stock.

Vespa has true native support for partial updates where there is no need to re-index the entire document but only a subset of the document (i.e fields in the document). Real time partial updates can be done at scale against attribute fields which are stored and updated in memory. Attribute fields in Vespa can be updated at rates up to about 40-50K updates/s per content node.

Using Vespa’s support for predicate fields it’s easy to control when content is surfaced in search results and not. The predicate field type allows the content (e.g a document) to express if it should match the query instead of the other way around. For e-commerce search and recommendation we can use predicate expressions to control how product campaigns are surfaced in search results. Some examples of what predicate fields can be used for:

  • Only match and retrieve the document if time of day is in the range 8–16 or range 19–20 and the user is a member. This could be used for promoting content for certain users, controlled by the predicate expression stored in the document. The time of day and member status is passed with the query.
  • Represent recurring campaigns with multiple time ranges.

Above examples are by no means exhaustive as predicates can be used for multiple campaign related use cases where the filtering logic is expressed in the content.

Are you worried that your current search installation will break by the traffic surge associated with the holiday shopping season? Are your cloud VMs running high on disk busy metrics already? What about those long GC pauses in the JVM old generation causing your 95percentile latency go through the roof? Needless to say but any downtime due to a slow search backend causing a

Parent-child joins and tensors for content recommendation

Aaron Nagao

Aaron Nagao

Senior Software Engineer, Verizon Media


A real-world application of parent-child joins and tensor functions to model topic popularity for content recommendation.

Every time a user visits Yahoo.com, we build the best news stream for that user from tens of thousands of candidate articles on a variety of topics such as sports, finance, and entertainment.

To rank these articles, a key feature is the topic’s click-through rate (CTR), a real-time popularity metric aggregated over all articles of that topic. There are two reasons why topic CTR is an important feature for machine-learned ranking:

  1. It helps the cold-start problem by allowing the model to use the popularity of other articles of that topic as a prior for new articles.
  2. It reduces a high-dimensional categorical feature like a topic to a single numerical feature (its mean CTR)—this mean encoding is needed for a model to incorporate topics.

In this blog post, we describe how we model topic CTRs using two modern features of Vespa: parent-child joins and tensor functions. Vespa, the open source big data serving engine created by Yahoo, powers all of our content recommendations in real-time for millions of users at scale.

To model ranking features in Vespa, most features like an article’s length or age are simple properties of the document which are stored within the Vespa document. In contrast, a topic CTR is shared by all documents of that topic.

A simple approach would be to feed the topic CTR to every document about that topic. But this denormalized approach both duplicates data and complicates CTR updates—the updater must first query for all documents of that topic in order to feed the updated CTR to all of them.

The standard solution for this from relational databases is joins through foreign keys, and such joins are supported in Vespa as parent-child relationships.

We model our relationships by storing all topic CTRs in a single “global” document, and each of our articles has a foreign key that points to that global document, as shown in the Figure. This way, when updating CTRs we only need to update the global document and not each individual article, avoiding data duplication.

Global document

When ranking articles, Vespa uses the foreign key to do a real-time join between each article and the global document to retrieve the topic CTRs. Vespa co-locates the global document on the article content node to reduce the latency of this real-time join, also represented in the Figure.

Vespa seamlessly allows us to write updated CTRs to the global document in real-time while concurrently reading the CTRs for use in article ranking.

The schema definition uses reference for this parent-child relationship:

document globalscores {
    field topic_ctrs type tensor<float>(topic{}) {
        indexing: attribute
        attribute: fast-search
    }
}

document article {
    field doc_topics type tensor<float>(topic{}) {
        indexing: attribute | summary
    }

    field ptr type reference<globalscores> {
        indexing: attribute
    }
}

Now that our data is modeled in Vespa, we want to use the CTRs as features in our ranking model.

One challenge is that some articles have 1 topic (so just 1 topic CTR) while some articles have 5 topics (with 5 CTRs). Since machine learning generally requires each article to have the same number of features, we take the average topic CTR and the maximum topic CTR, aggregating a differing number of CTRs into a fixed number of summary statistics.

We compute the average and maximum using Vespa’s Tensor API. A tensor is a vector with labeled dimensions, and Vespa provides API functions like sum, argmax, and * (elementwise multiplication) that operate on input tensors to compute features.

The rank profile code that computes these ranking features is:

import field ptr.topic_ctrs as global_topic_ctrs {}

rank-profile yahoo inherits default {
    # helper functions
    function AVG_CTR(weights, ctrs) {
        # weighted average CTR
        expression: sum(weights * ctrs) / sum(weights)
    }
    function MAX_CTR(weights, ctrs) {
        # weighted max, then use unweighted CTR
        expression: sum( argmax(weights * ctrs) * ctrs )
    }

    # ranking features
    function TOPIC_AVG_CTR() {
        expression: AVG_CTR(attribute(doc_topics), attribute(global_topic_ctrs))
    }
    function TOPIC_MAX_CTR() {
        expression: MAX_CTR(attribute(doc_topics), attribute(global_topic_ctrs))
    }
}

Rank Profile Walkthrough

Suppose an article is about two topics each with an associated weight:

attribute(doc_topics) = <'US': 0.7, 'Sports': 0.9>

And in the global document, the real-time CTR values for all topics are:

topic_ctrs = <'US': 0.08, 'Sports': 0.02, 'Finance': 0.05, ...>

The first import line does a real-time join between the article’s foreign key and the global document, so that the article ranking functions can reference the global CTRs as if they were stored with each article:

global_topic_ctrs = ptr.topic_ctrs = <'US': 0.08, 'Sports': 0.02, 'Finance': 0.05, ...>

For both of our features TOPIC_AVG_CTR and TOPIC_MAX_CTR, the first step uses the * elementwise multiplication operator in Vespa’s Tensor API, which effectively does a lookup of the article’s topics in the global tensor:

weights * ctrs
= attribute(doc_pub) * attribute(global_pub_ctrs)
= <'US': 0.7, 'Sports': 0.9> * <'US': 0.08, 'Sports': 0.02, 'Finance': 0.05, ...>
= <'US': 0.056, 'Sports': 0.018>

Then TOPIC_AVG_CTR computes a weighted average CTR by summing and normalizing by the weights:

TOPIC_AVG_CTR
= sum(weights * ctrs) / sum(weights)
= sum(<'US': 0.056, 'Sports': 0.018>) / sum(<'US': 0.7, 'Sports': 0.9>)
= 0.074 / 1.6
= 0.046

(Note this weighted average of 0.046 is closer to the Sports CTR=0.02 than the US CTR=0.08 because Sports had a higher topic weight.)

And TOPIC_MAX_CTR finds the CTR of the entity with the maximum weighted CTR:

argmax(weights * ctrs)
= argmax(<'US': 0.056, 'Sports': 0.018>)
= <'US': 1>

argmax(weights * ctrs) * ctrs
= <'US': 1> * <'US': 0.08, 'Sports': 0.02, 'Finance': 0.05, ...>
= <'US': 0.08>

(With a final sum to convert the mapped tensor to the scalar 0.08.)

These two examples demonstrate the expressiveness of Vespa’s Tensor API to compute useful features.

Ultimately TOPIC_AVG_CTR and TOPIC_MAX_CTR are two features computed in real-time for every article during ranking. These features can then be added to any machine-learned ranking model—Vespa supports gradient-boosted trees from XGBoost and LightGBM, and neural networks in TensorFlow and ONNX formats.

While we could have stored these topic CTRs in some external store, we would have essentially needed to fetch all of them when ranking as every topic is represented in our pool of articles. Instead, Vespa moves the computation to the data by co-locating the global feature store on the article content node, avoiding network load and reducing system complexity.

We also keep the global CTRs in-memory for even faster performance with the aptly named attribute: fast-search. Ultimately, the Vespa team optimized our use case of parent-child joins and tensors to rank 10,000 articles in just 17.5 milliseconds!

This blog post described just two of the many ranking features used for real-time content recommendation at Verizon Media, all powered by Vespa.

Build a News recommendation app from python with Vespa: Part 1

Part 1 – News search functionality.

We will build a news recommendation app in Vespa without leaving a python environment. In this first part of the series, we want to develop an application with basic search functionality. Future posts will add recommendation capabilities based on embeddings and other ML models.

UPDATE 2023-02-13: Code examples are updated to work with the latest release of
pyvespa.

Decorative image

Photo by Filip Mishevski on Unsplash

This series is a simplified version of Vespa’s News search and recommendation tutorial. We will also use the demo version of the Microsoft News Dataset (MIND) so that anyone can follow along on their laptops.

Dataset

The original Vespa news search tutorial provides a script to download, parse and convert the MIND dataset to Vespa format. To make things easier for you, we made the final parsed data required for this tutorial available for download:

import requests, json

data = json.loads(
    requests.get("https://thigm85.github.io/data/mind/mind_demo_fields_parsed.json").text
)
data[0]
{'abstract': "Shop the notebooks, jackets, and more that the royals can't live without.",
 'title': 'The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By',
 'subcategory': 'lifestyleroyals',
 'news_id': 'N3112',
 'category': 'lifestyle',
 'url': 'https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata',
 'date': 20191103,
 'clicks': 0,
 'impressions': 0}

The final parsed data used here is a list where each element is a dictionary containing relevant fields about a news article such as title and category. We also have information about the number of impressions and clicks the article has received. The demo version of the mind dataset has 28.603 news articles included.

Install pyvespa

Create the search app

Create the application package. app_package will hold all the relevant data related to your application’s specification.

from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name="news")

Add fields to the schema. Here is a short description of the non-obvious arguments used below:

  • indexing argument: configures the indexing pipeline for a field, which defines how Vespa will treat input during indexing.

  • index argument: configure how Vespa should create the search index.

    • “enable-bm25”: set up an index compatible with bm25 ranking for text search.
  • attribute argument: configure how Vespa should treat an attribute field.

    • “fast-search”: Build an index for an attribute field. By default, no index is generated for attributes, and search over these defaults to a linear scan.
from vespa.package import Field

app_package.schema.add_fields(
    Field(name="news_id", type="string", indexing=["summary", "attribute"], attribute=["fast-search"]),
    Field(name="category", type="string", indexing=["summary", "attribute"]),
    Field(name="subcategory", type="string", indexing=["summary", "attribute"]),
    Field(name="title", type="string", indexing=["index", "summary"], index="enable-bm25"),
    Field(name="abstract", type="string", indexing=["index", "summary"], index="enable-bm25"),
    Field(name="url", type="string", indexing=["index", "summary"]),        
    Field(name="date", type="int", indexing=["summary", "attribute"]),            
    Field(name="clicks", type="int", indexing=["summary", "attribute"]),            
    Field(name="impressions", type="int", indexing=["summary", "attribute"]),                
)

Add a fieldset to the schema. Fieldset allows us to search over multiple fields easily. In this case, searching over the default fieldset is equivalent to searching over title and abstract.

from vespa.package import FieldSet

app_package.schema.add_field_set(
    FieldSet(name="default", fields=["title", "abstract"])
)

We have enough to deploy the first version of our application. Later in this tutorial, we will include an article’s popularity into the relevance score used to rank the news that matches our queries.

Deploy the app on Docker

If you have Docker installed on your machine, you can deploy the app_package in a local Docker container:

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy(
    application_package=app_package, 
)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Finished deployment.

vespa_docker will parse the app_package and write all the necessary Vespa config files to the disk_folder. It will then create the docker containers and use the Vespa config files to deploy the Vespa application. We can then use the app instance to interact with the deployed application, such as for feeding and querying. If you want to know more about what happens behind the scenes, we suggest you go through this getting started with Docker tutorial.

Feed data to the app

We can use the feed_data_point method. We need to specify:

  • data_id: unique id to identify the data point

  • fields: dictionary with keys matching the field names defined in our application package schema.

  • schema: name of the schema we want to feed data to. When we created an application package, we created a schema by default with the same name as the application name, news in our case.

This takes 10 minutes or so:

for article in data:
    res = app.feed_data_point(
        data_id=article["news_id"], 
        fields=article, 
        schema="news"
    )

Query the app

We can use the Vespa Query API through app.query to unlock the full query flexibility Vespa can offer.

Search over indexed fields using keywords

Select all the fields from documents where default (title or abstract) contains the keyword ‘music’.

res = app.query(body={"yql" : "select * from sources * where default contains 'music'"})
res.hits[0]
{
    'id': 'id:news:news::N14152',
    'relevance': 0.25641557752127125,
    'source': 'news_content',
    'fields': {
        'sddocname': 'news',
        'documentid': 'id:news:news::N14152',
        'news_id': 'N14152',
        'category': 'music',
        'subcategory': 'musicnews',
        'title': 'Music is hot in Nashville this week',
        'abstract': 'Looking for fun, entertaining music events to check out in Nashville this week? Here are top picks with dates, times, locations and ticket links.', 'url': 'https://www.msn.com/en-us/music/musicnews/music-is-hot-in-nashville-this-week/ar-BBWImOh?ocid=chopendata',
        'date': 20191101,
        'clicks': 0,
        'impressions': 3
    }
}

Select title and abstract where title contains ‘music’ and default contains ‘festival’.

res = app.query(body = {"yql" : "select title, abstract from sources * where title contains 'music' AND default contains 'festival'"})
res.hits[0]
{
    'id': 'index:news_content/0/988f76793a855e48b16dc5d3',
    'relevance': 0.19587240022210403,
    'source': 'news_content',
    'fields': {
        'title': "At Least 3 Injured In Stampede At Travis Scott's Astroworld Music Festival",
        'abstract': "A stampede Saturday outside rapper Travis Scott's Astroworld musical festival in Houston, left three people injured. Minutes before the gates were scheduled to open at noon, fans began climbing over metal barricades and surged toward the entrance, according to local news reports."
    }
}

Search by document type

Select the title of all the documents with document type equal to news. Our application has only one document type, so the query below retrieves all our documents.

res = app.query(body = {"yql" : "select title from sources * where sddocname contains 'news'"})
res.hits[0]
{
    'id': 'index:news_content/0/698f73a87a936f1c773f2161',
    'relevance': 0.0,
    'source': 'news_content',
    'fields': {
        'title': 'The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By'
    }
}

Search over attribute fields such as date

Since date is not specified with attribute=["fast-search"] there is no index built for it. Therefore, search over it is equivalent to doing a linear scan over the values of the field.

res = app.query(body={"yql" : "select title, date from sources * where date contains '20191110'"})
res.hits[0]
{
    'id': 'index:news_content/0/debbdfe653c6d11f71cc2353',
    'relevance': 0.0017429193899782135,
    'source': 'news_content',
    'fields': {
        'title': 'These Cranberry Sauce Recipes Are Perfect for Thanksgiving Dinner',
        'date': 20191110
    }
}

Since the default fieldset is formed by indexed fields, Vespa will first filter by all the documents that contain the keyword ‘weather’ within title or abstract, before scanning the date field for ‘20191110’.

res = app.query(body={"yql" : "select title, abstract, date from sources * where default contains 'weather' AND date contains '20191110'"})
res.hits[0]
{
    'id': 'index:news_content/0/bb88325ae94d888c46538d0b',
    'relevance': 0.27025156546141466,
    'source': 'news_content',
    'fields': {
        'title': 'Weather forecast in St. Louis',
        'abstract': "What's the weather today? What's the weather for the week? Here's your forecast.",
        'date': 20191110
    }
}

We can also perform range searches:

res = app.query({"yql" : "select date from sources * where date <= 20191110 AND date >= 20191108"})
res.hits[0]
{
    'id': 'index:news_content/0/c41a873213fdcffbb74987c0',
    'relevance': 0.0017429193899782135,
    'source': 'news_content',
    'fields': {
        'date': 20191109
    }
}

Sorting

By default, Vespa sorts the hits by descending relevance score. The relevance score is given by the nativeRank unless something else is specified, as we will do later in this post.

res = app.query(body={"yql" : "select title, date from sources * where default contains 'music'"})
res.hits[:2]
[
    {
        'id': 'index:news_content/0/5f1b30d14d4a15050dae9f7f',
        'relevance': 0.25641557752127125,
        'source': 'news_content',
        'fields': {
            'title': 'Music is hot in Nashville this week',
            'date': 20191101
        }
    },
    {
        'id': 'index:news_content/0/6a031d5eff95264c54daf56d',
        'relevance': 0.23351089409559303,
        'source': 'news_content',
        'fields': {
            'title': 'Apple Music Replay highlights your favorite tunes of the year',
            'date': 20191105
        }
    }
]

However, we can explicitly order by a given field with the order keyword.

res = app.query(body={"yql" : "select title, date from sources * where default contains 'music' order by date"})
res.hits[:2]
[
    {
        'id': 'index:news_content/0/d0d7e1c080f0faf5989046d8',
        'relevance': 0.0,
        'source': 'news_content',
        'fields': {
            'title': "Elton John's second farewell tour stop in Cleveland shows why he's still standing after all these years",
            'date': 20191031
        }
    },
    {
        'id': 'index:news_content/0/abf7f6f46ff2a96862075155',
        'relevance': 0.0,
        'source': 'news_content',
        'fields': {
            'title': 'The best hair metal bands',
            'date': 20191101
        }
    }
]

order sorts in ascending order by default, we can override that with the desc keyword:

res = app.query(body={"yql" : "select title, date from sources * where default contains 'music' order by date desc"})
res.hits[:2]
[
    {
        'id': 'index:news_content/0/934a8d976ff8694772009362',
        'relevance': 0.0,
        'source': 'news_content',
        'fields': {
            'title': 'Korg Minilogue XD update adds key triggers for synth sequences',
            'date': 20191113
        }
    },
    {
        'id': 'index:news_content/0/4feca287fdfa1d027f61e7bf',
        'relevance': 0.0,
        'source': 'news_content',
        'fields': {
            'title': 'Tom Draper, Black Music Industry Pioneer, Dies at 79',
            'date': 20191113
        }
    }
]

Grouping

We can use Vespa’s grouping feature to compute the three news categories with the highest number of document counts:

res = app.query(body={"yql" : "select * from sources * where sddocname contains 'news' limit 0 | all(group(category) max(3) order(-count())each(output(count())))"})
res.hits[0]
{
    'id': 'group:root:0',
    'relevance': 1.0,
    'continuation': {
        'this': ''
    },
    'children': [
        {
            'id': 'grouplist:category',
            'relevance': 1.0,
            'label': 'category',
            'continuation': {
                'next': 'BGAAABEBGBC'
            },
            'children': [
                {
                    'id': 'group:string:news',
                    'relevance': 1.0,
                    'value': 'news',
                    'fields': {
                        'count()': 9115
                    }
                },
                {
                    'id': 'group:string:sports',
                    'relevance': 0.6666666666666666,
                    'value': 'sports',
                    'fields': {
                        'count()': 6765
                    }
                },
                {
                    'id': 'group:string:finance',
                    'relevance': 0.3333333333333333,
                    'value': 'finance',
                    'fields': {
                        'count()': 1886
                    }
                }
            ]
        }
    ]
}

Use news popularity signal for ranking

Vespa uses nativeRank to compute relevance scores by default. We will create a new rank-profile that includes a popularity signal in our relevance score computation.

from vespa.package import RankProfile, Function

app_package.schema.add_rank_profile(
    RankProfile(
        name="popularity",
        inherits="default",
        functions=[
            Function(
                name="popularity", 
                expression="if (attribute(impressions) > 0, attribute(clicks) / attribute(impressions), 0)"
            )
        ], 
        first_phase="nativeRank(title, abstract) + 10 * popularity"
    )
)

Our new rank-profile will be called

Build a News recommendation app from python with Vespa: Part 2

Thiago Martins

Thiago Martins

Vespa Data Scientist


Part 2 – From news search to news recommendation with embeddings.

UPDATE 2023-02-14: Code examples are updated to work with the latest releases of
pyvespa.

In this part, we’ll start transforming our application from news search to news recommendation using the embeddings created in this tutorial. An embedding vector will represent each user and news article. We will make the embeddings used available for download to make it easier to follow this post along. When a user comes, we retrieve his embedding and use it to retrieve the closest news articles via an approximate nearest neighbor (ANN) search. We also show that Vespa can jointly apply general filtering and ANN search, unlike competing alternatives available in the market.

Decorative image

Photo by Matt Popovich on Unsplash

We assume that you have followed the news search tutorial. Therefore, you should have an app_package variable holding the news search app definition and a Docker container named news running a search application fed with news articles from the demo version of the MIND dataset.

Add a user schema

We need to add another document type to represent a user. We set up the schema to search for a user_id and retrieve the user’s embedding vector.

from vespa.package import Schema, Document, Field

app_package.add_schema(
    Schema(
        name="user", 
        document=Document(
            fields=[
                Field(
                    name="user_id", 
                    type="string", 
                    indexing=["summary", "attribute"], 
                    attribute=["fast-search"]
                ), 
                Field(
                    name="embedding", 
                    type="tensor<float>(d0[51])", 
                    indexing=["summary", "attribute"]
                )
            ]
        )
    )
)

We build an index for the attribute field user_id by specifying the fast-search attribute. Remember that attribute fields are held in memory and are not indexed by default.

The embedding field is a tensor field. Tensors in Vespa are flexible multi-dimensional data structures and, as first-class citizens, can be used in queries, document fields, and constants in ranking. Tensors can be either dense or sparse or both and can contain any number of dimensions. Please see the tensor user guide for more information. Here we have defined a dense tensor with a single dimension (d0 – dimension 0), representing a vector. 51 is the size of the embeddings used in this post.

We now have one schema for the news and one schema for the user.

[schema.name for schema in app_package.schemas]

Index news embeddings

Similarly to the user schema, we will use a dense tensor to represent the news embeddings. But unlike the user embedding field, we will index the news embedding by including index in the indexing argument and specify that we want to build the index using the HNSW (hierarchical navigable small world) algorithm. The distance metric used is euclidean. Read this blog post to know more about Vespa’s journey to implement ANN search.

from vespa.package import Field, HNSW

app_package.get_schema(name="news").add_fields(
    Field(
        name="embedding", 
        type="tensor<float>(d0[51])", 
        indexing=["attribute", "index"],
        ann=HNSW(distance_metric="euclidean")
    )
)

Recommendation using embeddings

Here, we’ve added a ranking expression using the closeness ranking feature, which calculates the euclidean distance and uses that to rank the news articles. This rank-profile depends on using the nearestNeighbor search operator, which we’ll get back to below when searching. But for now, this expects a tensor in the query to use as the initial search point.

from vespa.package import RankProfile

app_package.get_schema(name="news").add_rank_profile(
    RankProfile(
        name="recommendation", 
        inherits="default", 
        first_phase="closeness(field, embedding)"
    )
)

Query Profile Type

The recommendation rank profile above requires that we send a tensor along with the query. For Vespa to bind the correct types, it needs to know the expected type of this query parameter.

from vespa.package import QueryTypeField

app_package.query_profile_type.add_fields(
    QueryTypeField(
        name="ranking.features.query(user_embedding)",
        type="tensor<float>(d0[51])"
    )
)

This query profile type instructs Vespa to expect a float tensor with dimension d0[51] when the query parameter ranking.features.query(user_embedding) is passed. We’ll see how this works together with the nearestNeighbor search operator below.

Redeploy the application

We made all the required changes to turn our news search app into a news recommendation app. We can now redeploy the app_package to our running container named news.

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker.from_container_name_or_id("news")
app = vespa_docker.deploy(application_package=app_package)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Finished deployment.
["Uploading application '/app/application' using http://localhost:19071/application/v2/tenant/default/session",
 "Session 7 for tenant 'default' created.",
 'Preparing session 7 using http://localhost:19071/application/v2/tenant/default/session/7/prepared',
 "WARNING: Host named 'news' may not receive any config since it is not a canonical hostname. Disregard this warning when testing in a Docker container.",
 "Session 7 for tenant 'default' prepared.",
 'Activating session 7 using http://localhost:19071/application/v2/tenant/default/session/7/active',
 "Session 7 for tenant 'default' activated.",
 'Checksum:   62d964000c4ff4a5280b342cd8d95c80',
 'Timestamp:  1616671116728',
 'Generation: 7',
 '']

Feeding and partial updates: news and user embeddings

To keep this tutorial easy to follow, we make the parsed embeddings available for download. To build them yourself, please follow this tutorial.

import requests, json

user_embeddings = json.loads(
    requests.get("https://thigm85.github.io/data/mind/mind_demo_user_embeddings_parsed.json").text
)
news_embeddings = json.loads(
    requests.get("https://thigm85.github.io/data/mind/mind_demo_news_embeddings_parsed.json").text
)

We just created the user schema, so we need to feed user data for the first time.

for user_embedding in user_embeddings:
    response = app.feed_data_point(
        schema="user", 
        data_id=user_embedding["user_id"], 
        fields=user_embedding
    )

For the news documents, we just need to update the embedding field added to the news schema.
This takes ten minutes or so:

for news_embedding in news_embeddings:
    response = app.update_data(
        schema="news", 
        data_id=news_embedding["news_id"], 
        fields={"embedding": news_embedding["embedding"]}
    )

Fetch the user embedding

Next, we create a query_user_embedding function to retrieve the user embedding by the user_id. Of course, you could do this more efficiently using a Vespa Searcher as described here, but keeping everything in python at this point makes learning easier.

def parse_embedding(hit_json):
    embedding_json = hit_json["fields"]["embedding"]["values"]
    embedding_vector = [0.0] * len(embedding_json)
    i=0
    for val in embedding_json:
        embedding_vector[i] = val
        i+=1
    return embedding_vector

def query_user_embedding(user_id):
    result = app.query(body={"yql": "select * from sources user where user_id contains '{}'".format(user_id)})
    embedding = parse_embedding(result.hits[0])
    return embedding

The function will query Vespa, retrieve the embedding and parse it into a list of floats. Here are the first five elements of the user U63195’s embedding.

query_user_embedding(user_id="U63195")[:5]
[
    0.0,
    -0.1694680005311966,
    -0.0703359991312027,
    -0.03539799898862839,
    0.14579899609088898
]

Get recommendations

The following yql instructs Vespa to select the title and the category from the ten news documents closest to the user embedding.

yql = "select title, category from sources news where ({targetHits:10}nearestNeighbor(embedding, user_embedding))" 

We also specify that we want to rank those documents by the recommendation rank-profile that we defined earlier and send the user embedding via the query profile type ranking.features.query(user_embedding) that we also defined in our app_package.

result = app.query(
    body={
        "yql": yql,        
        "hits": 10,
        "ranking.features.query(user_embedding)": str(query_user_embedding(user_id="U63195")),
        "ranking.profile": "recommendation"
    }
)

Here are the first two hits out of the ten returned.

[
    {
        'id': 'index:news_content/0/aca03f4ba2274dd95b58db9a',
        'relevance': 0.1460561756063909,
        'source': 'news_content',
        'fields': {
            'category': 'music',
            'title': 'Broadway Star Laurel Griggs Suffered Asthma Attack Before She Died at Age 13'
        }
    },
    {
        'id': 'index:news_content/0/bd02238644c604f3a2d53364',
        'relevance': 0.14591827245062294,
        'source': 'news_content',
        'fields': {
            'category': 'tv',
            'title': "Rip Taylor's Cause of Death Revealed, Memorial Service Scheduled for Later This Month"
        }
    }
]

Combine ANN search with query filters

Vespa ANN search is fully integrated into the Vespa query tree. This integration means that we can include query filters and the ANN search will be applied only to documents that satisfy the filters. No need to do pre- or post-processing involving filters.

The following yql search over news documents that have sports as their category.

yql = "select title, category from sources news where " \
      "({targetHits:10}nearestNeighbor(embedding, user_embedding)) AND " \
      "category contains 'sports'"
result = app.query(
    body={
        "yql": yql,        
        "hits": 10,
        "ranking.features.query(user_embedding)": str(query_user_embedding(user_id="U63195")),
        "ranking.profile": "recommendation"
    }
)

Here are the first two hits out of the ten returned. Notice the category field.

[
    {
        'id': 'index:news_content/0/375ea340c21b3138fae1a05c',
        'relevance': 0.14417346200569972,
        'source': 'news_content',
        'fields': {
            'category': 'sports',
            'title': 'Charles Rogers, former Michigan State football, Detroit Lions star, dead at 38'
        }
    },
    {
        'id': 'index:news_content/0/2b892989020ddf7796dae435',
        'relevance': 0.14404365847394848,
        'source': 'news_content',
        'fields': {
            'category': 'sports',
            'title': "'Monday Night Football' commentator under fire after belittling criticism of 49ers kicker for missed field goal"
        }
    }
]

Next steps

Step to part 3 –
or see conclusion
for how to clean up the Docker container instances if you are done with this.

Build a News recommendation app from python with Vespa: Part 3

Thiago Martins

Thiago Martins

Vespa Data Scientist


Part 3 – Efficient use of click-through rate via parent-child relationship.

UPDATE 2023-02-14: Code examples are updated to work with the latest releases of
pyvespa.

This part of the series introduces a new ranking signal: category click-through rate (CTR). The idea is that we can recommend popular content for users that don’t have a click history yet. Rather than just recommending based on articles, we recommend based on categories. However, these global CTR values can often change continuously, so we need an efficient way to update this value for all documents. We’ll do that by introducing parent-child relationships between documents in Vespa. We will also use sparse tensors directly in ranking. This post replicates this more detailed Vespa tutorial.

Decorative image

Photo by AbsolutVision on Unsplash

We assume that you have followed the part2 of the news recommendation tutorial. Therefore, you should have an app_package variable holding the news app definition and a Docker container named news running the application fed with data from the demo version of the MIND dataset.

Setting up a global category CTR document

If we add a category_ctr field in the news document, we would have to update all the sport’s documents every time there is a change in the sport’s CTR statistic. If we assume that the category CTR will change often, this turns out to be inefficient.

For these cases, Vespa introduced the parent-child relationship. Parents are global documents, which are automatically distributed to all content nodes. Other documents can reference these parents and “import” values for use in ranking. The benefit is that the global category CTR values only need to be written to one place: the global document.

from vespa.package import Schema, Document, Field

app_package.add_schema(
    Schema(
        name="category_ctr",
        global_document=True,
        document=Document(
            fields=[
                Field(
                    name="ctrs", 
                    type="tensor<float>(category{})", 
                    indexing=["attribute"], 
                    attribute=["fast-search"]
                ), 
            ]
        )
    )
)

We implement that by creating a new category_ctr schema and setting global_document=True to indicate that we want Vespa to keep a copy of these documents on all content nodes. Setting a document to be global is required for using it in a parent-child relationship. Note that we use a tensor with a single sparse dimension to hold the ctrs data.

Sparse tensors have strings as dimension addresses rather than a numeric index. More concretely, an example of such a tensor is (using the tensor literal form):

{
    {category: entertainment}: 0.2 }, 
    {category: news}: 0.3 },
    {category: sports}: 0.5 },
    {category: travel}: 0.4 },
    {category: finance}: 0.1 },
    ...
}

This tensor holds all the CTR scores for all the categories. When updating this tensor, we can update individual cells, and we don’t need to update the whole tensor. This operation is called tensor modify and can be helpful when you have large tensors.

Importing parent values in child documents

We need to set up two things to use the category_ctr tensor for ranking news documents. We need to reference the parent document (category_ctr in this case) and import the ctrs from the referenced parent document.

app_package.get_schema("news").add_fields(
    Field(
        name="category_ctr_ref",
        type="reference<category_ctr>",
        indexing=["attribute"],
    )
)

The field category_ctr_ref is a field of type reference of the category_ctr document type. When feeding this field, Vespa expects the fully qualified document id. For instance, if our global CTR document has the id id:category_ctr:category_ctr::global, that is the value that we need to feed to the category_ctr_ref field. A document can reference many parent documents.

from vespa.package import ImportedField

app_package.get_schema("news").add_imported_field(
    ImportedField(
        name="global_category_ctrs",
        reference_field="category_ctr_ref",
        field_to_import="ctrs",
    )
)

The imported field defines that we should import the ctrs field from the document referenced in the category_ctr_ref field. We name this as global_category_ctrs, and we can reference this as attribute(global_category_ctrs) during ranking.

Tensor expressions in ranking

Each news document has a category field of type string indicating which category the document belongs to. We want to use this information to select the correct CTR score stored in the global_category_ctrs. Unfortunately, tensor expressions only work on tensors, so we need to add a new field of type tensor called category_tensor to hold category information in a way that can be used in a tensor expression:

app_package.get_schema("news").add_fields(
    Field(
        name="category_tensor",
        type="tensor<float>(category{})",
        indexing=["attribute"],
    )
)

With the category_tensor field as defined above, we can use the tensor expression sum(attribute(category_tensor) * attribute(global_category_ctrs)) to select the specific CTR related to the category of the document being ranked. We implement this expression as a Function in the rank-profile below:

from vespa.package import Function

app_package.get_schema("news").add_rank_profile(
    RankProfile(
        name="recommendation_with_global_category_ctr", 
        inherits="recommendation",
        functions=[
            Function(
                name="category_ctr", 
                expression="sum(attribute(category_tensor) * attribute(global_category_ctrs))"
            ),
            Function(
                name="nearest_neighbor", 
                expression="closeness(field, embedding)"
            )
            
        ],
        first_phase="nearest_neighbor * category_ctr",
        summary_features=[
            "attribute(category_tensor)", 
            "attribute(global_category_ctrs)", 
            "category_ctr", 
            "nearest_neighbor"
        ]
    )
)

In the new rank-profile, we have added a first phase ranking expression that multiplies the nearest-neighbor score with the category CTR score, implemented with the functions nearest_neighbor and category_ctr, respectively. As a first attempt, we just multiply the nearest-neighbor with the category CTR score, which might not be the best way to combine those two values.

Deploy

We can reuse the same container named news created in the first part of this tutorial.

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker.from_container_name_or_id("news")
app = vespa_docker.deploy(application_package=app_package)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Finished deployment.

Feed

Next, we will download the global category CTR data, already parsed in the format that is expected by a sparse tensor with the category dimension.

import requests, json

global_category_ctr = json.loads(
    requests.get("https://data.vespa.oath.cloud/blog/news/global_category_ctr_parsed.json").text
)
global_category_ctr
{
    'ctrs': {
        'cells': [
            {'address': {'category': 'entertainment'}, 'value': 0.029266420380943244},
            {'address': {'category': 'autos'}, 'value': 0.028475809103747123},
            {'address': {'category': 'tv'}, 'value': 0.05374837981352176},
            {'address': {'category': 'health'}, 'value': 0.03531784305129329},
            {'address': {'category': 'sports'}, 'value': 0.05611187986670051},
            {'address': {'category': 'music'}, 'value': 0.05471192953054426},
            {'address': {'category': 'news'}, 'value': 0.04420778372641991},
            {'address': {'category': 'foodanddrink'}, 'value': 0.029256852366228187},
            {'address': {'category': 'travel'}, 'value': 0.025144552013730358},
            {'address': {'category': 'finance'}, 'value': 0.03231013195899643},
            {'address': {'category': 'lifestyle'}, 'value': 0.04423279317474416},
            {'address': {'category': 'video'}, 'value': 0.04006693315980292},
            {'address': {'category': 'movies'}, 'value': 0.03335647459420146},
            {'address': {'category': 'weather'}, 'value': 0.04532171803495617},
            {'address': {'category': 'northamerica'}, 'value': 0.0},
            {'address': {'category': 'kids'}, 'value': 0.043478260869565216}
        ]
    }
}

We can feed this data point to the document defined in the category_ctr. We will assign the global id to this document. Reference to this document can be done by using the Vespa id id:category_ctr:category_ctr::global.

response = app.feed_data_point(schema="category_ctr", data_id="global", fields=global_category_ctr)

We need to perform a partial update on the news documents to include information about the reference field category_ctr_ref and the new category_tensor that will have the value 1.0 for the specific category associated with each document.

news_category_ctr = json.loads(
    requests.get("https://data.vespa.oath.cloud/blog/news/news_category_ctr_update_parsed.json").text
)
news_category_ctr[0]
{
    'id': 'N3112',
    'fields': {
        'category_ctr_ref': 'id:category_ctr:category_ctr::global',
        'category_tensor': {
            'cells': [
                { 'address': {'category': 'lifestyle'}, 'value': 1.0}
            ]
        }
    }
}

This takes ten minutes or so:

for data_point in news_category_ctr:
    response = app.update_data(schema="news", data_id=data_point["id"], fields=data_point["fields"])

Testing the new rank-profile

We will redefine the query_user_embedding function defined in the second part of this tutorial and use it to make a query involving the user U33527 and the recommendation_with_global_category_ctr rank-profile.

def parse_embedding(hit_json):
    embedding_json = hit_json["fields"]["embedding"]["values"]
    embedding_vector = [0.0] * len(embedding_json)
    i=0
    for val in embedding_json:
        embedding_vector[i] = val
        i+=1
    return embedding_vector

def query_user_embedding(user_id):
    result = app.query(body={"yql": "select * from sources user where user_id contains '{}'".format(user_id)})
    embedding = parse_embedding(result.hits[0])
    return embedding
yql = "select * from sources news where " \
      "({targetHits:10}nearestNeighbor(embedding, user_embedding))"
result = app.query(
    body={
        "yql": yql,        
        "hits": 10,
        "ranking.features.query(user_embedding)": str(query_user_embedding(user_id="U33527")),
        "ranking.profile": "recommendation_with_global_category_ctr"
    }
)

The first hit below is a sports article. The global CTR document is also listed here, and the CTR score for the sports category is 0.0561. Thus, the result of the category_ctr function is 0.0561 as intended. The nearest_neighbor score is 0.149, and the resulting relevance score is 0.00836. So, this worked as expected.

{
    'id': 'id:news:news::N5316',
    'relevance': 0.008369192847921151,
    'source': 'news_content',
    'fields': {
        'sddocname': 'news',
        'documentid': 'id:news:news::N5316',
        'news_id': 'N5316',
        'category': 'sports',
        'subcategory': 'football_nfl',
        'title': "Matthew Stafford's status vs. Bears uncertain, Sam Martin will play",
        'abstract': "Stafford's start streak could be in jeopardy, according to Ian Rapoport.",
        'url': "https://www.msn.com/en-us/sports/football_nfl/matthew-stafford's-status-vs.-bears-uncertain,-sam-martin-will-play/ar-BBWwcVN?ocid=chopendata",
        'date': 20191112,
        'clicks': 0,
        'impressions': 1,
        'summaryfeatures': {
            'attribute(category_tensor)': {
                'type': 'tensor<float>(category{})',
                'cells': [
                    {'address': {'category': 'sports'}, 'value': 1.0}
                ]
            },
            'attribute(global_category_ctrs)': {
                'type': 'tensor<float>(category{})',
                'cells': [
                    {'address': {'category': 'entertainment'}, 'value': 0.029266420751810074},
                    {'address': {'category': 'autos'}, 'value': 0.0284758098423481},
                    {'address': {'category': 'tv'}, 'value': 0.05374838039278984},
                    {'address': {'category': 'health'}, 'value': 0.03531784191727638},
                    {'address': {'category': 'sports'}, 'value': 0.05611187964677811},
                    {'address': {'category': 'music'}, 'value': 0.05471193045377731},
                    {'address': {'category': 'news'}, 'value': 0.04420778527855873},
                    {'address': {'category': 'foodanddrink'}, 'value': 0.029256852343678474},
                    {'address': {'category': 'travel'}, 'value': 0.025144552811980247},
                    {'address': {'category': 'finance'}, 'value': 0.032310131937265396},
                    {'address': {'category': 'lifestyle'}, 'value': 0.044232793152332306},
                    {'address': {'category': 'video'}, 'value': 0.040066931396722794},
                    {'address': {'category': 'movies'}, 'value': 0.033356472849845886},
                    {'address': {'category': 'weather'}, 'value': 0.045321717858314514},
                    {'address': {'category': 'northamerica'}, 'value': 0.0},
                    {'address': {'category': 'kids'}, 'value': 0.043478261679410934}
                ]
            },
            'rankingExpression(category_ctr)': 0.05611187964677811,
            'rankingExpression(nearest_neighbor)': 0.14915188666574342,
            'vespa.summaryFeatures.cached': 0.0
        }
    }
}

Conclusion

This tutorial introduced parent-child relationships and demonstrated it through a global CTR feature we used in ranking. We also introduced ranking with (sparse) tensor expressions.

Clean up Docker container instances:

vespa_docker.container.stop()
vespa_docker.container.remove()