Blog recommendation with neural network models

Update 2021-05-20:
This blog post refers to Vespa sample applications that do not exist anymore.
Please refer to the
News search and recommendation tutorial
for an updated version of text and sample applications.

Introduction

The main objective of this post is to show how to deploy neural network models in Vespa using our Tensor Framework. In fact, any model that can be represented by a series of Tensor operations can be deployed in Vespa. Neural networks is just a popular example. In addition, we will introduce the multi-phase ranking model available in Vespa that can be used to run more expensive models in a phase based on a reduced number of documents returned by previous phases. This feature allow us to run models that would be prohibitively expensive to use if we had to run them at query-time across all the documents indexed in Vespa.

Model Training

In this section, we will define a neural network model, show how we created a suitable dataset to train the model and train the model using TensorFlow.

The neural network model

In the previous blog post, we computed latent factors for each user and each document and then used a dot-product between user and document vectors to rank the documents available for recommendation to a specific user. In this tutorial we will train a 2-layer fully connected neural network model that will take the same user (u) and document (d) latent factors as input and will output the probability of that specific user liking the document.

More technically, our previous rank function r was given by

r(u,d)=u∗d

while in this tutorial it will be given by

r(u,d,θ)=f(u,d,θ)

where f represents the neural network model described below and θ is the neural network parameter values that we need to learn from training data.

The specific form of the neural network model used here is

p = sigmoid(h1×W2+b2)
h1 = ReLU(x×W1+b1)

where x=[u,d] is the concatenation of the user and document latent factor, ReLU is the rectifier activation function, sigmoid represents the sigmoid function, p is the output of the model and in this case can be interpreted as the probability of the user u liking a blog post d. The parameters of the model are represented by θ=(W1,W2,b1,b2).

Training data

For the training dataset, we will start with the (user_id, post_id) rows from the “training_set_ids” generated previously. Then, we remove every row for which there is no latent factors for the user_id or post_id contained in that row. This gives us a dataset with only positive feedback (label = 1), since each row represents one instance of a user_id liking a post_id.

In order to train our model, we need to generate negative feedback (label = 0). So, for each row (user_id, post_id) in the current dataset we will generate N negative feedback rows by randomly sampling post_id_fake from the pool of post_id’s available in the current set, so that for each (user_id, post_id) row with label = 1 we will increase the dataset with N (user_id, post_id_fake) rows with label = 0.

Find code to generate the dataset in the utility scripts.

Training with TensorFlow

With the training data in hand, we have split it into 80% training set and 20% validation set and used TensorFlow to train the model. The script used can be found in the utility scripts and executed by

$ python vespaModel.py --product_features_file_path vespa_tutorial_data/user_item_cf_cv/product.json \
                       --user_features_file_path vespa_tutorial_data/user_item_cf_cv/user.json \
                       --dataset_file_path vespa_tutorial_data/nn_model/training_set.txt

The progress of your training can be visualized using Tensorboard

$ tensorboard --logdir runs/*/summaries/

##
Model deployment in Vespa

Two Phase Ranking

When a query is sent to Vespa, it will scan all documents available and select the ones (possibly all) that match the query. When the set of documents matching a query is found, Vespa must decide the order of these documents. Unless explicit sorting is used, Vespa decides this order by calculating a number for each document, the rank score, and sorts the documents by this number.

The rank score can be any function that takes as arguments parameters sent by the query, document attributes defined in search definitions and global parameters not directly linked to query or document parameters. One example of rank score is the output of the neural network model defined in this tutorial. The model takes the latent factor u associated with a specific user_id (query parameter), the latent factor dd associated with document post_id (document attribute) and learned model parameters (global parameters not related to a specific query nor document) and returns the probability of user u to like document d.

However, even though Vespa is designed to carry out such calculations optimally, complex expressions becomes expensive when they must be calculated over every one of a large set of matching documents. To relieve this, Vespa can be configured to run two ranking expressions – a smaller and less accurate one on all hits during the matching phase, and a more expensive and accurate one only on the best hits during the reranking phase. In general this allows a more optimal usage of the cpu budget by dedicating more of the total cpu towards the best candidate hits.

The reranking phase, if specified, will by default be run on the 100 best hits on each search node, after matching and before information is returned upwards to the search container. The number of hits to rerank can be turned up or down as needed. Below is a toy example showing how to configure first and second phase ranking expressions in the rank profile section of search definitions where the second phase rank expression is run on the 200 best hits from first phase on each search node.

search myapp {

    …

    rank-profile default inherits default {

        first-phase {
            expression: nativeRank + query(deservesFreshness) * freshness(timestamp)
        }

        second-phase {
            expression {
                0.7 * ( 0.7*fieldMatch(title) + 0.2*fieldMatch(description) + 0.1*fieldMatch(body) ) +
                0.3 * attributeMatch(keywords)
            }
            rerank-count: 200
        }
    }
}

Constant Tensor files

Once the model has been trained in TensorFlow, export the model parameters (W1,W2,b1,b2) to the application folder as Tensors according to the Vespa Document JSON format.

The complete code to serialize the model parameters using Vespa Tensor format can be found in the utility scripts but the following code snipped shows how to serialize the hidden layer weights W1:

serializer.serialize_to_disk(variable_name = "W_hidden", dimension_names = ['input', 'hidden'])

Note that Vespa currently requires dimension names for all the Tensor dimensions (in this case W1 is a matrix, therefore dimension is 2).

In the following section, we will use the following code in the blog_post search definition in order to be able to use the constant tensor W_hidden in our ranking expression.

    constant W_hidden {
        file: constants/W_hidden.json
        type: tensor(input[20],hidden[40])
    }

A constant tensor is data that is not specific to a given document type. In the case above we define W_hidden to be a tensor with two dimensions (matrix), where the first dimension is named input and has size 20 and second dimension is named hidden and has size 40. The data were serialized to a JSON file located at constants/W_hidden.json relative to the application package folder.

Vespa ranking expressions

In order to evaluate the neural network model trained with TensorFlow in the previous section, we need to translate the model structure to a Vespa ranking expression to be defined in the blog_post search definition. To honor a low-latency response, we will take advantage of the Two Phase Ranking available in Vespa and define the first phase ranking to be the same ranking function used in the previous blog post, which is a dot-product between the user and latent factors. After the documents have been sorted by the first phase ranking function, we will rerank the top 200 document from each search node using the second phase ranking given by the neural network model presented above.

Note that we define two ranking profiles in the search definition below. This allow us to decide which ranking profile to use at query time. We defined a ranking profile named tensor which only applies the dot-product between user and document latent factors for all matching documents and a ranking profile named nn_tensor, which rerank the top 200 documents using the neural network model discussed in the previous section.

We will walk through each part of the blog_post search definition, see blog_post.sd.

As always, we start the a search definition with the following line

We define the document type blog_post the same way we have done in the previous tutorial.

    document blog_post {

      # Field definitions
      # Examples:

      field date_gmt type string {
          indexing: summary
      }
      field language type string {
          indexing: summary
      }

      # Remaining fields as found in previous tutorial

    }

We define a ranking profile named tensor which rank all the matching documents by the dot-product between the document latent factor and the user latent factor. This is the same ranking expression used in the previous tutorial, which include code to retrieve the user latent factor based on the user_id sent by the query to Vespa.

    # Simpler ranking profile without
    # second-phase ranking
    rank-profile tensor {
      first-phase {
          expression {
              sum(query(user_item_cf) * attribute(user_item_cf))
          }
      }
    }

Since we want to evaluate the neural network model we have trained, we need to define where to find the model parameters (W1,W2,b1,b2). See the previous section for how to write the TensorFlow model parameters to Vespa Tensor format.

    # We need to specify the type and the location
    # of the files storing tensor values for each
    # Variable in our TensorFlow model. In this case,
    # W_hidden, b_hidden, W_final, b_final

    constant W_hidden {
        file: constants/W_hidden.json
        type: tensor(input[20],hidden[40])
    }

    constant b_hidden {
        file: constants/b_hidden.json
        type: tensor(hidden[40])
    }

    constant W_final {
        file: constants/W_final.json
        type: tensor(hidden[40], final[1])
    }

    constant b_final {
        file: constants/b_final.json
        type: tensor(final[1])
    }

Now, we specify a second rank-profile called nn_tensor that will use the same first phase as the rank-profile tensor but will rerank the top 200 documents using the neural network model as second phase. We refer to the Tensor Reference document for more information regarding the Tensor operations used in the code below.

    # rank profile with neural network model as
    # second phase
    rank-profile nn_tensor {

        # The input to the neural network is the
        # concatenation of the document and query vectors.

        macro nn_input() {
            expression: concat(attribute(user_item_cf), query(user_item_cf), input)
        }

        # Computes the hidden layer

        macro hidden_layer() {
            expression: relu(sum(nn_input * constant(W_hidden), input) + constant(b_hidden))
        }

        # Computes the output layer

        macro final_layer() {
            expression: sigmoid(sum(hidden_layer * constant(W_final), hidden) + constant(b_final))
        }


        # First-phase ranking:
        # Dot-product between user and document latent factors

        first-phase {
            expression: sum(query(user_item_cf) * attribute(user_item_cf))
        }

        # Second-phase ranking:
        # Neural network model based on the user and latent factors

        second-phase {
            rerank-count: 200
            expression: sum(final_layer)
        }

    }

}

Offline evaluation

We will now query Vespa and obtain 100 blog post recommendations for each user_id in our test set. Below, we query Vespa using the tensor ranking function which contain the simpler ranking expression involving the dot-product between user and document latent factors.

pig -x local -f tutorial_compute_metric.pig \
  -param VESPA_HADOOP_JAR=vespa-hadoop.jar \
  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \
  -param ENDPOINT=$(hostname):8080
  -param NUMBER_RECOMMENDATIONS=100
  -param RANKING_NAME=tensor
  -param OUTPUT=blog-job/cf-metric

We perform the same query routine below, but now using the ranking-profile nn_tensor which reranks the top 200 documents using the neural network model.

pig -x local -f tutorial_compute_metric.pig \
  -param VESPA_HADOOP_JAR=vespa-hadoop.jar \
  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \
  -param ENDPOINT=$(hostname):8080
  -param NUMBER_RECOMMENDATIONS=100
  -param RANKING_NAME=nn_tensor
  -param OUTPUT=blog-job/cf-metric

The tutorial_compute_metric.pig script can be found in our repo.

Comparing the recommendations obtained by those two ranking profiles and our test set, we see that by deploying a more complex and accurate model in the second phase ranking, we increased the number of relevant documents (documents read by the user) retrieved from 11948 to 12804 (more than 7% increase) and those documents retrieved appeared higher up in the list of recommendations, as shown by the expected percentile ranking metric introduced in the Vespa tutorial pt. 2 which decreased from 37.1% to

Optimizing realtime evaluation of neural net models on Vespa

In this blog post we describe how we recently made neural network evaluation over 20 times faster on Vespa’s tensor framework.

Vespa is the open source platform for building applications that carry out scalable real-time data processing, for instance search and recommendation systems. These require significant amounts of computation over large data sets. With advances in machine learning, it is desirable to run more advanced ranking models such as large linear or logistic regression models and artificial neural networks. Because of the tight computational budget at serving time, the evaluation of such models must be done in an efficient and scalable manner.

We introduced the tensor API to help solve such problems. The tensor API allows the concise expression of general computations on many-dimensional data, while simultaneously leaving room for deep optimizations on the platform side.  What we mean by this is that the tensor API is very expressive and supports a large range of model types. The general evaluation of tensors is not necessarily efficient in all cases, so in addition to continually working to increase the baseline performance, we also perform specific optimizations for important use cases. In this blog post we will describe one such important optimization we recently did, which improved neural network evaluation performance by over 20x.

To illustrate the types of optimization we can do, consider the following tensor expression representing a dot product between vectors v1 and v2:

reduce(join(v1, v2, f(x, y)(x * y)), sum)

The dot product is calculated by multiplying the vectors together by using the join operation,
then summing the elements in the vector together using the reduce operation.
The result is a single scalar. A naive implementation would first calculate the join and introduce a temporary tensor before the reduce sums up the cells to a single scalar. Particularly for large tensors with many dimensions, such a temporary tensor can be large and require significant memory allocations. This is obviously not the most efficient path to calculate the resulting tensor.  A general improvement would be to avoid the temporary tensor and reduce to the single scalar directly as the tensors are iterated through.

In Vespa, when ranking expressions are compiled, the abstract syntax tree (AST) is analyzed for such optimizations. When known cases are recognized, the most efficient implementation is selected. In the above example, assuming the vectors are dense and they share dimensions, Vespa has optimized hardware accelerated code for doing dot products on vectors. For sparse vectors, Vespa falls back to a implementation for weighted sets which build hash tables for efficient lookups.  This method allows recognition of both large and small optimizations, from simple dot products to specialized implementations for more advanced ranking models. Vespa currently has a few optimizations implemented, and we are adding more as important use cases arise.

We recently set out to improve the performance of evaluating simple neural networks, a case quite similar to the one presented in the previous blog post. The ranking expression to optimize was:

   macro hidden_layer() {
       expression: elu(xw_plus_b(nn_input, constant(W_fc1), constant(b_fc1), x))
   }
   macro final_layer() {
       expression: xw_plus_b(hidden_layer, constant(W_fc2), constant(b_fc2), hidden)
   }
   first-phase {
       expression: final_layer
   }

This represents a simple two-layer neural network.

Whenever a new version of Vespa is built, a large suite of integration and performance tests are run. When we want to optimize a specific use case, we first create a performance test to set a baseline.  With the performance tests we get both historical graphs as well as detailed profiling information and performance statistics sampled from the system under load.  This allows us to identify and optimize any bottlenecks. Also, it adds a bit of gamification to the process.

The graph below shows the performance of a test where 10 000 random documents are ranked according to the evaluation of a simple two-layer neural network:

image

Here, the x-axis represent builds, and the y-axis is the end-to-end latency as measured from a machine firing off queries to a server running the test on Vespa. As can be seen, over the course of optimization the latency was reduced from 150-160 ms to 7 ms, an impressive 20x end-to-end latency improvement.

When a query is received by Vespa, it is first processed in the stateless container. This is usually where applications would process the query, possibly enriching it with additional information. Vespa does a bit of default work here as well, and also transforms the query a bit. For this test, no specific handling was done except this default handling. After initial processing, the query is dispatched to each node in the stateful content layer. For this test, only a single node is used in the content layer, but applications would typically have multiple. The query is processed in parallel on each node utilizing multiple cores and the ranking expression gets executed once for each document that matches the query. For this test with 10 000 documents, the ranking expression and thus the neural network gets evaluated in total 10 000 times before the top N documents are returned to the container layer.

The following steps were taken to optimize this expression, with each step visible as a step in the graph above:

  1. Recognize join with multiplication as part of an inner product.
  2. Optimize for bias addition.
  3. Optimize vector concatenation (which was part of the input to the neural network)
  4. Replace appropriate sub-expressions with the dense vector-matrix product.

It was particularly the final step which gave the biggest percent wise performance boost. The solution in total was to recognize the vector-matrix multiplication done in the neural network layer and replace that with specialized code that invokes the existing hardware accelerated dot product code. In the expression above, the operation xw_plus_b is replaced with a reduce of the multiplicative join and additive join. This is what is recognized and performed in one step instead of three.

This strategy of optimizing specific use cases allows for a more rapid application development for users of Vespa. Consider the case where some exotic model needs to be run on Vespa. Without the generic tensor API users would have to implement their own custom rank features or wait for the Vespa core developers to implement them. In contrast, with the tensor API, teams can continue their development without external dependencies to the Vespa team.  If necessary, the Vespa team can in parallel implement the optimizations needed to meet performance requirements, as we did in this case with neural networks.

Serving article comments using reinforcement learning of a neural net

Don’t look at the comments. When you allow users to make comments on your content pages you face the problem that not all of them are worth showing — a difficult problem to solve, hence the saying. In this article I’ll show how this problem has been attacked using reinforcement learning at serving time on Yahoo content sites, using the Vespa open source platform to create a scalable production solution.

Yahoo properties such as Yahoo Finance, News and Sports allow users to comment on the articles, similar to many other apps and websites. To support this the team needed a system that can add, find, count and serve comments at scale in real time. Not all comments are equally as interesting or relevant though, and some articles can have hundreds of thousands of comments, so a good commenting system must also choose the right comments among these to show to users viewing the article. To accomplish this, the system must observe what users are doing and learn how to pick comments that are interesting.

Here I’ll explain how this problem was solved for Yahoo properties by using Vespa — the open source big data serving engine. I’ll start with the basics and then show how comment selection using a neural net and reinforcement learning was implemented.

As mentioned, the team needed a system that can add, find, count, and serve comments at scale in real time. The team chose Vespa, the open big data serving engine for this, as it supports both such basic serving as well as incorporating machine learning at serving time (which we’ll get to below). By storing each comment as a separate document in Vespa, containing the ID of the article commented upon, the ID of the user commenting, various comment metadata, and the comment text itself, the team could issue queries to quickly retrieve the comments on a given article for display, or to show a comment count next to the article:

image

In addition, this document structure allowed less-used operations such as showing all the articles of a given user and similar.

The Vespa instance used at Yahoo for this store about a billion comments at any time, serve about 12.000 queries per second, and about twice as many writes (new comments + comment metadata updates). Average latency for queries is about 4 ms, and write latency roughly 1 ms. Nodes are organized in two tiers as a single Vespa application: A single stateless cluster handling incoming queries and writes, and a content cluster storing the comments, maintaining indexes and executing the distributed part of queries in parallel. In total, 32 stateless and 96 stateful nodes are spread over 5 regional data centers. Data is automatically sharded by Vespa in each datacenter, in 6–12 shards depending on the traffic patterns of that region.

Some articles on Yahoo pages have a very large number of comments — up to hundreds of thousands are not uncommon, and no user is going to read all of them. Therefore it is necessary to pick the best comments to show each time someone views an article. Vespa does this by finding all the comments for the article, computing a score for each, and picking the comments with the best scores to show to the user. This process is called ranking. By configuring the function to compute for each comment as a ranking expression in Vespa, the engine will compute it locally on each data partition in parallel during query execution. This allows executing these queries with low latency and ensures that more comments can be handled by adding more content nodes, without causing an increase in latency.

The input to the ranking function is features which are typically stored in the document (here: a comment) or sent with the query. Comments have various features indicating how users interacted with the comment, as well as features computed from the comment content itself. In addition, the system keeps track of the reputation of each comment author as a feature.

User actions are sent as update operations to Vespa as they are performed. The information about authors is also continuously changing, but since each author can write many comments it would be wasteful to have to update each comment every time there is new information about the author.
Instead, the author information is stored in a separate document type — one document per author,
and a document reference in Vespa is used to import that author feature into each comment.
This allows updating the author information once and have it automatically take effect for all comments by that author.

With these features, it’s possible in Vespa to configure a mathematical function as a ranking expression which computes the rank score or each comment to produce a ranked list of the top comments, like the following:

image

Using a neural net and reinforcement learning

The team used to rank comments with a handwritten ranking expression having hardcoded weighting of the features. This is a good way to get started but obviously not optimal. To improve it they needed to decide on a measurable target and use machine learning to optimize towards it.

The ultimate goal is for users to find the comments interesting. This can not be measured directly, but luckily we can define a good proxy for interest based on signals such as dwell time (the amount of time the users spend on the comments of an article) and user actions (whether users reply to comments, provide upvotes and downvotes, etc). The team knew they wanted user interest to go up on average, but there is no way to know what the correct value of the measure of interest might be for any single given list of comments. Therefore it’s hard to create a training set of interest signals for articles (supervised learning), so reinforcement learning was chosen instead: Let the system make small changes to the live machine-learned model iteratively, observe the effect on the signal used as a proxy for user interest, and use this to converge on a model that increases it.

The model chosen here was a neural net with multiple hidden layers, roughly illustrated as follows:

image

The advantage of using a neural net compared to a simple function such as linear regression is that it can capture non-linear relationships in the feature data without anyone having to guess which relationship exists and hand-write functions to capture them (feature engineering).

To explore the space of possible rankings, the team implemented a sampling algorithm in a Searcher to perturb the ranking of comments returned from each query. They logged the ranking information and user interest signals such as dwell time to their Hadoop grid where they are joined. This generates a training set each hour which is used to retrain the model using TensorFlow-on-Spark, which produces a new model for the next iteration of the reinforcement learning cycle.

To implement this on Vespa, the team configured the neural net as the ranking function for comments. This was done as a manually written ranking function over tensors in a rank profile. Here is the production configuration used:

rank-profile neuralNet {
    function get_model_weights(field) {
        expression: if(query(field) == 0, constant(field), query(field))
    }
    function layer_0() { # returns tensor(hidden[9])     
        expression: elu(xw_plus_b(nn_input, get_model_weights(W_0), get_model_weights(b_0), x))   
    }
    function layer_1() { # returns tensor(out[9])
        expression: elu(xw_plus_b(layer_0 get_model_weights(W_1), get_model_weights(b_1), hidden))   
    }
    # xw_plus_b returns tensor(out[1]), so sum converts to double   
    function layer_out() {
        expression: sum(xw_plus_b(layer_1, get_model_weights(W_out), get_model_weights(b_out), out))   
    }    
    first-phase {     
        expression: freshnessRank   
    }    
    second-phase {
        expression: layer_out
        rerank-count: 2000   
    }
}

More recently Vespa added support for deploying TensorFlow SavedModels directly (as well as similar support for tools saving in the ONNX format), which would also be a good option here since the training happens in TensorFlow.

Neural nets have a pair of weight and bias tensors for each layer, which is what the team wanted the training process to optimize. The simplest way to include the weights and biases in the model is to add them as constant tensors to the application package. However, with reinforcement learning it is necessary to be able to update these tensor parameters frequently. This could be achieved by redeploying the application package frequently, as Vespa allows that to be done without restarts or disruption to ongoing queries. However, it is still a somewhat heavy-weight process, so another approach was chosen: Store the neural net parameters as tensors in a separate document type in Vespa, and create a Searcher component which looks up this document on each incoming query, and adds the parameter tensors to it before it’s passed to the content nodes for evaluation.

Here is the full production code needed to accomplish this serving-time operation:

import com.yahoo.document.Document;
import com.yahoo.document.DocumentId;
import com.yahoo.document.Field;
import com.yahoo.document.datatypes.FieldValue;
import com.yahoo.document.datatypes.TensorFieldValue;
import com.yahoo.documentapi.DocumentAccess;
import com.yahoo.documentapi.SyncParameters;
import com.yahoo.documentapi.SyncSession;
import com.yahoo.search.Query;
import com.yahoo.search.Result;
import com.yahoo.search.Searcher;
import com.yahoo.search.searchchain.Execution;
import com.yahoo.tensor.Tensor;
import java.util.Map;

public class LoadRankingmodelSearcher extends Searcher {
    private static final String VESPA_ID_FORMAT = "id:canvass_search:rankingmodel::%s";
    // https://docs.vespa.ai/en/ranking-expressions-features.html#using-query-variables
    private static final String FEATURE_FORMAT = "query(%s)";

    /** To fetch model documents from Vespa index */
    private final SyncSession fetchDocumentSession;
    public LoadRankingmodelSearcher() {
        this.fetchDocumentSession = DocumentAccess.createDefault().createSyncSession(new SyncParameters.Builder().build());
    }

    @Override
    public Result search(Query query, Execution execution) {
        // Fetch model document from Vespa
        String id = String.format(VESPA_ID_FORMAT, query.getRanking().getProfile());
        Document modelDoc = fetchDocumentSession.get(new DocumentId(id));
        // Add it to the query
        if (modelDoc != null) {
            modelDoc.iterator().forEachRemaining((Map.Entry<Field, FieldValue> e) ->
                addTensorFromDocumentToQuery(e.getKey().getName(), e.getValue(), query)
            );
        }
        return execution.search(query);
    }

    private static void addTensorFromDocumentToQuery(String field, FieldValue value, Query query) {
        if (value instanceof TensorFieldValue) {
            Tensor tensor = ((TensorFieldValue) value).getTensor().get();
            query.getRanking().getFeatures().put(String.format(FEATURE_FORMAT, field), tensor);
        }
    }
}
The model weight document definition is added to the same content cluster as the comment documents and simply contains attribute fields for each weight and bias tensor of the neural net (where each field below is configured with “indexing: attributesummary”):
document rankingmodel {
    field modelTimestamp type long { … }
    field W_0 type tensor(x[9],hidden[9]) { … }
    field b_0 type tensor(hidden[9]) { … } 
    field W_1 type tensor(hidden[9],out[9]) { … } 
    field b_1 type tensor(out[9]) { … }
    field W_out type tensor(out[9]) { … } 
    field b_out type tensor(out[1]) { … } 
}

Since updating documents is a lightweight operation it is now possible to make frequent changes to the neural net to implement the reinforcement learning process.

Results

Switching to the neural net model with reinforcement learning has already led to a 20% increase in average dwell time. The average response time when ranking with the neural net increased to about 7 ms since the neural net model is more expensive. The response time stays low because in Vespa the neural net is evaluated on all the content nodes (partitions) in parallel. This avoids the bottleneck of sending the data for each comment to be evaluated over the network and allows increasing parallelization indefinitely by adding more content nodes.

However, evaluating the neural net for all comments for outlier articles which have hundreds of thousands of comments would still be very costly. If you read the rank profile configuration shown above, you’ll have noticed the solution to this: Two-phase ranking was used where the comments are first selected by a cheap rank function (termed freshnessRank) and the highest scoring 2000 documents (per content node) are re-ranked using the neural net. This caps the max CPU spent on evaluating the neural net per query.

Conclusion and future work

In this article I have shown how to implement a real comment serving and ranking system on Vespa. With reinforcement learning gaining popularity, the serving system needs to become a more integrated part of the machine learning stack, and by using Vespa this can be accomplished relatively easily with a standard open source technology.

The team working on this plan to expand on this work by applying it to other domains such as content recommendation, incorporating more features in a larger network, and exploring personalized comment ranking.