Blog recommendation with neural network models

Update 2021-05-20:
This blog post refers to Vespa sample applications that do not exist anymore.
Please refer to the
News search and recommendation tutorial
for an updated version of text and sample applications.

Introduction

The main objective of this post is to show how to deploy neural network models in Vespa using our Tensor Framework. In fact, any model that can be represented by a series of Tensor operations can be deployed in Vespa. Neural networks is just a popular example. In addition, we will introduce the multi-phase ranking model available in Vespa that can be used to run more expensive models in a phase based on a reduced number of documents returned by previous phases. This feature allow us to run models that would be prohibitively expensive to use if we had to run them at query-time across all the documents indexed in Vespa.

Model Training

In this section, we will define a neural network model, show how we created a suitable dataset to train the model and train the model using TensorFlow.

The neural network model

In the previous blog post, we computed latent factors for each user and each document and then used a dot-product between user and document vectors to rank the documents available for recommendation to a specific user. In this tutorial we will train a 2-layer fully connected neural network model that will take the same user (u) and document (d) latent factors as input and will output the probability of that specific user liking the document.

More technically, our previous rank function r was given by

r(u,d)=u∗d

while in this tutorial it will be given by

r(u,d,θ)=f(u,d,θ)

where f represents the neural network model described below and θ is the neural network parameter values that we need to learn from training data.

The specific form of the neural network model used here is

p = sigmoid(h1×W2+b2)
h1 = ReLU(x×W1+b1)

where x=[u,d] is the concatenation of the user and document latent factor, ReLU is the rectifier activation function, sigmoid represents the sigmoid function, p is the output of the model and in this case can be interpreted as the probability of the user u liking a blog post d. The parameters of the model are represented by θ=(W1,W2,b1,b2).

Training data

For the training dataset, we will start with the (user_id, post_id) rows from the “training_set_ids” generated previously. Then, we remove every row for which there is no latent factors for the user_id or post_id contained in that row. This gives us a dataset with only positive feedback (label = 1), since each row represents one instance of a user_id liking a post_id.

In order to train our model, we need to generate negative feedback (label = 0). So, for each row (user_id, post_id) in the current dataset we will generate N negative feedback rows by randomly sampling post_id_fake from the pool of post_id’s available in the current set, so that for each (user_id, post_id) row with label = 1 we will increase the dataset with N (user_id, post_id_fake) rows with label = 0.

Find code to generate the dataset in the utility scripts.

Training with TensorFlow

With the training data in hand, we have split it into 80% training set and 20% validation set and used TensorFlow to train the model. The script used can be found in the utility scripts and executed by

$ python vespaModel.py --product_features_file_path vespa_tutorial_data/user_item_cf_cv/product.json \
                       --user_features_file_path vespa_tutorial_data/user_item_cf_cv/user.json \
                       --dataset_file_path vespa_tutorial_data/nn_model/training_set.txt

The progress of your training can be visualized using Tensorboard

$ tensorboard --logdir runs/*/summaries/

##
Model deployment in Vespa

Two Phase Ranking

When a query is sent to Vespa, it will scan all documents available and select the ones (possibly all) that match the query. When the set of documents matching a query is found, Vespa must decide the order of these documents. Unless explicit sorting is used, Vespa decides this order by calculating a number for each document, the rank score, and sorts the documents by this number.

The rank score can be any function that takes as arguments parameters sent by the query, document attributes defined in search definitions and global parameters not directly linked to query or document parameters. One example of rank score is the output of the neural network model defined in this tutorial. The model takes the latent factor u associated with a specific user_id (query parameter), the latent factor dd associated with document post_id (document attribute) and learned model parameters (global parameters not related to a specific query nor document) and returns the probability of user u to like document d.

However, even though Vespa is designed to carry out such calculations optimally, complex expressions becomes expensive when they must be calculated over every one of a large set of matching documents. To relieve this, Vespa can be configured to run two ranking expressions – a smaller and less accurate one on all hits during the matching phase, and a more expensive and accurate one only on the best hits during the reranking phase. In general this allows a more optimal usage of the cpu budget by dedicating more of the total cpu towards the best candidate hits.

The reranking phase, if specified, will by default be run on the 100 best hits on each search node, after matching and before information is returned upwards to the search container. The number of hits to rerank can be turned up or down as needed. Below is a toy example showing how to configure first and second phase ranking expressions in the rank profile section of search definitions where the second phase rank expression is run on the 200 best hits from first phase on each search node.

search myapp {

    …

    rank-profile default inherits default {

        first-phase {
            expression: nativeRank + query(deservesFreshness) * freshness(timestamp)
        }

        second-phase {
            expression {
                0.7 * ( 0.7*fieldMatch(title) + 0.2*fieldMatch(description) + 0.1*fieldMatch(body) ) +
                0.3 * attributeMatch(keywords)
            }
            rerank-count: 200
        }
    }
}

Constant Tensor files

Once the model has been trained in TensorFlow, export the model parameters (W1,W2,b1,b2) to the application folder as Tensors according to the Vespa Document JSON format.

The complete code to serialize the model parameters using Vespa Tensor format can be found in the utility scripts but the following code snipped shows how to serialize the hidden layer weights W1:

serializer.serialize_to_disk(variable_name = "W_hidden", dimension_names = ['input', 'hidden'])

Note that Vespa currently requires dimension names for all the Tensor dimensions (in this case W1 is a matrix, therefore dimension is 2).

In the following section, we will use the following code in the blog_post search definition in order to be able to use the constant tensor W_hidden in our ranking expression.

    constant W_hidden {
        file: constants/W_hidden.json
        type: tensor(input[20],hidden[40])
    }

A constant tensor is data that is not specific to a given document type. In the case above we define W_hidden to be a tensor with two dimensions (matrix), where the first dimension is named input and has size 20 and second dimension is named hidden and has size 40. The data were serialized to a JSON file located at constants/W_hidden.json relative to the application package folder.

Vespa ranking expressions

In order to evaluate the neural network model trained with TensorFlow in the previous section, we need to translate the model structure to a Vespa ranking expression to be defined in the blog_post search definition. To honor a low-latency response, we will take advantage of the Two Phase Ranking available in Vespa and define the first phase ranking to be the same ranking function used in the previous blog post, which is a dot-product between the user and latent factors. After the documents have been sorted by the first phase ranking function, we will rerank the top 200 document from each search node using the second phase ranking given by the neural network model presented above.

Note that we define two ranking profiles in the search definition below. This allow us to decide which ranking profile to use at query time. We defined a ranking profile named tensor which only applies the dot-product between user and document latent factors for all matching documents and a ranking profile named nn_tensor, which rerank the top 200 documents using the neural network model discussed in the previous section.

We will walk through each part of the blog_post search definition, see blog_post.sd.

As always, we start the a search definition with the following line

We define the document type blog_post the same way we have done in the previous tutorial.

    document blog_post {

      # Field definitions
      # Examples:

      field date_gmt type string {
          indexing: summary
      }
      field language type string {
          indexing: summary
      }

      # Remaining fields as found in previous tutorial

    }

We define a ranking profile named tensor which rank all the matching documents by the dot-product between the document latent factor and the user latent factor. This is the same ranking expression used in the previous tutorial, which include code to retrieve the user latent factor based on the user_id sent by the query to Vespa.

    # Simpler ranking profile without
    # second-phase ranking
    rank-profile tensor {
      first-phase {
          expression {
              sum(query(user_item_cf) * attribute(user_item_cf))
          }
      }
    }

Since we want to evaluate the neural network model we have trained, we need to define where to find the model parameters (W1,W2,b1,b2). See the previous section for how to write the TensorFlow model parameters to Vespa Tensor format.

    # We need to specify the type and the location
    # of the files storing tensor values for each
    # Variable in our TensorFlow model. In this case,
    # W_hidden, b_hidden, W_final, b_final

    constant W_hidden {
        file: constants/W_hidden.json
        type: tensor(input[20],hidden[40])
    }

    constant b_hidden {
        file: constants/b_hidden.json
        type: tensor(hidden[40])
    }

    constant W_final {
        file: constants/W_final.json
        type: tensor(hidden[40], final[1])
    }

    constant b_final {
        file: constants/b_final.json
        type: tensor(final[1])
    }

Now, we specify a second rank-profile called nn_tensor that will use the same first phase as the rank-profile tensor but will rerank the top 200 documents using the neural network model as second phase. We refer to the Tensor Reference document for more information regarding the Tensor operations used in the code below.

    # rank profile with neural network model as
    # second phase
    rank-profile nn_tensor {

        # The input to the neural network is the
        # concatenation of the document and query vectors.

        macro nn_input() {
            expression: concat(attribute(user_item_cf), query(user_item_cf), input)
        }

        # Computes the hidden layer

        macro hidden_layer() {
            expression: relu(sum(nn_input * constant(W_hidden), input) + constant(b_hidden))
        }

        # Computes the output layer

        macro final_layer() {
            expression: sigmoid(sum(hidden_layer * constant(W_final), hidden) + constant(b_final))
        }


        # First-phase ranking:
        # Dot-product between user and document latent factors

        first-phase {
            expression: sum(query(user_item_cf) * attribute(user_item_cf))
        }

        # Second-phase ranking:
        # Neural network model based on the user and latent factors

        second-phase {
            rerank-count: 200
            expression: sum(final_layer)
        }

    }

}

Offline evaluation

We will now query Vespa and obtain 100 blog post recommendations for each user_id in our test set. Below, we query Vespa using the tensor ranking function which contain the simpler ranking expression involving the dot-product between user and document latent factors.

pig -x local -f tutorial_compute_metric.pig \
  -param VESPA_HADOOP_JAR=vespa-hadoop.jar \
  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \
  -param ENDPOINT=$(hostname):8080
  -param NUMBER_RECOMMENDATIONS=100
  -param RANKING_NAME=tensor
  -param OUTPUT=blog-job/cf-metric

We perform the same query routine below, but now using the ranking-profile nn_tensor which reranks the top 200 documents using the neural network model.

pig -x local -f tutorial_compute_metric.pig \
  -param VESPA_HADOOP_JAR=vespa-hadoop.jar \
  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \
  -param ENDPOINT=$(hostname):8080
  -param NUMBER_RECOMMENDATIONS=100
  -param RANKING_NAME=nn_tensor
  -param OUTPUT=blog-job/cf-metric

The tutorial_compute_metric.pig script can be found in our repo.

Comparing the recommendations obtained by those two ranking profiles and our test set, we see that by deploying a more complex and accurate model in the second phase ranking, we increased the number of relevant documents (documents read by the user) retrieved from 11948 to 12804 (more than 7% increase) and those documents retrieved appeared higher up in the list of recommendations, as shown by the expected percentile ranking metric introduced in the Vespa tutorial pt. 2 which decreased from 37.1% to

Optimizing realtime evaluation of neural net models on Vespa

In this blog post we describe how we recently made neural network evaluation over 20 times faster on Vespa’s tensor framework.

Vespa is the open source platform for building applications that carry out scalable real-time data processing, for instance search and recommendation systems. These require significant amounts of computation over large data sets. With advances in machine learning, it is desirable to run more advanced ranking models such as large linear or logistic regression models and artificial neural networks. Because of the tight computational budget at serving time, the evaluation of such models must be done in an efficient and scalable manner.

We introduced the tensor API to help solve such problems. The tensor API allows the concise expression of general computations on many-dimensional data, while simultaneously leaving room for deep optimizations on the platform side.  What we mean by this is that the tensor API is very expressive and supports a large range of model types. The general evaluation of tensors is not necessarily efficient in all cases, so in addition to continually working to increase the baseline performance, we also perform specific optimizations for important use cases. In this blog post we will describe one such important optimization we recently did, which improved neural network evaluation performance by over 20x.

To illustrate the types of optimization we can do, consider the following tensor expression representing a dot product between vectors v1 and v2:

reduce(join(v1, v2, f(x, y)(x * y)), sum)

The dot product is calculated by multiplying the vectors together by using the join operation,
then summing the elements in the vector together using the reduce operation.
The result is a single scalar. A naive implementation would first calculate the join and introduce a temporary tensor before the reduce sums up the cells to a single scalar. Particularly for large tensors with many dimensions, such a temporary tensor can be large and require significant memory allocations. This is obviously not the most efficient path to calculate the resulting tensor.  A general improvement would be to avoid the temporary tensor and reduce to the single scalar directly as the tensors are iterated through.

In Vespa, when ranking expressions are compiled, the abstract syntax tree (AST) is analyzed for such optimizations. When known cases are recognized, the most efficient implementation is selected. In the above example, assuming the vectors are dense and they share dimensions, Vespa has optimized hardware accelerated code for doing dot products on vectors. For sparse vectors, Vespa falls back to a implementation for weighted sets which build hash tables for efficient lookups.  This method allows recognition of both large and small optimizations, from simple dot products to specialized implementations for more advanced ranking models. Vespa currently has a few optimizations implemented, and we are adding more as important use cases arise.

We recently set out to improve the performance of evaluating simple neural networks, a case quite similar to the one presented in the previous blog post. The ranking expression to optimize was:

   macro hidden_layer() {
       expression: elu(xw_plus_b(nn_input, constant(W_fc1), constant(b_fc1), x))
   }
   macro final_layer() {
       expression: xw_plus_b(hidden_layer, constant(W_fc2), constant(b_fc2), hidden)
   }
   first-phase {
       expression: final_layer
   }

This represents a simple two-layer neural network.

Whenever a new version of Vespa is built, a large suite of integration and performance tests are run. When we want to optimize a specific use case, we first create a performance test to set a baseline.  With the performance tests we get both historical graphs as well as detailed profiling information and performance statistics sampled from the system under load.  This allows us to identify and optimize any bottlenecks. Also, it adds a bit of gamification to the process.

The graph below shows the performance of a test where 10 000 random documents are ranked according to the evaluation of a simple two-layer neural network:

image

Here, the x-axis represent builds, and the y-axis is the end-to-end latency as measured from a machine firing off queries to a server running the test on Vespa. As can be seen, over the course of optimization the latency was reduced from 150-160 ms to 7 ms, an impressive 20x end-to-end latency improvement.

When a query is received by Vespa, it is first processed in the stateless container. This is usually where applications would process the query, possibly enriching it with additional information. Vespa does a bit of default work here as well, and also transforms the query a bit. For this test, no specific handling was done except this default handling. After initial processing, the query is dispatched to each node in the stateful content layer. For this test, only a single node is used in the content layer, but applications would typically have multiple. The query is processed in parallel on each node utilizing multiple cores and the ranking expression gets executed once for each document that matches the query. For this test with 10 000 documents, the ranking expression and thus the neural network gets evaluated in total 10 000 times before the top N documents are returned to the container layer.

The following steps were taken to optimize this expression, with each step visible as a step in the graph above:

  1. Recognize join with multiplication as part of an inner product.
  2. Optimize for bias addition.
  3. Optimize vector concatenation (which was part of the input to the neural network)
  4. Replace appropriate sub-expressions with the dense vector-matrix product.

It was particularly the final step which gave the biggest percent wise performance boost. The solution in total was to recognize the vector-matrix multiplication done in the neural network layer and replace that with specialized code that invokes the existing hardware accelerated dot product code. In the expression above, the operation xw_plus_b is replaced with a reduce of the multiplicative join and additive join. This is what is recognized and performed in one step instead of three.

This strategy of optimizing specific use cases allows for a more rapid application development for users of Vespa. Consider the case where some exotic model needs to be run on Vespa. Without the generic tensor API users would have to implement their own custom rank features or wait for the Vespa core developers to implement them. In contrast, with the tensor API, teams can continue their development without external dependencies to the Vespa team.  If necessary, the Vespa team can in parallel implement the optimizations needed to meet performance requirements, as we did in this case with neural networks.

Introducing TensorFlow support | Vespa Blog

In previous blog posts we have talked about Vespa’s tensor API which enables some advanced ranking capabilities. The primary use case is for machine learned ranking, where you train your models using some machine learning framework, convert the models to Vespa’s tensor format, and deploy them to Vespa. This works well, but converting trained models to Vespa form is cumbersome.

We are now happy to announce a new feature that makes this process a lot easier: TensorFlow import. With this feature you can directly deploy models you’ve trained in TensorFlow to Vespa, and use these models during ranking. This means that the models are executed in parallel over multiple threads and machines for a single query, which makes it possible to evaluate the model over any number of data items and still bound the total response time. In addition the data items to evaluate with the TensorFlow model can be selected dynamically with a query, and with a cheaper first-phase rank function if needed. Since the TensorFlow models are evaluated on the nodes storing the data, we avoid sending any data over the wire for evaluation.

In this post we’d like to introduce this new feature by discussing how it works, some assumptions behind working with TensorFlow and Vespa, and how to use the feature.

Vespa is optimized to evaluate models repeatedly over many data items (documents).  To do this efficiently, we do not evaluate the model using the TensorFlow inference engine. TensorFlow adds a non-trivial amount of overhead and instrumentation which it uses to manage potentially large scale computations. This is significant in our case, since we need to evaluate models on a micro-second scale. Hence our approach is to extract the parameters (weights) into Vespa tensors, and use the model specification in the TensorFlow graph to generate efficient Vespa tensor expressions.

Importing TensorFlow models is as simple as saving the TensorFlow model using the
SavedModel API,
adding those files to the Vespa application package, and referencing the model using the new TensorFlow ranking feature.
For instance, if your files are in models/my_model in the application package:

first-phase {
    expression: sum(tensorflow(“my_model/saved”))
}

The above expressions runs the model, and sums it to a single scalar value to use in ranking.  One thing you will have to provide is the input(s), or feed, to the graph. Vespa expects you to provide a macro with the same name as the input placeholder. In the macro you can specify where the input should come from, be it a parameter sent along with the query, a document field (possibly in a parent document) or a constant.

As mentioned, Vespa evaluates the imported models once per document. Depending on the requirements of the application, this can impose some natural limitations on the size and complexity of the models that can be evaluated. However, Vespa has a number of other search and rank features that can be used to reduce the search space before running the machine learned models. Typically, one would use the search and first ranking phases to select a relatively small number of candidate documents, which are then given their final rank score in the more computationally expensive second phase model evaluation.

Also note that TensorFlow import is new to Vespa, and we currently only support a subset of the TensorFlow operations. While the supported operations should suffice for many relevant use cases, there are some that are not supported yet due to potentially being too expensive to evaluate per document. For instance, convolutional networks and recurrent networks (LSTMs etc) are not supported. We are continually working to add functionality, if you find that we have some glaring omissions, please let us know.

Going forward we are focusing on further improving performance of our tensor framework for important use cases. We’ll follow up this post with one showing how the performance of evaluation in Vespa compares with TensorFlow serving. We will also add more supported frameworks and our next target is ONNX.

You can read more about this feature in the ranking with TensorFlow model in Vespa documentation. We are excited to announce the TensorFlow support, and we’re eager to hear what you are building with it.

Scaling TensorFlow model evaluation with Vespa

In this blog post we’ll explain how to use Vespa to evaluate TensorFlow models over arbitrarily many data points while keeping total latency constant. We provide benchmark data from our performance lab where we compare evaluation using TensorFlow serving with evaluating TensorFlow models in Vespa.

We recently introduced a new feature that enables direct import of TensorFlow models into Vespa for use at serving time. As mentioned in a previous blog post, our approach to support TensorFlow is to extract the computational graph and parameters of the TensorFlow model and convert it to Vespa’s tensor primitives. We chose this approach over attempting to integrate our backend with the TensorFlow runtime. There were a few reasons for this. One was that we would like to support other frameworks than TensorFlow. For instance, our next target is to support ONNX. Another was that we would like to avoid the inevitable overhead of such an integration, both on performance and code maintenance. Of course, this means a lot of optimization work on our side to make this as efficient as possible, but we do believe it is a better long term solution.

Naturally, we thought it would be interesting to set up some sort of performance comparison between Vespa and TensorFlow for cases that use a machine learning ranking model.

Before we get to that however, it is worth noting that Vespa and TensorFlow serving has an important conceptual difference. With TensorFlow you are typically interested in evaluating a model for a single data point, be that an image for an image classifier, or a sentence for a semantic representation etc. The use case for Vespa is when you need to evaluate the model over many data points. Examples are finding the best document given a text, or images similar to a given image, or computing a stream of recommendations for a user.

So, let’s explore this by setting up a typical search application in Vespa. We’ve based the application in this post on the Vespa
blog recommendation tutorial part 3.
In this application we’ve trained a collaborative filtering model which computes an interest vector for each existing user (which we refer to as the user profile) and a content vector for each blog post. In collaborative filtering these vectors are commonly referred to as latent factors. The application takes a user id as the query, retrieves the corresponding user profile, and searches for the blog posts that best match the user profile. The match is computed by a simple dot-product between the latent factor vectors. This is used as the first phase ranking. We’ve chosen vectors of length 128.

In addition, we’ve trained a neural network in TensorFlow to serve as the second-phase ranking. The user vector and blog post vector are concatenated and represents the input (of size 256) to the neural network. The network is fully connected with 2 hidden layers of size 512 and 128 respectively, and the network has a single output value representing the probability that the user would like the blog post.

In the following we set up two cases we would like to compare. The first is where the imported neural network is evaluated on the content node using Vespa’s native tensors. In the other we run TensorFlow directly on the stateless container node in the Vespa 2-tier architecture. In this case, the additional data required to evaluate the TensorFlow model must be passed back from the content node(s) to the container node. We use Vespa’s fbench utility to stress the system under fairly heavy load.

In this first test, we set up the system on a single host. This means the container and content nodes are running on the same host. We set up fbench so it uses 64 clients in parallel to query this system as fast as possible. 1000 documents per query are evaluated in the first phase and the top 200 documents are evaluated in the second phase. In the following, latency is measured in ms at the 95th percentile and QPS is the actual query rate in queries per second:

  • Baseline: 19.68 ms / 3251.80 QPS
  • Baseline with additional data: 24.20 ms / 2644.74 QPS
  • Vespa ranking: 42.8 ms / 1495.02 QPS
  • TensorFlow batch ranking: 42.67 ms / 1499.80 QPS
  • TensorFlow single ranking: 103.23 ms / 619.97 QPS

Some explanation is in order. The baseline here is the first phase ranking only without returning the additional data required for full ranking. The baseline with additional data is the same but returns the data required for ranking. Vespa ranking evaluates the model on the content backend. Both TensorFlow tests evaluate the model after content has been sent to the container. The difference is that batch ranking evaluates the model in one pass by batching the 200 documents together in a larger matrix, while single evaluates the model once per document, i.e. 200 evaluations. The reason why we test this is that Vespa evaluates the model once per document to be able to evaluate during matching, so in terms of efficiency this is a fairer comparison.

We see in the numbers above for this application that Vespa ranking and TensorFlow batch ranking achieve similar performance. This means that the gains in ranking batch-wise is offset by the cost of transferring data to TensorFlow. This isn’t entirely a fair comparison however, as the model evaluation architecture of Vespa and TensorFlow differ significantly. For instance, we measure that TensorFlow has a much lower degree of cache misses. One reason is that batch-ranking necessitates a more contiguous data layout. In contrast, relevant document data can be spread out over the entire available memory on the Vespa content nodes.

Another significant reason is that Vespa currently uses double floating point precision in ranking and in tensors. In the above TensorFlow model we have used floats, resulting in half the required memory bandwidth. We are considering making the floating point precision in Vespa configurable to improve evaluation speed for cases where full precision is not necessary, such as in most machine learned models.

So we still have some work to do in optimizing our tensor evaluation pipeline, but we are pleased with our results so far. Now, the performance of the model evaluation itself is only a part of the system-wide performance. In order to rank with TensorFlow, we need to move data to the host running TensorFlow. This is not free, so let’s delve a bit deeper into this cost.

The locality of data in relation to where the ranking computation takes place is an important aspect and indeed a core design point of Vespa. If your data is too large to fit on a single machine, or you want to evaluate your model on more data points faster than is possible on a single machine, you need to split your data over multiple nodes. Let’s assume that documents are distributed randomly across all content nodes, which is a very reasonable thing to do. Now, when you need to find the globally top-N documents for a given query, you first need to find the set of candidate documents that match the query. In general, if ranking is done on some other node than where the content is, all the data required for the computation obviously needs to be transferred there. Usually, the candidate set can be large so this incurs a significant cost in network activity, particularly as the number of content nodes increase. This approach can become infeasible quite quickly.

This is why a core design aspect of Vespa is to evaluate models where the content is stored.

image

This is illustrated in the figure above. The problem of transferring data for ranking is compounded as the number of content nodes increase, because to find the global top-N documents, the top-K documents of each content node need to be passed to the external ranker. This means that, if we have C content nodes, we need to transfer C*K documents over the network. This runs into hard network limits as the number of documents and data size for each document increases.

Let’s see the effect of this when we change the setup of the same application to run on three content nodes and a single stateless container which runs TensorFlow. In the following graph we plot the 95th percentile latency as we increase the number of parallel requests (clients) from 1 to 30:

image

Here we see that with low traffic, TensorFlow and Vespa are comparable in terms of latency. When we increase the load however, the cost of transmitting the data is the driver for the increase in latency for TensorFlow, as seen in the red line in the graph. The differences between batch and single mode TensorFlow evaluation become smaller as the system as a whole becomes largely network-bound. In contrast, the Vespa application scales much better.

Now, as we increase traffic even further, will the Vespa solution likewise become network-bound? In the following graph we plot the sustained requests per second as we increase clients to 200:

image

Vespa ranking is unable to sustain the same amount of QPS as just transmitting the data (the blue line), which is a hint that the system has become CPU-bound on the evaluation of the model on Vespa. While Vespa can sustain around 3500 QPS, the TensorFlow solution maxes out at 350 QPS which is reached quite early as we increase traffic. As the system is unable to transmit data fast enough, the latency naturally has to increase which is the cause for the linearity in the latency graph above. At 200 clients the average latency of the TensorFlow solution is around 600 ms, while Vespa is around 60 ms.

So, the obvious key takeaway here is that from a scalability point of view it is beneficial to avoid sending data around for evaluation. That is both a key design point of Vespa, but also for why we implemented TensorFlow support in the first case. By running the models where the content is allows for better utilization of resources, but perhaps the more interesting aspect is the ability to run more complex or deeper models while still being able to scale the system.

Parent-child in Vespa | Vespa Blog

Parent-child relationships let you model hierarchical relations in your data. This blog post talks about why and how we added this feature to Vespa, and how you can use it in your own applications. We’ll show some performance numbers and discuss practical considerations.

Introduction

The shortest possible background

Traditional relational databases let you perform joins between tables. Joins enable efficient normalization of data through foreign keys, which means any distinct piece of information can be stored in one place and then referred to (often transitively), rather than to be duplicated everywhere it might be needed. This makes relational databases an excellent fit for a great number of applications.

However, if we require scalable, real-time data processing with millisecond latency our options become more limited. To see why, and to investigate how parent-child can help us, we’ll consider a hypothetical use case.

A grand business idea

Let’s assume we’re building a disruptive startup for serving the cutest possible cat picture advertisements imaginable. Advertisers will run multiple campaigns, each with their own set of ads. Since they will (of course) pay us for this privilege, campaigns will have an associated budget which we have to manage at serving time. In particular, we don’t want to serve ads for a campaign that has spent all its money, as that would be free advertising. We must also ensure that campaign budgets are frequently updated when their ads have been served.

Our initial, relational data model might look like this:

Advertiser:
    id: (primary key)
    company_name: string
    contact_person_email: string

Campaign:
    id: (primary key)
    advertiser_id: (foreign key to advertiser.id)
    name: string
    budget: int

Ad:
    id: (primary key)
    campaign_id: (foreign key to campaign.id)
    cuteness: float
    cat_picture_url: string

This data normalization lets us easily update the budgets for all ads in a single operation, which is important since we don’t want to serve ads for which there is no budget. We can also get the advertiser name for all individual ads transitively via their campaign.

Scaling our expectations

Since we’re expecting our startup to rapidly grow to a massive size, we want to make sure we can scale from day one. As the number of ad queries grow, we ideally want scaling up to be as simple as adding more server capacity.

Unfortunately, scaling joins beyond a single server is a significant design and engineering challenge. As a consequence, most of the new data stores released in the past decade have been of the “NoSQL” variant (which might also be called “non-relational databases”). NoSQL’s horizontal scalability is usually achieved by requiring an application developer to explicitly de-normalize all data. This removes the need for joins altogether. For our use case, we have to store budget and advertiser name across multiple document types and instances (duplicated data here marked with bold text):

Advertiser:
    id: (primary key)
    company_name: string
    contact_person_email: string

Campaign:
    id: (primary key)
    advertiser_company_name : string
    name: string
    budget: int

Ad:
    id: (primary key)
    campaign_budget : int
    campaign_advertiser_company_name : string
    cuteness: float
    cat_picture_url: string

Now we can scale horizontally for queries, but updating the budget of a campaign requires updating all its ads. This turns an otherwise O(1) operation into O(n), and we likely have to implement this update logic ourselves as part of our application. We’ll be expecting thousands of budget updates to our cat ad campaigns per second. Multiplying this by an unknown number is likely to overload our servers or lose us money. Or both at the same time.

A pragmatic middle ground

In the middle between these two extremes of “arbitrary joins” and “no joins at all” we have parent-child relationships. These enable a subset of join functionality, but with enough restrictions that they can be implemented efficiently at scale. One core restriction is that your data relationships must be possible to represented as a directed, acyclic graph (DAG).

As it happens, this is the case with our cat picture advertisement use case; Advertiser is a parent to 0-n _Campaign_s, each of which in turn is a parent to 0-n _Ad_s. Being able to represent this natively in our application would get us functionally very close to the original, relational schema.

We’ll see very shortly how this can be directly mapped to Vespa’s parent-child feature support.

Parent-child support in Vespa

Creating the data model

Vespa’s fundamental data model is that of documents. Each document belongs to a particular schema and has a user-provided unique identifier. Such a schema is known as a document type and is specified in a search definition file. A document may have an arbitrary number of fields of different types. Some of these may be indexed, some may be kept in memory, all depending on the schema. A Vespa application may contain many document types.

Here’s how the Vespa equivalent of the above denormalized schema could look (again bolding where we’re duplicating information):

advertiser.sd:
    search advertiser {
        document advertiser {
            field company_name type string {
                indexing: attribute | summary
            }
            field contact_person_email type string {
                indexing: summary
            }
        }
    }

campaign.sd:
    search campaign {
        document campaign {
            field advertiser_company_name type string {
                indexing: attribute | summary
            }
            field name type string {
                indexing: attribute | summary
            }
            field budget type int {
                indexing: attribute | summary
            }
        }
    }

ad.sd:
    search ad {
        document ad {
            field campaign_budget type int {
                indexing: attribute | summary attribute: fast-search
            }
            field campaign_advertiser_company_name type string {
                indexing: attribute | summary
            }
            field cuteness type float {
                indexing: attribute | summary attribute: fast-search
            }
            field cat_picture_url type string {
                indexing: attribute | summary
            }
        }
    }

Note that since all documents in Vespa must already have a unique ID, we do not need to model the primary key IDs explicitly.

We’ll now see how little it takes to change this to its normalized equivalent by using parent-child.

Parent-child support adds two new types of declared fields to Vespa; references and imported fields.

A reference field contains the unique identifier of a parent document of a given document type. It is analogous to a foreign key in a relational database, or a pointer in Java/C++. A document may contain many reference fields, with each potentially referencing entirely different documents.

We want each ad to reference its parent campaign, so we add the following to ad.sd:

    field campaign_ref type reference<campaign> {
        indexing: attribute
    }

We also add a reference from a campaign to its advertiser in campaign.sd:

    field advertiser_ref type reference<advertiser> {
        indexing: attribute
    }

Since a reference just points to a particular document, it cannot be directly used in queries. Instead, imported fields are used to access a particular field within a referenced document. Imported fields are virtual; they do not take up any space in the document itself and they cannot be directly written to by put or update operations.

Add to search campaign in campaign.sd:

    import field advertiser_ref.company_name as campaign_company_name {}

Add to search ad in ad.sd:

    import field campaign_ref.budget as ad_campaign_budget {}

You can import a parent field which itself is an imported field. This enables transitive field lookups.

Add to search ad in ad.sd:

    import field campaign_ref.campaign_company_name as ad_campaign_company_name {}

After removing the now redundant fields, our normalized schema looks like this:

advertiser.sd:
    search advertiser {
        document advertiser {
            field company_name type string {
                indexing: attribute | summary
            }
            field contact_person_email type string {
                indexing: summary
            }
        }
    }

campaign.sd:
    search campaign {
        document campaign {
            field advertiser_ref type reference<advertiser> {
                indexing: attribute
            }
            field name type string {
                indexing: attribute | summary
            }
            field budget type int {
                indexing: attribute | summary
            }
        }
        import field advertiser_ref.company_name as campaign_company_name {}
    }

ad.sd:
    search ad {
        document ad {
            field campaign_ref type reference<campaign> {
                indexing: attribute
            }
            field cuteness type float {
                indexing: attribute | summary attribute: fast-search
            }
            field cat_picture_url type string {
                indexing: attribute | summary
            }
        }
        import field campaign_ref.budget as ad_campaign_budget {}
        import field campaign_ref.campaign_company_name as ad_campaign_company_name {}
    }

Feeding with references

When feeding documents to Vespa, references are assigned like any other string field:

[
    {
        "put": "id:test:advertiser::acme",
        "fields": {
            "company_name": "ACME Inc. cats and rocket equipment",
            "contact_person_email": "[email protected]"
        }
    },
    {
        "put": "id:acme:campaign::catnip",
        "fields": {
            "advertiser_ref": "id:test:advertiser::acme",
            "name": "Most excellent catnip deals",
            "budget": 500
        }
    },
    {
        "put": "id:acme:ad::1",
        "fields": {
            "campaign_ref": "id:acme:campaign::catnip",
            "cuteness": 100.0,
            "cat_picture_url": "/acme/super_cute.jpg"
        }
    }
]

We can efficiently update the budget of a single campaign, immediately affecting all its child ads:

[
    {
        "update": "id:test:campaign::catnip",
        "fields": {
            "budget": {
                "assign": 450
            }
        }
    }
]

Querying using imported fields

You can use imported fields in queries as if they were a regular field. Here are some examples using YQL:

Find all ads that still have a budget left in their campaign:

select * from ad where ad_campaign_budget > 0

Find all ads that have less than $500 left in their budget and belong to an advertiser whose company name contains the word “ACME”:

select * from ad where ad_campaign_budget < 500 and ad_campaign_company_name contains "ACME"

Note that imported fields are not part of the default document summary, so you must add them explicitly to a separate summary if you want their values returned as part of a query result:

document-summary my_ad_summary {
    summary ad_campaign_budget type int {}
    summary ad_campaign_company_name type string {}
    summary cuteness type float {}
    summary cat_picture_url type string {}
}

Add summary=my_ad_summary as a query HTTP request parameter to use it.

Global documents

One of the primary reasons why distributed, generalized joins are so hard to do well efficiently is that performing a join on node A might require looking at data that is only found on node B (or node C, or D…). Vespa gets around this problem by requiring that all documents that may be joined against are always present on every single node. This is achieved by marking parent documents as global in the services.xml declaration. Global documents are automatically distributed to all nodes in the cluster. In our use case, both advertisers and campaigns are used as parents:

<documents>
    <document mode="index" type="advertiser" global="true"/>
    <document mode="index" type="campaign" global="true"/>
    <document mode="index" type="ad"/>
</documents>

You cannot deploy an application containing reference fields pointing to non-global document types. Vespa verifies this at deployment time.

Performance

Feeding of campaign budget updates

Scenario: feed 2 million ad documents 4 times to a content cluster with one node, each time with a different ratio between ads and parent campaigns. Treat 1:1 as baseline (i.e. 2 million ads, 2 million campaigns). Measure relative speedup as the ratio of how many fewer campaigns must be fed to update the budget in all ads.

Results

  • 1 ad per campaign: 35000 campaign puts/second
  • 10 ads per campaign: 29000 campaign puts/second, 8.2x relative speedup
  • 100 ads per campaign: 19000 campaign puts/second, 54x relative speedup
  • 1000 ads percampaign: 6200 campaign puts/second, 177x relative speedup

Note that there is some cost associated with higher fan-outs due to internal management of parent-child mappings, so the speedup is not linear with the fan-out.

Searching on ads based on campaign budgets

Scenario: we want to search for all ads having a specific budget value. First measure with all ad budgets denormalized, then using an imported budget field from the ads’ referenced campaign documents. As with the feeding benchmark, we’ll use 1, 10, 100 and 1000 ads per campaign with a total of 2 million ads combined across all campaigns. Measure average latency over 5 runs.

In each case, the budget attribute is declared as fast-search, which means it has a B-tree index. This allows for efficent value and range searches.

Results

  • 1 ad per campaign: denormalized 0.742 ms, imported 0.818 ms, 10.2% slowdown
  • 10 ads per campaign: denormalized 0.986 ms, imported 1.186 ms, 20.2% slowdown
  • 100 ads per campaign: denormalized 0.830 ms, imported 0.958 ms, 15.4% slowdown
  • 1000 ads per campaign: denormalized 0.936 ms, imported 0.922 ms, 1.5% speedup

The observed speedup for the biggest fan-out is likely an artifact of measurement noise.

We can see that although there is generally some

Introducing ONNX support | Vespa Blog

ONNX (Open Neural Network eXchange) is an open format for the sharing of neural network and other machine learned models between various machine learning and deep learning frameworks. As the open big data serving engine, Vespa aims to make it simple to evaluate machine learned models at serving time at scale. By adding ONNX support in Vespa in addition to our existing TensorFlow support, we’ve made it possible to evaluate models from all the commonly used ML frameworks with low latency over large amounts of data.

Introduction

With the rise of deep learning in the last few years, we’ve naturally enough seen an increase of deep learning frameworks as well: TensorFlow, PyTorch/Caffe2, MxNet etc. One reason for these different frameworks to exist is that they have been developed and optimized around some characteristic, such as fast training on distributed systems or GPUs, or efficient evaluation on mobile devices. Previously, complex projects with non-trivial data pipelines have been unable to pick the best framework for any given subtask due to lacking interoperability between these frameworks. ONNX is a solution to this problem.

ONNX

ONNX is an open format for AI models, and represents an effort to push open standards in AI forward. The goal is to help increase the speed of innovation in the AI community by enabling interoperability between different frameworks and thus streamlining the process of getting models from research to production.

There is one commonality between the frameworks mentioned above that enables an open format such as ONNX, and that is that they all make use of dataflow graphs in one way or another. While there are differences between each framework, they all provide APIs enabling developers to construct computational graphs and runtimes to process these graphs. Even though these graphs are conceptually similar, each framework has been a siloed stack of API, graph and runtime. The goal of ONNX is to empower developers to select the framework that works best for their project, by providing an extensible computational graph model that works as a common intermediate representation at any stage of development or deployment.

Vespa is an open source project which fits well within such an ecosystem, and we aim to make the process of deploying and serving models to production that have been trained on any framework as smooth as possible. Vespa is optimized toward serving and evaluating over potentially very large datasets while still responding in real time. In contrast to other ML model serving options, Vespa can more efficiently evaluate models over many data points. As such, Vespa is an excellent choice when combining model evaluation with serving of various types of content.

Our ONNX support is quite similar to our TensorFlow support. Importing ONNX models is as simple as adding the model to the Vespa application package (under “models/”) and referencing the model using the new ONNX ranking feature:

    expression: sum(onnx("my_model.onnx"))

The above expression runs the model and sums it to a single scalar value to use in ranking. You will have to provide the inputs to the graph. Vespa expects you to provide a macro with the same name as the input tensor. In the macro you can specify where the input should come from, be it a document field, constant or a parameter sent along with the query. More information can be had in the documentation about ONNX import.

Internally, Vespa converts the ONNX operations to Vespa’s tensor API. We do the same for TensorFlow import. So the cost of evaluating ONNX and TensorFlow models are the same. We have put a lot of effort in optimizing the evaluation of tensors, and evaluating neural network models can be quite efficient.

ONNX support is also quite new to Vespa, so we do not support all current ONNX operations. Part of the reason we don’t support all operations yet is that some are potentially too expensive to evaluate per document, such as convolutional neural networks and recurrent networks (LSTMs etc). ONNX also contains an extension, ONNX-ML, which contains additional operations for non-neural network cases. Support for this extension will come later at some point. We are continually working to add functionality, so please reach out to us if there is something you would like to have added.

Going forward we are continually working on improving performance as well as supporting more of the ONNX (and ONNX-ML) standard. You can read more about ranking with ONNX models in the Vespa documentation. We are excited to announce ONNX support. Let us know what you are building with it!

Introducing JSON queries | Vespa Blog

We recently introduced a new addition to the Search API – JSON queries. The search request can now be executed with a POST request, which includes the query-parameters within its payload. Along with this new query we also introduce a new parameter SELECT with the sub-parameters WHERE and GROUPING, which is equivalent to YQL.

The new query

With the Search APIs newest addition, it is now possible to send queries with HTTP POST. The query-parameters has been moved out of the URL and into a POST request body – therefore, no more URL-encoding. You also avoid getting all the queries in the log, which can be an advantage.

This is how a GET query looks like:

GET /search/?param1=value1¶m2=value2&...

The general form of the new POST query is:

POST /search/ { param1 : value1, param2 : value2, ... }

The dot-notation is gone, and the query-parameters are now nested under the same key instead.

Let’s take this query:

GET /search/?yql=select+%2A+from+sources+%2A+where+default+contains+%22bad%22%3B&ranking.queryCache=false&ranking.profile=vespaProfile&ranking.matchPhase.ascending=true&ranking.matchPhase.maxHits=15&ranking.matchPhase.diversity.minGroups=10&presentation.bolding=false&presentation.format=json&nocache=true

and write it in the new POST request-format, which will look like this:

POST /search/ { "yql": "select \* from sources \* where default contains \"bad\";", "ranking": { "queryCache": "false", "profile": "vespaProfile", "matchPhase": { "ascending": "true", "maxHits": 15, "diversity": { "minGroups": 10 } } }, "presentation": { "bolding": "false", "format": "json" }, "nocache": true }

With Vespa running (see Quick Start or
Blog Search Tutorial),
you can try building POST-queries with the new querybuilder GUI at http://localhost:8080/querybuilder/, which can help you build queries with e.g. autocompletion of YQL:

image

The Select-parameter

The SELECT-parameter is used with POST queries and is the JSON equivalent of YQL queries, so they can not be used together. The query-parameter will overwrite SELECT, and decide the query’s querytree.

Where

The SQL-like syntax is gone and the tree-syntax has been enhanced. If you’re used to the query-parameter syntax you’ll feel right at home with this new language. YQL is a regular language and is parsed into a query-tree when parsed in Vespa. You can now build that tree in the WHERE-parameter with JSON. Lets take a look at the yql: select * from sources * where default contains foo and rank(a contains "A", b contains "B");, which will create the following query-tree:

image

You can build the tree above with the WHERE-parameter, like this:

{
    "and" : [
        { "contains" : ["default", "foo"] },
        { "rank" : [
            { "contains" : ["a", "A"] },
            { "contains" : ["b", "B"] }
        ]}
    ]
}

Which is equivalent with the YQL.

Grouping

The grouping can now be written in JSON, and can now be written with structure, instead of on the same line. Instead of parantheses, we now use curly brackets to symbolise the tree-structure between the different grouping/aggregation-functions, and colons to assign function-arguments.

A grouping, that will group first by year and then by month, can be written as such:

| all(group(time.year(a)) each(output(count())
         all(group(time.monthofyear(a)) each(output(count())))

and equivalentenly with the new GROUPING-parameter:

"grouping" : [
    {
        "all" : {
            "group" : "time.year(a)",
            "each" : { "output" : "count()" },
            "all" : {
                "group" : "time.monthofyear(a)",
                "each" : { "output" : "count()" },
            }
        }
    }
]

Wrapping it up

In this post we have provided a gentle introduction to the new Vepsa POST query feature, and the SELECT-parameter. You can read more about writing POST queries in the Vespa documentation. More examples of the POST query can be found in the Vespa tutorials.

Please share experiences. Happy searching!

Vespa at Zedge – providing personalization content to millions of iOS, Android & web users

This blog post describes Zedge’s use of Vespa for search and recommender systems to support content discovery for personalization of mobile phones (Android, iOS and Web). Zedge is now using Vespa in production to serve millions of monthly active users. See the architecture below.

image

What is Zedge?

image

Zedge’s main product is an app – Zedge Ringtones & Wallpapers – that provides wallpapers, ringtones, game recommendations and notification sounds customized for your mobile device.  Zedge apps have been downloaded more than 300 million times combined for iOS and Android and is used by millions of people worldwide each month. Zedge is traded on NYSE under the ticker ZDGE.

People use Zedge apps for self-expression. Setting a wallpaper or ringtone on your mobile device is in many ways similar to selecting clothes, hairstyle or other fashion statements. In fact people try a wallpaper or ringtone in a similar manner as they would try clothes in a dressing room before making a purchase decision, they try different wallpapers or ringtones before deciding on one they want to keep for a while.

The decision for selecting a wallpaper is not taken lightly, since people interact and view their mobile device screen (and background wallpaper) a lot (hundreds of times per day).

Why Zedge considered Vespa

Zedge apps – for iOS, Android and Web – depend heavily on search and recommender services to support content discovery. These services have been developed over several years and constituted of multiple subsystems – both internally developed and open source – and technologies for both search and recommender serving. In addition there were numerous big data processing jobs to build and maintain data for content discovery serving. The time and complexity of improving search and recommender services and corresponding processing jobs started to become high, so simplification was due.

image

Vespa seemed like a promising open source technology to consider for Zedge, in particular since it was proven in several ways within Oath (Yahoo):

  1. Scales to handle very large systems , e.g. 

  2. Flickr with billions of images and
  3. Yahoo Gemini Ads Platform with more than one hundred thousand request per second to serve ads to 1 billion monthly active users for services such as Techcrunch, Aol, Yahoo!, Tumblr and Huffpost.
  4. Runs stable and requires very little operations support – Oath has a few hundred – many of them large – Vespa based applications requiring less than a handful operations people to run smoothly. 
  5. Rich set of features that Zedge could gain from using

  6. Built-in tensor processing support could simplify calculation and serving of related wallpapers (images) & ringtones/notifications (audio)
  7. Built-in support of Tensorflow models to simplify development and deployment of machine learning based search and recommender ranking (at that time in development according to Oath).
  8. Search Chains
  9. Help from core developers of Vespa

The Vespa pilot project

Given the content discovery technology need and promising characteristics of Vespa we started out with a pilot project with a team of software engineers, SRE and data scientists with the goals of:

  1. Learn about Vespa from hands-on development 
  2. Create a realistic proof of concept using Vespa in a Zedge app
  3. Get initial answers to key questions about Vespa, i.e. enough to decide to go for it fully

  4. Which of today’s API services can it simplify and replace?
  5. What are the (cloud) production costs with Vespa at Zedge’s scale? (OPEX)
  6. How will maintenance and development look like with Vespa? (future CAPEX)
  7. Which new (innovation) opportunities does Vespa give?

The result of the pilot project was successful – we developed a good proof of concept use of Vespa with one of our Android apps internally and decided to start a project transferring all recommender and search serving to Vespa. Our impression after the pilot was that the main benefit was by making it easier to maintain and develop search/recommender systems, in particular by reducing amount of code and complexity of processing jobs.

Autosuggest for search with Vespa

Since autosuggest (for search) required both low latency and high throughput we decided that it was a good candidate to try for production with Vespa first. Configuration wise it was similar to regular search (from the pilot), but snippet generation (document summary) requiring access to document store was superfluous for autosuggest.

A good approach for autosuggest was to:

  1. Make all document fields searchable with autosuggest of type (in-memory) attribute

  2. https://docs.vespa.ai/en/attributes.html 
  3. https://docs.vespa.ai/en/reference/search-definitions-reference.html#attribute 
  4. https://docs.vespa.ai/en/search-definitions.html (basics)
  5. Avoid snippet generation and using the document store by overriding the document-summary setting in search definitions to only access attributes

  6. https://docs.vespa.ai/en/document-summaries.html 
  7. https://docs.vespa.ai/en/nativerank.html
image

The figure above illustrates the autosuggest architecture. When the user starts typing in the search field, we fire a query with the search prefix to the Cloudflare worker – which in case of a cache hit returns the result (possible queries) to the client. In case of a cache miss the Cloudflare worker forwards the query to our Vespa instance handling autosuggest.

Regarding external API for autosuggest we use
Cloudflare Workers
(supporting Javascript on V8 and later perhaps multiple languages with Webassembly)
to handle API queries from Zedge apps in front of Vespa running in Google Cloud.
This setup allow for simple close-to-user caching of autosuggest results.

Without going into details we had several recommender and search services to adapt to Vespa. These services were adapted by writing custom Vespa searchers and in some cases search chains:

The main change compared to our old recommender and related content services was the degree of dynamicity and freshness of serving, i.e. with Vespa more ranking signals are calculated on the fly using Vespa’s tensor support instead of being precalculated and fed into services periodically. Another benefit of this was that the amount of computational (big data) resources and code for recommender & related content processing was heavily reduced.

Continuous Integration and Testing with Vespa

A main focus was to make testing and deployment of Vespa services with continuous integration (see figure below). We found that a combination of Jenkins (or similar CI product or service) with Docker Compose worked nicely in order to test new Vespa applications, corresponding configurations and data (samples) before deploying to the staging cluster with Vespa on Google Cloud. This way we can have a realistic test setup – with Docker Compose – that is close to being exactly similar to the production environment (even at hostname level).

image

Monitoring of Vespa with Prometheus and Grafana

For monitoring we created a tool that continuously read Vespa metrics, stored them in Prometheus (a time series database) and visualized them them with Grafana. This tool can be found on https://github.com/vespa-engine/vespa_exporter. More information about Vespa metrics and monitoring:

image

Conclusion

The team quickly got up to speed with Vespa with its good documentation and examples, and it has been running like a clock since we started using it for real loads in production. But this was only our first step with Vespa – i.e. consolidating existing search and recommender technologies into a more homogeneous and easier to maintain form.

With Vespa as part of our architecture we see many possible paths for evolving our search and recommendation capabilities (e.g. machine learning based ranking such as integration with Tensorflow and ONNX).

Best regards,
Zedge Content Discovery Team

Join us in San Francisco on September 26th for a Meetup

Hi Vespa Community,

Several members from our team will be traveling to San Francisco on September 26th for a meetup and we’d love to chat with you there.

Jon Bratseth (Distinguished Architect) will present a Vespa overview and answer any questions.

To learn more and RSVP, please visit:

https://www.meetup.com/SF-Big-Analytics/events/254461052/

Hope to see you!

The Vespa Team

Sharing Vespa at the SF Big Analytics Meetup

By Jon Bratseth, Distinguished Architect, Oath

I had the wonderful opportunity to present Vespa at the
SF Big Analytics Meetup
on September 26th, hosted by Amplitude.
Several members of the Vespa team (Kim, Frode and Kristian) also attended. We all enjoyed meeting with members of the Big Analytics community to discuss how Vespa could be helpful for their companies.

Thank you to Chester Chen, T.J. Bay, and Jin Hao Wan for planning the meetup, and here’s our presentation, in case you missed it (slides are also available here):