Optimizing realtime evaluation of neural net models on Vespa

In this blog post we describe how we recently made neural network evaluation over 20 times faster on Vespa’s tensor framework.

Vespa is the open source platform for building applications that carry out scalable real-time data processing, for instance search and recommendation systems. These require significant amounts of computation over large data sets. With advances in machine learning, it is desirable to run more advanced ranking models such as large linear or logistic regression models and artificial neural networks. Because of the tight computational budget at serving time, the evaluation of such models must be done in an efficient and scalable manner.

We introduced the tensor API to help solve such problems. The tensor API allows the concise expression of general computations on many-dimensional data, while simultaneously leaving room for deep optimizations on the platform side.  What we mean by this is that the tensor API is very expressive and supports a large range of model types. The general evaluation of tensors is not necessarily efficient in all cases, so in addition to continually working to increase the baseline performance, we also perform specific optimizations for important use cases. In this blog post we will describe one such important optimization we recently did, which improved neural network evaluation performance by over 20x.

To illustrate the types of optimization we can do, consider the following tensor expression representing a dot product between vectors v1 and v2:

reduce(join(v1, v2, f(x, y)(x * y)), sum)

The dot product is calculated by multiplying the vectors together by using the join operation,
then summing the elements in the vector together using the reduce operation.
The result is a single scalar. A naive implementation would first calculate the join and introduce a temporary tensor before the reduce sums up the cells to a single scalar. Particularly for large tensors with many dimensions, such a temporary tensor can be large and require significant memory allocations. This is obviously not the most efficient path to calculate the resulting tensor.  A general improvement would be to avoid the temporary tensor and reduce to the single scalar directly as the tensors are iterated through.

In Vespa, when ranking expressions are compiled, the abstract syntax tree (AST) is analyzed for such optimizations. When known cases are recognized, the most efficient implementation is selected. In the above example, assuming the vectors are dense and they share dimensions, Vespa has optimized hardware accelerated code for doing dot products on vectors. For sparse vectors, Vespa falls back to a implementation for weighted sets which build hash tables for efficient lookups.  This method allows recognition of both large and small optimizations, from simple dot products to specialized implementations for more advanced ranking models. Vespa currently has a few optimizations implemented, and we are adding more as important use cases arise.

We recently set out to improve the performance of evaluating simple neural networks, a case quite similar to the one presented in the previous blog post. The ranking expression to optimize was:

   macro hidden_layer() {
       expression: elu(xw_plus_b(nn_input, constant(W_fc1), constant(b_fc1), x))
   }
   macro final_layer() {
       expression: xw_plus_b(hidden_layer, constant(W_fc2), constant(b_fc2), hidden)
   }
   first-phase {
       expression: final_layer
   }

This represents a simple two-layer neural network.

Whenever a new version of Vespa is built, a large suite of integration and performance tests are run. When we want to optimize a specific use case, we first create a performance test to set a baseline.  With the performance tests we get both historical graphs as well as detailed profiling information and performance statistics sampled from the system under load.  This allows us to identify and optimize any bottlenecks. Also, it adds a bit of gamification to the process.

The graph below shows the performance of a test where 10 000 random documents are ranked according to the evaluation of a simple two-layer neural network:

image

Here, the x-axis represent builds, and the y-axis is the end-to-end latency as measured from a machine firing off queries to a server running the test on Vespa. As can be seen, over the course of optimization the latency was reduced from 150-160 ms to 7 ms, an impressive 20x end-to-end latency improvement.

When a query is received by Vespa, it is first processed in the stateless container. This is usually where applications would process the query, possibly enriching it with additional information. Vespa does a bit of default work here as well, and also transforms the query a bit. For this test, no specific handling was done except this default handling. After initial processing, the query is dispatched to each node in the stateful content layer. For this test, only a single node is used in the content layer, but applications would typically have multiple. The query is processed in parallel on each node utilizing multiple cores and the ranking expression gets executed once for each document that matches the query. For this test with 10 000 documents, the ranking expression and thus the neural network gets evaluated in total 10 000 times before the top N documents are returned to the container layer.

The following steps were taken to optimize this expression, with each step visible as a step in the graph above:

  1. Recognize join with multiplication as part of an inner product.
  2. Optimize for bias addition.
  3. Optimize vector concatenation (which was part of the input to the neural network)
  4. Replace appropriate sub-expressions with the dense vector-matrix product.

It was particularly the final step which gave the biggest percent wise performance boost. The solution in total was to recognize the vector-matrix multiplication done in the neural network layer and replace that with specialized code that invokes the existing hardware accelerated dot product code. In the expression above, the operation xw_plus_b is replaced with a reduce of the multiplicative join and additive join. This is what is recognized and performed in one step instead of three.

This strategy of optimizing specific use cases allows for a more rapid application development for users of Vespa. Consider the case where some exotic model needs to be run on Vespa. Without the generic tensor API users would have to implement their own custom rank features or wait for the Vespa core developers to implement them. In contrast, with the tensor API, teams can continue their development without external dependencies to the Vespa team.  If necessary, the Vespa team can in parallel implement the optimizations needed to meet performance requirements, as we did in this case with neural networks.

Scaling TensorFlow model evaluation with Vespa

In this blog post we’ll explain how to use Vespa to evaluate TensorFlow models over arbitrarily many data points while keeping total latency constant. We provide benchmark data from our performance lab where we compare evaluation using TensorFlow serving with evaluating TensorFlow models in Vespa.

We recently introduced a new feature that enables direct import of TensorFlow models into Vespa for use at serving time. As mentioned in a previous blog post, our approach to support TensorFlow is to extract the computational graph and parameters of the TensorFlow model and convert it to Vespa’s tensor primitives. We chose this approach over attempting to integrate our backend with the TensorFlow runtime. There were a few reasons for this. One was that we would like to support other frameworks than TensorFlow. For instance, our next target is to support ONNX. Another was that we would like to avoid the inevitable overhead of such an integration, both on performance and code maintenance. Of course, this means a lot of optimization work on our side to make this as efficient as possible, but we do believe it is a better long term solution.

Naturally, we thought it would be interesting to set up some sort of performance comparison between Vespa and TensorFlow for cases that use a machine learning ranking model.

Before we get to that however, it is worth noting that Vespa and TensorFlow serving has an important conceptual difference. With TensorFlow you are typically interested in evaluating a model for a single data point, be that an image for an image classifier, or a sentence for a semantic representation etc. The use case for Vespa is when you need to evaluate the model over many data points. Examples are finding the best document given a text, or images similar to a given image, or computing a stream of recommendations for a user.

So, let’s explore this by setting up a typical search application in Vespa. We’ve based the application in this post on the Vespa
blog recommendation tutorial part 3.
In this application we’ve trained a collaborative filtering model which computes an interest vector for each existing user (which we refer to as the user profile) and a content vector for each blog post. In collaborative filtering these vectors are commonly referred to as latent factors. The application takes a user id as the query, retrieves the corresponding user profile, and searches for the blog posts that best match the user profile. The match is computed by a simple dot-product between the latent factor vectors. This is used as the first phase ranking. We’ve chosen vectors of length 128.

In addition, we’ve trained a neural network in TensorFlow to serve as the second-phase ranking. The user vector and blog post vector are concatenated and represents the input (of size 256) to the neural network. The network is fully connected with 2 hidden layers of size 512 and 128 respectively, and the network has a single output value representing the probability that the user would like the blog post.

In the following we set up two cases we would like to compare. The first is where the imported neural network is evaluated on the content node using Vespa’s native tensors. In the other we run TensorFlow directly on the stateless container node in the Vespa 2-tier architecture. In this case, the additional data required to evaluate the TensorFlow model must be passed back from the content node(s) to the container node. We use Vespa’s fbench utility to stress the system under fairly heavy load.

In this first test, we set up the system on a single host. This means the container and content nodes are running on the same host. We set up fbench so it uses 64 clients in parallel to query this system as fast as possible. 1000 documents per query are evaluated in the first phase and the top 200 documents are evaluated in the second phase. In the following, latency is measured in ms at the 95th percentile and QPS is the actual query rate in queries per second:

  • Baseline: 19.68 ms / 3251.80 QPS
  • Baseline with additional data: 24.20 ms / 2644.74 QPS
  • Vespa ranking: 42.8 ms / 1495.02 QPS
  • TensorFlow batch ranking: 42.67 ms / 1499.80 QPS
  • TensorFlow single ranking: 103.23 ms / 619.97 QPS

Some explanation is in order. The baseline here is the first phase ranking only without returning the additional data required for full ranking. The baseline with additional data is the same but returns the data required for ranking. Vespa ranking evaluates the model on the content backend. Both TensorFlow tests evaluate the model after content has been sent to the container. The difference is that batch ranking evaluates the model in one pass by batching the 200 documents together in a larger matrix, while single evaluates the model once per document, i.e. 200 evaluations. The reason why we test this is that Vespa evaluates the model once per document to be able to evaluate during matching, so in terms of efficiency this is a fairer comparison.

We see in the numbers above for this application that Vespa ranking and TensorFlow batch ranking achieve similar performance. This means that the gains in ranking batch-wise is offset by the cost of transferring data to TensorFlow. This isn’t entirely a fair comparison however, as the model evaluation architecture of Vespa and TensorFlow differ significantly. For instance, we measure that TensorFlow has a much lower degree of cache misses. One reason is that batch-ranking necessitates a more contiguous data layout. In contrast, relevant document data can be spread out over the entire available memory on the Vespa content nodes.

Another significant reason is that Vespa currently uses double floating point precision in ranking and in tensors. In the above TensorFlow model we have used floats, resulting in half the required memory bandwidth. We are considering making the floating point precision in Vespa configurable to improve evaluation speed for cases where full precision is not necessary, such as in most machine learned models.

So we still have some work to do in optimizing our tensor evaluation pipeline, but we are pleased with our results so far. Now, the performance of the model evaluation itself is only a part of the system-wide performance. In order to rank with TensorFlow, we need to move data to the host running TensorFlow. This is not free, so let’s delve a bit deeper into this cost.

The locality of data in relation to where the ranking computation takes place is an important aspect and indeed a core design point of Vespa. If your data is too large to fit on a single machine, or you want to evaluate your model on more data points faster than is possible on a single machine, you need to split your data over multiple nodes. Let’s assume that documents are distributed randomly across all content nodes, which is a very reasonable thing to do. Now, when you need to find the globally top-N documents for a given query, you first need to find the set of candidate documents that match the query. In general, if ranking is done on some other node than where the content is, all the data required for the computation obviously needs to be transferred there. Usually, the candidate set can be large so this incurs a significant cost in network activity, particularly as the number of content nodes increase. This approach can become infeasible quite quickly.

This is why a core design aspect of Vespa is to evaluate models where the content is stored.

image

This is illustrated in the figure above. The problem of transferring data for ranking is compounded as the number of content nodes increase, because to find the global top-N documents, the top-K documents of each content node need to be passed to the external ranker. This means that, if we have C content nodes, we need to transfer C*K documents over the network. This runs into hard network limits as the number of documents and data size for each document increases.

Let’s see the effect of this when we change the setup of the same application to run on three content nodes and a single stateless container which runs TensorFlow. In the following graph we plot the 95th percentile latency as we increase the number of parallel requests (clients) from 1 to 30:

image

Here we see that with low traffic, TensorFlow and Vespa are comparable in terms of latency. When we increase the load however, the cost of transmitting the data is the driver for the increase in latency for TensorFlow, as seen in the red line in the graph. The differences between batch and single mode TensorFlow evaluation become smaller as the system as a whole becomes largely network-bound. In contrast, the Vespa application scales much better.

Now, as we increase traffic even further, will the Vespa solution likewise become network-bound? In the following graph we plot the sustained requests per second as we increase clients to 200:

image

Vespa ranking is unable to sustain the same amount of QPS as just transmitting the data (the blue line), which is a hint that the system has become CPU-bound on the evaluation of the model on Vespa. While Vespa can sustain around 3500 QPS, the TensorFlow solution maxes out at 350 QPS which is reached quite early as we increase traffic. As the system is unable to transmit data fast enough, the latency naturally has to increase which is the cause for the linearity in the latency graph above. At 200 clients the average latency of the TensorFlow solution is around 600 ms, while Vespa is around 60 ms.

So, the obvious key takeaway here is that from a scalability point of view it is beneficial to avoid sending data around for evaluation. That is both a key design point of Vespa, but also for why we implemented TensorFlow support in the first case. By running the models where the content is allows for better utilization of resources, but perhaps the more interesting aspect is the ability to run more complex or deeper models while still being able to scale the system.

Accelerating stateless model evaluation on Vespa

A central architectural feature of Vespa.ai is the division
of work between the stateless container cluster and the content cluster.

Most computation, such as evaluating machine-learned models, happens in
the content cluster. However, it has become increasingly important to
efficiently evaluate models in the container cluster as well, to
process or transform documents or queries before storage or execution.
One prominent example is to generate a vector representation of natural
language text for queries and documents for nearest neighbor retrieval.

We have recently implemented accelerated model evaluation using ONNX Runtime in
the stateless cluster, which opens up new usage areas for Vespa.

Introduction

At Vespa.ai we differentiate between stateful and stateless machine-learned
model evaluation. Stateless model evaluation is what one usually thinks about
when serving machine-learned models in production. For instance, one might have
a stand-alone model server that is called from somewhere in a serving stack.
The result of evaluating a model there only depends upon its input.

In contrast, stateful model serving combines input with stored or persisted
data. This poses some additional challenges. One is that models typically need
to be evaluated many times per query, once per data point. This has been a
focus area of Vespa.ai for quite some time, and we have previously written about
how we accelerate stateful model
evaluation
in Vespa.ai using ONNX Runtime.

However, stateless model evaluation does have its place in Vespa.ai as well.
For instance, transforming query input or document content using Transformer
models. Or finding a vector representation for an image for image similarity
search. Or translating text to another language. The list goes on.

Vespa.ai has actually had stateless model
evaluation for some
time, but we’ve recently added acceleration of ONNX models using ONNX
Runtime. This makes this feature
much more powerful and opens up some new use cases for Vespa.ai. In this
post, we’ll take a look at some capabilities this enables:

  • The automatically generated REST API for model serving.
  • Creating lightweight request handlers for serving models with some custom
    code without the need for content nodes.
  • Adding model evaluation to searchers for query processing and enrichment.
  • Adding model evaluation to document processors for transforming content
    before ingestion.
  • Batch-processing results from the ranking back-end for additional ranking
    models.

We’ll start with a quick overview of the difference between where we evaluate
machine-learned models in Vespa.ai.

Vespa.ai applications: container and content nodes

Vespa.ai is a distributed application
consisting of various types of services on multiple nodes. A Vespa.ai
application is fully defined in an application package. This single unit
contains everything needed to set up an application, including all
configuration, custom components, schemas, and machine-learned models. When the
application package is deployed, the admin cluster takes care of configuring all
the services across all the system’s nodes, including distributing all
models to the nodes that need them.

Vespa architecture

The container nodes process queries or documents before passing them on to the
content nodes. So, when a document is fed to Vespa, content can be transformed
or added before being stored. Likewise, queries can be transformed or enriched
in various ways before being sent for further processing.

The content nodes are responsible for persisting data. They also do most of the
required computation when responding to queries. As that is where the data is,
this avoids the cost of transferring data across the network. Query data is
combined with document data to perform this computation in various ways.

We thus differentiate between stateless and stateful machine-learned model
evaluation. Stateless model evaluation happens on the container nodes and is
characterized by a single model evaluation per query or document. Stateful
model evaluation
happens on the content nodes, and the model is typically
evaluated a number of times using data from both the query and the document.

The exact configuration of the services on the nodes is specified in
services.xml. Here the
number of container and content nodes, and their capabilities, are fully
configured. Indeed, a Vespa.ai application does not need to be set up with any
content nodes, purely running stateless container code, including serving
machine-learned models.

This makes it easy to deploy applications. It offers a lot of flexibility
in combining many types of models and computations out of the box without any
plugins or extensions. In the next section, we’ll see how to set up stateless
model evaluation.

Stateless model evaluation

So, by stateless model
evaluation we mean
machine-learned models that are evaluated on Vespa container nodes. This is
enabled by simply adding the model-evaluation tag in services.xml:

...
<container>
    ...
    <model-evaluation/>
    ...
</container>
...

When this is specified, Vespa scans through the models directory in the
application packages to find any importable machine-learned models. Currently,
supported models are TensorFlow, ONNX, XGBoost, LightGBM or Vespa’s own
stateless
models.

There are two effects of this. The first is that a REST API for model discovery
and evaluation is automatically enabled. The other is that custom
components can have
a special ModelsEvaluator object dependency injected into their constructors.

Stateless model evaluation

In the following we’ll take a look at some of the usages of these, and use the
model-evaluation sample
app
for demonstratation.

REST API

The automatically added REST API provides an API for model discovery and
evaluation. This is great for using Vespa as a standalone model server, or
making models available for other parts of the application stack.

To get a list of imported models, call http://host:port/model-evaluation/v1.
For instance:

$ curl -s 'http://localhost:8080/model-evaluation/v1/'
{
    "pairwise_ranker": "http://localhost:8080/model-evaluation/v1/pairwise_ranker",
    "transformer": "http://localhost:8080/model-evaluation/v1/transformer"
}

This application has two models, the transformer model and the
pairwise_ranker model. We can inspect a model to see expected inputs and
outputs:

$ curl -s 'http://localhost:8080/model-evaluation/v1/transformer/output'
{
    "arguments": [
        {
            "name": "input",
            "type": "tensor(d0[],d1[])"
        },
        {
            "name": "onnxModel(transformer).output",
            "type": "tensor<float>(d0[],d1[],d2[16])"
        }
    ],
    "eval": "http://localhost:8080/model-evaluation/v1/transformer/output/eval",
    "function": "output",
    "info": "http://localhost:8080/model-evaluation/v1/transformer/output",
    "model": "transformer"
}

All model inputs and output are Vespa tensors. See the tensor user
guide for more information.

This model has one input, with tensor type tensor(d0[],d1[]). This tensor has
two dimensions: d0 is typically a batch dimension, and d1 represents for,
this model, a sequence of tokens. The output, of type tensor<float>(d0[],d1[],d2[16])
adds a dimension d2 which represents the embedding dimension. So the output is
an embedding representation for each token of the input.

By calling /model-evaluation/v1/transformer/eval and passing an URL encoded input
parameter, Vespa evaluates the model and returns the result as a JSON encoded
tensor.

Please refer to the sample
application
for a runnable example.

Request handlers

The REST API takes exactly the same input as the models it serves. In some
cases one might want to pre-process the input before providing it to the model.
A common example is to tokenize natural language text before passing the token
sequence to a language model such as BERT.

Vespa provides request
handlers
which lets applications implement arbitrary HTTP APIs. With custom request
handlers, arbitrary code can be run both before and after model evaluation.

When the model-evaluation tag has been supplied, Vespa makes a special
ModelsEvaluator object available which can be injected into a component
(such as a request handler):

public class MyHandler extends ThreadedHttpRequestHandler {

    private final ModelsEvaluator modelsEvaluator;

    public MyHandler(ModelsEvaluator modelsEvaluator, Context context) {
        super(context);
        this.modelsEvaluator = modelsEvaluator;
    }

    @Override
    public HttpResponse handle(HttpRequest request) {

        // Get the input
        String inputString = request.getProperty("input");

        // Convert to a Vespa tensor
        TensorType expectedType = TensorType.fromSpec("tensor<int8>(x[])");
        Tensor input = Tensor.from(expectedType, inputString);

        // Perform any pre-processing to the tensor
        // ...

        // Evaluate the model
        FunctionEvaluator evaluator = modelsEvaluator.evaluatorOf("transformer");
        Tensor result = evaluator.bind("input", input).evaluate();

        // Perform any post-processing to the tensor
        // ...
    }

A full example can be seen in the MyHandler class in the sample
application
and it’s unit
test.

As mentioned, arbitrary code can be run here. Pragmatically, it is often more
convenient to put the processing pipeline in the model itself. While not always
possible, this helps protect against divergence between the data processing
pipeline in training and in production.

Document processors

The REST API and request handler can work with a purely stateless application,
such as a model server. However, it is much more common for Vespa.ai applications to
have content. As such, it is fairly common to process incoming documents before
storing them. Vespa provides a chain of document
processors
for this.

Applications can implement custom document processors, and add them to the
processing chain. In the context of model evaluation, a typical task is to use a
machine-learned model to create a vector representation for a natural language
text. The text is first tokenized, then run though a language model such as
BERT to generate a vector representation which is then stored. Such a vector
representation can be for instance used in nearest neighbor
search. Other examples
are sentiment analysis, creating representations of images, object detection,
translating text, and so on.

The ModelsEvaluator can be injected into your component as already seen:

public class MyDocumentProcessor extends DocumentProcessor {

    private final ModelsEvaluator modelsEvaluator;

    public MyDocumentProcessor(ModelsEvaluator modelsEvaluator) {
        this.modelsEvaluator = modelsEvaluator;
    }

    @Override
    public Progress process(Processing processing) {
        for (DocumentOperation op : processing.getDocumentOperations()) {
            if (op instanceof DocumentPut) {
                DocumentPut put = (DocumentPut) op;
                Document document = put.getDocument();

                // Get tokens
                Tensor tokens = (Tensor) document.getFieldValue("tokens").getWrappedValue();

                // Perform any pre-processing to the tensor
                // ...

                // Evaluate the model
                FunctionEvaluator evaluator = modelsEvaluator.evaluatorOf("transformer");
                Tensor result = evaluator.bind("input", input).evaluate();

                // Reshape and extract the embedding vector (not shown)
                Tensor embedding = ...

                // Set embedding in document
                document.setFieldValue("embedding", new TensorFieldValue(embedding));
            }
        }
    }
}

Notice the code looks a lot like the previous example for the request handler.
The document processor receives a pre-constructed ModelsEvaluator from Vespa
which contains the transformer model. This code receives a tensor contained
in the tokens field, runs that through the transformer model, and puts the
resulting embedding into a new field. This is then stored along with the
document.

Again, a full example can be seen in the MyDocumentProcessor class in the sample
application
and it’s unit
test.

Searchers: query processing

Similar to document processing, queries are processed along a chain of
searchers.
Vespa provides a default chain of searchers for various tasks, and applications
can provide additional custom searchers as well. In the context of model
evaluation, the use cases are similar to document processing: a typical task
for text search is to generate vector representations for nearest neighbor search.

Again, the ModelsEvaluator can be injected into your component:

public class MySearcher extends Searcher {

    private final ModelsEvaluator modelsEvaluator;

    public MySearcher(ModelsEvaluator modelsEvaluator) {
        this.modelsEvaluator = modelsEvaluator;
    }

    @Override
    public Result search(Query query, Execution execution) {

        // Get the query input
        String inputString = query.properties().getString("input");

        // Convert to a Vespa tensor
        TensorType expectedType = TensorType.fromSpec("tensor<int8>(x[])");
        Tensor input = Tensor.from(expectedType, inputString);

        // Perform any pre-processing to the tensor
        // ...

        // Evaluate model
        FunctionEvaluator evaluator = modelsEvaluator.evaluatorOf("transformer");
        Tensor output = evaluator.bind("input", input).evaluate();

        // Reshape and extract the embedding vector (not shown)
        Tensor embedding = ...

        // Add this tensor to query
        query.getRanking().getFeatures().put("query(embedding)", embedding);

        // Continue processing
        return execution.search(query);
    }
}

As before, a full example can be seen in the MySearcher class in the sample
application
and it’s unit
test.

Searchers: result post-processing

Searchers don’t just process queries before being sent to the back-end: they
are just as useful in post-processing the results from the back-end. A typical
example is to de-duplicate similar results in a search application. Another is
to apply business rules to reorder the results, especially if coming from
various back-ends. In the context of machine learning, one example is is to
de-tokenize tokens back to a natural language text.

Post-processing is similar to the example above, but the search is executed
first, and tensor fields from the documents are extracted and used as input to
the models. In the sample application we have a model that compares all results
with each other to perform another phase of ranking. See the MyPostProcessing
searcher
for details.

Conclusion

In Vespa.ai, most of the computation required for executing queries has
traditionally been run in the content cluster. This makes sense as it avoids
transmitting data across the network to external

IR evaluation metrics with uncertainty estimates

[{'fields': {'doc_id': '7407715',
             'documentid': 'id:PassageRanking:PassageRanking::7407715',
             'sddocname': 'PassageRanking',
             'summaryfeatures': {'bm25(text)': 11.979235042476953,
                                 'vespa.summaryFeatures.cached': 0.0},
             'text': 'The Sky is the Limit also known as TSITL is a global '
                     'effort designed to influence, motivate and inspire '
                     'people all over the world to achieve their goals and '
                     'dreams in life. TSITL’s collaborative community on '
                     'social media provides you with a vast archive of '
                     'motivational pictures/quotes/videos.'},
  'id': 'id:PassageRanking:PassageRanking::7407715',
  'relevance': 11.979235042476953,
  'source': 'PassageRanking_content'},
 {'fields': {'doc_id': '84721',
             'documentid': 'id:PassageRanking:PassageRanking::84721',
             'sddocname': 'PassageRanking',
             'summaryfeatures': {'bm25(text)': 11.310323797415357,
                                 'vespa.summaryFeatures.cached': 0.0},
             'text': 'Sky Customer Service 0870 280 2564. Use the Sky contact '
                     'number to get in contact with the Sky customer services '
                     'team to speak to a representative about your Sky TV, Sky '
                     'Internet or Sky telephone services. The Sky customer '
                     'Services team is operational between 8:30am and 11:30pm '
                     'seven days a week.'},
  'id': 'id:PassageRanking:PassageRanking::84721',
  'relevance': 11.310323797415357,
  'source': 'PassageRanking_content'}]