Blog recommendation with neural network models

Update 2021-05-20:
This blog post refers to Vespa sample applications that do not exist anymore.
Please refer to the
News search and recommendation tutorial
for an updated version of text and sample applications.

Introduction

The main objective of this post is to show how to deploy neural network models in Vespa using our Tensor Framework. In fact, any model that can be represented by a series of Tensor operations can be deployed in Vespa. Neural networks is just a popular example. In addition, we will introduce the multi-phase ranking model available in Vespa that can be used to run more expensive models in a phase based on a reduced number of documents returned by previous phases. This feature allow us to run models that would be prohibitively expensive to use if we had to run them at query-time across all the documents indexed in Vespa.

Model Training

In this section, we will define a neural network model, show how we created a suitable dataset to train the model and train the model using TensorFlow.

The neural network model

In the previous blog post, we computed latent factors for each user and each document and then used a dot-product between user and document vectors to rank the documents available for recommendation to a specific user. In this tutorial we will train a 2-layer fully connected neural network model that will take the same user (u) and document (d) latent factors as input and will output the probability of that specific user liking the document.

More technically, our previous rank function r was given by

r(u,d)=u∗d

while in this tutorial it will be given by

r(u,d,θ)=f(u,d,θ)

where f represents the neural network model described below and θ is the neural network parameter values that we need to learn from training data.

The specific form of the neural network model used here is

p = sigmoid(h1×W2+b2)
h1 = ReLU(x×W1+b1)

where x=[u,d] is the concatenation of the user and document latent factor, ReLU is the rectifier activation function, sigmoid represents the sigmoid function, p is the output of the model and in this case can be interpreted as the probability of the user u liking a blog post d. The parameters of the model are represented by θ=(W1,W2,b1,b2).

Training data

For the training dataset, we will start with the (user_id, post_id) rows from the “training_set_ids” generated previously. Then, we remove every row for which there is no latent factors for the user_id or post_id contained in that row. This gives us a dataset with only positive feedback (label = 1), since each row represents one instance of a user_id liking a post_id.

In order to train our model, we need to generate negative feedback (label = 0). So, for each row (user_id, post_id) in the current dataset we will generate N negative feedback rows by randomly sampling post_id_fake from the pool of post_id’s available in the current set, so that for each (user_id, post_id) row with label = 1 we will increase the dataset with N (user_id, post_id_fake) rows with label = 0.

Find code to generate the dataset in the utility scripts.

Training with TensorFlow

With the training data in hand, we have split it into 80% training set and 20% validation set and used TensorFlow to train the model. The script used can be found in the utility scripts and executed by

$ python vespaModel.py --product_features_file_path vespa_tutorial_data/user_item_cf_cv/product.json \
                       --user_features_file_path vespa_tutorial_data/user_item_cf_cv/user.json \
                       --dataset_file_path vespa_tutorial_data/nn_model/training_set.txt

The progress of your training can be visualized using Tensorboard

$ tensorboard --logdir runs/*/summaries/

##
Model deployment in Vespa

Two Phase Ranking

When a query is sent to Vespa, it will scan all documents available and select the ones (possibly all) that match the query. When the set of documents matching a query is found, Vespa must decide the order of these documents. Unless explicit sorting is used, Vespa decides this order by calculating a number for each document, the rank score, and sorts the documents by this number.

The rank score can be any function that takes as arguments parameters sent by the query, document attributes defined in search definitions and global parameters not directly linked to query or document parameters. One example of rank score is the output of the neural network model defined in this tutorial. The model takes the latent factor u associated with a specific user_id (query parameter), the latent factor dd associated with document post_id (document attribute) and learned model parameters (global parameters not related to a specific query nor document) and returns the probability of user u to like document d.

However, even though Vespa is designed to carry out such calculations optimally, complex expressions becomes expensive when they must be calculated over every one of a large set of matching documents. To relieve this, Vespa can be configured to run two ranking expressions – a smaller and less accurate one on all hits during the matching phase, and a more expensive and accurate one only on the best hits during the reranking phase. In general this allows a more optimal usage of the cpu budget by dedicating more of the total cpu towards the best candidate hits.

The reranking phase, if specified, will by default be run on the 100 best hits on each search node, after matching and before information is returned upwards to the search container. The number of hits to rerank can be turned up or down as needed. Below is a toy example showing how to configure first and second phase ranking expressions in the rank profile section of search definitions where the second phase rank expression is run on the 200 best hits from first phase on each search node.

search myapp {

    …

    rank-profile default inherits default {

        first-phase {
            expression: nativeRank + query(deservesFreshness) * freshness(timestamp)
        }

        second-phase {
            expression {
                0.7 * ( 0.7*fieldMatch(title) + 0.2*fieldMatch(description) + 0.1*fieldMatch(body) ) +
                0.3 * attributeMatch(keywords)
            }
            rerank-count: 200
        }
    }
}

Constant Tensor files

Once the model has been trained in TensorFlow, export the model parameters (W1,W2,b1,b2) to the application folder as Tensors according to the Vespa Document JSON format.

The complete code to serialize the model parameters using Vespa Tensor format can be found in the utility scripts but the following code snipped shows how to serialize the hidden layer weights W1:

serializer.serialize_to_disk(variable_name = "W_hidden", dimension_names = ['input', 'hidden'])

Note that Vespa currently requires dimension names for all the Tensor dimensions (in this case W1 is a matrix, therefore dimension is 2).

In the following section, we will use the following code in the blog_post search definition in order to be able to use the constant tensor W_hidden in our ranking expression.

    constant W_hidden {
        file: constants/W_hidden.json
        type: tensor(input[20],hidden[40])
    }

A constant tensor is data that is not specific to a given document type. In the case above we define W_hidden to be a tensor with two dimensions (matrix), where the first dimension is named input and has size 20 and second dimension is named hidden and has size 40. The data were serialized to a JSON file located at constants/W_hidden.json relative to the application package folder.

Vespa ranking expressions

In order to evaluate the neural network model trained with TensorFlow in the previous section, we need to translate the model structure to a Vespa ranking expression to be defined in the blog_post search definition. To honor a low-latency response, we will take advantage of the Two Phase Ranking available in Vespa and define the first phase ranking to be the same ranking function used in the previous blog post, which is a dot-product between the user and latent factors. After the documents have been sorted by the first phase ranking function, we will rerank the top 200 document from each search node using the second phase ranking given by the neural network model presented above.

Note that we define two ranking profiles in the search definition below. This allow us to decide which ranking profile to use at query time. We defined a ranking profile named tensor which only applies the dot-product between user and document latent factors for all matching documents and a ranking profile named nn_tensor, which rerank the top 200 documents using the neural network model discussed in the previous section.

We will walk through each part of the blog_post search definition, see blog_post.sd.

As always, we start the a search definition with the following line

We define the document type blog_post the same way we have done in the previous tutorial.

    document blog_post {

      # Field definitions
      # Examples:

      field date_gmt type string {
          indexing: summary
      }
      field language type string {
          indexing: summary
      }

      # Remaining fields as found in previous tutorial

    }

We define a ranking profile named tensor which rank all the matching documents by the dot-product between the document latent factor and the user latent factor. This is the same ranking expression used in the previous tutorial, which include code to retrieve the user latent factor based on the user_id sent by the query to Vespa.

    # Simpler ranking profile without
    # second-phase ranking
    rank-profile tensor {
      first-phase {
          expression {
              sum(query(user_item_cf) * attribute(user_item_cf))
          }
      }
    }

Since we want to evaluate the neural network model we have trained, we need to define where to find the model parameters (W1,W2,b1,b2). See the previous section for how to write the TensorFlow model parameters to Vespa Tensor format.

    # We need to specify the type and the location
    # of the files storing tensor values for each
    # Variable in our TensorFlow model. In this case,
    # W_hidden, b_hidden, W_final, b_final

    constant W_hidden {
        file: constants/W_hidden.json
        type: tensor(input[20],hidden[40])
    }

    constant b_hidden {
        file: constants/b_hidden.json
        type: tensor(hidden[40])
    }

    constant W_final {
        file: constants/W_final.json
        type: tensor(hidden[40], final[1])
    }

    constant b_final {
        file: constants/b_final.json
        type: tensor(final[1])
    }

Now, we specify a second rank-profile called nn_tensor that will use the same first phase as the rank-profile tensor but will rerank the top 200 documents using the neural network model as second phase. We refer to the Tensor Reference document for more information regarding the Tensor operations used in the code below.

    # rank profile with neural network model as
    # second phase
    rank-profile nn_tensor {

        # The input to the neural network is the
        # concatenation of the document and query vectors.

        macro nn_input() {
            expression: concat(attribute(user_item_cf), query(user_item_cf), input)
        }

        # Computes the hidden layer

        macro hidden_layer() {
            expression: relu(sum(nn_input * constant(W_hidden), input) + constant(b_hidden))
        }

        # Computes the output layer

        macro final_layer() {
            expression: sigmoid(sum(hidden_layer * constant(W_final), hidden) + constant(b_final))
        }


        # First-phase ranking:
        # Dot-product between user and document latent factors

        first-phase {
            expression: sum(query(user_item_cf) * attribute(user_item_cf))
        }

        # Second-phase ranking:
        # Neural network model based on the user and latent factors

        second-phase {
            rerank-count: 200
            expression: sum(final_layer)
        }

    }

}

Offline evaluation

We will now query Vespa and obtain 100 blog post recommendations for each user_id in our test set. Below, we query Vespa using the tensor ranking function which contain the simpler ranking expression involving the dot-product between user and document latent factors.

pig -x local -f tutorial_compute_metric.pig \
  -param VESPA_HADOOP_JAR=vespa-hadoop.jar \
  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \
  -param ENDPOINT=$(hostname):8080
  -param NUMBER_RECOMMENDATIONS=100
  -param RANKING_NAME=tensor
  -param OUTPUT=blog-job/cf-metric

We perform the same query routine below, but now using the ranking-profile nn_tensor which reranks the top 200 documents using the neural network model.

pig -x local -f tutorial_compute_metric.pig \
  -param VESPA_HADOOP_JAR=vespa-hadoop.jar \
  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \
  -param ENDPOINT=$(hostname):8080
  -param NUMBER_RECOMMENDATIONS=100
  -param RANKING_NAME=nn_tensor
  -param OUTPUT=blog-job/cf-metric

The tutorial_compute_metric.pig script can be found in our repo.

Comparing the recommendations obtained by those two ranking profiles and our test set, we see that by deploying a more complex and accurate model in the second phase ranking, we increased the number of relevant documents (documents read by the user) retrieved from 11948 to 12804 (more than 7% increase) and those documents retrieved appeared higher up in the list of recommendations, as shown by the expected percentile ranking metric introduced in the Vespa tutorial pt. 2 which decreased from 37.1% to

Optimizing realtime evaluation of neural net models on Vespa

In this blog post we describe how we recently made neural network evaluation over 20 times faster on Vespa’s tensor framework.

Vespa is the open source platform for building applications that carry out scalable real-time data processing, for instance search and recommendation systems. These require significant amounts of computation over large data sets. With advances in machine learning, it is desirable to run more advanced ranking models such as large linear or logistic regression models and artificial neural networks. Because of the tight computational budget at serving time, the evaluation of such models must be done in an efficient and scalable manner.

We introduced the tensor API to help solve such problems. The tensor API allows the concise expression of general computations on many-dimensional data, while simultaneously leaving room for deep optimizations on the platform side.  What we mean by this is that the tensor API is very expressive and supports a large range of model types. The general evaluation of tensors is not necessarily efficient in all cases, so in addition to continually working to increase the baseline performance, we also perform specific optimizations for important use cases. In this blog post we will describe one such important optimization we recently did, which improved neural network evaluation performance by over 20x.

To illustrate the types of optimization we can do, consider the following tensor expression representing a dot product between vectors v1 and v2:

reduce(join(v1, v2, f(x, y)(x * y)), sum)

The dot product is calculated by multiplying the vectors together by using the join operation,
then summing the elements in the vector together using the reduce operation.
The result is a single scalar. A naive implementation would first calculate the join and introduce a temporary tensor before the reduce sums up the cells to a single scalar. Particularly for large tensors with many dimensions, such a temporary tensor can be large and require significant memory allocations. This is obviously not the most efficient path to calculate the resulting tensor.  A general improvement would be to avoid the temporary tensor and reduce to the single scalar directly as the tensors are iterated through.

In Vespa, when ranking expressions are compiled, the abstract syntax tree (AST) is analyzed for such optimizations. When known cases are recognized, the most efficient implementation is selected. In the above example, assuming the vectors are dense and they share dimensions, Vespa has optimized hardware accelerated code for doing dot products on vectors. For sparse vectors, Vespa falls back to a implementation for weighted sets which build hash tables for efficient lookups.  This method allows recognition of both large and small optimizations, from simple dot products to specialized implementations for more advanced ranking models. Vespa currently has a few optimizations implemented, and we are adding more as important use cases arise.

We recently set out to improve the performance of evaluating simple neural networks, a case quite similar to the one presented in the previous blog post. The ranking expression to optimize was:

   macro hidden_layer() {
       expression: elu(xw_plus_b(nn_input, constant(W_fc1), constant(b_fc1), x))
   }
   macro final_layer() {
       expression: xw_plus_b(hidden_layer, constant(W_fc2), constant(b_fc2), hidden)
   }
   first-phase {
       expression: final_layer
   }

This represents a simple two-layer neural network.

Whenever a new version of Vespa is built, a large suite of integration and performance tests are run. When we want to optimize a specific use case, we first create a performance test to set a baseline.  With the performance tests we get both historical graphs as well as detailed profiling information and performance statistics sampled from the system under load.  This allows us to identify and optimize any bottlenecks. Also, it adds a bit of gamification to the process.

The graph below shows the performance of a test where 10 000 random documents are ranked according to the evaluation of a simple two-layer neural network:

image

Here, the x-axis represent builds, and the y-axis is the end-to-end latency as measured from a machine firing off queries to a server running the test on Vespa. As can be seen, over the course of optimization the latency was reduced from 150-160 ms to 7 ms, an impressive 20x end-to-end latency improvement.

When a query is received by Vespa, it is first processed in the stateless container. This is usually where applications would process the query, possibly enriching it with additional information. Vespa does a bit of default work here as well, and also transforms the query a bit. For this test, no specific handling was done except this default handling. After initial processing, the query is dispatched to each node in the stateful content layer. For this test, only a single node is used in the content layer, but applications would typically have multiple. The query is processed in parallel on each node utilizing multiple cores and the ranking expression gets executed once for each document that matches the query. For this test with 10 000 documents, the ranking expression and thus the neural network gets evaluated in total 10 000 times before the top N documents are returned to the container layer.

The following steps were taken to optimize this expression, with each step visible as a step in the graph above:

  1. Recognize join with multiplication as part of an inner product.
  2. Optimize for bias addition.
  3. Optimize vector concatenation (which was part of the input to the neural network)
  4. Replace appropriate sub-expressions with the dense vector-matrix product.

It was particularly the final step which gave the biggest percent wise performance boost. The solution in total was to recognize the vector-matrix multiplication done in the neural network layer and replace that with specialized code that invokes the existing hardware accelerated dot product code. In the expression above, the operation xw_plus_b is replaced with a reduce of the multiplicative join and additive join. This is what is recognized and performed in one step instead of three.

This strategy of optimizing specific use cases allows for a more rapid application development for users of Vespa. Consider the case where some exotic model needs to be run on Vespa. Without the generic tensor API users would have to implement their own custom rank features or wait for the Vespa core developers to implement them. In contrast, with the tensor API, teams can continue their development without external dependencies to the Vespa team.  If necessary, the Vespa team can in parallel implement the optimizations needed to meet performance requirements, as we did in this case with neural networks.

Vespa Product Updates, May 2019: Deploy Large Machine Learning Models, Multithreaded Disk Index Fusion, Ideal State Optimizations, and Feeding Improvements

Kristian Aune

Kristian Aune

Head of Customer Success, Vespa


In last month’s Vespa update, we mentioned Tensor updates, Query tracing and coverage. Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to evolve.

For May, we’re excited to share the following feature updates with you:

Multithreaded disk index fusion

Content nodes are now able to sustain a higher feed rate by using multiple threads for disk index fusion. Read more.

Feeding improvements

Cluster-internal communications are now multithreaded out of the box, for  high throughput feeding operations. This fully utilizes a 10 Gbps network and improves utilization of high-CPU content nodes.

Ideal state optimizations

Whenever the content cluster state changes, the ideal state is calculated. This is now optimized (faster and runs less often) and state transitions like node up/down will have less impact on read and write operations. Learn more in the dynamic data distribution documentation.

Download ML models during deploy

One procedure for using/importing ML models to Vespa is to put them in the application package in the models directory. Applications where models are trained frequently in some external system can refer to the model by URL rather than including it in the application package. This use case is now documented in deploying remote models, and solves the challenge of deploying huge models.

We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Pretrained Transformer Language Models for Search – part 1

Decorative image

Photo by Jamie Street
on Unsplash

Updated 2022-10-21: Added links and clarified some sections

In this blog series we demonstrate how to represent transformer models in a multi-phase retrieval and ranking pipeline using Vespa.ai.
We also evaluate these models on the largest Information Retrieval (IR) relevance dataset, namely the MS Marco Passage ranking dataset.
Furthermore, we demonstrate how to achieve close to state-of-the-art ranking using miniature transformer models with just 22M parameters,
beating large ensemble models with billions of parameters.

Blog posts in this series:

In this first post we give an introduction to Transformers for text ranking and three different methods of applying them for ranking.
We also cover multi-phase retrieval and ranking pipelines, and introduce three different ways to efficiently retrieve
documents in a phased retrieval and ranking pipeline.

Introduction

Since BERT was first applied to search and document ranking, we at the Vespa team have been busy making it easy to use BERT or Transformer models in general, for ranking and question answering with Vespa.ai. In previous work,
we demonstrated how to use BERT as a representation model (bi-encoder), for efficient passage retrieval for question answering.
We also demonstrated how we could accelerate BERT models for production serving using distillation and quantization.

Search or information retrieval is going through a neural paradigm shift, some have even called it the BERT revolution.
The introduction of pre-trained language models BERT have led to significant advancement of the state of the art in search and document ranking.

table

The table shows how significant the advancement was when first applied to the MS MARCO Passage Ranking leaderboard. The state-of-the-art on MS Marco passage ranking advanced by almost 30% within a week,
while improvements up until then had been incremental at best.
Compared to the baseline BM25 text ranking (default Apache Lucene 9 text scoring), applying BERT improved the ranking effectiveness by more than 100%.

The table above is from Pretrained Transformers for Text Ranking: BERT and Beyond, which is a brilliant resource
for understanding how pre-trained transformers models can be used for text ranking.
The MS MARCO Passage ranking relevancy dataset consists of about 8.8M passages, and more than 500 000 queries with at least one judged relevant document.
It is by far the largest IR dataset available in the public domain and is commonly used to evaluate ranking models.

The MS Marco passage ranking dataset queries are split in three different subsets, the train, development (dev) and test (eval). The train split can be used to train a ranking model using machine learning. Once a model is built, one can test the effectiveness of the ranking model on the development and test split. Applying the learned model on the development and test set is called in-domain usage of the model. If the trained ranking model is applied on a different relevancy dataset, it’s usually referred to as out of domain usage, or zero-shot. How well models trained on MS Marco query and passage pairs generalize to other domains is out of scope for this blog post, but we can sincerely recommend BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models.

The official evaluation metric used for the MS Marco Passage ranking leaderboard is MRR@10. The name might sound scary, but it’s
a trivial way to judge the effectiveness of a ranking algorithm. RR@10 is the Reciprocal Rank of the first relevant passage within the top 10 ranking positions for a given query. @k denotes the depth into the top ranking documents we look for the first relevant document.
The reciprocal rank formula is simply 1/(position of the first relevant hit).
If the judged relevant hit (as judged by a human) is ranked at position 1 the reciprocal rank score is 1. If the relevant hit is found at position 2 the reciprocal rank score is 0.5 and so on. The mean in mean reciprocal rank is simply the mean RR over all queries in the dev or test split which gives us an overall score.
The MS Marco passage ranking development (dev) set consists of 6,980 queries.

The query relevance judgment list for the development (dev) set is in the public domain. Researchers can compare methods based on this.
The judgements for the eval query set is not in the public domain. Researchers, or industry practitioners, need to submit their ranking for the queries in the test set to have the MRR@10 evaluated and the ranking run listed on the leaderboard. Note that the MS Marco ranking leaderboards are not run time constrained, so many of the submissions take days of computation to produce ranked lists for the queries in dev and eval splits.

There is unfortunately a lot of confusion in the industry on how BERT can successfully be used for text ranking.
The IR research field has moved so fast since the release of BERT in late 2018 that the textbooks on text ranking are already outdated.
Since there is no textbook, industry practitioners need to look at how the research community is applying BERT or Transformer models for ranking.
BERT is a pre-trained language model, and to use it effectively for document or passage ranking, it needs to be fine-tuned for retrieval or ranking.
For examples of not so great ways to use BERT for ranking, see How not to use BERT for Document ranking.

As demonstrated in Pretrained Transformers for Text Ranking: BERT and Beyond, pre-trained language models of the Transformer family achieve best accuracy for text ranking and question answering tasks when used as an interaction model with all-to-all cross-attention between the query and document.
Generally, there are 3 ways to use Transformer models for text ranking and all of them require training data to fine tune for retrieval or ranking.

table

Figure from ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT illustrating various deep neural networks for ranking. In the following section we give an overview of three of these methods using Transformer models.

Representation based ranking using Transformer models

It is possible to use the Transformer model as the underlying deep neural network for representation based learning. Given training data, one can learn a representation of documents and queries so that relevant documents are closer or more similar in this representation than irrelevant documents. The representation based ranking approach falls under the broad representation learning research field and representation learning can be applied to text, images, videos or even combinations (multi-modal representations).
For text ranking, queries and documents are embedded into a dense embedding space using one or two Transformer models (bi-encoders).
The embedding representation is learned by the training examples given to the model. Once the model has been trained, one can pre-compute the embeddings for all documents, and
at query time, use nearest neighbor search for efficient document retrieval.

table

Document retrieval using dense embedding models, is commonly referred to as dense retrieval.
The query and the document are embedded independently, and the model is during training given examples of the form (query, relevant passage, negative passage).
The model(s) weights are adjusted per batch of training triplets.
The embedding representation from the Transformer model could be based on for example the CLS token of BERT (Classification Token), or using a pooling strategy over the last Transformer layer.

The huge benefit of using representation based similarity on top of Transformer models is that the document representation can be produced offline by encoding them through the trained transformer and unless the model changes, this only needs to be done once when indexing the document. At online serving time, the serving system only needs to obtain the query embedding by running the query through the transformer model and use the resulting query embedding vector as the input to a nearest neighbor search in the dense embedding space to find relevant documents. On the MS Marco Passage ranking set, dense retrieval using a learned representation has demonstrated good results over the last year or so. Dense retrievers achieve much better accuracy (MRR@10 and Recall@1000) than sparse traditional search using exact lexical matching (e.g BM25) and the current state-of-the-art uses a dense retriever as the first phase candidate selection for re-ranking using a more sophisticated (and computationally expensive) all-to-all interaction model.

Since the query is usually short, the online encoding complexity is relatively low and encoding latency is acceptable even on a cpu serving stack. Transformer models with full all to all cross attention have quadratic run time complexity with the input sequence length so the smaller the sequence input the better the performance is. Most online serving systems can also cache the query embedding representation to save computations and reduce latency.

All to all interaction ranking using Transformers

The “classic way to use BERT for ranking is to use it as an all-to-all interaction model where both the query and the document is fed through the Transformer model simultaneously and not independently as with the representation based ranking model. For BERT this is usually accomplished with a classification layer on top of the CLS token output, and the ranking task is converted into a classification task where one classifies if the document is relevant for the query or not (binary classification). This approach is called monoBERT or vanilla BERT, or BERT cat (categorization). It’s a straightforward approach and inline with the proposed suggestions of the original BERT paper for how to use BERT for task specific fine tuning.

table

Similar to the representation model, all to all interaction models need to be trained by triplets and the way we sample the negative examples (irrelevant) is important for the overall effectiveness of the model. The first BERT submission to the MS Marco passage ranking used mono-BERT to re-rank the top 1K documents from a more efficient sparse first phase retriever (BM25).

With all to all interaction there is no known way to efficiently pre-compute the document representation offline. Running online inference with cross-attention models over all documents in a collection is computationally prohibitively expensive even for large organizations like Google or Microsoft, so to deploy it for production one needs a way to reduce the number of candidate documents which are fully evaluated using the all to all cross attention model. This has led to increased interest in multi-stage retrieval and ranking architectures but also more efficient Transformer models without quadratic complexity due to the cross attention mechanisms (all to all attention).

Late Interaction using Transformers

An alternative approach for using Transformers for ranking was suggested in ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.

Unlike the all to all query document interaction model, the late contextualized interaction over BERT enables processing the documents offline since the per document token contextual embedding is generated independent of the query tokens. The embedding outputs of the last Transformer layer is calculated at document indexing time and stored in the document. For a passage of 100 tokens we end up with 100 embedding vectors of dimensionality n where n is a tradeoff between ranking accuracy and storage (memory) footprint. The dimensionality does not necessarily need to be the same as the transformer model’s hidden size. Using 32 dimensions per token embedding gives almost the same accuracy as the larger 768 dim of BERT base. Similar one can use low precision like float16 or quantization (int8) to reduce the memory requirements per dimension.

table

The ability to run the document and obtain the per term contextual embedding offline significantly speeds up onstage query evaluation, since at query time one only needs to do one pass through the Transformer model with the query to obtain the contextual query term embeddings. Then calculate the proposed MaxSim operator over the pre-computed per term contextualized embeddings for the documents we want to re-rank.

Similar to the pure representation based model we only need to encode the query through the transformer model at query time. The query tokens only attend to other query tokens, and similar document tokens only attend to other document tokens.

As demonstrated in the

Pretrained Transformer Language Models for Search – part 2

Decorative image

Photo by Rob Fuller on Unsplash

Updated 2022-10-21: Added links and clarified some sections

In this blog series we demonstrate how to represent transformer models in a multiphase retrieval and ranking pipeline using Vespa.ai. We also evaluate these models on the largest Information Retrieval relevance dataset, namely the MS Marco Passage ranking dataset. We demonstrate how to achieve close to state of the art ranking using miniature transformer models with just 22M parameters, beating large ensemble models with billions of parameters.

In the first post in this series we introduced using pre-trained models for ranking. In this second post we study efficient candidate retrievers which can be used to efficiently find candidate documents which are re-ranked using more advanced models.

Multiphase retrieval and ranking

Due to computational complexity of cross interaction transformer models there has been renewed interest in multiphase retrieval and ranking. In a multiphased retrieval and ranking pipeline, the first phase retrieves candidate documents using a cost efficient retrieval method and the more computationally complex cross-attention or late interaction model inference is limited to the top ranking documents from the first phase.

table

Illustration of a multi-stage retrieval and ranking architecture is given in the figure above. The illustration is from Phased ranking with Vespa. The three phases illustrated in the diagram is per content node, which is retrieving and re-ranking a subset of the total document volume. In addition one can also re-rank the global top scoring documents after the results from the nodes involved in the query are merged to find the global best documents. This step might also involve diversification of the result set before final re-ranking.

Broadly there are two categories of efficient sub-linear retrieval methods

  • Sparse retrieval using lexical term matching over inverted indexes, potentially accelerated by the WAND algorithm
  • Dense retrieval using dense vector representation of queries and documents, potentially accelerated by approximate nearest neighbor search algorithms

In the next sections we take a deep dive into these two methods and we also evaluate their effectiveness on the MS Marco Passage Ranking relevancy dataset. We also show how these
two methods can be combined with Vespa.

Sparse lexical retrieval

Classic information retrieval (IR) relying on lexical matching which has been around since the early days of Information Retrieval. One example of a popular lexical based retrieval scoring function is BM25. Retrieval can be done in sub-linear time using inverted indexes and accelerated by dynamic pruning algorithms like WAND. Dynamic pruning algorithms avoid scoring exhaustively all documents which match at least one of the query terms. In the below Vespa document schema we declare a minimal passage document type which we can use to index the MS Marco Passage ranking dataset introduced in post 1.

search passage {
  document passage {
    field text type string {
      indexing: summary |index
      index:enable-bm25
    }
    field id type int {
      indexing: summary |attribute
    }
  }
  fieldset default {
  	fields: text
  }
  rank-profile bm25 {
  	first-phase {
  	  expression: bm25(text)
  	}
  }
}

We define a text field which we populate with the passage text. The indexing directive controls how the field is handled.The summary means that the text should be returned in the search result page and index specifies that we want to build inverted index data structures for efficient search and matching. We also define a ranking profile with only a single ranking phase using the Vespa bm25(name) text ranking feature, one out of many built in Vespa text matching ranking features.

Once we have indexed our data we can search using the Vespa HTTP POST query api:

  {
    "yql": "select id,text from passage where userQuery();",
    "hits": 10,
    "query": "is cdg airport in main paris?",
    "ranking.profile": "bm25",
    "type": "all"
  }
  • The yql parameter is the Vespa query language, userQuery() is a reference to the query parameter
  • The hits parameter controls the number of hits in the Vespa response
  • The query parameter contains the free text input query from the end user. Simple query language
  • The ranking.profile parameter choses the ranking profile to use for the query
  • The type specifies the query type (all, any, phrase) which controls the boolean query logic. All requires that all query terms are found in the document while any specifies at least one of the query terms should match in the document.

If we use the above query to search the MS Marco Passages we end up ranking only 2 passages and the query takes 7 ms. If we change type to any instead of all we end up ranking 7,926,256 passages (89% of the total collection) and the query takes 120 ms. Exact timing depends obviously on HW and number of threads used to evaluate the query but the main point is that brute force matching all documents which contains at least one term is expensive. While restricting to all is too restrictive, failing to recall the relevant documents. So what is the solution to this problem? How can we find the relevant documents without having to fully score almost all passages in the collection?

Meet the dynamic pruning algorithm WAND

The WAND algorithm is described in detail in

Efficient Query Evaluation using a Two-Level Retrieval Process (PDF)

We have determined that our algorithm significantly reduces the total number of full evaluations by more
than 90%, almost without any loss in precision or recall.
At the heart of our approach there is an efficient implementation of a new Boolean construct called WAND or
Weak AND that might be of independent interest

Vespa implements the WAND as a query operator and the below is an example of how to use it using our query example from above:

 {
    "yql": "select id, text from passage where ([{\"targetNumHits\": 10}]weakAnd(default contains \"is\", default contains \"cdg\", default contains \"airport\", default contains \"in\", default contains \"main\", default contains \"paris\"));",
    "hits": 10,
    "ranking.profile": "bm25"
  }

Using the above WAND query only fully ranks 2409 passages using the bm25 ranking profile and recall at first positions is the same as with brute force any,
so we did not lose any accuracy but saved a lot of resources.
Using the weakAnd operator, the query takes 12 ms instead of 120ms with brute force any.
Using WAND is best implemented using a custom searcher plugin to avoid tokenization outside of Vespa which might introduce asymmetric behaviour.
For example RetrievalModelSearcher
or using weakAnd.replace
which rewrites type any queries to using WAND instead.

There are two WAND/WeakAnd implementations in Vespa where in the above example we used weakAnd() which fully integrates with text processing (tokenization and index statistics like IDF(Inverse Document Frequency)). The alternative is wand() where the end user can control the query and document side weights explicitly. The latter wand() operator can be used to implement DeepCT and HDCT: Context-Aware Term Importance Estimation For First Stage Retrieval as Vespa gives the user full control of query and document term weighting without having to bloat the regular index by repeating terms to increase or lower the term frequency. Read more in Using WAND with Vespa.

Dense Retrieval using bi-encoders over Transformer models

Embedding based models embed or map queries and documents into a latent low dimensional dense embedding vector space and use vector search to retrieve documents. Dense retrieval could be accelerated by using approximate nearest neighbor search, for example indexing the document vector representation using HNSW graph indexing. In-domain dense retrievers based on bi-encoder architecture trained on MS Marco passage data have demonstrated that they can outperform sparse lexical retrievers with a large margin. Let us introduce using dense retrievers with Vespa.

In this example we use a pre-trained dense retriever model from Huggingface 🤗 sentence-transformers/msmarco-MiniLM-L-6-v3 . The model is based on MiniLM and the output layer has 384 dimensions. The model has just 22.7M trainable parameters and encoding the query using a quantized model takes approximately 8 ms on cpu. The original model uses mean pooling over the last layer of the MiniLM model but we also add a L2 normalization to normalize vectors to unit length (1) so that we can use innerproduct distance metric instead of angular distance metric. This saves computations during the approximate nearest neighbor search.

We expand our passage document type with a dense tensor field mini_document_embedding and a new ranking profile.

  search passage {
  document passage {
    field text type string {
      indexing: summary |index
      index:enable-bm25
    }
    field mini_document_embedding type tensor<float>(d0[384]) {
      indexing: attribute | index
      attribute {
        distance-metric: innerproduct
      }
      index {
        hnsw {
          max-links-per-node: 32
          neighbors-to-explore-at-insert: 500
        }
      }
    }
    field id type int {
      indexing: summary |attribute
    }
  }
  fieldset default {
  	fields: text
  }
  rank-profile bm25 {
  	first-phase {
  	  expression: bm25(text)
  	}
  }
  rank-profile dense {
    first-phase {
      expression: closeness(field,mini_document_embedding)
    }
  }
}

The mini_document_embedding tensor is dense (denoted by d0[384]) and is of dimensionality 384 (determined by the Transformer model we use, and possible linear dimension reduction). We use float resolution (4 bytes) for the tensor cell values (valid choices are double, bfloat16 and int8). We also define HNSW index for the field, and we set 2 HNSW indexing parameters which is an accuracy versus performance tradeoff. See HNSW for details. Accuracy is typically measured by recall@k comparing brute force nearest neighbor search versus the approximate nearest neighbor search at level k. The dense ranking profile specifies how we want to rank (or actually re-rank) our documents, in this case we use the closeness ranking feature. Documents close to the query in the embedding space is ranked higher than documents which are far. At indexing time we need to convert the passage text into the dense vector representation and index. At query time, we need to encode the query and use approximate nearest neighbor search:

  {
   "yql": "select id, text from passage where [{\"targetNumHits\": 10]nearestNeighbor(mini_document_embedding, query_embedding);"
   "hits": 10,
   "query": "is cdg airport in main paris?",
   "ranking.profile": "dense",
   "ranking.features.query(query_embedding)": [0.08691329, -0.046273664, -0.010773866,..,..]
  }

In the above example we use the Vespa nearestNeigbhor query operator to retrieve the 10 closests documents in embedding space for the input query embedding vector passed in the ranking.features.query(query_embedding) parameter. In this example, query encoding (the forward query encoding pass of the query to obtain the query embedding) is done outside but we can also represent the query encoding model inside Vespa, avoiding complicating our online serving deployment setup:

Representing the bi-encoder model inside Vespa

To represent the bi-encoder query model in Vespa we need to export the Huggingface PyTorch model into ONNX format for efficient serving in Vespa.
We include a notebook in this
sample application
which demonstrates how to transform the model and export it to ONNX format.
Vespa supports evaluating ONNX models for ranking and query encoding.
To speed up evaluation on CPU we use quantized (int) version.
We have demonstrated how to represent query encoders in
Dense passage retrieval with nearest neighbor search.

Hybrid Dense Sparse Retrieval

Recent research indicates that combining dense and sparse retrieval could improve the recall, see for example A Replication Study of Dense Passage Retriever. The hybrid approach combines dense and sparse retrieval but requires search technology which supports both sparse lexical and dense retrieval. Vespa.ai supports hybrid retrieval in the same query by combining the WAND and ANN algorithms. There are two ways to do this:

Disjunction (OR)

  {
   "yql": "select id, text from passage where 
   ([{\"targetNumHits\": 10]nearestNeighbor(mini_document_embedding, query_embedding)) or  
   ([{\"targetNumHits\": 10}]weakAnd(default contains \"is\"...));"
   "hits": 10,
   "query": "is cdg airport in main paris?",
   "ranking.profile": "hybrid",
   "ranking.features.query(query_embedding)": [0.08691329, -0.046273664, -0.010773866,..,..]
  }

In the above example we combine ANN with WAND using OR disjunction and we have a hybrid ranking profile which can combine using the dense and sparse ranking signals (e.g bm25 and vector distance/closeness). Approximately 10 + 10 documents will be exposed to the first-phase ranking function (depending on targetNumHits). It is then up to the first-phase ranking expression to combine the scores of these two different retrieval methods into a final score. See A Replication Study of Dense Passage Retriever for examples of parameter/weighting. For example it could look something like this:

rank-profile hybrid {
  first-phase {
    expression: 0.7*bm25(text) + 2.9*closeness(field, mini_document_embedding)
  }
}

Rank:

Pretrained Transformer Language Models for Search – part 3

Decorative image

Photo by Frank Busch
on Unsplash

Updated 2022-10-21: Added links and clarified some sections

In this blog series we demonstrate how to represent transformer models in a multiphase retrieval and ranking pipeline using Vespa.ai. We also evaluate these models on the largest Information Retrieval relevance dataset, namely the MS Marco Passage ranking dataset.
We demonstrate how to achieve close to state-of-the-art ranking using miniature transformer models with just 22M parameters, beating large ensemble models with billions of parameters.

In the first post in this series
we introduced using pre-trained language models for ranking and three popular methods for using them for text ranking.
In the second post
we studied efficient retrievers that could be used as the first phase in a multiphase retrieval and ranking pipeline.
In this third post we study a re-ranking model which we will deploy as a re-ranker on top of the retriever methods
we studied in the previous post, but first let us recap what a multiphase retrieval and ranking pipeline is.
In a multiphased retrieval and ranking pipeline,
the first phase retrieves candidate documents using a cost-efficient retrieval method
and the more computationally complex cross-attention or late interaction model inference
is limited to the top ranking documents from the first phase.
In this post we will study the Contextualized late interaction over BERT (ColBERT) model
and deploy it as a re-ranking phase on top of the dense retriever that we studied in the previous post.
The CoLBERT ranking model was introduced in
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
by Omar Khattab and Matei Zaharia.

Contextualized late interaction over BERT (ColBERT)

In the previous post in this series we introduced a dense retriever using a bi-encoder architecture over a Transformer model (MiniLM). Both queries and documents were encoded by the bi-encoder and represented in the same dense embedding vector space.
We used cosine similarity between the query and the document in this embedding vector space to rank documents for a query,
and we could accelerate the retrieval phase using approximate nearest neighbor search
using angular distance or innerproduct.

Unlike the dense bi-encoder, the contextualized late interaction model represents the query and document as multiple vectors obtained from the last output layer of the Transformer model. Bi-encoders on the other hand, usually performs a pooling operation over the last transformer layer,
e.g. just using the embedding representation from the CLS output token, or mean over all token output embeddings.
Also, unlike other text to vector representations like Word2Vec, the token vector representation depends on the other tokens in the same input sequence. For example the token driver in the text Microsoft driver has a different vector representation than driver in the text Taxi driver as the context is different. This thanks to the attention mechanism in the Transformer architecture where each token attends to all other tokens in the same input sequence. We can say that token output vector representation is contextualized by the other tokens in the input text sequence.

Similar to the single vector bi-encoder model, queries and documents are encoded independently. Hence, the query tokens only attend to other query tokens, and document tokens only attend to other document tokens. This separation enables offline processing of the documents which speeds up re-ranking as at re-reranking time we only need to obtain the query token embeddings and load the precomputed document embeddings from storage (e.g. memory). The ColBERT architecture also uses a query encoder and a document encoder, based on the same Transformer instance. The input to the model is different for queries and documents. The query encoder pads using the BERT mask token to a configurable maximum query length if the query input text is shorter than this max length. The document input is not padded to a fixed length.
The padding of masked tokens of the query input is explained in the paper:

We denote the padding with masked tokens as query augmentation, a step that allows BERT to produce query-based embeddings at the positions corresponding to these masks. Query augmentation is intended to serve as a soft, differentiable mechanism for learning to expand queries with new terms or to re-weigh existing terms based on their importance for matching the query

The dimensionality used to represent the output token embedding can be reduced using a dimension reduction layer on top of the last output transformer layer. The original token output dimensionality depends on the Transformer model used, for example, the bert-base model uses 768 dimensions while MiniLM uses 384 dimensions. In the ColBERT paper the authors uses dimension reduction to 128 dimensions from the original hidden size of 768 dimensions. The authors also demonstrate that reducing the dimensionality further to 32 does not impact ranking accuracy significantly. The dimensionality used and the precision used for the vector values matters for both the computational complexity and storage requirements.
For example, if we use 32 dimensions with bfloat16 (2 bytes per tensor value) precision, we need to store 32GB of vector data for 9M documents with average 60 tokens per document.
While if we use 128 dimensions with float32 (4 bytes) precision, we end up with about 256GB of vector data.

Ranking with ColBERT – Meet MaxSim

So we now know roughly how the ColBERT architecture works; Query text is encoded into a fixed length bag of token embeddings and document text is encoded into a bag of token embeddings. But the missing piece is how do we compute the relevancy score of a query, document pair using this representation?

The ColBERT paper introduces the late interaction similarity function which they name Maximum Similarity (MaxSim): For a given query and document pair the MaxSim relevancy score is calculated as follows:

For each query token embedding perform cosine similarity against all the document token embeddings and track the maximum score per query token.
The overall query, document score is the sum of these maximum cosine scores.
For a query with 32 token embeddings (max query length 32), and a document with 128 tokens we need to perform 32*128 cosine similarity operations. The MaxSim operator is illustrated in the figure below.

MaxSim

MaxSim illustration from the ColBERT paper

The cosine similarity with unit length vectors can be performed by the inner dot product,
and can be HW accelerated using advanced vector instructions.

Vespa ColBERT representation

To represent the ColBERT model architecture in Vespa for re-ranking documents we need:

  • Store the document token embeddings in our Vespa document model for fast, on-demand, access in ranking phases
  • Express the MaxSim function with a Vespa ranking expression
  • Map the query text to token ids, and map tokens to token embeddings at run time by invoking the ColBERT query encoder transformer model

We expand the Vespa document schema from the previous post and introduce a new mixed Vespa tensor field called dt. We use this tensor to store the computed bag of token embeddings for the document.
The mixed tensor (combining sparse dt, and indexed x dimensions) allows storing a dynamic number of token embeddings for the document, depending on the length of the document. We could have used an indexed representation, but that would have used more memory as we would need to determine a max document length.

Vespa Passage document schema

The new document schema including the new dt ColBERT document tensor is given below:

search passage {
  document passage {
    field id type int {...} 
    field text type string {...}
    field mini_document_embedding type tensor<float>(d0[384]){...}
    field dt type tensor<bfloat16>(dt{}, x[32]){
     indexing: attribute
     attribute:fast-search
    }
  }
}

The tensor cell value precision type we use is bfloat16 which is 2 bytes per tensor cell value which saves 50% of the memory compared to float precision (4 bytes per value). Vespa supports double, float, bfloat16 and int8 tensor cell value precision types.

We also use 32 dimensions for the per token embedding representation instead of 128 to further reduce the memory requirement. The indexing statement specifies attribute which means this field will be stored in-memory and fast-search enables fast uncompressed representation in memory which speeds up evaluation over mixed tensor fields. fast-search is only relevant for mixed tensor type fields.

Vespa MaxSim operator

We can express the MaxSim operator in Vespa by a tensor ranking expression using sum and reduce tensor functions.

 
sum(
  reduce(
    sum(query(qt) * attribute(dt), x),
    max, dt
  ),
  qt
)

Where attribute(dt) is the ColBERT document tensor field and query(qt) is the ColBERT query tensor representation.
The query(qt) tensor is defined in the
passage schema:

query(qt) tensor>float<(qt{},x[32])

We configure the MaxSim operator in a Vespa ranking profile,
where we use the dense bi-encoder model as our first-phase ranking function and use the ColBERT MaxSim as the second phase ranking expression.
We use re-ranking count of 1000 (per node),
this setting can also be controlled by a query time setting,
in case we want to explore different re-ranking depths.
The ranking profile is given below.
In this case, we also cast the bfloat16 tensor values
to float to enable HW accelerations in place for operations on float tensors.

rank-profile dense-colbert {
  first-phase {
    expression: closeness(field,mini_document_embedding)
  }
  second-phase {
    rerank-count: 1000
    expression {
      sum(
        reduce(
          sum(
              query(qt) * cell_cast(attribute(dt), float) , x
          ),
          max, dt
         ),
         qt
      )
    }
  }
}

To obtain the query(qt) ColBERT tensor we need to encode the text query input using the ColBERT query encoder.

Vespa ColBERT query encoder

We have trained a ColBERT model using a 6-layer MiniLM model which can be downloaded from Huggingface model hub. This model only have 22.7M trainable parameters. This model can be served with Vespa using ONNX format. We also have included a notebook which demonstrates how to export the PyTorch transformer model to ONNX format and also use quantization to further speed up the evaluation. Quantization (using int8) weights instead of float speeds up evaluation of the model by 3x. See Google colab notebook.

The query encoder is represented in a query document type which has no fields.
It’s a placeholder to be able to represent the ONNX model,
and we use a single empty document so that we can invoke the Vespa ranking framework to evaluate the ONNX model.

schema query {
  document query {}
  onnx-model colbert_encoder {
    file: files/vespa-colMiniLM-L-6-quantized.onnx
    input input_ids: query(input_ids)
    input attention_mask: query(attention_mask)
    output contextual:contextual 
  }
  rank-profile colbert_query_encoder {
    num-threads-per-search: 1
    first-phase {
      expression: random 
    }
    summary-features {
      onnxModel(colbert_encoder).contextual
    }
  }
}

Tokenization and tensor input (input_ids and attention_mask) is generated using a
custom searcher which maps the query text to BERT token ids
and creates the ColBERT masked query input.
See ColBERTSearcher for details.
This searcher produces the mentioned query(qt) tensor which is used by the MaxSim ranking expression.
We use the ColBERT repo’s indexing routine to produce the document token embeddings,
and we also publish a pre-processed dataset with all 8.8M passages including both the mini_document_embedding and ColBERT tensor fields.
See MS Marco Passage Ranking using Transformers vespa sample application.

We evaluate the ranking effectiveness of the ColBERT model deployed as a re-ranking step on top of the dense retriever introduced in the previous post. We use MS Marco Passage Ranking dev query split (6980 queries):

Retrieval methodRankingMRR@10Recall@100Recall@200Recall@1000
weakAnd (sparse)bm250.1850.660.730.85
nearestNeighbor (dense)innerproduct0.3100.820.870.94
nearestNeighbor (dense)ColBERT0.3590.860.900.94

The Recall@1000 does not change as the model is used to re-rank the top 1K hits from the dense retriever. The Recall@100 and Recall@200 metrics improve with the ColBERT re-ranking step and MRR@10 improves from 0.310 to 0.359.
The end-to-end latency, including query encoding of the dense retriever model, ColBERT query encoding, retrieval with nearest neighbor search (with targetHits=1000) and re-ranking with ColBERT is just 39 ms. Reducing the nearest neighbor search targetHits, and also the re-ranking depth of the ColBERT model can be used to trade accuracy versus cost.

$ ./src/main/python/evaluate_passage_run.py --rank_profile dense-colbert --rerank_hits 1000 --retriever dense  --ann_hits 1000 --hits 10  --trec_format --run_file dev.test --query_split dev --endpoint https://$ENDPOINT:4443/search/
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6980/6980 [04:27<00:00, 26.07it/s]

In this blog post

Pretrained Transformer Language Models for Search – part 4

Decorative image

Photo by Patrick Hendry on Unsplash

Updated 2022-10-21: Added links and clarified some sections

In this blog series we demonstrate how to represent transformer models in a multiphase retrieval and ranking pipeline using Vespa.ai. We also evaluate these models on the largest Information Retrieval relevance dataset, namely the MS Marco Passage ranking dataset. We demonstrate how to achieve close to state of the art ranking using miniature transformer models with just 22M parameters, beating large ensemble models with billions of parameters.

Blog posts in this series:

In the first post in this series we introduced using pre-trained language models for ranking and three popular methods for using them for text ranking. In the second post we studied efficient retrievers which could be used as the first phase in a multiphase retrieval and ranking pipeline. In the third post we studied the ColBERT re-ranking model.

In this fourth and last post in our blog post series on pre-trained transformer models for search,
we introduce a cross-encoder model with all-to-all interaction between the query and the passage.

We deploy this model as our final ranking stage in our multiphase retrieval and ranking pipeline, furthermore,
we submit the ranking results to the MS Marco Passage Ranking Leaderboard.

In addition, we benchmark the serving performance of all the retrieval and ranking methods introduced in this blog post series.
Finally, we also release a vespa sample application,
which lets try out these state of the art retrieval and ranking methods.

Introduction

In this blog post we study the third option for using transformer models for search and document ranking.
This option is the simplest model to configure and use in Vespa but also the most computationally expensive model in our multi-phase retrieval and ranking pipeline.
With the cross attention model we input both the query and the passage to the model and as we know by now,
the computational complexity of the transformer is squared with regards to the input length.
Doubling the sequence length increases the computational complexity by 4x.

The cross-encoder model is a transformer based model with a classification head on top of the Transformer CLS token (classification token).
The model has been fine-tuned using the MS Marco passage training set and is a binary classifier which classifies
if a query,document pair is relevant or not.

The cross-encoder model is also based on a 6-layer MiniLM model with only 22.7M parameters, same as the transformer models previously introduced in this blog series. As with the other two transformer models we introduced in previous posts in this series, we integrate this model in Vespa using ONNX format. We demonstrate how to export the model(s) from PyTorch/Transformers to ONNX format in this notebook. The model is hosted on the Huggingface model hub.

We use a quantized version where the original float weights have been quantized to int8 representation to speed up inference on cpu.

Vespa representation of the cross-encoder model

In previous posts we have introduced the Vespa passage schema.
We add a new tensor field to our schema and in this tensor field we will store the transformer token ids of the processed text.
We haven’t described this in detail before, but the MiniLM model uses as input the sequence of the numeric token ids from the
fixed BERT token vocabulary of about 30K unique tokens or subwords.

For example the passage:

Charles de Gaulle (CDG) Airport is close to Paris

Is tokenized to:

['charles', 'de', 'gaulle', '(', 'cd', '##g', ')', 'airport', 'is', 'close', 'to', 'paris']

The subword tokens are mapped to token ids from the fixed vocabulary, e.g ‘charles’ maps to token id 2798.
The example passage text is represented as a tensor by:

[2798, 2139, 28724, 1006, 3729, 2290, 1007, 3199, 2003, 2485, 2000, 3000]

We use the native Vespa WordPiece embedder
to map the text into tensor representation.

The passage document schema,
including the new text_token_ids field:

search passage {
  document passage {
    field id type int {...} 
    field text type string {...}
    field mini_document_embedding type tensor<float>(d0[384]){...}
    field dt type tensor<bfloat16>(dt{}, x[32]){..}
  }

  field text_token_ids type tensor<float>(d0[128])  {
    indexing: input text | embed tokenizer | attribute | summary
    attribute: paged
  }
}

We store maximum 128 tokens, denoted by d0[128]. This is an example of an indexed Vespa tensor type.

Vespa ranking with cross-encoder model

We are going to use the dense retriever model, accelerated by Vespa’s approximate nearest neighbor search to
efficiently retrieve passages for re-ranking with our transformer based ranking models. The retrieved hits are
re-ranked with the ColBERT model introduced in the third post,
and finally the top ranking documents from the ColBERT model is re-ranked using the cross-encoder.

The retrieval and ranking pipeline have two re-ranking depth parameters.

  • How many are re-ranked with ColBERT is determined by the target number of hits passed to the nearest neighbor query operator.
  • The number of documents that are re-ranked using the final cross-encoder model is determined by the rank-profile rerank-count property.

See phased ranking with Vespa.
Both these parameters impact end-to-end serving performance and also ranking accuracy as measured by MRR@10.

Both the nearest neighbor search target number of hits and rerank-count is per content node which is involved in the query.
This is only relevant for deployments where the document corpus cannot be indexed on a single node due to either space constraints (memory, disk) or serving latency constraints.

Defining the MiniLM cross-encoder

schema passage {
  document passage {...}

  onnx-model minilmranker {
    file: files/ms-marco-MiniLM-L-6-v2-quantized.onnx
    input input_ids: input_ids
    input attention_mask: attention_mask
    input token_type_ids: token_type_ids
  }
}

In the above snippet we define the ONNX model and its inputs, each of the inputs are mapped to a function declared later in the ranking profile. Each function produces a tensor
which is used as input to the model. The file points to the ONNX formatted model format, placed in in src/main/application/files/.
Vespa takes care of distributing the model to the content node(s). The inputs
to the model are standard transformer inputs (input_ids, attention_mask and token_type_ids).

The first part of the ranking profile where we define the 3 input functions to the BERT model looks like this:

  rank-profile dense-colbert-mini-lm {
    function input_ids() {
       expression: tokenInputIds(128, query(query_token_ids), attribute(text_token_ids))
    }
    function token_type_ids() {
      expression: tokenTypeIds(128, query(query_token_ids), attribute(text_token_ids))
    }
    function attention_mask() {
      expression: tokenAttentionMask(128, query(query_token_ids), attribute(text_token_ids))
    }
}

For example the input input_ids the function input_ids which is defined as

  function input_ids() {
       expression: tokenInputIds(128, query(query_token_ids), attribute(text_token_ids))
    }

The tokenInputIds is a built-in Vespa ranking feature
which builds the transformer model input including special tokens like CLS and SEP.

We pass the query(token_ids) tensor which
is sent with the query and the passage token ids which is read from the in-memory attribute field (text_token_ids).

The query tensor representation (query(query_token_ids)) is created in a custom query processor RetrievalModelSearcher
which converts the free text query input from the
user to a tensor representation using the same BertTokenizer as used by the custom document processor.

For example for a text query

is CDG in paris?

The query tensor representation becomes:

[2003, 3729, 2290, 1999, 3000, 1029]

The tokenInputIds ranking function will create the concatenated tensor of both query and passage including the special tokens. Using the example passage
from previous section with the above query example our concatenated output with special tokens becomes:

[101, 2003, 3729, 2290, 1999, 3000, 1029, 102, 2798, 2139, 28724, 1006, 3729, 2290, 1007, 3199, 2003, 2485, 2000, 3000, 102]

Where 101 is the CLS token id and 102 is the SEP token separating the query from the passage.

Cross-Encoder Model

The above figure illustrates the input and output of the cross-encoder transformer model.

Notice the CLS output embedding which is fed into the
classification layer which predicts the class label (Relevant = 1, irrelevant = 0).

Now as we have presented how to represent the cross-encoder model, we can present the remaining parts of our
ranking profile:

rank-profile dense-colbert-mini-lm {
    ...

    function maxSimNormalized() {
      expression {
        sum(
          reduce(
            sum(
              query(qt) * attribute(dt), x
            ),
            max, dt
          ),
          qt
        )/32.0
       }
    }
    function dense() {
      expression: closeness(field, mini_document_embedding)
    }
    
    function crossModel() {
      expression: onnx(minilmranker){d0:0,d1:0}
    }
    
    first-phase {
        expression: maxSimNormalized()
    }
    
    second-phase {
      rerank-count: 24
      expression: 0.2*crossModel() + 1.1*maxSimNormalized() + 0.8*dense()
    }
}

The maxSimNormalized function computes the ColBERT MaxSim function which we introduced in post 3,
here we also normalizes the MaxSim score by dividing the score with 32 which is the configured max ColBERT query encoder query length,
and each term has maximum score of 1.

The dense() function calculates the cosine similarity as calculated
by the dense retriever introduced in post 2

In the crossModel() function we calculate the score from cross-encoder introduced in this blog post:

function crossModel() {
  expression: onnx(minilmranker){d0:0,d1:0}
}

The {d0:0,d1:0} access the logit score. (d0:0 is the batch dimension, which always is of size 1, and d1:0 access the logit score, which is a proxy for the relevancy).

Ranking profile summarized

  • Retrieve efficiently using the dense retriever model – This is done by the Vespa approximate nearest neighbor search query operator.
  • The k passages retrieved by the nearest neighbor search is re-ranked using the ColBERT MaxSim operator. K is set by the target hits used for the nearest neighbor search.
  • In the last phase, the top ranking 24 passages from the previous phase are evaluated by the cross attention model.
  • The final ranking score is a linear combination of all three ranking scores. The rerank-count can also be adjusted by a query parameter

Observe that reusing scores from the previous ranking phases does not impact serving performance,
as they are only evaluated once (per hit) and cached.

The linear weights
of the three different transformer scores was obtained by a simple grid search observing
the ranking accuracy on the dev query split when changing parameters.

MS Marco Passage Ranking Submission

We submitted a run for the MS Massage Ranking where we used targetHits 1K for the approximate nearest neighbor search,
so that 1K passages are re-ranking using the ColBERT model and finally 96 passages are re-ranked with the cross-encoder model.

Passage Ranking

Our multi-phase retrieval and ranking pipeline with 3 miniature models performed pretty well,
even beating large models using T5 with 3B parameters.
See MS Marco Passage Ranking Leaderboard.

ModelEvalDev
BM25 (Official baseline)0.1650.167
BM25 (Lucene8, tuned)0.1900.187
Vespa dense + ColBERT + cross-attention0.3930.403

Multi-threaded retrieval and ranking

Vespa has the ability to use multiple threads per search query.
This ability can reduce search latency as the document retrieval and ranking
for a single query can be partitioned, so that each thread works on a subset of the searchable documents in an index.
The number of threads to use is controlled on a per rank profile basis,
but can only use less than the global setting controlled in the application services.xml.

To find optimal settings, we recommend benchmarking starting with one thread per search and increasing until latency does not improve significantly.
See Vespa Scaling Guide for details.

Serving performance versus ranking accuracy

In this section we perform benchmarking where we deploy the system on a Vespa cloud instance using
2 x Xeon Gold 6263CY 2.60GHz (HT enabled, 48 cores, 96 threads) with 256GB memory.

We use a single content node indexing the 9M passages.
All query encodings with the MiniLM based query encoders, retrieval and re-ranking is performed on this content node.
We also use 2 stateless container nodes with 16 v-cpu each to make sure that we are benchmarking the content node performance.
See Vespa overview on
stateless container nodes versus content nodes.

Running everything of importance on the same node enables us to quantitatively compare the performance of the methods we have introduced in this blog post series.
We benchmark throughput per retrieval and ranking model until we reach about 70% cpu utilization,
and compare obtained throughput and latency. We also include tail latency (99.9 percentile) in the reported result.

We use the vespa-fbench benchmarking utility to
load the cluster (by increasing the number of clients to reach about 70% cpu util).
We

Pre-trained models on Vespa Cloud

UPDATE 2023-06-06: use new syntax to configure Bert embedder.

Decorative image

“searching data using pre-trained models, unreal engine high quality render, 4k, glossy, vivid_colors, intricate_detail” by Stable Diffusion

Vespa can now convert text to embeddings for you automatically,
if you don’t want to bring your own vectors – but you still need to provide the ML models to use.

On Vespa Cloud we’re now making this even simpler, by also providing pre-trained models you can use for such tasks.
To take advantage of this, just pick the models you want from
cloud.vespa.ai/en/model-hub and refer
to them in your application by supplying a model-id where you would otherwise use path or url. For example:

<component id="myEmbedderId" type="bert-embedder">
    <transformer-model model-id="minilm-l6-v2"/>
    <tokenizer-vocab model-id="bert-base-uncased"/>
</component>

You can deploy this to Vespa Cloud to have these models do their job in your application –
no need to include a model in your application and wait for it to be uploaded.

You can use these models both in configurations provided by Vespa, as above, and in your own components,
with your own configurations – see the documentation for details.

We’ll grow the set of models available over time, but the models we provide on Vespa Cloud will always be an
exclusive selection of models that we think it is beneficial to use in real applications,
both in terms of performance and model quality.

We hope this will empower many more teams to leverage modern AI in their production use cases.

Simplify Search with Multilingual Embedding Models

Decorative image

Photo by Bruno Martins on Unsplash

This blog post presents and shows how to represent a robust
multilingual embedding model of the E5 family in Vespa. We also
demonstrate how to evaluate the model’s effectiveness on multilingual
information retrieval (IR) datasets.

Introduction

The fundamental concept behind embedding models is transforming
textual data into a continuous vector space, wherein similar items
are brought close together and dissimilar ones are pushed
farther apart. Mapping multilingual texts into a unified vector
embedding space makes it possible to represent and compare queries
and documents from various languages within this shared space.

multilingual embedding model

Meet the E5 family.

Researchers from Microsoft introduced the E5 family of text embedding
models in the paper Text Embeddings by Weakly-Supervised Contrastive
Pre-training. E5 is short for
EmbEddings from bidirEctional Encoder rEpresentations. Using a
permissive MIT license, the same researchers have also published
the model weights on the Huggingface model hub. There are three
multilingual E5 embedding model variants with different model sizes
and embedding dimensionality. All three models are initialized from
pre-trained transformer models with trained text vocabularies that
handle up to 100 languages.

This model is initialized from
xlm-roberta-base and
continually trained on a mixture of multilingual datasets. It
supports 100 languages from xlm-roberta, but low-resource languages
may see performance degradation._

Similarly, the E5 embedding model family includes three variants
trained only on English datasets.

Choose your E5 Fighter

The embedding model variants allow developers to trade effectiveness
versus serving related costs. Embedding model size and embedding dimensionality
impact task accuracy, model inference, nearest
neighbor search, and storage cost.

These serving-related costs are all roughly linear with model size
and embedding dimensionality. In other words, using an embedding
model with 768 dimensions instead of 384 increases embedding storage
by 2x and nearest neighbor search compute with 2x. Accuracy, however,
is not nearly linear, as demonstrated on the MTEB
leaderboard.

The nearest neighbor search for embedding-based retrieval could be
accelerated by introducing approximate algorithms like
HNSW. HNSW
significantly reduces distance calculations at query time but also
introduces degraded retrieval accuracy because the search is
approximate. Still, the same linear relationship between embedding
dimensionality and distance compute complexity holds.

ModelDimensionalityModel params (M)Accuracy
Average (56 datasets)
Accuracy Retrieval
(15 datasets)
Small38411857.8746.64
Base76827859.4548.88
Large102456061.551.43

Comparision of the E5 multilingual models. Accuracy numbers from MTEB
leaderboard.

Do note that the datasets included in MTEB are biased towards English
datasets, which means that the reported retrieval performance might
not match up with observed accuracy on private datasets, especially
for low-resource languages.

Representing E5 embedding models in Vespa

Vespa’s vector search and embedding inference support allows
developers to build multilingual semantic search applications without
managing separate systems for embedding inference and vector search
over the multilingual embedding representations.

In the following sections, we use the small E5 multilingual variant,
which gives us reasonable accuracy for a much lower cost than the
larger sister E5 variants. The small model inference complexity
also makes it servable on CPU architecture, allowing iterations and
development locally without managing GPU-related infrastructure
complexity.

Exporting E5 to ONNX format for accelerated model inference

To export the embedding model from the Huggingface model hub to
ONNX format for inference in Vespa, we can use the
Optimum library:

$ optimum-cli export onnx --task sentence-similarity -m intfloat/multilingual-e5-small multilingual-e5-small-onnx

The above optimum-cli command exports the HF model to ONNX format that can be imported
and used with the Vespa Huggingface
embedder.
Using the Optimum generated ONNX file and tokenizer configuration
file, we configure Vespa with the following in the Vespa application
package
services.xml
file.

<component id="e5" type="hugging-face-embedder">
  <transformer-model path="model/multilingual-e5-small.onnx"/>
  <tokenizer-model path="model/tokenizer.json"/>
</component>

That’s it! These two simple steps are all we need to start using the multilingual
E5 model to embed queries and documents with Vespa.

Using E5 with queries and documents in Vespa

The E5 family uses text instructions mixed with the input data to
separate queries and documents. Instead of having two different
models for queries and documents, the E5 family separates queries
and documents by prepending the input with “query:” or “passage:”.

schema doc {
  document doc  {
    field title type string { .. }
    field text type string { .. }
  }
  field embedding type tensor<float>(x[384]) {
    indexing {
      "passage: " . input title . " " . input text | embed | attribute
    }
  }

The above Vespa schema language
uses the embed indexing
language
functionality to invoke the configured E5 embedding model, using a
concatenation of the “passage: “ instruction, the title, and
the text. Notice that the embedding tensor
field defines the embedding dimensionality (384).

The above schema uses a single vector
representation per document. With Vespa multi-vector
indexing,
it’s also possible to represent and index multiple vector representations
for the same tensor field.

Similarly, on the query, we can embed the input query text with the
E5 model, now prepending the input user query with “query: “

{
  "yql": "select ..",
  "input.query(q)": "embed(query: the query to encode)", 
}

Evaluation

To demonstrate how to evaluate multilingual embedding models, we
evaluate the small E5 multilingual variant on three information
retrieval (IR) datasets. We use the classic trec-covid dataset, a
part of the BEIR benchmark,
that we have written about in blog
posts
before. We also include two languages from the
MIRACL (Multilingual Information
Retrieval Across a Continuum of Languages
) datasets.

All three datasets use
NDCG@10 to
evaluate ranking effectiveness. NDCG is a ranking metric that is
precision-oriented and handles graded relevance judgments.

DatasetIncluded in E5
fine-tuning
LanguageDocumentsQueriesRelevance Judgments
BEIR:trec-covidNoEnglish171,3325066,336
MIRACL:swYes (The train split was used)Swahili131,9244825092
MIRACL:yoNoYoruba49,0431191188

IR dataset characteristics

We consider both BEIR:trec-covid and MIRACL:yo as out-of-domain datasets
as E5 has not been trained or fine tuned on them since they don’t
contain any training split. Applying E5 on out-of-domain datasets
is called zero-shot, as no training examples (shots) are available.

The Swahili dataset could be categorized as an in-domain dataset
as E5 has been trained on the train split of the dataset. All three
datasets have documents with titles and text
fields. We use the concatenation strategy described in previous sections, inputting both title
and text to the embedding model.

We evaluate the E5 model using exact nearest neighbor
search
without HNSW indexing,
and all experiments are run on an M1 Pro (arm64) laptop using the
open-source Vespa container
image. We contrast
the E5 model results with Vespa BM25.

DatasetBM25Multilingual E5 (small)
MIRACL:sw0.42430.6755
MIRACL:yo0.68310.4187
BEIR:trec-covid0.68230.7139

Retrieval effectiveness for BM25 and E5 small (NDCG@10)

For BEIR:trec-covid, we also evaluated a hybrid combination of E5
and BM25, using a linear combination of the two scores, which lifted
NDCG@10 to 0.7670. This aligns with previous findings, where hybrid
combinations
outperform
each model used independently.

Summary

As demonstrated in the evaluation, multilingual embedding models
can enhance and simplify building multilingual search applications
and provide a solid baseline. Still, as we can see from the evaluation
results, the simple and cheap Vespa BM25 ranking model outperformed
the dense embedding model on the MIRACL Yoruba queries.

This result can largely be explained by the fact that the model had not
been pre-trained on the language (low resource) or tuned for retrieval
with Yoruba queries or documents. This is another reminder of what
we wrote about in a blog post about improving zero-shot
ranking,
where we summarize with a quote from the BEIR paper, which evaluates
multiple models in a zero-shot setting:

In-domain performance is not a good indicator for out-of-domain
generalization. We observe that BM25 heavily underperforms neural
approaches by 7-18 points on in-domain MS MARCO. However, BEIR
reveals it to be a strong baseline for generalization and generally
outperforming many other, more complex approaches. This stresses
the point that retrieval methods must be evaluated on a broad range
of datasets.

In the next blog post, we will look at ways to make embedding
inference cheaper without sacrificing much retrieval effectiveness
by optimizing the embedding model. Furthermore, we will show how
to save 50% of embedding storage using Vespa’s support for bfloat16
precision instead of float, with close to zero impact on retrieval
effectiveness.

If you want to reproduce the retrieval results, or get started
with multilingual embedding search, check out
the new multilingual search sample application.

Representing BGE embedding models in Vespa using bfloat16

Decorative image

Photo by Rafael Drück on Unsplash

This post demonstrates how to use recently announced BGE (BAAI General Embedding)
models in Vespa. The open-sourced (MIT licensed) BGE models
from the Beijing Academy of Artificial Intelligence (BAAI) perform
strongly on the Massive Text Embedding Benchmark (MTEB
leaderboard). We
evaluate the effectiveness of two BGE variants on the
BEIR trec-covid dataset.
Finally, we demonstrate how Vespa’s support for storing and indexing
vectors using bfloat16 precision saves 50% of memory and storage
fooprint with close to zero loss in retrieval quality.

Choose your BGE Fighter

When deciding on an embedding model, developers must strike a balance
between quality and serving costs.

Triangle of tradeoffs

These serving-related costs are all roughly linear with model
parameters and embedding dimensionality (for a given sequence
length). For example, using an embedding model with 768 dimensions
instead of 384 increases embedding storage by 2x and nearest neighbor
search compute by 2x.

Quality, however, is not nearly linear, as demonstrated on the MTEB
leaderboard.

ModelDimensionalityModel params (M)Accuracy
Average (56 datasets)
Accuracy Retrieval
(15 datasets)
bge-small-en3843362.1151.82
bge-base-en76811063.3653
bge-base-large102433563.9853.9

A comparison of the English BGE embedding models — accuracy numbers MTEB
leaderboard. All
three BGE models outperforms OpenAI ada embeddings with 1536
dimensions and unknown model parameters on MTEB

In the following sections, we experiment with the small and base
BGE variant, which gives us reasonable accuracy for a much lower
cost than the large variant. The small model inference complexity
also makes it servable on CPU architecture, allowing iterations and
development locally without managing GPU-related infrastructure
complexity.

Exporting BGE to ONNX format for accelerated model inference

To use the embedding model from the Huggingface model hub in Vespa
we need to export it to ONNX format. We can use
the Transformers Optimum
library for this:

$ optimum-cli export onnx --task sentence-similarity -m BAAI/bge-small-en --optimize O3 bge-small-en

This exports the small model with the highest optimization
level
usable for serving on CPU. We also quantize the optimized ONNX model
using onnxruntime quantization like
this.
Quantization (post-training) converts the float model weights (4
bytes per weight) to byte (int8), enabling faster inference on the
CPU. As demonstrated in this blog
post,
quantization accelerates embedding model inference by 2x on CPU with negligible
impact on retrieval quality.

Using BGE in Vespa

Using the Optimum generated ONNX model and
tokenizer files, we configure the Vespa Huggingface
embedder
with the following in the Vespa application
package
services.xml
file.

<component id="bge" type="hugging-face-embedder">
  <transformer-model path="model/model.onnx"/>
  <tokenizer-model path="model/tokenizer.json"/>
  <pooling-strategy>cls</pooling-strategy>
  <normalize>true</normalize>
</component>

BGE uses the CLS special token as the text representation vector
(instead of average pooling). We also specify normalization so that
we can use the prenormalized-angular distance
metric
for nearest neighbor search. See configuration
reference
for details.

With this, we are ready to use the BGE model to embed queries and
documents with Vespa.

Using BGE in Vespa schema

The BGE model family does not use instructions for documents like
the E5
family,
so we don’t need to prepend the input to the document model with
“passage: “ like with the E5 models. Since we configure the Vespa
Huggingface
embedder to
normalize the vectors, we use the optimized prenormalized-angular
distance-metric for the nearest neighbor search
distance-metric.

field embedding type tensor<float>(x[384]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular
    }
}

Note that the above does not enable HNSW
indexing, see
this
blog
post on the tradeoffs related to introducing approximative nearest
neighbor search. The small model embedding is configured with 384
dimensions, while the base model uses 768 dimensions.

field embedding type tensor<float>(x[768]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular
    }
}

Using BGE in queries

The BGE model uses query instructions like the E5
family
that are prepended to the input query text. We prepend the instruction
text to the user query as demonstrated in the snippet below:

query = 'is remdesivir an effective treatment for COVID-19'
body = {
        'yql': 'select doc_id from doc where ({targetHits:10}nearestNeighbor(embedding, q))',
        'input.query(q)': 'embed(Represent this sentence for searching relevant passages: ' + query +  ')', 
        'ranking': 'semantic',
        'hits' : '10' 
 }
response = session.post('http://localhost:8080/search/', json=body)

The BGE query instruction is Represent this sentence for searching
relevant passages:
. We are unsure why they choose a longer query instruction as
it does hurt efficiency as compute complexity is
quadratic
with sequence length.

Experiments

We evaluate the small and base model on the trec-covid test split
from the BEIR benchmark. We
concat the title and the abstract as input to the BEG embedding
models as demonstrated in the Vespa schema snippets in the previous
section.

DatasetDocumentsAvg document tokensQueriesAvg query
tokens
Relevance Judgments
BEIR trec_covid171,332245501866,336

Dataset characteristics; tokens are the number of language model
token identifiers (wordpieces)

All experiments are run on an M1 Pro (arm64) laptop with 8 v-CPUs
and 32GB of memory, using the open-source Vespa container
image. No GPU
acceleration and no need to manage CUDA driver compatibility, huge
container images due to CUDA dependencies, or forwarding host GPU
devices to the container.

Sample Vespa JSON
formatted
feed document (prettified) from the
BEIR trec-covid dataset:

{
  "put": "id:miracl-trec:doc::wnnsmx60",
  "fields": {
    "title": "Managing emerging infectious diseases: Is a federal system an impediment to effective laws?",
    "text": "In the 1980's and 1990's HIV/AIDS was the emerging infectious disease. In 2003\u20132004 we saw the emergence of SARS, Avian influenza and Anthrax in a man made form used for bioterrorism. Emergency powers legislation in Australia is a patchwork of Commonwealth quarantine laws and State and Territory based emergency powers in public health legislation. It is time for a review of such legislation and time for consideration of the efficacy of such legislation from a country wide perspective in an age when we have to consider the possibility of mass outbreaks of communicable diseases which ignore jurisdictional boundaries.",
    "doc_id": "wnnsmx60",
    "language": "en"
  }
}

Evalution results

ModelModel size (MB)NDCG@10 BGENDCG@10
BM25
bge-small-en330.73950.6823
bge-base-en1040.76620.6823

Evaluation results for quantized BGE models.

We contrast both BGE models with the unsupervised
BM25 baseline from
this blog
post.
Both models perform better than the BM25 baseline
on this dataset. We also note that our NDCG@10 numbers represented
in Vespa is slightly better than reported on the MTEB leaderboard
for the same dataset. We can also observe that the base model
performs better on this dataset, but is also 2x more costly due to
size of embedding model and the embedding dimensionality. The
bge-base model inference could benefit from GPU
acceleration
(without quantization).

Using bfloat16 precision

We evaluate using
bfloat16
instead of float for the tensor representation in Vespa. Using
bfloat16 instead of float reduces memory and storage requirements
by 2x since bfloat16 uses 2 bytes per embedding dimension instead
of 4 bytes for float. See Vespa tensor values
types.

We do not change the type of the query tensor. Vespa will take care
of casting the bfloat16 field representation to float at search
time, allowing CPU acceleration of floating point operations. The
cast operation does come with a small cost (20-30%) compared with
using float, but the saving in memory and storage resource footprint
is well worth it for most use cases.

field embedding type tensor<bfloat16>(x[384]) {
    indexing: input title . " " . input text | embed | attribute
    attribute {
      distance-metric: prenormalized-angular
    }
}

Using bfloat16 instead of float for the embedding tensor.

ModelNDCG@10 bfloat16NDCG@10 float
bge-small-en0.73460.7395
bge-base-en0.76560.7662

Evaluation results for BGE models – float versus bfloat16 document representation.

By using bfloat16 instead of float to store the vectors, we save
50% of memory cost and we can store 2x more embeddings per instance
type with almost zero impact on retrieval quality:

Summary

Using the open-source Vespa container image, we’ve explored the
recently announced strong BGE text embedding models with embedding
inference and retrieval on our laptops. The local experimentation
eliminates prolonged feedback loops.

Moreover, the same Vespa configuration files suffice for many
deployment scenarios, whether in on-premise setups, on Vespa Cloud,
or locally on a laptop. The beauty lies in that specific
infrastructure for managing embedding inference and nearest neighbor
search as separate infra systems become obsolete with Vespa’s
native embedding
support.

If you are interested to learn more about Vespa; See Vespa Cloud – getting started,
or self-serve Vespa – getting started.
Got questions? Join the Vespa community in Vespa Slack.