Pretrained Transformer Language Models for Search – part 1

Decorative image

Photo by Jamie Street
on Unsplash

Updated 2022-10-21: Added links and clarified some sections

In this blog series we demonstrate how to represent transformer models in a multi-phase retrieval and ranking pipeline using
We also evaluate these models on the largest Information Retrieval (IR) relevance dataset, namely the MS Marco Passage ranking dataset.
Furthermore, we demonstrate how to achieve close to state-of-the-art ranking using miniature transformer models with just 22M parameters,
beating large ensemble models with billions of parameters.

Blog posts in this series:

In this first post we give an introduction to Transformers for text ranking and three different methods of applying them for ranking.
We also cover multi-phase retrieval and ranking pipelines, and introduce three different ways to efficiently retrieve
documents in a phased retrieval and ranking pipeline.


Since BERT was first applied to search and document ranking, we at the Vespa team have been busy making it easy to use BERT or Transformer models in general, for ranking and question answering with In previous work,
we demonstrated how to use BERT as a representation model (bi-encoder), for efficient passage retrieval for question answering.
We also demonstrated how we could accelerate BERT models for production serving using distillation and quantization.

Search or information retrieval is going through a neural paradigm shift, some have even called it the BERT revolution.
The introduction of pre-trained language models BERT have led to significant advancement of the state of the art in search and document ranking.


The table shows how significant the advancement was when first applied to the MS MARCO Passage Ranking leaderboard. The state-of-the-art on MS Marco passage ranking advanced by almost 30% within a week,
while improvements up until then had been incremental at best.
Compared to the baseline BM25 text ranking (default Apache Lucene 9 text scoring), applying BERT improved the ranking effectiveness by more than 100%.

The table above is from Pretrained Transformers for Text Ranking: BERT and Beyond, which is a brilliant resource
for understanding how pre-trained transformers models can be used for text ranking.
The MS MARCO Passage ranking relevancy dataset consists of about 8.8M passages, and more than 500 000 queries with at least one judged relevant document.
It is by far the largest IR dataset available in the public domain and is commonly used to evaluate ranking models.

The MS Marco passage ranking dataset queries are split in three different subsets, the train, development (dev) and test (eval). The train split can be used to train a ranking model using machine learning. Once a model is built, one can test the effectiveness of the ranking model on the development and test split. Applying the learned model on the development and test set is called in-domain usage of the model. If the trained ranking model is applied on a different relevancy dataset, it’s usually referred to as out of domain usage, or zero-shot. How well models trained on MS Marco query and passage pairs generalize to other domains is out of scope for this blog post, but we can sincerely recommend BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models.

The official evaluation metric used for the MS Marco Passage ranking leaderboard is MRR@10. The name might sound scary, but it’s
a trivial way to judge the effectiveness of a ranking algorithm. RR@10 is the Reciprocal Rank of the first relevant passage within the top 10 ranking positions for a given query. @k denotes the depth into the top ranking documents we look for the first relevant document.
The reciprocal rank formula is simply 1/(position of the first relevant hit).
If the judged relevant hit (as judged by a human) is ranked at position 1 the reciprocal rank score is 1. If the relevant hit is found at position 2 the reciprocal rank score is 0.5 and so on. The mean in mean reciprocal rank is simply the mean RR over all queries in the dev or test split which gives us an overall score.
The MS Marco passage ranking development (dev) set consists of 6,980 queries.

The query relevance judgment list for the development (dev) set is in the public domain. Researchers can compare methods based on this.
The judgements for the eval query set is not in the public domain. Researchers, or industry practitioners, need to submit their ranking for the queries in the test set to have the MRR@10 evaluated and the ranking run listed on the leaderboard. Note that the MS Marco ranking leaderboards are not run time constrained, so many of the submissions take days of computation to produce ranked lists for the queries in dev and eval splits.

There is unfortunately a lot of confusion in the industry on how BERT can successfully be used for text ranking.
The IR research field has moved so fast since the release of BERT in late 2018 that the textbooks on text ranking are already outdated.
Since there is no textbook, industry practitioners need to look at how the research community is applying BERT or Transformer models for ranking.
BERT is a pre-trained language model, and to use it effectively for document or passage ranking, it needs to be fine-tuned for retrieval or ranking.
For examples of not so great ways to use BERT for ranking, see How not to use BERT for Document ranking.

As demonstrated in Pretrained Transformers for Text Ranking: BERT and Beyond, pre-trained language models of the Transformer family achieve best accuracy for text ranking and question answering tasks when used as an interaction model with all-to-all cross-attention between the query and document.
Generally, there are 3 ways to use Transformer models for text ranking and all of them require training data to fine tune for retrieval or ranking.


Figure from ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT illustrating various deep neural networks for ranking. In the following section we give an overview of three of these methods using Transformer models.

Representation based ranking using Transformer models

It is possible to use the Transformer model as the underlying deep neural network for representation based learning. Given training data, one can learn a representation of documents and queries so that relevant documents are closer or more similar in this representation than irrelevant documents. The representation based ranking approach falls under the broad representation learning research field and representation learning can be applied to text, images, videos or even combinations (multi-modal representations).
For text ranking, queries and documents are embedded into a dense embedding space using one or two Transformer models (bi-encoders).
The embedding representation is learned by the training examples given to the model. Once the model has been trained, one can pre-compute the embeddings for all documents, and
at query time, use nearest neighbor search for efficient document retrieval.


Document retrieval using dense embedding models, is commonly referred to as dense retrieval.
The query and the document are embedded independently, and the model is during training given examples of the form (query, relevant passage, negative passage).
The model(s) weights are adjusted per batch of training triplets.
The embedding representation from the Transformer model could be based on for example the CLS token of BERT (Classification Token), or using a pooling strategy over the last Transformer layer.

The huge benefit of using representation based similarity on top of Transformer models is that the document representation can be produced offline by encoding them through the trained transformer and unless the model changes, this only needs to be done once when indexing the document. At online serving time, the serving system only needs to obtain the query embedding by running the query through the transformer model and use the resulting query embedding vector as the input to a nearest neighbor search in the dense embedding space to find relevant documents. On the MS Marco Passage ranking set, dense retrieval using a learned representation has demonstrated good results over the last year or so. Dense retrievers achieve much better accuracy (MRR@10 and Recall@1000) than sparse traditional search using exact lexical matching (e.g BM25) and the current state-of-the-art uses a dense retriever as the first phase candidate selection for re-ranking using a more sophisticated (and computationally expensive) all-to-all interaction model.

Since the query is usually short, the online encoding complexity is relatively low and encoding latency is acceptable even on a cpu serving stack. Transformer models with full all to all cross attention have quadratic run time complexity with the input sequence length so the smaller the sequence input the better the performance is. Most online serving systems can also cache the query embedding representation to save computations and reduce latency.

All to all interaction ranking using Transformers

The “classic way to use BERT for ranking is to use it as an all-to-all interaction model where both the query and the document is fed through the Transformer model simultaneously and not independently as with the representation based ranking model. For BERT this is usually accomplished with a classification layer on top of the CLS token output, and the ranking task is converted into a classification task where one classifies if the document is relevant for the query or not (binary classification). This approach is called monoBERT or vanilla BERT, or BERT cat (categorization). It’s a straightforward approach and inline with the proposed suggestions of the original BERT paper for how to use BERT for task specific fine tuning.


Similar to the representation model, all to all interaction models need to be trained by triplets and the way we sample the negative examples (irrelevant) is important for the overall effectiveness of the model. The first BERT submission to the MS Marco passage ranking used mono-BERT to re-rank the top 1K documents from a more efficient sparse first phase retriever (BM25).

With all to all interaction there is no known way to efficiently pre-compute the document representation offline. Running online inference with cross-attention models over all documents in a collection is computationally prohibitively expensive even for large organizations like Google or Microsoft, so to deploy it for production one needs a way to reduce the number of candidate documents which are fully evaluated using the all to all cross attention model. This has led to increased interest in multi-stage retrieval and ranking architectures but also more efficient Transformer models without quadratic complexity due to the cross attention mechanisms (all to all attention).

Late Interaction using Transformers

An alternative approach for using Transformers for ranking was suggested in ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.

Unlike the all to all query document interaction model, the late contextualized interaction over BERT enables processing the documents offline since the per document token contextual embedding is generated independent of the query tokens. The embedding outputs of the last Transformer layer is calculated at document indexing time and stored in the document. For a passage of 100 tokens we end up with 100 embedding vectors of dimensionality n where n is a tradeoff between ranking accuracy and storage (memory) footprint. The dimensionality does not necessarily need to be the same as the transformer model’s hidden size. Using 32 dimensions per token embedding gives almost the same accuracy as the larger 768 dim of BERT base. Similar one can use low precision like float16 or quantization (int8) to reduce the memory requirements per dimension.


The ability to run the document and obtain the per term contextual embedding offline significantly speeds up onstage query evaluation, since at query time one only needs to do one pass through the Transformer model with the query to obtain the contextual query term embeddings. Then calculate the proposed MaxSim operator over the pre-computed per term contextualized embeddings for the documents we want to re-rank.

Similar to the pure representation based model we only need to encode the query through the transformer model at query time. The query tokens only attend to other query tokens, and similar document tokens only attend to other document tokens.

As demonstrated in the

Pretrained Transformer Language Models for Search – part 2

Decorative image

Photo by Rob Fuller on Unsplash

Updated 2022-10-21: Added links and clarified some sections

In this blog series we demonstrate how to represent transformer models in a multiphase retrieval and ranking pipeline using We also evaluate these models on the largest Information Retrieval relevance dataset, namely the MS Marco Passage ranking dataset. We demonstrate how to achieve close to state of the art ranking using miniature transformer models with just 22M parameters, beating large ensemble models with billions of parameters.

In the first post in this series we introduced using pre-trained models for ranking. In this second post we study efficient candidate retrievers which can be used to efficiently find candidate documents which are re-ranked using more advanced models.

Multiphase retrieval and ranking

Due to computational complexity of cross interaction transformer models there has been renewed interest in multiphase retrieval and ranking. In a multiphased retrieval and ranking pipeline, the first phase retrieves candidate documents using a cost efficient retrieval method and the more computationally complex cross-attention or late interaction model inference is limited to the top ranking documents from the first phase.


Illustration of a multi-stage retrieval and ranking architecture is given in the figure above. The illustration is from Phased ranking with Vespa. The three phases illustrated in the diagram is per content node, which is retrieving and re-ranking a subset of the total document volume. In addition one can also re-rank the global top scoring documents after the results from the nodes involved in the query are merged to find the global best documents. This step might also involve diversification of the result set before final re-ranking.

Broadly there are two categories of efficient sub-linear retrieval methods

  • Sparse retrieval using lexical term matching over inverted indexes, potentially accelerated by the WAND algorithm
  • Dense retrieval using dense vector representation of queries and documents, potentially accelerated by approximate nearest neighbor search algorithms

In the next sections we take a deep dive into these two methods and we also evaluate their effectiveness on the MS Marco Passage Ranking relevancy dataset. We also show how these
two methods can be combined with Vespa.

Sparse lexical retrieval

Classic information retrieval (IR) relying on lexical matching which has been around since the early days of Information Retrieval. One example of a popular lexical based retrieval scoring function is BM25. Retrieval can be done in sub-linear time using inverted indexes and accelerated by dynamic pruning algorithms like WAND. Dynamic pruning algorithms avoid scoring exhaustively all documents which match at least one of the query terms. In the below Vespa document schema we declare a minimal passage document type which we can use to index the MS Marco Passage ranking dataset introduced in post 1.

search passage {
  document passage {
    field text type string {
      indexing: summary |index
    field id type int {
      indexing: summary |attribute
  fieldset default {
  	fields: text
  rank-profile bm25 {
  	first-phase {
  	  expression: bm25(text)

We define a text field which we populate with the passage text. The indexing directive controls how the field is handled.The summary means that the text should be returned in the search result page and index specifies that we want to build inverted index data structures for efficient search and matching. We also define a ranking profile with only a single ranking phase using the Vespa bm25(name) text ranking feature, one out of many built in Vespa text matching ranking features.

Once we have indexed our data we can search using the Vespa HTTP POST query api:

    "yql": "select id,text from passage where userQuery();",
    "hits": 10,
    "query": "is cdg airport in main paris?",
    "ranking.profile": "bm25",
    "type": "all"
  • The yql parameter is the Vespa query language, userQuery() is a reference to the query parameter
  • The hits parameter controls the number of hits in the Vespa response
  • The query parameter contains the free text input query from the end user. Simple query language
  • The ranking.profile parameter choses the ranking profile to use for the query
  • The type specifies the query type (all, any, phrase) which controls the boolean query logic. All requires that all query terms are found in the document while any specifies at least one of the query terms should match in the document.

If we use the above query to search the MS Marco Passages we end up ranking only 2 passages and the query takes 7 ms. If we change type to any instead of all we end up ranking 7,926,256 passages (89% of the total collection) and the query takes 120 ms. Exact timing depends obviously on HW and number of threads used to evaluate the query but the main point is that brute force matching all documents which contains at least one term is expensive. While restricting to all is too restrictive, failing to recall the relevant documents. So what is the solution to this problem? How can we find the relevant documents without having to fully score almost all passages in the collection?

Meet the dynamic pruning algorithm WAND

The WAND algorithm is described in detail in

Efficient Query Evaluation using a Two-Level Retrieval Process (PDF)

We have determined that our algorithm significantly reduces the total number of full evaluations by more
than 90%, almost without any loss in precision or recall.
At the heart of our approach there is an efficient implementation of a new Boolean construct called WAND or
Weak AND that might be of independent interest

Vespa implements the WAND as a query operator and the below is an example of how to use it using our query example from above:

    "yql": "select id, text from passage where ([{\"targetNumHits\": 10}]weakAnd(default contains \"is\", default contains \"cdg\", default contains \"airport\", default contains \"in\", default contains \"main\", default contains \"paris\"));",
    "hits": 10,
    "ranking.profile": "bm25"

Using the above WAND query only fully ranks 2409 passages using the bm25 ranking profile and recall at first positions is the same as with brute force any,
so we did not lose any accuracy but saved a lot of resources.
Using the weakAnd operator, the query takes 12 ms instead of 120ms with brute force any.
Using WAND is best implemented using a custom searcher plugin to avoid tokenization outside of Vespa which might introduce asymmetric behaviour.
For example RetrievalModelSearcher
or using weakAnd.replace
which rewrites type any queries to using WAND instead.

There are two WAND/WeakAnd implementations in Vespa where in the above example we used weakAnd() which fully integrates with text processing (tokenization and index statistics like IDF(Inverse Document Frequency)). The alternative is wand() where the end user can control the query and document side weights explicitly. The latter wand() operator can be used to implement DeepCT and HDCT: Context-Aware Term Importance Estimation For First Stage Retrieval as Vespa gives the user full control of query and document term weighting without having to bloat the regular index by repeating terms to increase or lower the term frequency. Read more in Using WAND with Vespa.

Dense Retrieval using bi-encoders over Transformer models

Embedding based models embed or map queries and documents into a latent low dimensional dense embedding vector space and use vector search to retrieve documents. Dense retrieval could be accelerated by using approximate nearest neighbor search, for example indexing the document vector representation using HNSW graph indexing. In-domain dense retrievers based on bi-encoder architecture trained on MS Marco passage data have demonstrated that they can outperform sparse lexical retrievers with a large margin. Let us introduce using dense retrievers with Vespa.

In this example we use a pre-trained dense retriever model from Huggingface 🤗 sentence-transformers/msmarco-MiniLM-L-6-v3 . The model is based on MiniLM and the output layer has 384 dimensions. The model has just 22.7M trainable parameters and encoding the query using a quantized model takes approximately 8 ms on cpu. The original model uses mean pooling over the last layer of the MiniLM model but we also add a L2 normalization to normalize vectors to unit length (1) so that we can use innerproduct distance metric instead of angular distance metric. This saves computations during the approximate nearest neighbor search.

We expand our passage document type with a dense tensor field mini_document_embedding and a new ranking profile.

  search passage {
  document passage {
    field text type string {
      indexing: summary |index
    field mini_document_embedding type tensor<float>(d0[384]) {
      indexing: attribute | index
      attribute {
        distance-metric: innerproduct
      index {
        hnsw {
          max-links-per-node: 32
          neighbors-to-explore-at-insert: 500
    field id type int {
      indexing: summary |attribute
  fieldset default {
  	fields: text
  rank-profile bm25 {
  	first-phase {
  	  expression: bm25(text)
  rank-profile dense {
    first-phase {
      expression: closeness(field,mini_document_embedding)

The mini_document_embedding tensor is dense (denoted by d0[384]) and is of dimensionality 384 (determined by the Transformer model we use, and possible linear dimension reduction). We use float resolution (4 bytes) for the tensor cell values (valid choices are double, bfloat16 and int8). We also define HNSW index for the field, and we set 2 HNSW indexing parameters which is an accuracy versus performance tradeoff. See HNSW for details. Accuracy is typically measured by recall@k comparing brute force nearest neighbor search versus the approximate nearest neighbor search at level k. The dense ranking profile specifies how we want to rank (or actually re-rank) our documents, in this case we use the closeness ranking feature. Documents close to the query in the embedding space is ranked higher than documents which are far. At indexing time we need to convert the passage text into the dense vector representation and index. At query time, we need to encode the query and use approximate nearest neighbor search:

   "yql": "select id, text from passage where [{\"targetNumHits\": 10]nearestNeighbor(mini_document_embedding, query_embedding);"
   "hits": 10,
   "query": "is cdg airport in main paris?",
   "ranking.profile": "dense",
   "ranking.features.query(query_embedding)": [0.08691329, -0.046273664, -0.010773866,..,..]

In the above example we use the Vespa nearestNeigbhor query operator to retrieve the 10 closests documents in embedding space for the input query embedding vector passed in the ranking.features.query(query_embedding) parameter. In this example, query encoding (the forward query encoding pass of the query to obtain the query embedding) is done outside but we can also represent the query encoding model inside Vespa, avoiding complicating our online serving deployment setup:

Representing the bi-encoder model inside Vespa

To represent the bi-encoder query model in Vespa we need to export the Huggingface PyTorch model into ONNX format for efficient serving in Vespa.
We include a notebook in this
sample application
which demonstrates how to transform the model and export it to ONNX format.
Vespa supports evaluating ONNX models for ranking and query encoding.
To speed up evaluation on CPU we use quantized (int) version.
We have demonstrated how to represent query encoders in
Dense passage retrieval with nearest neighbor search.

Hybrid Dense Sparse Retrieval

Recent research indicates that combining dense and sparse retrieval could improve the recall, see for example A Replication Study of Dense Passage Retriever. The hybrid approach combines dense and sparse retrieval but requires search technology which supports both sparse lexical and dense retrieval. supports hybrid retrieval in the same query by combining the WAND and ANN algorithms. There are two ways to do this:

Disjunction (OR)

   "yql": "select id, text from passage where 
   ([{\"targetNumHits\": 10]nearestNeighbor(mini_document_embedding, query_embedding)) or  
   ([{\"targetNumHits\": 10}]weakAnd(default contains \"is\"...));"
   "hits": 10,
   "query": "is cdg airport in main paris?",
   "ranking.profile": "hybrid",
   "ranking.features.query(query_embedding)": [0.08691329, -0.046273664, -0.010773866,..,..]

In the above example we combine ANN with WAND using OR disjunction and we have a hybrid ranking profile which can combine using the dense and sparse ranking signals (e.g bm25 and vector distance/closeness). Approximately 10 + 10 documents will be exposed to the first-phase ranking function (depending on targetNumHits). It is then up to the first-phase ranking expression to combine the scores of these two different retrieval methods into a final score. See A Replication Study of Dense Passage Retriever for examples of parameter/weighting. For example it could look something like this:

rank-profile hybrid {
  first-phase {
    expression: 0.7*bm25(text) + 2.9*closeness(field, mini_document_embedding)


Pretrained Transformer Language Models for Search – part 3

Decorative image

Photo by Frank Busch
on Unsplash

Updated 2022-10-21: Added links and clarified some sections

In this blog series we demonstrate how to represent transformer models in a multiphase retrieval and ranking pipeline using We also evaluate these models on the largest Information Retrieval relevance dataset, namely the MS Marco Passage ranking dataset.
We demonstrate how to achieve close to state-of-the-art ranking using miniature transformer models with just 22M parameters, beating large ensemble models with billions of parameters.

In the first post in this series
we introduced using pre-trained language models for ranking and three popular methods for using them for text ranking.
In the second post
we studied efficient retrievers that could be used as the first phase in a multiphase retrieval and ranking pipeline.
In this third post we study a re-ranking model which we will deploy as a re-ranker on top of the retriever methods
we studied in the previous post, but first let us recap what a multiphase retrieval and ranking pipeline is.
In a multiphased retrieval and ranking pipeline,
the first phase retrieves candidate documents using a cost-efficient retrieval method
and the more computationally complex cross-attention or late interaction model inference
is limited to the top ranking documents from the first phase.
In this post we will study the Contextualized late interaction over BERT (ColBERT) model
and deploy it as a re-ranking phase on top of the dense retriever that we studied in the previous post.
The CoLBERT ranking model was introduced in
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
by Omar Khattab and Matei Zaharia.

Contextualized late interaction over BERT (ColBERT)

In the previous post in this series we introduced a dense retriever using a bi-encoder architecture over a Transformer model (MiniLM). Both queries and documents were encoded by the bi-encoder and represented in the same dense embedding vector space.
We used cosine similarity between the query and the document in this embedding vector space to rank documents for a query,
and we could accelerate the retrieval phase using approximate nearest neighbor search
using angular distance or innerproduct.

Unlike the dense bi-encoder, the contextualized late interaction model represents the query and document as multiple vectors obtained from the last output layer of the Transformer model. Bi-encoders on the other hand, usually performs a pooling operation over the last transformer layer,
e.g. just using the embedding representation from the CLS output token, or mean over all token output embeddings.
Also, unlike other text to vector representations like Word2Vec, the token vector representation depends on the other tokens in the same input sequence. For example the token driver in the text Microsoft driver has a different vector representation than driver in the text Taxi driver as the context is different. This thanks to the attention mechanism in the Transformer architecture where each token attends to all other tokens in the same input sequence. We can say that token output vector representation is contextualized by the other tokens in the input text sequence.

Similar to the single vector bi-encoder model, queries and documents are encoded independently. Hence, the query tokens only attend to other query tokens, and document tokens only attend to other document tokens. This separation enables offline processing of the documents which speeds up re-ranking as at re-reranking time we only need to obtain the query token embeddings and load the precomputed document embeddings from storage (e.g. memory). The ColBERT architecture also uses a query encoder and a document encoder, based on the same Transformer instance. The input to the model is different for queries and documents. The query encoder pads using the BERT mask token to a configurable maximum query length if the query input text is shorter than this max length. The document input is not padded to a fixed length.
The padding of masked tokens of the query input is explained in the paper:

We denote the padding with masked tokens as query augmentation, a step that allows BERT to produce query-based embeddings at the positions corresponding to these masks. Query augmentation is intended to serve as a soft, differentiable mechanism for learning to expand queries with new terms or to re-weigh existing terms based on their importance for matching the query

The dimensionality used to represent the output token embedding can be reduced using a dimension reduction layer on top of the last output transformer layer. The original token output dimensionality depends on the Transformer model used, for example, the bert-base model uses 768 dimensions while MiniLM uses 384 dimensions. In the ColBERT paper the authors uses dimension reduction to 128 dimensions from the original hidden size of 768 dimensions. The authors also demonstrate that reducing the dimensionality further to 32 does not impact ranking accuracy significantly. The dimensionality used and the precision used for the vector values matters for both the computational complexity and storage requirements.
For example, if we use 32 dimensions with bfloat16 (2 bytes per tensor value) precision, we need to store 32GB of vector data for 9M documents with average 60 tokens per document.
While if we use 128 dimensions with float32 (4 bytes) precision, we end up with about 256GB of vector data.

Ranking with ColBERT – Meet MaxSim

So we now know roughly how the ColBERT architecture works; Query text is encoded into a fixed length bag of token embeddings and document text is encoded into a bag of token embeddings. But the missing piece is how do we compute the relevancy score of a query, document pair using this representation?

The ColBERT paper introduces the late interaction similarity function which they name Maximum Similarity (MaxSim): For a given query and document pair the MaxSim relevancy score is calculated as follows:

For each query token embedding perform cosine similarity against all the document token embeddings and track the maximum score per query token.
The overall query, document score is the sum of these maximum cosine scores.
For a query with 32 token embeddings (max query length 32), and a document with 128 tokens we need to perform 32*128 cosine similarity operations. The MaxSim operator is illustrated in the figure below.


MaxSim illustration from the ColBERT paper

The cosine similarity with unit length vectors can be performed by the inner dot product,
and can be HW accelerated using advanced vector instructions.

Vespa ColBERT representation

To represent the ColBERT model architecture in Vespa for re-ranking documents we need:

  • Store the document token embeddings in our Vespa document model for fast, on-demand, access in ranking phases
  • Express the MaxSim function with a Vespa ranking expression
  • Map the query text to token ids, and map tokens to token embeddings at run time by invoking the ColBERT query encoder transformer model

We expand the Vespa document schema from the previous post and introduce a new mixed Vespa tensor field called dt. We use this tensor to store the computed bag of token embeddings for the document.
The mixed tensor (combining sparse dt, and indexed x dimensions) allows storing a dynamic number of token embeddings for the document, depending on the length of the document. We could have used an indexed representation, but that would have used more memory as we would need to determine a max document length.

Vespa Passage document schema

The new document schema including the new dt ColBERT document tensor is given below:

search passage {
  document passage {
    field id type int {...} 
    field text type string {...}
    field mini_document_embedding type tensor<float>(d0[384]){...}
    field dt type tensor<bfloat16>(dt{}, x[32]){
     indexing: attribute

The tensor cell value precision type we use is bfloat16 which is 2 bytes per tensor cell value which saves 50% of the memory compared to float precision (4 bytes per value). Vespa supports double, float, bfloat16 and int8 tensor cell value precision types.

We also use 32 dimensions for the per token embedding representation instead of 128 to further reduce the memory requirement. The indexing statement specifies attribute which means this field will be stored in-memory and fast-search enables fast uncompressed representation in memory which speeds up evaluation over mixed tensor fields. fast-search is only relevant for mixed tensor type fields.

Vespa MaxSim operator

We can express the MaxSim operator in Vespa by a tensor ranking expression using sum and reduce tensor functions.

    sum(query(qt) * attribute(dt), x),
    max, dt

Where attribute(dt) is the ColBERT document tensor field and query(qt) is the ColBERT query tensor representation.
The query(qt) tensor is defined in the
passage schema:

query(qt) tensor>float<(qt{},x[32])

We configure the MaxSim operator in a Vespa ranking profile,
where we use the dense bi-encoder model as our first-phase ranking function and use the ColBERT MaxSim as the second phase ranking expression.
We use re-ranking count of 1000 (per node),
this setting can also be controlled by a query time setting,
in case we want to explore different re-ranking depths.
The ranking profile is given below.
In this case, we also cast the bfloat16 tensor values
to float to enable HW accelerations in place for operations on float tensors.

rank-profile dense-colbert {
  first-phase {
    expression: closeness(field,mini_document_embedding)
  second-phase {
    rerank-count: 1000
    expression {
              query(qt) * cell_cast(attribute(dt), float) , x
          max, dt

To obtain the query(qt) ColBERT tensor we need to encode the text query input using the ColBERT query encoder.

Vespa ColBERT query encoder

We have trained a ColBERT model using a 6-layer MiniLM model which can be downloaded from Huggingface model hub. This model only have 22.7M trainable parameters. This model can be served with Vespa using ONNX format. We also have included a notebook which demonstrates how to export the PyTorch transformer model to ONNX format and also use quantization to further speed up the evaluation. Quantization (using int8) weights instead of float speeds up evaluation of the model by 3x. See Google colab notebook.

The query encoder is represented in a query document type which has no fields.
It’s a placeholder to be able to represent the ONNX model,
and we use a single empty document so that we can invoke the Vespa ranking framework to evaluate the ONNX model.

schema query {
  document query {}
  onnx-model colbert_encoder {
    file: files/vespa-colMiniLM-L-6-quantized.onnx
    input input_ids: query(input_ids)
    input attention_mask: query(attention_mask)
    output contextual:contextual 
  rank-profile colbert_query_encoder {
    num-threads-per-search: 1
    first-phase {
      expression: random 
    summary-features {

Tokenization and tensor input (input_ids and attention_mask) is generated using a
custom searcher which maps the query text to BERT token ids
and creates the ColBERT masked query input.
See ColBERTSearcher for details.
This searcher produces the mentioned query(qt) tensor which is used by the MaxSim ranking expression.
We use the ColBERT repo’s indexing routine to produce the document token embeddings,
and we also publish a pre-processed dataset with all 8.8M passages including both the mini_document_embedding and ColBERT tensor fields.
See MS Marco Passage Ranking using Transformers vespa sample application.

We evaluate the ranking effectiveness of the ColBERT model deployed as a re-ranking step on top of the dense retriever introduced in the previous post. We use MS Marco Passage Ranking dev query split (6980 queries):

Retrieval methodRankingMRR@10Recall@100Recall@200Recall@1000
weakAnd (sparse)bm250.1850.660.730.85
nearestNeighbor (dense)innerproduct0.3100.820.870.94
nearestNeighbor (dense)ColBERT0.3590.860.900.94

The Recall@1000 does not change as the model is used to re-rank the top 1K hits from the dense retriever. The Recall@100 and Recall@200 metrics improve with the ColBERT re-ranking step and MRR@10 improves from 0.310 to 0.359.
The end-to-end latency, including query encoding of the dense retriever model, ColBERT query encoding, retrieval with nearest neighbor search (with targetHits=1000) and re-ranking with ColBERT is just 39 ms. Reducing the nearest neighbor search targetHits, and also the re-ranking depth of the ColBERT model can be used to trade accuracy versus cost.

$ ./src/main/python/ --rank_profile dense-colbert --rerank_hits 1000 --retriever dense  --ann_hits 1000 --hits 10  --trec_format --run_file dev.test --query_split dev --endpoint https://$ENDPOINT:4443/search/
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6980/6980 [04:27<00:00, 26.07it/s]

In this blog post

Pretrained Transformer Language Models for Search – part 4

Decorative image

Photo by Patrick Hendry on Unsplash

Updated 2022-10-21: Added links and clarified some sections

In this blog series we demonstrate how to represent transformer models in a multiphase retrieval and ranking pipeline using We also evaluate these models on the largest Information Retrieval relevance dataset, namely the MS Marco Passage ranking dataset. We demonstrate how to achieve close to state of the art ranking using miniature transformer models with just 22M parameters, beating large ensemble models with billions of parameters.

Blog posts in this series:

In the first post in this series we introduced using pre-trained language models for ranking and three popular methods for using them for text ranking. In the second post we studied efficient retrievers which could be used as the first phase in a multiphase retrieval and ranking pipeline. In the third post we studied the ColBERT re-ranking model.

In this fourth and last post in our blog post series on pre-trained transformer models for search,
we introduce a cross-encoder model with all-to-all interaction between the query and the passage.

We deploy this model as our final ranking stage in our multiphase retrieval and ranking pipeline, furthermore,
we submit the ranking results to the MS Marco Passage Ranking Leaderboard.

In addition, we benchmark the serving performance of all the retrieval and ranking methods introduced in this blog post series.
Finally, we also release a vespa sample application,
which lets try out these state of the art retrieval and ranking methods.


In this blog post we study the third option for using transformer models for search and document ranking.
This option is the simplest model to configure and use in Vespa but also the most computationally expensive model in our multi-phase retrieval and ranking pipeline.
With the cross attention model we input both the query and the passage to the model and as we know by now,
the computational complexity of the transformer is squared with regards to the input length.
Doubling the sequence length increases the computational complexity by 4x.

The cross-encoder model is a transformer based model with a classification head on top of the Transformer CLS token (classification token).
The model has been fine-tuned using the MS Marco passage training set and is a binary classifier which classifies
if a query,document pair is relevant or not.

The cross-encoder model is also based on a 6-layer MiniLM model with only 22.7M parameters, same as the transformer models previously introduced in this blog series. As with the other two transformer models we introduced in previous posts in this series, we integrate this model in Vespa using ONNX format. We demonstrate how to export the model(s) from PyTorch/Transformers to ONNX format in this notebook. The model is hosted on the Huggingface model hub.

We use a quantized version where the original float weights have been quantized to int8 representation to speed up inference on cpu.

Vespa representation of the cross-encoder model

In previous posts we have introduced the Vespa passage schema.
We add a new tensor field to our schema and in this tensor field we will store the transformer token ids of the processed text.
We haven’t described this in detail before, but the MiniLM model uses as input the sequence of the numeric token ids from the
fixed BERT token vocabulary of about 30K unique tokens or subwords.

For example the passage:

Charles de Gaulle (CDG) Airport is close to Paris

Is tokenized to:

['charles', 'de', 'gaulle', '(', 'cd', '##g', ')', 'airport', 'is', 'close', 'to', 'paris']

The subword tokens are mapped to token ids from the fixed vocabulary, e.g ‘charles’ maps to token id 2798.
The example passage text is represented as a tensor by:

[2798, 2139, 28724, 1006, 3729, 2290, 1007, 3199, 2003, 2485, 2000, 3000]

We use the native Vespa WordPiece embedder
to map the text into tensor representation.

The passage document schema,
including the new text_token_ids field:

search passage {
  document passage {
    field id type int {...} 
    field text type string {...}
    field mini_document_embedding type tensor<float>(d0[384]){...}
    field dt type tensor<bfloat16>(dt{}, x[32]){..}

  field text_token_ids type tensor<float>(d0[128])  {
    indexing: input text | embed tokenizer | attribute | summary
    attribute: paged

We store maximum 128 tokens, denoted by d0[128]. This is an example of an indexed Vespa tensor type.

Vespa ranking with cross-encoder model

We are going to use the dense retriever model, accelerated by Vespa’s approximate nearest neighbor search to
efficiently retrieve passages for re-ranking with our transformer based ranking models. The retrieved hits are
re-ranked with the ColBERT model introduced in the third post,
and finally the top ranking documents from the ColBERT model is re-ranked using the cross-encoder.

The retrieval and ranking pipeline have two re-ranking depth parameters.

  • How many are re-ranked with ColBERT is determined by the target number of hits passed to the nearest neighbor query operator.
  • The number of documents that are re-ranked using the final cross-encoder model is determined by the rank-profile rerank-count property.

See phased ranking with Vespa.
Both these parameters impact end-to-end serving performance and also ranking accuracy as measured by MRR@10.

Both the nearest neighbor search target number of hits and rerank-count is per content node which is involved in the query.
This is only relevant for deployments where the document corpus cannot be indexed on a single node due to either space constraints (memory, disk) or serving latency constraints.

Defining the MiniLM cross-encoder

schema passage {
  document passage {...}

  onnx-model minilmranker {
    file: files/ms-marco-MiniLM-L-6-v2-quantized.onnx
    input input_ids: input_ids
    input attention_mask: attention_mask
    input token_type_ids: token_type_ids

In the above snippet we define the ONNX model and its inputs, each of the inputs are mapped to a function declared later in the ranking profile. Each function produces a tensor
which is used as input to the model. The file points to the ONNX formatted model format, placed in in src/main/application/files/.
Vespa takes care of distributing the model to the content node(s). The inputs
to the model are standard transformer inputs (input_ids, attention_mask and token_type_ids).

The first part of the ranking profile where we define the 3 input functions to the BERT model looks like this:

  rank-profile dense-colbert-mini-lm {
    function input_ids() {
       expression: tokenInputIds(128, query(query_token_ids), attribute(text_token_ids))
    function token_type_ids() {
      expression: tokenTypeIds(128, query(query_token_ids), attribute(text_token_ids))
    function attention_mask() {
      expression: tokenAttentionMask(128, query(query_token_ids), attribute(text_token_ids))

For example the input input_ids the function input_ids which is defined as

  function input_ids() {
       expression: tokenInputIds(128, query(query_token_ids), attribute(text_token_ids))

The tokenInputIds is a built-in Vespa ranking feature
which builds the transformer model input including special tokens like CLS and SEP.

We pass the query(token_ids) tensor which
is sent with the query and the passage token ids which is read from the in-memory attribute field (text_token_ids).

The query tensor representation (query(query_token_ids)) is created in a custom query processor RetrievalModelSearcher
which converts the free text query input from the
user to a tensor representation using the same BertTokenizer as used by the custom document processor.

For example for a text query

is CDG in paris?

The query tensor representation becomes:

[2003, 3729, 2290, 1999, 3000, 1029]

The tokenInputIds ranking function will create the concatenated tensor of both query and passage including the special tokens. Using the example passage
from previous section with the above query example our concatenated output with special tokens becomes:

[101, 2003, 3729, 2290, 1999, 3000, 1029, 102, 2798, 2139, 28724, 1006, 3729, 2290, 1007, 3199, 2003, 2485, 2000, 3000, 102]

Where 101 is the CLS token id and 102 is the SEP token separating the query from the passage.

Cross-Encoder Model

The above figure illustrates the input and output of the cross-encoder transformer model.

Notice the CLS output embedding which is fed into the
classification layer which predicts the class label (Relevant = 1, irrelevant = 0).

Now as we have presented how to represent the cross-encoder model, we can present the remaining parts of our
ranking profile:

rank-profile dense-colbert-mini-lm {

    function maxSimNormalized() {
      expression {
              query(qt) * attribute(dt), x
            max, dt
    function dense() {
      expression: closeness(field, mini_document_embedding)
    function crossModel() {
      expression: onnx(minilmranker){d0:0,d1:0}
    first-phase {
        expression: maxSimNormalized()
    second-phase {
      rerank-count: 24
      expression: 0.2*crossModel() + 1.1*maxSimNormalized() + 0.8*dense()

The maxSimNormalized function computes the ColBERT MaxSim function which we introduced in post 3,
here we also normalizes the MaxSim score by dividing the score with 32 which is the configured max ColBERT query encoder query length,
and each term has maximum score of 1.

The dense() function calculates the cosine similarity as calculated
by the dense retriever introduced in post 2

In the crossModel() function we calculate the score from cross-encoder introduced in this blog post:

function crossModel() {
  expression: onnx(minilmranker){d0:0,d1:0}

The {d0:0,d1:0} access the logit score. (d0:0 is the batch dimension, which always is of size 1, and d1:0 access the logit score, which is a proxy for the relevancy).

Ranking profile summarized

  • Retrieve efficiently using the dense retriever model – This is done by the Vespa approximate nearest neighbor search query operator.
  • The k passages retrieved by the nearest neighbor search is re-ranked using the ColBERT MaxSim operator. K is set by the target hits used for the nearest neighbor search.
  • In the last phase, the top ranking 24 passages from the previous phase are evaluated by the cross attention model.
  • The final ranking score is a linear combination of all three ranking scores. The rerank-count can also be adjusted by a query parameter

Observe that reusing scores from the previous ranking phases does not impact serving performance,
as they are only evaluated once (per hit) and cached.

The linear weights
of the three different transformer scores was obtained by a simple grid search observing
the ranking accuracy on the dev query split when changing parameters.

MS Marco Passage Ranking Submission

We submitted a run for the MS Massage Ranking where we used targetHits 1K for the approximate nearest neighbor search,
so that 1K passages are re-ranking using the ColBERT model and finally 96 passages are re-ranked with the cross-encoder model.

Passage Ranking

Our multi-phase retrieval and ranking pipeline with 3 miniature models performed pretty well,
even beating large models using T5 with 3B parameters.
See MS Marco Passage Ranking Leaderboard.

BM25 (Official baseline)0.1650.167
BM25 (Lucene8, tuned)0.1900.187
Vespa dense + ColBERT + cross-attention0.3930.403

Multi-threaded retrieval and ranking

Vespa has the ability to use multiple threads per search query.
This ability can reduce search latency as the document retrieval and ranking
for a single query can be partitioned, so that each thread works on a subset of the searchable documents in an index.
The number of threads to use is controlled on a per rank profile basis,
but can only use less than the global setting controlled in the application services.xml.

To find optimal settings, we recommend benchmarking starting with one thread per search and increasing until latency does not improve significantly.
See Vespa Scaling Guide for details.

Serving performance versus ranking accuracy

In this section we perform benchmarking where we deploy the system on a Vespa cloud instance using
2 x Xeon Gold 6263CY 2.60GHz (HT enabled, 48 cores, 96 threads) with 256GB memory.

We use a single content node indexing the 9M passages.
All query encodings with the MiniLM based query encoders, retrieval and re-ranking is performed on this content node.
We also use 2 stateless container nodes with 16 v-cpu each to make sure that we are benchmarking the content node performance.
See Vespa overview on
stateless container nodes versus content nodes.

Running everything of importance on the same node enables us to quantitatively compare the performance of the methods we have introduced in this blog post series.
We benchmark throughput per retrieval and ranking model until we reach about 70% cpu utilization,
and compare obtained throughput and latency. We also include tail latency (99.9 percentile) in the reported result.

We use the vespa-fbench benchmarking utility to
load the cluster (by increasing the number of clients to reach about 70% cpu util).

Pre-trained models on Vespa Cloud

UPDATE 2023-06-06: use new syntax to configure Bert embedder.

Decorative image

“searching data using pre-trained models, unreal engine high quality render, 4k, glossy, vivid_colors, intricate_detail” by Stable Diffusion

Vespa can now convert text to embeddings for you automatically,
if you don’t want to bring your own vectors – but you still need to provide the ML models to use.

On Vespa Cloud we’re now making this even simpler, by also providing pre-trained models you can use for such tasks.
To take advantage of this, just pick the models you want from and refer
to them in your application by supplying a model-id where you would otherwise use path or url. For example:

<component id="myEmbedderId" type="bert-embedder">
    <transformer-model model-id="minilm-l6-v2"/>
    <tokenizer-vocab model-id="bert-base-uncased"/>

You can deploy this to Vespa Cloud to have these models do their job in your application –
no need to include a model in your application and wait for it to be uploaded.

You can use these models both in configurations provided by Vespa, as above, and in your own components,
with your own configurations – see the documentation for details.

We’ll grow the set of models available over time, but the models we provide on Vespa Cloud will always be an
exclusive selection of models that we think it is beneficial to use in real applications,
both in terms of performance and model quality.

We hope this will empower many more teams to leverage modern AI in their production use cases.