Customizing Reusable Frozen ML-Embeddings with Vespa

UPDATE 2023-06-06: use new syntax to configure Bert embedder.

Decorative
image

Photo by fabio on Unsplash

Deep-learned embeddings are popular for search and recommendation use cases, and organizations must manage and operate these
embeddings efficiently in production. One emerging strategy which reduces embedding lifecycle complexity is to use
frozen models which output frozen foundational embeddings that are reusable and customizable for different tasks.

This post introduces the concept of using reusable frozen embeddings and tailoring them with Vespa.

Background

Deep Learning for Embeddings Overview

Encoding data objects using deep learning models allows for representing
objects in a high-dimensional vector space. In this latent embedding
vector
space, one can compare the objects using vector distance
functions, which can be used for search, recommendation, classification,
and clustering. There are three primary ways developers can introduce
embedding representations of data in their applications:

  • Using commercial embedding providers
  • Using off-the-shelf open-source embedding models
  • Training custom embedding models

All three incur training and inference (computational) costs, which
are proportional to the model size, the number of objects,
and the object input sizes. In addition, the output vector embedding must
be stored and potentially indexed
for efficient retrieval.

Operating and maintaining embeddings in production #MLEmbeddingOps

Suppose we want to modify an embedding model by fine-tuning it for a task or
replacing it with a new model with a different vector dimensionality. Then, all our data objects must be reprocessed
and embedded again. Reprocessing for any model change might be easy for small-scale
applications with a few million data points. Still, it quickly
gets out of hand with larger-scale evolving datasets in production.

Consider a case where we have an evolving dataset of 10M news
articles that we have implemented semantic
search
for, using a model that embeds query and document texts into vector
representations. Our search service has been serving production
traffic for some time, but now the ML team wants to change the
embedding model. Now, to get this new model into production
for online evaluation, we roughly need to follow these steps:

  • Run inference with the new model over all documents in the index to obtain the new vector embedding.
    This stage requires infrastructure to run inferences with the model or pay an embedding inference provider per inference.
    We must also serve the current embedding model, which is in production, used to embed new documents and the current real-time stream of queries.

  • Index the new vector embedding representation to our serving infrastructure used for efficient vector search.
    Suppose, we are fortunate enough to be using Vespa, which supports
    multiple embedding fields per document schema. In that case, we can index the new embedding in a new field without duplicating
    other fields and creating an entirely new schema or index. Adding the new tensor field still adds to the serving cost,
    as we double the resource usage footprint related to indexing and storage.

  • After all this, we are finally ready to evaluate the new embedding
    model online. Depending on the outcome of the online evaluation, we can
    garbage collect either the new or old embedding field.

That’s a lot of complexity and cost to evaluate a model online, but now we can relax.
But, wait, our PM now wants to introduce news article recommendations
for the home page, and the ML team is planning on using embeddings for this project. We also hear they are
discussing a related articles feature they want to launch soon.

Very quickly we will face the challenge of maintaining and operating three different embedding-based use cases, each with model iterations, and going
through the above process. There must be a better way. What if we could somehow reuse the embeddings for multiple
tasks?

Frozen embeddings to the rescue

Frozen embeddings

An emerging industry trend that helps reduce #MLEmbeddingOps complexity is to use frozen foundational
embeddings
that can be reused
for
different tasks without incurring the usual costs related to embedding
versioning, storage, inference, and training.

Vector Space Illustration

A frozen model embeds data Q and D into vector space. By transforming
the Q representation to Q’, the vector distance is reduced
(d(D,Q') < d(D,Q). This illustrates fine-tuning of metric
distances over embeddings from frozen models. Note that the D representation
does not change, which is a practical property for search and
recommendation use cases with potentially billions of embedding
representations of items.

With frozen embeddings from frozen models, the data is embedded once using a foundational
embedding model. Developers can then tailor the representation to
specific tasks by adding transformation layers. The fixed model will always produce the
same frozen embedding representation for the same input. So as long as the input data
does not change, we will not need to invoke the model again. In our news search example, we
can use the same frozen document embeddings for all three use cases.

The following sections describe different methods
for tailoring frozen embeddings for search or recommendation use
cases.

  • Tuning the query tower in two-tower embedding models
  • Simple query embedding transformations
  • Advanced transformations using Deep Neural Networks

Two-tower models

The industry standard for semantic vector search is using a two-tower
architecture based on a Transformer based model.

This architecture is also called a bi-encoder model, as there is a query and document
encoder. Most of the two-tower architecture models use the same
weights for both the query and document encoder. This is not ideal
as if we tune the model, we would need to reprocess all our items
again. By de-coupling the model weights of the query and document
tower, developers can treat the document tower as frozen.
Then, when fine-tuning the model for the
specific task, developers tune the query tower and
freeze the document tower.

Frozen Query Tower

The frozen document tower and corresponding embeddings significantly reduce the
complexity and cost of serving and training. For example, during training,
there is no need to encode the document in the training data, it can be fetched
directly from Vespa. This saves at least 2x of
computational complexity during training. In practice,
since documents are generally longer than queries and Transformer models
scale quadratic with input lengths, the computational saving is
higher than that.

On the serving side in Vespa, there is no need to re-process the
documents, as the same input will produce the exact same frozen document embedding representation.
This saves the compute of performing the
inference and avoids introducing new embedding fields or embedding versioning. And,
because Vespa allows deploying multiple query tower models,
applications may test the accuracy of new models, without re-processing documents,
which allows for frequent model deployment and evaluations.

Managing complex infrastructure for producing text embedding vectors
could be challenging, especially at query serving time, with low
latency, high availability, and high query throughput. Vespa allows
developers to represent embedding
models in Vespa.
Consider the following schema, expressed using Vespa’s schema
definition language:

schema doc {
  document doc {
    field text type string {}  
  } 
  field embedding type tensor<float>(x[384]) { 
    indexing: input text | embed frozen | attribute | index
  }
}

In this case, Vespa will produce embeddings using a frozen embedding
model, and at query time, we can either use the frozen model to
encode the query or a new fine-tuned model. Deploying multiple query tower models allows
for query time A/B testing which increases model deployment velocity and shortens
the ML feedback loop.

curl \
 --json "
  {
   'yql': 'select text from doc where {targetHits:10}nearestNeighbor(embedding, q)',
   'input.query(q)': 'embed(tuned, dancing with wolves)' 
  }" \
 https://vespaendpoint/search/

The first argument of the embed command is the model to use when encoding the query.
For each new query tower model, developers will add the model to a directory in
the Vespa application
package, and
give it a name, which is referenced at query inference time.

Re-deployment of new models is a live change, where Vespa automates
the model distribution to all the nodes in the cluster, without
service interruption or downtime.

<component id="frozen" type="bert-embedder">
    <transformer-model path="models/frozen.onnx"/>
    <tokenizer-vocab path="models/vocab.txt"/>
</component>

<component id="tuned" type="bert-embedder">
    <transformer-model path="models/tuned.onnx"/>
    <tokenizer-vocab path="models/vocab.txt"/>
</component>

Snippet from the Vespa application services.xml file, which defines the models and
names, see represent embedding
models for
details.
Finally, how documents are ranked is expressed using
Vespa ranking
expressions.

rank-profile default inherits default {
  inputs {
    query(q) tensor<float>(x[384])
  }
  first-phase {
    expression: cos(distance(field,embedding))
  }
}

Simple embedding transformation

Simple linear embedding transformation is great for the cases where
developers use an embedding provider and don’t have access to the
underlying model weights. In this case tuning the model weights is
impossible, so the developers cannot adjust the embedding model
towers. However, the simple approach for adapting the model is to
add a linear layer on top of the embeddings obtained from the
provider.

The simplest form is to adjust the query vector representation by
multiplying it with a learned weights matrix. Similarly to the query
tower approach, the document side representation is frozen. This
example implements the transformation using tensor compute
expressions configured
with the Vespa ranking
framework.

rank-profile simple-similarity inherits default {
  constants {
    W tensor<float>(w[128],x[384]): file: constants/weights.json
  }
 
  function transform_query() {
     expression: sum(query(q) * constant(W), w)   
  }

  first-phase {
    expression: attribute(embedding) * transform_query()
  }
}

The learned weights are exported from any ML framework (e.g.,
PyTorch,
scikit-learn) used to train the
matrix weights. And the weights are exported to a constant
tensor
file. Meanwhile, the transform_query function performs a vector
-matrix
product,
returning a modified vector of the same dimensionality.

This representation is then used to score the documents in the first-phase
ranking expression. Note that this is effectively represented as
a re-ranking phase
as the query tensor used for the nearestNeighbor search is untouched. It’s possible to transform the query tensor, before
the nearestNeighbor search as well, using a custom stateless searcher.

The weight tensor does not necessarily need to be a constant across
all users. For example, one can have a weight tensor per user, as
shown in the recommendation use
case,
to unlock true personalization.

Advanced transformation using Deep Neural Networks

Another approach for customization is to use the query and document
embeddings as input to another Deep Neural Network (DNN) model.
This approach can be combined with the previously mentioned
approaches because it’s applied as a re-scoring model in a phased
rankingpipeline.

import torch.nn as nn
import torch
class CustomEmbeddingSimilarity(nn.Module):
	def __init__(self, dimensionality=384):
		super(CustomEmbeddingSimilarity, self).__init__()
		self.fc1 = nn.Linear(2*dimensionality, 256)
		self.fc2 = nn.Linear(256, 128)
		self.fc3 = nn.Linear(128, 64)
		self.fc4 = nn.Linear(64, 1)
	def forward(self, query , document):
		x = torch.cat((query, document), dim=1)
		x = nn.functional.relu(self.fc1(x))
		x = nn.functional.relu(self.fc2(x))
		x = nn.functional.relu(self.fc3(x))
		return torch.sigmoid(self.fc4(x))
dim = 384
ranker = CustomEmbeddingSimilarity(dimensionality=dim)
# Train the ranker model ..
# Export to ONNX for inference with Vespa 
input_names = ["query","document"]
output_names = ["similarity"]
document = torch.ones(1,dim,dtype=torch.float)
query = torch.ones(1,dim,dtype=torch.float)
args = (query,document)
torch.onnx.export(ranker,
                  args=args,
                  f="custom_similarity.onnx",
                  input_names = input_names,
                  output_names = output_names,
                  opset_version=15)

The above PyTorch model.py snippet defines a custom
DNN-based similarity model which takes the query and document
embedding as input. This model is exported to ONNX
format for accelerated
inference
using Vespa’s support for ranking with
ONNX models.

rank-profile custom-similarity inherits simple-similarity {
  function query() {
    # Match expected tensor input shape
    expression: query(q) * tensor<float>(batch[1]):[1]
  }
  function document() {
    # Match expected tensor input shape
    expression: attribute(embedding) * tensor<float>(batch[1]):[1]
  }
  onnx-model dnn {
    file: models/custom_similarity.onnx
    input "query": query
    input "document": document
    output "similarity": score
  }
  second-phase {
    expression: sum(onnx(dnn).score)
  }
}

This model might be complex, so one typically use it as a second-phase expression, only scoring
the highest ranking documents from the first-phase expression.

Architecture

The Vespa serving architecture
operates in the following
manner: The stateless containers performs inference using the embedding
model(s). The containers are stateless, which allows
for fast auto-scaling with changes in query and inference volume.
Meanwhile, the stateful content nodes store (and index) the frozen
vector embeddings. Stateful content clusters are scaled
elastically in
proportion to the embedding volume. Additionally, Vespa handles the
deployment of ranking and embedding models.

Summary

In this post, we covered three different ways to use frozen models and frozen embeddings
with Vespa while still allowing for task-specific customization of
the embeddings or the similarity function.

Simplify your

Leveraging frozen embeddings in Vespa with SentenceTransformers

Decorative
image

Introduction

Hybrid search is an information retrieval approach that combines traditional lexical search with semantic search based on vector representation of search subjects.

Despite being pretty easy to setup with Vespa, hybrid search could be very tedious in support, especially in a setting with millions of documents and regular changes in user search patterns, which result in regular text embedding model re-training.

One example of such a setting could be e-commerce.

Updating tens of millions document embeddings every time the similarity model is retrained is not fun. As mentioned in Vespa’s article on the topic, getting new embedding model into production requires roughly 3 steps:

  • Re-calculating embeddings for all existing documents using new embedding model
  • Index new vectors with Vespa
  • Evaluate new model and get rid of the old embeddings

As number of possible objects having vector representation inside Vespa increases, the burden of maintaining them becomes tedious.

In aforementioned article, Vespa Team proposes an elegant solution to this problem, called “frozen embeddings”.

Basically, it is an idea that we could freeze vector representations of the documents stored inside Vespa and update only query representations as embedding model is being retrained due to user search patterns change.

In this article, we want to explore the implementation details of the Vespa search application that leverages frozen embeddings obtained from the model built with sentence-transformers library.

Model training

The most effective approach to training a semantic similarity model that would produce quality embeddings of textual data is based on two-tower transformer architecture called bi-encoder.

Bi-encoder is trained using dataset consisting of text pairs annotated with similarity labels. In e-commerce setting, for example, text pair would be (search query, product title) with associated similarity score or class.

By passing each element of the text pair through encoder, we obtain 2 vector representations. Then, using specified loss function, encoder weights are updated depending on how close predicted similarity between these vectors was to ground truth similarity.

Despite it’s name, this “two-tower” transformer often uses shared transformer weights. As a result, the same model is used to encode both queries and documents.

Depending on the language of your documents and queries, you would use some pre-trained weights to train your bi-encoder. One good choice for common European languages is to use xlm-roberta-base. It supports 100 languages and is one of the most powerful small multi-language open-source language models out there.

The simplest way to train bi-encoder is to use sentence-transformers package. It is a framework for state-of-the-art sentence, text and image embeddings built on the top of the famous transformers library.

By default, bi-encoder model built with sentence-transformers library would share weights for query and document. The reason behind this design decision is the fact that shared weights lead to better representational ability of the underlying encoder.

But in order to train model suitable for generation of frozen embeddings, we would need to make non-trivial changes to the default bi-encoder training procedure provided by sentence-transformers library.

First, we would need to decide how to achieve asymmetry in the representation of query and document.

In the context of `sentence-transformers, there are 2 possible ways to reach this asymmetry:

  1. Share transformer weights and use 2 dense layers on the top of transformer to generate asymmetric representations of query and document.
     from sentence_transformers import SentenceTransformer
     from sentence_transformers import models
        
     EMBEDDING_DIM = 384
    
     word_embedding_model = models.Transformer('xlm-roberta-base')
        
     pooling_model = models.Pooling(
         word_embedding_model.get_word_embedding_dimension(),
         pooling_mode_mean_tokens=True,
         pooling_mode_cls_token=False,
         pooling_mode_max_tokens=False
     )
        
     in_features = word_embedding_model.get_word_embedding_dimension()
     out_features = EMBEDDING_DIM
        
     q_dense = models.Dense(
         in_features=in_features,
         out_features=out_features,
         bias=False,
         init_weight=torch.eye(out_features, in_features),
         activation_function=nn.Identity()
     )
        
     d_dense = models.Dense(
         in_features=in_features,
         out_features=out_features,
         bias=False,
         init_weight=torch.eye(out_features, in_features),
         activation_function=nn.Identity()
     )
        
     asym_model = models.Asym({'query': [q_dense], 'doc': [d_dense]})
        
     model = SentenceTransformer(
         modules=[word_embedding_model, pooling_model, asym_model]
     )
    
  2. Use 2 different transformers for query and document:
     q_word_embedding_model = models.Transformer('xlm-roberta-base')
        
     q_pooling_model = models.Pooling(
         q_word_embedding_model.get_word_embedding_dimension(),
         pooling_mode_mean_tokens=True,
         pooling_mode_cls_token=False,
         pooling_mode_max_tokens=False
     )
        
     d_word_embedding_model = models.Transformer('xlm-roberta-base')
        
     d_pooling_model = models.Pooling(
         d_word_embedding_model.get_word_embedding_dimension(),
         pooling_mode_mean_tokens=True,
         pooling_mode_cls_token=False,
         pooling_mode_max_tokens=False
     )
        
     q_model = SentenceTransformer(modules=[q_word_embedding_model, q_pooling_model])
     d_model = SentenceTransformer(modules=[d_word_embedding_model, d_pooling_model])
        
     asym_model = models.Asym({'query': [q_model], 'doc': [d_model]})
     model = SentenceTrasformer(modules=[asym_model])
    

Additional dense layers could be added here as well to decrease dimensionality. We skip it here for simplicity.

According to the experiments conducted with our proprietary data, these 2 approaches provide similar information retrieval performance. But first one is much more efficient with regard to disc and memory usage.

It is especially important if we take into account current limitation of 2GB on the content of models/ folder in Vespa’s application package.

So, we end up using 2 dense layers to achieve asymmetry in our bi-encoder.

But how do we train this model correctly to generate frozen embeddings?

There are also 2 possible approaches.

  1. We could first train bi-encoder with shared transformer and dense parts using the data that consists of (document, document) text pairs. This would result in good document embedder that could be used to generate document embeddings for Vespa and further training of bi-encoder with asymmetric dense layers. It should be noted that in subsequent model re-trainings transformer and document dense layer weights need to be frozen.

  2. We could make an “initial” training with asymmetric bi-encoder using query-document text pairs and freeze transformer and document dense layer weights in all subsequent training of the bi-encoder. As a result, document embeddings generated by such model would be automatically “frozen”.

We wouldn’t discuss other training details since they depend a lot on dataset format, loss function choice and training environment.

One additional detail that should be explained is the usage of activation function in asymmetrical layers. By default, sentence-transformers uses tanh activation function in models.Dense layer. But since it’s not currently implemented in Vespa’s Tensor API, we decided to use identity activation without bias term. This simplifies the usage of this additional layer to straightforward matrix multiplication, which could be easily done with matmul method of Tensor class.

Model preparation

The format in which model is stored by sentence-transformers is not directly applicable for Vespa.

In order to integrate our asymmetric bi-encoder model in Vespa, we need to make few preparations.

First, let’s first take a look into structure of files produced by sentence-transformers as a result of training:

model/
├── 1_Pooling
│   └── config.json
├── 2_Asym
│   ├── 139761904414624_Dense
│   │   ├── config.json
│   │   └── pytorch_model.bin
│   ├── 139761906488608_Dense
│   │   ├── config.json
│   │   └── pytorch_model.bin
│   └── config.json
├── config.json
├── config_sentence_transformers.json
├── eval
│   └── accuracy_evaluation_dev_results.csv
├── modules.json
├── pytorch_model.bin
├── README.md
├── sentence_bert_config.json
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json

Out of all these files, we would need to use only 4:

model/tokenizer.json     # shared tokenizer
model/pytorch_model.bin  # shared transformer
model/2_Asym/139761904414624_Dense/pytorch_model.bin # query dense layer
model/2_Asym/139761906488608_Dense/pytorch_model.bin # document dense layer

Transformer onnx export

In order to integrate our transformer into Vespa, we need to export it to ONNX format. This is done using optimum-cli:

optimum-cli export onnx --framework pt --task feature-extraction --model ./model/ ./model/onnx/

Resulting file ./model/onnx/model.onnx needs to be added to Vespa application package.

Tokenizer export

Tokenizer file is not changed and could be directly copied to Vespa application package.

Dense layers export

Dense layer weights matrix needs to be exported to plain text file in a specific format:

layer_name_2_type = {
	"139761904414624_Dense": "query",
	"139761906488608_Dense": "doc"
}

for k, v in layer_name_2_type.items():
	dense_layer = torch.load(f'model/2_Asym/{k}/pytorch_model.bin')
	with open(f'{v}_dense_layer.txt', 'w') as file:
	    tensor_str = f"tensor<float>(x[384],y[768]):{dense_layer['linear.weight'].cpu().numpy().tolist()}"
	    file.write(tensor_str)

Resulting files doc_dense_layer.txt and query_dense_layer.txt need to be added to Vespa application package.

As a result of these actions you’ll have a following structure of your application package’s models folder:

src/main/application/models/
├── doc_dense_layer.txt
├── query_dense_layer.txt
├── tokenizer.json
└── model.onnx

Model integration

With their latest update Vespa has made it very easy to integrate HuggingFace feature-generation models as embedders.

But despite having almost all necessary functionality, it does not allow us to integrate a model with additional dense layer stacked over transformer with mean pooling.

To achieve this, we need to implement our own DenseAsymmetricHfEmbedder. It would slightly differ from current Vespa’s HuggingFaceEmbedder.

  1. First, define an appropriate package name:
     package com.experimental.search.embedding;
    
  2. Then we need to add an attribute to store dense layer weights:
     private final Tensor linearLayer;
    
  3. linearLayer needs to be initialized in constructor from weights stored in plain text file
     try {
         String strTensor = Files.readString(Paths.get(config.linearLayer().toString()));
         this.linearLayer = Tensor.from(strTensor);
     } catch (IOException e) {
         throw new RuntimeException(e);
     }
    
  4. Then we need to update embed method to include matrix multiplication:
     TensorType intermediate = TensorType.fromSpec("tensor<float>(x[768])");
     var result = poolingStrategy.toSentenceEmbedding(intermediate, tokenEmbeddings, attentionMask);
     var finalResult = linearLayer.matmul(result.rename("x", "y"), "y");
     return normalize ? normalize(finalResult, tensorType) : finalResult;
    

Also, we would need to create our own config definition file inside src/resources/configdefinitions called dense-asymmetric-hf-embedder.def,

which would differ from hugging-face-embedder.def by 2 lines:

-namespace=embedding.huggingface
+package=com.experimental.search.embedding;

+linearLayer model

Finally, we would need to setup document and query model configurations in services.xml

<component id="doc-embedder"
           class="com.experimental.search.embedding.DenseAsymmetricHfEmbedder"
           bundle="search-mvp">
    <config name="com.experimental.search.embedding.dense-asymmetric-hf-embedder">
        <tokenizerPath path="models/tokenizer.json"/>
        <transformerModel path="models/model.onnx"/>
        <linearLayer path="models/doc_dense_layer.txt"/>
        <normalize>true</normalize>
        <transformerTokenTypeIds/>
    </config>
</component>
<component id="query-embedder"
           class="com.experimental.search.embedding.DenseAsymmetricHfEmbedder"
           bundle="search-mvp">
    <config name="com.experimental.search.embedding.dense-asymmetric-hf-embedder">
        <tokenizerPath path="models/tokenizer.json"/>
        <transformerModel path="models/model.onnx"/>
        <linearLayer path="models/query_dense_layer.txt"/>
        <normalize>true</normalize>
        <transformerTokenTypeIds/>
    </config>
</component>

Now, we could easily use doc-embedder in our schema:

field title_embedding type tensor<float>(x[384]) {
    indexing: input title | embed doc-embedder | attribute | index
    attribute {
        distance-metric: innerproduct
    }
}

Or query-embedder in search requests:

import requests

text = "our search query"

r = requests.post(
    url=VESPA_DOC_API_URL,
    json={
        "yql": 'select * from product ' \
               'where ({targetHits:1000, approximate:false}nearestNeighbor(title_embedding, input_embedding))',
        "input.query(input_embedding)": f'embed(query-embedder, "{text}")',
        "hits": hits,
        "ranking": {"profile": "semantic"},
    },
    headers={
        'Content-Type': 'application/json'
    }
)

Conclusions

The usage of frozen embeddings in your Vespa search application could substantially decrease efforts to support it in production with constant changes in search behavior patterns. It makes it much easier to maintain your application and update embedding models.

This specific implementation gives you additional benefits, such as:

  • Plug-and-play training procedure with sentence-transformers library
  • Shared transformer weights between document and query models, which decrease memory usage during deployment
  • Possibility to easily decrease embedding size for objects that do not require high-dimensional representations