Approximate Nearest Neighbor Search in Vespa – Part 1

Searching for the nearest neighbors of a data point in a high dimensional vector space is an important problem for many real-time applications.
For example, in Computer Vision it can be used to find similar images in large image datasets.
In Information Retrieval, pre-trained text embeddings can be used to match documents based on the distance between query and document embeddings.
In many of these applications,
the document corpus is constantly evolving and the search is constrained by query filters applied on the metadata of the data points.
For example, in E-commerce,
a nearest neighbor search for products in a vector space would typically be constrained by product metadata like inventory status and price.

Vespa (the open source big data serving engine) already supports exact
nearest neighbor search
that is integrated with the Vespa query tree and its filter support.
This enables you to get the exact nearest neighbors meeting the filter criterias of the query.
This works well when the number of documents to calculate nearest neighbors for is small,
e.g when the query filter is strong, or the document corpus is small.
However, as the number of documents to consider increases, we want to trade exactness for performance and we need an approximate solution.

This blog post is part 1 in a series of blog posts where we share how the Vespa team implemented an approximate nearest neighbor (ANN) search algorithm.
In this first post, we’ll explain why we selected HNSW (Hierarchical Navigable Small World Graphs)
as the baseline algorithm and how we extended it to meet the requirements for integration in Vespa.
Requirements included supporting real-time updates, integration with the Vespa query tree, and being flexible enough to fit a wide range of use cases.

Requirements & Challenges

Vespa is an open source real-time big data serving engine.
In a typical application, the document corpus is constantly evolving.
New documents are added and removed, and metadata is being updated.
A typical use case is news article search and recommendation, keeping a week, month or year’s worth of articles,
continually adding or updating new articles while selectively removing the oldest ones.
In this case, nearest neighbor search over pre-trained text embeddings could be used in combination with query filters over metadata.

Based on the existing functionality and flexibility of Vespa,
we defined a set of requirements an ANN search algorithm had to support to be a good fit for integration:

  • Indexes used in the algorithm must be real-time updateable with low latency and high throughput.
    Data points can be added, updated and removed, and the entire corpus should not be needed when creating the index.
  • Concurrent querying with low latency and high throughput without huge performance impact due to ongoing update operations.
  • Seamless integration with the Vespa query tree and its filter support.
    This enables correct results, compared to just increasing top K for the nearest neighbor search algorithm to compensate when query filters are used.

Algorithms Background

There exists a lot of algorithms for ANN search.
Many of these are summarized and analyzed in
Approximate Nearest Neighbor Search on High Dimensional Data — Experiments, Analyses, and Improvement (v1.0)
and
A Revisit of Hashing Algorithms for Approximate Nearest Neighbor Search.
In addition, benchmark results for different algorithms over different datasets are summarized in
ANN Benchmarks.

Existing algorithms are mainly focused on a stable document corpus where all data points in the vector space are known up front.
In this case the index structure is built once, and then used for nearest neighbor searches.
To support real-time updates, we looked for algorithms that were either incremental in nature or could be modified to meet this requirement.

There are three broad categories of algorithms: tree-based, graph-based and hash-based.
We choose one from each category for a more detailed exploration.
This selection was based on how they performed in benchmarks within papers and in
ANN Benchmarks,
and how easy the algorithm was to modify to fit our requirements.

We ended up with the following:

  • Annoy
    Annoy is tree-based and described in
    Nearest neighbors and vector models – part 2 – algorithms and data structures.
    It takes a straightforward engineering approach to the ANN problem, and is quite easy to understand and implement.
    An Annoy index consists of N binary trees, where each tree partitions the vector space using random hyperplanes at each node in the tree.
    The implementation is available on github.
  • HNSW – Hierarchical Navigable Small World Graphs
    This is graph-based and described in
    Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.
    HNSW builds a hierarchical graph incrementally,
    and has great search performance with high recall, motivating us to prototype it for comparison.
    Each node in the graph represents a point in the vector space, and nodes are linked to other nodes that are close in space.
    The implementation is available on github.
  • RPLSH – Random Projection based Locality Sensitive Hashing
    This is hash-based and described in
    A Revisit of Hashing Algorithms for Approximate Nearest Neighbor Search.
    The implementation is available on github.
    Since the random hyperplanes used for projections can be selected up-front
    (only depending on the number of dimensions of the vector space) this approach is data-independent.
    For our purposes, we could use the hash value as a filter,
    where only documents having most bits in common with the hash value of the query data point would be considered for full distance computation.
    This would give us little extra memory and CPU usage and would be easy to fit into our exact (brute-force) nearest neighbor feature.
    It was the first algorithm that we implemented as a prototype.
    However, in our prototype we found that getting a significant speedup required a large sacrifice of quality.
    For more info, see Dropping RPLSH below.

We created prototype implementations of the three algorithms in C++.
This gave us a deeper understanding and the ability to make necessary modifications to support real-time updates and query filters.
The prototypes were also used to do low-level benchmarking and quality testing.

Prototype Benchmark

We benchmarked the three prototype implementations to look at indexing throughput and search throughput with different document corpus sizes.
We used the
1M SIFT (128 dim)
and
1M GIST (960 dim)
datasets, where one vector corresponds to a document.
In this section, we summarize the results of the 1M SIFT tests. Findings were similar with the 1M GIST dataset.

Setup

The tests were executed in a CentOS 7 Docker image using Docker Desktop on a MacBook Pro (15 inch, 2018)
with a 2.6 GHz 6-Core Intel Core i7 CPU and 32GB of memory. The Docker setup was 8 CPUs, 16GB of memory and 1GB of Swap.

The configuration of the algorithms were as follows:

  • Float size: 4 bytes.
  • Annoy: Number of trees = 50. Max points in leaf node = 128.
  • HNSW: Max links per node (M) = 16. Note that level 0 has 32 links per node (2*M). Number of neighbors to explore at insert (efConstruction) = 200.
  • RPLSH: 512 bits used for hash and linear scan of hash table.

Indexing Throughput

The 1M SIFT dataset was split into chunks of first 100k, 200k, and 500k vectors,
and indexing throughput was tested with different corpus sizes.
The index structures for all algorithms started empty, and we added one document at a time to simulate real-time feeding.
One thread was used. Necessary adjustments to the prototype implementations were done to accommodate this.
Note that the data points for 10M documents are estimated based on the 100k and 1M data points.

Observations:

  • Indexing throughput depends on corpus size for Annoy and HNSW, where throughput is halved when corpus size is increased by 10x.
  • Indexing throughput for RPLSH is independent of corpus size.
  • Annoy is 4.5 to 5 times faster than HNSW.
  • RPLSH is 23 to 24 times faster than HNSW at 1M documents.

Search Throughput (QPS)

The search throughput was measured after the indexes were built, using a single benchmarking client and search thread.
We used the query vectors from the dataset, and searched for the nearest K=100 neighbors from the query vector.
To get comparable quality results between the algorithms we had to adjust some of the parameters used.
The default behavior for Annoy is to collect K=100 candidates per tree, and then calculate brute force distance for all those.
To get comparable quality between Annoy and HNSW we had to collect 8000 candidates instead of 5000.
For RPLSH, we used the Hamming distance between hashes to pre-filter candidates after collecting 200 candidates into a heap,
skipping full distance calculation for most candidates. For HNSW, we asked for 100 candidates.

Observations:

  • HNSW outperforms Annoy and RPLSH. At corpus size 1M the QPS is 9 times as high as Annoy,
    and 16 times as high as RPLSH at comparable quality.
    Similar observations between hnswlib and Annoy are found in ANN Benchmarks,
    where the QPS of hnswlib is 5-10 times higher at the same quality on all tested datasets.
  • The HNSW search algorithm depends heavily on the number of links between nodes, which again depends on corpus size.
    The QPS is halved when the corpus size is increased by 10x.
    We see the same during indexing as that uses the search algorithm to find candidates to connect to.
  • The Annoy search algorithm is less dependent on corpus size, at least with small corpus sizes.
    The static cost is driven by brute force calculation of distances for the candidate set, 8000 in this case.
    With high corpus size the cost of traversing the binary trees will likely match and exceed the static cost.
    We don’t know where this limit is.
  • For RPLSH, doing exhaustive search (linear scan) of the hashes is more expensive than expected.
    One major consideration here is that the data reduction (from 128 dimensions*32 bit floats, to 512 bits hash) is just a factor 8,
    and indeed we got around 8 times speedup compared to brute-force linear scan.

Index Structures & Memory Usage

The prototype implementations used simple data structures to represent the index structures of the algorithms in memory.
Only one thread was used for indexing and searching and we didn’t have to worry about locking.
These challenges would have to be overcome in the actual implementation of the selected algorithm.
We created a rough index structure design for both Annoy and HNSW to see how much memory the index would use.

In Vespa, a data point in a high dimensional vector space is represented by a
tensor
of rank one with dimension size equal to the dimensionality of the vector space.
To use this, a document type with a tensor field of such type is defined, and documents with tensors are fed to Vespa.
A document type typically consists of multiple other fields as well, for instance, metadata fields and other tensor fields.
The tensors across all documents for a given tensor field are stored in-memory.
The index structures were designed such that they just reference the tensor data points stored in-memory.
In addition, they support one writer thread doing changes, while multiple reader threads are searching, without use of locking.
The actual index structure details for HNSW will be covered in an upcoming blog post.

Annoy Index

An Annoy index consists of multiple binary trees. Each tree consists of split nodes and leaf nodes.
A split node has an offset from origo, a hyperplane that splits the space in two, and reference to left and right child nodes.
A leaf node contains the document ids of the points that ended up in that portion of space.
The document ids are used to lookup the actual tensor data from memory.

HNSW Index

An HNSW index consists of navigable small world graphs in a hierarchy.
Each document in the index is represented by a single graph node.
Each node has an array of levels, from level 0 to n.
The number of levels is constant during the lifetime of the node and is drawn randomly when the node is created.
All nodes have

Build a News recommendation app from python with Vespa: Part 1

Part 1 – News search functionality.

We will build a news recommendation app in Vespa without leaving a python environment. In this first part of the series, we want to develop an application with basic search functionality. Future posts will add recommendation capabilities based on embeddings and other ML models.

UPDATE 2023-02-13: Code examples are updated to work with the latest release of
pyvespa.

Decorative image

Photo by Filip Mishevski on Unsplash

This series is a simplified version of Vespa’s News search and recommendation tutorial. We will also use the demo version of the Microsoft News Dataset (MIND) so that anyone can follow along on their laptops.

Dataset

The original Vespa news search tutorial provides a script to download, parse and convert the MIND dataset to Vespa format. To make things easier for you, we made the final parsed data required for this tutorial available for download:

import requests, json

data = json.loads(
    requests.get("https://thigm85.github.io/data/mind/mind_demo_fields_parsed.json").text
)
data[0]
{'abstract': "Shop the notebooks, jackets, and more that the royals can't live without.",
 'title': 'The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By',
 'subcategory': 'lifestyleroyals',
 'news_id': 'N3112',
 'category': 'lifestyle',
 'url': 'https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata',
 'date': 20191103,
 'clicks': 0,
 'impressions': 0}

The final parsed data used here is a list where each element is a dictionary containing relevant fields about a news article such as title and category. We also have information about the number of impressions and clicks the article has received. The demo version of the mind dataset has 28.603 news articles included.

Install pyvespa

Create the search app

Create the application package. app_package will hold all the relevant data related to your application’s specification.

from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name="news")

Add fields to the schema. Here is a short description of the non-obvious arguments used below:

  • indexing argument: configures the indexing pipeline for a field, which defines how Vespa will treat input during indexing.

  • index argument: configure how Vespa should create the search index.

    • “enable-bm25”: set up an index compatible with bm25 ranking for text search.
  • attribute argument: configure how Vespa should treat an attribute field.

    • “fast-search”: Build an index for an attribute field. By default, no index is generated for attributes, and search over these defaults to a linear scan.
from vespa.package import Field

app_package.schema.add_fields(
    Field(name="news_id", type="string", indexing=["summary", "attribute"], attribute=["fast-search"]),
    Field(name="category", type="string", indexing=["summary", "attribute"]),
    Field(name="subcategory", type="string", indexing=["summary", "attribute"]),
    Field(name="title", type="string", indexing=["index", "summary"], index="enable-bm25"),
    Field(name="abstract", type="string", indexing=["index", "summary"], index="enable-bm25"),
    Field(name="url", type="string", indexing=["index", "summary"]),        
    Field(name="date", type="int", indexing=["summary", "attribute"]),            
    Field(name="clicks", type="int", indexing=["summary", "attribute"]),            
    Field(name="impressions", type="int", indexing=["summary", "attribute"]),                
)

Add a fieldset to the schema. Fieldset allows us to search over multiple fields easily. In this case, searching over the default fieldset is equivalent to searching over title and abstract.

from vespa.package import FieldSet

app_package.schema.add_field_set(
    FieldSet(name="default", fields=["title", "abstract"])
)

We have enough to deploy the first version of our application. Later in this tutorial, we will include an article’s popularity into the relevance score used to rank the news that matches our queries.

Deploy the app on Docker

If you have Docker installed on your machine, you can deploy the app_package in a local Docker container:

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy(
    application_package=app_package, 
)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Finished deployment.

vespa_docker will parse the app_package and write all the necessary Vespa config files to the disk_folder. It will then create the docker containers and use the Vespa config files to deploy the Vespa application. We can then use the app instance to interact with the deployed application, such as for feeding and querying. If you want to know more about what happens behind the scenes, we suggest you go through this getting started with Docker tutorial.

Feed data to the app

We can use the feed_data_point method. We need to specify:

  • data_id: unique id to identify the data point

  • fields: dictionary with keys matching the field names defined in our application package schema.

  • schema: name of the schema we want to feed data to. When we created an application package, we created a schema by default with the same name as the application name, news in our case.

This takes 10 minutes or so:

for article in data:
    res = app.feed_data_point(
        data_id=article["news_id"], 
        fields=article, 
        schema="news"
    )

Query the app

We can use the Vespa Query API through app.query to unlock the full query flexibility Vespa can offer.

Search over indexed fields using keywords

Select all the fields from documents where default (title or abstract) contains the keyword ‘music’.

res = app.query(body={"yql" : "select * from sources * where default contains 'music'"})
res.hits[0]
{
    'id': 'id:news:news::N14152',
    'relevance': 0.25641557752127125,
    'source': 'news_content',
    'fields': {
        'sddocname': 'news',
        'documentid': 'id:news:news::N14152',
        'news_id': 'N14152',
        'category': 'music',
        'subcategory': 'musicnews',
        'title': 'Music is hot in Nashville this week',
        'abstract': 'Looking for fun, entertaining music events to check out in Nashville this week? Here are top picks with dates, times, locations and ticket links.', 'url': 'https://www.msn.com/en-us/music/musicnews/music-is-hot-in-nashville-this-week/ar-BBWImOh?ocid=chopendata',
        'date': 20191101,
        'clicks': 0,
        'impressions': 3
    }
}

Select title and abstract where title contains ‘music’ and default contains ‘festival’.

res = app.query(body = {"yql" : "select title, abstract from sources * where title contains 'music' AND default contains 'festival'"})
res.hits[0]
{
    'id': 'index:news_content/0/988f76793a855e48b16dc5d3',
    'relevance': 0.19587240022210403,
    'source': 'news_content',
    'fields': {
        'title': "At Least 3 Injured In Stampede At Travis Scott's Astroworld Music Festival",
        'abstract': "A stampede Saturday outside rapper Travis Scott's Astroworld musical festival in Houston, left three people injured. Minutes before the gates were scheduled to open at noon, fans began climbing over metal barricades and surged toward the entrance, according to local news reports."
    }
}

Search by document type

Select the title of all the documents with document type equal to news. Our application has only one document type, so the query below retrieves all our documents.

res = app.query(body = {"yql" : "select title from sources * where sddocname contains 'news'"})
res.hits[0]
{
    'id': 'index:news_content/0/698f73a87a936f1c773f2161',
    'relevance': 0.0,
    'source': 'news_content',
    'fields': {
        'title': 'The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By'
    }
}

Search over attribute fields such as date

Since date is not specified with attribute=["fast-search"] there is no index built for it. Therefore, search over it is equivalent to doing a linear scan over the values of the field.

res = app.query(body={"yql" : "select title, date from sources * where date contains '20191110'"})
res.hits[0]
{
    'id': 'index:news_content/0/debbdfe653c6d11f71cc2353',
    'relevance': 0.0017429193899782135,
    'source': 'news_content',
    'fields': {
        'title': 'These Cranberry Sauce Recipes Are Perfect for Thanksgiving Dinner',
        'date': 20191110
    }
}

Since the default fieldset is formed by indexed fields, Vespa will first filter by all the documents that contain the keyword ‘weather’ within title or abstract, before scanning the date field for ‘20191110’.

res = app.query(body={"yql" : "select title, abstract, date from sources * where default contains 'weather' AND date contains '20191110'"})
res.hits[0]
{
    'id': 'index:news_content/0/bb88325ae94d888c46538d0b',
    'relevance': 0.27025156546141466,
    'source': 'news_content',
    'fields': {
        'title': 'Weather forecast in St. Louis',
        'abstract': "What's the weather today? What's the weather for the week? Here's your forecast.",
        'date': 20191110
    }
}

We can also perform range searches:

res = app.query({"yql" : "select date from sources * where date <= 20191110 AND date >= 20191108"})
res.hits[0]
{
    'id': 'index:news_content/0/c41a873213fdcffbb74987c0',
    'relevance': 0.0017429193899782135,
    'source': 'news_content',
    'fields': {
        'date': 20191109
    }
}

Sorting

By default, Vespa sorts the hits by descending relevance score. The relevance score is given by the nativeRank unless something else is specified, as we will do later in this post.

res = app.query(body={"yql" : "select title, date from sources * where default contains 'music'"})
res.hits[:2]
[
    {
        'id': 'index:news_content/0/5f1b30d14d4a15050dae9f7f',
        'relevance': 0.25641557752127125,
        'source': 'news_content',
        'fields': {
            'title': 'Music is hot in Nashville this week',
            'date': 20191101
        }
    },
    {
        'id': 'index:news_content/0/6a031d5eff95264c54daf56d',
        'relevance': 0.23351089409559303,
        'source': 'news_content',
        'fields': {
            'title': 'Apple Music Replay highlights your favorite tunes of the year',
            'date': 20191105
        }
    }
]

However, we can explicitly order by a given field with the order keyword.

res = app.query(body={"yql" : "select title, date from sources * where default contains 'music' order by date"})
res.hits[:2]
[
    {
        'id': 'index:news_content/0/d0d7e1c080f0faf5989046d8',
        'relevance': 0.0,
        'source': 'news_content',
        'fields': {
            'title': "Elton John's second farewell tour stop in Cleveland shows why he's still standing after all these years",
            'date': 20191031
        }
    },
    {
        'id': 'index:news_content/0/abf7f6f46ff2a96862075155',
        'relevance': 0.0,
        'source': 'news_content',
        'fields': {
            'title': 'The best hair metal bands',
            'date': 20191101
        }
    }
]

order sorts in ascending order by default, we can override that with the desc keyword:

res = app.query(body={"yql" : "select title, date from sources * where default contains 'music' order by date desc"})
res.hits[:2]
[
    {
        'id': 'index:news_content/0/934a8d976ff8694772009362',
        'relevance': 0.0,
        'source': 'news_content',
        'fields': {
            'title': 'Korg Minilogue XD update adds key triggers for synth sequences',
            'date': 20191113
        }
    },
    {
        'id': 'index:news_content/0/4feca287fdfa1d027f61e7bf',
        'relevance': 0.0,
        'source': 'news_content',
        'fields': {
            'title': 'Tom Draper, Black Music Industry Pioneer, Dies at 79',
            'date': 20191113
        }
    }
]

Grouping

We can use Vespa’s grouping feature to compute the three news categories with the highest number of document counts:

res = app.query(body={"yql" : "select * from sources * where sddocname contains 'news' limit 0 | all(group(category) max(3) order(-count())each(output(count())))"})
res.hits[0]
{
    'id': 'group:root:0',
    'relevance': 1.0,
    'continuation': {
        'this': ''
    },
    'children': [
        {
            'id': 'grouplist:category',
            'relevance': 1.0,
            'label': 'category',
            'continuation': {
                'next': 'BGAAABEBGBC'
            },
            'children': [
                {
                    'id': 'group:string:news',
                    'relevance': 1.0,
                    'value': 'news',
                    'fields': {
                        'count()': 9115
                    }
                },
                {
                    'id': 'group:string:sports',
                    'relevance': 0.6666666666666666,
                    'value': 'sports',
                    'fields': {
                        'count()': 6765
                    }
                },
                {
                    'id': 'group:string:finance',
                    'relevance': 0.3333333333333333,
                    'value': 'finance',
                    'fields': {
                        'count()': 1886
                    }
                }
            ]
        }
    ]
}

Use news popularity signal for ranking

Vespa uses nativeRank to compute relevance scores by default. We will create a new rank-profile that includes a popularity signal in our relevance score computation.

from vespa.package import RankProfile, Function

app_package.schema.add_rank_profile(
    RankProfile(
        name="popularity",
        inherits="default",
        functions=[
            Function(
                name="popularity", 
                expression="if (attribute(impressions) > 0, attribute(clicks) / attribute(impressions), 0)"
            )
        ], 
        first_phase="nativeRank(title, abstract) + 10 * popularity"
    )
)

Our new rank-profile will be called

Build a News recommendation app from python with Vespa: Part 2

Thiago Martins

Thiago Martins

Vespa Data Scientist


Part 2 – From news search to news recommendation with embeddings.

UPDATE 2023-02-14: Code examples are updated to work with the latest releases of
pyvespa.

In this part, we’ll start transforming our application from news search to news recommendation using the embeddings created in this tutorial. An embedding vector will represent each user and news article. We will make the embeddings used available for download to make it easier to follow this post along. When a user comes, we retrieve his embedding and use it to retrieve the closest news articles via an approximate nearest neighbor (ANN) search. We also show that Vespa can jointly apply general filtering and ANN search, unlike competing alternatives available in the market.

Decorative image

Photo by Matt Popovich on Unsplash

We assume that you have followed the news search tutorial. Therefore, you should have an app_package variable holding the news search app definition and a Docker container named news running a search application fed with news articles from the demo version of the MIND dataset.

Add a user schema

We need to add another document type to represent a user. We set up the schema to search for a user_id and retrieve the user’s embedding vector.

from vespa.package import Schema, Document, Field

app_package.add_schema(
    Schema(
        name="user", 
        document=Document(
            fields=[
                Field(
                    name="user_id", 
                    type="string", 
                    indexing=["summary", "attribute"], 
                    attribute=["fast-search"]
                ), 
                Field(
                    name="embedding", 
                    type="tensor<float>(d0[51])", 
                    indexing=["summary", "attribute"]
                )
            ]
        )
    )
)

We build an index for the attribute field user_id by specifying the fast-search attribute. Remember that attribute fields are held in memory and are not indexed by default.

The embedding field is a tensor field. Tensors in Vespa are flexible multi-dimensional data structures and, as first-class citizens, can be used in queries, document fields, and constants in ranking. Tensors can be either dense or sparse or both and can contain any number of dimensions. Please see the tensor user guide for more information. Here we have defined a dense tensor with a single dimension (d0 – dimension 0), representing a vector. 51 is the size of the embeddings used in this post.

We now have one schema for the news and one schema for the user.

[schema.name for schema in app_package.schemas]

Index news embeddings

Similarly to the user schema, we will use a dense tensor to represent the news embeddings. But unlike the user embedding field, we will index the news embedding by including index in the indexing argument and specify that we want to build the index using the HNSW (hierarchical navigable small world) algorithm. The distance metric used is euclidean. Read this blog post to know more about Vespa’s journey to implement ANN search.

from vespa.package import Field, HNSW

app_package.get_schema(name="news").add_fields(
    Field(
        name="embedding", 
        type="tensor<float>(d0[51])", 
        indexing=["attribute", "index"],
        ann=HNSW(distance_metric="euclidean")
    )
)

Recommendation using embeddings

Here, we’ve added a ranking expression using the closeness ranking feature, which calculates the euclidean distance and uses that to rank the news articles. This rank-profile depends on using the nearestNeighbor search operator, which we’ll get back to below when searching. But for now, this expects a tensor in the query to use as the initial search point.

from vespa.package import RankProfile

app_package.get_schema(name="news").add_rank_profile(
    RankProfile(
        name="recommendation", 
        inherits="default", 
        first_phase="closeness(field, embedding)"
    )
)

Query Profile Type

The recommendation rank profile above requires that we send a tensor along with the query. For Vespa to bind the correct types, it needs to know the expected type of this query parameter.

from vespa.package import QueryTypeField

app_package.query_profile_type.add_fields(
    QueryTypeField(
        name="ranking.features.query(user_embedding)",
        type="tensor<float>(d0[51])"
    )
)

This query profile type instructs Vespa to expect a float tensor with dimension d0[51] when the query parameter ranking.features.query(user_embedding) is passed. We’ll see how this works together with the nearestNeighbor search operator below.

Redeploy the application

We made all the required changes to turn our news search app into a news recommendation app. We can now redeploy the app_package to our running container named news.

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker.from_container_name_or_id("news")
app = vespa_docker.deploy(application_package=app_package)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Finished deployment.
["Uploading application '/app/application' using http://localhost:19071/application/v2/tenant/default/session",
 "Session 7 for tenant 'default' created.",
 'Preparing session 7 using http://localhost:19071/application/v2/tenant/default/session/7/prepared',
 "WARNING: Host named 'news' may not receive any config since it is not a canonical hostname. Disregard this warning when testing in a Docker container.",
 "Session 7 for tenant 'default' prepared.",
 'Activating session 7 using http://localhost:19071/application/v2/tenant/default/session/7/active',
 "Session 7 for tenant 'default' activated.",
 'Checksum:   62d964000c4ff4a5280b342cd8d95c80',
 'Timestamp:  1616671116728',
 'Generation: 7',
 '']

Feeding and partial updates: news and user embeddings

To keep this tutorial easy to follow, we make the parsed embeddings available for download. To build them yourself, please follow this tutorial.

import requests, json

user_embeddings = json.loads(
    requests.get("https://thigm85.github.io/data/mind/mind_demo_user_embeddings_parsed.json").text
)
news_embeddings = json.loads(
    requests.get("https://thigm85.github.io/data/mind/mind_demo_news_embeddings_parsed.json").text
)

We just created the user schema, so we need to feed user data for the first time.

for user_embedding in user_embeddings:
    response = app.feed_data_point(
        schema="user", 
        data_id=user_embedding["user_id"], 
        fields=user_embedding
    )

For the news documents, we just need to update the embedding field added to the news schema.
This takes ten minutes or so:

for news_embedding in news_embeddings:
    response = app.update_data(
        schema="news", 
        data_id=news_embedding["news_id"], 
        fields={"embedding": news_embedding["embedding"]}
    )

Fetch the user embedding

Next, we create a query_user_embedding function to retrieve the user embedding by the user_id. Of course, you could do this more efficiently using a Vespa Searcher as described here, but keeping everything in python at this point makes learning easier.

def parse_embedding(hit_json):
    embedding_json = hit_json["fields"]["embedding"]["values"]
    embedding_vector = [0.0] * len(embedding_json)
    i=0
    for val in embedding_json:
        embedding_vector[i] = val
        i+=1
    return embedding_vector

def query_user_embedding(user_id):
    result = app.query(body={"yql": "select * from sources user where user_id contains '{}'".format(user_id)})
    embedding = parse_embedding(result.hits[0])
    return embedding

The function will query Vespa, retrieve the embedding and parse it into a list of floats. Here are the first five elements of the user U63195’s embedding.

query_user_embedding(user_id="U63195")[:5]
[
    0.0,
    -0.1694680005311966,
    -0.0703359991312027,
    -0.03539799898862839,
    0.14579899609088898
]

Get recommendations

The following yql instructs Vespa to select the title and the category from the ten news documents closest to the user embedding.

yql = "select title, category from sources news where ({targetHits:10}nearestNeighbor(embedding, user_embedding))" 

We also specify that we want to rank those documents by the recommendation rank-profile that we defined earlier and send the user embedding via the query profile type ranking.features.query(user_embedding) that we also defined in our app_package.

result = app.query(
    body={
        "yql": yql,        
        "hits": 10,
        "ranking.features.query(user_embedding)": str(query_user_embedding(user_id="U63195")),
        "ranking.profile": "recommendation"
    }
)

Here are the first two hits out of the ten returned.

[
    {
        'id': 'index:news_content/0/aca03f4ba2274dd95b58db9a',
        'relevance': 0.1460561756063909,
        'source': 'news_content',
        'fields': {
            'category': 'music',
            'title': 'Broadway Star Laurel Griggs Suffered Asthma Attack Before She Died at Age 13'
        }
    },
    {
        'id': 'index:news_content/0/bd02238644c604f3a2d53364',
        'relevance': 0.14591827245062294,
        'source': 'news_content',
        'fields': {
            'category': 'tv',
            'title': "Rip Taylor's Cause of Death Revealed, Memorial Service Scheduled for Later This Month"
        }
    }
]

Combine ANN search with query filters

Vespa ANN search is fully integrated into the Vespa query tree. This integration means that we can include query filters and the ANN search will be applied only to documents that satisfy the filters. No need to do pre- or post-processing involving filters.

The following yql search over news documents that have sports as their category.

yql = "select title, category from sources news where " \
      "({targetHits:10}nearestNeighbor(embedding, user_embedding)) AND " \
      "category contains 'sports'"
result = app.query(
    body={
        "yql": yql,        
        "hits": 10,
        "ranking.features.query(user_embedding)": str(query_user_embedding(user_id="U63195")),
        "ranking.profile": "recommendation"
    }
)

Here are the first two hits out of the ten returned. Notice the category field.

[
    {
        'id': 'index:news_content/0/375ea340c21b3138fae1a05c',
        'relevance': 0.14417346200569972,
        'source': 'news_content',
        'fields': {
            'category': 'sports',
            'title': 'Charles Rogers, former Michigan State football, Detroit Lions star, dead at 38'
        }
    },
    {
        'id': 'index:news_content/0/2b892989020ddf7796dae435',
        'relevance': 0.14404365847394848,
        'source': 'news_content',
        'fields': {
            'category': 'sports',
            'title': "'Monday Night Football' commentator under fire after belittling criticism of 49ers kicker for missed field goal"
        }
    }
]

Next steps

Step to part 3 –
or see conclusion
for how to clean up the Docker container instances if you are done with this.

Pretrained Transformer Language Models for Search – part 1

Decorative image

Photo by Jamie Street
on Unsplash

Updated 2022-10-21: Added links and clarified some sections

In this blog series we demonstrate how to represent transformer models in a multi-phase retrieval and ranking pipeline using Vespa.ai.
We also evaluate these models on the largest Information Retrieval (IR) relevance dataset, namely the MS Marco Passage ranking dataset.
Furthermore, we demonstrate how to achieve close to state-of-the-art ranking using miniature transformer models with just 22M parameters,
beating large ensemble models with billions of parameters.

Blog posts in this series:

In this first post we give an introduction to Transformers for text ranking and three different methods of applying them for ranking.
We also cover multi-phase retrieval and ranking pipelines, and introduce three different ways to efficiently retrieve
documents in a phased retrieval and ranking pipeline.

Introduction

Since BERT was first applied to search and document ranking, we at the Vespa team have been busy making it easy to use BERT or Transformer models in general, for ranking and question answering with Vespa.ai. In previous work,
we demonstrated how to use BERT as a representation model (bi-encoder), for efficient passage retrieval for question answering.
We also demonstrated how we could accelerate BERT models for production serving using distillation and quantization.

Search or information retrieval is going through a neural paradigm shift, some have even called it the BERT revolution.
The introduction of pre-trained language models BERT have led to significant advancement of the state of the art in search and document ranking.

table

The table shows how significant the advancement was when first applied to the MS MARCO Passage Ranking leaderboard. The state-of-the-art on MS Marco passage ranking advanced by almost 30% within a week,
while improvements up until then had been incremental at best.
Compared to the baseline BM25 text ranking (default Apache Lucene 9 text scoring), applying BERT improved the ranking effectiveness by more than 100%.

The table above is from Pretrained Transformers for Text Ranking: BERT and Beyond, which is a brilliant resource
for understanding how pre-trained transformers models can be used for text ranking.
The MS MARCO Passage ranking relevancy dataset consists of about 8.8M passages, and more than 500 000 queries with at least one judged relevant document.
It is by far the largest IR dataset available in the public domain and is commonly used to evaluate ranking models.

The MS Marco passage ranking dataset queries are split in three different subsets, the train, development (dev) and test (eval). The train split can be used to train a ranking model using machine learning. Once a model is built, one can test the effectiveness of the ranking model on the development and test split. Applying the learned model on the development and test set is called in-domain usage of the model. If the trained ranking model is applied on a different relevancy dataset, it’s usually referred to as out of domain usage, or zero-shot. How well models trained on MS Marco query and passage pairs generalize to other domains is out of scope for this blog post, but we can sincerely recommend BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models.

The official evaluation metric used for the MS Marco Passage ranking leaderboard is MRR@10. The name might sound scary, but it’s
a trivial way to judge the effectiveness of a ranking algorithm. RR@10 is the Reciprocal Rank of the first relevant passage within the top 10 ranking positions for a given query. @k denotes the depth into the top ranking documents we look for the first relevant document.
The reciprocal rank formula is simply 1/(position of the first relevant hit).
If the judged relevant hit (as judged by a human) is ranked at position 1 the reciprocal rank score is 1. If the relevant hit is found at position 2 the reciprocal rank score is 0.5 and so on. The mean in mean reciprocal rank is simply the mean RR over all queries in the dev or test split which gives us an overall score.
The MS Marco passage ranking development (dev) set consists of 6,980 queries.

The query relevance judgment list for the development (dev) set is in the public domain. Researchers can compare methods based on this.
The judgements for the eval query set is not in the public domain. Researchers, or industry practitioners, need to submit their ranking for the queries in the test set to have the MRR@10 evaluated and the ranking run listed on the leaderboard. Note that the MS Marco ranking leaderboards are not run time constrained, so many of the submissions take days of computation to produce ranked lists for the queries in dev and eval splits.

There is unfortunately a lot of confusion in the industry on how BERT can successfully be used for text ranking.
The IR research field has moved so fast since the release of BERT in late 2018 that the textbooks on text ranking are already outdated.
Since there is no textbook, industry practitioners need to look at how the research community is applying BERT or Transformer models for ranking.
BERT is a pre-trained language model, and to use it effectively for document or passage ranking, it needs to be fine-tuned for retrieval or ranking.
For examples of not so great ways to use BERT for ranking, see How not to use BERT for Document ranking.

As demonstrated in Pretrained Transformers for Text Ranking: BERT and Beyond, pre-trained language models of the Transformer family achieve best accuracy for text ranking and question answering tasks when used as an interaction model with all-to-all cross-attention between the query and document.
Generally, there are 3 ways to use Transformer models for text ranking and all of them require training data to fine tune for retrieval or ranking.

table

Figure from ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT illustrating various deep neural networks for ranking. In the following section we give an overview of three of these methods using Transformer models.

Representation based ranking using Transformer models

It is possible to use the Transformer model as the underlying deep neural network for representation based learning. Given training data, one can learn a representation of documents and queries so that relevant documents are closer or more similar in this representation than irrelevant documents. The representation based ranking approach falls under the broad representation learning research field and representation learning can be applied to text, images, videos or even combinations (multi-modal representations).
For text ranking, queries and documents are embedded into a dense embedding space using one or two Transformer models (bi-encoders).
The embedding representation is learned by the training examples given to the model. Once the model has been trained, one can pre-compute the embeddings for all documents, and
at query time, use nearest neighbor search for efficient document retrieval.

table

Document retrieval using dense embedding models, is commonly referred to as dense retrieval.
The query and the document are embedded independently, and the model is during training given examples of the form (query, relevant passage, negative passage).
The model(s) weights are adjusted per batch of training triplets.
The embedding representation from the Transformer model could be based on for example the CLS token of BERT (Classification Token), or using a pooling strategy over the last Transformer layer.

The huge benefit of using representation based similarity on top of Transformer models is that the document representation can be produced offline by encoding them through the trained transformer and unless the model changes, this only needs to be done once when indexing the document. At online serving time, the serving system only needs to obtain the query embedding by running the query through the transformer model and use the resulting query embedding vector as the input to a nearest neighbor search in the dense embedding space to find relevant documents. On the MS Marco Passage ranking set, dense retrieval using a learned representation has demonstrated good results over the last year or so. Dense retrievers achieve much better accuracy (MRR@10 and Recall@1000) than sparse traditional search using exact lexical matching (e.g BM25) and the current state-of-the-art uses a dense retriever as the first phase candidate selection for re-ranking using a more sophisticated (and computationally expensive) all-to-all interaction model.

Since the query is usually short, the online encoding complexity is relatively low and encoding latency is acceptable even on a cpu serving stack. Transformer models with full all to all cross attention have quadratic run time complexity with the input sequence length so the smaller the sequence input the better the performance is. Most online serving systems can also cache the query embedding representation to save computations and reduce latency.

All to all interaction ranking using Transformers

The “classic way to use BERT for ranking is to use it as an all-to-all interaction model where both the query and the document is fed through the Transformer model simultaneously and not independently as with the representation based ranking model. For BERT this is usually accomplished with a classification layer on top of the CLS token output, and the ranking task is converted into a classification task where one classifies if the document is relevant for the query or not (binary classification). This approach is called monoBERT or vanilla BERT, or BERT cat (categorization). It’s a straightforward approach and inline with the proposed suggestions of the original BERT paper for how to use BERT for task specific fine tuning.

table

Similar to the representation model, all to all interaction models need to be trained by triplets and the way we sample the negative examples (irrelevant) is important for the overall effectiveness of the model. The first BERT submission to the MS Marco passage ranking used mono-BERT to re-rank the top 1K documents from a more efficient sparse first phase retriever (BM25).

With all to all interaction there is no known way to efficiently pre-compute the document representation offline. Running online inference with cross-attention models over all documents in a collection is computationally prohibitively expensive even for large organizations like Google or Microsoft, so to deploy it for production one needs a way to reduce the number of candidate documents which are fully evaluated using the all to all cross attention model. This has led to increased interest in multi-stage retrieval and ranking architectures but also more efficient Transformer models without quadratic complexity due to the cross attention mechanisms (all to all attention).

Late Interaction using Transformers

An alternative approach for using Transformers for ranking was suggested in ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.

Unlike the all to all query document interaction model, the late contextualized interaction over BERT enables processing the documents offline since the per document token contextual embedding is generated independent of the query tokens. The embedding outputs of the last Transformer layer is calculated at document indexing time and stored in the document. For a passage of 100 tokens we end up with 100 embedding vectors of dimensionality n where n is a tradeoff between ranking accuracy and storage (memory) footprint. The dimensionality does not necessarily need to be the same as the transformer model’s hidden size. Using 32 dimensions per token embedding gives almost the same accuracy as the larger 768 dim of BERT base. Similar one can use low precision like float16 or quantization (int8) to reduce the memory requirements per dimension.

table

The ability to run the document and obtain the per term contextual embedding offline significantly speeds up onstage query evaluation, since at query time one only needs to do one pass through the Transformer model with the query to obtain the contextual query term embeddings. Then calculate the proposed MaxSim operator over the pre-computed per term contextualized embeddings for the documents we want to re-rank.

Similar to the pure representation based model we only need to encode the query through the transformer model at query time. The query tokens only attend to other query tokens, and similar document tokens only attend to other document tokens.

As demonstrated in the

Build a News recommendation app from python with Vespa: Part 3

Thiago Martins

Thiago Martins

Vespa Data Scientist


Part 3 – Efficient use of click-through rate via parent-child relationship.

UPDATE 2023-02-14: Code examples are updated to work with the latest releases of
pyvespa.

This part of the series introduces a new ranking signal: category click-through rate (CTR). The idea is that we can recommend popular content for users that don’t have a click history yet. Rather than just recommending based on articles, we recommend based on categories. However, these global CTR values can often change continuously, so we need an efficient way to update this value for all documents. We’ll do that by introducing parent-child relationships between documents in Vespa. We will also use sparse tensors directly in ranking. This post replicates this more detailed Vespa tutorial.

Decorative image

Photo by AbsolutVision on Unsplash

We assume that you have followed the part2 of the news recommendation tutorial. Therefore, you should have an app_package variable holding the news app definition and a Docker container named news running the application fed with data from the demo version of the MIND dataset.

Setting up a global category CTR document

If we add a category_ctr field in the news document, we would have to update all the sport’s documents every time there is a change in the sport’s CTR statistic. If we assume that the category CTR will change often, this turns out to be inefficient.

For these cases, Vespa introduced the parent-child relationship. Parents are global documents, which are automatically distributed to all content nodes. Other documents can reference these parents and “import” values for use in ranking. The benefit is that the global category CTR values only need to be written to one place: the global document.

from vespa.package import Schema, Document, Field

app_package.add_schema(
    Schema(
        name="category_ctr",
        global_document=True,
        document=Document(
            fields=[
                Field(
                    name="ctrs", 
                    type="tensor<float>(category{})", 
                    indexing=["attribute"], 
                    attribute=["fast-search"]
                ), 
            ]
        )
    )
)

We implement that by creating a new category_ctr schema and setting global_document=True to indicate that we want Vespa to keep a copy of these documents on all content nodes. Setting a document to be global is required for using it in a parent-child relationship. Note that we use a tensor with a single sparse dimension to hold the ctrs data.

Sparse tensors have strings as dimension addresses rather than a numeric index. More concretely, an example of such a tensor is (using the tensor literal form):

{
    {category: entertainment}: 0.2 }, 
    {category: news}: 0.3 },
    {category: sports}: 0.5 },
    {category: travel}: 0.4 },
    {category: finance}: 0.1 },
    ...
}

This tensor holds all the CTR scores for all the categories. When updating this tensor, we can update individual cells, and we don’t need to update the whole tensor. This operation is called tensor modify and can be helpful when you have large tensors.

Importing parent values in child documents

We need to set up two things to use the category_ctr tensor for ranking news documents. We need to reference the parent document (category_ctr in this case) and import the ctrs from the referenced parent document.

app_package.get_schema("news").add_fields(
    Field(
        name="category_ctr_ref",
        type="reference<category_ctr>",
        indexing=["attribute"],
    )
)

The field category_ctr_ref is a field of type reference of the category_ctr document type. When feeding this field, Vespa expects the fully qualified document id. For instance, if our global CTR document has the id id:category_ctr:category_ctr::global, that is the value that we need to feed to the category_ctr_ref field. A document can reference many parent documents.

from vespa.package import ImportedField

app_package.get_schema("news").add_imported_field(
    ImportedField(
        name="global_category_ctrs",
        reference_field="category_ctr_ref",
        field_to_import="ctrs",
    )
)

The imported field defines that we should import the ctrs field from the document referenced in the category_ctr_ref field. We name this as global_category_ctrs, and we can reference this as attribute(global_category_ctrs) during ranking.

Tensor expressions in ranking

Each news document has a category field of type string indicating which category the document belongs to. We want to use this information to select the correct CTR score stored in the global_category_ctrs. Unfortunately, tensor expressions only work on tensors, so we need to add a new field of type tensor called category_tensor to hold category information in a way that can be used in a tensor expression:

app_package.get_schema("news").add_fields(
    Field(
        name="category_tensor",
        type="tensor<float>(category{})",
        indexing=["attribute"],
    )
)

With the category_tensor field as defined above, we can use the tensor expression sum(attribute(category_tensor) * attribute(global_category_ctrs)) to select the specific CTR related to the category of the document being ranked. We implement this expression as a Function in the rank-profile below:

from vespa.package import Function

app_package.get_schema("news").add_rank_profile(
    RankProfile(
        name="recommendation_with_global_category_ctr", 
        inherits="recommendation",
        functions=[
            Function(
                name="category_ctr", 
                expression="sum(attribute(category_tensor) * attribute(global_category_ctrs))"
            ),
            Function(
                name="nearest_neighbor", 
                expression="closeness(field, embedding)"
            )
            
        ],
        first_phase="nearest_neighbor * category_ctr",
        summary_features=[
            "attribute(category_tensor)", 
            "attribute(global_category_ctrs)", 
            "category_ctr", 
            "nearest_neighbor"
        ]
    )
)

In the new rank-profile, we have added a first phase ranking expression that multiplies the nearest-neighbor score with the category CTR score, implemented with the functions nearest_neighbor and category_ctr, respectively. As a first attempt, we just multiply the nearest-neighbor with the category CTR score, which might not be the best way to combine those two values.

Deploy

We can reuse the same container named news created in the first part of this tutorial.

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker.from_container_name_or_id("news")
app = vespa_docker.deploy(application_package=app_package)
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Finished deployment.

Feed

Next, we will download the global category CTR data, already parsed in the format that is expected by a sparse tensor with the category dimension.

import requests, json

global_category_ctr = json.loads(
    requests.get("https://data.vespa.oath.cloud/blog/news/global_category_ctr_parsed.json").text
)
global_category_ctr
{
    'ctrs': {
        'cells': [
            {'address': {'category': 'entertainment'}, 'value': 0.029266420380943244},
            {'address': {'category': 'autos'}, 'value': 0.028475809103747123},
            {'address': {'category': 'tv'}, 'value': 0.05374837981352176},
            {'address': {'category': 'health'}, 'value': 0.03531784305129329},
            {'address': {'category': 'sports'}, 'value': 0.05611187986670051},
            {'address': {'category': 'music'}, 'value': 0.05471192953054426},
            {'address': {'category': 'news'}, 'value': 0.04420778372641991},
            {'address': {'category': 'foodanddrink'}, 'value': 0.029256852366228187},
            {'address': {'category': 'travel'}, 'value': 0.025144552013730358},
            {'address': {'category': 'finance'}, 'value': 0.03231013195899643},
            {'address': {'category': 'lifestyle'}, 'value': 0.04423279317474416},
            {'address': {'category': 'video'}, 'value': 0.04006693315980292},
            {'address': {'category': 'movies'}, 'value': 0.03335647459420146},
            {'address': {'category': 'weather'}, 'value': 0.04532171803495617},
            {'address': {'category': 'northamerica'}, 'value': 0.0},
            {'address': {'category': 'kids'}, 'value': 0.043478260869565216}
        ]
    }
}

We can feed this data point to the document defined in the category_ctr. We will assign the global id to this document. Reference to this document can be done by using the Vespa id id:category_ctr:category_ctr::global.

response = app.feed_data_point(schema="category_ctr", data_id="global", fields=global_category_ctr)

We need to perform a partial update on the news documents to include information about the reference field category_ctr_ref and the new category_tensor that will have the value 1.0 for the specific category associated with each document.

news_category_ctr = json.loads(
    requests.get("https://data.vespa.oath.cloud/blog/news/news_category_ctr_update_parsed.json").text
)
news_category_ctr[0]
{
    'id': 'N3112',
    'fields': {
        'category_ctr_ref': 'id:category_ctr:category_ctr::global',
        'category_tensor': {
            'cells': [
                { 'address': {'category': 'lifestyle'}, 'value': 1.0}
            ]
        }
    }
}

This takes ten minutes or so:

for data_point in news_category_ctr:
    response = app.update_data(schema="news", data_id=data_point["id"], fields=data_point["fields"])

Testing the new rank-profile

We will redefine the query_user_embedding function defined in the second part of this tutorial and use it to make a query involving the user U33527 and the recommendation_with_global_category_ctr rank-profile.

def parse_embedding(hit_json):
    embedding_json = hit_json["fields"]["embedding"]["values"]
    embedding_vector = [0.0] * len(embedding_json)
    i=0
    for val in embedding_json:
        embedding_vector[i] = val
        i+=1
    return embedding_vector

def query_user_embedding(user_id):
    result = app.query(body={"yql": "select * from sources user where user_id contains '{}'".format(user_id)})
    embedding = parse_embedding(result.hits[0])
    return embedding
yql = "select * from sources news where " \
      "({targetHits:10}nearestNeighbor(embedding, user_embedding))"
result = app.query(
    body={
        "yql": yql,        
        "hits": 10,
        "ranking.features.query(user_embedding)": str(query_user_embedding(user_id="U33527")),
        "ranking.profile": "recommendation_with_global_category_ctr"
    }
)

The first hit below is a sports article. The global CTR document is also listed here, and the CTR score for the sports category is 0.0561. Thus, the result of the category_ctr function is 0.0561 as intended. The nearest_neighbor score is 0.149, and the resulting relevance score is 0.00836. So, this worked as expected.

{
    'id': 'id:news:news::N5316',
    'relevance': 0.008369192847921151,
    'source': 'news_content',
    'fields': {
        'sddocname': 'news',
        'documentid': 'id:news:news::N5316',
        'news_id': 'N5316',
        'category': 'sports',
        'subcategory': 'football_nfl',
        'title': "Matthew Stafford's status vs. Bears uncertain, Sam Martin will play",
        'abstract': "Stafford's start streak could be in jeopardy, according to Ian Rapoport.",
        'url': "https://www.msn.com/en-us/sports/football_nfl/matthew-stafford's-status-vs.-bears-uncertain,-sam-martin-will-play/ar-BBWwcVN?ocid=chopendata",
        'date': 20191112,
        'clicks': 0,
        'impressions': 1,
        'summaryfeatures': {
            'attribute(category_tensor)': {
                'type': 'tensor<float>(category{})',
                'cells': [
                    {'address': {'category': 'sports'}, 'value': 1.0}
                ]
            },
            'attribute(global_category_ctrs)': {
                'type': 'tensor<float>(category{})',
                'cells': [
                    {'address': {'category': 'entertainment'}, 'value': 0.029266420751810074},
                    {'address': {'category': 'autos'}, 'value': 0.0284758098423481},
                    {'address': {'category': 'tv'}, 'value': 0.05374838039278984},
                    {'address': {'category': 'health'}, 'value': 0.03531784191727638},
                    {'address': {'category': 'sports'}, 'value': 0.05611187964677811},
                    {'address': {'category': 'music'}, 'value': 0.05471193045377731},
                    {'address': {'category': 'news'}, 'value': 0.04420778527855873},
                    {'address': {'category': 'foodanddrink'}, 'value': 0.029256852343678474},
                    {'address': {'category': 'travel'}, 'value': 0.025144552811980247},
                    {'address': {'category': 'finance'}, 'value': 0.032310131937265396},
                    {'address': {'category': 'lifestyle'}, 'value': 0.044232793152332306},
                    {'address': {'category': 'video'}, 'value': 0.040066931396722794},
                    {'address': {'category': 'movies'}, 'value': 0.033356472849845886},
                    {'address': {'category': 'weather'}, 'value': 0.045321717858314514},
                    {'address': {'category': 'northamerica'}, 'value': 0.0},
                    {'address': {'category': 'kids'}, 'value': 0.043478261679410934}
                ]
            },
            'rankingExpression(category_ctr)': 0.05611187964677811,
            'rankingExpression(nearest_neighbor)': 0.14915188666574342,
            'vespa.summaryFeatures.cached': 0.0
        }
    }
}

Conclusion

This tutorial introduced parent-child relationships and demonstrated it through a global CTR feature we used in ranking. We also introduced ranking with (sparse) tensor expressions.

Clean up Docker container instances:

vespa_docker.container.stop()
vespa_docker.container.remove()

Pretrained Transformer Language Models for Search – part 2

Decorative image

Photo by Rob Fuller on Unsplash

Updated 2022-10-21: Added links and clarified some sections

In this blog series we demonstrate how to represent transformer models in a multiphase retrieval and ranking pipeline using Vespa.ai. We also evaluate these models on the largest Information Retrieval relevance dataset, namely the MS Marco Passage ranking dataset. We demonstrate how to achieve close to state of the art ranking using miniature transformer models with just 22M parameters, beating large ensemble models with billions of parameters.

In the first post in this series we introduced using pre-trained models for ranking. In this second post we study efficient candidate retrievers which can be used to efficiently find candidate documents which are re-ranked using more advanced models.

Multiphase retrieval and ranking

Due to computational complexity of cross interaction transformer models there has been renewed interest in multiphase retrieval and ranking. In a multiphased retrieval and ranking pipeline, the first phase retrieves candidate documents using a cost efficient retrieval method and the more computationally complex cross-attention or late interaction model inference is limited to the top ranking documents from the first phase.

table

Illustration of a multi-stage retrieval and ranking architecture is given in the figure above. The illustration is from Phased ranking with Vespa. The three phases illustrated in the diagram is per content node, which is retrieving and re-ranking a subset of the total document volume. In addition one can also re-rank the global top scoring documents after the results from the nodes involved in the query are merged to find the global best documents. This step might also involve diversification of the result set before final re-ranking.

Broadly there are two categories of efficient sub-linear retrieval methods

  • Sparse retrieval using lexical term matching over inverted indexes, potentially accelerated by the WAND algorithm
  • Dense retrieval using dense vector representation of queries and documents, potentially accelerated by approximate nearest neighbor search algorithms

In the next sections we take a deep dive into these two methods and we also evaluate their effectiveness on the MS Marco Passage Ranking relevancy dataset. We also show how these
two methods can be combined with Vespa.

Sparse lexical retrieval

Classic information retrieval (IR) relying on lexical matching which has been around since the early days of Information Retrieval. One example of a popular lexical based retrieval scoring function is BM25. Retrieval can be done in sub-linear time using inverted indexes and accelerated by dynamic pruning algorithms like WAND. Dynamic pruning algorithms avoid scoring exhaustively all documents which match at least one of the query terms. In the below Vespa document schema we declare a minimal passage document type which we can use to index the MS Marco Passage ranking dataset introduced in post 1.

search passage {
  document passage {
    field text type string {
      indexing: summary |index
      index:enable-bm25
    }
    field id type int {
      indexing: summary |attribute
    }
  }
  fieldset default {
  	fields: text
  }
  rank-profile bm25 {
  	first-phase {
  	  expression: bm25(text)
  	}
  }
}

We define a text field which we populate with the passage text. The indexing directive controls how the field is handled.The summary means that the text should be returned in the search result page and index specifies that we want to build inverted index data structures for efficient search and matching. We also define a ranking profile with only a single ranking phase using the Vespa bm25(name) text ranking feature, one out of many built in Vespa text matching ranking features.

Once we have indexed our data we can search using the Vespa HTTP POST query api:

  {
    "yql": "select id,text from passage where userQuery();",
    "hits": 10,
    "query": "is cdg airport in main paris?",
    "ranking.profile": "bm25",
    "type": "all"
  }
  • The yql parameter is the Vespa query language, userQuery() is a reference to the query parameter
  • The hits parameter controls the number of hits in the Vespa response
  • The query parameter contains the free text input query from the end user. Simple query language
  • The ranking.profile parameter choses the ranking profile to use for the query
  • The type specifies the query type (all, any, phrase) which controls the boolean query logic. All requires that all query terms are found in the document while any specifies at least one of the query terms should match in the document.

If we use the above query to search the MS Marco Passages we end up ranking only 2 passages and the query takes 7 ms. If we change type to any instead of all we end up ranking 7,926,256 passages (89% of the total collection) and the query takes 120 ms. Exact timing depends obviously on HW and number of threads used to evaluate the query but the main point is that brute force matching all documents which contains at least one term is expensive. While restricting to all is too restrictive, failing to recall the relevant documents. So what is the solution to this problem? How can we find the relevant documents without having to fully score almost all passages in the collection?

Meet the dynamic pruning algorithm WAND

The WAND algorithm is described in detail in

Efficient Query Evaluation using a Two-Level Retrieval Process (PDF)

We have determined that our algorithm significantly reduces the total number of full evaluations by more
than 90%, almost without any loss in precision or recall.
At the heart of our approach there is an efficient implementation of a new Boolean construct called WAND or
Weak AND that might be of independent interest

Vespa implements the WAND as a query operator and the below is an example of how to use it using our query example from above:

 {
    "yql": "select id, text from passage where ([{\"targetNumHits\": 10}]weakAnd(default contains \"is\", default contains \"cdg\", default contains \"airport\", default contains \"in\", default contains \"main\", default contains \"paris\"));",
    "hits": 10,
    "ranking.profile": "bm25"
  }

Using the above WAND query only fully ranks 2409 passages using the bm25 ranking profile and recall at first positions is the same as with brute force any,
so we did not lose any accuracy but saved a lot of resources.
Using the weakAnd operator, the query takes 12 ms instead of 120ms with brute force any.
Using WAND is best implemented using a custom searcher plugin to avoid tokenization outside of Vespa which might introduce asymmetric behaviour.
For example RetrievalModelSearcher
or using weakAnd.replace
which rewrites type any queries to using WAND instead.

There are two WAND/WeakAnd implementations in Vespa where in the above example we used weakAnd() which fully integrates with text processing (tokenization and index statistics like IDF(Inverse Document Frequency)). The alternative is wand() where the end user can control the query and document side weights explicitly. The latter wand() operator can be used to implement DeepCT and HDCT: Context-Aware Term Importance Estimation For First Stage Retrieval as Vespa gives the user full control of query and document term weighting without having to bloat the regular index by repeating terms to increase or lower the term frequency. Read more in Using WAND with Vespa.

Dense Retrieval using bi-encoders over Transformer models

Embedding based models embed or map queries and documents into a latent low dimensional dense embedding vector space and use vector search to retrieve documents. Dense retrieval could be accelerated by using approximate nearest neighbor search, for example indexing the document vector representation using HNSW graph indexing. In-domain dense retrievers based on bi-encoder architecture trained on MS Marco passage data have demonstrated that they can outperform sparse lexical retrievers with a large margin. Let us introduce using dense retrievers with Vespa.

In this example we use a pre-trained dense retriever model from Huggingface 🤗 sentence-transformers/msmarco-MiniLM-L-6-v3 . The model is based on MiniLM and the output layer has 384 dimensions. The model has just 22.7M trainable parameters and encoding the query using a quantized model takes approximately 8 ms on cpu. The original model uses mean pooling over the last layer of the MiniLM model but we also add a L2 normalization to normalize vectors to unit length (1) so that we can use innerproduct distance metric instead of angular distance metric. This saves computations during the approximate nearest neighbor search.

We expand our passage document type with a dense tensor field mini_document_embedding and a new ranking profile.

  search passage {
  document passage {
    field text type string {
      indexing: summary |index
      index:enable-bm25
    }
    field mini_document_embedding type tensor<float>(d0[384]) {
      indexing: attribute | index
      attribute {
        distance-metric: innerproduct
      }
      index {
        hnsw {
          max-links-per-node: 32
          neighbors-to-explore-at-insert: 500
        }
      }
    }
    field id type int {
      indexing: summary |attribute
    }
  }
  fieldset default {
  	fields: text
  }
  rank-profile bm25 {
  	first-phase {
  	  expression: bm25(text)
  	}
  }
  rank-profile dense {
    first-phase {
      expression: closeness(field,mini_document_embedding)
    }
  }
}

The mini_document_embedding tensor is dense (denoted by d0[384]) and is of dimensionality 384 (determined by the Transformer model we use, and possible linear dimension reduction). We use float resolution (4 bytes) for the tensor cell values (valid choices are double, bfloat16 and int8). We also define HNSW index for the field, and we set 2 HNSW indexing parameters which is an accuracy versus performance tradeoff. See HNSW for details. Accuracy is typically measured by recall@k comparing brute force nearest neighbor search versus the approximate nearest neighbor search at level k. The dense ranking profile specifies how we want to rank (or actually re-rank) our documents, in this case we use the closeness ranking feature. Documents close to the query in the embedding space is ranked higher than documents which are far. At indexing time we need to convert the passage text into the dense vector representation and index. At query time, we need to encode the query and use approximate nearest neighbor search:

  {
   "yql": "select id, text from passage where [{\"targetNumHits\": 10]nearestNeighbor(mini_document_embedding, query_embedding);"
   "hits": 10,
   "query": "is cdg airport in main paris?",
   "ranking.profile": "dense",
   "ranking.features.query(query_embedding)": [0.08691329, -0.046273664, -0.010773866,..,..]
  }

In the above example we use the Vespa nearestNeigbhor query operator to retrieve the 10 closests documents in embedding space for the input query embedding vector passed in the ranking.features.query(query_embedding) parameter. In this example, query encoding (the forward query encoding pass of the query to obtain the query embedding) is done outside but we can also represent the query encoding model inside Vespa, avoiding complicating our online serving deployment setup:

Representing the bi-encoder model inside Vespa

To represent the bi-encoder query model in Vespa we need to export the Huggingface PyTorch model into ONNX format for efficient serving in Vespa.
We include a notebook in this
sample application
which demonstrates how to transform the model and export it to ONNX format.
Vespa supports evaluating ONNX models for ranking and query encoding.
To speed up evaluation on CPU we use quantized (int) version.
We have demonstrated how to represent query encoders in
Dense passage retrieval with nearest neighbor search.

Hybrid Dense Sparse Retrieval

Recent research indicates that combining dense and sparse retrieval could improve the recall, see for example A Replication Study of Dense Passage Retriever. The hybrid approach combines dense and sparse retrieval but requires search technology which supports both sparse lexical and dense retrieval. Vespa.ai supports hybrid retrieval in the same query by combining the WAND and ANN algorithms. There are two ways to do this:

Disjunction (OR)

  {
   "yql": "select id, text from passage where 
   ([{\"targetNumHits\": 10]nearestNeighbor(mini_document_embedding, query_embedding)) or  
   ([{\"targetNumHits\": 10}]weakAnd(default contains \"is\"...));"
   "hits": 10,
   "query": "is cdg airport in main paris?",
   "ranking.profile": "hybrid",
   "ranking.features.query(query_embedding)": [0.08691329, -0.046273664, -0.010773866,..,..]
  }

In the above example we combine ANN with WAND using OR disjunction and we have a hybrid ranking profile which can combine using the dense and sparse ranking signals (e.g bm25 and vector distance/closeness). Approximately 10 + 10 documents will be exposed to the first-phase ranking function (depending on targetNumHits). It is then up to the first-phase ranking expression to combine the scores of these two different retrieval methods into a final score. See A Replication Study of Dense Passage Retriever for examples of parameter/weighting. For example it could look something like this:

rank-profile hybrid {
  first-phase {
    expression: 0.7*bm25(text) + 2.9*closeness(field, mini_document_embedding)
  }
}

Rank:

Pretrained Transformer Language Models for Search – part 3

Decorative image

Photo by Frank Busch
on Unsplash

Updated 2022-10-21: Added links and clarified some sections

In this blog series we demonstrate how to represent transformer models in a multiphase retrieval and ranking pipeline using Vespa.ai. We also evaluate these models on the largest Information Retrieval relevance dataset, namely the MS Marco Passage ranking dataset.
We demonstrate how to achieve close to state-of-the-art ranking using miniature transformer models with just 22M parameters, beating large ensemble models with billions of parameters.

In the first post in this series
we introduced using pre-trained language models for ranking and three popular methods for using them for text ranking.
In the second post
we studied efficient retrievers that could be used as the first phase in a multiphase retrieval and ranking pipeline.
In this third post we study a re-ranking model which we will deploy as a re-ranker on top of the retriever methods
we studied in the previous post, but first let us recap what a multiphase retrieval and ranking pipeline is.
In a multiphased retrieval and ranking pipeline,
the first phase retrieves candidate documents using a cost-efficient retrieval method
and the more computationally complex cross-attention or late interaction model inference
is limited to the top ranking documents from the first phase.
In this post we will study the Contextualized late interaction over BERT (ColBERT) model
and deploy it as a re-ranking phase on top of the dense retriever that we studied in the previous post.
The CoLBERT ranking model was introduced in
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
by Omar Khattab and Matei Zaharia.

Contextualized late interaction over BERT (ColBERT)

In the previous post in this series we introduced a dense retriever using a bi-encoder architecture over a Transformer model (MiniLM). Both queries and documents were encoded by the bi-encoder and represented in the same dense embedding vector space.
We used cosine similarity between the query and the document in this embedding vector space to rank documents for a query,
and we could accelerate the retrieval phase using approximate nearest neighbor search
using angular distance or innerproduct.

Unlike the dense bi-encoder, the contextualized late interaction model represents the query and document as multiple vectors obtained from the last output layer of the Transformer model. Bi-encoders on the other hand, usually performs a pooling operation over the last transformer layer,
e.g. just using the embedding representation from the CLS output token, or mean over all token output embeddings.
Also, unlike other text to vector representations like Word2Vec, the token vector representation depends on the other tokens in the same input sequence. For example the token driver in the text Microsoft driver has a different vector representation than driver in the text Taxi driver as the context is different. This thanks to the attention mechanism in the Transformer architecture where each token attends to all other tokens in the same input sequence. We can say that token output vector representation is contextualized by the other tokens in the input text sequence.

Similar to the single vector bi-encoder model, queries and documents are encoded independently. Hence, the query tokens only attend to other query tokens, and document tokens only attend to other document tokens. This separation enables offline processing of the documents which speeds up re-ranking as at re-reranking time we only need to obtain the query token embeddings and load the precomputed document embeddings from storage (e.g. memory). The ColBERT architecture also uses a query encoder and a document encoder, based on the same Transformer instance. The input to the model is different for queries and documents. The query encoder pads using the BERT mask token to a configurable maximum query length if the query input text is shorter than this max length. The document input is not padded to a fixed length.
The padding of masked tokens of the query input is explained in the paper:

We denote the padding with masked tokens as query augmentation, a step that allows BERT to produce query-based embeddings at the positions corresponding to these masks. Query augmentation is intended to serve as a soft, differentiable mechanism for learning to expand queries with new terms or to re-weigh existing terms based on their importance for matching the query

The dimensionality used to represent the output token embedding can be reduced using a dimension reduction layer on top of the last output transformer layer. The original token output dimensionality depends on the Transformer model used, for example, the bert-base model uses 768 dimensions while MiniLM uses 384 dimensions. In the ColBERT paper the authors uses dimension reduction to 128 dimensions from the original hidden size of 768 dimensions. The authors also demonstrate that reducing the dimensionality further to 32 does not impact ranking accuracy significantly. The dimensionality used and the precision used for the vector values matters for both the computational complexity and storage requirements.
For example, if we use 32 dimensions with bfloat16 (2 bytes per tensor value) precision, we need to store 32GB of vector data for 9M documents with average 60 tokens per document.
While if we use 128 dimensions with float32 (4 bytes) precision, we end up with about 256GB of vector data.

Ranking with ColBERT – Meet MaxSim

So we now know roughly how the ColBERT architecture works; Query text is encoded into a fixed length bag of token embeddings and document text is encoded into a bag of token embeddings. But the missing piece is how do we compute the relevancy score of a query, document pair using this representation?

The ColBERT paper introduces the late interaction similarity function which they name Maximum Similarity (MaxSim): For a given query and document pair the MaxSim relevancy score is calculated as follows:

For each query token embedding perform cosine similarity against all the document token embeddings and track the maximum score per query token.
The overall query, document score is the sum of these maximum cosine scores.
For a query with 32 token embeddings (max query length 32), and a document with 128 tokens we need to perform 32*128 cosine similarity operations. The MaxSim operator is illustrated in the figure below.

MaxSim

MaxSim illustration from the ColBERT paper

The cosine similarity with unit length vectors can be performed by the inner dot product,
and can be HW accelerated using advanced vector instructions.

Vespa ColBERT representation

To represent the ColBERT model architecture in Vespa for re-ranking documents we need:

  • Store the document token embeddings in our Vespa document model for fast, on-demand, access in ranking phases
  • Express the MaxSim function with a Vespa ranking expression
  • Map the query text to token ids, and map tokens to token embeddings at run time by invoking the ColBERT query encoder transformer model

We expand the Vespa document schema from the previous post and introduce a new mixed Vespa tensor field called dt. We use this tensor to store the computed bag of token embeddings for the document.
The mixed tensor (combining sparse dt, and indexed x dimensions) allows storing a dynamic number of token embeddings for the document, depending on the length of the document. We could have used an indexed representation, but that would have used more memory as we would need to determine a max document length.

Vespa Passage document schema

The new document schema including the new dt ColBERT document tensor is given below:

search passage {
  document passage {
    field id type int {...} 
    field text type string {...}
    field mini_document_embedding type tensor<float>(d0[384]){...}
    field dt type tensor<bfloat16>(dt{}, x[32]){
     indexing: attribute
     attribute:fast-search
    }
  }
}

The tensor cell value precision type we use is bfloat16 which is 2 bytes per tensor cell value which saves 50% of the memory compared to float precision (4 bytes per value). Vespa supports double, float, bfloat16 and int8 tensor cell value precision types.

We also use 32 dimensions for the per token embedding representation instead of 128 to further reduce the memory requirement. The indexing statement specifies attribute which means this field will be stored in-memory and fast-search enables fast uncompressed representation in memory which speeds up evaluation over mixed tensor fields. fast-search is only relevant for mixed tensor type fields.

Vespa MaxSim operator

We can express the MaxSim operator in Vespa by a tensor ranking expression using sum and reduce tensor functions.

 
sum(
  reduce(
    sum(query(qt) * attribute(dt), x),
    max, dt
  ),
  qt
)

Where attribute(dt) is the ColBERT document tensor field and query(qt) is the ColBERT query tensor representation.
The query(qt) tensor is defined in the
passage schema:

query(qt) tensor>float<(qt{},x[32])

We configure the MaxSim operator in a Vespa ranking profile,
where we use the dense bi-encoder model as our first-phase ranking function and use the ColBERT MaxSim as the second phase ranking expression.
We use re-ranking count of 1000 (per node),
this setting can also be controlled by a query time setting,
in case we want to explore different re-ranking depths.
The ranking profile is given below.
In this case, we also cast the bfloat16 tensor values
to float to enable HW accelerations in place for operations on float tensors.

rank-profile dense-colbert {
  first-phase {
    expression: closeness(field,mini_document_embedding)
  }
  second-phase {
    rerank-count: 1000
    expression {
      sum(
        reduce(
          sum(
              query(qt) * cell_cast(attribute(dt), float) , x
          ),
          max, dt
         ),
         qt
      )
    }
  }
}

To obtain the query(qt) ColBERT tensor we need to encode the text query input using the ColBERT query encoder.

Vespa ColBERT query encoder

We have trained a ColBERT model using a 6-layer MiniLM model which can be downloaded from Huggingface model hub. This model only have 22.7M trainable parameters. This model can be served with Vespa using ONNX format. We also have included a notebook which demonstrates how to export the PyTorch transformer model to ONNX format and also use quantization to further speed up the evaluation. Quantization (using int8) weights instead of float speeds up evaluation of the model by 3x. See Google colab notebook.

The query encoder is represented in a query document type which has no fields.
It’s a placeholder to be able to represent the ONNX model,
and we use a single empty document so that we can invoke the Vespa ranking framework to evaluate the ONNX model.

schema query {
  document query {}
  onnx-model colbert_encoder {
    file: files/vespa-colMiniLM-L-6-quantized.onnx
    input input_ids: query(input_ids)
    input attention_mask: query(attention_mask)
    output contextual:contextual 
  }
  rank-profile colbert_query_encoder {
    num-threads-per-search: 1
    first-phase {
      expression: random 
    }
    summary-features {
      onnxModel(colbert_encoder).contextual
    }
  }
}

Tokenization and tensor input (input_ids and attention_mask) is generated using a
custom searcher which maps the query text to BERT token ids
and creates the ColBERT masked query input.
See ColBERTSearcher for details.
This searcher produces the mentioned query(qt) tensor which is used by the MaxSim ranking expression.
We use the ColBERT repo’s indexing routine to produce the document token embeddings,
and we also publish a pre-processed dataset with all 8.8M passages including both the mini_document_embedding and ColBERT tensor fields.
See MS Marco Passage Ranking using Transformers vespa sample application.

We evaluate the ranking effectiveness of the ColBERT model deployed as a re-ranking step on top of the dense retriever introduced in the previous post. We use MS Marco Passage Ranking dev query split (6980 queries):

Retrieval methodRankingMRR@10Recall@100Recall@200Recall@1000
weakAnd (sparse)bm250.1850.660.730.85
nearestNeighbor (dense)innerproduct0.3100.820.870.94
nearestNeighbor (dense)ColBERT0.3590.860.900.94

The Recall@1000 does not change as the model is used to re-rank the top 1K hits from the dense retriever. The Recall@100 and Recall@200 metrics improve with the ColBERT re-ranking step and MRR@10 improves from 0.310 to 0.359.
The end-to-end latency, including query encoding of the dense retriever model, ColBERT query encoding, retrieval with nearest neighbor search (with targetHits=1000) and re-ranking with ColBERT is just 39 ms. Reducing the nearest neighbor search targetHits, and also the re-ranking depth of the ColBERT model can be used to trade accuracy versus cost.

$ ./src/main/python/evaluate_passage_run.py --rank_profile dense-colbert --rerank_hits 1000 --retriever dense  --ann_hits 1000 --hits 10  --trec_format --run_file dev.test --query_split dev --endpoint https://$ENDPOINT:4443/search/
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6980/6980 [04:27<00:00, 26.07it/s]

In this blog post

Build a basic text search application from python with Vespa: Part 2

Thiago Martins

Thiago Martins

Vespa Data Scientist


Evaluate search engine experiments using Python.

We want to enable Vespa users to run their experiments from python. This tutorial illustrates how to define query models and evaluation metrics to perform search engine experiments.

UPDATE 2023-02-13: Code examples and links are updated to work with the latest releases of
pyvespa
and learntorank.

Decorative image

Photo by Eugene Golovesov on Unsplash

We show how to use the pyvespa API to run search engine experiments based on the text search app we built in the first part of this tutorial series. Specifically, we compare two different matching operators and show how to reduce the number of documents matched by the queries while keeping similar recall and precision metrics.

We assume that you have followed the first tutorial and have a variable app holding the Vespa connection instance that we established there. This connection should be pointing to a Docker container named cord19 running the Vespa application.

Feed additional data points

We will continue to use the CORD19 sample data
that fed the search app in the first tutorial.
In addition, we are going to feed a few additional data points to make it possible to get relevant metrics from our experiments.
We tried to minimize the amount of data required to make this tutorial easy to reproduce.
You can download the additional 494 data points below:

from pandas import read_csv

parsed_feed = read_csv("https://data.vespa.oath.cloud/blog/cord19/parsed_feed_additional.csv")
parsed_feed.head(5)

Feed data

We can then feed the data we just downloaded to the app via the feed_data_point method:

for idx, row in parsed_feed.iterrows():
    fields = {
        "cord_uid": str(row["cord_uid"]),
        "title": str(row["title"]),
        "abstract": str(row["abstract"])
    }
    response = app.feed_data_point(
        schema = "cord19",
        data_id = str(row["cord_uid"]),
        fields = fields,
    )

Define query models to compare

A QueryModel is an abstraction that encapsulates all the relevant information controlling how your app matches and ranks documents. Since we are dealing with a simple text search app here, we will start by creating two query models that use BM25 to rank but differ on how they match documents.

from learntorank.query import QueryModel, OR, WeakAnd, Ranking

or_bm25 = QueryModel(
    name="or_bm25",
    match_phase=OR(), 
    ranking=Ranking(name="bm25")
)

The first model is named or_bm25 and will match all the documents that share at least one token with the query.

from learntorank.query import WeakAnd

wand_bm25 = QueryModel(
    name="wand_bm25", 
    match_phase=WeakAnd(hits=10), 
    ranking=Ranking(name="bm25")
)

The second model is named wand_bm25 and uses the WeakAnd operator, considered an accelerated OR operator. The next section shows that the WeakAnd operator matches fewer documents without affecting the recall and precision metrics for the case considered here. We also analyze the optimal hits parameter to use for our specific application.

Run experiments

We can define which metrics we want to compute when running our experiments.

from learntorank.evaluation import MatchRatio, Recall, NormalizedDiscountedCumulativeGain

eval_metrics = [
    MatchRatio(), 
    Recall(at=10), 
    NormalizedDiscountedCumulativeGain(at=10)
]

MatchRatio computes the fraction of the document corpus matched by the queries. This metric will be critical when comparing match phase operators such as the OR and the WeakAnd. In addition, we compute Recall and NDCG metrics.

We can download labeled data to perform our experiments and compare query models. In our sample data, we have 50 queries, and each has a relevant document associated with them.

import json, requests

labeled_data = json.loads(
    requests.get("https://data.vespa.oath.cloud/blog/cord19/labeled_data.json").text
)
labeled_data[:3]
[{'query_id': 1,
  'relevant_docs': [{'id': 'kqqantwg', 'score': 2}],
  'query': 'coronavirus origin'},
 {'query_id': 2,
  'relevant_docs': [{'id': '526elsrf', 'score': 2}],
  'query': 'coronavirus response to weather changes'},
 {'query_id': 3,
  'relevant_docs': [{'id': '5jl6ltfj', 'score': 1}],
  'query': 'coronavirus immunity'}]

Evaluate

Once we have labeled data, the evaluation metrics to compute, and the query models we want to compare, we can run experiments with the evaluate method. The cord_uid field of the Vespa application should match the id of the relevant documents.

from learntorank.evaluation import evaluate

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=[or_bm25, wand_bm25], 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
)
evaluation

Evaluate

The result shows that, on average, we match 67% of our document corpus when using the OR operator and 21% when using the WeakAnd operator. The reduction in matched documents did not affect the recall and the NDCG metrics, which stayed at around 0.84 and 0.40, respectively. The Match Ratio will get even better when we experiment with the hits parameter of the WeakAnd further down in this tutorial.

There are different options available to configure the output of the evaluate method.

Specify summary statistics

The evaluate method returns the mean, the median, and the standard deviation of the metrics by default. We can customize this by specifying the desired aggregators. Below we choose the mean, the max, and the min as an example.

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=[or_bm25, wand_bm25], 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
    aggregators=["mean", "min", "max"]
)
evaluation

Summaries

Check detailed metrics output

Some of the metrics have intermediate results that might be of interest. For example, the MatchRatio metric requires us to compute the number of matched documents (retrieved_docs) and the number of documents available to be retrieved (docs_available). We can output those intermediate steps by setting detailed_metrics=True.

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=[or_bm25, wand_bm25], 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
    aggregators=["mean"],
    detailed_metrics=True
)
evaluation

detailed

Get per-query results

When debugging the results, it is often helpful to look at the metrics on a per-query basis, which is available by setting per_query=True.

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=[or_bm25, wand_bm25], 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
    per_query=True
)
evaluation.head(5)

per-query

Find optimal WeakAnd parameter

We can use the same evaluation framework to find the optimal hits parameter of the WeakAnd operator for this specific application. To do that, we can define a list of query models that only differ by the hits parameter.

wand_models = [QueryModel(
    name="wand_{}_bm25".format(hits), 
    match_phase=WeakAnd(hits=hits), 
    ranking=Ranking(name="bm25")
) for hits in range(1, 11)]

We can then call evaluate as before and show the match ratio and recall for each of the options defined above.

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    query_model=wand_models, 
    eval_metrics=eval_metrics, 
    id_field="cord_uid",
    aggregators=["mean"],
)
evaluation.loc[["match_ratio", "recall_10"], ["wand_{}_bm25".format(hits) for hits in range(1, 11)]]

optimal

As expected, we can see that a higher hits parameter implies a higher match ratio. But the recall metric remains the same as long as we pick hits > 3. So, using WeakAnd with hits = 4 is enough for this specific application and dataset, leading to a further reduction in the number of documents matched on average by our queries.

Clean up:

vespa_docker.container.stop()
vespa_docker.container.remove()

Conclusion

We want to enable Vespa users to run their experiments from python. This tutorial illustrates how to define query models and evaluation metrics to run search engine experiments via the evaluate method. We used a simple example that compares two different match operators and another that optimizes the parameter of one of those operators. Our key finding is that we can reduce the size of the retrieved set of hits without losing recall and precision by using the WeakAnd instead of the OR match operator.

The following Vespa resources are related to the topics explored by the experiments presented here:

Pretrained Transformer Language Models for Search – part 4

Decorative image

Photo by Patrick Hendry on Unsplash

Updated 2022-10-21: Added links and clarified some sections

In this blog series we demonstrate how to represent transformer models in a multiphase retrieval and ranking pipeline using Vespa.ai. We also evaluate these models on the largest Information Retrieval relevance dataset, namely the MS Marco Passage ranking dataset. We demonstrate how to achieve close to state of the art ranking using miniature transformer models with just 22M parameters, beating large ensemble models with billions of parameters.

Blog posts in this series:

In the first post in this series we introduced using pre-trained language models for ranking and three popular methods for using them for text ranking. In the second post we studied efficient retrievers which could be used as the first phase in a multiphase retrieval and ranking pipeline. In the third post we studied the ColBERT re-ranking model.

In this fourth and last post in our blog post series on pre-trained transformer models for search,
we introduce a cross-encoder model with all-to-all interaction between the query and the passage.

We deploy this model as our final ranking stage in our multiphase retrieval and ranking pipeline, furthermore,
we submit the ranking results to the MS Marco Passage Ranking Leaderboard.

In addition, we benchmark the serving performance of all the retrieval and ranking methods introduced in this blog post series.
Finally, we also release a vespa sample application,
which lets try out these state of the art retrieval and ranking methods.

Introduction

In this blog post we study the third option for using transformer models for search and document ranking.
This option is the simplest model to configure and use in Vespa but also the most computationally expensive model in our multi-phase retrieval and ranking pipeline.
With the cross attention model we input both the query and the passage to the model and as we know by now,
the computational complexity of the transformer is squared with regards to the input length.
Doubling the sequence length increases the computational complexity by 4x.

The cross-encoder model is a transformer based model with a classification head on top of the Transformer CLS token (classification token).
The model has been fine-tuned using the MS Marco passage training set and is a binary classifier which classifies
if a query,document pair is relevant or not.

The cross-encoder model is also based on a 6-layer MiniLM model with only 22.7M parameters, same as the transformer models previously introduced in this blog series. As with the other two transformer models we introduced in previous posts in this series, we integrate this model in Vespa using ONNX format. We demonstrate how to export the model(s) from PyTorch/Transformers to ONNX format in this notebook. The model is hosted on the Huggingface model hub.

We use a quantized version where the original float weights have been quantized to int8 representation to speed up inference on cpu.

Vespa representation of the cross-encoder model

In previous posts we have introduced the Vespa passage schema.
We add a new tensor field to our schema and in this tensor field we will store the transformer token ids of the processed text.
We haven’t described this in detail before, but the MiniLM model uses as input the sequence of the numeric token ids from the
fixed BERT token vocabulary of about 30K unique tokens or subwords.

For example the passage:

Charles de Gaulle (CDG) Airport is close to Paris

Is tokenized to:

['charles', 'de', 'gaulle', '(', 'cd', '##g', ')', 'airport', 'is', 'close', 'to', 'paris']

The subword tokens are mapped to token ids from the fixed vocabulary, e.g ‘charles’ maps to token id 2798.
The example passage text is represented as a tensor by:

[2798, 2139, 28724, 1006, 3729, 2290, 1007, 3199, 2003, 2485, 2000, 3000]

We use the native Vespa WordPiece embedder
to map the text into tensor representation.

The passage document schema,
including the new text_token_ids field:

search passage {
  document passage {
    field id type int {...} 
    field text type string {...}
    field mini_document_embedding type tensor<float>(d0[384]){...}
    field dt type tensor<bfloat16>(dt{}, x[32]){..}
  }

  field text_token_ids type tensor<float>(d0[128])  {
    indexing: input text | embed tokenizer | attribute | summary
    attribute: paged
  }
}

We store maximum 128 tokens, denoted by d0[128]. This is an example of an indexed Vespa tensor type.

Vespa ranking with cross-encoder model

We are going to use the dense retriever model, accelerated by Vespa’s approximate nearest neighbor search to
efficiently retrieve passages for re-ranking with our transformer based ranking models. The retrieved hits are
re-ranked with the ColBERT model introduced in the third post,
and finally the top ranking documents from the ColBERT model is re-ranked using the cross-encoder.

The retrieval and ranking pipeline have two re-ranking depth parameters.

  • How many are re-ranked with ColBERT is determined by the target number of hits passed to the nearest neighbor query operator.
  • The number of documents that are re-ranked using the final cross-encoder model is determined by the rank-profile rerank-count property.

See phased ranking with Vespa.
Both these parameters impact end-to-end serving performance and also ranking accuracy as measured by MRR@10.

Both the nearest neighbor search target number of hits and rerank-count is per content node which is involved in the query.
This is only relevant for deployments where the document corpus cannot be indexed on a single node due to either space constraints (memory, disk) or serving latency constraints.

Defining the MiniLM cross-encoder

schema passage {
  document passage {...}

  onnx-model minilmranker {
    file: files/ms-marco-MiniLM-L-6-v2-quantized.onnx
    input input_ids: input_ids
    input attention_mask: attention_mask
    input token_type_ids: token_type_ids
  }
}

In the above snippet we define the ONNX model and its inputs, each of the inputs are mapped to a function declared later in the ranking profile. Each function produces a tensor
which is used as input to the model. The file points to the ONNX formatted model format, placed in in src/main/application/files/.
Vespa takes care of distributing the model to the content node(s). The inputs
to the model are standard transformer inputs (input_ids, attention_mask and token_type_ids).

The first part of the ranking profile where we define the 3 input functions to the BERT model looks like this:

  rank-profile dense-colbert-mini-lm {
    function input_ids() {
       expression: tokenInputIds(128, query(query_token_ids), attribute(text_token_ids))
    }
    function token_type_ids() {
      expression: tokenTypeIds(128, query(query_token_ids), attribute(text_token_ids))
    }
    function attention_mask() {
      expression: tokenAttentionMask(128, query(query_token_ids), attribute(text_token_ids))
    }
}

For example the input input_ids the function input_ids which is defined as

  function input_ids() {
       expression: tokenInputIds(128, query(query_token_ids), attribute(text_token_ids))
    }

The tokenInputIds is a built-in Vespa ranking feature
which builds the transformer model input including special tokens like CLS and SEP.

We pass the query(token_ids) tensor which
is sent with the query and the passage token ids which is read from the in-memory attribute field (text_token_ids).

The query tensor representation (query(query_token_ids)) is created in a custom query processor RetrievalModelSearcher
which converts the free text query input from the
user to a tensor representation using the same BertTokenizer as used by the custom document processor.

For example for a text query

is CDG in paris?

The query tensor representation becomes:

[2003, 3729, 2290, 1999, 3000, 1029]

The tokenInputIds ranking function will create the concatenated tensor of both query and passage including the special tokens. Using the example passage
from previous section with the above query example our concatenated output with special tokens becomes:

[101, 2003, 3729, 2290, 1999, 3000, 1029, 102, 2798, 2139, 28724, 1006, 3729, 2290, 1007, 3199, 2003, 2485, 2000, 3000, 102]

Where 101 is the CLS token id and 102 is the SEP token separating the query from the passage.

Cross-Encoder Model

The above figure illustrates the input and output of the cross-encoder transformer model.

Notice the CLS output embedding which is fed into the
classification layer which predicts the class label (Relevant = 1, irrelevant = 0).

Now as we have presented how to represent the cross-encoder model, we can present the remaining parts of our
ranking profile:

rank-profile dense-colbert-mini-lm {
    ...

    function maxSimNormalized() {
      expression {
        sum(
          reduce(
            sum(
              query(qt) * attribute(dt), x
            ),
            max, dt
          ),
          qt
        )/32.0
       }
    }
    function dense() {
      expression: closeness(field, mini_document_embedding)
    }
    
    function crossModel() {
      expression: onnx(minilmranker){d0:0,d1:0}
    }
    
    first-phase {
        expression: maxSimNormalized()
    }
    
    second-phase {
      rerank-count: 24
      expression: 0.2*crossModel() + 1.1*maxSimNormalized() + 0.8*dense()
    }
}

The maxSimNormalized function computes the ColBERT MaxSim function which we introduced in post 3,
here we also normalizes the MaxSim score by dividing the score with 32 which is the configured max ColBERT query encoder query length,
and each term has maximum score of 1.

The dense() function calculates the cosine similarity as calculated
by the dense retriever introduced in post 2

In the crossModel() function we calculate the score from cross-encoder introduced in this blog post:

function crossModel() {
  expression: onnx(minilmranker){d0:0,d1:0}
}

The {d0:0,d1:0} access the logit score. (d0:0 is the batch dimension, which always is of size 1, and d1:0 access the logit score, which is a proxy for the relevancy).

Ranking profile summarized

  • Retrieve efficiently using the dense retriever model – This is done by the Vespa approximate nearest neighbor search query operator.
  • The k passages retrieved by the nearest neighbor search is re-ranked using the ColBERT MaxSim operator. K is set by the target hits used for the nearest neighbor search.
  • In the last phase, the top ranking 24 passages from the previous phase are evaluated by the cross attention model.
  • The final ranking score is a linear combination of all three ranking scores. The rerank-count can also be adjusted by a query parameter

Observe that reusing scores from the previous ranking phases does not impact serving performance,
as they are only evaluated once (per hit) and cached.

The linear weights
of the three different transformer scores was obtained by a simple grid search observing
the ranking accuracy on the dev query split when changing parameters.

MS Marco Passage Ranking Submission

We submitted a run for the MS Massage Ranking where we used targetHits 1K for the approximate nearest neighbor search,
so that 1K passages are re-ranking using the ColBERT model and finally 96 passages are re-ranked with the cross-encoder model.

Passage Ranking

Our multi-phase retrieval and ranking pipeline with 3 miniature models performed pretty well,
even beating large models using T5 with 3B parameters.
See MS Marco Passage Ranking Leaderboard.

ModelEvalDev
BM25 (Official baseline)0.1650.167
BM25 (Lucene8, tuned)0.1900.187
Vespa dense + ColBERT + cross-attention0.3930.403

Multi-threaded retrieval and ranking

Vespa has the ability to use multiple threads per search query.
This ability can reduce search latency as the document retrieval and ranking
for a single query can be partitioned, so that each thread works on a subset of the searchable documents in an index.
The number of threads to use is controlled on a per rank profile basis,
but can only use less than the global setting controlled in the application services.xml.

To find optimal settings, we recommend benchmarking starting with one thread per search and increasing until latency does not improve significantly.
See Vespa Scaling Guide for details.

Serving performance versus ranking accuracy

In this section we perform benchmarking where we deploy the system on a Vespa cloud instance using
2 x Xeon Gold 6263CY 2.60GHz (HT enabled, 48 cores, 96 threads) with 256GB memory.

We use a single content node indexing the 9M passages.
All query encodings with the MiniLM based query encoders, retrieval and re-ranking is performed on this content node.
We also use 2 stateless container nodes with 16 v-cpu each to make sure that we are benchmarking the content node performance.
See Vespa overview on
stateless container nodes versus content nodes.

Running everything of importance on the same node enables us to quantitatively compare the performance of the methods we have introduced in this blog post series.
We benchmark throughput per retrieval and ranking model until we reach about 70% cpu utilization,
and compare obtained throughput and latency. We also include tail latency (99.9 percentile) in the reported result.

We use the vespa-fbench benchmarking utility to
load the cluster (by increasing the number of clients to reach about 70% cpu util).
We

Billion-scale vector search with Vespa – part one

illustrative image

Photo by Federico Beccari
on Unsplash

NASA estimates that the Milky Way galaxy
contains between
100 to 400 billion stars. A mind-blowing large number of stars and solar systems,
and also a stunning sight from planet earth on a clear night.

Many organizations face challenges involving star-sized datasets, even orders of magnitude larger.
AI-powered representations of unstructured data like image, text, and video have enabled search applications
we could not foresee just a few years ago.
For example, in a previous blog post, we covered vision and video search applications using
AI-powered vector representations.

Searching star-sized vector datasets using (approximate)
nearest neighbor search is a fascinating and complex problem with many trade-offs :

  • Accuracy of the approximate versus the exact nearest neighbor search
  • Latency and scaling
  • Scaling search volume (searches per second)
  • Batch indexing versus real-time indexing and handling of updates and vector meta-data
  • Distributed search in large vector datasets which does not fit into a single content node
  • Cost($), total cost of ownership, including development efforts

This blog series looks at how to represent and search billion-scale vector datasets using Vespa, and we cover many of the mentioned
trade-offs.

In this first post we look at compact binary-coded vector representations which can reduce storage and computational complexity
of both exact and approximate nearest neighbor search. For those that are new to Vespa we can recommend reading the
Vespa overview and the
Vespa quick start guide before diving into this post.

Deep Neural Hashing is a catchy phrase
for using deep neural networks for
learning
compact hash-like binary-coded representations. The goal of representation
learning, deep or not, is to transform any data into a suitable representation
that retains the essential information needed to solve a particular task, for
example, search or retrieval. Representation learning for retrieval usually involves
supervised learning with labeled or pseudo-labeled data from user-item interactions.

Many successful text retrieval systems use supervised representation
learning to transform text queries and documents into continuous
high-dimensional real-valued vector representations. Query and document
similarity, or relevancy, is reduced to a vector distance metric in the
representational vector space. To efficiently retrieve from large
collection of documents, one can turn to approximate nearest neighbor search
algorithms instead of exact nearest neighbor search.
See for example the
pre-trained transformer language models for search blog post for
an introduction to state-of-the-art text retrieval using dense vector representations.

Recently, exciting research has demonstrated that it is possible to learn a
compact hash-like binary code representation instead of a dense continuous
vector representation without much accuracy loss.
In
Efficient Passage Retrieval with Hashing for Open-domain Question Answering, the authors describe
using a hashing layer on top of the bi-encoder transformer architecture
to train a binary coded representation of documents and queries
instead of continuous real-valued representation.

Illustrative image

Illustration from
Efficient Passage Retrieval with Hashing for Open-domain Question Answering

Note that the search is performed in two phases, first a coarse-level search using the
hamming distance with binary codes, secondly a re-ranking phase using the continuous query vector representation and
a unpacked vector representation from the binary code.

A huge advantage over continuous vector
representations is that the binary-coded document representation significantly reduces
the document storage requirements. For example, representing text documents in a
768-dimensional vector space where each dimension uses float precision, the
storage requirement per document becomes 3072 bytes. Using a 768-bit
binary-coded representation instead, the storage requirement per document
becomes 96 bytes, a 32x reduction.
In the mentioned paper, the authors demonstrate that the entire
English Wikipedia consisting of 22M passages can be reduced to 2GB of binary codes.

Searching in binary-coded representations can be done using the hamming distance metric.
The hamming distance between code q and code d is simply the number of bit
positions that differ or, in other words, the number of bits that need to flip
to convert q into d. Hamming distance is efficiently implemented on CPU
architectures using few instructions (xor combined with population count). In
addition, hamming distance search makes it possible to rank a set of binary
codes for a binary coded query compared to exact hash table lookup.

Compact binary-coded representations paired with hamming
distance is also successfully used for large-scale vision search.
See for example these papers:

Vespa has first-class citizen support for representing high dimensional dense
vectors using the
Vespa tensor field type. Dense vectors are represented as
dense first-order tensors in Vespa. Tensor cell values in Vespa support four
numeric precision types,
in order of increasing precision:

  • int8 (8 bits, 1 byte) per dimension
  • bfloat16 (16 bits, 2 bytes) per dimension
  • float (32 bits, 4 bytes) per dimension
  • double (64 bits, 8 bytes) per dimension

All of which are signed data types. In addition, for dense first-order tensors
(vectors), Vespa supports fast approximate nearest neighbor search using Vespa’s
HNSW implementation.
Furthermore, the nearest neighbor search operator in Vespa
supports several
distance metrics, including euclidean, angular, and bitwise
hamming distance.

To represent binary-coded vectors in Vespa, one should use first-order tensors
with the int8 tensor cell precision type. For example, to represent a 128-bit code using
Vespa tensors, one can use a 16-dimensional dense (indexed) tensor using int8 value type
(16*8 = 128 bits). The
Vespa document schema below defines a numeric id field
of type int, representing the vector document id in addition to the binary-coded
vector using a dense first-order tensor. The b denotes the tensor dimension
name, and finally, [16] denotes the dimensionality.

schema code {
  document code {
    field id type int {}
    field binary_code type tensor<int8>(b[16]) {
      indexing: attribute
      attribute { 
        distance-metric:hamming
      } 
    }
  }
}

Using big-endian ordering for the binary-coded representation, the
most significant bits from the binary-code at position 0 to 7 inclusive are the first
vector element at index 0, bits at position 8 to 15 inclusive in the second
vector element, and so on. For example, the following snippet uses NumPy
with python to
demonstrate a way to create a binary-coded representation from a 128-dimensional
real-valued vector representation by using an untrained thresholding function (sign
function):

import numpy as np
import binascii
vector = np.random.normal(3,2.5, size=(128))
binary_code = np.packbits(np.where(vector > 0, 1,0)).astype(np.int8)
str(binascii.hexlify(binary_code),'utf-8')
'e7ffef5df3bcefffbfffff9fff77fdc7'

The above hexadecimal string representation
can be fed to Vespa using the
Vespa JSON feed format.

{
  "id": "id:bigann:code::834221",
  "fields": {
    "id": 834221,
    "binary_code": {
      "values": "e7ffef5df3bcefffbfffff9fff77fdc7"
    }
  } 
}

Indexing in Vespa is real-time and documents become searchable within single digit
millisecond latency at high throughput. The JSON feed files can be indexed with
high throughput using the
Vespa feed client.

Feeding output stream

Dense first-order tensors can either be in memory all the time or paged in from
secondary storage on-demand at query time, for example, during scoring and
ranking phases in a
multiphased retrieval and ranking pipeline. In any case,
the data is
persisted and stored for durability and data re-balancing.

Furthermore, supporting in-memory and
paged dense first-order tensors enables
storing the original vector representation in the document model at a relatively
low cost (storage hierarchy economics). For example, the following document schema keeps
the original float precision vector on disk using the paged tensor attribute option.

schema code {
  document code {
    field id type int {} 
    field binary_code type tensor<int8>(b[16]) {
      indexing: attribute
      attribute { 
        distance-metric: hamming 
      }
    }
    field vector type tensor<float>(r[128]) {
      indexing: attribute
      attribute: paged
    }
  }
}

Storing the original vector representation on disk using the
paged
attribute option enables phased retrieval and ranking close to the data. First,
one can use the compact in-memory binary code for the coarse level search to
efficiently find a limited number of candidates. Then, the pruned
candidates from the coarse search can be re-scored and re-ranked using a more
advanced scoring function using the original document and query representation. Once
a document is retrieved and exposed to the ranking phases, one can also use more
sophisticated scoring models, for example using Vespa’s support for evaluating
ONNX models.

Illustration from Lin_Deep_Learning_of_2015_CVPR_paper.pdf

A two-phased coarse to fine level search using hamming distance as the coarse-level search.
Illustration from
Deep Learning of Binary Hash Codes for Fast Image Retrieval (pdf)
.

The binary-coded representation and the original vector are co-located on the
same Vespa content node(s)
since they live in the same document object. Thus,
re-ranking or fine-level search using the real-valued vector representation does not require any
network round trips to fetch the original vector representation from some
external key-value storage system.

In addition, co-locating both the coarse and fine representation avoid
data consistency and synchronizations issues between different data stores.

Using Vespa’s nearest neighbor search query operator, one can search for the
semantically similar documents using the hamming distance metric. The following
example uses the exact version and does not use the approximate version using
Vespa’s HNSW indexing support. The next blog post in this series will compare
exact with approximate search. The following Vespa document schema defines
two Vespa ranking profiles:

schema code {
  document code {
    field id type int {..} 
    field binary_code type tensor<int8>(b[16]) {..}
 }
 rank-profile coarse-ranking {
    num-threads-per-search:12
    first-phase { expression { closeness(field,binary_code) } } 
 }
 rank-profile fine-ranking inherits coarse-ranking {
    second-phase { 
      rerank-count:200
      expression { .. } 
    } 
 }
}

The coarse-ranking ranking
profile ranks documents by the
closeness rank feature which in our case is the inverse hamming distance.
By default, Vespa sorts documents by descending relevancy score,
hence the closeness(field,name) rank feature uses
1/(1 + distance()) as the relevance score.

The observant reader might have noticed the
num-threads-per-search ranking profile setting.
This setting allows parallelizing the search and ranking using
multiple CPU threads, reducing the overall serving latency at the cost of
increased CPU usage per query. This allows better use of multicore CPU architectures.

The second ranking profile fine-ranking inherits the first phase
ranking function from the coarse-ranking profile and re-ranks the top k results using a more sophisticated model,
for example using the original representation.

The nearest neighbor search is expressed using the Vespa YQL query
language in a query api http(s) request.

A sample JSON POST query is given below, searching for the 10 nearest neighbors of a binary coded query vector query(q_binary_code):

{
  "yql": "select id from vector where ([{\"targetHits\":10}]nearestNeighbor(binary_code, q_binary_code));",
  "ranking.profile": "coarse-ranking",
  "ranking.features.query(q_binary_code): [-18,-14,28,...],
  "hits":10
}

Similar, using the fine-ranking we can also pass the original query vector representation which might be
used in the second phased ranking expression.

{
  "yql": "select id from vector where ([{\"targetHits\":10}]nearestNeighbor(binary_code, q_binary_code));",
  "ranking.profile": "fine-ranking",
  "ranking.features.query(q_binary_code): [-18,-14,28,...],
  "ranking.features.query(q_vector_real): [-0.513,-0.843,0.034,...],
  "hits":10
}

Vespa allows combining the nearest neighbor search query operator with
other query operators and filters. Using filtering reduces the complexity of the nearest neighbor search as fewer candidates
evaluated. Fewer documents saves memory bandwidth and CPU instructions.

See also this blog post for more examples of
combining the nearest neighbor query operator with filters. An example of filtering on a bool field type
is given below.

{
  "yql": "select id from vector where ([{\"targetHits\":10}]nearestNeighbor(binary_code, q_binary_code)) and is_visible=true;",
  "ranking.profile": "coarse-ranking",
  "ranking.features.query(q_binary_code): [-18,-14,28,...],
  "hits":10
}

In the above query examples we use the
short dense (indexed) tensor input format.
Note that query input tensors do not support the compact hex string representation. The above examples also assumed that an external
system would do the binarization.
Vespa also supports importing ONNX models so that the binarization
could be performed in the Vespa stateless cluster before searching the content cluster(s),
see stateless model evaluation for examples and discussion.

This post introduced our blog post series on billion-scale vector search, furthermore, we took a deep dive into representing binary-code using
Vespa’s tensor field with int8 tensor cell precision.
We also covered coarse-level to fine-level search and ranking using hamming
distance as the coarse-level nearest neighbor search distance metric.

In the next blog post in this series we will
experiment with a billion-scale vector dataset from big-ann-benchmarks.com.
We will be indexing it using a single Vespa content node, and we will experiment with using both exact and approximate vector search with hamming distance.

The focus of the next post will be to demonstrate some of the mentioned trade-offs from the introduction:

  • Real-time