Improving Product Search with Learning to Rank – part one


Decorative
image

Photo by Pawel Czerwinski on Unsplash

Over a few busy years, we have witnessed a neural ranking paradigm shift, where
pre-trained language models, such as BERT,
have revolutionized text ranking
in data-rich settings with lots of labeled data.
Feeding the data hungry neural networks lots of labeled relevance judgments
has proved more effective than statistical (tabular) text ranking feature-based methods.

However, historically, tree-based machine learning models
such as Gradient Boosting (GB)
with lambdarank loss have prevailed as state-of-the-art for product search ranking.
In this blog post series, we look at improving product search using learning to rank techniques,
with demonstrable results on the largest publicly available product ranking dataset.
We will train and measure their effectiveness on a large dataset, but
without getting into the details of complex loss functions.

In this first post, we introduce a product ranking dataset and establish multiple ranking baselines
that do not use the labeled relevance judgments in the dataset.
These ranking models are applied in a zero-shot setting, meaning they have not been trained using
domain-specific relevancy labels. The methods include traditional lexical
ranking like BM25 and other Vespa native text ranking features, semantic
off-the-shelf vector search models, and hybrid combinations of sparse and dense
methods. All of the ranking methods are represented end-to-end in Vespa.

Dataset

In this series, we use a large shopping queries
dataset released by Amazon:

We introduce the “Shopping Queries Data Set”, a large dataset of complex search
queries released to foster research in semantic matching of queries and
products. For each query, the dataset provides a list of up to 40 potentially
relevant results, together with ESCI relevance judgments (Exact, Substitute,
Complement, Irrelevant) indicating the relevance of the product to the query.
Each query-product pair is accompanied by additional information. The dataset is
multilingual, containing queries in English, Japanese, and Spanish.

The dataset consists of a large number of questions, and where each query has,
on average, 20 graded query-item relevance judgments:

  • Exact (E): the item is relevant to the query and satisfies all the query
    specifications (e.g., water bottle matching all attributes of a query “plastic
    water bottle 24oz”, such as material and size)

  • Substitute (S): the item is somewhat relevant: it fails to fulfill some
    aspects of the query, but the item can be used as a functional substitute (e.g.,
    fleece for a “sweater” query)

  • Complement (C): the item does not fulfill the query but could be used in
    combination with an exact item (e.g., track pants for “running shoe” query)

  • Irrelevant (I): the item is irrelevant, or it fails to fulfill a central
    aspect of the query (e.g., socks for a “pant” query)

As with most ranking datasets, the dataset is split into a train and test part,
and models can be trained using the train split and evaluated on the test split.
In our work, we focus on the English queries in both splits.

English Train Split

The train split contains 20,888 queries with 419,653 query-item judgments. This
split can be used to train models, using learning to rank techniques. The train
set will be the focus of upcoming blog posts.

English Test Split

In this blog post we focus on the test split, and it contains 8,956 queries with
181,701 query-item judgments.

The judgment label distribution is as follows:

  • Exact (Relevant) 79,708 (44%)
  • Substitute (Somewhat Relevant) 8,099 (4%)
  • Complement 63,563 (35%)
  • Irrelevant 30,331 (17%)

Both splits have about 20 product judgments per query. As we can see from the label
distribution, the dataset has many Exact (relevant) judgments, especially
compared to Substitutes (somewhat relevant) and Irrelevant. On average, there
are about 9 relevant products for every query in the test split.

This product relevance dataset is the largest available product relevance
dataset and allows us to compare the effectiveness of multiple ranking models
using graded relevance judgments. A large dataset allows us to report
information retrieval ranking metrics and not anecdotical LGTM@10 (Looks Good To Me) metrics.

We can evaluate ranking models used in a
zero-shot setting without using the in-domain relevance labels and models
trained using learning to rank techniques, exploiting the relevance labels in the train split.

Product ranking example

The above image shows query #535 in the test split with a total of 16 product judgements.
The task is to optimize the ordering (ranking) so that products labeled as exact are ordered
before supplements and supplements before irrelevant.

Product ranking example
The above image shows perfect ranking of products for query #535.
Note that we have converted the textual labels (esci_label) to numeric labels using
E=4, S=3, C=2, I=1. If our ranking function is able to produce this ordering, we
would get a perfect score for this query.

Indexing the dataset with Vespa

We use the following Vespa document
schema, which allows us to index all the
products associated with English queries across both splits.
We use a utility script that converts the parquet product data file to a
Vespa JSON
formatted feed file. In total we index about 1.2M products in Vespa.

schema product {

    document product {

        field id type string {
            indexing: summary | index 
            rank:filter
            match:word
        }

        field title type string {
            indexing: summary | index
            index: enable-bm25
            match:text
            weight:300
            bolding:on
        }

        field description type string {
            indexing: summary | index
            index: enable-bm25
            match:text
            weight: 200
        }

        field bullets type string {
            indexing: summary | index
            index: enable-bm25
            match:text
            weight: 200
        }

        field brand type string {
            indexing: summary | index | attribute
            match:text
            weight:100
        }

        field color type string {
            indexing: summary | index | attribute
            match:text
            weight:100
        }

    }

    field title_tokens type tensor<float>(d0[32]) {
        indexing: input title | embed tokenizer | attribute
    }

    field description_tokens type tensor<float>(d0[128]) {
        indexing: input description | embed tokenizer | attribute
    }

    field embedding type tensor<float>(d0[384]) {
        indexing {
            input title . input description | embed transformer | attribute | summary | index
        }
        attribute {
            distance-metric: angular 
        }
    }
    fieldset default {
        fields: title, description, bullets, brand  
    }
}

The product data contains title, description, brand, color and bullets. We use
Vespa’s support for encoding text into vector embeddings, in this case
we use the title and description as input.
See text embedding made simple
for details on how to embed text embedding models in Vespa.

Evaluation

The official dataset evaluation metric is
NDCG (Normalized
Discounted Cumulative Gain), a precision-oriented metric commonly used for
ranking datasets with graded relevance judgments. An important observation is
that the task only considers ranking of the products that have judgment
labels. In other words, we assume that a magic retriever has
retrieved all relevant documents (plus irrelevant documents), and our task is to
re-rank the products so that the relevant ones are pushed to the top of the
ranked list of products. This is a simpler task than implementing end-to-end
retrieval and ranking over the 1.2M products.

Ranking overview

Many deployed search systems need to hedge the deployment cost and deploy
multi-phased retrieval and ranking
pipelines to reduce computational complexity.
In a retrieval and ranking funnel, one or many diverse retrievers use a
cost-efficient retrieval ranking phase, and subsequent ranking phases re-rank
the documents, focusing on precision. In a later post in this series, we look at
how to train a retriever function using learning to retrieve and how to
represent retrieval and ranking phases in Vespa.

Baseline zero-shot ranking models

We propose and evaluate seven different baseline ranking models, all represented end-to-end in
Vespa using Vespa’s ranking framework.

Random

Obviously not the greatest baseline, but allows us to compare other baselines with
random ranking. We use Vespa’s random
rank-feature.

A pseudorandom number in the range [0,1] which is drawn once per document during rank evaluation.

rank-profile random inherits default {
    first-phase {
        expression: random 
    }
}

BM25 – lexical ranking

BM25 is a tried and true
zero-shot text ranking model. BM25 only has two parameters and is the ranking
function many turn to when there are no explicit relevancy labels or implicit
click feedback to train ranking models. The BM25 scoring function can also be
accelerated for efficient retrieval using dynamic pruning algorithms like
WAND.

rank-profile bm25 inherits default {
    first-phase {
        expression: bm25(title) + bm25(description) 
    }
}

Note that we only use the title and description. All our lexical baselines
uses only these two fields.

Vespa nativeRank – lexical ranking

Vespa’s nativeRank is a
lexical ranking function and the default text ranking function in Vespa. It has
similar characteristics as BM25 but also considers the proximity between matched
terms in the document. Unlike BM25, which has an unbound score range, Vespa’s
nativeRank is normalized to a score in the range [0,1].

rank-profile nativeRank inherits default {
    first-phase {
        expression: nativeRank(title) + nativeRank(description) 
    }
}

Vespa fieldMatch – lexical ranking

Vespa fieldMatch
is another lexical ranking function that is more sophisticated than Vespa
nativeRank but also computationally expensive and unsuitable as a retrieval
function. fieldMatch(field) or fieldMatch sub-features such as fieldMatch(field).proximity is
typically used as re-ranking features.

rank-profile fieldMatch inherits default {
    first-phase {
        expression: fieldMatch(title) + fieldMatch(description) 
    }
}

Vespa semantic ranking using vector similarity – dense retrieval

Using an off-the-shelf dense vector model from
Huggingface, we
encode documents and queries into a 384-dimensional dense vector space. We use
cosine similarity between the query and the document in this shared embedding
vector space as our relevancy measure. The vector embedding model,
sentence-transformers/all-MiniLM-L6-v2, is trained
on a large number of sentence-length texts.

Semantic similarity search over dense representations can be accelerated using
Vespa‘s support for approximate nearest neighbor search, making it usable for
efficient retrieval. We use Vespa’s support for embedding dense embedding models in this work.
See text embeddings made simple.

rank-profile semantic inherits default {
    inputs {
        query(query_embedding) tensor<float>(d0[384])
    } 
    first-phase {
        expression: closeness(field, embedding)
    }
} 

Vespa semantic ranking using vector similarity – dense retrieval

Another model that can be used for vectorization,
bert-base-uncased, 768-dimensional.
This model has not been trained for semantic similarity, or semantic search. We
use the pre-trained language model directly. As with the previous semantic
model, we use cosine similarity between the query and the document as our relevancy measure.

rank-profile semantic inherits default {
    inputs {
        query(bert_query_embedding) tensor<float>(d0[768])
    } 
    first-phase {
        expression: closeness(field, bert_embedding)
    }
} 

Vespa hybrid – combining lexical and semantic ranking

In this baseline model, we combine the semantic similarity score of the
sentence-transformers/all-MiniLM-L6-v2 model with the lexical score computed by
Vespa’s nativeRank. We use a simple linear combination of the two ranking
models. Since the nativeRank lexical feature is normalized in the range [0,1],
combining it with semantic similarity is more straightforward as we don’t need
to normalize unbound lexical scoring functions, such as BM25.
We use a query(alpha) parameter to control how each method influences the overall score.

rank-profile hybrid inherits default {
    inputs {
        query(alpha): 0.5
        query(query_embedding) tensor<float>(d0[384])
    }
    first-phase {
        expression: query(alpha)*closeness(field, embedding) + (1 - query(alpha))*(nativeRank(title) + nativeRank(description))
    }
}

Zero-shot Baseline Results

Ranking result

We present the results of our evaluation in the above figure. We execute each
query in the English test split (8.9K queries) and ask Vespa to rank the
documents that we have labels for, using the Vespa search API’s recall
parameter.

Sets a recall parameter to be combined with the query. This is identical to filter, except that recall terms are not exposed to the ranking framework and thus not ranked.

We use trec_eval utility to compute
NDCG, using the relevance judgments. The reported NDCG is the average NDCG
calculated across all 8,956 queries in the test split. These are unique queries
and we don’t know their frequency distribution on Amazon, but in a real-world
evaluation we would have to take into account the query frequency as well when
evaluating the model, for example using weighting so that popular (head) queries weights more
than rare (tail) queries. In other words, we don’t want to introduce a ranking model that on average improves
NDCG, but hurts many head queries.

Notice that the random ranking function produces a high NDCG score, this is because there are
many exact (relevant) judgements per query. On average, 9 out of 20 judgments per query is exact.
This demonstrates that including a random baseline has value, since it allows us
to compare other models to a random baseline.

The semantic 2 is the bert-base-uncased model. A model which have not been
fine-tuned for retrieval or ranking tasks, but masked-language-modeling.
The bert-base-uncased NDCG score is closest to the random baseline.
This demonstrates that not all vectorization models that encode text into vectors are
great for ranking. In addition, the bert-base-uncased model is about
5x larger than the model used in semantic model 1, which costs both storage (dimensionality),
and CPU cycles during inference.

The semantic model 1, which is the sentence-transformers/all-MiniLM-L6-v2 model,
also underperforms compared to the lexical based methods. This is a known
shortcoming with dense vectorization models, they might not work well in a zero-shot
setting when applied on a new text domain.

In this dataset, the hybrid combination of lexical (nativeRank) and semantic did not
yield any improvement over the pure lexical ranking methods. We did not tune the
alpha parameters, as tuning parameters to improve accuracy on the test set is a
bad machine learning practice.

Summary

In this blog post, we described a large product relevance dataset and
evaluated several baseline ranking methods on this dataset, applied in a zero-shot setting.

In the next blog post in this series, we will use the labeled train split of the dataset to
train multiple ranking models using various learning to rank techniques:

  • A semantic similarity model using two-tower/
    bi-encoder
    architecture. Building on a pre-tuned model, we will use the training data (query product labels)
    to fine-tune the model for the product ranking domain.
    We then embed this model into Vespa for semantic search.

  • A semantic
    cross-encoder
    similarity model based on a pre-trained language model. A more computationally
    expensive model than bi-encoders, but generally more accurate than
    bi-encoders that reduces semantic similarity to a vector similarity.

  • Gradient Boosting (GB)
    models, combining multiple ranking signals. GB models are famous for their
    performance on tabular data and
    popular in e-commerce search ranking. We include GB methods, as historically,
    these models have been prevalent for product search ranking, including
    statistical features such as sales rank or other business-logic-oriented
    features. We will use a combination of lexical and neural features and feed
    into the GB model. For run-time inference, we will be using Vespa’s support for
    LightGBM and
    XGBoost models, two popular frameworks for
    GBDT models.

  • Ensemble ranking models. Since Vespa maps different models from different
    frameworks into the same ranking framework, combining multiple models into an
    ensemble model is straightforward.