Using approximate nearest neighbor search to find similar products

In this blog we give an introduction to how to use the
Vespa’s approximate nearest neighbor search query operator.

We demonstrate how nearest neighbor search in an image feature vector space can be used to find similar products.
Given an image of a product, we want to find
similar products, but also with the possibility to filter the returned products by inventory status, price or other real world constraints.

We’ll be using the
Amazon Products dataset
as our demo dataset. The subset we use has about 22K products.

  • The dataset contains product information like title, description and price and we show how to map the data into the Vespa document model
  • The dataset also contains binary feature vectors for images, features produced by a neural network model. We’ll use these image vectors and show you how to store and index vectors in Vespa.

We also demonstrate some of the real world challenges with using nearest neighbor search:

  • Combining the nearest neighbor search with filters, for example on inventory status
  • Real time indexing of vectors and update documents with vectors
  • True partial updates to update inventory status at scale

Since we use the Amazon Products dataset, we also recommend these resources:

PyVespa

In this blog we’ll be using pyvespa, which is a simple python api built on top of Vespa’s native HTTP apis.

The python api is not meant as a production ready api, but an api to explore features in Vespa.
It is also for training ML models, which can be deployed to Vespa for serving.

See also the Complete notebook which powered this blog post.

Exploring the Amazon Product dataset

Let us download the demo data from the Amazon Products dataset, we use the Amazon_Fashion subset.

!wget -nc https://raw.githubusercontent.com/alexklibisz/elastiknn/master/examples/tutorial-notebooks/amazonutils.py
!wget -nc http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/meta_Amazon_Fashion.json.gz
!wget -nc http://snap.stanford.edu/data/amazon/productGraph/image_features/categoryFiles/image_features_Amazon_Fashion.b

from amazonutils import *
from pprint import pprint

Let us have a look at a selected slice of the dataset:

for p in islice(iter_products('meta_Amazon_Fashion.json.gz'), 221,223):
  pprint(p)
  display(Image(p['imUrl'], width=128, height=128))

{'asin': 'B0001KHRKU',
'categories': [['Amazon Fashion']],
'imUrl': 'http://ecx.images-amazon.com/images/I/31J78QMEKDL.jpg',
'salesRank': {'Watches': 220635},
'title': 'Midas Remote Control Watch'}

![jpeg](/assets/2021-02-16-image-similarity-search/output_7_1.jpg)

{'asin': 'B0001LU08U',
'categories': [['Amazon Fashion']],
'imUrl': 'http://ecx.images-amazon.com/images/I/41q0XD866wL._SX342_.jpg',
'salesRank': {'Clothing': 1296873},
'title': 'Solid Genuine Leather Fanny Pack Waist Bag/purse with '
'Cellphoneholder'}

![jpeg](/assets/2021-02-16-image-similarity-search/output_7_3.jpg)

## Build basic search functionality

A Vespa instance is described by a [Vespa application package](https://docs.vespa.ai/en/application-packages.html). Let us create an application called product and define our [document schema](https://docs.vespa.ai/en/schemas.html). We use pyvespa to define our application. This is not a requirement for creating a Vespa application, one can just use a text editor of choice.

from vespa.package import ApplicationPackage
app_package = ApplicationPackage(name = "product")
from vespa.package import Field
app_package.schema.add_fields(        
    Field(name = "asin", type = "string", indexing = ["attribute", "summary"]),
    Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
    Field(name = "description", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
    Field(name = "price", type = "float", indexing = ["attribute", "summary"]),
    Field(name = "salesRank", type = "weightedset", indexing = ["summary","attribute"]),
    Field(name = "imUrl", type = "string", indexing = ["summary"])
)
</pre>

We define a fieldset which is a way to combine matching over multiple string fields. 
We will only do text queries over the *title* and *description* field. 


from vespa.package import FieldSet
app_package.schema.add_field_set(
    FieldSet(name = "default", fields = ["title", "description"])
)

Then define a simple [ranking](https://docs.vespa.ai/en/ranking.html) function which uses a linear combination of the [bm25](https://docs.vespa.ai/en/reference/bm25.html) text ranking feature over our two free text string fields.

from vespa.package import RankProfile
app_package.schema.add_rank_profile(
    RankProfile(
        name = "bm25", 
        first_phase = "0.9*bm25(title) + 0.2*bm25(description)")
)

So let us deploy this application. We use docker in this example. See also [Vespa quick start](https://docs.vespa.ai/en/vespa-quick-start.html)

### Deploy the application and start Vespa

from vespa.package import VespaDocker
vespa_docker = VespaDocker(port=8080)

app = vespa_docker.deploy(
    application_package = app_package,
    disk_folder="/home/centos/product_search" # include the desired absolute path here
)

Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for application status.
Waiting for application status.
Finished deployment.

### Index our product data

Pyvespa does expose a feed api, but in this notebook we use the raw [Vespa http /document/v1 feed api](https://docs.vespa.ai/en/document-v1-api-guide.html).

The HTTP document api is synchronous and the operation is visible in search when acked with a response code 200. In this case, the feed throughput is limited by the client as we are posting one document at a time. For high throughput use cases use the asynchronous feed api, or use more client threads with the synchronous api.

import requests 
session = requests.Session()

def index_document(product):
    asin = product['asin']
    doc = {
        "fields": {
            "asin": asin,
            "title": product.get("title",None),
            "description": product.get('description',None),
            "price": product.get("price",None),
            "imUrl": product.get("imUrl",None),
            "salesRank": product.get("salesRank",None)     
        }
    }
    resource = "http://localhost:8080/document/v1/demo/product/docid/{}".format(asin)
    request_response = session.post(resource,json=doc)
    request_response.raise_for_status()

With our routine defined we can iterate over the data and index the products, one doc at a time:

from tqdm import tqdm
for product in tqdm(iter_products("meta_Amazon_Fashion.json.gz")):
    index_document(product)

24145it [01:46, 226.40it/s]

So we have our index ready, no need to perform any additional index maintenance operations like merging segments.
All the data is searchable. Let us define a simple routine to display search results. Parsing the [Vespa JSON search response](https://docs.vespa.ai/en/reference/default-result-format.html) format:

def display_hits(res, ranking):
    time = 1000*res['timing']['searchtime'] #convert to ms
    totalCount = res['root']['fields']['totalCount']
    print("Found {} hits in {:.2f} ms.".format(totalCount,time))
    print("Showing top {}, ranked by {}".format(len(res['root']['children']),ranking))
    print("")
    for hit in res['root']['children']:
        fields = hit['fields']
        print("{}".format(fields.get('title', None)))
        display(Image(fields.get("imUrl"), width=128, height=128))  
        print("documentid: {}".format(fields.get('documentid')))
        if 'inventory' in fields:
            print("Inventory: {}".format(fields.get('inventory')))
        print("asin: {}".format(fields.get('asin')))
        if 'price' in fields:
            print("price: {}".format(fields.get('price',None)))
        if 'priceRank' in fields:
            print("priceRank: {}".format(fields.get('priceRank',None)))  
        print("relevance score: {:.2f}".format(hit.get('relevance')))
        print("")

### Query our product data

We use the [Vespa HTTP Search API](https://docs.vespa.ai/en/query-api.html#http) to search our product index.

In this example we assume there is a user query 'mens wrist watch' which we use as input to the
[YQL](https://docs.vespa.ai/en/query-language.html) query language. Vespa allows combining the structured application logic expressed by YQL with a
user query language called [Vespa simple query language](https://docs.vespa.ai/en/reference/simple-query-language-reference.html).

In this case we use *type=any* so matching any of our 3 terms is enough to retrieve the document.
In the YQL statement we select the fields we want to return. Only fields which are marked as *summary* in the schema can be returned with the hit result.

We don't mention which fields we want to search, so Vespa uses the fieldset defined earlier called default, which will search both the title and the description fields.

query = {
    'yql': 'select documentid, asin,title,imUrl,price from sources * where userQuery();',
    'query': 'mens wrist watch',
    'ranking': 'bm25',
    'type': 'any',
    'presentation.timing': True,
    'hits': 2

display_hits(app.query(body=query).json, "bm25")

Found 3285 hits in 4.00 ms.
Showing top 2, ranked by bm25

Geekbuying 814 Analog Alloy Quartz Men's Wrist Watch - Black (White)

![jpeg](/assets/2021-02-16-image-similarity-search/output_25_1.jpg)

documentid: id:demo:product::B00GLP1GTW
asin: B00GLP1GTW
price: 8.99
relevance score: 10.27

Popular Quality Brand New Fashion Mens Boy Leatheroid Quartz Wrist Watch Watches

![jpeg](/assets/2021-02-16-image-similarity-search/output_25_3.jpg)

documentid: id:demo:product::B009EJATDQ
asin: B009EJATDQ
relevance score: 9.67

So there we have our basic search functionality.

## Related products using nearest neighbor search in image feature spaces
Now we have basic search functionality up and running, but the Amazon Product dataset also includes image features which we can also
index in Vespa and use approximate nearest neighbor search on.
Let us load the image feature data. We reduce the vector dimensionality to something more practical and use 256 dimensions.

vectors = []
reduced = iter_vectors_reduced("image_features_Amazon_Fashion.b", 256, 1000)
for asin,v in tqdm(reduced("image_features_Amazon_Fashion.b")):
    vectors.append((asin,v))

22929it [00:04, 4739.67it/s]

We need to re-configure our application to add our image vector field.
We also define a *HNSW* index for it and using *angular* as our [distance metric](https://docs.vespa.ai/en/reference/schema-reference.html#distance-metric).

We also need to define the input query vector in the application package.
Without defining our query input tensor we won't be able to perform our nearest neighbor search,
so make sure you remember to include that.

Most changes like adding or remove a field is a [live change](https://docs.vespa.ai/en/reference/schema-reference.html#modifying-schemas) in Vespa, no need to re-index the data.

from vespa.package import HNSW
app_package.schema.add_fields(
    Field(name = "image_vector", 
          type = "tensor(x[256])", 
          indexing = ["attribute","index"],
          ann=HNSW(
            distance_metric="angular",
            max_links_per_node=16,
            neighbors_to_explore_at_insert=200)
         )
)
from vespa.package import QueryTypeField
app_package.query_profile_type.add_fields(
    QueryTypeField(name="ranking.features.query(query_image_vector)", type="tensor(x[256])")
)
</pre>

We also need to define a ranking profile on how we want to score our documents. We use the *closeness* [ranking feature](https://docs.vespa.ai/en/reference/rank-features.html). Note that it's also possible to retrieve results using approximate nearest neighbor search operator and use the first phase ranking function as a re-ranking stage (e.g by sales popularity etc). 
app_package.schema.add_rank_profile(
    RankProfile(
        name = "vector_similarity", 
        first_phase = "closeness(field,image_vector)")
)

Now, we need to re-deploy our application package to make the changes effective.

app = vespa_docker.deploy(
    application_package = app_package,
    disk_folder="/home/centos/product_search" # include the desired absolute path here
)

## Update the index with image vectors
Now we are ready to feed and index the image vectors.

We update the documents in the index by running partial update operations, adding the vectors using real time updates of the existing documents.
Partially updating a tensor field, with or without tensor, does not trigger re-indexing.

for asin,vector in tqdm(vectors):
    update_doc = {
        "fields":  {
            "image_vector": {
                "assign": {
                    "values": vector
                }
            }
        }
    }
    url = "http://localhost:8080/document/v1/demo/product/docid/{}".format(asin)
    response = session.put(url, json=update_doc)

100%|██████████| 22929/22929 [01:40<00:00, 228.94it/s]

We now want to get similar products using the image feature data.
We do so by first fetching the
vector of the product we want to find similar products for, and use this vector as input to the nearest neighbor search operator of Vespa.
First we define a simple get vector utility to fetch the vector of a given product *asin*.

def get_vector(asin):
    resource = "http://localhost:8080/document/v1/demo/product/docid/{}".format(asin)
    response = session.get(resource)
    response.raise_for_status()
    document = response.json()
    
    cells = document['fields']['image_vector']['cells']
    vector = {}
    for i,cell in enumerate(cells):
        v = cell['value']
        adress = cell['address']['x']
        vector[int(adress)] = v
    values = []
    for i in range(0,256):
        values.append(vector[i])
    return values

Let us repeat the query from above to find an image to find similar products for

query = {
    'yql': 'select documentid, asin,title,imUrl,price from sources * where userQuery();',
    'query': 'mens wrist watch',
    'ranking': 'bm25',
    'type': 'any',
    'presentation.timing': True,
    'hits': 1
}
display_hits(app.query(body=query).json, "bm25")

Found 3285 hits in 4.00 ms.
Showing top 1, ranked by bm25

Geekbuying 814 Analog Alloy Quartz Men's Wrist Watch - Black (White)

![jpeg](/assets/2021-02-16-image-similarity-search/output_40_1.jpg)

documentid: id:demo:product::B00GLP1GTW
asin: B00GLP1GTW
price: 8.99
relevance score: 10.27

Let us search for similar images using **exact** nearest neighbor search. We ask for 3 most similar to the product image with asin id **B00GLP1GTW**

query = {
    'yql': 'select documentid, asin,title,imUrl,description,price from sources * where \
    ([{"targetHits":3,"approximate":false}]nearestNeighbor(image_vector,query_image_vector));',
    'ranking': 'vector_similarity',
    'hits': 3, 
    'presentation.timing': True,
    'ranking.features.query(query_image_vector)': get_vector('B00GLP1GTW')
}
display_hits(app.query(body=query).json, "vector_similarity")

Found 46 hits in 10.00 ms.
Showing top 3, ranked by vector_similarity

Geekbuying 814 Analog Alloy Quartz Men's Wrist Watch - Black (White)

![jpeg](/assets/2021-02-16-image-similarity-search/output_42_1.jpg)

documentid: id:demo:product::B00GLP1GTW
asin: B00GLP1GTW
price: 8.99
relevance score: 1.00

Avalon EZC Unisex Low-Vision Silver-Tone Flex Bracelet One-Button Talking Watch, # 2609-1B

![jpeg](/assets/2021-02-16-image-similarity-search/output_42_3.jpg)

documentid: id:demo:product::B000M9GQ0M
asin: B000M9GQ0M
price: 44.95
relevance score: 0.63

White Rubber Strap Digital Watch

![jpeg](/assets/2021-02-16-image-similarity-search/output_42_5.jpg)

documentid: id:demo:product::B003ZYXF1Y
asin: B003ZYXF1Y
relevance score: 0.62

Let us repeat the same query but this time using the faster approximate version.
When there is a HNSW index on the tensor,
the default behavior is to use approximate:true, so we remove the approximation flag.

query = {
    'yql': 'select documentid, asin,title,imUrl,description,price from sources * where \
    ([{"targetHits":3}]nearestNeighbor(image_vector,query_image_vector));',
    'ranking': 'vector_similarity',
    'hits': 3, 
    'presentation.timing': True,
    'ranking.features.query(query_image_vector)': get_vector('B00GLP1GTW')
}
display_hits(app.query(body=query).json, "vector_similarity")

Found 3 hits in 6.00 ms.
Showing top 3, ranked by vector_similarity

Geekbuying 814 Analog Alloy Quartz Men's Wrist Watch - Black (White)

![jpeg](/assets/2021-02-16-image-similarity-search/output_44_1.jpg)

documentid: id:demo:product::B00GLP1GTW
asin: B00GLP1GTW
price: 8.99
relevance score: 1.00

Avalon EZC Unisex Low-Vision Silver-Tone Flex Bracelet One-Button Talking Watch, # 2609-1B

![jpeg](/assets/2021-02-16-image-similarity-search/output_44_3.jpg)

documentid: id:demo:product::B000M9GQ0M
asin: B000M9GQ0M
price: 44.95
relevance score: 0.63

White Rubber Strap Digital Watch

![jpeg](/assets/2021-02-16-image-similarity-search/output_44_5.jpg)

documentid: id:demo:product::B003ZYXF1Y
asin: B003ZYXF1Y
relevance score: 0.62

## Combining nearest neighbor search with filters

If we look at the results for the above exact and approximate nearest neighbor searches, we got the same results using the approximate version (perfect recall).
But naturally the first listed product was the same product that we used as input, and the closeness score was 1.0 simply because the angular distance is 0.
Since the user is already presented with the product, we want to remove it from the result. We can do that