Text-to-image search with Vespa | Vespa Blog

Text-to-image search is a form of search where images are retrieved based on a
textual description. This form of search has, like text search, gone through a
revolution in recent years. Previously, one used traditional information
retrieval techniques based on a textual label associated with each image, thus
not really using the image at all. In contrast, modern approaches are based on
machine-learned representations of actual image content.

For instance, “a child playing football”:

a child playing football

Or “a dog catching a frisbee”:

a dog catching a frisbee

Both of these are the top results from the Vespa text-image sample
This sample application is powered by a pre-trained model, trained specifically
to “understand” both text and image content. Pre-trained models have come to
dominate an increasing number of fields in recent years. This is possibly best
demonstrated by the popularity of BERT and relatives, which has revolutionized
natural language processing. By pre-training a model on large datasets, it
gains a base capability that can be fine-tuned to specific tasks using much
smaller amounts of data.

This particular sample app is based on the CLIP model from
OpenAI, which has been trained on 400 million
(image, text) pairs taken from the internet. CLIP consists of two models: one
for text and one for images. During training, images are associated with
textual descriptions. Like with pre-trained language models, CLIP thus gains a
basic textual understanding of image content.

One of the exciting capabilities CLIP has shown is that this training method
enables a strong capability for zero-shot learning. Most previous image
classification approaches classify images into a fixed set of labels. However,
CLIP can understand and label images with labels it hasn’t seen during
training. Also, it generalizes very well with images it hasn’t encountered
during training.

The sample application uses the
dataset, which consists of 8000 images. These images were not explicitly used
during the training of CLIP. Yet, the model handles them successfully, as seen
in the examples above. This is interesting because it means that you can expect
reasonable results from your own image collection.

In this blog post, we’ll describe how to use the CLIP model to set up a
production-ready text-to-image search application on Vespa. Vespa is an
excellent platform for this as it contains all the necessary capabilities right
out of the box, such as (approximate) nearest neighbor
search and machine
learned model
inference. We’ll
start by looking closer at the CLIP model.


CLIP (Contrastive Language-Image Pre-training) is a neural network that learns
visual concepts from natural language supervision. By training the model from a
very large set of images and captions found on the internet, CLIP learns to
associate textual description with image content. The goal is to create a model
with a certain capability for “understanding” image content without overfitting
to any benchmark such as ImageNet. This pre-trained network can then be
fine-tuned toward a more specific task if required. This is similar to how
natural language understanding models such as BERT are pre-trained on large
corpora and then potentially fine-tuned on much smaller amounts of data for
specific tasks.

Perhaps the most significant contribution of CLIP is the large amount of data
it has been trained on: 400 million image-text pairs. This enables an excellent
capability for zero-shot learning, meaning:

  1. Images can be classified by labels the model hasn’t seen during training.
  2. Images not seen during training can still be classified correctly.

This is exemplified by the following code:

import torch
import clip
from PIL import Image

model, preprocess = clip.load("ViT-B/32", device="cpu")
image = preprocess(Image.open("image.png")).unsqueeze(0)
text = clip.tokenize(["a diagram", "a dog", "a cat"])

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1)

The result in this code, probs, contains the relative probabilities that the
image is described by the 3 different text labels: diagram, dog, or cat.
Indeed, any text or number of texts can be passed to the model, which estimates
the relative probabilities between them. This architecture differs from typical
image classification models that contain a fixed number of outputs, one for
each pre-defined label.

CLIP contains two models: one for text and one for images. Both models produce
a representation vector from their inputs. This is the image_features and
text_features above. The final output is the cosine distance between these
two. By training on text and image pairs, the model learns to minimize the
distance between matching text and image representations in a shared semantic


For more details about CLIP, please refer to CLIP’s GitHub

Next, we’ll see how to use CLIP to create an image search application where the
user provides a textual description, and the search system will return the best
matching images.

Recall that the entire CLIP model takes a set of texts and an image to
“classify” the image among the provided texts. We want to find the best
matching images from a single provided text to create a search application. The
key here is the representation vector generated by CLIP’s text and image

Using the image model alone, we can generate a representation vector for each
image. We can then generate a vector for a query text and perform a nearest
neighbor search to find the images with the smallest cosine distance from the
query vector.

Nearest neighbor search consists of calculating the distance between the query
vector and the image representation. This must be done for all images. As the
number of images increases, this becomes infeasible. A common solution is to
create an index of the image representations. Unfortunately, there are no exact
methods for finding the nearest neighbors efficiently, so we trade accuracy for
efficiency in what is called approximate nearest neighbors (ANN).

Many different methods for ANN search have been proposed. Some are compatible
with inverted index structures, so they can be readily implemented in existing
information retrieval systems. Examples are k-means clustering, product
quantization (and its relatives), and locality-sensitive hashing, where the
centroids or buckets can be indexed. A method that is not compatible with
inverted indexes is HNSW (hierarchical
navigable small world). HNSW is based on graph structures, is efficient, and
has an attractive property where the graph can be incrementally built at
runtime. This is in contrast to most other methods that require offline,
batch-oriented index building.

The core of our text-to-image search application is an ANN index. In the
following, we’ll set this up in a Vespa application.

The Vespa text-to-image search application

It’s straightforward to set up an ANN search index in Vespa. All that is needed
is a document schema:

schema image_search {
    document image_search {
        field image_file_name type string {
            indexing: attribute | summary
        field image_features type tensor<float>(x[512]) {
            indexing: attribute | index
            attribute {
                distance-metric: angular
            index {
                hnsw {
                    max-links-per-node: 16
                    neighbors-to-explore-at-insert: 500

    rank-profile image_similarity inherits default {
        first-phase {
            expression: closeness(image_features)

Here, the documents representing images contain two fields: one for the image
file name and one for the image representation vector. Even though we could, we
don’t store the image blob in the index. The image_features vector is
represented as a tensor with
512 elements along one dimension. These are indexed using a HNSW
which supports approximate retrieval.

The schema also contains a rank profile. The image documents will be ranked
using the closeness rank feature when using this rank profile. The distance
function is set to angular in the document schema, so the closeness will
use the cosine distance between the provided query vector and the
image_features vector.

Using the above, we can issue a query that performs an approximate nearest

  "yql": "select image_file_name from sources * where ([{\"targetHits\": 10}]nearestNeighbor(image_features, text_features));",
  "hits": 10
  "ranking.features.query(text_features): [0.21,0.12,....],
  "ranking.profile": "image_similarity"

This will return the top 10 images given the provided text_features query
vector. Note that this means you need to pass the text query’s representation
vector; thus, it needs to be calculated outside of Vespa for this particular

The text-image search sample application includes a Python-based search
which uses the CLIP model for this. The app uses
pyvespa and is
particularly suitable for analysis and exploration. We’ll take a closer
look in the next section.

The sample application also includes a stand-alone Vespa
Here we move the logic and CLIP model for generating the text representation
vector into Vespa itself. This enables querying directly with text. The
stand-alone Vespa application is more suitable for production, and we’ll take a
look later.


Pyvespa is the Python
API to Vespa, which provides easy ways to create, modify, deploy and interact
with running Vespa instances. One of the main goals of pyvespa is to allow for
faster prototyping and to facilitate machine learning experiments for Vespa
applications. With pyvespa, it is easy to connect to either a local Vespa
running in a Docker container or Vespa cloud, a
service for running Vespa applications.

The text-image sample app
contains a full end-to-end example of setting up the application, feeding
images, and querying. This example also uses all 6 different model variants
found in CLIP and includes an evaluation analysis. As is shown there, the
ViT-B/32 model is found to be superior.

The streamlit demo
can be set up to query the python application after it is deployed and images are fed.
With this application, one can visually evaluate the differences in the model variants.

streamlist example

Native Vespa application

The Python application is fine for prototyping and running experiments.
However, it contains the code to generate vectors for the text queries. For an
application to run in production (which requires low latency and stability),
the question arises on how we make this feature available. Vespa contains
facilities to run custom code
as well as evaluating machine-learned
models inside custom

The ViT-B/32 model was superior during analysis, so this application uses
only that model. The model itself is put in the Vespa application package under
the models directory. There, it is automatically discovered and made
available to custom components, and a REST API is also provided.

The stand-alone sample
application contains
a searcher that modifies
the query from textual input to a vector representation suitable for nearest
neighbor search. This searcher first tokenizes the text using a custom
byte-pair encoding tokenizer before passing the tokens to the language model.
The query is then modified to an approximate nearest neighbor search using the
resulting vector representation.

After deploying this, the Vespa application can take a string input directly
and return the best matching images.


Pretty much anything can be represented by a vector. Text, images, even
time-based entities such as sound, viewing history or purchase logs. These
vectors can be thought of as points in a high-dimensional space. When similar
objects (according to some metric) are close, we can call this a semantic
space. Interestingly, the origin of the representation does not really matter.
This means we can project entities into a shared semantic space, independent of
what type of entity it is. This is called multi-modal search.

In this post, we’ve explored text-to-image search in Vespa, where users provide
a textual description and Vespa retrieves a set of matching images. For this,
we used the CLIP model from OpenAI, which
consists of two transformer models: one for text and one for images. These
models produce vector representations, and they have been trained so that the
cosine distance is small when the text matches the image. First, we indexed the
vector representations of a set of images. Then, we searched these using the
vector representation from user-provided queries.

One of the strong points of the CLIP model is its zero-shot learning
capability. While we used the
data set here, any set of images can be used. That means it is easy to set up
this application for your own collection of images. However, as with all
pre-trained models, there is room for improvement using fine-tuning. The CLIP
model is, however, a great baseline.

We have also created another sample application for video
This is another example of using the CLIP model to search for videos given a
textual description.