Revolutionizing Semantic Search with Multi-Vector HNSW Indexing in Vespa
Photo by Peter Herrmann on Unsplash
Finding data items by nearest neighbor search
in vector space has become popular in recent years, but suffers from one big limitation:
Each data item must be well representable by a single vector. This
is often far from possible, for example, when your data is text
documents of non-trivial length. Now we are announcing multi-vector
support in Vespa, which allows you to index multiple vectors per
document and retrieve documents by the closest vector in each.
Background
Advances in self-supervised deep learning models have revolutionized
how to represent unstructured data like text, audio, image, and
videos in a language native to machines; Vectors.
Embedding data into vector space using deep neural networks.
Encoding objects using deep learning models allows for representing
objects in a high-dimensional vector space. In this latent embedding
vector space, one can compare the objects using vector distance
functions, which can be used for search, classification, and
clustering, to name a few.
Documents (squared) and queries (circle) are mapped by a trainable
model to a vector space (here illustrated in two dimensions). The
two nearest neighbors of the query in vector space are documents A
and C. Using representation learning, the model is adjusted (by
gradient descent) so that relevant (q,d) pairs have a low distance,
and irrelevant pairs a have a higher distance.
For embedding text data, models based on the
Transformer
architecture have become the de-facto standard. A challenge with
Transformer based models is their input length limitation due to
the quadratic self-attention computational complexity. For example,
a popular open-source text embedding
model has
an absolute maximum input length of 512 wordpiece
tokens. Still,
inputs are truncated during training at 128 tokens, and trying to
fit more tokens than used during fine-tuning of the model will
impact the quality of the vector representation. One
can view embedding encoding as a lossy compression technique, where
variable-length texts are compressed into a fixed dimensional vector
representation.
Highlighted text in blue from the
Wikipedia:Metric_space
article. The highlighted text is exactly 128 wordpieces long after
tokenization. This illustrates the narrow context window of Transformer
based embedding models.
Due to the context length limited Transformers, developers that
want to index Wikipedia or any text dataset using embedding models
must split the text input into chunks that align
with lengths used during fine-tuning of the model for optimal
embedding quality.
Illustration of splitting or chunking of Wikipedia articles into
multiple paragraphs to overcome model length limitations. There are
several strategies for chunking longer text, from simple splitting
to more advanced methods using sliding windows, so the generated
chunks have overlapping wordpieces.
When changing the retrieval unit from articles to paragraphs, the
nearest neighbor search in the embedding space retrieves the nearest
paragraphs, not the nearest articles. To map the query-paragraph
distance to a query-article distance, one popular approach is to
use an aggregate function. For example, the minimum distance of
query-paragraph distances is used as a proxy for the query-article
distance.
With minimum distance aggregation, it’s possible to use a vector
search library and index paragraphs, combining the article id and
a paragraph id to form a new primary key. For example, using
incremental paragraph numbers per article, instead of indexing the
article Metric_space, developers could index Metric_space#1,
_Metric_space#2, _Metric_space#3_, and so forth. This might seem
easy at first sight, but there are many disadvantages to this
modeling approach.
Relationships and fan-out
The developer needs to manage the fan-out relationship between
articles and paragraphs. When the article changes, the developer must
re-process the article into paragraphs which might leave orphaned
paragraphs in the paragraph index. Deletion of an article requires
a cascading deletion of all associated paragraphs from the index.
Similarly, updating an article-level meta field requires fan-out
to update the paragraphs associated with the article.
Result presentation
The unit of retrieval complicates search result presentation and
linking to the source text. Consider searching Wikipedia; how do
you link to the chunked text with the minimal embedding distance,
which caused the article to be present on the search engine result
page?
Result diversity
The closest K nearest neighbors in the paragraph level embedding
space could be from the same article. Increasing the diversity of
sources is especially important when used for retrieval-augmented
generative question answering, where the generative model also has
context length limitations.
Constrained nearest neighbor search
For constrained nearest neighbor
search
on article-level metadata, developers must duplicate the article
metadata into the paragraph level for efficient filtering in
combination with vector search. Thus, the more properties that
naturally belong on the article level, the higher the duplication
factor.
Hybrid ranking
Hybrid
combinations of text scoring functions, like
BM25, with vector
distance, outperform other efficient retrieval methods in a zero-shot
setting. Important text fields from the article, such as the title,
must be duplicated into the paragraph index to perform efficient
hybrid retrieval.
The following sections describe how to overcome these challenges
using Vespa’s multi-vector HNSW indexing feature. Multi-vector
indexing is available from Vespa 8.144.19. The post also looks
behind the scenes at the industry-unique implementation. Furthermore,
it also describes other use cases, which are greatly simplified
with multi-vector indexing support.
Using multi-vector HNSW indexing with Vespa
This section highlights how developers configure and use multi-vector
indexing with Vespa. The highlights are from an open-source sample
application
that demonstrates the new multi-vector indexing functionality.
Schema
Consider the following schema for Wikipedia articles, expressed
using Vespa’s schema definition
language:
schema wiki document wiki { field title type string {} field content type string {} } }
This is straightforward, mapping the Wikipedia article schema to a
Vespa schema as before semantic search with vector embeddings
made their entry. With length-limited embedding models, developers
need to chunk the content into an array of strings to overcome
length limitations.
schema wiki { document wiki { field title type string {} field paragraphs type array<string> {} field paragraph_embeddings type tensor<float>(p{},x[384]) {} } }
Notice the paragraph_embeddings
field, which is an example of a
mapped-indexed tensor,
which mixes a mapped sparse dimension (p) with an indexed
dimension (x). This tensor-type definition is similar to a
classic map data structure, where the key maps to a fixed-size array
of floats.
Using a mapped-indexed tensor allows variable-length articles to
be indexed without wasting resources using a matrix (indexed-indexed)
tensor representation.
In the above schema example, it’s up to the developer to produce
the paragraphs in array representation and the vectorized embeddings.
JSON Feed
With the above schema, developers can index chunked data, along
with the embeddings per chunk, using the following Vespa JSON feed
format:
{ "put": "id:wikipedia:wiki::Metric_space", "fields": { "title": "Metric space", "paragraphs": [ "In mathematics, a metric space...", "strings can be equipped with the Hamming distance, which measures the number.. " ], "paragraph_embeddings": { "0": [0.12,0.03,....,0.04], "1": [0.03, 0.02,...,0.02] } } }
Note that the developer controls the text chunking strategy. Suppose
the developer wants to map vectors to the text paragraphs that
produced them. In that case, the developer can use the index in the
paragraphs array as the mapped dimension label. This helps the developer
present the best matching paragraph(s) on the search result page,
as described in the ranking section below.
Native embedding integration
Managing complex infrastructure for producing text embedding vectors
could be challenging, especially at query serving time, with low
latency, high availability, and high query throughput. Vespa allows
developers to represent embedding
models in Vespa.
In this case, the schema becomes:
schema wiki { document wiki { field title type string {} field paragraphs type array<string> {} } field paragraph_embeddings type tensor<float>(p{},x[384]) { indexing: input paragraphs | embed | attribute | index | summary } }
In this case, Vespa will produce embeddings and use the array index
(0-based) of the paragraph as the mapped dimension label. Regardless
of using native embedders or producing embeddings outside of Vespa,
developers must also configure a distance
metric
that matches the distance metric used during model training.
Querying
The following demonstrates a Vespa query
api request with native
embedder
functionality.
The native embedder encodes the input text ‘metric spaces’, and
uses the resulting 384-dimensional vector in the nearest neighbor
search. See text embeddings made
simple and
Vespa query language
for details.
curl \ --json " { 'yql': 'select title,url from wiki where {targetHits:10}nearestNeighbor(paragraph_embeddings, q)', 'input.query(q)': 'embed(metric spaces)' }" \ https://vespaendpoint/search/
The targetHits
annotation to the nearestNeighbor query operator
is the target number of articles the
nearest neighbor search should expose to Vespa
ranking phases. Selecting
articles overcomes the article diversity issue associated with a
paragraph index, avoiding that all closest retrieved nearest neighbors
are from the same article.
Hybrid search
This query example combines the exact search with semantic search,
a hybrid combination that has demonstrated strong zero-shot
accuracy.
curl \ --json " { 'yql': 'select title,url from wiki where (userQuery()) or ({targetHits:10}nearestNeighbor(paragraph_embeddings, q))', 'input.query(q)': 'embed(metric spaces)', 'query': 'metric spaces', 'ranking': 'hybrid' }" \ https://vespaendpoint/search/
The userQuery() matches against the full article across all
paragraphs and the title. Searching across all the fields improves
recall of the traditional text retrieval component. The semantic
nearestNeighbor component searches at the paragraph level,
finding the closest paragraphs in the paragraph embedding space.
Developers can also use multiple nearestNeighbor query operators
in the query; see the nearest neighbor search in the Vespa
guide.
Multiple nearestNeighbor operators in the same query are convenient
for query rewrites and expansions. Notice also that the query request
includes a ranking parameter.
Ranking
Vespa allows declarative rank
expressions in
the schema. The existing
distance(dimension,name)
rank feature now also supports mapped-indexed tensor fields and return the
distance of the closest vector. With support for searching multi-vector
fields, two new rank features are introduced:
closest(name)
and
closest(name,label)
The optional label is useful when querying
with multiple labeled nearestNeighbor query
operators.
The output of the closest feature is a tensor with one mapped
dimension and one point (with a value of 1.0). For example:
tensor<float>(p{}):{ 3: 1.0 }
In this example, the vector with label 3 is the closest paragraph
to the query, which retrieved the document into Vespa ranking phases.
The following rank-profile orders the articles by the maximum
paragraph cosine similarity using the mentioned distance feature.
The
match-features
are used to return the closest mapped dimension label. This allows
the developer to present the best matching paragraph on the search
result page.
rank-profile semantic inherits default { inputs { query(q) tensor<float>(x[384]) } first-phase { expression: cos(closeness(field, paragraph_embeddings)) } match-features: closest(paragraph_embeddings) }
Using Vespa tensor
compute expressions,
developers can also compute distance aggregates, over all the vectors
in the document and also the distance to all the vectors in the
field.
Result presentation and snippeting
How results are presented to the user is commonly overlooked when
introducing semantic search, and most vector databases
do not support snippeting or highlighting. With Vespa, developers can
display the best matching paragraph when displaying articles on the
search result page using the closest feature combined with
match-features. Vespa also supports dynamic
summaries
and bolding of query terms in single and multi-valued string fields.
Implementation
This section lifts the curtain on the multi-vector HNSW indexing
implementation in Vespa.
Vespa uses a custom HNSW index
implementation to
support approximate nearest neighbor search. This is a modified
version of the Hierarchical Navigable Small World (HNSW) graph
algorithm. To support multiple
vectors per document, some changes were made to the implementation.
The HNSW index consists of a navigable small world graph in a
hierarchy. A graph node corresponds to a single vector in the
document corpus, and is uniquely identified by a nodeid. A graph
node exists in 1 to N layers. On each layer, graph nodes are linked
to other nodes that are close in the vector space according to the
distance
metric.
The vectors are stored in a tensor field
attribute, and the HNSW
index references these vectors when doing distance calculations.
Each document in the corpus is assigned a unique docid, and this
docid is used when