Announcing vector streaming search: AI assistants at scale without breaking the bank

Decorative
image

Photo by Marc Sendra Martorell on Unsplash

If you are using a large language model to build a personal assistant
you often need to give it access to personal data such as email, documents or images.
This is usually done by indexing the vectors in a vector database and retrieving by approximate nearest neighbor (ANN) search.

In this post we’ll explain why this is not a good solution for personal data
and introduce an alternative which is an order of magnitude cheaper while actually solving the problem:
Vector streaming search.

Let’s just build an ANN index?

Let’s say you’re building a personal assistant who’s working with personal data averaging 10k documents per user,
and that you want to scale to a million users – that is 10B documents.
And let’s say you are using typical cost-effective embeddings of 384 bfloat16s – 768 bytes per document.
How efficient can we make this in a vector database?

Let’s try to handle it the normal way by maintaining a global (but sharded) approximate nearest neighbor vector index.
Queries will need to calculate distances for vectors in a random access pattern as they are found in the index,
which means they’ll need to be in memory to deliver interactive latency.
Here, we need 10B * 768 bytes = 7.68 Tb of memory for the vector,
plus about 20% for the vector index for a total of about 9.2 Tb memory to store a single copy of the data.
In practice though you need two copies to be able to deliver a user’s data reliably,
some headroom for other in-memory data (say 10%), and about 35% headroom for working memory.
This gives a grand total of 9.2 * 2 * 1.1 / 0.65 = 31Tb.

If we use nodes with 128Gb memory that works out to 31Tb / 128Gb = 242 nodes.
On AWS, we can use i4i.4xlarge nodes at a cost of about $33 per node per day, so our total cost becomes 242 * 33 = $8000 per day.

Hefty, but at least we get a great solution right? Well, not really.

The A in ANN stands for approximate – the results from an ANN index will be missing some documents,
including likely some of the very best ones. That is often fine when working with global data,
but is it really acceptable to miss the one crucial mail, photo or document the user needs to complete some task correctly?

In addition – ANN indexes shine when most of the vectors in the data are eligible for a given query,
that is when query filters are weak. But here we need to filter on the user’s own data,
so our filter is very strong indeed and our queries will be quite expensive despite all the effort of building the index.
In fact it would be cheaper to not make use of the index at all (which is what Vespa would automatically do when given these queries).

Lastly, there’s write speed. A realistic speed here is about 8k inserts per node per second.
Since we have 2 * 10B/242 = 82 M documents per node that means it will take about
82M/(8k * 3600) = 2.8 hours to feed the entire data set even though we have this massive amount of powerful nodes.

To recap, this solution has four problems as shown in this table:

Regular ANN for personal data
❌ CostAll the vectors must be in memory, which becomes very expensive.
❌ CoverageANN doesn’t find all the best matches, problematic with personal data.
❌ Query performanceQueries are expensive to the point of making an ANN index moot.
❌ Write performanceWriting the data set is slow.

Can we do better?

Let’s consider some alternatives.

The first observation to make is that we are building a global index capable of searching all user’s data at once,
but we are not actually using this capability since we always search in the context of a single user.
So, could we build a single ANN index per user instead?

This actually makes the ANN indexes useful since there is no user filter. However, the other three problems remain.

ANN (approximate nearest neighbor) for personal data
❌ CostAll the vectors must be in memory, which becomes very expensive.
❌ CoverageANN doesn’t find all the best matches, problematic with personal data.
✅ Query performanceOne index per user makes queries cheap.
❌ Write performanceWriting the data set is slow.

Can we drop the ANN index and do vector calculations brute force?
This is actually not such a bad option (and Vespa trivially supports it).
Since each user has a limited number of documents, there is no problem getting good latency by brute forcing over a user’s vectors.
However, we still store all the vectors in memory so the cost problem remains.

NN (exact nearest neighbor) for personal data
❌ CostAll the vectors must be in memory, which becomes very expensive.
✅ CoverageAll the best matches are guaranteed to be found.
✅ Query performanceCheap enough: One user’s data is a small subset of a node’s data.
✅ Write performanceWriting data is an order of magnitude faster than with ANN.

Can we avoid the memory cost? Vespa provides an option to mark vectors paged,
meaning portions of the data will be swapped out to disk.
However, since this vector store is not localizing the data of each user
we still need a good fraction of the data in memory to stay responsive, and even so both query and write speed will suffer.

NN (exact nearest neighbor) with paged vectors for personal data
🟡 CostA significant fraction of data must be in memory.
✅ CoverageAll the best matches are guaranteed to be found.
🟡 Query performanceReading vectors from disk with random access is slow.
✅ Write performanceWriting data is an order of magnitude faster than with ANN.

Can we do even better, by localizing the vector data of each user
and so avoid the egregious memory cost altogether while keeping good performance?
Yes, with Vespa’s new vector streaming search you can!

Vespa’s streaming search solution lets you make the user id a part of the document id
so that Vespa can use it to co-locate the data of each user on a small set of nodes and on the same chunk of disk.
This allows you to do searches over a user’s data with low latency without keeping any user’s data in memory
nor paying the cost of managing indexes at all.

This mode has been available for a long time for text and metadata search,
and we have now extended it to support vectors and tensors as well, both for search and ranking.

With this mode you can store billions of user vectors, along other data, on each node without running out of memory,
write it at a very high throughput thanks to Vespa’s log data store, and run queries with:

  • High throughput: Data is co-located on disk, or in memory buffers for recently written data.
  • Low latency regardless user data size: Vespa will,
    in addition to co-locating a user’s data, also automatically spread it over a sufficient number of nodes to bound query latency.

In addition you’ll see about an order of magnitude higher write throughput per node than with a vector indexing solution.

The resource driving cost instead moves to disk I/O capacity, which is what makes streaming so much cheaper.
To compare with our initial solution which required 242 128Gb nodes – streaming requires 45b to be stored in memory per document
so we’ll be able to cram about 128Gb / 45 * 0.65 = 1.84 B documents on each node.
We can then fit two copies of the 10B document corpus on 20B / 1.84B = 11 nodes.

Quite a reduction! In a very successful application you may want a little more to deliver sufficient query capacity
(see the performance case study below), but this is the kind of savings you’ll see for real production systems.

Vector streaming search for personal data
✅ CostNo vector data (or other document data) must be in memory.
✅ CoverageAll the best matches are guaranteed to be found.
✅ Query performanceLocalized disk reads are fast.
✅ Write performanceWriting data is faster even with less than 1/20 of the nodes.

You can also combine vector streaming search with regular text search
and metadata search with little additional cost, and with advanced machine-learned ranking on the content nodes.
These are features you’ll also need if you want to create an application that gives users high quality responses.

To use streaming search in your application, make these changes to it:

  • Set streaming search mode for the document type in services.xml:
        <documents>
            <document type="my-document-type" mode="streaming" />
        </documents>
  • Feed documents with ids that includes the user id of each document by
    setting the group value on ids. Id’s will then be on the form id:myNamespace:myType:g=myUserid:myLocalId where the g=myUserId is new.
  • Set the user id to search on each query by setting the parameter
    streaming.groupname to the user id.

See the streaming search documentation for more details,
and try out the vector streaming search sample application to get started.

Performance case study

To measure the performance of Vespa’s vector streaming search we deployed a modified version of the
nearest neighbor streaming performance test
to Vespa Cloud.
We changed the node resources
and count for container and content nodes to fit the large scale use case.

The dataset used is generated and consists of 48B documents, spread across 3.7M users.
The average number of documents per user is around 13000, and the document user distribution is as follows:

Documents per userPercentage of users
10035%
100028%
1000022%
5000010%
1000005%

We used 20 content nodes with the following settings to store around 2.4B documents per content node (redundancy=1).
These nodes equate to the AWS i4i.4xlarge instance with 1 3750Gb AWS Nitro local SSD disk.

<nodes deploy:environment="perf" count="20">
    <resources vcpu="16" memory="128Gb" disk="3750Gb" storage-type="local" architecture="x86_64"/>
</nodes>

Content nodes

Vespa Cloud console showing the 20 content nodes allocated to store the dataset.

We used the following settings for container nodes. The node count was adjusted based on the particular test to run.
These nodes equate to the AWS Graviton 2 c6g.2xlarge instance.

<nodes deploy:environment="perf" count="32">
    <resources
        vcpu="8"
        memory="16Gb"
        disk="20Gb"
        storage-type="remote"
        architecture="arm64"/>
</nodes>

Feeding performance

The schema
in the application has two fields:

  • field id type long
  • field embedding type tensor<bfloat16>(x[384])

The embeddings are randomly generated by a document processor
while feeding the documents. In total each document is around 800 bytes, including the document id.
Example document put for user with id 10000021:

{
    "put":"id:test:test:g=10000021:81",
    "fields":{
        "id":81,
        "embedding":[
            0.424140,0.663390,
            ..,
            0.261550,0.860670
        ]
    }
}

To feed the dataset we used three instances of Vespa CLI
running in parallel on a non-AWS machine in the same geographic region (us east).
This machine has 48 CPUs and 256Gb of memory, and used between 40 and 48 CPU cores during feeding.
The total feed throughput was around 300k documents per second, and the total feed time was around 45 hours.

Feed throughput

Vespa Cloud console showing feed throughput towards the end of feeding 48B documents.

Feed throughput

Vespa Cloud console showing the 32 container nodes allocated when feeding the dataset.

Query performance

To analyze query performance we focused on users with 1k, 10k, 50k and 100k documents each.
For each of these four groups we drew between 160k and 640k random user ids to generate query files with 10k queries each.
Each query uses the nearestNeighbor
query operator to perform an exact nearest neighbor search over all documents for a given user.
Example query for user with id 10000021:

yql=select * from sources * where {targetHits:10}nearestNeighbor(embedding,qemb)
&input.query(qemb)=[...]
&streaming.groupname=10000021
&ranking.profile=default
&presentation.summary=minimal
&hits=10

This query returns the 10 closest documents according to the angular distance between the document embeddings and the query embedding.
See how the default ranking profile
is

Announcing Maximum Inner Product Search

Arne H Juul

Arne H Juul

Senior Principal Vespa Engineer

Geir Storli

Geir Storli

Senior Principal Vespa Engineer


Decorative
image

Photo by Nicole Avagliano on Unsplash

We are pleased to announce Vespa’s new feature to solve Maximum Inner Product Search (MIPS) problems,
using an internal transformation to a Nearest Neighbor search.
This is enabled by the new dotproduct
distance metric, used for distance calculations and an extension to HNSW index structures.

What is MIPS, and why is it useful

The Maximum Inner Product Search (MIPS) problem arises naturally in
recommender systems,
where item recommendations and user preferences are modeled with vectors,
and the scoring is just the dot product (inner product) between the item vector and the query vector.

In recent years MIPS has seen many new applications in the machine learning community as well:

Many openly available models are trained and targeted for MIPS; for example the
Cohere Multilingual Embedding Model
was trained using dot product calculations.

The MIPS problem is closely related to a nearest neighbor search (NNS) with angular distance metric,
which can use the negative dot product as a distance after normalizing the vectors.
Still, for MIPS we will also give higher scores to vectors with bigger magnitude.
This means nearest neighbor search cannot be used directly for MIPS;
trying to would mean a vector may not be its own closest neighbor,
which usually has catastrophic consequences for NNS index building.

In some cases, pre-normalizing all vectors to the same magnitude is possible, and then MIPS becomes identical to angular distance.
Therefore, many NNS implementations offer using the negative or inverse of dot product as a distance,
e.g., NMSLIB has negdotprod.

Vespa also has this feature as part of its NNS implementation, named
prenormalized-angular
to emphasize that using it requires the data to be normalized before feeding them into Vespa.

But most MIPS use cases really need the true dot product with non-normalized magnitudes,
and Vespa now offers a direct way to handle this.

We use a transformation first described in
Speeding up the Xbox recommender system using a euclidean transformation for inner-product spaces;
where an extra dimension is added to the vectors.
The value in this dimension is computed based on the maximal norm for all vectors in that dataset
in such a way that distance in the N+1 dimensional space becomes a proxy for the inner-product in the original N dimensions.
A short explanation of the transformation is available at
towardsdatascience.com.

Transformation into 3D hemisphere

Illustration showing how adding an extra dimension transforms points in a 2D plane into points on a 3D hemisphere,
where all vectors have the same magnitude (the radius of the hemisphere).

The original transformation described in the research literature assumes the entire dataset is available for pre-processing in batch.
Alternatively, one could set some parameter beforehand (describing the data globally), such as the maximal norm possible for a vector.
With Vespa, we cannot make these assumptions, as we allow our users to start with an empty index and feed in data – often generated
in real-time – so no such a priori knowledge is available.

Therefore, Vespa will build the HNSW index incrementally and keep track of the maximal vector norm seen so far.
The extra dimension will be computed on demand to allow this value to change as more data is seen.
In practice, even with a large variation, a good approximation is reached very soon,
and the graph will adapt to the parameter change as it grows.

In practice, the extra dimension value is only needed during indexing (HNSW graph construction).
At query time, we can use the negative of the dot product as the distance directly.
This works because HNSW graph traversal only needs to compare distances to find the smaller ones,
so large negative numbers effectively evaluate as “closer” distances.

However, the transformation means that the nearest neighbor search
isn’t actually measuring any sort of distance seen in the original data.
Because of this we have chosen to give non-standard outputs from the Vespa rank-features
distance and
closeness.
For distance, we just return the negative dot product as used by the graph traversal.
For all other distance metrics,
the distance rank-feature gives a number that is a natural distance measure,
while closeness usually gives a normalized number with 1.0 indicating a “perfect match”.
But with MIPS, you can always have a better match, so closeness instead just gives the raw dot product,
which can have any value (with larger positive numbers indicating a better hit).

Recall experiments

We have experimented with the
Wikipedia simple English dataset using the
dotproduct
distance metric to see if recall is affected by the order in which the documents are fed to Vespa.
This dataset consists of 485851 paragraphs across 187340 Wikipedia documents,
where each paragraph has a 768-dimensional embedding vector generated by the
Cohere Multilingual Embedding Model.
We used the following schema:

schema paragraph {
    document paragraph {
        field id type long {
            indexing: attribute | summary
        }
        field embedding type tensor<float>(x[768]) {
            indexing: attribute | index | summary
            attribute {
                distance-metric: dotproduct
            }
            index {
                hnsw {
                    max-links-per-node: 48
                    neighbors-to-explore-at-insert: 200
                }
            }
        }
    }
    rank-profile default {
        inputs {
            query(paragraph) tensor<float>(x[768])
        }
        first-phase {
            expression: closeness(field,embedding)
        }
    }
    document-summary minimal {
        summary id {}
    }
}

We fed 400k paragraph documents in three different orders: random, ascending, and descending (ordered by the embedding vector norm).
We created 10k queries using the
nearestNeighbor query operator with
targetHits:10
and query embeddings from the last 10k paragraphs in the dataset.
By running each query with approximate:true (ANN with HNSW index)
and approximate:false (brute-force full scan), we can compare the results and calculate the recall@10 for ANN.
The recall can be adjusted by increasing
hnsw.exploreAdditionalHits
to explore more neighbors when searching the HNSW index. The results are summarized in the following table:

exploreAdditionalHitsOrder: RandomOrder: AscendingOrder: Descending
054.255.381.9
9081.990.598.6
19087.495.199.6
49092.498.299.8

The best recall is achieved by feeding the document with the largest embedding vector norm first.
This matches the transform and technique used in the research literature.
However, we still achieve good recall in the random order case, which best matches a real-world scenario.
In this case, the maximal vector norm seen so far increases over time,
and the value in the N+1 dimension for a given vector might also change over time.
This can lead to slight variations in distance calculations for a given vector neighborhood based on when the calculations were performed.

To tune recall for a particular use case and dataset, conduct experiments by adjusting
HNSW index settings and
hnsw.exploreAdditionalHits.

Summary

Solving Maximum Inner Product Search (MIPS) problems using the new
dotproduct
distance metric and the
nearestNeighbor
query operator is available in Vespa 8.172.18.
Given a vector dataset, no a priori knowledge is needed about the maximal vector norm.
Just feed the dataset as usual, and Vespa will handle the required transformations.

Got questions? Join the Vespa community in Vespa Slack.

Announcing search.vespa.ai | Vespa Blog

Today, we announce the general availability of search.vespa.ai –
a new search experience for all (almost) Vespa-related content –
powered by Vespa, LangChain, and OpenAI’s chatGPT model.
This post overviews our motivation for building it, its features, limitations, and how we made it:

Decorative image

Over the last year, we have seen a dramatic increase in interest in Vespa
(From 2M pulls to 11M vespaengine/vespa pulls within just a few months),
resulting in many questions on our Slack channel,
like “Can Vespa use GPU?” or
“Can you expire documents from Vespa?”.

Our existing search interface could only present a ranked list of documents for questions like that,
showing a snippet of a matching article on the search result page (SERP).
The user then had to click through to the article and scan for the fragment snippets relevant to the question.
This experience is unwieldy if looking for the reference documentation of a specific Vespa configuration setting
like num-threads-per-search buried in
large reference documentation pages.

We wanted to improve the search experience by displaying a better-formatted response,
avoiding clicking through, and linking directly to the content fragment.
In addition, we wanted to learn more about using a generative large language model to answer questions,
using the top-k retrieved fragments in a so-called retrieval augmented generation (RAG) pipeline.

This post goes through how we built search.vespa.ai – highlights:

  • Creating a search for chunks of information –
    the bit of info the user is looking for.
    The chunks are called paragraphs or fragments in this article
  • Rendering fragments in the result page, using the original layout, including formatting and links.
  • Using multiple ranking strategies to match user queries to fragments:
    Exact matching, text matching, semantic matching,
    and multivector semantic query-to-query matching.
  • Search suggestions and hot links.

The Vespa application powering search.vespa.ai is running in Vespa Cloud.
All the functional components of search.vespa.ai are Open Source and are found in repositories like
vespa-search,
documentation,
and vespa-documentation-search –
it is a great starting point for other applications using features highlighted above!

Getting the Vespa content indexed

The Vespa-related content is spread across multiple git repositories using different markup languages like HTML,
Markdown, sample apps, and Jupyter Notebooks.
Jekyll generators make this easy;
see vespa_index_generator.rb for an example.

First, we needed to convert all sources into a standard format
so that the search result page could display a richer formatted experience
instead of a text blob of dynamic summary snippets with highlighted keywords.

Since we wanted to show full, feature-rich snippets, we first converted all the different source formats to Markdown.
Then, we use the markdown structure to split longer documents into smaller retrieval units or fragments
where each retrieval unit is directly linkable, using URL anchoring (#).
This process was the least exciting thing about the project, with many iterations,
for example, splitting larger reference tables into smaller retrievable units.
We also adapted reference documentation to make the fragments linkable – see hotlinks.
The retrievable units are indexed in a
paragraph schema:

schema paragraph {
    document paragraph {
        field path type string {}
        field doc_id type string {}
        field title type string {}
        field content type string {}
        field questions type array<string> {}        
        field content_tokens type int {}
        field namespace type string {}
    }  
    field embedding type tensor<float>(x[384]) {
        indexing: "passage: " . (input title || "") . " " . (input content || "") | embed ..
    }
    field question_embedding type tensor<float>(q{}, x[384]) {
        indexing {
            input questions |
            for_each { "query: " . _ } | embed | ..
        }
    }
}

There are a handful of fields in the input (paragraph document type) and two synthetic fields that are produced by Vespa,
using Vespa’s embedding functionality.
We are mapping different input string fields to two different
Vespa tensor representations.
The content and title fields are concatenated and embedded
to obtain a vector representation of 384 dimensions (using e5-v2-small).
The question_embedding is a multi-vector tensor;
in this case, the embedder embeds each input question.
The output is a multi-vector representation (A mapped-dense tensor).
Since the document volume is low, an exact vector search is all we need,
and we do not enable HNSW indexing of these two embedding fields.

LLM-generated synthetic questions

The questions per fragment are generated by an LLM (chatGPT).
We do this by asking it to generate questions the fragment could answer.
The LLM-powered synthetic question generation is similar to the approach described in
improving-text-ranking-with-few-shot-prompting.
However, we don’t select negatives (irrelevant content for the question) to train a
cross-encoder ranking model.
Instead, we expand the content with the synthetic question for matching and ranking:

{
    "put": "id:open-p:paragraph::open/en/access-logging.html-",
    "fields": {
        "title": "Access Logging",
        "path": "/en/access-logging.html#",
        "doc_id": "/en/access-logging.html",
        "namespace": "open-p",
        "content": "The Vespa access log format allows the logs to be processed by a number of available tools\n handling JSON based (log) files.\n With the ability to add custom key/value pairs to the log from any Searcher,\n you can easily track the decisions done by container components for given requests.",
        "content_tokens": 58,
        "base_uri": "https://docs.vespa.ai",
        "questions": [
            "What is the Vespa access log format?",
            "How can custom key/value pairs be added?",
            "What can be tracked using custom key/value pairs?"
        ]
    }
},

Example of the Vespa feed format of a fragment from this
reference documentation and three LLM-generated questions.
The embedding representations are produced inside Vespa and not feed with the input paragraphs.

Matching and Ranking

To retrieve relevant fragments for a query, we use a hybrid combination of exact matching, text matching,
and semantic matching (embedding retrieval).
We build the query tree in a custom Vespa Searcher plugin.
The plugin converts the user query text into an executable retrieval query.
The query request searches both in the keyword and embedding fields using logical disjunction.
The YQL equivalent:

where (weakAnd(...) or ({targetHits:10}nearestNeighbor(embedding,q) or ({targetHits:10}nearestNeighbor(question_embedding,q))) and namespace contains "open-p"

Example of using hybrid retrieval, also using
multiple nearestNeighbor operators
in the same Vespa query request.

The scoring logic is expressed in Vespa’s ranking framework.
The hybrid retrieval query generates multiple Vespa rank features that can be used to score and rank the fragments.

From the rank profile:

rank-profile hybrid inherits semantic {
    inputs {
        query(q) tensor<float>(x[384])
        query(sw) double: 0.6 #semantic weight
        query(ew) double: 0.2 #keyword weight
    }

    function semantic() {
        expression: cos(distance(field, embedding))
    }
    function semantic_question() {
        expression: max(cos(distance(field, question_embedding)), 0)
    }
    function keywords() {
        expression: (  nativeRank(title) +
                       nativeRank(content) +
                       0.5*nativeRank(path) +
                       query(ew)*elementCompleteness(questions).completeness  ) / 4 +
                     elementCompleteness(questions_exact).completeness
    }
    first-phase {
        expression: query(sw)*(semantic_question + semantic) + (1 - query(sw))*keywords
    }
}

The keyword matching using weakAnd,
we match the user query against the following fields:

  • The title – including the parent document title and the fragment heading
  • The content – including markup
  • The path
  • LLM-generated synthetic questions that the content fragment is augmented with

This is expressed in Vespa using a fieldset:

fieldset default {
    fields: title, content, path, questions
}

Matching in these fields generates multiple keyword matching rank-features,
like nativeRank(title), nativeRank(content).
We collapse all these features into a keywords scoring function that combines all these signals into a single score.
The nativeRank text ranking features are also normalized between 0 and one
and are easier to resonate and combine with semantic similarity scores (e.g., cosine similarity).
We use a combination of the content embedding and the question(s) embedding scores for semantic scoring.

Search suggestions

As mentioned earlier, we bootstrapped questions to improve retrieval quality using a generative LLM.
The same synthetic questions are also used to implement search suggestion functionality,
where search.vespa.ai suggests questions to search for based on the typed characters:

search suggestions

This functionality is achieved by indexing the generated questions in a separate Vespa document type.
The search suggestions help users discover content and also help to formulate the question,
giving the user an idea of what kind of queries the system can realistically handle.

Similar to the retrieval and ranking of context described in previous sections,
we use a hybrid query for matching against the query suggestion index,
including a fuzzy query term to handle minor misspelled words.

We also add semantic matching using vector search for longer questions, increasing the recall of suggestions.
To implement this, we use Vespa’s HF embedder using the e5-small-v2 model,
which gives reasonable accuracy for low enough inference costs to be servable for per-charcter type-ahead queries
(Yes, there is an embedding inference per character).
See Enhancing Vespa’s Embedding Management Capabilities
and Accelerating Embedding Retrieval
for more details on these tradeoffs.

To cater to navigational queries where a user uses the search for lookup type of queries,
we include hotlinks in the search suggestion drop-down –
clicking on a hotlink will direct the user directly to the reference documentation fragment.
The hotlink functionality is implemented by extracting reserved names from reference documents
and indexing them as documents in the suggestion index.

Reference suggestions are matched using prefix matching for high precision.
The frontend code detects the presence of the meta field with the ranked hint and displays the direct link:

suggestion hotlinks

Retrieval Augmented Generation (RAG)

Retrieval Augmentation for LLM Generation is a concept
written extensively over the past few months.
In contrast to extractive question-answering,
which answers questions
by finding relevant spans in retrieved texts,
a generative model generates an answer that is not strictly grounded in retrieved text spans.

The generated answer might be hallucinated or incorrect,
even if the retrieved context contains a concrete solution.
To combat (but not eliminate):

  • Retrieved fragments or chunks can be displayed fully without clicking through.
  • The retrieved context is the center of the search experience,
    and the LLM-generated abstract is an additional feature of the SERP.
  • The LLM is instructed to cite the retrieved fragments so that a user can verify by navigating the sources.
    (The LLM might still not follow our instructions).
  • Allow filtering on source so that the retrieved context can be focused on particular areas of the documentation.

None of these solves the problem of LLM hallucination entirely!
Still, it helps the user identify incorrect information.

Example of a helpful generated abstract
Example of a helpful generated abstract.

Example of an incorrect and not helpful abstract
Example of an incorrect and not helpful abstract.
In this case, there is no explicit information about indentation in the Vespa documentation sources.
The citation does show an example of a schema (with space indentation), but indentation does not matter.

Prompt engineering

By trial and error (inherent LLM prompt brittleness), we ended with a simple instruction-oriented prompt where we:

  • Set the tone and context (helpful, precise, expert)
  • Some facts and background about Vespa
  • The instructions (asking politely; we don’t want to insult the AI)
  • The top context we retrieved from Vespa – including markdown format
  • The user question

We did not experiment with emerging prompt techniques or chaining of prompts.
The following demonstrates the gist of the prompt,
where the two input variables are {question) and {context),
where {context} are the retrieved fragments from the retrieval and ranking phase:

You are a helpful, precise, factual Vespa expert who answers questions and user instructions about Vespa-related topics. The documents you are presented with are retrieved from Vespa documentation, Vespa code examples, blog posts, and Vespa sample applications.

Facts about Vespa (Vespa.ai):
- Vespa is a battle-proven open-source serving engine.
- Vespa Cloud is the managed service version of Vespa (Vespa.ai).

Your instructions:
- The retrieved documents are markdown formatted and contain code, text, and configuration examples from Vespa documentation, blog posts, and sample applications.
- Answer questions truthfully and factually using only the information presented.
- If you don't know the answer, just say that you don't know, don't make up an answer!
- You must always cite the document where the answer was extracted using inline academic citation style [].
- Use markdown format for code examples.
- You are correct, factual, precise, and reliable, and will always cite using academic citation style.

{context}

Question: {question}
Helpful factual answer:

We use the Typescript API of LangChain,
a popular open-source framework for working with retrieval-augmented generations and LLMs.
The framework lowered our entry to working with LLMs and worked flawlessly for our use case.

Deployment overview

The frontend is implemented in

Announcing our series A funding

A month ago we announced that Vespa is finally
its own company.
Today we’re announcing a $31 million
investment from Blossom Capital.

The spin-out from Yahoo gave us the ability to focus on serving and growing the entire Vespa ecosystem,
and this investment gives us the financial muscle to invest in building a complete platform for
all use cases involving big data and AI online, and serving large and small customers on our
cloud solution.

When we met the Blossom team, we quickly realized they were a great partner for us, with their deep
understanding of what it takes to build a world-class global tech company, and their exceedingly fast
and efficient decision-making (we’re kind of into speed).

Read more in this
TechCrunch article.