Anonymized endpoints and token authentication in Vespa Cloud

Morten Tokle

Morten Tokle

Principal Software Systems Engineer

Martin Polden

Martin Polden

Principal Vespa Engineer

When you deploy a Vespa application on Vespa Cloud your application is assigned an endpoint for each container cluster declared in your application package. This is the endpoint you communicate with when you query or feed documents to your application.

Since the launch of Vespa Cloud these endpoints have included many dimensions identifying your exact cluster, on the following format {service}.{instance}.{application}.{tenant}.{zone} This format allows easy identification of a given endpoint.

However, while this format makes it easy to identify where an endpoint points, it also reveals details of the application that you might want to keep confidential. This is why we are introducing anonymized endpoints in Vespa Cloud. Anonymized endpoints are created on-demand when you deploy your application and have the format {generated-id}.{generated-id}.{scope} As with existing endpoints, details of anonymized endpoints and where they point are shown in the Vespa Cloud Console for your application.

Anonymized endpoints are the now the default for all new applications in Vespa Cloud. They have also been enabled for existing applications but with backward compatibility. This means that endpoints on the old format continue to work for now but are marked as deprecated in the Vespa Cloud console. We will continue to support the previous format for existing applications, but we encourage using the new endpoints.

In addition to making your endpoint details confidential, this new endpoint format allows Vespa Cloud to optimize certificate issuing. It allows for much faster deployments of new applications as they no longer have to wait for a new certificate to be published.

No action is needed to enable this feature. You can find the new anonymized endpoints in the Vespa Console for your application or by running the Vespa CLI command vespa status.

In addition to anonymized endpoints, we are introducing support for data plane authenticating using access tokens. Token authentication is intended for cases where mTLS authentication is unavailable or impractical. For example, edge runtimes like Vercel edge runtime are built on the V8 Javascript engine that does not support mTLS authentication. Access tokens are created and defined in the Vespa Cloud console and referenced in the application package. See instructions for creating and referencing tokens in the application package in the security guide.

Note it’s still required to define a data plane certificate for mTLS authentication; mTLS is still the preferred authentication method for data plane access, and applications configuring token-based authentication will have two distinct endpoints.


Application endpoints in Vespa Console – Deprecated legacy mTLS endpoint name and two anonymized endpoints, one with mTLS support and the other with token authentication. Using token-based authentication on the mTLS endpoint is not supported.

Using data plane authentication tokens

Using the token endpoint from the above screenshot,, we can authenticate against it by
adding a standard Authorization HTTP header to the data plane requests. For example as demonstrated below using curl:

curl -H "Authorization: Bearer vespa_cloud_...."


Using the latest release of pyvespa, you can interact with token endpoints by setting an environment variable named VESPA_CLOUD_SECRET_TOKEN. If this environment variable is present, pyvespa will read this and use it when interacting with the token endpoint.

import os
os.environ['VESPA_CLOUD_SECRET_TOKEN'] = "vespa_cloud_...."
from vespa.application import Vespa
vespa = Vespa(url="")

In this case, pyvespa will read the environment variable VESPA_CLOUD_SECRET_TOKEN and use that when interacting with the data plane endpoint of your application. There are no changes concerning control-plane authentication, which requires a valid developer/API key.

We do not plan to add token-based authentication to other Vespa data plane tools like vespa-cli or vespa-feed-client, as these are not designed for lightweight edge runtimes.

Edge Runtimes

This is a minimalistic example of using Cloudflare worker, where we have stored the secret Vespa Cloud token using Cloudflare worker functionality for storing secrets. Note that Cloudflare workers also support mTLS.

export default {
    async fetch(request, env, ctx) {
        const secret_token = env.vespa_cloud_secret_key
        return fetch('', 
                     {headers:{'Authorization': `Bearer ${secret_token}`}})

Consult your preferred edge runtime provider documentation on how to store and access secrets.

Security recommendations

It may be easier to use a token for authentication, but we still recommend using mTLS wherever possible. Before using a token for your application, consider the following recommendations.

Token expiration

While the cryptographic properties of tokens are comparable to certificates, it is recommended that tokens have a shorter expiration. Tokens are part of the request headers and not used to set up the connection. This means they are more likely to be included in e.g. log outputs. The default token expiration in Vespa Cloud is 30 days, but it is possible to create tokens with shorter expiration.

Token secret storage

The token value should be treated as a secret, and never be included in source code. Make sure to use a secure way of accessing the tokens and in such a way that they are not exposed in any log output.


Keeping your data safe is a number one priority for us. With these changes, we continue to improve the developer friendliness of Vespa Cloud while maintaining the highest level of security. With anonymized endpoints, we improve deployment time for new applications by several minutes, avoiding waiting for certificate issuing. Furthermore, anonymized endpoints eliminate disclosing tenant and application details in certificates and DNS entries.

Managed Vector Search using Vespa Cloud

Photo by israel palacio on Unsplash

There is a growing interest in AI-powered vector representations of unstructured multimodal data
and searching efficiently over these representations. This blog post describes how your organization can unlock the full potential of multimodal AI-powered vector representations using Vespa – the industry-leading open-source big data serving engine.


Deep Learning has revolutionized information extraction from unstructured data like text, audio, image, and videos.
Furthermore, self-supervised learning algorithms like data2vec
accelerate learning representations of speech, vision, text, and multimodal representations
combining these modalities. Pre-training deep neural network models using self-supervised
learning without expensive curated labeled data helps scale machine learning as
adoption and fine-tuning for a specific task requires fewer labeled examples.

Representing unstructured multimodal data as vectors or tensors unlocks new and exciting use cases
it wasn’t easy to foresee just a few years ago. Even a well-established AI-powered use case like
search ranking, which has been using AI to improve the search results for decades,
is going through a neural paradigm shift
driven by language models like BERT.

These emerging multimodal data-to-vector models increase the insight and knowledge organizations can
extract from their unstructured data. As a result, organizations leveraging this
new data paradigm will have a significant competitive advantage over organizations
not participating in this paradigm shift.
Learning from structured and unstructured data has historically
primarily been performed offline.
However, advanced organizations with access to modern infrastructure
and competence have started transferring the learning process to onstage,
using real-time,
in-session contextual features to improve AI predictions.

One example of real-time online inference or prediction is within-cart
recommendation systems,
where grocery and e-commerce sites recommend or predict
related items to supplement the user’s current cart contents.
An AI-powered recommendation model for this use case could use item-to-item similarity
or past sparse user-to-item interactions.
Still, without a doubt, using the real-time context, in this case, the cart’s contents,
can improve the model’s accuracy. Furthermore,
creating add-to-cart suggestions for all possible combinations offline is impossible
due to the combinatoric explosion of likely cart items.
This use case also has the challenging property that the number of things to choose from is extensive,
hundreds of millions in the case of Amazon. In addition, business constraints like in-stock status limit the candidate selection.

Building technology and infrastructure to perform computationally complex distributed AI inference
over billions of data items with low user-time serving latency constraints
is one of the most challenging problems in computing.

Vespa – Serving Engine

Vespa, the open-source big data serving engine, specializes in making it easy for an
any-sized organization to move AI inference computations online at scale without investing a significant amount of resources in building infrastructure and technology. Vespa is a distributed computation engine that can scale in any dimension.

  • Scale elastically with data volume – handling billion scale
    datasets efficiently without pre-provisioning resources up-front.
  • Scale update and ingestion rates to handle evolving real-time data.
  • Scale with query volume using state-of-the-art retrieval and index structures and fully use modern hardware stacks.

In Vespa, AI is a first-class citizen and not an after-thought. The following Vespa primitives are the
foundational building blocks for building an online AI serving engine:

  • CRUD operations at scale. Dataset sizes vary across organizations and use cases. Handling fast-paced evolving datasets is one of Vespa’s core strengths. Returning to our in-cart recommendation system for a moment, handling in-stock status updates, price changes, or real-time click feedback can dramatically improve the experience – imagine recommending an item out of stock? A lost revenue opportunity and a negative user experience.
  • Document Model. Vespa’s document model supports structured and unstructured field types, including tensor fields representing single-order dense vectors. Vespa’s tensor storage and compute engine
    is built from the ground up.
    The document model with tensor also enables feature-store functionality, accessing real-time features close to the data.
    Features stored as Vespa attributes support in place real-time updates
    at scale (50K updates/s per tensor field per compute node).
  • A feature-rich query language. Vespa’s SQL-like query language
    enables efficient online selection over potentially billions of rows, combining structured and unstructured data in the same query.
  • Machine Learning frameworks and accelerator integrations. Vespa integrates with the most popular machine learning frameworks like
    Tensorflow, PyTorch,
    XGboost, and LightGBM.
    In addition, Vespa integrates with ONNX-Runtime
    for accelerated inference
    with large deep neural network models that accelerate powerful data-to-vector models.
    Vespa handles model versioning,
    distribution, and auto-scaling of online inference computations.
    These framework integrations complement Vespa’s native
    support for tensor storage and calculations over tensors.
  • Efficient Vector Search. AI-powered vector representations are at the core of the unstructured data revolution. Vespa implements a real-time version of the HNSW algorithm for efficient Vector search, an implementation that is vetted and verified with multiple vector datasets on
    Vespa supports combining vector search with structured query filters at scale.

Get Started Today with Vector Search using Vespa Cloud.

We have created a getting started with Vector Search sample application which,
in a few steps, shows you how to deploy your Vector search use case to Vespa Cloud.
Check it out at

The sample application features:

  • Deployment to Vespa Cloud environments (dev, perf, and production) and how to perform safe deployments to production using CI/CD
  • Vespa Cloud’s security model
  • Vespa Cloud Auto-Scaling and pricing, optimizing the deployment cost by auto-scaling by resource usage
  • Interacting with Vespa Cloud – indexing your vector data and searching it at scale.

For only $3,36 per hour, your organization can store and search 5M 768 dimensional vectors,
deployed in Vespa Cloud production zones with high availability, supporting thousands
of inserts and queries per second.

Vespa Cloud Console. Snapshot while auto-scaling of stateless container cluster in progress.

Vespa Cloud Console. Concurrent real-time indexing of vectors while searching. Scale as needed to
meet any low latency serving use case.

With this vector search sample application, you have a great starting point for
implementing your vector search use case, without worrying about managing complex infrastructure.
See also other Vespa sample applications using vector search:

  • State-of-the-art text ranking:
    Vector search with AI-powered representations built on NLP Transformer models for candidate retrieval.
    The application has multi-vector representations for re-ranking, using Vespa’s phased retrieval and ranking
    pipelines. Furthermore, the application shows how embedding models, which map the text data to vector representation, can be
    deployed to Vespa for run-time inference during document and query processing.

  • State-of-the-art image search: AI-powered multi-modal vector representations
    to retrieve images for a text query.

  • State-of-the-art open-domain question answering: AI-powered vector representations
    to retrieve passages from Wikipedia, which are fed into an NLP reader model which extracts the answer. End-to-end represented using Vespa.

These are examples of applications built using AI-powered vector representations.

Vespa is available as a cloud service; see Vespa Cloud – getting started,
or self-serve Vespa – getting started.

Pre-trained models on Vespa Cloud

UPDATE 2023-06-06: use new syntax to configure Bert embedder.

Decorative image

“searching data using pre-trained models, unreal engine high quality render, 4k, glossy, vivid_colors, intricate_detail” by Stable Diffusion

Vespa can now convert text to embeddings for you automatically,
if you don’t want to bring your own vectors – but you still need to provide the ML models to use.

On Vespa Cloud we’re now making this even simpler, by also providing pre-trained models you can use for such tasks.
To take advantage of this, just pick the models you want from and refer
to them in your application by supplying a model-id where you would otherwise use path or url. For example:

<component id="myEmbedderId" type="bert-embedder">
    <transformer-model model-id="minilm-l6-v2"/>
    <tokenizer-vocab model-id="bert-base-uncased"/>

You can deploy this to Vespa Cloud to have these models do their job in your application –
no need to include a model in your application and wait for it to be uploaded.

You can use these models both in configurations provided by Vespa, as above, and in your own components,
with your own configurations – see the documentation for details.

We’ll grow the set of models available over time, but the models we provide on Vespa Cloud will always be an
exclusive selection of models that we think it is beneficial to use in real applications,
both in terms of performance and model quality.

We hope this will empower many more teams to leverage modern AI in their production use cases.

Vespa Cloud on Google Cloud Platform

Kristian Aune

Kristian Aune

Head of Customer Success, Vespa

Photo by NASA
on Unsplash

Vespa Cloud has run in AWS zones since its start in 2019.
We are now happy to announce Vespa Cloud availability in Google Cloud Platform (GCP) zones!
To add a gcp zone to your application, simply add <region>gcp-us-central1-f</region>
to deployment.xml.

GCP availability makes it easier for users with their current workload in GCP to use Vespa Cloud.
Using a GCP zone can reduce data transfer costs, simplify operations, and cut latencies
by locating everything in the same location and cloud provider.

You can always find the currently supported zones in the zone reference.
Let us know if your workload requires additional zones;
expect a two-week ramp-up time.

GPU-accelerated ML inference in Vespa Cloud

Martin Polden

Martin Polden

Principal Vespa Engineer

Photo by Sandro
Katalina on Unsplash

In machine learning, computing model inference is a good candidate for being
accelerated by special-purpose hardware, such as GPUs. Vespa supports
evaluating multiple types of machine-learned models in stateless
containers, for
example TensorFlow,
XGBoost and
LightGBM models. For many use-cases
using a GPU makes it possible to perform model inference with higher
performance, and at a lower price point, compared to using a general purpose

Today we’re introducing support for GPU-accelerated ONNX model inference in
Vespa, together with support for GPU instances in Vespa Cloud!

Vespa Cloud

If you’re using Vespa Cloud, you can get started with
GPU instances in AWS zones by updating the <nodes> configuration in your
services.xml file. Our cloud platform will then provision and configure GPU
instances automatically, just like regular instances. See the services.xml
reference documentation for
syntax details and examples.

You can then configure which models to evaluate on the GPU in the
<model-evaluation> element, in services.xml. The GPU device number is
specified as part of the ONNX inference
for your model.

See our pricing page for details on GPU

Open source Vespa

GPUs are also supported when using open source Vespa. However, when running
Vespa inside a container, special configuration is required to pass GPU devices
to the container engine (e.g. Podman or Docker).

See the Vespa documentation
for a tutorial on how to configure GPUs in a Vespa container.

CORD-19 application benchmark

While implementing support for GPUs in Vespa, we wanted to see if we could find
a real-world use-case demonstrating that a GPU instance can be a better fit than
a CPU instance. We decided to run a benchmark of our CORD-19
application – a
Vespa application serving the COVID-19 Open Research Dataset. Its source code is
available on GitHub.

Our benchmark consisted of a query where the top 30 hits are re-ranked, using a
22M Transformer model using batch inference. The measured latency is end-to-end,
and includes retrieval and inference.

See our recent blog
post for
more information about using a Transformer language model to re-rank results.

We compared the following node configurations:

  • GPU: 4 vCPUs, 16GB memory, 125 GB disk, 1 GPU with 16GB memory (Vespa Cloud
    cost: 1.87$/hour)
  • CPU: 16 vCPUs, 32GB memory, 125 GB disk (Vespa Cloud cost: $2.16/hour)


InstanceClientsRe-rank (batch)Avg. latency (ms)95 pct. latencyQPSGPU util (%)CPU util (%)


The GPU of the GPU instance was saturated at 4 clients, with an average
end-to-end request latency at 212 ms and a throughput of 18.8 QPS. The CPU
instance had a higher average latency, at 1011 ms with 4 clients and a
comparatively low throughput of 3.95 QPS.

So, in this example, the average latency is reduced by 79% when using a GPU,
while costing 13% less.

Private regional endpoints in Vespa Cloud

Jon M Venstad

Jon M Venstad

Principal Vespa Engineer

Decorative image

Photo by Taylor Vick on Unsplash

Vespa Cloud exposes application container clusters through public endpoints, by default.
We’re happy to announce that we now also support private endpoints, in both AWS and GCP;
that is, our users can connect to their Vespa application, in Vespa Cloud, exclusively
through the private network of the cloud provider.

Why use private endpoints

Traffic to private, regional endpoints avoid the trip out onto the public internet,
and both latency and costs are reduced:

Public vs private routing

With private endpoints enabled, it is also possible to disable the public endpoints
of the application, for another layer of access control and security.

How to set up private endpoints in Vespa Cloud

To use this feature, clients must be located within the same region (or availability zone)
as the Vespa clusters they connect to.
Configuring and connecting to the application is done in a few, simple steps:

Read more about AWS PrivateLink
or GCP Private Service Connect for further details.