The hardest problem in computing

What is the hardest problem in applied computing? My bet is on big data
serving
— computing over large data sets online. It requires solving four
problems at once: Distributed state management, low latency computation with
stochastic load, high availability, and distributed computation, and all four
are known to be hard to solve separately.

Decorative image

Photo by
Artiom Vallat
on Unsplash

This is part three in a series of posts about big data serving. In the first
post we covered the
stages organizations advance through as they start putting their data to use.
In the second post,
we saw that moving computation online lets your systems make decisions with
up-to-date information and high precision at lower cost and complexity. In
this post we’ll look at the technological challenges that must be overcome to
achieve this, and see how they are solved in the first open source platform
available for this problem, vespa.ai.

Why big data serving is hard

In big data serving we’re computing over large data sets right at the moment
the computation is needed. A computation typically consists of finding the
right data items, evaluating machine-learned models over each item, and
organizing and aggregating the results.

As an example, consider a system making personalized recommendations of
movies, songs or articles, at the moment when a user shows up. To do this, you
need to find the items that should be considered for the situation at hand,
score each one using machine-learned personalization modeles, organize the
resulting recommendations by categories and similar, and return the necessary
data for display to the frontend or app.

Consider how to solve this problem efficiently. Can we solve it the usual way
server applications are implemented, by storing the items to be recommended in
a database and using stateless middle-tier to fetch data items to do the
processing and inference? If we have a 10Gbps (1.25GB/sec) network and each
item is 10kB large, we can evaluate max 125.000 movies per second. To achieve
an end-to-end response time which doesn’t annoy humans — about 400 ms — the
backend must typically respond in about 100 ms. This means we can scale to
evaluate at most 12.500 items in total per user request. Not good. But worse,
this uses all the network capacity available! If we want to mostly return in
100 ms it means we can only handle 2–3 users per second in total, even with
this low number of items considered. Any more and we need to replicate the
entire database.

How can we do better? The obvious solution is to move the computation to the
data instead of the other way around. This is what systems such as Hadoop does
for batch computing, but those are not applicable here because they do not
provide low latency.

Therefore, to solve the big data serving problem, we need a new kind of
system
. One which both stores all the data we are working with and is able to
compute locally where the data is stored, including making inferences in
machine-learned models.

A new kind of system

That sounds like a lot of hard work, but we’re only getting started. To be
able to scale to more data than can be stored and computed over in time on a
single machine, we need distributed storage, and distributed algorithms
executing computation over multiple nodes in parallel to produce a response,
while still meeting latency requirements reliably. And since these systems are
online we need high availability, which translates to storing each piece of
data in multiple distributed copies, and automatically rebuilding copies
whenever nodes fail. To keep the data up to date with reality — one of the
goals of moving computation online — all of this must keep working while the
data is continuously modified. Further, to achieve high availability without
having redundant copies of the entire system, it must be possible to change
data schemas, logic, data layout, hardware and so on, without taking the
system offline at any time.

It’s not just solving all these problems, but solving them so that they work
together
. Clearly, this is many man-years of work, and years of calendar time
regardless of the amount of money you are willing to spend, as accumulating
the practical detailed experience with what does and does not work just takes
time.

Building all this is out of the question for most applications, which is why
the advantages of computing online are so often left on the table.

Web search to the rescue!

But are there any applications where the effort is economically justifiable?
Turns out there is one — web search.

Web search is the prototypical big data serving application: Computing over
big data sets (the web), including machine-learned model inference (relevance)
— and performing the computation offline is infeasible because there are just
too many queries to precompute them all. Furthermore, web search turned out to
be profitable enough to fund the kind of large multi-decade development
efforts required here.

The companies which funded their own web search technology have long since
started applying them to solve other problems important to them, such as
recommendation, personalization and targeting, but they have not made them
public.

Vespa.ai

Luckily there is an exception to this. My team creates
Vespa.ai — an engine solving the data serving
problem as open source. We first started working on this problem in the late
nineties as the web search engine alltheweb.com, competing with the other web
search engines such as Alta Vista back in those days. We were eventually
acquired by Yahoo! where we have been well funded ever since to work on
creating an ever better and more general big data serving engine. In 2017
we were able to open source the platform,
making big data serving a viable technology option for the wider world for the
first time.

About 700 man-years of development has gone into building Vespa so far, and
since we’ve been able to keep a stable core team of experts over many years,
most of that time has gone to improving it based on experience, instead of
continuously rebuilding pieces from scratch as the developers turn over — a
common problem often preventing progress on big and complex problems in the
industry.

Many companies have started using it to solve well-known big data serving
problems such as search and recommendation, but we’re also increasingly seeing
people using it in ingenious ways to solve problems that seem completely
different on the surface, but where there is an advantage to computing over
data with low latency. Looks like this will lead to a lot of new and
interesting solutions as people wrap their heads around the possibilities.

Enough for now, in the next and last post in this series, we’re finally set up
to dive into details on how Vespa solves the core problems in big data
serving.

Computing with tensors in Vespa

Decorative image

Photo by Pietro Jengr on Unsplash

In computer science, a tensor is a data structure that generalizes scalars,
vectors, matrices, and higher-order structures into a single construct.

In Vespa we have introduced a tensor formalism that differs from what one
usually finds in most modern machine learning frameworks today. The main
differences are:

  • Named dimensions
  • Unified sparse and dense tensor types
  • A small but powerful set of core functions

In this blog post, we’ll explore these and other aspects of tensors in Vespa.
We will also introduce the recently released tensor
playground, a tool to get familiar with and
explore tensors and tensor expressions in an interactive environment.

Tensors are multi-dimensional arrays of numeric values and can be viewed as a
generalization of vectors and matrices. A tensor with one dimension (a
first-order tensor) is a vector, a two-dimensional (second-order) tensor is a
matrix, and so on. A scalar value is a tensor without any dimensions.

In most frameworks, these dimensions have an implicit ordering. For instance, a
matrix has two dimensions. A multiplication between two matrices, A and B
with sizes (i,j) and (j,k), results in a matrix with size (i,k). Values
along the columns in A and the rows in B are multiplied and summed. Thus
these must have equal size. If not, for instance if B had size (k,j), B
would need a transpose to (j,k) before multiplication.

Implicitly ordered dimensions pose a problem: what do the dimensions represent?
For instance, consider loading a (monochrome) image from a file into a matrix.
It is not immediately clear which dimension represents the height or width, as
that depends on the file format. Additionally, a file containing a set of color
images has two additional dimensions: the number of images, and the color
channels (e.g. RGB). This is called NCHW format. Sometimes it is stored as
NHWC, which is the default representation in TensorFlow. Interestingly, NCHW is
the preferred format on GPUs.

Knowing which dimension is which is essential when performing various
operations; one generally wants to rotate images by working on the height and
width dimensions, not on the channel and height dimensions. However, after a
series of data manipulations, keeping track of dimensions can quickly become
challenging.

However, the tensor does not contain the information itself to help describe the
dimensions. As a result, practitioners often use comments such as the following
to document the dimension order:

# Num x Height x Width x Channel
tensor = torch.tensor(numpy.load("images.npy"))
tensor.shape
> torch.Size([100, 64, 64, 3])

This is obviously error-prone.

In Vespa, we have taken a different approach and introduced named dimensions.
This enables a strong tensor type system. An example of a tensor type in Vespa
which holds the same data as above:

tensor(num[100], height[64], width[64], channel[3])

This type system provides more formal documentation, which makes it easier to
work with for humans. It also introduces the ability for tensors and tensor
operations to be semantically verified. This means we can perform static type
inference for all computation, catching errors at an early compile-time stage
rather than during runtime.

Later in the post, we’ll see how this enables arbitrarily complex computation
with only a minimal, concise set of core operations.

Sparse and dense tensors

The tensor as a multi-dimensional array is often considered to be dense. This
means that all combinations of dimension indexes have a value, even if that
value is 0 or NaN. However, a tensor with many such values could be
represented more efficiently as a sparse tensor, where only the non-empty values
are defined. This can lead to considerable savings in space.

Unfortunately, the internal representation of sparse tensors in most frameworks
makes them incompatible with regular dense tensors. This leads to an entirely
separate set of functions operating on sparse and dense tensors, with functions
to convert between the two.

Vespa supports dense, sparse, and tensors that contain both dense and sparse
dimensions, called a mixed tensor. A dense tensor is a tensor containing only
“indexed” dimensions. An indexed dimension is indexed by integers, like an
array. The following is an example of a dense tensor containing two indexed
dimensions:

tensor(width[128], height[96])

A sparse tensor is conversely a tensor consisting of only “mapped” dimensions. A
mapped dimension is indexed by strings, like a map or hash table. An example of
a sparse tensor containing two mapped dimensions:

tensor(model_id{}, feature{})

A mixed tensor can contain both types:

tensor(model_version_id{}, first_layer_weights[64], second_layer_weights[128])

This particular example effectively works as a lookup table. By providing the
model_version_id, one can extract a dense tensor (called a dense subspace).
For instance, this can be useful for a single tensor to contain the weights for
multiple versions of a model, e.g., a neural network.

All in all, this enables a flexible data representation.

Tensor operations

Most frameworks that operate on and manipulate tensors have an extensive library
of functions. For instance, TensorFlow has hundreds of different operations
organized in various groups according to their function. Examples are
constructors, general manipulation, linear algebra, neural networks, and so on.
Some operations work element-wise, some on entire tensors, and some combine
tensors.

All operations have assumptions on their input tensors. For instance, a matrix
multiplication operation requires two matrix tensors with the same column/row
size. Likewise, a dot product operation requires two vectors with the same
vector length. These two operations are essentially the same: the sum of the
products of elements along a dimension. Yet two different operations are
provided: matmul and dotprod.

A two-dimensional convolution requires an input of four dimensions: the batch
size, color channels, width and height. This ordering is called NCHW, which is
the most efficient for GPU processing. However, the default in TensorFlow is
NHWC. Therefore, the operations working on these tensors need to know the
format, which must be provided by the developer as the format is not part of the
tensor.

Another example is the different operations based on tensor type. For instance,
there are two different operations for concat: one for a dense tensor and one
for sparse.

The inevitable consequence is that most frameworks get an extensive library of
operations. The problem with such large libraries is interoperability and
maintainability.

Assume you train a model on one framework and want to run inference on another.
The common way to represent the model is as a computational graph, where each
vertex is an operation. Evaluating a given graph on another system requires the
implementation of all operations in the graph. To guarantee general
interoperability between two frameworks, all operations in the original
framework must have an implementation. This becomes increasingly less feasible
as frameworks grow to have hundreds or thousands of operations.

In addition, many developers add custom operations. This further decreases
interoperability by requiring binary compatibility.

The tensor framework in Vespa, on the other hand, takes a different approach.
Tensors with named dimensions allow for a strong type system with well-founded
semantics. This makes it possible to define a small set of foundational
mathematical operations in which all other computations can be expressed.

Unary operations work on single tensors; examples include filtering and
reduction. Binary operations combine two input tensors to an output tensor, for
instance, joining, extending, or calculating some relation. By combining these
operations, they can express complex computation.

Vespa provides just 8 core operations to transform tensors:

  • tensor – construct a new tensor from a given expression
  • map – apply a function to all values
  • reduce – aggregate values along a dimension
  • rename – rename a dimension
  • slice – returns a tensor with all cells matching a partial address
  • join – the natural join between two input tensors with a custom expression applied
  • merge – merge two input tensors with a custom expression
  • concat – concatenate two tensors along a given dimension

To aid developers, Vespa additionally provides higher-level, non-core
operations, which are all implemented in terms of these core operations.

This approach enables interoperability as implementing this small set is all
that is needed to realize complete support for tensor computation. Furthermore,
it makes optimization work more efficient. This is because low-level
optimizations are only required on this set of functions. Higher-level
optimizations can work on whichever chunks of these operations are beneficial,
independent of any chunking into higher-level functions humans happen to find
meaningful.

In this section, we’ll describe Vespa’s full tensor formalism, including
tensors and tensor types, and the core set of tensor operations.

Tensor types

A tensor type consists of a value type and a set of dimensions. The value type is a numeric data type: double, float, int8, bfloat16, etc. The default value type is double.

A dimension consists of a name and a type. The type can be either indexed or mapped, and, if indexed, can optionally include a size. The number of dimensions in the tensor is called its order. A tensor without dimensions (0-order) is a single scalar value. A tensor with one dimension (first-order) is a vector. A tensor with two dimensions (second-order) is a matrix.

Note that this notion of dimensions is not to be confused with the concept of dimensions in vector space, where a vector with 3 elements can represent a vector in a 3-dimensional space.

A value in a dimension is addressed by a label. For mapped dimensions, this label is a string. For indexed dimensions this label is an integer index. Tensors with mapped dimensions only hold values for the labels that exist. Indexed dimensions must have values for all indices.

Some examples:

  • tensor() – a scalar value (double)
  • tensor<float>(x[3]) – An indexed vector with 3 floats
  • tensor(x[2], y[3]) – An indexed matrix of size 2-by-3
  • tensor(name{}) – A (sparse) map of named values
  • tensor<int8>(name{}, x[2]) – A map of named vectors of int8

Tensors

A tensor is simply a tensor type and a set of values. Values are held in
cells that are fully defined by the address given by the labels in each
dimension. A tensor has a string representation defined by the type followed
by the cell values.

Some examples:

  • tensor():3.0 – a scalar value (double)
  • tensor<float>(x[3]):[1.0, 2.0, 3.0] – An indexed vector with 3 floats
  • tensor(x[2], y[3]):[[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]] – An indexed matrix of size 2-by-3
  • tensor(name{}):{ {name:foo}:2, {name:bar}:5 } – A (sparse) map of named values
  • tensor(name{}):{ foo:2, bar:5 } – Same as above with inferred dimension (single sparse)
  • tensor<int8>(name{}, x[2]):{foo:[1,2], bar:[3,4]} – A map of named vectors of int8

Tensor functions

Vespa has a small core set of just 8 tensor functions:

  • Creational: tensor
  • Unary: map, reduce, rename, slice
  • Binary: join, merge, concat

These can be combined and used to express potentially complex computation, as
some take general lambda expressions as arguments. Please refer to the tensor
function
reference
for detailed descriptions of these functions.

We’ll provide some examples to demonstrate the expressive power of join and
reduce in the following. These examples are also provided with a link to the
tensor playground. In this interactive
environment, you can experiment with tensor expressions. Feel free to take any
of these examples and play around with them

Vector outer product

Given two tensors representing vectors, A and B:

A: tensor(x[3]):[1,2,3]
B: tensor(y[3]):[4,5,6]

Notice that these tensors have different dimension names. We can multiply
these two tensors together, A * B, and the result is:

tensor(x[3],y[3]):[[4.0, 5.0, 6.0], [8.0, 10.0, 12.0], [12.0, 15.0, 18.0]]

This tensor has type tensor(x[3],y[3]). To see what is happening
here, note that the expression A * B is a convenience form of the underlying
expression:

join(A, B, f(a,b)(a * b))

The join function is the most used function when combining two input tensors.
As input, it takes two tensors and a lambda function (here f(a,b)(a * b)) to
define how to combine matching cells. The resulting tensor is the natural join
between the input tensors, and the cell values are defined by the lambda. The
resulting type is the union of dimensions from