Parent-child in Vespa | Vespa Blog

Parent-child relationships let you model hierarchical relations in your data. This blog post talks about why and how we added this feature to Vespa, and how you can use it in your own applications. We’ll show some performance numbers and discuss practical considerations.

Introduction

The shortest possible background

Traditional relational databases let you perform joins between tables. Joins enable efficient normalization of data through foreign keys, which means any distinct piece of information can be stored in one place and then referred to (often transitively), rather than to be duplicated everywhere it might be needed. This makes relational databases an excellent fit for a great number of applications.

However, if we require scalable, real-time data processing with millisecond latency our options become more limited. To see why, and to investigate how parent-child can help us, we’ll consider a hypothetical use case.

A grand business idea

Let’s assume we’re building a disruptive startup for serving the cutest possible cat picture advertisements imaginable. Advertisers will run multiple campaigns, each with their own set of ads. Since they will (of course) pay us for this privilege, campaigns will have an associated budget which we have to manage at serving time. In particular, we don’t want to serve ads for a campaign that has spent all its money, as that would be free advertising. We must also ensure that campaign budgets are frequently updated when their ads have been served.

Our initial, relational data model might look like this:

Advertiser:
    id: (primary key)
    company_name: string
    contact_person_email: string

Campaign:
    id: (primary key)
    advertiser_id: (foreign key to advertiser.id)
    name: string
    budget: int

Ad:
    id: (primary key)
    campaign_id: (foreign key to campaign.id)
    cuteness: float
    cat_picture_url: string

This data normalization lets us easily update the budgets for all ads in a single operation, which is important since we don’t want to serve ads for which there is no budget. We can also get the advertiser name for all individual ads transitively via their campaign.

Scaling our expectations

Since we’re expecting our startup to rapidly grow to a massive size, we want to make sure we can scale from day one. As the number of ad queries grow, we ideally want scaling up to be as simple as adding more server capacity.

Unfortunately, scaling joins beyond a single server is a significant design and engineering challenge. As a consequence, most of the new data stores released in the past decade have been of the “NoSQL” variant (which might also be called “non-relational databases”). NoSQL’s horizontal scalability is usually achieved by requiring an application developer to explicitly de-normalize all data. This removes the need for joins altogether. For our use case, we have to store budget and advertiser name across multiple document types and instances (duplicated data here marked with bold text):

Advertiser:
    id: (primary key)
    company_name: string
    contact_person_email: string

Campaign:
    id: (primary key)
    advertiser_company_name : string
    name: string
    budget: int

Ad:
    id: (primary key)
    campaign_budget : int
    campaign_advertiser_company_name : string
    cuteness: float
    cat_picture_url: string

Now we can scale horizontally for queries, but updating the budget of a campaign requires updating all its ads. This turns an otherwise O(1) operation into O(n), and we likely have to implement this update logic ourselves as part of our application. We’ll be expecting thousands of budget updates to our cat ad campaigns per second. Multiplying this by an unknown number is likely to overload our servers or lose us money. Or both at the same time.

A pragmatic middle ground

In the middle between these two extremes of “arbitrary joins” and “no joins at all” we have parent-child relationships. These enable a subset of join functionality, but with enough restrictions that they can be implemented efficiently at scale. One core restriction is that your data relationships must be possible to represented as a directed, acyclic graph (DAG).

As it happens, this is the case with our cat picture advertisement use case; Advertiser is a parent to 0-n _Campaign_s, each of which in turn is a parent to 0-n _Ad_s. Being able to represent this natively in our application would get us functionally very close to the original, relational schema.

We’ll see very shortly how this can be directly mapped to Vespa’s parent-child feature support.

Parent-child support in Vespa

Creating the data model

Vespa’s fundamental data model is that of documents. Each document belongs to a particular schema and has a user-provided unique identifier. Such a schema is known as a document type and is specified in a search definition file. A document may have an arbitrary number of fields of different types. Some of these may be indexed, some may be kept in memory, all depending on the schema. A Vespa application may contain many document types.

Here’s how the Vespa equivalent of the above denormalized schema could look (again bolding where we’re duplicating information):

advertiser.sd:
    search advertiser {
        document advertiser {
            field company_name type string {
                indexing: attribute | summary
            }
            field contact_person_email type string {
                indexing: summary
            }
        }
    }

campaign.sd:
    search campaign {
        document campaign {
            field advertiser_company_name type string {
                indexing: attribute | summary
            }
            field name type string {
                indexing: attribute | summary
            }
            field budget type int {
                indexing: attribute | summary
            }
        }
    }

ad.sd:
    search ad {
        document ad {
            field campaign_budget type int {
                indexing: attribute | summary attribute: fast-search
            }
            field campaign_advertiser_company_name type string {
                indexing: attribute | summary
            }
            field cuteness type float {
                indexing: attribute | summary attribute: fast-search
            }
            field cat_picture_url type string {
                indexing: attribute | summary
            }
        }
    }

Note that since all documents in Vespa must already have a unique ID, we do not need to model the primary key IDs explicitly.

We’ll now see how little it takes to change this to its normalized equivalent by using parent-child.

Parent-child support adds two new types of declared fields to Vespa; references and imported fields.

A reference field contains the unique identifier of a parent document of a given document type. It is analogous to a foreign key in a relational database, or a pointer in Java/C++. A document may contain many reference fields, with each potentially referencing entirely different documents.

We want each ad to reference its parent campaign, so we add the following to ad.sd:

    field campaign_ref type reference<campaign> {
        indexing: attribute
    }

We also add a reference from a campaign to its advertiser in campaign.sd:

    field advertiser_ref type reference<advertiser> {
        indexing: attribute
    }

Since a reference just points to a particular document, it cannot be directly used in queries. Instead, imported fields are used to access a particular field within a referenced document. Imported fields are virtual; they do not take up any space in the document itself and they cannot be directly written to by put or update operations.

Add to search campaign in campaign.sd:

    import field advertiser_ref.company_name as campaign_company_name {}

Add to search ad in ad.sd:

    import field campaign_ref.budget as ad_campaign_budget {}

You can import a parent field which itself is an imported field. This enables transitive field lookups.

Add to search ad in ad.sd:

    import field campaign_ref.campaign_company_name as ad_campaign_company_name {}

After removing the now redundant fields, our normalized schema looks like this:

advertiser.sd:
    search advertiser {
        document advertiser {
            field company_name type string {
                indexing: attribute | summary
            }
            field contact_person_email type string {
                indexing: summary
            }
        }
    }

campaign.sd:
    search campaign {
        document campaign {
            field advertiser_ref type reference<advertiser> {
                indexing: attribute
            }
            field name type string {
                indexing: attribute | summary
            }
            field budget type int {
                indexing: attribute | summary
            }
        }
        import field advertiser_ref.company_name as campaign_company_name {}
    }

ad.sd:
    search ad {
        document ad {
            field campaign_ref type reference<campaign> {
                indexing: attribute
            }
            field cuteness type float {
                indexing: attribute | summary attribute: fast-search
            }
            field cat_picture_url type string {
                indexing: attribute | summary
            }
        }
        import field campaign_ref.budget as ad_campaign_budget {}
        import field campaign_ref.campaign_company_name as ad_campaign_company_name {}
    }

Feeding with references

When feeding documents to Vespa, references are assigned like any other string field:

[
    {
        "put": "id:test:advertiser::acme",
        "fields": {
            "company_name": "ACME Inc. cats and rocket equipment",
            "contact_person_email": "[email protected]"
        }
    },
    {
        "put": "id:acme:campaign::catnip",
        "fields": {
            "advertiser_ref": "id:test:advertiser::acme",
            "name": "Most excellent catnip deals",
            "budget": 500
        }
    },
    {
        "put": "id:acme:ad::1",
        "fields": {
            "campaign_ref": "id:acme:campaign::catnip",
            "cuteness": 100.0,
            "cat_picture_url": "/acme/super_cute.jpg"
        }
    }
]

We can efficiently update the budget of a single campaign, immediately affecting all its child ads:

[
    {
        "update": "id:test:campaign::catnip",
        "fields": {
            "budget": {
                "assign": 450
            }
        }
    }
]

Querying using imported fields

You can use imported fields in queries as if they were a regular field. Here are some examples using YQL:

Find all ads that still have a budget left in their campaign:

select * from ad where ad_campaign_budget > 0

Find all ads that have less than $500 left in their budget and belong to an advertiser whose company name contains the word “ACME”:

select * from ad where ad_campaign_budget < 500 and ad_campaign_company_name contains "ACME"

Note that imported fields are not part of the default document summary, so you must add them explicitly to a separate summary if you want their values returned as part of a query result:

document-summary my_ad_summary {
    summary ad_campaign_budget type int {}
    summary ad_campaign_company_name type string {}
    summary cuteness type float {}
    summary cat_picture_url type string {}
}

Add summary=my_ad_summary as a query HTTP request parameter to use it.

Global documents

One of the primary reasons why distributed, generalized joins are so hard to do well efficiently is that performing a join on node A might require looking at data that is only found on node B (or node C, or D…). Vespa gets around this problem by requiring that all documents that may be joined against are always present on every single node. This is achieved by marking parent documents as global in the services.xml declaration. Global documents are automatically distributed to all nodes in the cluster. In our use case, both advertisers and campaigns are used as parents:

<documents>
    <document mode="index" type="advertiser" global="true"/>
    <document mode="index" type="campaign" global="true"/>
    <document mode="index" type="ad"/>
</documents>

You cannot deploy an application containing reference fields pointing to non-global document types. Vespa verifies this at deployment time.

Performance

Feeding of campaign budget updates

Scenario: feed 2 million ad documents 4 times to a content cluster with one node, each time with a different ratio between ads and parent campaigns. Treat 1:1 as baseline (i.e. 2 million ads, 2 million campaigns). Measure relative speedup as the ratio of how many fewer campaigns must be fed to update the budget in all ads.

Results

  • 1 ad per campaign: 35000 campaign puts/second
  • 10 ads per campaign: 29000 campaign puts/second, 8.2x relative speedup
  • 100 ads per campaign: 19000 campaign puts/second, 54x relative speedup
  • 1000 ads percampaign: 6200 campaign puts/second, 177x relative speedup

Note that there is some cost associated with higher fan-outs due to internal management of parent-child mappings, so the speedup is not linear with the fan-out.

Searching on ads based on campaign budgets

Scenario: we want to search for all ads having a specific budget value. First measure with all ad budgets denormalized, then using an imported budget field from the ads’ referenced campaign documents. As with the feeding benchmark, we’ll use 1, 10, 100 and 1000 ads per campaign with a total of 2 million ads combined across all campaigns. Measure average latency over 5 runs.

In each case, the budget attribute is declared as fast-search, which means it has a B-tree index. This allows for efficent value and range searches.

Results

  • 1 ad per campaign: denormalized 0.742 ms, imported 0.818 ms, 10.2% slowdown
  • 10 ads per campaign: denormalized 0.986 ms, imported 1.186 ms, 20.2% slowdown
  • 100 ads per campaign: denormalized 0.830 ms, imported 0.958 ms, 15.4% slowdown
  • 1000 ads per campaign: denormalized 0.936 ms, imported 0.922 ms, 1.5% speedup

The observed speedup for the biggest fan-out is likely an artifact of measurement noise.

We can see that although there is generally some

Vespa Product Updates, January 2019: Parent/Child, Large File Config Download, and a Simplified Feeding Interface

In last month’s Vespa update, we mentioned ONNX integration, precise transaction log pruning, grouping on maps, and improvements to streaming search performance. Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and Oath Ads Platforms. Thanks to feedback and contributions from the community, Vespa continues to evolve.

This month, we’re excited to share the following updates with you:

Parent/Child

We’ve added support for multiple levels of parent-child document references. Documents with references to parent documents can now import fields, with minimal impact on performance. This simplifies updates to parent data as no denormalization is needed and supports use cases with many-to-many relationships, like Product Search. Read more in parent-child.

File URL references in application packages

Serving nodes sometimes require data files which are so large that it doesn’t make sense for them to be stored and deployed in the application package. Such files can now be included in application packages by using the URL reference. When the application is redeployed, the files are automatically downloaded and injected into the components who depend on them.

Batch feed in java client

The new SyncFeedClient provides a simplified API for feeding batches of data with high performance using the Java HTTP client. This is convenient when feeding from systems without full streaming support such as Kafka and DynamoDB.

We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to see.

Parent-child joins and tensors for content recommendation

Aaron Nagao

Aaron Nagao

Senior Software Engineer, Verizon Media


A real-world application of parent-child joins and tensor functions to model topic popularity for content recommendation.

Every time a user visits Yahoo.com, we build the best news stream for that user from tens of thousands of candidate articles on a variety of topics such as sports, finance, and entertainment.

To rank these articles, a key feature is the topic’s click-through rate (CTR), a real-time popularity metric aggregated over all articles of that topic. There are two reasons why topic CTR is an important feature for machine-learned ranking:

  1. It helps the cold-start problem by allowing the model to use the popularity of other articles of that topic as a prior for new articles.
  2. It reduces a high-dimensional categorical feature like a topic to a single numerical feature (its mean CTR)—this mean encoding is needed for a model to incorporate topics.

In this blog post, we describe how we model topic CTRs using two modern features of Vespa: parent-child joins and tensor functions. Vespa, the open source big data serving engine created by Yahoo, powers all of our content recommendations in real-time for millions of users at scale.

To model ranking features in Vespa, most features like an article’s length or age are simple properties of the document which are stored within the Vespa document. In contrast, a topic CTR is shared by all documents of that topic.

A simple approach would be to feed the topic CTR to every document about that topic. But this denormalized approach both duplicates data and complicates CTR updates—the updater must first query for all documents of that topic in order to feed the updated CTR to all of them.

The standard solution for this from relational databases is joins through foreign keys, and such joins are supported in Vespa as parent-child relationships.

We model our relationships by storing all topic CTRs in a single “global” document, and each of our articles has a foreign key that points to that global document, as shown in the Figure. This way, when updating CTRs we only need to update the global document and not each individual article, avoiding data duplication.

Global document

When ranking articles, Vespa uses the foreign key to do a real-time join between each article and the global document to retrieve the topic CTRs. Vespa co-locates the global document on the article content node to reduce the latency of this real-time join, also represented in the Figure.

Vespa seamlessly allows us to write updated CTRs to the global document in real-time while concurrently reading the CTRs for use in article ranking.

The schema definition uses reference for this parent-child relationship:

document globalscores {
    field topic_ctrs type tensor<float>(topic{}) {
        indexing: attribute
        attribute: fast-search
    }
}

document article {
    field doc_topics type tensor<float>(topic{}) {
        indexing: attribute | summary
    }

    field ptr type reference<globalscores> {
        indexing: attribute
    }
}

Now that our data is modeled in Vespa, we want to use the CTRs as features in our ranking model.

One challenge is that some articles have 1 topic (so just 1 topic CTR) while some articles have 5 topics (with 5 CTRs). Since machine learning generally requires each article to have the same number of features, we take the average topic CTR and the maximum topic CTR, aggregating a differing number of CTRs into a fixed number of summary statistics.

We compute the average and maximum using Vespa’s Tensor API. A tensor is a vector with labeled dimensions, and Vespa provides API functions like sum, argmax, and * (elementwise multiplication) that operate on input tensors to compute features.

The rank profile code that computes these ranking features is:

import field ptr.topic_ctrs as global_topic_ctrs {}

rank-profile yahoo inherits default {
    # helper functions
    function AVG_CTR(weights, ctrs) {
        # weighted average CTR
        expression: sum(weights * ctrs) / sum(weights)
    }
    function MAX_CTR(weights, ctrs) {
        # weighted max, then use unweighted CTR
        expression: sum( argmax(weights * ctrs) * ctrs )
    }

    # ranking features
    function TOPIC_AVG_CTR() {
        expression: AVG_CTR(attribute(doc_topics), attribute(global_topic_ctrs))
    }
    function TOPIC_MAX_CTR() {
        expression: MAX_CTR(attribute(doc_topics), attribute(global_topic_ctrs))
    }
}

Rank Profile Walkthrough

Suppose an article is about two topics each with an associated weight:

attribute(doc_topics) = <'US': 0.7, 'Sports': 0.9>

And in the global document, the real-time CTR values for all topics are:

topic_ctrs = <'US': 0.08, 'Sports': 0.02, 'Finance': 0.05, ...>

The first import line does a real-time join between the article’s foreign key and the global document, so that the article ranking functions can reference the global CTRs as if they were stored with each article:

global_topic_ctrs = ptr.topic_ctrs = <'US': 0.08, 'Sports': 0.02, 'Finance': 0.05, ...>

For both of our features TOPIC_AVG_CTR and TOPIC_MAX_CTR, the first step uses the * elementwise multiplication operator in Vespa’s Tensor API, which effectively does a lookup of the article’s topics in the global tensor:

weights * ctrs
= attribute(doc_pub) * attribute(global_pub_ctrs)
= <'US': 0.7, 'Sports': 0.9> * <'US': 0.08, 'Sports': 0.02, 'Finance': 0.05, ...>
= <'US': 0.056, 'Sports': 0.018>

Then TOPIC_AVG_CTR computes a weighted average CTR by summing and normalizing by the weights:

TOPIC_AVG_CTR
= sum(weights * ctrs) / sum(weights)
= sum(<'US': 0.056, 'Sports': 0.018>) / sum(<'US': 0.7, 'Sports': 0.9>)
= 0.074 / 1.6
= 0.046

(Note this weighted average of 0.046 is closer to the Sports CTR=0.02 than the US CTR=0.08 because Sports had a higher topic weight.)

And TOPIC_MAX_CTR finds the CTR of the entity with the maximum weighted CTR:

argmax(weights * ctrs)
= argmax(<'US': 0.056, 'Sports': 0.018>)
= <'US': 1>

argmax(weights * ctrs) * ctrs
= <'US': 1> * <'US': 0.08, 'Sports': 0.02, 'Finance': 0.05, ...>
= <'US': 0.08>

(With a final sum to convert the mapped tensor to the scalar 0.08.)

These two examples demonstrate the expressiveness of Vespa’s Tensor API to compute useful features.

Ultimately TOPIC_AVG_CTR and TOPIC_MAX_CTR are two features computed in real-time for every article during ranking. These features can then be added to any machine-learned ranking model—Vespa supports gradient-boosted trees from XGBoost and LightGBM, and neural networks in TensorFlow and ONNX formats.

While we could have stored these topic CTRs in some external store, we would have essentially needed to fetch all of them when ranking as every topic is represented in our pool of articles. Instead, Vespa moves the computation to the data by co-locating the global feature store on the article content node, avoiding network load and reducing system complexity.

We also keep the global CTRs in-memory for even faster performance with the aptly named attribute: fast-search. Ultimately, the Vespa team optimized our use case of parent-child joins and tensors to rank 10,000 articles in just 17.5 milliseconds!

This blog post described just two of the many ranking features used for real-time content recommendation at Verizon Media, all powered by Vespa.