Summer internship at Vespa | Vespa Blog

Erlend Solbakken Nikolaisen

Intern I, Summer of 2022


Photo by Arnold Francisca on Unsplash

Through the summer as an intern at Vespa I got the opportunity to learn new technologies and experience how it is to work for a software company. At the start of my internship I got an introduction to the company and was told about the projects I was going to work on during the internship.

During the internship I worked on two projects. The first one was to recreate the Vespa Query Builder using React, and the second one was to create a solution for visualizing the traces made by the Vespa engine. Both of these projects have been implemented into a client.

Query Builder

The Query Builder is a tool for creating Vespa queries. The tool is a website that helps with creating the queries by allowing the user to select query-options from drop-down menus. The old website consisted of a HTML website with some old, hard to read, JavaScript code, and a backend handler written in Java. This made the website separate and hard to integrate with other tools.

My assignment was to recreate the Query Builder using React making it a complete JavaScript application. Before I started on the assignment I spent a day learning about React, and then I dove into the deep end and started on creating the application, learning more as I went. The old JavaScript code was difficult to read and did not merge well with React so many functionalities had to be recreated from scratch.

The finished application looks very much like the old one, but since it is created in React it is much simpler to implement it in other React applications. I did update the UI some by adding tooltips to buttons to make the application simpler to use.

Trace Visualizer

The Trace Visualizer is supposed to make it easier to identify bottlenecks in queries. The idea is to remove the need to comb through in search for where the problems could be. The solution consists of an application to input and transform the Vespa trace and the third-party tool Jaeger to visualize the transformed trace.

I started by looking at and comparing several existing solutions for visualizing traces and chose to use Jaeger because it was the simplest to use and was the best fit for the use-case. Because Jaeger did not support the traces created by Vespa the traces had to be transformed into a format that Jaeger could use. One of the formats Jaeger supports, and the one I used, is similar to OpenTelemtry´s trace definition with spans being the smallest unit of work (more information here: OpenTelemetry tracing.

The first iteration of the transformation tool could handle simple traces from Vespa and transform them into traces that could be imported into Jaeger. The hardest part was figuring out how to best traverse the Vespa trace to find the relevant information that Jaeger would need. The Vespa trace always seemed to have more special cases that needed to be handled differently just when I thought I had found them all. The Vespa trace could also be much more complicated and the first iteration could not handle them.

Vespa traceTransformed trace
{
  "trace": {
    "children": [
      ...
      {
        "timestamp": 4,
        "message": "Invoke searcher ..."
      },
      {
        "timestamp": 5,
        "children": [
          {
            "timestamp": 5,
            "message": Invoke searcher ..."
          },
          {
            "timestamp": 6,
            "message": "Return searcher ..."
          }
        ]
      },
      {
        "timestamp": 8,
        "message": "Retunr searcher ..."
      }
      ...
      {
        "start_time": "2022-07-28 13:49:47.816 UTC",
        "trace": [
          {
            "traces": [
              {
                "timestamp_ms": 0.051936,
                "event": "Start query setup"
              }
              ...
              {
                "timestamp_ms": 1.045379,
                "event": "Complete query setup"
              }
            ]
          }
        ]
      }
    ]
  }
}
{
  "data": [
    {
      "traceID": "db187cb870b90c0ad8cc235fed504c16",
      "spans": [
        {
          "traceID": "db187cb870b90c0ad8cc235fed504c16",
          "spanID": "8182dc73c8bd68ed",
          "operationName": "default",
          "references": [],
          "startTime": 1656923873159000,
          "duration": 2000,
          "tags": [],
          "logs": [],
          "processID": "p0"
        },
        {
          "traceID": "db187cb870b90c0ad8cc235fed504c16",
          "spanID": "52bc94897ad844b6",
          "operationName": "Invoke searcher ...",
          "references": [
            {
              "refType": "CHILD_OF",
              "traceID": "db187cb870b90c0ad8cc235fed504c16",
              "spanID": "8182dc73c8bd68ed"
            }
          ],
          "startTime": 1656923873159000,
          "duration": 1,
          "tags": [],
          "logs": [],
          "processID": "p1"
        },
        ...
        {
          "traceID": "db187cb870b90c0ad8cc235fed504c16",
          "spanID": "d94b2b388d92864d",
          "operationName": "Return searcher ...",
          "references": [
            {
              "refType": "CHILD_OF",
              "traceID": "db187cb870b90c0ad8cc235fed504c16",
              "spanID": "d671eeb306d4784b"
            }
          ],
          "startTime": 1656923873159000,
          "duration": 100,
          "tags": [],
          "logs": [],
          "processID": "p7"
        }
      ]
    }
  ]
}

To make the tool capable of handling the more complicated traces I first refactored much of the code to make it easier to use and then I created a recursive function to handle the more complex structure that the traces could have. I also implemented better naming of the spans in the transformed trace to make it easier so see what was happening in each span. By using regex on the description of the work the span is doing it is possible to find the process that work is being done on and use this as the name of the span.

Jaeger UI

There is some further work to be done with the naming of spans as a few can get names that do not reflect the work contained in the span. The timings and durations of spans are also a bit imprecise. This imprecision is small and does not have any impact on the use of the tool to find bottlenecks. The imprecision happens because the Vespa trace mostly uses milliseconds for timestamps with some parts using microseconds and Jaeger always using microseconds there can be some problems with the timings because of imprecision.

My experience at Vespa

At the start of my internship I was excited to find how it would be to work for a software company and get insight into the workflow. I felt that I was warmly welcomed and was well introduced to the work environment.

At the beginning of my internship it was a bit daunting to have to learn both a bit about how the Vespa engine worked and how to use React and JavaScript. It was all completely new to me and at the beginning felt a bit insurmountable, but I always had colleagues that seemed eager to help me with problems.

I really enjoyed my time working at Vespa with knowledgeable colleagues who could always help me when I was stuck and have taught me alot. My experience at Vespa has been very enjoyable and educational and has and will continue to benefit me in the future.

Summer Internship at Vespa | Vespa Blog

This summer, two young men have revolutionized the field of information retrieval! Or at least they tried… Read on for the tale of this year’s summer interns, and see the fruits of our labor in the embedder auto-training sample app.

Automatic Embedder Training with an LLM

Our main project this summer has been developing a system for automatically improving relevance for semantic search. Semantic search utilizes machine-learned text embedders trained on large amounts of annotated data to improve search relevance.

Embedders can be fine-tuned on a specific dataset to improve relevance further for the dataset in question. This requires annotated training data, which traditionally has been created by humans. However, this process is laborious and time-consuming – can it be automated?

Enter large language models! LLMs like ChatGPT have been trained on an enormous amount of data from a multitude of sources, and appear to understand a great deal about the world. Our hypothesis was that it would be possible to use an LLM to generate training data for an embedder.

Query generation

Diagram depicting the query generation pipeline

Training data for text embedders used for information retrieval consists of two parts: queries and query relevance judgments (qrels). Qrels indicate which documents are relevant for which queries, and are used for training and to rate retrieval performance during evaluation. Our LLM of choice, ChatGPT (3.5-turbo-4k), works by providing it with a system prompt and a list of messages containing instructions and data. We used the system prompt to inform ChatGPT of its purpose and provide it with rules informing how queries should be generated.

Generating queries requires a system prompt, example document-query pairs, and a document to generate queries for. Our system generates the system prompt, and optionally generates additional qrels, resulting in the three-step process illustrated by the diagram above.

In the beginning, we handcrafted system prompts while trying to get ChatGPT to generate queries similar to existing training data. After some trial and error, we found that we got better results if we specified rules describing what queries should look like. Later, we devised a way for ChatGPT to generate these rules itself, in an effort to automate the process.

Using the system prompt alone did not appear to yield great results, though. ChatGPT would often ignore the prompt and summarize the input documents instead of creating queries for them. To solve this, we used a technique called few-shot prompting. It works by essentially faking a conversation between the user and ChatGPT, showing the LLM how it’s supposed to answer. Using the aforementioned message list, we simply passed the LLM a couple of examples before showing it the document to generate queries for. This increased the quality of the output drastically at the cost of using more tokens.

After generating queries, we optionally generate additional qrels. This can be necessary for training if the generated queries are relevant for multiple documents in the dataset, because the training script assumes that all matched documents not in the qrels aren’t relevant. Generating qrels works by first querying Vespa with a query generated by ChatGPT, then showing the returned documents and the generated query to ChatGPT and asking it to judge whether or not each document is relevant.

Training and evaluation

We utilized SentenceTransformers for training, and we initialized from the E5 model. We started off by using scripts provided by SimLM, which got us up and running quickly, but eventually wanted more control of our training loop.

The training script requires a list of positive (matching) documents and a list of negative (non-matching) documents for each query. The list of positive documents is given by the generated qrels. We assemble a list of negative documents for each query by querying Vespa and marking each returned document not in the qrels as a negative.

After training we evaluated the model with trec_eval and the nDCG@10 metric. The resulting score was compared to previous trainings, and to a baseline evaluation of the model.

We encapsulated the entire training and evaluation procedure into a single Bash script that let us provide the generated queries and qrels as input, and get the evaluation of the trained model as output.

Results

The results we got were varied. We had the most successful training on the NFCorpus dataset, where we consistently got an evaluation higher than the baseline. Interestingly we initially got the highest evaluation when training on just 50 queries! We eventually figured out that this was caused by using the small version of the E5 model – using the base version of the model gave us the highest evaluation when training on 400 queries.

Training on other datasets was unfortunately unsuccessful. We tried training on both the FiQA and the NQ dataset, tweaking various parameters, but weren’t able to get an evaluation higher than their baselines.

Limitations and future work

The results we got for NFCorpus are a promising start, and previous research also shows this method to have promise. The next step is to figure out how to apply our system to datasets other than NFCorpus. There’s a wide variety of different options to try:

  • Tweaking various training parameters, e.g. number of epochs and learning rate
  • Different training methods, e.g. knowledge distillation
  • Determining query relevance with a fine-tuned cross-encoder instead of with ChatGPT-generated qrels
  • More data, both in terms of more documents and generating more queries
  • Using a different model than E5

We currently make some assumptions about the datasets we train on that don’t always hold. Firstly, we do few-shot prompting when generating queries by fetching examples from existing training data, but this system is perhaps most useful for datasets without that data. Secondly, we use the ir_datasets package to prepare and manage datasets, but ideally we’d want to fetch documents from e.g. Vespa itself.

Most of our training was done on the relatively small NFCorpus dataset because of the need to refeed all documents, after each training, to generate new embeddings. This becomes a big bottleneck on large datasets. Implementing frozen embeddings, which allows reusing document embeddings between trainings, would solve this problem.

Side quests

The easiest way to learn Vespa is to use it. Before starting on the main project, we spent some time trying out the various interactive tutorials. We also worked on various side projects which were related to the main project in some way.

Embedding service

We created a sample app to create embeddings from arbitrary text, using the various models in the Vespa model hub. This was a great way to learn about Vespa’s stateless Java components and how Vespa works in general.

Pyvespa

Pyvespa is a Python API that enables fast prototyping of Vespa applications. Pyvespa is very useful when working in Python, like we did for our machine learning experiments, but it does not support all of Vespa’s features. In addition, there were some issues with how Pyvespa handled certificates that prevented us from using Pyvespa in combination with an app deployed from the Vespa CLI.

We were encouraged to implement fixes for these problems ourselves. Our main changes were to enable Pyvespa to use existing certificates generated with the Vespa CLI, as well as adding a function to deploy an application from disk to Vespa Cloud via Pyvespa, allowing us to use all the features of Vespa from Python (this feature already existed for deploying to Docker, but not for deploying to Vespa Cloud). This was very satisfying, as well as a great learning experience.

Our experience at Vespa

We’ve learned a lot during our summer at Vespa, especially about information retrieval and working with LLMs. We’ve also learned a lot about programming and gotten great insight into the workings of a professional software company.

Contributing to an open-source project, especially such a large one as Vespa, has been very exciting. Vespa is powerful, which is awesome, but as new users, there was quite a lot to take in. The project is well documented, however, and includes a great number of sample apps and example use cases, meaning we were usually able to find out how to solve problems on our own. Whenever we got really stuck, there was always someone to ask and talk to. A big shout out to all of our colleagues, and a special thanks to Kristian Aune and Lester Solbakken for their support and daily follow-up during our internship.

Working at Vespa has been a great experience, and we’ve really enjoyed our time here.