March 5, 2024

How to improve document matching when designing a chatbot?

When working with chatbots, you may have experienced good and poor responses to your questions. You may have seen relevant documents used to create an answer but do not know why they were chosen to be relevant.

This blog post is part of the series on improving chatbots based on our experiences while creating an in-house assistant for an open-source library called tapir. We will first recap how the Retrieval Augmented Generation (RAG) works, our basic solution, how we created an evaluation dataset, the methods we tried, and other interesting methods worth considering to improve document matching.

‍

How does RAG work?

So before going over our solution's exact methods and description, let's recap how, behind the scenes, most chatbots work now using RAG.

This solution works in a few steps:

The user asks the system (via the frontend app) the question. This question is sent to the backend app.
Next, the question is analysed to find relevant documents in the available knowledge database.
Once the knowledge database retrieves extra information, it is sent to the text generation model with an original question. This model then creates an answer based on the provided context information, which is returned to the front-end application.

We will move to the baseline solution we created as we now have hands-on experience working with RAG systems.

Baseline solution

An overview of what our baseline solution looks like can be seen above. To design a frontend application, we decided to go for Streamlit, an easy-to-use Python library. The backend part consists of two elements:

Document retriever using FAISS vector store, which was implemented with Langchain. This knowledge database uses documentation files from the Tapir Github repository.
Text generation model (databricks/dolly-v2-3b) using the HuggingFace library. Experiments to get the best model were covered in the next blog in this series, focused on the evaluation of the text generation part of RAG systems.

Both components (frontend and backend) are wrapped in Docker containers for ease of use and reproducibility of the setup.

Evaluation preparations

Before we move to experiments, it is important to first show how our dataset looks and what metrics we consider to improve document retrieval. In this blog post, we are only focusing on how relevant documents are retrieved.
Now, let's explore our evaluation dataset.

Evaluation dataset

We first need a dataset to check whether the solution works well or if the improvement we want to introduce gives any value. For the problem of using a library of Tapir, there are no datasets available, so we decided to source our small dataset (of 50 questions and answers) as a good seed for our dataset. As a method, we explored markdown files of the repository and tried to create questions with links to the correct paragraph (called a document from that point) and keywords that should be present in the correct answer. This dataset is in the form of a JSON file, which can be seen below.

Evaluation metrics

Any good ML Engineer knows that datasets and good metrics are the critical components of any evaluation. We already have the dataset, so we need to find a good metric. As you can imagine, there are quite some metrics to check the retrieval of information, but we focus on 3 of them:

match@k, which checks whether the right document is included in the k retrieved documents or not. This one is chosen as our main metric.
precision@k, which corresponds to % of relevant documents retrieved in the set of k.
recall@k, which corresponds to % of relevant documents retrieved in a set of k out of the whole set of relevant documents.

This definition is a bit blurry, so here is an example. Suppose there are 4 documents explaining “What is the tapir?”. Our system retrieves 5 documents, out of which relevant are the Top 3.

Then, the results for precision@k will be as follows:

precision@1 =precision@2=precision@3 = 1, as all documents are relevant.
precision@4 = 0.75, as only 3 out of 4 documents will be relevant
precision@5=0.6, as only 3 out of 5 documents will be relevant.

When considering recall@k, the results will be as follows:

recall@1 = 0.25, as 1 out of 4 relevant documents are retrieved.
recall@2=0.5, as 2 out of 4 relevant documents are retrieved.
recall@3 = recall@4=recall@5=0.75, as only 3 of 4 relevant documents are retrieved.

As we now have a dataset, we can evaluate our baseline solution and, step by step, introduce improvements.

Improvements

When considering improving document retrieval, you can target many things: the embedding model used for creating the FAISS store, what input you use when creating these documents, and finally, some ensemble of retrieval models. One thing not covered here but often suggested is prompt engineering. However, changes to this might influence not only the retrieved documents but also the quality of text generated, as different models require different prompt structures.

Now ,let’s move on to experiments we have done while working on our assistant.

Embedding model

The first thing we considered checking was the influence of the embedding model used to create the FAISS vector store. Initially, we used sentence-transformers/all-mpnet-base-v2, but we found a useful HuggingFace leaderboard as a good inspiration for alternatives. We explored a variety of options but have found that a multilingual solution is the best (intfloat/multilingual-e5-base). What was interesting was that Codebert, which was created to work with code, was performing poorly. This might be because the majority of our questions were not focused on code generation but rather questions on the operational details of the library.

Paragraph structure

As we knew that context is essential to generate a satisfactory response, we decided to focus on what is used to create the paragraph (document which will be retrieved). At that point, we decided to go with the output, which uses headers representing the markdown document's hierarchical structure and the paragraph's content.

We also wanted to experiment with the best way to connect the information blocks we have, and it appeared that comma separation between elements is the best way forward.

Retrieval model

The last thing worth mentioning, with which we experimented to get the best results, was the choice of the retrieval mechanism. According to guides from Langchain documentation, it is common to combine BM25 with an embedding similarity retriever (like FAISS), as the first one is good at finding relevant documents based on keywords. In contrast, the latter finds relevant documents based on their semantic similarity. We decided it was worth exploring our problem that way.

Unfortunately, our results did not support that claim, as BM25 solutions were significantly lower, and the ensemble did not improve the baseline solution.

What is left on the table?

Some methods were not suitable for the small and relatively cost-efficient solutions we were working on, but they still might bring good benefits to larger projects.

Elaborate retrievals

Langchain offers other retrievers that require a bit more code adjustments, licenses, or specialised databases (i.e., ElasticSearch, Amazon Kendra, or Azure Cognitive Search).
On the other hand, there is an interesting method called Hyde (hypothetical Document Embedding). This method works in the following fashion: you first generate a hypothetical document that should be retrieved for incoming queries. This document is embedded, which is used to find an existing document with similar embedding.

Reranking

Huge benefits for retrieval are often mentioned in the form of reranking. With Ensmeble Retriever, we only scratched the surface as there are plenty of other reranking methods i.e., Flash Reranker or CohereRerank.

Complex postprocessing methods

At this point, it is worth mentioning that there are other methods available for improving the retrieval mechanism, like:

MultiQueryRetriever, which generates multiple versions of user queries and retrieves a union of all retrieved documents (across query variations).
Contextual compression, where only compressed form is returned instead of directly returning the paragraph, which might be full of irrelevant information.
Long context reorder, to reorder documents so that the required information is not in the middle, which is often ignored when generating answers for long queries.
A time-weighted vector store retriever applies a time decay parameter to favour newer documents. This kind of behaviour might be favourable for news documents.

Conclusions

In this blog post, we went over the baseline solution for our in-house chatbot, and then we went over various improvements we made along the way. Finally, there are some improvements we still need to try but are still worth exploring. If you want further methods for improving retrieval, consider checking the Langchain guides and the OpenAI survey video.

‍

Reviewed by: Adam Wawrzyński