April 8, 2024

How can you evaluate your chatbot's answers?

When creating a chatbot or comparing available solutions (i.e., GPT Builder, Google Cloud Vector Search + Gemini), you might wish to have a structured approach to evaluating the models. In this blog post, we will review various available methods, some related insights from our experiments in this area for our own chatbot for the Tapir library, and some possible shortcomings of the libraries.

Evaluation metrics and libraries

This section will review various metrics and libraries for evaluating the question-answering problems, starting with more “classic” methods like the BLEU metric family, followed by more hands-on libraries like the ragas or Langchain evaluation module.

BLEU & ROUGE

The first evaluation tool we will explore is the BLEU (Bilingual Evaluation Understudy) metric, which is casually used in NLP for problems of summarisation and translations but can still be considered for Q&A. This precision-based metric compares two texts (generated and reference), counts the number of words (or n-grams) present in both, and divides by the length of the text generated. The most popular variant is BLEU-4, which operates on 4-grams.

Values are between 0 and 1, while even humans do not get a score of 1.0. The table below gives some insight into how to interpret the values of BLEU:

As you can imagine, this metric has its shortcomings:

It cannot be used on a single response but rather on the whole dataset due to the basing results on n-grams and their statistical relevance.
This metric is also sensitive to synonyms, as it only focuses on the presence of exact words in the answer, i.e., "This is a great blog post” and “That was an amazing read,” would have completely different values for that metric, while having more or less the same meaning.
Metric penalizes wrong content and functional words in the same way. For example, the presence of words like “a” and “Python” in the answer “What is Tapir?” has the same effect for metric while much more significant for the essence of the question).
Effectively, the metric completely ignores the text's grammatical and structural correctness. Results for BLEU-1 for the texts “That was an amazing read” and “That a read amazing was” give the very same results.
As it uses n-grams, it ignores long-range dependencies in the answer while putting a minor penalty on grammatically incorrect sentences.

This metric also has a “recall-oriented” version called ROUGE (Recall Oriented Understudy for gisting Evaluation) and other improvements considering the longest common subsequence or skipping some bigrams.

Ragas

Ragas is a specialized library for evaluating Retrieval Augmented Generation (RAG) pipelines. It offers metrics tailored for each of the steps:

generation (faithfulness, answer relevancy, answer correctness, answer semantic similarity, and aspect critique)
retrieval (context precision - precision@k and context relevance - recall@k, where both metrics are described in a blog post on improving retrieval for chatbots.

In this blog post, we focus on text generation so we will go more in-depth about related metrics. All metrics operate on the dataset level with the assistance of GPT. Each request for a metric is sent with an: original question, reference answer, generated answer, and provided context, Metrics have value between 0 and 1, where larger is more favourable.

At this point, it is important to mention the pros and cons of solutions using automated evaluation (in this case via OpenAI’s GPT), so here they are:

Human evaluation is expensive, time-consuming, and often subjective.
The new evaluation model version might change metric results without your knowledge. Freezing the evaluation model version can mitigate this.
There may be bias towards answers generated by the evaluation model or in a format that is more acceptable to it.

Faithfulness

Faithfulness is defined as maintaining factual consistency based on the context provided. A generated answer is deemed faithful when all claims in the generated answer can be inferred from the provided context.

Answer correctness

This criterion compares the generated and reference answers based on semantic similarity and factual similarity in a weighted fashion.

Answer relevance

This criterion uses the question and the generated answer to assess whether it appropriately addresses the original question. It operates in the following fashion: LLM is prompted to generate appropriate questions based on the provided answer. Then, cosine similarity is calculated between them and the original question (using an embedding model of your choice). If the generated questions are almost identical to the original question, then the generated answer should be deemed relevant.

Answer semantic similarity

This criterion considers the semantic similarity between the generated and reference answers. To calculate the similarity cross-encoder model is used.

Aspect critique

Ragas also offers to check various criteria of the answer using GPT. Available criteria are harmfulness, strictness, maliciousness, coherence, correctness, and conciseness.

Unfortunately, as with each criterion, it is only possible to assess whole dataset performance, not per sample result and reasoning.

Langchain evaluation

The other library worth mentioning for evaluation is the Langchain evaluation module. It works very closely to the aspect critique of ragas, but with the difference that it also provides per-sample results with reasoning.

Available metrics are:

conciseness (how concise and not lengthy the answer is)
relevance (how relevant the answer to the provided context is)
correctness (if the generated answer is correct wrt. to reference answer)
coherence (if the generated answer is well structured, organized, and makes sense wrt to the provided input: context + question)
harmfulness
maliciousness
helpfulness
controversiality
misogyny
criminality
insensitivity

As with ragas aspect critique, these metrics operate closely with requests to GPT with the original question, provided context, reference answer and definition of the metric.

For our own LLM project, we decided this is the best tool for evaluation, with only two metrics measured (to limit OpenAI costs): relevance and correctness, as they both measure whether an answer is correct and based on the provided context. An example of such an evaluation can be seen below.

Now we can move on to the next step, being a comparison of some models we investigated using our dataset.

Comparison of models

The LLM leaderboard on HuggingFace is a good resource for getting some inspiration on models to try.

Our results for our Tapir documentation Assistant can be seen below.

We explored the majority of well-known models in the size of 7B as we wanted to be able to load them on GPUs with around 16-24GB VRAM. As our baseline, we used dolly, which had quite good results. After that, we decided to try the Mistral and MPT model, out of which MPT seemed to be better than the baseline, but only in terms of correctness. At this point, we decided to explore if models that have seen quite a lot of code content would perform well in this setup, so we used code-lama as a text generation model. Unfortunately, this model was not performing well, with 0.22 score for relevance and 0.12 for correctness, proving that dolly would be a much cheaper and better option. Then we decided to try Falcon, which performed very poorly, and stable-beluga, which performed best out of the available models, with a 0.64 relevance score and 0.48 correctness.

For reference, we also checked GPT-4 performance. However, it is worth mentioning here that this model was also used for both generation and evaluation, which might cause some leakage between results and a tendency to give better results for its answers. As this is also an API service, we do not know if they are not using extra context when generating answers, which would make it an unfair comparison to open-source models.

Conclusions

In this blog post, we explored a variety of metrics and libraries that allow us to evaluate Q&A systems based on their text-generation components. We went over classical methods like BLEU or ROUGE, followed by popular library ragas and a more informative version of the Langchain evaluation module. Then, we went over example results, the structure of Langchain metrics, and their explanations. Finally, we reviewed aggregated results for SoftwareMill and ReasonField Lab’s LLM project.

Reviewed by: Adam Wawrzyński