ReasonField Lab
VirtusLab Group Company
hello@reasonfieldlab.com
ReasonField Lab 2024, All Rights Reserved
When creating a chatbot or comparing available solutions (i.e., GPT Builder, Google Cloud Vector Search + Gemini), you might wish to have a structured approach to evaluating the models. In this blog post, we will review various available methods, some related insights from our experiments in this area for our own chatbot for the Tapir library, and some possible shortcomings of the libraries.
This section will review various metrics and libraries for evaluating the question-answering problems, starting with more “classic” methods like the BLEU metric family, followed by more hands-on libraries like the ragas or Langchain evaluation module.
The first evaluation tool we will explore is the BLEU (Bilingual Evaluation Understudy) metric, which is casually used in NLP for problems of summarisation and translations but can still be considered for Q&A. This precision-based metric compares two texts (generated and reference), counts the number of words (or n-grams) present in both, and divides by the length of the text generated. The most popular variant is BLEU-4, which operates on 4-grams.
Values are between 0 and 1, while even humans do not get a score of 1.0. The table below gives some insight into how to interpret the values of BLEU:
As you can imagine, this metric has its shortcomings:
This metric also has a “recall-oriented” version called ROUGE (Recall Oriented Understudy for gisting Evaluation) and other improvements considering the longest common subsequence or skipping some bigrams.
Ragas is a specialized library for evaluating Retrieval Augmented Generation (RAG) pipelines. It offers metrics tailored for each of the steps:
In this blog post, we focus on text generation so we will go more in-depth about related metrics. All metrics operate on the dataset level with the assistance of GPT. Each request for a metric is sent with an: original question, reference answer, generated answer, and provided context, Metrics have value between 0 and 1, where larger is more favourable.
At this point, it is important to mention the pros and cons of solutions using automated evaluation (in this case via OpenAI’s GPT), so here they are:
Faithfulness is defined as maintaining factual consistency based on the context provided. A generated answer is deemed faithful when all claims in the generated answer can be inferred from the provided context.
This criterion compares the generated and reference answers based on semantic similarity and factual similarity in a weighted fashion.
This criterion uses the question and the generated answer to assess whether it appropriately addresses the original question. It operates in the following fashion: LLM is prompted to generate appropriate questions based on the provided answer. Then, cosine similarity is calculated between them and the original question (using an embedding model of your choice). If the generated questions are almost identical to the original question, then the generated answer should be deemed relevant.
This criterion considers the semantic similarity between the generated and reference answers. To calculate the similarity cross-encoder model is used.
Ragas also offers to check various criteria of the answer using GPT. Available criteria are harmfulness, strictness, maliciousness, coherence, correctness, and conciseness.
Unfortunately, as with each criterion, it is only possible to assess whole dataset performance, not per sample result and reasoning.
The other library worth mentioning for evaluation is the Langchain evaluation module. It works very closely to the aspect critique of ragas, but with the difference that it also provides per-sample results with reasoning.
Available metrics are:
As with ragas aspect critique, these metrics operate closely with requests to GPT with the original question, provided context, reference answer and definition of the metric.
For our own LLM project, we decided this is the best tool for evaluation, with only two metrics measured (to limit OpenAI costs): relevance and correctness, as they both measure whether an answer is correct and based on the provided context. An example of such an evaluation can be seen below.
Now we can move on to the next step, being a comparison of some models we investigated using our dataset.
The LLM leaderboard on HuggingFace is a good resource for getting some inspiration on models to try.
Our results for our Tapir documentation Assistant can be seen below.
We explored the majority of well-known models in the size of 7B as we wanted to be able to load them on GPUs with around 16-24GB VRAM. As our baseline, we used dolly, which had quite good results. After that, we decided to try the Mistral and MPT model, out of which MPT seemed to be better than the baseline, but only in terms of correctness. At this point, we decided to explore if models that have seen quite a lot of code content would perform well in this setup, so we used code-lama as a text generation model. Unfortunately, this model was not performing well, with 0.22 score for relevance and 0.12 for correctness, proving that dolly would be a much cheaper and better option. Then we decided to try Falcon, which performed very poorly, and stable-beluga, which performed best out of the available models, with a 0.64 relevance score and 0.48 correctness.
For reference, we also checked GPT-4 performance. However, it is worth mentioning here that this model was also used for both generation and evaluation, which might cause some leakage between results and a tendency to give better results for its answers. As this is also an API service, we do not know if they are not using extra context when generating answers, which would make it an unfair comparison to open-source models.
In this blog post, we explored a variety of metrics and libraries that allow us to evaluate Q&A systems based on their text-generation components. We went over classical methods like BLEU or ROUGE, followed by popular library ragas and a more informative version of the Langchain evaluation module. Then, we went over example results, the structure of Langchain metrics, and their explanations. Finally, we reviewed aggregated results for SoftwareMill and ReasonField Lab’s LLM project.
Reviewed by: Adam Wawrzyński