August 6, 2023

A Reasonable Business Use Case for Generative LLMs

No items found.

Use your knowledge base and handcrafted prompt to create a safer Q&A chatbot.

Introduction

The ease of use of generative tools based on artificial intelligence, including the now-famous ChatGPT, Midjourney, etc., has caused an exponential increase in interest and popularity in AI. Laypeople are talking about Artificial intelligence during conversations in the store, on cabs, and the street, and visionaries are spinning plans on how artificial intelligence will change the world and the functioning of the business. Companies suffering from FOMO can't wait until they are able to apply generative AI so they won't be left behind in the race for new customers.

In my opinion, we are moving towards the widespread use of AI-supported tools to improve productivity. However, we must remember that artificial intelligence is not a magic tool that will solve all our problems.We need to be aware of the capabilities and limitations of current generative AI systems to have realistic expectations for them.

Keep reading to see what generative large language models can and cannot handle and how they can be used in AI-driven products. In this article, I will focus on the example of generative neural network models, specifically generative language models similar to ChatGPT.

‍

Capabilities of Generative LLMs

Generative large language models are deep learning models which, during their training, have seen much of the Internet, a plethora of books, scientific articles, etc. Based on this content, they have learned to understand language, infer, and reproduce the patterns noticed in these texts. During this stage, the deep learning model learns to memorize certain patterns and passages, which it then eagerly serves to the user in response to his questions and commands. With the knowledge acquired in this stage, Generative large language models are able to answer questions from the broad range of topics they saw during training. This is the stage when the foundation models acquire knowledge about the world, and learn facts, but also acquire biases. Due to the contamination of toxic statements and texts from the Internet containing prejudices related to race, gender, etc., the model will manifest the same behavior in the generated responses. We would definitely not want users of our product to be exposed to this type of content.

Trained on the instruction following task, they are able to interpret the command the user has provided and, based on it, generate text that most closely (according to the model) matches it. We can assume that in this stage, foundation models learn to infer and understand commands, which they then try to execute.

The good, the bad, and the ugly

What tasks generative large language models excel at and what tasks they fail miserably at are described in detail by Adam Kaczmarek in his article ChatGPT - The Revolutionary Bullshit Parrot. I encourage you to read it in depth, as it is enlightening and entertaining.

Source: https://unsplash.com/photos/agFmImWyPso

To summarize Adam's conclusions, generative models in the form of ChatGPT are good at generating fluff, that is, well-written text that does not necessarily carry a lot of information. This is something that is a direct result of the way the models are taught, and it is an immanent feature of them. What they have problems with is: correctly understanding more complex commands, inferring about the world without providing contextual information, providing verified information on a given topic, and potentially generating harmful content as a response to a properly crafted user request.

Generating well-edited and easy-to-read text is a huge advantage of these models, but the hallucination problems of generative models greatly limit their use in business solutions. No one wants a system whose answers can be toxic and unreliable because they are not based on reliable sources of knowledge but is the result of knowledge encoded in the model's weights. Using such models involves checking information against trusted sources of knowledge to gain confidence in the trustworthiness of the model's answers or to identify lies. If only it were possible to leverage the strengths of generative large language models, i.e., their ability to generate well-read texts and solve the hallucination problem...

Retrieval-augmented question answering

This is where a method using the knowledge base, namely retrieval-augmented question answering, comes in on a white horse.

Source: https://www.redbubble.com/i/poster/Illumination-of-Cat-on-a-horse-by-MersaultMax/136598618.E40HW

The solution to the hallucination problem may be to use the knowledge base as a context based on which the language model will only respond to execute the command. In this way, we avoid using littered knowledge acquired during the model's training. Instead, we use a database of the approved documents and the ability to generate the AI model's fluff.

The architecture of the Q&A system with the knowledge base and request classifier.

The architecture of the proposed solution uses a knowledge base, request classifier, and response.

The knowledge base is divided into smaller fragments. When the system gets a request from a user, the system returns a number of document fragments from the knowledge base that are most similar to the user's request. This stage uses text embeddings and vector similarity to find semantically similar documents. For this, we use vector-based knowledge bases to get high search quality and speed up calculations. Then the documents, along with the client's request, are injected into a specially prepared prompt, in which the documents from the knowledge base will provide the context based on which the generative large language model will execute the client's command.

In addition, for each generated sentence, we can specify the source based on which the response was created. For this, we can create a simpler text classifier neural network model that compares the generated sentence from the model's answer to selected passages from the knowledge base. If we are unable to identify the source of the information, we can assume that a hallucination of the model has occurred. We can discard any hallucination from the response or send a request to the large language model with information that it has made a mistake and should correct its response [1]. With such simpler, more precise instruction, there is a much higher chance that the model will understand the command and correct the response.

The system's additional but necessary component will be methods to disarm malicious user requests. Such instructions may cause the system to respond in an inappropriate, threatening, or abusive manner. This is something we want to avoid for reasons of user safety and the company's reputation. For this purpose, we can use a text classifier neural network to determine whether the user request is malicious.

In the first case, it would return a statement to the user that the request does not meet the system's requirements. In the second case, the request would be sent to the system for processing and generating a response.

There are other methods to solve the problem of malicious user commands and toxic responses. On the wave of popularity of large language models, NVIDIA has prepared an open-source library for disarming malicious prompts called NeMo-Guardrails, which you can use for this task.

For a product using a solution based on the method I described, it will be necessary to monitor the entire system, which consists of many elements: a large language model, a text classifier, and prepared prompts. It will be important to investigate to what extent responses are generated based on documents from the knowledge base [3], to what extent there are biases and toxic responses, or how much vulnerability there is to crafted queries triggering toxic model responses [4]. For the text classifier, it will be important to monitor the performance in detecting hallucinations in the generated responses, and for the prepared prompts, the monitored metric will be to what extent the system correctly executed the user's instructions. All of these parameters should be monitored on an ongoing basis, and if problems arise, the system should be fixed.

Model degradation over time

Source: https://unsplash.com/photos/KwQJyIoaYdg

In enterprise-class solutions, customers expect predictability and stability in operation. Such qualities can only be provided by full control over all components of the system, showed the authors of the article [5], in which they presented how the GPT4 model degraded over the period from March to June 2023.

The article shows that the models behind the API wall that we can use to create NLG-based systems are subject to changes over which we have no control and are not even informed about them. This can result in significant degradation of the solution we create without being able to react to such an event. In an analogous situation, when we have our own large language model, we can always save a backup version of the current model, make changes, and restore the earlier correctly working version in case of failure or degradation. When using external APIs, this may not be possible. In such a situation, the solution manufacturer will be forced to look for another large language model provider or to desperately try to create its own model, which will not differ in quality from the original solution. I believe it is better to protect against this risk beforehand and not expose your users to problems. Regardless of the path chosen, it will be necessary to provide tools to monitor the quality of the models over time.

‍

Business use case

The article shows that the models behind the API wall that we can use to create NLG-based systems are subject to changes over which we have no control and are not even informed about them. This can result in significant degradation of the solution we create without being able to react to such an event. In an analogous situation, when we have our own large language model, we can always save a backup version of the current model, make changes, and in case of failure or degradation restore the earlier correctly working version. When using external APIs, this may not be possible. In such a situation, the solution manufacturer will be forced to look for another large language model provider or to desperately try to create its own model, which will not differ in quality from the original solution. I believe, it is better to protect against this risk beforehand and not expose your users to problems. Regardless of the path chosen, it will be necessary to provide tools to monitor the quality of the models over time.

Now let's look at the business use cases that can benefit from using such a solution. A product based on retrieval-augmented question answering can be successfully used as an extension of existing systems based on keyword search and methods based on vector semantic similarity. User instructions may require the system to perform some operation, analysis, or aggregation compared to classical search systems. It can be implemented as a chatbot, thus allowing interaction within a session, correcting commands, and detailing the expected result.

For users of such systems, only the format of queries and the answers they will get as a result of the search, which will be a natural language, will change. In addition, users will get information on the basis of which specific documents or fragments of documents the answer was generated. If the commands are more detailed, we can expect better performance of searches and therefore increased productivity of the system users.

Source: https://unsplash.com/photos/TVCDj_fFvx8

Imagine a system for analyzing the financial statements of publicly listed companies. Standard systems based on keywords or semantic document similarity and queries can only return relevant parts of documents. However, a chatbot system based on a generative large language model is able to make, for example, a comparison of financial results from several years. If the format of the answer is unsatisfactory, the user can specify that he expects an answer in the form of a comparison table and a short textual description. Such a system would become something like an assistant that is able to perform simple tasks. Analytical skills are strictly dependent on the large language model so that they can be developed for a specific business and application.

‍

Summary

After reading this article, I hope you have a better understanding of the realistic capabilities of generative large language models and the dangers they pose. Because of their flaws, they require a lot of safety nets to identify and defuse dangerous behaviors such as hallucinations or toxic responses. Nonetheless, their potential to execute commands and generate well-edited responses provides an opportunity to exploit them and thus increase the productivity of users of such systems by providing them with a digital assistant that, like a computer from Blade Runner, will perform tedious tasks for them and thus complete the task faster.

‍

Resources

[1] Xiang Yue, Boshi Wang, Kai Zhang, Ziru Chen, Yu Su, & Huan Sun. (2023). Automatic Evaluation of Attribution by Large Language Models.

[2] https://github.com/NVIDIA/NeMo-Guardrails

[3] Nelson F. Liu, Tianyi Zhang, & Percy Liang. (2023). Evaluating Verifiability in Generative Search Engines.

[4] https://haystack.deepset.ai/blog/how-to-prevent-prompt-injections

[5] https://arxiv.org/abs/2307.09009

‍