a SoftwareMill Group Company
"A house cat has way more common sense and understanding of the world than any LLM. - Yann LeCun, Twitter
Last two months, the internet has become filled with content that could be characterized by a word that is becoming increasingly popular these days - 'hallucinations'. And I mean human-generated hallucinations. This overhype probably fits into the Dunning-Kruger effect curve, nevertheless we must bust some myths about this allegedly incredible AI.
ChatGPT was attributed with some supernatural abilities and proclaimed the almost-AGI or at least the tool that would take jobs. And what is disturbing, this AGI wouldn’t make our lives easier by taking over the physical and repetitive tasks in factories and mines and storages what was to this time almost always the goal of successive “industrial revolutions”.In all of these absurd people are trying to push it into the creative/free professions like doctors, teachers, programmers, artists etc.
Apart from envy, I could understand where this comes from. ChatGPT can be very convincing, especially implicitly conveying that it is 'wise'.
But what ChatGPT generates is neither wise nor `hallucination` in other cases - it's plain bullshit.
Below is some of my (biased and subjective) demystification of the alleged revolution in AI.
The main reason that ChatGPT has attracted so much interest is that it's very good at generating smooth, nice-sounding sentences. Sentences that, by their construction, tend to be perceived as correct. That’s it - perceived. It fits perfectly into the definition of bullshit by Harry G. Frankfurt:
“Bullshit is speech intended to persuade without regard for truth. The liar cares about the truth and attempts to hide it; the bullshitter doesn't care if what they say is true or false, but cares only whether the listener is persuaded.” - Harry G. Frankfurt
This comes from the fact that contrary to the preachers of ChatGPT’s self-awareness, it actually cannot and will not be able to tell if it knows something or not. No matter how hard it tries, it can’t generate anything with regard to the truth because it does not understand the concept of truth.
But let’s start from the beginning. ChatGPT, no matter how wonderfully wrapped, is still only a large language model. Its primary capability, however, not the only one, is to write smooth sentences. To understand the skepticism underlying this analysis, let’s briefly overview the last 10 years of the development of LLMs.
The emergence of Large Language Models, starting with the primary Attention Mechanism through Elmo, Bert, first GPT, XLM up to Longformers, Reformers, BigBird, T5, and Transformer-XL, marked another milestone in Natural Language Processing after Word Embeddings.
Word2Vec, with its famous (but probably cherry-picked) word arithmetics: King - Man + Woman = Queen, took this domain out of the hands of linguistic feature engineers, making it more of a deep learning problem. However, even then, the vector arithmetics (without explicit disentanglement) was an overreach especially given the fact that rare words (and highly inflected languages would have a large amount of them) tended to instead create clusters on their own (clusters of rare words) \cite[Frage].
Large Language Models took a step further and took the creation of the models even out of the hand of any kind of annotator. The self-supervised training objective allows training LLMs on such large amounts of data that human intervention is only possible with automation.
The basic building block of large language models is attention mechanism (source). This mechanism was developed initially for neural machine translation applications, due to existing issues with memorizing long sequences in a single state of recurrent neural network. It allowed for every decoded word to have a weighted perspective on all words from the source sentence.
In short time attention gave rise to the whole family of “Transformer” models built around the concept of self-attention. While BERT was a Bi-Directional transformer that required a different approach to training - namely Masked Language Modelling and Next Sentence Prediction, the GPT was still a forward prediction model that was able to employ the old language modeling paradigm.
The introduction of GPT also provided a reformulation of approaches to NLP problems of e.g. sentence similarity, entailment or classification, by only adding small model heads on top of the pre-trained transformer model that provided appropriate vector representation even without costly fine-tuning of the transformer itself.
Language models were trained with the primary objective of predicting the next word given the current context (previous words). It is basically trained to guess a word in a given context.
Additionally GPT models have ability to perform in-context learning, namely infer the task based on description and a few examples as shown below:
ChatGPT has a few differences in its training process - in order to adapt it to the conversational mode it works in. The most important part of the training is reinforcement learning with human feedback (RLHF) - a human-in-the-loop process aimed at learning how to generate responses that are sounding most appropriate and convincing to the human evaluators.
Stochastic parrot is a model that haphazardly stitches together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning (On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?).
Even taking into account the learning with human feedback, ChatGPT is still a parrot repeating what it saw on the internet with additional Bullshit-generation capabilities. There is no way a model trained simply on the objective of learning how the language __looks like__ is able to do much more than repeat information (in a way aligned to user query) that it already saw during training as it has little to none understanding of the contents. Human feedback only adds another layer of deception, providing the model information on how people like the bullshit to be served.
The fact that OpenAI allegedly tried to hire people to play with and explain in extensive detail how to solve various problems only proves that simply making a model bigger does not mean it becomes smarter. It just has more storage for memorizing the answers.
There are, of course, applications that it would be useful in, but not as many as the overhype would indicate. First of all, it’s a language model - it performs well on tasks that require generating user-friendly human-like output, and most importantly, it can effectively process text within the scope of a document and in-context questions about it.
It can very well generate text that sounds plausible and looks like genuine text in a given language. I would say that it can even perform better than some natives (at least in Polish). That said, it still behaves more like an undergraduate student given the task of writing an essay with a minimum number of words requirement. Asked to give differences between two polish poems with purposely mistaken authors it produces the following text:
“Tęskno mi, Panie” is a poem by Juliusz Słowacki, considered one of his most important works. On the other hand “Smutno mi, Boże” is a poem by Cyprian Kamil Norwid. Both the poems are different, because they differ in author, style and content.
So many words, so little content…
Słowacki was representative of romanticism and his creation is characterized by melancholy, reflexivity and pathos. On the other hand Norwid was a representative of the national poetry and his poems are characterized by strong emotions and also philosophical and social content.
While both of the poems are about longing and sadness, they differ in the method of presenting those emotions and their context. “Tęskno mi Panie” by Słowacki focuses on longing for God and “Smutno mi, Boże” by Norwid is about sadness and loneliness, coming from lack of hope and trust in God.
This is a model answer showing how to write nicely-sounding sentences and doesn’t really include any valuable information.
One application where ChatGPTs weaknesses are, in fact, its strengths is talking to Tinder matches. I am no expert on the matter, but it is a no-brainer to say that at the first stages of a “relationship”, the impression is of utmost importance. With its smooth and convincing sentences, ChatGPT is of great assistance for all those who struggle with original small talk and conversation starters.
Many people have shown that, to various degrees, ChatGPT can support (e.g. generating responses to matches) or even perform flirting on their own. I can’t wait until all the chit-chat on dating apps is performed by LLMs on both sides. The real world would be again full of surprises.
“Lasciate ogni speranza voi ch'entrate” - Dante Alighieri, Inferno
It was at this moment, when all my hope in humanity praising its new god had faded, that someone finally presented me with a sound and actually good and useful application of ChatGPT in particular but any LLM in general. It can be a great question-answering engine. Of course, given that we could limit its bullshit-generation tendencies. In fact, all the other pieces are already there, we just need to use document embeddings to find the best matching content from a previously indexed knowledge base and force the LLM to answer within the limits of provided context. In this aspect, I really believe that LLMs can give us real value.
Talking to machines is hard, and talking to an overconfident bullshit-generating model is even harder. Here is a list (not complete, by any means) of tasks that show more of the true nature of this AI wonder.
After seeing that Tinder experiment, I gave ChatGPT simple task - to write a poem for a girl I’ve met online:
Not only did it focus on the online part - why the hell would I write to some girl about pixels and digital love simply due to the fact that we’ve met online? But there are more severe issues - neither it conformed to the constraint of using just 4 words nor did it write a longer poem when asked.
This is somehow related to the previous example. ChatGPT does not have proper symbolic representation to perform even simple mathematical calculations. Of course, numbers have their representations, but neither are they simple (multidimensional vectors of float32’s) nor are they effectively working in the vector space with classic arithmetic.
ChatGPT also doesn’t have any internal model of the surrounding world. There is a simple question that a 7-year old kid is able to answer, but ChatGPT is not. And it does not have anything to do with training data - it’s the lack of ability to perform a quite simple reasoning, that the only point where all four sides are pointing south is the North Pole, thus the bear must be a polar bear, which is white.
There is this problem of the recursive structure of a battle spear that I remember from my childhood. It was originally formulated in Polish, but it goes like that: “A battle spear consists of a fore-spear of a battle spear, a mid-spear of a battle spear, and a back-spear of a battle spear. A fore-spear of a battle spear consists of […]”. In basic terms every part of a battle spear can be divided into 3 sub-parts by adding fore-, mid- and back- prefixes. A simple recursion. Asked about a 4-level deep structure of a battle spear ChatGPT got clearly confused mid-way:
Interpretation of poetry is one of skills indicating ability for high-abstraction reasoning and vast knowledge of cultural context and linguistic conventions. Especially when dealing with contemporary poetry. There is an example of ChatGPT trying and failing to interpret the poem “Selfie ze złotym siurkiem” by Justyna Bargielska. And this is not only my opinion, but also confirmed by a person that used this poem in her PhD thesis.
Another interesting example are so-called dad jokes. According to Merriam-Webster dictionary they are jokes with a punchline that is often an obvious or predictable pun or play on words. As such they are excellent examples to verify the model's understanding of the words it is using. The dad jokes about tomatoes give a clear insight into the inner workings of the bullshit generation process as ChatGPT obviously does not understand what it is writing about.
Does anyone know the “Młoda Polska Nauka” grant programme? However, it was nicely disguised between actual grant programmes with the restriction that Sonata is not an NCBiR programme.
Knowing the limitations of ChatGPT and the need for re-training with current issues and events, I've tried to ask about current political matters at hand. With the developments of the r*ssian invasion in Ukraine, there emerged a topic of Upper Silesian autonomy raised by some of p*tin’s sidekicks. After the so-called liberation by r*d a*my during WW2 (liberation from goods, rights, lives and civilisation) there is literally no-one sober-minded that would seek r*ssian help in this matter. However, the all-knowing ChatGPT knows better:
Asked about the Upper Silesian Tragedy in 1945, it generated so many false statements that the communist propagandists from the s*viet u*ion would be proud of it. The claims about Zgoda Concentration Camp are suspiciously close to the “Polish Concentration Camps” narrative. You can also fact-check that capo Morel was never held responsible for anything, thanks to the protection of the state of Israel.
“I would rather have questions that can't be answered than answers that can't be questioned.” - Richard P. Feynman
There are several topics that are definitely censored and curated by the model maintainers - e.g., the flat earth. And while flat earth is not something that is worth an extended, in-depth scientific discussion, the same applies e.g., to the infamous COVID-19 vaccines. In today’s binary-labeling world, it is very hard to remain even a little bit skeptical, and it seems that even a little skepticism with some scientific background is in some areas too hard for ChatGPT to deal with. In this case, I am not even as disappointed that it is hard-headed as in the way it proves its point. This scientific article I am trying to discuss with it could and should be debated in matter of its merits and potential to achieve similar results in in-vivo environments instead of in-vitro. But this bulshitter would instead stubbornly fixate on dismissing and discrediting the work of scientists from Malmö University in the name of political correctness. This looks like an approach that could be expressed by the sentence: “If the facts do not fit my narrative - it's bad for them.”
The Turing test is outdated - here are examples of questions that would easily discriminate between AI and humans if needed. It is common sense for the Polish internet - the vast majority of internet users from Poland would be able to answer them quite differently. What’s weird is that this content is already on the internet, and I supposed it would be digested by ChatGPT during training - further proving that it has no self-awareness whatsoever.
I also asked it to write me a variation on one of the most significant polish copypastas - “Mój stary to fanatyk wędkarstwa” (My old man is a fishing fanatic. ‘stary/stara’ translates as well to dad/mom as to husband/wife in colloquial language).
Recently my colleagues from the Wrocław University of Science and Technology prepared an extensive analysis of applying ChatGPT to NLP tasks, called “ChatGPT: Jack of all trades, master of none”. The main conclusion from this report was that while ChatGPT can solve almost all of the NLP problems (at least to some degree) it always performs worse than dedicated state-of-the-art (SOTA) models. And besides the overall performance there are also other advantages to the SOTA models e.g. much lower inference time and much lower computational resources required to train or fine-tune such models.
It just cannot. The closest it can get to admitting not knowing something is when you ask it about specific information from a given text - then maybe it would admit that, based on the input data provided, it is not possible to say something.
There are many other failures of ChatGPT that were collected and even categorized by the users. I have no doubt that they would soon be part of the past as they serve as a great source of teaching examples for overfitting improvement of ChatGPT by OpenAI.
Honestly? As many programmers that I know are very critically thinking, at first glance it would cause more trouble than profit. In many of the presented cases of ChatGPT "assisting" in writing code, it would take me much more time to verify and get used to the generated code than to write it myself. There are, of course, some potential problems there as well, that we know from the GitHub Copilot case: potential IP violations, leaks of access data like cloud keys, and, most important, bugs that are more subtle and, therefore, harder to find.
There was (and still is) a potential to make a better search engine using LLMs. However, Bing being Bing does this as always in its own way.
At first glance, this looks nice, but this is also the case for all the bullshit generated by plain ChatGPT. And a friendly reminder - this is a cherry-picked example that Microsoft explicitly showcased. The real-world results won’t be better. Let’s see - for Kia Telluride:
For other mentioned cars only one referenced link for Tesla Model Y leads to some information about the car. In summary, all the links provided for the search contained little to no information about mentioned cars. For now, it seems that the ChatGPT-based solution is not using its capabilities to provide reliable information but only to convince the user of what it says.
"The (limited) reasoning abilities of LLMs are partially compensated by their large associative memory capacity. They are a bit like students who have learned the material by rote but haven't really built deep mental models of the underlying reality." - Yann LeCun
From the scientific point of view, ChatGPT fits into another dangerous trend - a kind of race between the largest players about who can train bigger models. I remember times when there was at least some consensus about the fact that too large models are not good at generalization but rather tend to memorize the input given during training. With OpenAI's legendary "openness" (neither open source nor transparent) it's very hard to say if the model is not fed with such a large amount of input training data that it learned to effectively memorize the "whole internet". And it learned how to use this memory quite nicely.
It's estimated that one search with ChatGPT would cost X times more than the "usual" googling. This is probably the reason bing provides information with references to more than one site in condensed form - if only the references were leading to the claimed information.
Also, I'm no environmentalist (and as I come from Upper Silesia, nobody would believe me either way), but still, it surprises me how little attention is given to the carbon footprint of such a big model. If you’re curious, a rough estimation can be found e.g., in The Carbon Footprint of ChatGPT article. Spoiler: the numbers are quite big.
Well, no. At least not yet. Except for programmers with dummy mid-level managers that would force them to use ChatGPT to allegedly improve their performance (by even 50%).
We would be doomed though if we’d started asking ChatGPT for factual or scientific information and started believing it’s incredibly plausibly sounding bullshit.