With a hammer in hand, I search for nails … this classic quip applies to the hype around large language models today. We all marvel at the output, make fun of it, search for applications for it, and ignore the alternatives. We can do better as these models are not at all ready for use.
SwissCognitive Guest Blogger: Patrick Bangert, VP of AI – “Dangers of Tech Myopia – Redefining Business Performance with Generative AI”
Marvelling at the output
Wow, the text sounds so human. It can write a poem! ChatGPT wrote a book report for my English class at school. These are all exclamations we have heard many times in the last weeks. It is amazing that a computer program is able to write lucid text in a manner and on topics specified by the user. This truly is an achievement.
Artificial intelligence (AI) has a history of six decades of trying to do something like this, starting with the famous Eliza system of 1964. These are attempts at passing the Turing Test, a test conducted by a human of a conversation partner in an attempt to identify whether that conversation partner is a human or a computer. The challenge is to make the computer program so human-like that another human cannot tell the difference.
Have we achieved this? No.
Making fun of it
Alongside all the marvel, we find many examples of large language models (LLM) producing a text that is grammatically sound, lucid, and well written but factually wrong. Attempts to explain why elephant eggs are larger than chicken eggs or that Hillary Clinton was, in fact, the first female president of the USA are some examples. This stems from the fact that these models do not know anything of the world.
The way these models work is that they discover patterns in text by looking at existing texts. As the corpus of existing texts grows, the number of parameters in the model grows, and the length of patterns that the model looks for grows, as a result the model is better able to generate text with complex patterns. These three dimensions are what has given these language models the prefix “large.” The primary difference between the generations of the GPT family being the number of parameters in the model – effectively the model’s memory size.
Notably what the model do not do is discover concepts. The model does not understand the concept of a “chair” or “table” but it can use these words in complex sentences. That is ultimately the reason why these models produce eloquent but incorrect sentences. For instance the name “Hillary Clinton” and the phrase “first female president” have been used frequently in the same context. The somewhat subtle causal connection between these two phrases is however lost on an LLM and so can lead to wrong assertions.
Thus: Be very careful to believe any factual statements made by an LLM no matter how good it sounds.
Searching for applications
Given that LLMs produce high-quality text of questionable factual correctness, what can we realistically use them for? My personal opinion is not very much and many of those use cases are harmful. We can use it to cheat at school and university, we can create (at scale!) misinformation and propaganda, we can write poems, novels, and make visuals without human inspiration. All of this is done at the press of a button and can be time-consuming to detect.
Legitimate uses might include personalized marketing and sales, better chatbot experiences on retail websites or hotlines and other interactions where factual accuracy does not matter much. However, factual accuracy nearly always matters to some degree. For instance, when asking for the gas mileage on a car maker’s website, we do expect a numerical and correct answer. Can we rely on an LLM to provide it? Probably not.
Ignoring the alternatives
Whenever we evaluate new technologies in the software realm, we have to look at it from a financial angle. LLMs can be extremely lucrative for their makers and so it is not in their interest to be cautious and slow, even if the model is not ready to meet the world. In addition, there is a winner-takes-most attitude in technology where the predominant incarnation takes virtually all the attention and money; there will be a second and third but they will be distant.
What this means is that companies like Aleph Alpha – who have an LLM at least as good as ChatGPT but more compact – are virtually ignored by the community and press. It also means that the research into better architectures are ignored as we chase our own tail trying to make money and trying to convince ourselves that the immature technology is good enough.
In order to be factually accurate, models will need to learn concepts. They need to go beyond merely looking at patterns in text to do that and must have some facility for logical reasoning harking back to the AI of the 1980’s where logic reigned. Combining neural networks and logical reasoning into one coherent framework is the next big step of evolution.
As we draw a young child’s attention to something and call it “tree,” the child miraculously learns a concept and is able to correctly call many other things trees. This type of one-shot concept learning combined with the logic the one might climb it but then run the risk of falling and feeling pain and so on is the important ability that prevents current language models from being accurate. Until that happens, the Turing Test will be failed and these models have few real use cases.
In promoting this unrealistic hype, technology companies are doing a disservice to the AI community in front of the altar of money.
Patrick will be speaking at the SwissCognitive World-Leading AI Network AI Conference focused on Redefining Business Performance with Generative AI on 28th March on this topic.