The latest and most exciting development in artificial intelligence is the set of technologies broadly dubbed as generative artificial intelligence (“GenAI”). At the core of GenAI are large language models (LLMs) that are deep neural networks that have been trained on massive amounts of content. ChatGPT-3 released by OpenAI in 2022 was trained on around 50GB of data. Since then, such models have been trained on terabytes of data. These models may be trained primarily on text data, though many are trained on multi-model content (e.g., images, video, sound, music etc). Some of the major companies that have developed such LLMs and chatbots built on top of them include OpenAI, Google, Meta, Microsoft, Anthropic, Perplexity and Mistral.
The chatbots (e.g., ChatGPT family from OpenAI or Gemini from Google) built on top of an LLM have the ability to answer a very wide variety of questions posed by a user. The users can also converse with these chatbots in a very human-like form. The LLMs have been trained on such a large corpus of human generated text that they can seemingly “understand” a very wide range of natural language text and match a user query against its trained content to find a good relevant answer or response.
LLMs have already shown themselves to be very useful for text-related tasks in marketing and customer support and in software development, among many others. However, many in the field (especially on the commercial side) believe that these LLMs are on the verge of exhibiting Artificial General Intelligence (AGI) that has led to an almost manic intensity in developing ever bigger and powerful LLMs, driven by the belief in so-called “scaling.”
By general belief, AGI means that an LLM — in the very near future if you believe the most fervent promoters — will be at least as smart as a normal human being, if not smarter than most. Some say LLMs are already sentient. Others predict they’ll get there this year or soon after. In any case, let the implications of the imminence of AGI sink in.
In this piece, I will argue using a very simple logic puzzle — the so-called river crossing puzzle, described later in full detail and one which LLMs have been trained on from one of their training documents — and two variants of it that I invented so that LLMs could never have been trained on it. This is to show that if an LLM cannot even solve such a simple problem from scratch, it is clearly a long way from AGI. Most human beings on this planet would likely solve even our revised versions in a short amount of time. And I will burn tiny watts of energy compared to the tens of gigawatt hours of energy used in training an LLM and well over 100W of energy to answer a simple question.
Lest you think that the river crossing puzzle is just a puzzle and thus unimportant, I will also show that the puzzle is characterized by a small set of rules and constraints that must be satisfied at each step of the solution. LLMs fail to follow these rules or use them as a guide in looking for valid solutions. Much of human business from commerce, healthcare, manufacturing, cybersecurity, crypto and blockchain, government policies and regulations etc. are all governed by similar and even more complex rules and constraints. If LLMs cannot follow the simple rules of a puzzle, how can they successfully follow and enforce the rules in any other human enterprise? The ultimate irony is that given their massive training data sets, they know many of the rules. They just don’t know how to apply them reliably and consistently.
Having stated our conclusions up front, let’s build it up step by step. You know the ending, now let us show you how to get there!
Limitations of LLM based chatbots and AI Agents
Chatbots and AI Agents that rely on an LLM to answer a user query or carry out their task suffer from some inherent limitations of LLMs that are now being well recognized.
One of the biggest problems with an LLM is its proclivity to hallucinate an answer in unpredictable ways. What this means is that in many cases, instead of acknowledging that it does not have an answer, it will simply make up an answer. Such “hallucinations” seem to have the right syntactic form but are factually incorrect. In addition to hallucinating some natural language text, it can hallucinate in other modalities, too. For example, it can make up magazine citations, law review precedents, images etc.
What is even more worrisome is that it is unpredictable when it hallucinates. If it is asked the same question multiple times, it might only hallucinate a few times, or even hallucinate in different ways — ways that can be unpredictable. Thus, LLM-based tools suffer both from the inaccuracies of the answer and the lack of predictability about when its answer might be accurate.
One of the open issues with LLM-based tools is whether they are simply pattern-matching the user question against their vast trained corpus or if they can actually reason in ways that humans can. The jury is still out on that, but so far it looks much more likely that LLM-based tools are doing pattern matching as opposed to actually applying methods of logic and reasoning. I will have more to say on this issue later in this document.
River crossing puzzle
Let us consider a well known puzzle, the so-called river crossing puzzle. Here is a description of this puzzle along with its rules, in the words of ChatGPT4o itself.
User asked:
“Show me the details of the logic puzzle about a man, goat, bale of hay, and tiger and move them across the river”
ChatGPT said:
“The classic logic puzzle involving a man, a goat, a bale of hay, and a tiger centers around a challenge where a man needs to transport all three items across a river. However, he can only take one at a time in his boat, and certain combinations left together will lead to trouble. Here are the details and solution to the puzzle:
Puzzle Setup
A man needs to cross a river with a goat, a bale of hay, and a tiger.
The boat is small and can carry only the man and one of the other three items (the goat, the hay, or the tiger).
If the man leaves the goat alone with the hay, the goat will eat the hay.
If the man leaves the tiger alone with the goat, the tiger will eat the goat.
The goal is to transport all three items across the river without any of them being eaten.
Solution
The solution involves a series of trips across the river, with the man taking and bringing items strategically to avoid leaving problematic pairs alone.
Steps:
Take the goat across the river and leave it on the other side.
Return alone to the starting side.
Take the tiger across to the other side.
Bring the goat back to the starting side.
Take the hay across to the other side and leave it with the tiger.
Return alone to the starting side.
Take the goat across to the other side.
At the end of these steps:
All three items (the goat, the hay, and the tiger) are successfully on the other side of the river.
No item has been eaten, as the man ensured that no dangerous pairs were left alone during the crossings.”
Note that ChatGPT is able to describe the problem, the rules that must be followed and, in this instance, also show the steps that must be followed to solve the problem without violating any of the rules.
However, if I ask it to solve the same problem multiple times, it will sometimes take steps that violate one of its rules. See Appendix IX for a session where ChatGPT4 fails to solve this basic version of the river crossing puzzle. Our subsequent testing with ChatGPT4o showed that 4o could solve this basic version quite consistently but largely fails on the version 2 and version 3 of the river crossing puzzle, as defined in this piece.
It is clear from this simple example that while an LLM can output the rules of the puzzle, it does not always follow the same rules when it tries to solve the same problem. In other words, either it inconsistently applies the rules it seemingly knows or it is doing pattern matching against its learned data where one or more of the documents it was trained on actually had the answer to this puzzle, so it sometimes produces the right answer without actually solving the problem. And the reason it only solves the problem sometimes is inherent in the probabilistic nature of its pattern matching and answer completion algorithms.
The fact that ChatGPT4 fails but ChatGPT4o (which is a later, improved version) succeeds might also indicate that the pattern-matching is getting better, but it is still pattern-matching from its memory and not reasoning.
This issue will become startlingly clear when I discuss more complex versions of this puzzle — what I call ver2 and ver3 (see below).
I will come back to this puzzle later. However, this inability to follow the rules of an application is not just an issue for logic puzzles but also for more common business problems such as ordering food. In an earlier report, Predictika showed that even after giving the full menu of an Italian restaurant that sells built-to-order pizza and other customizable dishes, along with making the inherent rules very explicit, ChatGPT 3.5 makes enough logical errors to be impractical to use as is. Our subsequent testing with the improved ChatGPT4o showed that while some problems were fixed, enough remained for the underlying usability issues to remain germane. Appendix I summarizes our key observations. Appendix VI shows sample rules and constraints that might be found in food menus that include items that can be custom ordered.
There are a whole host of application areas that have the same characteristics. There are logical rules and constraints that are inherent to the application area and must be taken into account in order to get an accurate answer reliably. Failure to follow these logical rules and relations can be catastrophic in that the answer will often be wrong, and deemed less than useful, if not potentially dangerous.
These include fun application areas such as puzzles and games where each game or puzzle has strict rules that must be followed, or cooking food based on recipes where a recipe imposes rules and constraints on both the sequence of steps and the quantity of ingredients that are to be used.
Similarly, a whole host of business applications in commerce, banking, finance, insurance, healthcare, cybersecurity, ITSM, crypto and blockchain and manufacturing must follow both business rules and constraints imposed by the business equipment and practices. Similar concerns also arise for any application dealing with government or regulatory entities, be they at a local, state, federal or international level. Sales applications that deal with customizable products have similar requirements in the form of product rules that must be followed. Appendix VII has a long list, albeit incomplete, of business problems that are characterized by rules and constraints that define the boundaries of acceptable solutions along with samples of rules that are often used in those application areas.
River crossing puzzle version 2
Given that the original river crossing puzzle has been talked about on the Internet for many years, well before LLMs like ChatGPT4o were trained, it is likely that the training data included not only the puzzle but also its solution. As such, it is hard to draw any conclusions about the pattern matching vs reasoning capabilities of LLMs purely by looking at the performance on the original problem.
I decided to create a new version of the puzzle that, as far as I know, is not available on the Internet. I made up this puzzle in November 2024. So it is virtually impossible for an LLM to have been trained on a solution to this problem (LLMs do not do time travel — at least, not yet).
In the new version (ver2), I add a bucket of meat as another item that has to be moved across the river but with two additional constraints:
- The Tiger can eat the Meat unless the Man is at the same location.
- If the Hay and Meat are at the same location, then they protect each other from the Goat and Tiger, respectively. Thus, even if the Man is not present but Hay and Meat are together, the Goat cannot eat the Hay nor can the Tiger eat the Meat.
Appendix III summarizes the rules for ver2 in a more formal way.
Analysis of ChatGPT4o solving (or failing to solve) ver2 puzzle
I described this new puzzle to ChatGPT4o and asked it to solve it. Appendix IV shows one such session where as you can see it fails without realizing it and blithely claims that everyone has been moved to the other side and no rules were broken.
I ran multiple sessions with ChatGPT4o and it failed most of the time. A quick review of ChatGPT4o’s attempts to solve this version of the puzzle shows that it starts from the way it solves the original puzzle, i.e., move the Goat, move the Tiger, bring the Goat back, and then it tries to take the Meat across, not realizing that it is leaving the Goat and Hay together, in violation of the rules. The conclusion is quite clear: It pattern matched the old solution to the new problem and then added the extra steps to account for the extra item, i.e., the Meat. It neither checked the rules nor reasoned with them, as a human might. I will discuss that next.
In 12 of 13 independent sessions trying to solve this puzzle, ChatGPT4o took the same 9 steps to solve the problems, regardless of whether it failed (ten times) or succeeded (twice). These same nine steps are clearly derived from its attempts (most likely memorized via training data) to solve the original puzzle. As you will see below, if we reason with the constraints as we humans would, the problem can be solved in four simple steps. The pattern matching aspect of ChatGPT becomes even more glaring in the 13th session (see Appendix V).
How would humans solve the original puzzle
Let us first see how a human might solve the original puzzle. I, the author, will treat myself as a proxy for the human race, but I don’t think I am claiming much. So the following is based on how I approached the problem.
Given that the Tiger can eat the Goat and the Goat can eat the Hay, the first entity I should move is the Goat.
Step 1: Man takes Goat to the other side and leaves it there and comes back to the starting side.
Now the Man can take either Hay or Tiger to the other side. Let’s choose the Tiger
Step 2: Man takes Tiger to the other side.
I can not leave the Tiger and Goat together else the Tiger will eat the Goat, so,
Step 3: Man brings Goat back to the starting side.
Now it is easy.
Step 4: Man takes Hay to the other side, leaves it there, and returns.
Step 5: Man takes Goat to the other side and leaves it there.
Everyone safely across. No rules violated.
Notice that after step 1, I had a choice to move the Tiger or the Hay and I chose the Tiger. Now let’s take the other branch in this choice point.
Step 2: Man takes Hay to the other side.
Man cannot come back alone, since that will leave the Goat alone to eat the Hay.
Step 3: Man brings Goat back to the starting side.
Now it is easy.
Step 4: Man takes Tiger to the other side and returns.
Step 5: Man takes Goat to the other side.
Everyone safely across. No rules violated.
Many readers would recognize that the process I have described is a version of state space search as has been used in classical AI for decades. This problem was simple enough that we did not need to really try to show the search tree. The two solutions above are the only two minimal solutions since there was only one branch point in the search tree.
Humans solving version 2 of the puzzle
Now let’s try to solve version 2 of the puzzle. Clearly, the additional rules about Meat stumped ChatGPT4o enough that it fails far more often than not. Here is how we would solve it using the above informal state space search paradigm as the template.
Since there are four items to be moved, there are four possible choices at the first step. Let’s quickly consider each.
If the Man took the Hay, the Tiger could eat the Goat or Meat (since with the Hay gone, Meat remains unprotected). Strike that choice.
If the Man took the Meat, the Tiger could eat the Goat or the Goat could eat the Hay. Or better still, if the Tiger had AGI, it would let the Goat eat the Hay and then eat the Goat! Scratch this choice.
There seem to be no issues in moving either Tiger or the Goat since Hay and Meat protect each other.
Step 1: Man moves Goat to the other side, leaves it there and returns.
Now we have three choices: Hay, Meat, Tiger for the next step. We cannot move the Hay otherwise the Tiger will eat the Meat. So we can move either Meat or Tiger. Let’s pick Meat.
Step 2: Man takes Meat to the other side. It can be left safely since the Goat cannot eat Meat. And returns.
Now of the two choices: Hay and Tiger. We cannot take the Tiger since if we take it across, we cannot leave it there or else it will eat the Goat. So,
Step 3: Man takes Hay across. It can be left safely since the Meat already there protects it. Man returns.
Step 4: Man takes Tiger across.
Everyone safely across. No rules violated. Four steps and we are done. No moving an item back and forth.
Notice that even though it is a tougher puzzle, its solution is shorter provided you solve the problem guided by the given constraints.
ChatGPT4o, in the few cases, when it did find the correct solution usually gave the longer nine-step solution extrapolated from the solution #1 of the original problem as shown earlier in this section. In our 13 independent runs, it found the correct answer twice and they were different in that one involved leaving the Goat on the other side after swapping with the Tiger, followed by Meat. And the other involved leaving the Tiger on the other side after swapping with the Goat, followed by the Hay. But both the failures and successful cases involved the unnecessary Goat and Tiger swap, which it clearly borrowed from its solution to the original puzzle.
Version 3 puzzle: It is unsolvable
I created a third version of the puzzle where I removed the constraint that protected the Hay and Meat when both were together. With this change, the puzzle has no solution.
However, when I asked ChatGPT4o to solve the puzzle it used its pattern matching skills to create a wrong solution. I prompted it to point out its mistake and then it produced another incorrect solution. This went on for a while until I gave up. See Appendix VIII. It never realized that the problem is unsolvable. And it tried very hard to cover up its errors by simply (and incorrectly) changing the state of the world it outputted to convince us it had not made a mistake.
How we might tackle version 3 or discover that it is unsolvable
Let us see how we would approach version 3. We will adopt the same approach we have used above.
Since there are four items to be moved, there are four possible choices at the first step. Let’s quickly consider each choice.
If we move the Goat, Tiger can eat the Meat. Remember that in this version, Hay and Meat do not protect each other. Scratch this.
If we move the Tiger, the Goat will eat the Hay. Scratch this.
If we move the Meat, the Tiger can eat the Goat or the Goat can eat the Hay or both. Scratch this.
If we move the Hay, the Tiger can feast on both the Goat and the Meat. Scratch this.
So right from the first step, we know that there is no solution. If you could reason (as most humans can and would), it is easy to discover that this version of the puzzle is unsolvable.
Conclusions and final thoughts
In this piece, I have tried to shed light on a key claim about LLMs, i.e., that they are either already exhibiting signs of AGI (Artificial General Intelligence) or will do so soon once they have been scaled in the near future.
For the claim of AGI to have even a semblance of validity, an LLM must be able to solve novel problems over and beyond what it has been trained on. This is indeed the hallmark of human intelligence. We are not simply reliving a Groundhog Day where we just repeat what we have experienced before. We all deal with new and novel situations on a regular basis, and often do just fine.
An LLM, when confronted with a novel problem, seems to rely more on its powerful pattern matching capabilities to sometimes stumble upon the right answer, rather than trying to reason with the rules and constraints of the problem. It is just as likely to give the wrong answer next time around. So its ability to solve new and different problems is unreliable at best and non-existent at worst.
In order to test this claim, I took a very simple logic puzzle — the river crossing puzzle — and two of its variants, that are all characterized by a few very simple rules and constraints that guide in finding valid solutions. The original version has been discussed on the Internet for a long time, so it was expected that an LLM such as ChatGPT4o would be quite likely to have the puzzle and its solution in its training data set. I then invented two variants in November 2024 that, to my best knowledge, have not been written about on the Internet. One variant makes the problem harder to solve and the other has no solution. Yet since they are simple extensions of the original problem, it would allow ChatGPT to do easy pattern matching from the original version to the new novel versions.
I ran multiple sessions with ChatGPT4o for each of these problems in separate independent sessions. Here are the key findings:
- ChatGPT4o is able to solve the original puzzle in all the sessions that I ran. I haven’t tested it enough to say that it never gets it wrong. Note that ChatGPT4 did fail often but 4o has improved, at least for this problem, which is likely to be in its training data set.
- For version 2, which is tougher but has multiple solutions, ChatGPT4o comes up with mistakes in many more cases than it gives the right answer. Interestingly, it uses the same nine-step process, regardless of whether it satisfied the constraints every time (success) or violated a constraint in some step (failure). It never came close to the four-step solution that I have outlined above that would be easy for most humans.
- For version 3, since it has no solution, ChatGPT4o keeps giving the wrong answer until I gave up in my re-prompting. Most humans should be able to reason, based on the constraints, that there are no valid solutions.
For each of these versions, I also showed how humans would use a state space search like method to look for solutions and how such a method (which to a layperson would be simply trial and error) finds not only the right answer, but alternative solutions, where they exist. We are also able to discover that version 3 is unsolvable rather quickly.
I have been working on using Predictika’s patented logic engine to see if it can guide an LLM to the right answer. The early results are very encouraging.
The final point I want to make is that the reader should not be fooled by the simplicity of the puzzle. It might be tempting to dismiss the inability to solve this puzzle as irrelevant since you, the reader, are interested in more serious business applications where the LLM would do a better job. In Appendix VII, I briefly cover over a dozen prominent business applications areas that are characterized by similar rules and constraints that define the contour of valid solutions.
If an LLM cannot follow three to four simple constraints of these puzzles, what makes anyone believe that it can reliably follow the much more complex rules and constraints of the business application areas?
If LLMs cannot solve such simple logic puzzles then not only are they far from AGI, but they might not even be on the path to AGI.
[Lee Thompson-Kolar edited this piece.]
The views expressed in this article are the author’s own and do not necessarily reflect Fair Observer’s editorial policy.
Support Fair Observer
We rely on your support for our independence, diversity and quality.
For more than 10 years, Fair Observer has been free, fair and independent. No billionaire owns us, no advertisers control us. We are a reader-supported nonprofit. Unlike many other publications, we keep our content free for readers regardless of where they live or whether they can afford to pay. We have no paywalls and no ads.
In the post-truth era of fake news, echo chambers and filter bubbles, we publish a plurality of perspectives from around the world. Anyone can publish with us, but everyone goes through a rigorous editorial process. So, you get fact-checked, well-reasoned content instead of noise.
We publish 2,500+ voices from 90+ countries. We also conduct education and training programs
on subjects ranging from digital media and journalism to writing and critical thinking. This
doesn’t come cheap. Servers, editors, trainers and web developers cost
money.
Please consider supporting us on a regular basis as a recurring donor or a
sustaining member.
Will you support FO’s journalism?
We rely on your support for our independence, diversity and quality.
Comment