Public discussions about ChatGPT and other LLM ("Large Language Model") scores are mostly not very constructive. We will try to add rationality to the debate, avoiding obviously irrational judgments like "people make mistakes too," or "yes, sometimes they make mistakes, but nevertheless useful," "the next version will fix this," or "this is because you are using the free version, in the paid one it is correctly."
Since LLM in general and ChatGPT claim to be an intelligent system, it is reasonable to start the analysis by remembering that intelligence includes many different abilities, of which two are basic - the ability to accumulate knowledge by memorizing information and the ability to analyze the available information, i.e., an ability to inference. Other abilities - for example, the ability to predict, separate useful information from irrelevant information, convert information from one type to another, etc. based on these two basic abilities.
The abilities to remember and infer to a certain extent can replace each other, that is, acceptable alternatives in a certain activity. This can be demonstrated using the multiplication table as an example. You can memorize the multiplication table and retrieve the necessary information from memory during calculations. And you can not remember but get the required value, using the ability to inference and basic knowledge. As a result, the ability to answer a question on the multiplication table does not indicate the presence or absence of the ability to infer.
Those who, as the author of this text, have teaching experience know that the ability to remember and recall often plays the role of a surrogate for the ability to infer. Accordingly, teachers also know practical ways to detect the ability to reason: they should offer examinees tasks that cannot be memorized and simply retrieved from memory.
What happens if a specific task is not included in the memorized "verbatim" set and the ability to infer is absent? First, the test-taker may admit an inability to solve the problem. Second, the examinee may ask for more information about the problem to help them find something similar among those memorized. Finally, the third option - as a solution, the examinee gives something textually similar (in the opinion of the examinee) to the correct answer.
In the latter case, it is important that the degree of similarity is determined textually - due to the lack of the ability to perform inference. As a result, the logical level of adequacy can be anything: from absolutely correct to absolutely incorrect ("either he killed, or he was killed").
The use of inference means interpreting information, that is, searching for consequences from the presence of certain data components, comparing the outcomes with each other, etc. This accordingly leads to differences in the behavior of the examinee.
In particular, there are differences in the reaction of the examinee to the indication of errors in his answer. If an incorrect answer is obtained due to an incomplete analysis, the examinee corrects the error by correcting the cause that gave rise to it. If the logical analysis is not used, but only the search for an answer to a textually close problem is used, the new solution may differ radically from the previous one, contain an uncorrected error, and even add more errors.
Another feature of using the search for textual similarity as a surrogate for inference is the frequent presence in the answers of what is not directly related to the question asked: the lack of the ability to interpret/analyze does not make it possible to separate the required from the "accompanying."
What does the described difference in the behavior of the tested person give in the case of ChatGPT4 (we used the paid, that is, the most complete version)?
The first of the "illustrative" test tasks requires composing a sequence of commands for a certain robot that performs an oil change in a gearbox. It is formulated textually in such a way as to minimize the similarity with similar tasks, the descriptions of which are present on the Internet. The result should be the following sequence of commands:
1 UOPEN
2 LOPEN
3 LEVEL 4 3 3
4 LCLOSE
5 INSERT
6 START
7 LEVEL 7 7 8
8 STOP
9 EXTRACT
10 UCLOSE
11 FINISH
The main logical complexity is represented by the "LEVEL" commands, in which three numeric arguments indicate the number of the line/command that should be passed depending on the result of measuring the oil level - first in the process of draining the used oil, then in the process of filling with fresh oil (three variants of the measurement result correspond to "empty," "partial" and "full"). The meaning of the commands is to repeat measurements until the level drops to zero in the first case and until it is completely filled in the second case.
ChatGPT4 generates a sequence with incorrect command parameters for both LEVEL commands. Pointing out errors entails the generation of a new version with the indispensable accepting of the error of the answer - but the latest version is also incorrect, and sometimes it does not differ in any way from the previous one, which was recognized as erroneous. Here is a fragment of the dialogue (essential points are in bold):
My:
Loop "LEVEL 3 4 4" finished as soon as oil level becomes a little lower than "full", so old oil still be presented in the gearbox when fresh oil starts flowing.
ChatGPT4:
You're correct, my apologies for the oversight. The 'LEVEL' command should point to itself as long as there is any oil left in the gearbox. Only when there's no oil left (level is zero), should it move to the next command. Here's the corrected sequence:
LOPEN - Open the bottom lid of the gearbox to let the old oil drain out.
UOPEN - Open the top lid of the gearbox. This will allow air to enter and facilitate the draining of the oil.
LEVEL 3 3 4 - Check the oil level. If there's still oil (not zero), the process will repeat this same command (LEVEL 3 3 4), effectively looping until the oil level reaches zero. Once it's zero, the sequence moves to the command number 4.
....
As you can see, ChatGPT, as it were, "completely agrees with the presence of an error" and even gives some explanation that is textually related to the task but generates exactly the same answer with the same logical error.
Exactly the same problem with the second LEVEL command. The textual comments are correct, but they contradict the essence of the constructed response.
Summary: There are all signs of a complete lack of inference ability and the presence of a search for an answer by textual similarity, which leads, on the one hand, to declaring consent with an error presence, and on the other hand, to a "correct" answer that exactly repeats the previous erroneous one.
The presence of the ability to infer ensures the detection of contradictions in a potential answer; the absence of such an ability sooner or later turns into the presence of nonsense with grammatical correctness and textual similarity with the correct answer. As expected, in the absence of the ability to infer, ChatGPT does not ask clarifying questions.
Of course, this is not the only test problem author has formulated that demonstrates ChatGPT's lack of inference; out of a couple of dozens, here's another one:
Which European capital names contain letters that appear twice in the same name?
Here are some examples of European capitals where at least one letter appears twice in their names:
Athens (Greece) - The letter "e" appears twice.
Brussels (Belgium) - The letter "s" appears twice.
Vienna (Austria) - The letter "n" appears twice.
Warsaw (Poland) - The letter "w" appears twice.
Budapest (Hungary) - The letter "u" appears twice.
Riga (Latvia) - The letter "g" appears twice.
London (United Kingdom) - The letter "o" appears twice.
Oslo (Norway) - The letter "o" appears twice.
Helsinki (Finland) - The letter "i" appears twice.
Bucharest (Romania) - The letter "u" appears twice.
Dublin (Ireland) - The letter "d" appears twice.
What about Copenhagen?
Indeed, Copenhagen, the capital of Denmark, fits the criteria as well. The letters "e" and "n" both appear twice in the name. Thanks for pointing out this additional example!
Tallinn?
Yes, Tallinn, the capital of Estonia, also fits the criteria. The letter "l" appears twice in the name. Good catch!
Budapest?
Yes, you're correct. Budapest, the capital of Hungary, contains the letters "d" and "a" twice each in its name. Thanks for pointing that out!
There is "extra" information in the comments and monstrous errors that are impossible in the case of the ability to analyze ("Riga .. - The letter `g` appears twice", "Budapest.. contains the letters ` d' and `a' twice each in its name").
Deprived of the ability for logical analysis/interpretation/inference, ChatGPT and other versions of LLM have a gigantic memory capacity; this allows it to be used as a surrogate for logical analysis (and more successfully than people do) - which is what the given examples demonstrate (along with many others collected by many testers). In its mathematical essence, LLM is an associative memory system with added memory for context, containing the text of the dialogue. There is no division of knowledge into facts and rules of inference, just as no computational component would use the rules to interpret the text. As a result, attempts to find solutions to problems not used in the training process immediately lead to obvious absurdities.
From a practical point of view, the main problem is that the possibilities of LLM as a helpful tool are used by those who find it difficult to detect nonsense in the generated response.
Whether the ability to analyze/infer should be considered a mandatory component of any system that deserves the tag "Artificial Intelligence" is a terminological aspect. But regarding Artificial General Intelligence (AGI), such an ability seems absolutely necessary. This is dictated by the fact that the use of textual similarity as a surrogate for logical analysis means the impossibility of acquiring new knowledge in any other way than by absorbing the knowledge already accumulated by people; this obviously excludes many abilities inherent in people and providing, together with memory, what is commonly considered among people as intelligence.