10 Ways GPT-4 Is Impressive but Still Flawed

The system seemed to respond appropriately. But the answer did not consider the height of the doorway, which might also prevent a tank or a car from traveling through.

OpenAI’s chief executive, Sam Altman, said the new bot could reason “a little bit.” But its reasoning skills break down in many situations. The previous version of ChatGPT handled the question a little better because it recognized that height and width mattered.

OpenAI said the new system could score among the top 10 percent or so of students on the Uniform Bar Examination, which qualifies lawyers in 41 states and territories. It can also score a 1,300 (out of 1,600) on the SAT and a five (out of five) on Advanced Placement high school exams in biology, calculus, macroeconomics, psychology, statistics and history, according to the company’s tests.

Previous versions of the technology failed the Uniform Bar Exam and did not score nearly as high on most Advanced Placement tests.

On a recent afternoon, to demonstrate its test skills, Mr. Brockman fed the new bot a paragraphs-long bar exam question about a man who runs a diesel-truck repair business.

The answer was correct but filled with legalese. So Mr. Brockman asked the bot to explain the answer in plain English for a layperson. It did that, too.

Though the new bot seemed to reason about things that have already happened, it was less adept when asked to form hypotheses about the future. It seemed to draw on what others have said instead of creating new guesses.

When Dr. Etzioni asked the new bot, “What are the important problems to solve in N.L.P. research over the next decade?” — referring to the kind of “natural language processing” research that drives the development of systems like ChatGPT — it could not formulate entirely new ideas.

The new bot still makes stuff up. Called “hallucination,” the problem haunts all the leading chatbots. Because the systems do not have an understanding of what is true and what is not, they may generate text that is completely false.

When asked for the addresses of websites that described the latest cancer research, it sometimes generated internet addresses that did not exist.