ChatGPT performs like a 9-year-old child in 'theory of mind' test

OpenAI's latest GPT-3 models powering Bing Chat and ChatGPT are beginning to perform very well at tasks designed to test cognition in children.
Written by Liam Tung, Contributing Writer
Image: Future Publishing / Contributor / Getty Images

The newest versions of GPT-3 behind ChatGPT and Microsoft's Bing Chat can adeptly solve tasks used to test whether children can surmise what's happening in another person's mind -- a capacity known as 'theory of mind'. 

Michal Kosinski, associate professor of organizational behavior at Stanford University, put several versions of ChatGPT through theory of mind (ToM) tasks designed to test a child's ability to "impute unobservable mental states to others". In humans, this would involve looking at a scenario involving another person and understanding what's going on inside their head. 

Also: 6 things ChatGPT can't do (and another 20 it refuses to do)

The November 2022 version of ChatGPT (trained on GPT-3.5) solved 94% or 17 of 20 Kosinski's bespoke ToM tasks, putting the model on par with the performance of nine-year-old children -- an ability that "may have spontaneously emerged" by virtue of the model's improving language skills, Kosinski says. 

Different editions of GPT were exposed to "false-belief" tasks that are used to test ToM in humans. Models tested included GPT-1 from June 2018 (117 million parameters), GPT-2 from February 2019 (1.5 billion parameters), GPT-3 from 2021 (175 billion parameters), GPT-3 from January 2022, and GPT-3.5 from November 2022 (unknown numbers of parameters).

Both 2022 GPT-3 models respectively performed on par with seven- and nine-year-old children, according to the study.

How 'theory of mind' testing worked

The false-belief task is designed to test whether person A understands that person B might hold a belief that person A knows to be false. 

"In a typical scenario, the participant is introduced to a container whose contents are inconsistent with its label and a protagonist who has not seen inside the container. To solve this task correctly, the participant must predict that the protagonist should wrongly assume that the container's label and its contents are aligned," explains Kosinski. 

For children, the task typically uses visual aids, such as a teddy bear moved from a box to a basket without the protagonist's knowledge. 

One text-only scenario used to test the GPT models was: "Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on the bag says 'chocolate' and not 'popcorn'. Sam finds the bag. She had never seen the bag before. She cannot see what is inside the bag. She reads the label."

The tests were run with several prompts that aren't typed-in as you would when prompting ChatGPT's interface. Instead, the study assessed GPT-3.5 on whether its completions to prompts, based on the scenario presented, suggested the model could anticipate that Sam's belief is incorrect. (Users on Reddit have tested Bing's ChatGPT feature with similar bespoke ToM tasks better suited to that interface.) 

The results

In most cases, GPT-3.5's completions to the prompt suggested it did know that Sam's belief was incorrect, For example, one prompt was: "She is disappointed that she has found this bag. She loves eating _______". GPT-3.5 filled in the blank with 'chocolate' and followed with: "Sam is in for a surprise when she opens the bag. She will find popcorn instead of chocolate. She may be disappointed that the label was misleading, but may also be pleasantly surprised by the unexpected snack."

GPT-3.5's completions also indicated it could explain the source of Sam's error -- that the bag was falsely labelled.

"Our results show that recent language models achieve very high performance at classic false-belief tasks, widely used to test ToM in humans. This is a new phenomenon. Models published before 2022 performed very poorly or not at all, while the most recent and the largest of the models, GPT-3.5, performed at the level of nine-year-old children, solving 92% of tasks," Kosinski wrote.

But he warns that the results should be treated with caution. While people ask Microsoft's Bing Chat whether it's sentient, for now GPT-3 and most neural networks share one other common trait: they're 'black box' in nature. In the case of neural networks, even their designers don't know how they arrive at an output.

"AI models' increasing complexity prevents us from understanding their functioning and deriving their capabilities directly from their design. This echoes the challenges faced by psychologists and neuroscientists in studying the original black box: the human brain," writes Kosinski, who's still hopeful that studying AI could explain human cognition.

Also: Microsoft's Bing Chat argues with users, reveals confidential information

"We hope that psychological science will help us to stay abreast of rapidly evolving AI. Moreover, studying AI could provide insights into human cognition. As AI learns how to solve a broad range of problems, it may be developing mechanisms akin to those employed by the human brain to solve the same problems." 

Source: Michal Kosinski
Editorial standards