Has ChatGPT rendered the US's education report card irrelevant?

ChatGPT and GPT-4 excelled at the nation's benchmark for answering science questions. The programs' abilities to overcome the working memory limit of humans carries 'significant implications' for educators, say researchers.
Written by Tiernan Ray, Senior Contributing Writer
Classroom with students raising their hands

ChatGPT and GPT-4 scored above the majority of students in grades 4, 8, and 12 on a standardized test of science questions.

skynesher/Getty Images

The Nation's Report Card, also known as The National Assessment of Educational Progress, NAEP, is a standardized test of student ability in the US that has been administered since 1969 by the US Board of Education. The test is widely cited as the benchmark of where students stand in their ability to read, write, do math, understand scientific experiments, and many other areas of competence.

The test had a grim message for teachers, administrators, and parents last year: teenagers' math scores showed the largest-ever decline since the start of the assessment, amid a general long-term trend of declining math and reading scores.

Also: I'm taking AI image courses for free on Udemy with this little trick - and you can too

The decline comes at the same time as the rise of generative artificial intelligence (AI), such as OpenAI's ChatGPT, and obviously, many people are asking if there's a connection. 

"ChatGPT and GPT-4 consistently outperformed the majority of students who answered each individual item in the NAEP science assessments," write Xiaoming Zhai of the University of Georgia, and colleagues at the University's AI4STEM Education Center, and at the University of Alabama's College of Education, in a paper published this week on the arXiv pre-print server, "Can Generative Ai And Chatgpt Outperform Humans On Cognitive-Demanding Problem-Solving Tasks In Science?"

Also: AI in 2023: A year of breakthroughs that left no human thing unchanged

The report is "the first study focusing on comparing cutting-edge GAI and K-12 students in problem-solving in science," state Zhai and team. 

There have been numerous studies in the past year showing that ChatGPT can "match human performance in practice and transfer problems, aligning with the most probable outcomes expected from a human sample," which, they write, "underscores ChatGPT's capability to mirror the average success rate of human subjects, thereby showcasing its proficiency in cognitive tasks."

The authors constructed a NAEP exam for ChatGPT and GPT-4 by selecting 33 multiple-choice questions in science problem-solving, along with four questions that are designated as "selected response", in which the test-taker selects an appropriate response from a list after reading a passage. There are three questions that present a scenario, with sequences of connected questions; and 11 "constructed response" questions and 3 "extended constructed response" questions, where the test-taker has to write a response rather than choosing from offered responses.

Also: How tech professionals can survive and thrive at work in the time of AI

An example of a science question could involve an imaginary scenario of a rubber band stretched between two nails, asking the student to articulate why it causes a sound when plucked, and what would make the sound reach a higher pitch. That question requires the student to write a reply about vibrations of the air from the rubber band, and how increasing tension could raise the pitch of the vibration. 


Example of a constructed answer question that tests science reasoning.

University of Georgia

The questions were all oriented to grades 4, 8, and 12. The output from ChatGPT and GPT-4 was compared to the anonymous responses of human test-takers, on average, as provided to the authors by the Department of Education.

ChatGPT and GPT-4 answered the questions with accuracy "above the median" -- and, in fact, the human students scored abysmally compared to the two programs on numerous tests. ChatGPT scored better than 83%, 70%, and 81% of students for grade 4, 8, and 12 questions, and GPT-4 was similar, ahead of 74%, 71%, and 81%, respectively.

The authors have a theory for what's going on, and it suggests in stark terms the kind of grind that standardized tests create. Human students end up being something like the famous story of John Henry trying to compete against the steam-powered rock drill.

The authors draw upon a framework in psychology known as "cognitive load", which measures how intensely a task challenges the working memory of the human brain, the place where resources are held for a short duration. Akin to computer DRAM, short-term memory has a limited capacity, and things get flushed out of short-term memory as new facts have to be attended to. 

Also: I fact-checked ChatGPT with Bard, Claude, and Copilot - and it got weird

"Cognitive load in science education discusses the mental effort required by students to process and comprehend scientific knowledge and concepts," the authors relate. Specifically, working memory can become taxed by the various facets of a test, which "all compete for these limited working memory resources," such as trying to keep all the variables of a test question in mind at the same time. 

Machines have a greater ability to maintain variables in DRAM, and ChatGPT and GPT-4 can -- through their various neural weights, and the explicit context typed into the prompt -- store vastly more input, the authors emphasize. 

The matter comes to a head when the authors look at the ability of each student correlated to the complexity of the question. The average student gets bogged down as the questions get harder, but ChatGPT and GPT-4 do not.

"For each of the three grade levels, higher average student ability scores are required on NAEP science assessments with increased cognitive demand, however, the performance of both ChatGPT and GPT-4 might not significantly impact the same conditions, except for the lowest grade 4."

Also: How to use Bing Image Creator (and why it's better than ever)

In other words: "Their lack of sensitivity to cognitive demand demonstrates' GAI's potential to overcome the working memory that humans suffer when using higher-order thinking required by the problems."

The authors argue that generative AI's ability to overcome the working memory limit of humans carries "significant implications for the evolution of assessment practices within educational paradigms," and that "there is an imperative for educators to overhaul traditional assessment practices." 

Generative AI is "omnipresent" in students' lives, they note, and so human students are going to use the tools, and also be out-classed by the tools, on standardized tests such as NAEP. 

"Given the noted insensitivity of GAI to cognitive load and its potential role as a tool in students' future professional endeavors, it becomes crucial to recalibrate educational assessments," write Zhai and team. 


Human students' average performance on questions was below GPT-4 and ChatGPT on most of the questions for twelfth-grade students.

University of Georgia

"The focus of these assessments should pivot away from solely measuring cognitive intensity to a greater emphasis on creativity and the application of knowledge in novel contexts," they advise. 

"This shift recognizes the growing importance of innovative thinking and problem-solving skills in a landscape increasingly influenced by advanced GAI technologies." 

Also: These are the jobs most likely to be taken over by AI

Teachers, they note, are "currently unprepared" for what looks to be a "significant shift" in pedagogy. That transformation means it's up to educational institutions to focus on professional development for teachers. 

An interesting footnote to the study is the limitations of the two programs. In certain cases, one program or the other requested additional information for a science question. When one of the programs asked, but the other did not, "The model that did not request additional information often produced unsatisfactory answers." That means, the authors conclude, that "these models heavily rely on the information provided to generate accurate responses." 

The machines are dependent on what's either in the prompt or in the learned parameters of the model. That gap opens a way for humans, perhaps, to excel where neither source contains the insights required for problem-solving activities. 

Editorial standards