Using ChatGPT for accounting? You may want to think again

ChatGPT is an language model that uses probability to create a text-based answer to your question. But math and accounting hinge on accuracy, not probability. Here's why that matters.
Written by Rajiv Rao, Contributing Writer
Calculator next to numbers and pen
krisanapong detraphiphat/Getty Images

Over the past year or so, large language model (LLM) ChatGPT has demonstrated an uncanny ability to best humans at some of the things that are the cornerstone of our young professional lives.

It has passed all three notoriously difficult exams for medical school, got through the law school bar exam, and passed an MBA exam from the Wharton school of business at the University of Pennsylvania.

Also: What is ChatGPT and why does it matter? Here's what you need to know

The scores posted by the LLM were modest passing grades. But its later avatar -- GPT-4 -- is supposedly an even better student than its parent, having sailed through the bar exam with a 90th percentile score and getting near-perfect marks on the GRE Verbal test.

So, it must come as an immense source of both satisfaction and relief for us humans that there is at least one thing that LLMs like ChatGPT are not good at -- or in fact terrible at: accounting.

Also: How to use ChatGPT to write Excel formulas

Many users of ChatGPT have commented publicly on how the simplest math functions have foxed it. However, there's a sizeable and rigorously executed study into ChatGPT's accountaing capabilities that Brigham Young University (BYU) professor of accounting David Wood undertook several months ago.

Testing circumstances

Wood decided to harness the power of the global accounting fraternity via a pitch on social media that solicited help to put ChatGPT through the paces of a global accounting exam of sorts. 

There was a deluge of takers: 327 co-authors from 186 educational institutions located in 14 countries participated in the study. They collectively pooled together 25,181 classroom accounting exam questions -- as well as 2,000-plus questions from his own department at BYU -- to pose to ChatGPT. 

Typical of a comprehensive accounting examination, questions ranged across all major topics. such as financial accounting, auditing, managerial accounting, tax, and others, and were of different types (multiple choice, short answers, true/false) and difficulty levels.

Also: How to use ChatGPT to make charts and tables

The results were unequivocal: ChatGPT clocked a 47.4% result which, in and of itself, was not that bad. Students, however, scored an overall average of 76.7% and easily bested the machine.

According to the study, the LLM did fine on things like auditing. but had trouble getting its artificial neurons around problems that dealt with tax, financial, and managerial assessment problems, according to Wood's paper -- and these were sections that involved a lot of math.

AI's math doesn't add up

A lot of people can't quite reconcile AI's inability to do sometimes even simple math with AI's fearsome reputation as a potential killer of humanity.

Also: ChatGPT seems to be confused about when its knowledge ends

Yet the fact is that ChatGPT is essentially a glorified predictive text program -- it has been fed vast amounts of data and then trained to identify right and wrong answers. 

Its ability to be uncannily humanlike by spitting out conversational answers to questions is because it has been built to understand the patterns inherent in language and the connection between words, but not numbers. (This is why it is called a 'language' model.)

The output of these AI LLMs hinges on probability, and not accuracy. Output, by design, has been architected to represent an answer that has the statistically highest probability for the question asked.

Also: How does ChatGPT actually work?

And numbers, sadly, don't work like that. 

Answers involving math or many forms of accounting need to be precise and not an approximation. They depend on an exact output, like what a calculator gives you, and are not based on a relationship between words.

Paulo Shakarian, an associate professor at Arizona State University's engineering department, who runs a lab exploring challenges confronting AI, completed a study that measured ChatGPT's performance on mathematical word problems.

Solving these word problems involves multiple steps, which requires translating words into mathematical equations. But this sort of multi-step process also requires logical reasoning, which is something the algorithm is not engineered to do.

Also: Can generative AI solve computer science's greatest unsolved problem?

"Our initial tests on ChatGPT, done in early January, indicate that performance is significantly below the 60% accuracy for state-of-the-art algorithm for math word problem-solvers," adds Shakarian.

Bright spots

So, where does an LLM like ChatGPT excel?

Another professor, Christian Terwiesch, from the Wharton School of Business at the University of Pennsylvania, had a very different experience with a case study typical of those assigned in business schools.

"On some problems, the math was horrible," Terwiesch said

Also: Can AI code? In baby steps only

However, when given a case involving troubleshooting a bottleneck process at a hypothetical iron ore factory in Latin America, ChatGPT excelled.

"Wow! Not only is the answer correct, but it is also superbly explained," Terwiesch wrote in a paper about his experiment. "I don't see any reasons to take points off from this answer: A+!" 

The overall grade for the entire MBA exam was around a B or B-, says Terwiesch, primarily because of the bot's strength in operations management and process analysis, which a lot of workers in finance and management are paid a sizeable amount of money to do.

Another area of high AI competence: ripping through tedious tasks, such as processing invoices, tabulating and categorizing expenses, dealing with data entry, and similar areas.

Also: Extending ChatGPT: Can AI chatbot plugins really change the game?

But most of all, ChatGPT provided Wood, the BYU professor, with an unrivaled ability to introspect on what the staff were teaching students -- and how they were doing so.

"When this technology first came out, everyone was worried that students could now use it to cheat," he said. 

"But opportunities to cheat have always existed. So for us, we're trying to focus on what we can do with this technology now that we couldn't do before to improve the teaching process for faculty and the learning process for students. Testing it out was eye-opening."

Meanwhile, it's probably not a good idea to let an AI LLM do your taxes for you just yet.

Editorial standards