Can AI code? In baby steps only

Don't give up your coding career. Studies have shown generative AI does only a so-so job of coding, and fails spectacularly on hard problems.
Written by Tiernan Ray, Senior Contributing Writer
Lego blocks next to laptop
Vladimir Sukhachev/Getty Images

The first thrilling days of OpenAI's release to the public last winter of ChatGPT brought with it evidence of the program's ability to generate computer code, something that was a revelation to developers. It seemed at the outset that ChatGPT was so good at code, in fact, that suddenly, even people with little coding knowledge could use it to generate powerful software, so powerful it could even be used as malware to threaten computer networks

Many months of experience, and formal research into the matter, have revealed that ChatGPT and other such generative AI cannot really develop programs, per se. The best they can do is offer baby steps, mostly for simple coding problems, which may or may not be helpful to human coders.

Also: How to use ChatGPT to write code

"What generative has opened everyone's eyes to is the fact that I can almost have a partner when I'm doing a task that essentially gives me suggestions that move me past creative roadblocks," said Naveen Rao, co-founder and CEO of AI startup MosaicML, which was acquired in August by Databricks. 

At the same time, said Rao, the level of assistance for coding is low. 

"They give you a scaffolding, some things that are repeatable, but they don't give you anything particularly good," he said. "If I say, go solve this really hard problem, then they can't do that, right? They don't even write particularly good code; it's like someone who's been doing it for a year or two, kind of, level code."

Indeed, some studies have found large language models such as GPT-4 are well below those of human coders in their overall level of code quality.

A recent study by Sayed Erfan Arefin and colleagues at Texas Tech University scholars tested GPT-4 and its predecessor, GPT-3.5, in example coding problems from the online platform LeetCode -- problems that are the kinds asked of job applicants to Google and other tech giants. 

The programs were assessed based on two core challenges, "organizing data for efficient access (using appropriate data structures)" and "creating workflows to process data (using effective algorithms)." They were also evaluated on what's called "string manipulation," which intersects with both of the other two. 

Also: How to use ChatGPT to make charts and tables

When the language models were given what the authors called complete questions, where the programs were supplied with examples of solutions to the questions, GPT-4 answered only 26% of the questions correctly, versus 45% for human respondents. When some information was taken away, GPT-4's ability plummeted to 19% of questions answered correctly. GPT-3.5 was down at around 12% and 10%, respectively. 

The authors also examined the quality of the GPT code, both for success and failure. In either case, they found a consistent problem: GPT often struggled with a basic practice of coding, "defining variables in a consistent manner."


Correctness of GPT-3, GPT-4, and humans, for train and test sets, when given either full problem information, with example solutions, incomplete information. 

Texas Tech University

Scale is also an issue for AI code generation. The most encouraging results so far in studies of GPT-4 are mostly on baby problems. 

One study, by David Noever of cyber-security firm PeopleTec, tested how well GPT-4 could find faulty coding in samples of code, similar to existing programs on the market for vulnerability testing, such as Snyk, a form of "Static Application Security Testing," or SAST.

In some cases, GPT-4 found more errors than Snyk, the authors reported. But it also missed numerous errors. And, it was tested on a grand total of just over 2,000 lines of code. That is minuscule compared to full production applications, which can contain hundreds of thousands to millions of lines of code, across numerous linked files. It's not clear that successes on the toy problems will scale to such complexity.

Also: How ChatGPT can rewrite and improve your existing code

A study last month by Zhijie Liu and colleagues at ShanghaiTech University examined quality of code based upon correctness, understandability, and security. The examination challenged ChatGPT on LeetCode tasks, like Arefin and team at Texas Tech, and also tested its code generation on what's called the Common Weakness Environment, a test of vulnerabilities maintained by research firm MITRE.

Lou and team tested ChatGPT on tasks formulated either before or after 2021, because ChatGPT was trained only on material before 2021, so they wanted to see how the program did when it was tested on both established and newer challenges. 

The results are striking. For the newer problems, called "Aft.," for "after" 2021, Lui and team found very low rates of correctness in ChatGPT's code. "ChatGPT's ability to functionally correct code generation decreases significantly as the difficulty of the problem increases," they write. Only 15.4% of C-language program code was acceptable, and none of it was acceptable for the hardest problems. And, "the code generated by ChatGPT for hard and medium problems is more likely to contain both compile and runtime errors." Human coders taking the test, on average, got 66% right. 

Also: How to use ChatGPT to create an app

For older problems, labeled "Bef.," the percent rises to 31% correct, which is still low. 

The team went through numerous examples and qualified the kinds of wrong answers ChatGPT gave in its lines of code. For example, while an overall program design might be in the right direction, a given line of code would show a fundamental wrong use of something as simple as evaluating a variable, an error it's hard to imagine a beginner programmer making. 


Example of wrong code generated by ChatGPT. The program is supposed to sort boxes into categories by description. In line 12, the code decides that if a box is neither "bulky" nor "heavy," it should be sorted into the category of "both" — exactly the opposite for a box description that should be "neither."

ShanghaiTech University

Liu and team arrive at a series of fascinating general conclusions and also mitigating factors. For one, they find that ChatGPT struggles with novel problems: "ChatGPT may have limitations when generating code for unfamiliar or unseen problems in the training dataset, even if the problems are easy with logic from human perspective."

But which programming language is used matters: the technology does better with certain programming languages that are "strongly typed" or more "expressive."

Also: How does ChatGPT actually work?

"In general, the probability of ChatGPT generating functionally correct code is higher when using languages with more strongly expressive power (e.g., Python3)," they write. 

Another shortcoming is that ChatGPT can be convoluted so that its errors are harder to fix. "The code generation process of ChatGPT may be careless," they write, "and the generated code may fail to meet some of the detailed conditions described, resulting in it being difficult to successfully generate or fix (to functional correct)."

And on the Common Vulnerabilities test by MITRE, "the code generated by ChatGPT often exhibits relevant vulnerabilities, which is a severe issue," they write. Fortunately, they note, ChatGPT is able to correct many of those vulnerabilities in subsequent prompts when supplied with more detailed information from the MITRE data set. 

All three studies suggest it is very early in the use of generative AI for programming. It is, as Rao said, helpful in simple assistant tasks, where the programmer is in charge. 

Also: The 10 best ChatGPT plugins (and how to make the most of them)

It's possible that progress will come from new approaches that break programming paradigms. For example, recent Google work trains language models to reach out to the internet for tools to solve tasks. And work by Google's DeepMind unit trains language models to go more deeply into engineering its own prompts to improve performance -- a kind of self-reflexive programming that seems promising. 

Something deeper may ultimately be required, says Rao.

"I don't think it can be solved with prompts," says Rao. "I think there's actually some fundamental problems we still have to solve -- there's still something fundamentally missing." 

Added Rao, "We can basically throw so much data at a large neural network that it's a hundred lifetimes or more of human experience, and yet, a human with much less experience can solve novel problems better, and not make certain kinds of basic errors."

Editorial standards