X
Innovation

I put GPT-4o through my coding tests and it aced them - except for one weird result

Can OpenAI's GPT-4o code better than its competitors? Yes, but you still need to check its work.
Written by David Gewirtz, Senior Contributing Editor
Gpt-4o
Ismail Aslandag/Anadolu via Getty Images

Unless you've been hiding on a deserted island somewhere without any internet service, you probably know that OpenAI released its new large language model, GPT-4o, where "o" stands for "omni." The new LLM is supposed to offer a variety of modes, including text, graphics, and voice.

In this article, I'm subjecting the new GPT-4o model to my standard set of coding tests. I've run these tests against a wide range of AIs with a wide range of results. You'll want to read all the way to the end because I did get a surprising result.

If you want to follow along with your own tests, point your browser to this article: How I test an AI chatbot's coding ability - and you can too.

It contains all the standard tests I apply, along with explanations of how they work, and what to look for in the results.

And, with that, let's dig into the results of each test and see how they compare to previous tests using Microsoft Copilot, Meta AI, Meta Code Llama, Google Gemini Advanced, and ChatGPT.

1. Writing a WordPress plugin

Here's GPT-4o's user interface:

cleanshot-2024-05-19-at-12-45-322x
Screenshot by David Gewirtz/ZDNET

This was a first. GPT-4o decided to include a JavaScript file that caused the count of lines in both fields to update dynamically. Since the prompt did not specify that JavaScript was disallowed, it's a creative solution.

More to the point, it works. The JavaScript also controls the Randomize button, so if you press the Randomize button multiple times, you get multiple sets of results without the whole page refreshing.

Also: ChatGPT vs. ChatGPT Plus: Is a paid subscription still worth it?

Lines were arranged correctly. Duplicates were separated from each other per the specification. This is a totally workable piece of code.

My only complaint is that the Randomize button doesn't exist on a line of its own. However, I did not tell ChatGPT to put it on its own line, so it's not the fault of the AI that it's arranged as it is.

Here are the aggregate results of this and previous tests:

  • ChatGPT GPT-4o: Interface: good, functionality: good
  • Microsoft Copilot: Interface: adequate, functionality: fail
  • Meta AI: Interface: adequate, functionality: fail
  • Meta Code Llama: Complete failure
  • Google Gemini Advanced: Interface: good, functionality: fail
  • ChatGPT 4: Interface: good, functionality: good
  • ChatGPT 3.5: Interface: good, functionality: good

2. Rewriting a string function

This test is designed to test dollars and cents conversions. The ChatGPT GPT-4o AI did rewrite the code correctly, disallowing any inputs that would cause subsequent lines of code to fail if a proper dollars and cents value were not submitted.

Also: 6 ways OpenAI just supercharged ChatGPT for free users

I was a little disappointed that the code does allow a leading decimal point (i.e., .75) but doesn't prepend a zero to the value (as in 0.75). But dollars and cents processing code would be able to understand the version without the original zero, and would not fail.

Since I didn't explicitly ask for a prepended zero in that case, it's not something I'll ding the AI over. However, this does show how, even if an AI delivers workable code, you might want to go back in and tweak the prompt to get more of what you really want to see.

Here are the aggregate results of this and previous tests:

  • ChatGPT GPT-4o: Succeeded
  • Microsoft Copilot: Failed
  • Meta AI: Failed
  • Meta Code Llama: Succeeded
  • Google Gemini Advanced: Failed
  • ChatGPT 4: Succeeded
  • ChatGPT 3.5: Succeeded

3. Finding an annoying bug

This is an interesting test because the answer isn't immediately obvious. I was originally stumped when I got this error during coding, so I fed it into the first ChatGPT language model. At the time, I was blown away because it actually found the error right away.

Also: How Adobe manages AI ethics concerns while fostering creativity

By contrast, three of the LLMs I tested missed the misdirection inherent in this problem. Basically, from the error message presented, it looks like the error is in one part of the code, but the error is actually in a completely different area of the code, something you (or an AI) wouldn't know if you didn't have deep knowledge of, in this case, the WordPress framework.

The good news: ChatGPT GPT-4o found the problem and correctly described the fix.

Here are the aggregate results of this and previous tests:

  • ChatGPT GPT-4o: Succeeded
  • Microsoft Copilot: Failed. Spectacularly. Enthusiastically. Emojically.
  • Meta AI: Succeeded
  • Meta Code Llama: Failed
  • Google Gemini Advanced: Failed
  • ChatGPT 4: Succeeded
  • ChatGPT 3.5: Succeeded

So far, we're at three out of three wins. Let's move on to our last test.

4. Writing a script

In responding to this test, ChatGPT GPT-4o gave me an answer that was actually a bit more than I asked for.

The idea of this test is that it asks about a fairly obscure Mac scripting tool called Keyboard Maestro, as well as Apple's scripting language AppleScript, and Chrome scripting behavior. For the record, Keyboard Maestro is one of the single biggest reasons I use Macs over Windows for my daily productivity, because it allows the entire OS and the various applications to be reprogrammed to suit my needs. It's that powerful.

Also: How to use ChatGPT to write code: What it can and can't do for you

In any case, to pass the test, the AI has to properly describe how to solve the problem using a mix of Keyboard Maestro code, AppleScript code, and Chrome API functionality.

As you can see, ChatGPT GPT-4o gave me two versions.

cleanshot-2024-05-19-at-13-09-532x
Screenshot by David Gewirtz/ZDNET

Both versions properly talked to Keyboard Maestro, but they differ on how they deal with ignoring case. The one on the left was actually erroneous because AppleScript doesn't have an "as lowercase" capability. The code on the right, which used "contains" and is case-insensitive, did work.

I'm going to give GPT-4o a passing grade because it did return code that worked. But that's a cautious passing grade because it should have returned just one option, and that option should have been correct. What it did instead was require me to evaluate both results and choose. That could have taken as much time as it would have taken to just write the code myself.

Here are the aggregate results of this and previous tests:

  • ChatGPT GPT-4o: Succeeded, but with reservations
  • Microsoft Copilot: Failed
  • Meta AI: Failed
  • Meta Code Llama: Failed
  • Google Gemini Advanced: Succeeded
  • ChatGPT 4: Succeeded
  • ChatGPT 3.5: Failed

Overall results

Here are the overall results of the five tests:

Up until now, my default go-to for programming help has been ChatGPT. It's just worked (except for when it didn't). All of the other AIs failed most of my coding tests. But GPT-4o is weird. That last answer kind of raised the hairs on the back of my neck.

I'm not thrilled with getting two answers for one question, especially when one answer contains code that the language itself doesn't support. What's going on inside GPT-4o that's causing this loss of confidence?

In any case, it's still the best-performing AI for my coding tests, so I'll probably continue to use it and get more acquainted with GPT-4o. Another option is to go back to GPT-3.5 or GPT-4 in ChatGPT Plus. Stay tuned. The next time ChatGPT updates its model, I'm definitely going to rerun these tests and see if it's smart enough to pick the right answer for all four tests.

Have you tried coding with Copilot, Meta AI, Gemini, or ChatGPT? What has your experience been? Let us know in the comments below.

Editorial standards