X
Innovation

I pitted Claude 3.5 Sonnet against AI coding tests ChatGPT aced - and it failed creatively

Anthropic claims its updated AI tool is ideal for programming. After subjecting Claude 3.5 Sonnet to my four standard programming tests, I have some advice for you.
Written by David Gewirtz, Senior Contributing Editor
cover
David Gewirtz/ZDNET

Last week, I got an email from Anthropic announcing that Claude 3.5 Sonnet was available. According to the AI company, "Claude 3.5 Sonnet raises the industry bar for intelligence, outperforming competitor models and Claude 3 Opus on a wide range of evaluations."

The company added: "Claude 3.5 Sonnet is ideal for complex tasks like code generation." I decided to see if that was true.

Also: How to use ChatGPT to create an app

I'll subject the new Claude 3.5 Sonnet model to my standard set of coding tests --  tests I've run against a wide range of AIs with a wide range of results. Want to follow along with your own tests? Point your browser to How I test an AI chatbot's coding ability - and you can too, which contains all the standard tests I apply, explanations of how they work, and what to look for in the results.

OK, let's dig into the results of each test and see how they compare to previous tests using Microsoft CopilotMeta AIMeta Code LlamaGoogle Gemini Advanced, and ChatGPT.

1. Writing a WordPress plugin

At first, this seemed to have so much promise. Let's start with the user interface Claude 3.5 Sonnet created based on my test prompt.

cleanshot-2024-06-26-at-13-28-382x
Screenshot by David Gewirtz/ZDNET

This is the first time an AI has decided to put the two data fields side-by-side. The layout is clean and looks great.

Claude also decided to do something else I've never seen an AI do. This plugin can be created using just PHP code, which is the code running at the back end of a WordPress server.

Also: How I test an AI chatbot's coding ability - and you can too

But some AI implementations also have added JavaScript code (which runs in the browser to control dynamic user interface features) and CSS code (which controls how the browser displays information).

In a PHP environment, if you need PHP, JavaScript, and CSS, you can either include the CSS and JavaScript right in the PHP code (that's a feature of PHP), or you can put the code in three separate files -- one for PHP, one for JavaScript, and one for CSS.

Usually, when an AI wants to use all three languages, it shows what needs to be cut and pasted into the PHP file, then another block to be cut and pasted into a JavaScript file, and then a third block to be cut and pasted into a CSS file.

But Claude just provided one PHP file and then, when it ran, auto-generated the JavaScript and CSS files into the plugin's home directory. This is both fairly impressive and somewhat wrong-headed. It's cool that it tried to make the plugin creation process easier, but whether or not a plugin can write to its own folder is dependent on the settings of the OS configuration -- and there's a very high chance it could fail.

I allowed it in my testing environment, but I'd never allow a plugin to rewrite its own code in a production environment. That's a very serious security flaw.

Also: How to use ChatGPT to write code: What it can and can't do for you

Despite the fairly creative nature of Claude's code generation solution, the bottom line is that the plugin failed. Pressing the Randomize button does absolutely nothing. That's sad because, as I said, it had so much promise.

Here are the aggregate results of this and previous tests:

  • Claude 3.5 Sonnet: Interface: good, functionality: fail
  • ChatGPT GPT-4o: Interface: good, functionality: good
  • Microsoft Copilot: Interface: adequate, functionality: fail
  • Meta AI: Interface: adequate, functionality: fail
  • Meta Code Llama: Complete failure
  • Google Gemini Advanced: Interface: good, functionality: fail
  • ChatGPT 4: Interface: good, functionality: good
  • ChatGPT 3.5: Interface: good, functionality: good

2. Rewriting a string function

This test is designed to evaluate how the AI does rewriting code to work more appropriately for the given need; in this case -- dollars and cents conversions.

The Claude 3.5 Sonnet revision properly removed leading zeros, making sure that entries like "000123" are treated as "123". It properly allows integers and decimals with up to two decimal places (which is the key fix the prompt asked for). It prevents negative values. And it's smart enough to return "0" for any weird or unexpected input, which prevents the code from abnormally ending in an error.

Also: Can AI detectors save us from ChatGPT? I tried 6 online tools to find out

One failure is that it won't allow decimal values alone to be entered. So if the user entered 50 cents as ".50" instead of "0.50", it would fail the entry. Based on how the original text description for the test is written, it should have allowed this input form.

Although most of the revised code worked, I have to count this as a fail because if the code were pasted into a production project, users would not be able to enter inputs that contained only values for cents.

Here are the aggregate results of this and previous tests:

  • Claude 3.5 Sonnet: Failed
  • ChatGPT GPT-4o: Succeeded
  • Microsoft Copilot: Failed
  • Meta AI: Failed
  • Meta Code Llama: Succeeded
  • Google Gemini Advanced: Failed
  • ChatGPT 4: Succeeded
  • ChatGPT 3.5: Succeeded

3. Finding an annoying bug

The big challenge of this test is that the AI is tasked with finding a bug that's not obvious and -- to solve correctly -- requires platform knowledge of the WordPress platform. It's also a bug I did not immediately see on my own and, originally, asked ChatGPT to solve (which it did).

Also: The best free AI courses in 2024 (and whether AI certificates are worth it)

Claude not only got this right -- catching the subtlety of the error and correcting it -- but it was also the first AI since I published the full set of tests online to catch the fact that the publishing process introduced an error into the sample query (which I subsequently fixed and republished).

Here are the aggregate results of this and previous tests:

  • Claude 3.5 Sonnet: Succeeded
  • ChatGPT GPT-4o: Succeeded
  • Microsoft Copilot: Failed. Spectacularly. Enthusiastically. Emojically.
  • Meta AI: Succeeded
  • Meta Code Llama: Failed
  • Google Gemini Advanced: Failed
  • ChatGPT 4: Succeeded
  • ChatGPT 3.5: Succeeded

So far, we're at two out of three fails. Let's move on to our last test.

4. Writing a script

This test is designed to see how far the AI's programming knowledge goes into specialized programming tools. While AppleScript is fairly common for scripting on Macs, Keyboard Maestro is a commercial application sold by a lone programmer in Australia. I find it indispensable, but it's just one of many such apps on the Mac.

However, when testing in ChatGPT, ChatGPT knew how to "speak" Keyboard Maestro as well as AppleScript, which shows how broad its programming language knowledge is.

Also: From AI trainers to ethicists: AI may obsolete some jobs but generate new ones

Unfortunately, Claude does not have that knowledge. It did write an AppleScript that attempted to speak to Chrome (that's part of the test parameter) but it ignored the essential Keyboard Maestro component.

Worse, it generated code in AppleScript that would generate a runtime error. In an attempt to ignore case for the match in the test, Claude generated the line:

if theTab's title contains input ignoring case then

This is pretty much a double error because the "contains" statement is case insensitive and the phrase "ignoring case" does not belong where it was placed. It caused the script to error out with an "Ignoring can't go after this" syntax error message.

Here are the aggregate results of this and previous tests:

  • Claude 3.5 Sonnet: Failed
  • ChatGPT GPT-4o: Succeeded but with reservations
  • Microsoft Copilot: Failed
  • Meta AI: Failed
  • Meta Code Llama: Failed
  • Google Gemini Advanced: Succeeded
  • ChatGPT 4: Succeeded
  • ChatGPT 3.5: Failed

Overall results

Here are the overall results of the five tests:

I was somewhat bummed about Claude 3.5 Sonnet. The company specifically promised that this version was suited to programming. But as you can see, not so much. It's not that it can't program. It just can't program correctly.

Also: I used ChatGPT to write the same routine in 12 top programming languages. Here's how it did

I keep looking for an AI that can best the ChatGPT solutions, especially as platform and programming environment vendors start to integrate these other models directly into the programming process. But, for now, I'm going back to ChatGPT when I need programming help, and that's my advice to you as well.

Have you used an AI to help you program? Which one? How did it go? Let us know in the comments below.


You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.

Editorial standards