Innovation

Yikes! Microsoft Copilot failed every single one of my coding tests

I ran Microsoft Copilot against Meta AI, Code Llama, Gemini Advanced, and ChatGPT. It managed to get every test wrong. That's some kind of record or something.

Written by David Gewirtz, Senior Contributing Editor April 29, 2024 at 7:12 a.m. PT

Rafael Henrique/SOPA Images/LightRocket via Getty Images

Recently, my ZDNET colleague and fellow AI explorer Sabrina Ortiz wrote an article entitled, 7 reasons I use Copilot instead of ChatGPT. I had never been terribly impressed with Copilot, especially since it failed some fact-checking tests I ran against it last year. But Sabrina made some really good points about the benefits of Microsoft's offering, so I thought I'd give it another try.

Also: What is Copilot (formerly Bing Chat)? Here's everything you need to know

To be clear, because Microsoft names everything Copilot, the Copilot I'm testing is the general-purpose chatbot. There is a GitHub version of Copilot, but that runs as an extension inside Visual Studio Code and is available for a monthly or yearly fee. I did not test GitHub Copilot.

Instead, I loaded my standard set of four tests and fed them into the chatbot version of Copilot.

To recap, here is a description of the tests I'm using:

Writing a WordPress plugin: This tests basic web development, using the PHP programming language, inside of WordPress. It also requires a bit of user interface building. If an AI chatbot passes this test, it can help create rudimentary code as an assistant to web developers. I originally documented this test in "I asked ChatGPT to write a WordPress plugin I needed. It did it in less than 5 minutes."
Rewriting a string function: This test evaluates how an AI chatbot updates a utility function for better functionality. If an AI chatbot passes this test, it might be able to help create tools for programmers. If it fails, first-year programming students can probably do a better job. I originally documented this test in "OK, so ChatGPT just debugged my code. For real."
Finding an annoying bug: This test requires intimate knowledge of how WordPress works because the obvious answer is wrong. If an AI chatbot can answer this correctly, then its knowledge base is pretty complete, even with frameworks like WordPress. I originally documented this test in "OK, so ChatGPT just debugged my code. For real."
Writing a script: This test asks an AI chatbot to program using two fairly specialized programming tools not known to many users. It essentially tests the AI chatbot's knowledge beyond the big languages. I originally documented this test in "Google unveils Gemini Code Assist and I'm cautiously optimistic it will help programmers."

Let's dig into the results of each test and see how they compare to previous tests using Meta AI, Meta Code Llama, Google Gemini Advanced, and ChatGPT.

1. Writing a WordPress plugin

Here's Copilot's result on the left and the ChatGPT result on the right.

copilot-vs-chatgpt — Screenshot by David Gewirtz/ZDNET

Unlike ChatGPT, which styled the fields to look uniform, Copilot left that as an exercise for the user, stating "Remember to adjust the styling and error handling as needed."

To test, I inserted a set of names. When I clicked Randomized Lines, I got nothing back in the result field.

A look at the code showed some interesting programming mistakes, indicating that Copilot didn't really know how to write code for WordPress. For example, it assigned the hook intended to process the form to the admin_init action. That's not something that would cause the form to process, it's what initializes the admin interface.

Also: How to use ChatGPT to write code

It also didn't have code to actually display the randomized lines. It does store them in a value, but it doesn't retrieve and display them. The duplicate check was partially correct in that it did sort names together, but it didn't compare names to each other, so duplicates were still allowed.

Copilot is apparently using a more advanced LLM (GPT-4) than the free large language model I ran these tests on with the free version of ChatGPT (GPT-3.5), and yet the results of ChatGPT still seem to be better. I find that a bit baffling.

Here are the aggregate results of this and previous tests:

Microsoft Copilot: Interface: adequate, functionality: fail
Meta AI: Interface: adequate, functionality: fail
Meta Code Llama: Complete failure
Google Gemini Advanced: Interface: good, functionality: fail
ChatGPT: Interface: good, functionality: good

2. Rewriting a string function

This test is designed to test dollars and cents conversions. While the Copilot-generated code does properly flag an error if a value containing a letter or more than one decimal point is sent to it, it doesn't perform a complete validation.

For example, it allows for leading zeroes. It also allows for more than two digits to the right of the decimal point.

Also: How I used ChatGPT to write a custom JavaScript bookmarklet

While it does properly generate errors for the more egregious entry mistakes, the values it allows as correct could cause subsequent routines to fail, if they're expecting a strict dollars and cents value.

If a student turned this in as an assignment, I might give it a C. But if programmers in the real world are relying on Copilot to generate code that won't cause failures down the line, what Copilot generated is just not good enough. I have to give it a fail.

Here are the aggregate results of this and previous tests:

Microsoft Copilot: Failed
Meta AI: Failed
Meta Code Llama: Succeeded
Google Gemini Advanced: Failed
ChatGPT: Succeeded

3. Finding an annoying bug

Well, this is new. Okay, first, let me back up and put this test into context. This tests the AI's ability to think a few chess moves ahead. The answer that seems obvious isn't the right answer. I got caught by that when I was originally debugging the issue that eventually became this test.

ChatGPT, much to my very great surprise at the time, saw through the "trick" of the problem and correctly identified what the code was doing wrong. To do so, it had to see not just what the code itself said, but how it behaved based on the way the WordPress API worked. Like I said, I was pretty shocked that ChatGPT could be that sophisticated.

Also: How ChatGPT can rewrite and improve your existing code

Copilot, well, not so much. Copilot suggests I check the spelling of my function name and the WordPress hook name. The WordPress hook is a published thing, so it should be able to confirm, as I did, that it was spelled correctly. And my function is my function, so I can spell it however I want. If I had misspelled it somewhere in the code, the IDE would have very visibly pointed it out.

It also quite happily repeated the problem statement to me, suggesting I solve it. That's what I asked it to do, and it turned it back to me, telling me the problem I told it, and then telling me it would work if I debugged it. Then, it ended with "consider seeking support from the plugin developer or community forums. 😊" -- and yeah, that emoji was part of the AI's response.

Here are the aggregate results of this and previous tests:

Microsoft Copilot: Failed. Spectacularly. Enthusiastically. Emojically.
Meta AI: Succeeded
Meta Code Llama: Failed
Google Gemini Advanced: Failed
ChatGPT: Succeeded

4. Writing a script

I wouldn't originally have tried this test on an AI, but I had tried it on a lark with ChatGPT and it figured it out. So did Gemini Advanced.

The idea with this test is that it asks about a fairly obscure Mac scripting tool called Keyboard Maestro, as well as Apple's scripting language AppleScript, and Chrome scripting behavior. For the record, Keyboard Maestro is one of the single biggest reasons I use Macs over Windows for my daily productivity, because it allows the entire OS and the various applications to be reprogrammed to suit my needs. It's that powerful.

Also: I used ChatGPT to write the same routine in 12 top programming languages. Here's how it did

In any case, to pass the test, the AI has to properly describe how to solve the problem using a mix of Keyboard Maestro code, AppleScript code, and Chrome API functionality. Continuing its trend, Copilot didn't do it right. It completely ignored Keyboard Maestro (I'm guessing it's not in its dataset).

In the generated AppleScript, where I asked it to just scan the current window, Copilot repeated the process for all windows, returning results for the wrong window (the last one in the chain).

Here are the aggregate results of this and previous tests:

Microsoft Copilot: Failed
Meta AI: Failed
Meta Code Llama: Failed
Google Gemini Advanced: Succeeded
ChatGPT: Succeeded

Overall results

Here are the overall results of the five tests:

Microsoft Copilot: 0 out of 4 succeeded
Meta AI: 1 out of 4 succeeded
Meta Code Llama: 1 out of 4 succeeded
Google Gemini Advanced: 1 out of 4 succeeded
ChatGPT: 4 out of 4 succeeded

The results here really surprised me. It's been about five months since I last tested Copilot against other AIs. I fully expected Microsoft to have worked out the bugs. I expected that Copilot would do as well, or perhaps even better than, ChatGPT. After all, Microsoft is a huge investor in OpenAI (makers of ChatGPT) and Copilot is based on the same language model as ChatGPT.

Also: Microsoft quietly upgraded Copilot's free version to GPT-4 Turbo. Here's why it matters

And yet, it failed spectacularly, turning in the worst score of any of the AI's I've tried by not passing a single coding test. Not one. The last time I tested Copilot, I tried doing some fact-checking using all the AIs. All the other AIs answered the question and gave back fairly usable results. Copilot returned the data I asked it to verify, which was similar to the behavior I found in Test 3 above.

I'm not impressed. In fact, I find the results from Microsoft's flagship AI offering to be a little demoralizing. It should be so much better. Ah well, Microsoft does improve its products over time. Maybe by next year.

Have you tried coding with Copilot, Meta AI, Gemini, or ChatGPT? What has your experience been? Let us know in the comments below.

You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.

Editorial standards

Show Comments

Yikes! Microsoft Copilot failed every single one of my coding tests

1. Writing a WordPress plugin

2. Rewriting a string function

3. Finding an annoying bug

4. Writing a script

Overall results

Related

Microsoft Copilot vs. Copilot Pro: Is the subscription fee worth it?

Copilot Pro vs. ChatGPT Plus: Which AI chatbot is worth your $20 a month?

Meta inches toward open source AI with new LLaMA 3.1