I tested Meta's Code Llama with 3 AI coding challenges that ChatGPT aced - and it wasn't good

Meta CEO Mark Zuckerberg recently unveiled Code Llama, a 70B parameter AI designed for coding. But how does it stack up against giants like ChatGPT? I put it to the test.
Written by David Gewirtz, Senior Contributing Editor
Portrait of Llama against the Wooden Background
Rocter/Getty Images

A few weeks ago, Meta CEO Mark Zuckerberg announced via Facebook that his company is open-sourcing its large language model (LLM) Code Llama, which is an artificial intelligence (AI) engine similar to GPT-3.5 and GPT-4 in ChatGPT

Zuck announced three interesting things about this LLM: it's being open-sourced, it's designed to help write and edit code, and its model has 70B parameters. The hope is that developers can feed the model more challenging problems, and the engine will be more accurate when it answers.

Also: Why open-source generative AI models are still a step behind GPT-4

The open-sourcing issue is interesting. It's an approach that implies that you could download the whole thing, install it on your own server, and use the model to get programming help without ever taking the risk that the Overlords of Facebook will hoover up your code for training or other nefarious purposes.

Doing this work involves setting up a Linux server and doing all sorts of hoop jumps. However, it turns out that the specialists at Hugging Face have already implemented the Code Llama 70B LLM into their HuggingChat interface. So, that's what I'm going to test next.

Getting started with Code Llama

To get started, you'll need to create a free account on Hugging Face. If you already have one (as I do), you can use the 70B Code Llama LLM with that account.

Also: GitHub: AI helps developers write safer code, but you need to get the basics right

One thing that's important to note is that, while you could install Code Llama on your own server and thereby not share any of your code, the story is far different on Hugging Face. That service says that anything you type in might be shared with the model authors unless you turn off that option in settings:

Screenshot by David Gewirtz/ZDNET

When you log in to HuggingChat, you'll be presented with a blank chat screen. As you can see below, my current LLM is openchat/openchat-3.5-0106, but I'm going to change it to Code Llama -- and I'll show you how.

You change your current model in the settings, which you can get to by hitting the gear icon:

Screenshot by David Gewirtz/ZDNET

Once in settings, click (at 1) the codellama/CodeLlama-70b-Instruct-hf on the left, verify (at 2) that the Code Llama LLM has been selected, and then click Activate (at 3):

Screenshot by David Gewirtz/ZDNET

Now, when you talk to the chat interface, you'll be using the Code Llama model, as verified at the top of the chat interface:

Screenshot by David Gewirtz/ZDNET

To test, I decided to pull prompts from a previous coding test run I conducted with Bard (now Gemini) and I ran the same tests in HuggingChat.

Test 1: Writing a WordPress plugin

My first test was the creation of a WordPress plugin. ChatGPT performed quite well at this task. Bard was weak, but tried its best. But how about Code Llama? Well, let's see. Here's the prompt:

Write a PHP 8 compatible WordPress plugin that provides a text entry field where a list of lines can be pasted into it and a button, that when pressed, randomizes the lines in the list and presents the results in a second text entry field with no blank lines and makes sure no two identical entries are next to each other (unless there's no other option)…with the number of lines submitted and the number of lines in the result identical to each other. Under the first field, display text stating "Line to randomize: " with the number of nonempty lines in the source field. Under the second field, display text stating "Lines that have been randomized: " with the number of non-empty lines in the destination field.

And here are the results, such as they are:

Screenshot by David Gewirtz/ZDNET

That ain't right on so many levels. First, Code Llama didn't create the plugin header, a very simple set of fields required by all plugins. Then, it generated code that my programming editor's code formatter couldn't interpret, indicating that there are missing elements in the code. 

What's more, the code can't be tested. This is a fail.

Test 2: Rewriting a string function

For this test, I used a string rewriting request I once gave ChatGPT. I wanted to fix a bug in my code that was only allowing integers when it should have been allowing for dollars and cents (in other words, some number of digits, possibly followed by a period, and then, if there is a period, two more digits).

I fed in the same prompt I used in my previous article and got back this response:

Screenshot by David Gewirtz/ZDNET

This code works and I'm satisfied with it.

Test 3: Finding a bug I couldn't find 

Again, I reused a test I wrote about in a previous article. I'll point you to the original article if you want the details of the problem I tried out on Code Llama. The coding problem is long and fairly convoluted, which is why I couldn't find out what was wrong.

ChatGPT solved the problem immediately; Bard did not. Bard failed because it looked at the surface of the problem, not how the overall code was constructed and needed to run. An analogy is going to the doctor with a headache. One doctor might tell you to take two aspirin and not call him in the morning. The other doctor might try to find out the root cause of the headache and help solve that.

Also: How to use ChatGPT to write code

ChatGPT zeroed in on the root cause, and I was able to fix the bug. Bard just looked at the symptoms and didn't come up with a fix.

Unfortunately, Code Llama did exactly the same thing as Bard, looking at just the surface of the problem. The AI made recommendations, but those recommendations didn't improve the situation.

And the winner is...

My test suite is far from comprehensive. But if Code Llama fails on two of the three tests that didn't even slow down ChatGPT, it seems like the AI isn't ready for prime time.

The only reason you might want to use Code Llama over ChatGPT is if you install it on your own server because then your code won't be shared with Meta. But what good is privacy if the thing doesn't give correct answers?

If ChatGPT hadn't been so good, I probably would have given some points to Code Llama. But we know what's possible with ChatGPT -- and Code Llama is far from that level. In short, it looks like Facebook has to Zuck it up and make some improvements.

Also: Implementing AI into software engineering? Here's everything you need to know

To be honest, I expected better and I'm a little disappointed. But if there's one thing tech columnists get used to, it's being a little disappointed by many of the products and projects we look at. I think that's why we get so excited when something stands out and rocks our world. And Code Llama, unfortunatey, isn't one of those.

Have you tried any of the AIs for coding help? Which ones have you used? How have they worked out? Let us know in the comments below.

You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter on Substack, and follow me on Twitter at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.

Editorial standards