Programming languages: This open-source AI code generator is very good at writing in C

Researchers trained new model across 12 programming languages.
Written by Liam Tung, Contributing Writer

Researchers from Carnegie Mellon University have released PolyCoder, an automated code generator model that was trained on multiple programming languages, which they say is particularly good at writing code in C.

The researchers hope their open source PolyCoder can democratize research into the field of AI code generation, which so far is dominated by well-funded companies like Alphabet-owned DeepMind and OpenAI. 

"Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs... are not publicly available, leaving many questions about their model and data design decisions," the researchers said.

SEE: What is Agile software development? Everything you need to know about delivering better code, faster

The researchers point out that OpenAI's Codex, unveiled in August, is available through Microsoft-owned GitHub's Copilot tool but notes that it provides "non-free access" to the model's output through black-box API calls, but the model's weights and training data are unavailable.

The idea behind auto code generation is that it can save developers time, assuming the output is accurate and doesn't introduce security flaws. DeepMind claimed its recently unveiled AlphaCode code generator ranked in the top 54.3% of human participants in programming competitions. But training the model required "hundreds of petaFLOPS days" in Google's data centers. 

"Despite the great success of large language models of code, the strongest models are not publicly available," the researchers note. "This prevents the application of these models outside of well-resourced companies and limits research in this field for low-resourced organizations."

To fix this, the researchers have delivered their own model trained on code from multiple programming languages that they have called "PolyCoder".

The researchers explained: "We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, that was trained on 249GB of code across 12 programming languages on a single machine. In the C programming language, PolyCoder outperforms all models including Codex." 

The model was trained on data from several repositories from GitHub, covering 12 popular programming languages: C, C#, C++, Go, Java, JavaScript, PHP, Python, Ruby, Rust, Scala and TypeScript. The unfiltered dataset totaled 631GB of data and 38.9 million files. Also, to train PolyCoder, the researchers picked GPT-2 because of budget constraints.  

The researchers claimed some areas of success, particularly in C. However, Codex still trumped it in other languages. 

"Notably, PolyCoder outperforms Codex and all other models in the C language. Comparing the open-source models only, PolyCoder performs better than the similarly sized GPT-Neo 2.7B in C, JavaScript, Rust, Scala and TypeScript," the researchers note.

"In the other 11 languages other than C, all other open-source models, including ours, are significantly worse (higher perplexity) than Codex.

Editorial standards