On a mission to measure AI-assisted developer productivity, researchers at GitHub recently ran an experiment comparing coding speeds of a group using its Copilot code completion tool versus a group relying on human ability alone.
GitHub Copilot is an AI pair-programming service that launched publicly earlier this year for $10 per user per month or a $100 per user per year. Since launching, researchers have been curious to know whether these AI tools really translate into a boost to developer productivity. The catch is that it's not easy to identify the right metrics to measure performance changes.
Copilot is used as an extension to code editors, such as Microsoft's VS Code. It generates code suggestions in multiple programming languages that users can accept, reject or edit. The suggestions are provided by OpenAI's Codex, a system that translates natural language to code and is based on OpenAI's GPT-3 language model.
Google Research and the Google Brain Team concluded in July, after studying the impact of AI code suggestions on over 10,000 of its own developers' productivity, that the debate over relative performance speed remains an "open question". That's despite concluding that a combination of traditional rule-based semantic engines and large language models, such as Codex/Copilot, "can be used to significantly improve developer productivity with better code completion".
But how do you measure productivity? Other researchers earlier this year, using a small sample of 24 developers, found that Copilot didn't necessarily improve the task completion time or success rate. However, it found Copilot did save developers the effort of searching online for code snippets to solve particular problems. This is an important indicator of how much an AI tool like Copilot can reduce context switches, when developers hop in an out of an editor to solve a problem.
GitHub also surveyed over 2,600 developers, asking questions like, "Do people feel like GitHub Copilot makes them more productive?" Its researchers also had the benefit of unique access to large-scale telemetry data and published the research in June. Among other things, the researchers found that between 60% to 75% of users feel more fulfilled with their job when using Copilot, feel less frustrated when coding, and are able to focus on more satisfying work.
"In our research, we saw that GitHub Copilot supports faster completion times, conserves developers' mental energy, helps them focus on more satisfying work, and ultimately find more fun in the coding they do," GitHub said.
GitHub researcher, Dr. Eirini Kalliamvakou, explained the approach: "We conducted multiple rounds of research including qualitative (perceptual) and quantitative (observed) data to assemble the full picture. We wanted to verify: (a) Do users' actual experiences confirm what we infer from telemetry? (b) Does our qualitative feedback generalize to our large user base?"
Kalliamvakou, who was involved in the original study, has now built on it with an experiment involving 95 developers that focussed on the question of coding speed with Copilot and without.
This research found that the group who used Copilot (45 developers) completed the task on average within 1 hour and 11 minutes. The group who didn't use Copilot (50 developers) completed it in on average in 2 hours and 41 minutes. So, the group with Copilot were 55% faster than the group without it.
Kalliamvakou also found a higher percentage of the group with Copilot completed the task – 78% of the Copilot group versus 70% in the group without Copilot.
And the experiment didn't look at factors that contribute to productivity such as context switching. However, GitHub's earlier study found that 73% of developers reported that Copilot helped them stay in the flow.
In an email, Kalliamvakou explained to ZDNET what this figure meant in terms of context switching and developer productivity.
"Reporting 'staying in the flow' certainly implies less context switching, and we have extra evidence. 77% of those surveyed reported that, when using GitHub Copilot, they spend less time searching," she wrote.
"The statement gauges a known context switch for developers, such as looking up documentation, or visiting Q&A sites like Stack Overflow to find answers or ask questions. With GitHub Copilot bringing information into the editor, developers don't need to switch out of the IDE as often'," she explained.
But using context switching alone to measure improved productivity from AI code suggestions can't show the full picture. There's also "good" and "bad" context switching, which makes it difficult to measure the impact of context switching.
During a typical task developers switch between different activities, tools and information sources a lot, Kalliamvakou explained.
She pointed to a study published in 2014 that found developers spend on average 1.6 minutes on an activity before switching, or switch on average 47 times an hour.
"That's just because of the nature of their work and the multitude of tools they use, so it's considered "good" context switching. In contrast, there is "bad" context switching due to delays or interruptions," she said.
"We found in our earlier research that this hurts productivity a lot, as well as developers' own sense of progress. Context switching is hard to measure, because we don't have a good way to distinguish automatically between "good" and "bad" instances – or when a switch is part of completing a task versus causes a disruption to developers' flow and productivity. However, there are ways to gauge context switching through self-reports and observations which we do use in our research."
As for Copilot's performance with other languages, Kalliamvakou says she's interested in conducting experiments in the future.
"It was certainly a fun experiment to do. These controlled experiments are quite time-consuming as we try to make them bigger or more comprehensive, but I'd like to explore testing for other languages in the future'," she said.
Kalliamvakou posted other key findings from GitHub's large scale survey in a blogpost detailing its quest to find the most suitable metrics to gauge developer productivity.