Microsoft researchers have been working on a deep-learning model that was trained to find software bugs without any real-world bugs to learn from.
While there are dozens of tools available for static analysis of code in various languages to find security flaws, researchers have been exploring techniques that use machine learning to improve the ability to both detect flaws and fix them. That's because finding and fixing bugs in code can be hard and costly, even when using AI to find them.
Researchers at Microsoft Research Cambridge, UK have detailed their work on BugLab, a Python implementation of "an approach for self-supervised learning of bug detection and repair". It's 'self-supervised' in that the two models behind BugLab were trained without labelled data.
This ambition for no-training was driven by the lack of annotated real-world bugs to train bug-finding deep-learning models. While there is vast amounts of source code available for such training, it's largely not annotated.
BugLab aims to find hard-to-detect bugs versus critical bugs that can be already found through traditional program analyses. Their approach promises to avoid the costly process of manually coding a model to find these bugs.
The group claims to have found 19 previously unknown bugs in open-source Python packages from PyPI as detailed in the paper, Self-Supervised Bug Detection and Repair, presented at the recent Neural Information Processing Systems (NeurIPS) 2021 conference.
"BugLab can be taught to detect and fix bugs, without using labelled data, through a "hide and seek" game," explain Miltos Allamanis , a principal researcher at Microsoft Research and Marc Brockschmidt, a senior principal research manager at Microsoft. Both researchers are authors of the paper.
Beyond reasoning over a piece of code's structure, they believe bugs can be found "by also understanding ambiguous natural language hints that software developers leave in code comments, variable names, and more."
Their approach in BugLab, which uses two competing models, builds on existing self-supervised learning efforts in the field that use deep learning, computer vision, and natural language processing (NLP). It also resembles or is "inspired by" GANs or generative adversarial networks – the neural networks sometimes used to create deep fakes.
"In our case, we aim to train a bug detection model without using training data from real-life bugs," they note in the paper.
BugLab's two models include bug selector and a bug detector: "Given some existing code, presumed to be correct, a bug selector model decides if it should introduce a bug, where to introduce it, and its exact form (e.g., replace a specific "+" with a "-"). Given the selector choice, the code is edited to introduce the bug. Then, another model, the bug detector, tries to determine if a bug was introduced in the code, and if so, locate it, and fix it."
Their models are not a GAN because BugLab's "bug selector does not generate a new code snippet from scratch, but instead rewrites an existing piece of code (assumed to be correct)."
From the researchers test dataset of 2,374 real-life Python package bugs, they showed that 26% of bugs can be found and fixed automatically.
However, their technique also flagged too many false-positives, or bugs that weren't actually bugs. For example, while it detected some known bugs, only 19 of the 1,000 reported warnings from BugLab were actually real-life bugs.
Training a neural network without using real bug-training data sounds like a tough nut to crack. For example, some bugs were obviously not a bug, yet were flagged as such by the neural models.
"Some reported issues were sufficiently complex that it took us (the human authors) a couple of minutes of thought to conclude that a warning is spurious," they note in the paper.
"Simultaneously, there are some warnings that are "obviously" incorrect to us, but the reasons why the neural models raise them is unclear."
As for the 19 zero-day flaws they found, they reported 11 of them on GitHub, of which six have been merged and five are pending approval. Some of the 19 flaws were too minor to bother reporting.