A scan of billions of files from 13 percent of all GitHub public repositories over a period of six months has revealed that over 100,000 repos have leaked API tokens and cryptographic keys, with thousands of new repositories leaking new secrets on a daily basis.
The scan was the object of academic research carried out by a team from the North Carolina State University (NCSU), and the study's results have been shared with GitHub.
Academics scanned billions of GitHub files
The NCSU study is the most comprehensive and in-depth GitHub scan to date and exceeds any previous research of its kind.
NCSU academics scanned GitHub accounts for a period of nearly six months, between October 31, 2017, and April 20, 2018, and looked for text strings formatted like API tokens and cryptographic keys.
They didn't just use the GitHub Search API to look for these text patterns, like other previous research efforts, but they also looked at GitHub repository snapshots recorded in Google's BigQuery database.
Across the six-month period, researchers analyzed billions of files from millions of GitHub repositories.
In a research paper published last month, the three-man NCSU team said they captured and analyzed 4,394,476 files representing 681,784 repos using the GitHub Search API, and another 2,312,763,353 files from 3,374,973 repos that had been recorded in Google's BigQuery database.
NCSU team scanned for API tokens from 11 companies
Inside this gigantic pile of files, researchers looked for text strings that were in the format of particular API tokens or cryptographic keys.
Since not all API tokens and cryptographic keys are in the same format, the NCSU team decided on 15 API token formats (from 15 services belonging to 11 companies, five of which were from the Alexa Top 50), and four cryptographic key formats.
This included API key formats used by Google, Amazon, Twitter, Facebook, Mailchimp, MailGun, Stripe, Twilio, Square, Braintree, and Picatic.
Results came back right away, with thousands of API and cryptographic keys leaking being found every day of the research project.
In total, the NCSU team said they found 575,456 API and cryptographic keys, of which 201,642 were unique, all spread over more than 100,000 GitHub projects.
An observation that the research team made in their academic paper was that the "secrets" found using the Google Search API and the ones via the Google BigQuery dataset also had little overlap.
"After joining both collections, we determined that 7,044 secrets, or 3.49% of the total, were seen in both datasets. This indicates that our approaches are largely complementary," researchers said.
Furthermore, most of the API tokens and cryptographic keys --93.58 percent-- came from single-owner accounts, rather than multi-owner repositories.
What this means is that the vast majority of API and cryptographic keys found by the NCSU team were most likely valid tokens and keys used in the real world, as multi-owner accounts usually tend to contain test tokens used for shared-testing environments and with in-dev code.
Leaked API and crypto keys to hang around for weeks
Because the research project also took place over a six-month period, researchers also had a chance to observe if and when account owners would realize they've leaked API and cryptographic keys, and remove the sensitive data from their code.
The team said that six percent of the API and cryptographic keys they've tracked were removed within an hour after they leaked, suggesting that these GitHub owners realized their mistake right away.
Over 12 percent of keys and tokens were gone after a day, while 19 percent stayed for as much as 16 days.
"This also means 81% of the secrets we discover were not removed," researchers said. "It is likely that the developers for this 81% either do not know the secrets are being committed or are underestimating the risk of compromise."
Research team uncovers some high-profile leaks
The extraordinary quality of these scans was evident when researchers started looking at what and where were some of these leaks were originating.
"In one case, we found what we believe to be AWS credentials for a major website relied upon by millions of college applicants in the United States, possibly leaked by a contractor," the NCSU team said.
"We also found AWS credentials for the website of a major government agency in a Western European country. In that case, we were able to verify the validity of the account, and even the specific developer who committed the secrets. This developer claims in their online presence to have nearly 10 years of development experience."
In another case, researchers also found 564 Google API keys that were being used by an online site to skirt YouTube rate limits and download YouTube videos that they'd later host on another video sharing portal.
"Because the number of keys is so high, we suspect (but cannot confirm) that these keys may have been obtained fraudulently," NCSU researchers said.
Last, but not least, researchers also found 7,280 RSA keys inside OpenVPN config files. By looking at the other settings found inside these configuration files, researchers said that the vast majority of the users had disabled password authentication and were relying solely on the RSA keys for authentication, meaning anyone who found these keys could have gained accessed to thousands of private networks.
The high quality of the scan results was also apparent when researchers used other API token-scanning tools to analyze their own dataset, to determine the efficiency of their scan system.
"Our results show that TruffleHog is largely ineffective at detecting secrets, as its algorithm only detected 25.236% of the secrets in our Search dataset and 29.39% in the BigQuery dataset," the research team said.
GitHub is aware and on the job
"We have discussed the results with GitHub," Brad Reaves, Assistant Professor at the Department of Computer Science at North Carolina State University, told ZDNet in an interview today. "We were told they are monitoring additional secrets beyond those listed in the documentation, but we weren't given further details."
"Because leakage of this type is so pervasive, it would have been very difficult for us to notify all affected developers. One of the many challenges we faced is that we simply didn't have a way to obtain secure contact information for GitHub developers at scale," Reaves added.
"At the time our paper went to press, we were trying to work with GitHub to do notifications, but given the overlap between our token scanning and theirs, they felt an additional notification was not necessary."
In an email, a GitHub spokesperson told ZDNet that the API keys the NCSU team reported are now likely void tokens, as GitHub was working at the time on a security feature named Token Scanning.
This feature scans GitHub push commits to public repos for API keys and notifies service providers, who may, in turn, revoke the key or notify its owner.
API key leaks --a known issue
The problem of developers leaving their API and cryptographic keys in apps and websites' source code is not a new one. Amazon has urged web devs to search their code and remove any AWS keys from public repos as far as 2014, and has even released a tool to help them scan repos before commiting any code to a public repo.
Some companies have taken it upon themselves to scan GitHub and other code-sharing repositories for accidentaly exposed API keys, and revoke the tokens even before API key owners notice the leak or abuse.
What the NCSU study has done was to provide the most in-depth look at this problem to date.
The paper that Reaves authored together with Michael Meli and Matthew R. McNiece is titled "How Bad Can It Git? Characterizing Secret Leakage in Public GitHub Repositories," and is available for download in PDF format.
"Our findings show that credential management in open-source software repositories is still challenging for novices and experts alike," Reaves told us.