The Linux Foundation identifies most important open-source software components and their problems

In its latest study, the Linux Foundation's Core Infrastructure Initiative discovered just how prevalent open-source components are in all software and their shared problems and vulnerabilities.
Written by Steven Vaughan-Nichols, Senior Contributing Editor

Red Hat recently reported open-source software now dominates the enterprise. Actually, it does more than that. Another older study found open-source software makes up 80% to 90% of all software. You may not know that, because many of these programs are built on deeply buried open-source components. Now, The Linux Foundation's Core Infrastructure Initiative (CII) and the Laboratory for Innovation Science at Harvard (LISH) have revealed -- in "Vulnerabilities in the Core, a preliminary report and Census II of open-source software" -- the most frequently used components and the vulnerabilities they share.  

This Census II analysis and report is the first major study of its kind but isn't a final analysis. It takes important first steps and lays out a methodology for understanding and addressing open-source software structural and security complexities. Specifically, it also identifies the most commonly used free and open-source software (FOSS) components in production applications and examines them for potential vulnerabilities.

To create this work, CII and LISH partnered with Software Composition Analysis (SCAs) and application security companies such as Snyk and Synopsys Cybersecurity Research Center. They combined private usage data with publicly available datasets for identifying over 200 of the most used open-source software projects.

These are not the programs -- Apache, MySQL, Linux -- that probably spring to your mind. For all their foundational importance, it's the small building block programs that are most widely used. 

They may be small, sometimes less than a hundred lines of code (LoC), but they're vital. As Frank Nagle, a professor at Harvard Business School and co-director of the Census II project, said:

"FOSS was long seen as the domain of hobbyists and tinkerers. However, it has now become an integral component of the modern economy and is a fundamental building block of everyday technologies like smart phones, cars, the Internet of Things, and numerous pieces of critical infrastructure. Understanding which components are most widely used and most vulnerable will allow us to help ensure the continued health of the ecosystem and the digital economy." 

It's those hidden, small programs that, if you don't keep an eye on them, can cause trouble. For example, with the OpenSSL Heartbleed security bug, the real problem wasn't with the relatively, well-known OpenSSH security program, but with its lower level, more obscure cryptographic library, OpenSSL

If you're an end-user, you've never heard of these low-level programs. For that matter, it would be a rate CIO or even CTO who knows about them either. As venture capitalist Mike Volpi wrote: "The real customers of open source are the developers who often discover the software, and then download and integrate it into the prototype versions of the projects that they are working on." 

Once those open-source components have made their way into a user-facing program, they're not coming out. 

Many of these sub-programs are in JavaScript. In large part, that's because they're small -- 112 LoC -- and often perform only a single function. In contrast, the average Python module in the PyPI repository has over 2,200 LoC. So, when you measure programs by dependencies, JavaScript shows up far more often. 

The most commonly used JavaScript programs

  • Async: A utility module that provides functions for working with asynchronous JavaScript. Although originally designed for use with Node.js and installable via npm install async, it can also be used directly in the browser. 
  • Inherits: Browser-friendly inheritance fully compatible with standard node.js inherits. 
  • Isarray: Array for older browsers and deprecated Node.js versions. 
  • Kind-of: Get the native JavaScript type of a value. 
  • Lodash: A modern JavaScript utility library delivering modularity, performance, and extras. 
  • Minimist: Parse argument options. This module is the guts of optimist's argument parser without all the fanciful decoration. 
  • Natives: Calls Node.js's native JavaScript modules. 
  • Qs: A query string parsing and stringifying library with some added security
  • Readable-stream: Node.js core streams for userland. 
  • String_decoder: Node-core string_decoder for userland. 

The most widely used non-Javascript software components

  • Com.fasterxml.jackson.core:jackson-core: A core part of Jackson, a JavaScript Object Notation (JSON)  processor that defines Streaming API as well as basic shared abstractions. 
  • Com.fasterxml.jackson.core:jackson-databind: General data-binding package for Jackson (2.x): works on streaming API implementations.
  • Com.google.guava:guava: Google core libraries for Java.  
  • Commons-codec: Apache Commons-Codec software provides implementations of common encoders and decoders such as Base64, Hex, Phonetic and URLs. 
  • Commons-io: Commons IO is a library of utilities to assist with developing IO functionality. 
  • Httpcomponents-client: The Apache HttpComponents project is a toolset of low-level Java components focused on HTTP and associated protocols. 
  • Httpcomponents-core: The Apache HttpComponents project is a toolset of low-level Java components focused on HTTP and associated protocols. 
  • Logback-core: The reliable, generic, fast and flexible logging framework for Java. 
  • Org.apache.commons:commons-lang3: A package of Java utility classes for the classes that are in java.lang's hierarchy, or are considered to be so standard as to justify existence in java.lang. 
  • Slf4j: Simple Logging Facade for Java. 

Along the way, the researchers discovered that the myth of the lone open-source developer toiling away for love of coding in their parents' basement is just that: A myth. A programmer may start at home, but they don't stay there -- unless they like their home office. 

Instead, the research found a  high correlation between being employed and being a top contributor to the most popular FOSS packages. An analysis of 2017 GitHub data found that some of the most active FOSS developers contributed to projects under their Microsoft, Google, IBM, or Intel employee email addresses. The CII and partners will be further investigating exactly who are today's open-source developers and who they work for in a future study.

The researchers also found several common problems with open-source components. 

The first is there's little rhyme or reason to how programs are named. The lack of a standardized naming schema for software components makes tracking these programs a major pain. That's not unique to open-source software. The National Institute for Standards and Technology (NIST) and National Telecommunications and Information Administration (NTIA) have grappled with this issue for decades.

The Linux Foundation and partners think there's a "critical need for a standardized software component naming schema." They're not wrong. "Until one exists, strategies for software security, transparency, and more will have limited effect. Organizations will remain categorically unable to communicate with each other", and "given the increasing frequency and sophistication of cybersecurity incidents in which the software supply chain plays a part", this is a real problem. 

Another issue you may not have considered -- but is all too common in open-source circles -- is that many of these programs still live in individual developer accounts. "Of the top 10 most-used software packages in our analysis, the CII team found that seven were hosted under individual developer accounts." What happens if one of these accounts is hacked? Would you, farther down the software supply chain, even know? 

You might. If, for example, a tiny and otherwise obscure Node.js program, called left-pad, was removed by its developer. This happened and caused thousands of npm JavaScript programs to fail. 

Or, again you might not know. In another example, a hacker gained access to the popular Event-Stream JavaScript library and put a backdoor into the code. He then added malicious code into the library, which let him steal Bitcoin. How much? We don't know. We may never know. 

Clearly, individual developer accounts need more protection than they're getting. The Linux Foundation's CII badging program and Trust and Security Initiative incorporate developer account security into their controls to mitigate these risks.

The final problem is that open source hasn't escaped the curse of legacy software. Developers move on to newer programs or newer versions of their old programs, but downstream programmers still rely on the old program. These developers are reluctant to move when the new replacement package generally does the same job. That's especially true when the new component comes, as they often do, with compatibility bugs. Also, there are the financial and time costs that come with switching to new software when there is no guarantee of any added benefit. At the same time, though, since the older but still widely used program isn't being updated, security holes may be discovered and left unpatched. 

In short, open-source developers must start addressing the problems of legacy software. Maintaining code is never as much fun as developing new code, but it's necessary work. 

Speaking of work, this report itself is only the beginning. As Jim Zemlin, The Linux Foundation's executive director, said:

"The report begins to give us an inventory of the most important shared software and potential vulnerabilities and is the first step to understand more about these projects so that we can create tools and standards that results in trust and transparency in software."

Tim Mackey, the principal security strategist for the Synopsys Cybersecurity Research Center, added:

"Identifying the most pervasive FOSS components in commercial software ecosystems, combined with a clear understanding of both their security posture and the communities who maintain them, is a critical first step. Beyond that, commercial organizations can do their part by conducting internal reviews of their open-source usage and actively engaging with the appropriate open-source communities to ensure the security and longevity of the components they depend on."

Zemlin concluded:

"Open source is an undeniable and critical part of today's economy, providing the underpinnings for most of our global commerce. Hundreds of thousands of open source software packages are in production applications throughout the supply chain, so understanding what we need to be assessing for vulnerabilities is the first step for ensuring long-term security and sustainability of open source software."

Related Stories:

Editorial standards