Open-sourcing data will make big data bigger than ever

Open source changed everything about how we write code, can open-sourcing data do the same for big data? The experts say yes.
Written by Steven Vaughan-Nichols, Senior Contributing Editor

Video: How machine learning's big data loop works

Free software has been with computing since day one, but proprietary software ruled businesses. It took open source and its licenses to transform how we coded our programs. Today, even Microsoft has embraced open source. Now, The Linux Foundation has created a new open license framework, Community Data License Agreement (CDLA), which may do for data what open source did for programming.

In Prague, at Open Source Summit Europe, The Linux Foundation announced a new family of open-data licenses. The CDLA licenses are an effort to define a licensing framework to support collaborative communities built around curating and sharing "open" data.

Specifically, the CDLA licenses enable individuals and organizations to share data as easily as they share open-source code. These licensing models are made to help people form communities to assemble, curate, and maintain big data. This will bring new value to data-based communities and businesses and to power new data-based applications.

Big data, thanks to open-source programs such as Hadoop, Spark, and MongoDB, have enabled us to transform unstructured data into useful information. Today, the challenge is to assemble the critical mass of data for those tools to analyze. The CDLA licenses are designed to help governments, academic institutions, businesses, and other organizations open and share data, with the goal of creating communities that curate and share data openly.

For example, the Foundation stated, "If automakers, suppliers and civil infrastructure services can share data, they may be able to improve safety, decrease energy consumption and improve predictive maintenance. Self-driving cars are heavily dependent on AI systems for navigation, and need massive volumes of data to function properly. Once on the road, they can generate nearly a gigabyte of data every second. For the average car, that means two petabytes of sensor, audio, video and other data each year."

But, how do you legally share valuable data? Until now, there's no plan on how to do legally manage data sharing. Each data-sharing agreement is unique. That's where the CDLA licenses come in.

"Data is the oil of the 21st century," said Mark Radcliffe, partner and global chair of the FOSS Practice Group at global legal powerhouse DLA Piper. "Yet, the legal protection for and licensing of data is in its infancy. Many current licenses take a variety of inconsistent (and frequently incomplete) approaches to the use and licensing of data. The CDLA provides a valuable tool for companies and lawyers in managing the use and licensing of data. In the best tradition of the open source community, The Linux Foundation used a collaborative process to get the best possible agreement. I will be using the CDLA for many of my clients."

There are two CDLA licenses: A sharing license that encourages, but doesn't require, contributions of data back to the data community. This is somewhat like Linux's Gnu General Public License version 2 (GPLv2) The other is a permissive license. This puts no additional sharing requirements on open-data recipients or contributors It's something like the BSD license.

Eben Moglen, Columbia Law School professor of law and founding director of the Software Freedom Law Center (SFLC), explained: "Shared data licensing will do for machine learning and the next phase of information technology evolution what the GPL and the free-software ethos it embodied did for primary software production over the last generation,. Clearly expressed, well-designed rules for 'share alike' treatment of collaboratively-produced data will enable massive cooperation and help us resist over-concentrated ownership of the resource most crucial to 21st century social and economic development."

The CDLA licenses have been drafted with the needs of companies, organizations, and communities with valuable data assets to share. The licenses' intention is for contributors and consumers of open data-sets to actively use and support the contribution of data in a uniform fashion, while clarifying the terms of that sharing and reducing risk.

In practice, these licenses will give companies, governments, and organizations, the following features:

  • Data producers can share with greater clarity about what recipients may do with it. Data producers can also choose between sharing and permissive licenses and select the model that best aligns with their interests. In either case, data producers should enjoy the clarity of recognized terms and disclaimers of liabilities and warranties.
  • Data communities can standardize on a license or set of licenses that provide the ability to share data on known, equal terms that balance the needs of data producers and data users. Data communities have a high degree of flexibility to add their own governance and requirements for curating data as a community, particularly around areas such as personally identifiable information.
  • Data users who are looking for data-sets to help kick off training an AI system or for any other use will have the ability to find data shared under a known license model with terms that clearly state their rights and responsibilities.

Of course, the CDLA is only a framework. It's still more than we've had before. The CDLA is also data privacy agnostic. It relies on data publishers and curators to create their own governance structure around what data they curate and how. Each data producer or curator must work through various jurisdictional requirements and legal issues.

Why? Because the "CDLA is intended to be an agreement that can be used throughout the world. Since data may be licensed from data providers located in many countries, the CDLA Working Group opted not to specify a law or jurisdiction in favor of encouraging global adoption of the Agreement."

Even without legal enforcement specifications, as Jim Zemlin, The Linux Foundation's executive director, observed, "An open-data license is essential for the friction-less sharing of the data that powers both critical technologies and societal benefits. The success of open source software provides a powerful example of what can be accomplished when people come together around a resource and advance it for the common good. The CDLA licenses are a key step in that direction and will encourage the continued growth of applications and infrastructure."

Related stories:

Editorial standards