This article features information on the AWS Public Datasets program, its business model, experience with clients, and comments provided via an email interview with Jed Sundwall, AWS Global Open Data Lead. Sundwall's comments have been edited for brevity and clarity and complemented with the author's own research and comments.
Recently the world's biggest conference on open data brought together practitioners from government, research, NGOs and business to share their experiences and discuss the way forward. Amazon Web Services (AWS), one of the most prominent open data hosts, was there to participate in the collective quest to build sustainable capacities for data driven businesses.
This can be a win-win-win for data publishers, hosts and users alike. Typically it has been governments acting as both publisher and host for open data, businesses either abstaining or being occasional users and the non-profit sector making for most of the use. But there's something in open data for everyone, and the cloud is a catalyst in this process.
Publishing open data in the cloud
One thing AWS has been recommending to their customers, according to AWS's Jed Sundwall, is to "make data consistently available online. This might sound strange, but there are a lot of government data products that you can only get by having disks mailed to you, while some other services are simply unreliable. This is one area where cloud infrastructure can really help."
Many government datasets, says Sundwall, have been "technically 'open' (i.e. no license restriction) for years, but very few people have had the ability to access them because of the costs of acquiring copies of the data. A good example is NOAA's NEXRAD data. AWS has a collaborative research and development agreement with NOAA to explore ways to improve public access of their data. Several hundred terabytes of NEXRAD high-resolution radar data are available on Amazon S3."
"NOAA did an analysis of NEXRAD usage and found that making it available on AWS led to a 230 percent increase in usage, while simultaneously leading to a 50 percent decrease in usage of their own servers. The convenience of having the data available close to computing resources on the cloud has made the data much more usable."
Hosting open data for profit and the greater good
Data publishers have a lot to gain by entrusting their data to the cloud, but the benefits may not be obvious for cloud providers if they do like AWS, making data available for anyone to use at no cost through the Public Datasets program.
In AWS, Sundwall says, they often "think of this as a lab to explore ways to make more data available to more people, and a convenient way to demonstrate the capabilities of their platforms, give people interesting data to work with as they get started on the cloud, and show public sector customers how they could use the cloud to share their data."
AWS, he notes, "has many public sector customers who are obligated, either by law or by their mission, to share data. AWS covers the costs of hosting data shared through the Public Datasets program." Naturally hosting such datasets comes at a cost, so why do this? One could argue that this builds credibility and corporate social responsibility for AWS, but that's only part of the answer.
Sundwall says the rest is "customer demand for compute. Data is staged for analysis on the cloud for anyone to access, but users must pay for their own compute. At the end of the day, AWS is interested in open data because it's so important to our customers." If you're wondering whether AWS would consider hosting your dataset, or letting you tinker with the datasets they host for free, the answer is yes - maybe.
"AWS largely selects public data sets based on customer demand," Syndwall says. "If we have multiple customers who can run production workloads on the data, we are much more likely to bring the data on." DBPedia for example, in which there is an ongoing discussion on sustainability, could potentially benefit from this. As for free compute, Sundwall notes, "researchers may apply for promotional credits through AWS Cloud Credits for Research program."
Getting value out of open data
But the majority of organizations are neither publishing nor hosting open data. What's in it for them? Added value. They say data is the new oil that can do for the digital economy what oil did for the industrial economy, and that analogy is to the point thinking of raw data as crude oil.
Although very few organizations can work with crude oil, refined oil products are used everywhere and generate immense value for producers and users. Very few people are able to work with raw government data, and that provides an opportunity for businesses to produce value-added products that meet their customers' needs.
So publishers, AWS says, need to ensure data is:
- Up-to-date and thoroughly documented. If developers have confidence that they understand what the data is and how it can be used, they'll be much more likely to work with it.
- Accurate. In many cases government agencies producing the data are the only arbiters of truth for their data. This is hard work and no one else can do it for them.
Sundwall says that "to make data useful beyond that, the private sector has a lot to offer." He mentions startups like PlanetOS, Astro Digital, FarmLogs, Sinergize, and Building Radar, "taking raw government data and building commercial products based on it." There's more, but benefits can be reaped from open data even without a business model based on it.
BBC for example has been leveraging DBPedia data, URIs and design principles at the core of its Linked Data platform since 2010. This has resulted among others in 50% increase in traffic in some cases, showcasing the potential for value that sources ranging from McKinsey to the EU Comission are all pointing to. Who would turn down free quality data anyway?