Finding the data buried in cloud storage

With cloud object stores becoming the de facto data lakes, a recent survey shows that enterprises are between a rock and a hard place when it comes to finding and accounting for all the data that is piling up.
Written by Tony Baer (dbInsight), Contributor

It's human nature for messes to spread across all empty spaces. We pointed out a trend several months back that for a growing cross section of enterprises, cloud object storage is becoming the de facto data lake. The good news is that cloud object storage is relatively cheap and highly scalable, and increasingly, accessible. For instance, most cloud Hadoop services swap in object storage for HDFS, and increasingly, cloud providers are delivering services that provide ad hoc query or treat cloud object stores as extended tables for data warehouses.

The flip side of relying on cloud storage as the default target or data lake is the need to reconcile the accumulation of data in a general-purpose target with the need to become more accountable for data privacy or data protection, especially with regulations such as GDPR taking effect.

Chaos Sumo, a company that plans to introduce a search layer for SaaS providers to add atop cloud storage (for now, Amazon S3) in the summer, has just released a survey showing some of the pain points that cloud adopters are feeling.

Admittedly, at 120 respondents, the survey size was modest. And targeted at data ops professionals, the sample was likely skewed towards organizations already embracing the cloud. For instance, 72% indicated that they use some form of cloud object storage today. For those using Amazon S3, 40% of respondents stated they expected that their use of S3 storage would grow at least 50% in the next year.

For enterprises, the primary use was for backup, storage, and archiving. But 28% are already using object storage for data lakes, while another 18%% plan to implement one over the next 12- 18 months. Not surprisingly, for this AWS-dominated sample, a similar proportion (23%) reported using Amazon Athena today. Roughly half use the Amazon Redshift data warehouse, where with Spectrum, can now treat S3 as an extended table.

The innovation of tools such as Athena is opening up interactive access to data from a system otherwise optimized for storage, without the need for ETL (although the data must be in some form of semi-structured storage, such as CSV, JSON, Parquet or other formats).


Credit: Chaos Sumo

But as the chart shows, as the data is pilling up in object storage, a growing minority is concerned about accountability. That has been the advantage of commercial distributions of platforms such as Hadoop and packaged tooling for analytics and data preparation, which feature some form of data lineage, security, and access control as their raison d'etre. By comparison, cloud object stores are naked when it comes to governance or perimeter security -- that has traditionally been the job of the data platform, cloud host, or analytic tool that consumes the data.

So a quarter of the sample are concerned that they will have to move data to analyze it, while smaller, but statistically significant minorities are voicing concern about finding the data, compliance, and security. They are spending significant time cleaning and preparing data -- well over half report spending at least six hours per week, with nearly 40% of respondents stating devoting over 11 hours per week at the task (those are results that the data prep companies would eat up).

Significantly, only 7% of the sample reported that it is currently easy to analyze data squirreled away in object storage today. That's where the commercial for the survey sponsor, Chaos Sumo, comes in. The company plans to introduce what it terms a "data fabric" that will open S3 data to Elasticsearch by summer for OEM use by existing SaaS providers. We expect S3 to become a sweet spot for more analytic platforms and tools. For Chaos Sumo, adding search as a utility for SaaS providers to make this data more visible will be yet another step toward taming the cloud storage beast.

Editorial standards