Big Data 2018: Cloud storage becomes the de facto data lake

While AI, IoT, and GDPR grab the headlines, don't forget about the generational impact that cloud migration and streaming will have on big data implementations.
Written by Tony Baer (dbInsight), Contributor

As we bat cleanup with our 2018 predictions for big data, we're going to pick up where Big on Data bros Andrew Brust and George Anadiotis have left off.

Yes, it's getting harder and harder to stay oblivious to the impact of AI, with implications from the geopolitical to the mundane and the positively creepy. It's getting harder to miss the growing impact of IoT on everything from our homes to the way hospitals deliver care, autonomous cars are driven, factories are run, and smart cities are managed. And the arrival of GDPR, which will start taking effect in 2018, is forcing the issue for organizations the privacy and national sovereignty implications for the data sitting in everything from traction databases to data lakes and cloud storage.

But beneath the surface, we're seeing the beginnings of tectonic shifts in how enterprises manage their cloud, streaming analytics, and data lake strategies.


27.5 percent of big data workloads are running in the cloud (Source: Ovum ICT Enterprise Insights)

Multi-cloud moves to the front burner

For our look ahead, we're focusing on how the data is being managed. Rewind the tape to a year ago and we stated that "increasingly, Big Data, whether from IoT or more traditional sources, is going to live and be processed in the cloud." Last year, we forecast that 35-40 percent of new big data workloads would be deployed in the cloud, and that by year end 2018, new deployments would pass the 50 percent threshold.

Our predictions weren't far off the mark; Ovum's latest global survey for all big data workloads shows that 27.5 percent of them are already deployed in the cloud. And according to Ovum research, big data is hardly an outlier for enterprise cloud adoption, which ranges from 26-30 percent across different workloads.

By inertia, most organizations have ended up with the same polyglot environments in the cloud that characterize their data centers. Most organizations use more than one cloud provider, just like on premises where they often have one of everything. Like history repeating itself, this is the consequence of a combination of top-down policies mandating a corporate standard, and departmental decisions made for expedience.

So, just as your organization might have SAP for its accounts payable, different segments might have Workday for HR or Salesforce for CRM. Or maybe they have multiple ERP systems that have not yet been converged as the legacy of M&A. In the cloud, your corporate email system might be on Office 365 while departmental IT groups use AWS for DevTest, and corporate marketing uses Google Analytics.

In 2018, we expect the early majority to start formalizing multi-cloud strategies as cloud evolves from a target for running standalone workloads to enterprise-critical applications. So, as we saw cloud deployment as the sleeper issue for big data in 2017, multi-cloud will become the looming issue for 2018. That's the back story for why Oracle doubled prices for running its database on Amazon's RDS service and why the Aurora OLTP database is now Amazon's fastest growing service (succeeding Redshift before it).

More than a reactive decision about the fears of cloud vendor lock-in, multi-cloud decisions will be about platform choices. When you decide to run an Oracle database or Hadoop cluster on EC2, that is a tactical choice that can revisited should Azure or Google Cloud change their pricing.

When you choose Aurora, Cosmos DB, Google BigQuery, Oracle Autonomous database 18c, or the IBM Analytics system on the IBM cloud, you are not just choosing your cloud, but your data platform. You are choosing whether the value-add of running a data platform that is native to a specific cloud outweighs concerns over relying on a specific cloud provider. It's like making your Oracle or SQL Server platform decision all over again.

And that's why Amazon and Microsoft, are offering database migration services almost as freebies. They want your enterprise database. We also expect Google Cloud, Oracle, and IBM will actively promote loss leader database migration offerings in the coming year, and why more enterprises will elevate to the front burner the issue of how many eggs to put in each cloud basket.

Multi-cloud strategies will also figure heavily in organizations determining how to manage the reality of hybrid cloud. Just as few organizations of any size are likely to rely on a single cloud provider, few organizations (apart from startups) are likely to go 100 percent cloud. The transparency of maintaining sensitive customer records on premise either by design or because of data sovereignty issues while running analytics in the cloud will become major factors in cloud platform selection.

Data pipelines shift the center of gravity for real-time processing

Last year, we predicted that "IoT is the use case that will push real time streaming onto the front burner." This year, George Anadiotis forecast, not only that streaming is becoming mainstream, "but also [being] analyzed on the fly."

Streaming analytics is hardly new; we've devoted plenty of ink on its rebirth. Streaming can be used for parsing and filtering data, and for detection of patterns or events, before the data is persisted. Explosive growth of IoT data prompts the question, not only of whether to store all that data, but also where to process it.

As our appetites get whetted, we want to do more with the data while it is in motion. That not only explains the emergence of Kafka for queuing and distributing the data, but also why data platform providers like SAP, Hortonworks, MapR, and Teradata are getting into the act, not to mention cloud services such as Amazon Kinesis, Azure Data Factory, and Google Cloud Dataflow. Data pipelines allow you to extend real-time processing from basic filtering and transformation to orchestrated processes that could support advanced, predictive analytics and machine learning. We expect that in 2018, data pipelines will become key pillars for streaming analytics, and we'll be hearing a lot more from providers like IBM and Oracle in this space.

Cloud storage becomes the de facto data lake

Because it was specially designed for holding data that would not easily fit elsewhere, and lots of it, when you thought data lake, you probably thought Hadoop. We've defined data lakes as governed repositories that become the default ingest points for data. But we are seeing data lake implementations transcending Hadoop. Or as Mike Olson prophetically stated back in 2014, Hadoop would disappear.

It began with federated query tools that have become checkbox items for practically every analytic database. We've seen JSON databases extended for analytic queries via Spark. We've also seen Hadoop providers (e.g., Cloudera and Hortonworks) decouple their data governance services from HDFS. So the data lake is wherever you store data.

Not surprisingly, the cloud providers are having the last word: in the cloud, cloud storage has become the default ingest point for data. And so cloud providers are making their cloud object stores directly queryable. Amazon has made S3 directly accessible to SQL ad hoc query with Athena, and as an extension of the data warehouse with Redshift Spectrum. Google Cloud has long made its cloud storage the default source for BigQuery, while Snowflake, a third-party cloud data warehouse, has done the same.

There's more than a bit of irony here. Cloud storage was originally designed for just that: storage. But in a world where cloud object storage accounts for the majority of data by volume, enterprises are going to demand optimized access. In 2018, we expect that virtually all data warehouses and analytic databases will add popular cloud object stores such as S3, Azure BLOB Storage, and Google Cloud Storage as supported targets.

Editorial standards