Data 2021 Outlook, Part I: What’s ahead for AI and Cloud Data Warehousing

“Responsible AI” is at the beginning of a long slog, while Cloud Data Lakehouses will supplement but not replace Data Warehouses or Data Lakes.

21-is-going-to-be-a-good-year.jpg

Credit: The Who

If there's one obvious prediction that bore out over the course of what was otherwise a very unpredictable year, it was the acceleration in the adoption of cloud computing. Just look at the continued very healthy double-digit growth rates of each of the major clouds. For enterprises, it was about adapting to the virtual environment and constrained supply chains of a suddenly locked-down world.

A year ago, (pre-COVID), we viewed cloud adoption as a series of logical stages, evolving from DevTest to development of new born-in-the-cloud apps, opportunistic adoption of new SaaS services, with the home stretch now coming into view with re-platforming and/or transformation of core enterprise back end applications. But with hindsight, not surprisingly, the headline for cloud adoption over the past year was for use cases enabling businesses to pivot into what became the new normal – the need to change or develop new services in a landscape where work and consumption were increasingly virtual, and where conventional supply chains came under stress.

Over the past year, the predominant theme in data, analytics, and cloud services was an extension. We saw relatively few launches of new database cloud services (Amazon Timestream and Oracle MySQL services being the major introductions of the year), but instead, the extension of existing services with new caching, query federation, and second-generation launches (or in some cases, relaunches) of databases as cloud-native managed services.

We're splitting our forecast for what's ahead in 2021 into a couple posts. Today, we'll focus on cloud AI and data warehousing trends. Tomorrow, we'll offer our take on what multi-cloud will mean in the coming year.

RESPONSIBLE AI AND EXPLAINABLE AI WILL BE TIED AT THE HIP

We're not going to rake the waterfront here. Just over the last few weeks, these pages have seen predictions on the role of human intelligence; demand for AI in job postings; the short-term impacts of the COVID pandemic on AI, which in term are being tempered by more realistic expectations for AI's impact in the software market.

Instead, we'll pick up on some comments by industry executives on Responsible AI that were part of Big on Data bro Andrew Brust's annual exhaustive roundup of industry predictions that appeared in these pages last week. The issue is coming to the forefront because AI is becoming more ubiquitous – enterprises are now following the lead set by consumer online services that are increasingly embedding AI into every mundane transaction. And the onramps are getting wider now that AutoML services are expanding their breadth. For instance, a few weeks back, AWS further expanded SageMaker with a new feature store, data pipeline automation, new "jumpstart" prebuilt models, and autopilot capabilities that enable SQL database developers to run predictive models (more about that below).

Ensuring that AI is responsible and as minimally biased as possible is challenging enough if you are a data scientist; that challenge gets magnified when you open the gates to less technical practitioners. There's no way that we're going to turn back the clock and close the gates to all those citizen data scientists. And so, technology is going to have to lend a hand in helping keep AI on the straight and narrow. Explainable AI will be essential to making Responsible AI initiatives effective. While Explainable AI won't be a panacea (it takes humans to develop the criteria for how models self-document), without explainability, efforts to root out bias and unfairness will amount to whack-a-mole efforts.

The challenge is that over the past year, we've not seen much progress in Explainable AI. We outlined those challenges for getting AI out of the black box back in our 2020 outlook a year ago, and guess what, the limitations on Explainable AI have changed relatively little over the past year. For instance, Google Cloud's disclosure page has changed marginally in the ensuing 12 months.

Going forward, Responsible AI won't be a new trend in 2021. We do expect however that renewed efforts will be invested in explainability owing to the external pressure of regulation, reflecting the political climate, especially in North America and Western Europe, for making tech companies more accountable. And with that, the goalposts for Responsible AI will continue to be moving targets as AI grows more ubiquitous, and with it, as demand for public scrutiny continues to grow.

IN-DATABASE MACHINE LEARNING BECOMES CHECKBOX ITEM

Sometimes you can have it both ways.

At first blush, the second wave of cloud-native DBaaS services from providers ranging from Microsoft to SAP, Oracle, Informatica, SAS, and others that are embracing separate compute and storage and microservices might seem at loggerheads with another trend: so-called "pushdown" processing of data-intensive processes into the database. In the coming year, we'll see more of both.

The push to pushdown is nothing new. From one perspective, one could draw this back to the dawn of mainframe computing where programs and data were interlocked, but the more modern manifestation emerged with database stored procedures and triggers that were actually Sybase's calling card (and the key to why Wall Street customers stubbornly stuck by an untrendy platform that we expect SAP to pump new life in this year) back in the 1990s.

We're seeing this with the onrush toward in-database ML capabilities. Virtually every cloud data warehousing DBaaS supports some form of training and running of ML models inside the database. In-database ML has become a checkbox item because (1) ML is ravenous for data and (2) it's costly and inefficient to move all that data when there's an alternative to processing it in place. And anyway, in some cases, we might be talking up to petabytes of data; who wants to pay for moving all that, then wait for all that data to get moved?

Here are a few examples. AWS recently announced previews of ML capabilities in Redshift and its graph database Neptune. Microsoft supports processing ML models in SQL and Spark pools managed by Azure Synapse Analytics. Google BigQuery offers support for running roughly ten different types of ML algorithms in the database. Oracle has long supported in-database R and Python processing. Meanwhile, Snowflake supports feature engineering using SQL pushdown from ML tools such as Dataiku, Alteryx, and Zepl, plus integrations with AutoML tools such as DataRobot, Dataiku, H20.ai, and Amazon SageMaker, among other capabilities.

CHILLING OUT AT THE LAKEHOUSE

The data warehouse vs. data lake was the top debated trend cited in Andrew Brust's roundup. Essentially the discourse boils down to this. Data warehouse proponents cite cloud-native architectures as giving them the scale, and multimodel data support enabling them to support the variety associated with data lakes. Data lake proponents counter that size matters, especially when you're running data-intensive AI models, and that emerging open source technologies (e.g., Presto, Trino query engines; table formats such as Iceberg) can make data lakes almost as performant as data warehouses.

The reality is that data warehouses and data lakes each have their own varying strengths. Yes, cloud data warehouses can now venture into petabyte territory, but the barrier for most enterprises will be economic: at those scales, data lakes will normally be more economic. Likewise, no matter how optimized the query engine, data lakes rely on file scans, and that will never be as efficient as having tables where data can be indexed, compressed, and filtered.

Federated query is associated with joining tables from different databases for a single query. Approaches that push processing down to where the data lives are better suited for the cloud as data movement (result sets only) can be minimized. In the cloud, that means federating query to reach down into cloud object storage. Data warehouses from AWS, Azure, GCP, and Snowflake already reach into cloud storage either through a federated query or their own specialized query engines, and we expect that Oracle and SAP will add those capabilities this year.

The Data Lakehouse is a new take that picks up where federated query leaves off. Introduced by Databricks a year ago, it refers to a system that is a hybrid of a data warehouse and data lake. The term has been seconded by Snowflake, and more recently embraced by Informatica (we'll have something more to say about that later this week). For a term introduced barely a year ago, at this point, three's a crowd, which means we'll probably be seeing this term a lot more in the coming year. Data lake houses don't necessarily use a relational data warehouse as the entry point, but instead rely on "open" data formats, the most likely being Parquet or CSV.

Looking ahead, we don't expect that the data warehouse, reimagined as a relational data lake, or a data lakehouse, will necessarily make data lakes obsolete. Ultimately, it's your developers who will drive the choice. Classic SQL database developers will likely opt for the relational data lake, while data scientists and developers using languages like Java or Python are likely to prefer data lakes, or if their natural skepticism gets addressed, data lakehouses. In many organizations, the choice between data warehouse, data lake, and/or data lakehouse won't be an either-or-decision.

Note: Our Data Outlook for 2021 is in two parts. Click here for Part II, addressing what's ahead for multi-cloud.

Disclosure: AWS, Microsoft, Oracle, SAS, and SAP are dbInsight clients.