Data 2022 outlook, part one: Will data clouds get easier? Will streaming get off its own island?

Here's our take on whether cloud providers will get serious in making their data services more enterprise-friendly.
Written by Tony Baer (dbInsight), Contributor

With the pandemic nearing its two-year anniversary, the growth of cloud adoption has continued accelerating. Although dated last March, the most recent state of the cloud report from Flexera shows significant acceleration in cloud spending for large enterprises, with the proportion shelling out over $1 million/month -- double over the previous year.

As reported by Larry Dignan last summer, a backlash to cloud migration may be starting to brew based on growing expenses. We've heard anecdotes from technology providers like Vertica that some of their largest clients were actually repatriating workloads from the cloud back to their own data center or colocation facilities. 

So what's on tap for this year? We're dividing our 2022 outlook over two posts. Here, we'll focus on trends with cloud data platforms; tomorrow, we'll share our thoughts on what will happen with data mesh in the coming year.

Looking back on 2021

Last year saw some of the last on-premises database holdouts, such as Vertica and Couchbase, unveil their own cloud managed services. This reflects the reality that, while not all customers are going to deploy in the public cloud, offering an as-a-service option is now a required addition to the portfolio.

Despite the growth in cloud adoption, the database and analytics world did not see dramatic product or cloud service introductions. Instead, it saw a rounding out of portfolios with the addition of serverless options for analytics, and it moved toward pushdown processing in the database or storage tier. Excluding HPE, which unveiled a significant expansion of its GreenLake hybrid cloud platform in midyear, the same was largely true on the hybrid cloud front.

With most providers having planted their stakes in the cloud, the past year was about cloud providers building bridges to make it easier to lift and shift or lift and transform on-premise database deployments. For lift and shift, Microsoft already offered Azure SQL Database Managed Instance to SQL server customers, and it added managed instance for Apache Cassandra in 2021.

Meanwhile, AWS introduced its answer to Managed Instance: a new RDS Custom option for SQL Server and Oracle customers requiring special configurations that wouldn't otherwise be supported in RDS. This could be especially useful for instances that support, for example, legacy ERP applications. 

What if you want to continue using your existing SQL skills on a new target? Last year, AWS released Babelfish, an open source utility that can automatically convert most SQL Server T-SQL calls into PostgreSQL's pg/PLSQL dialect. And then there's Datometry to just virtualize your database.

Also in the spirit of lift and shift, last year saw each of the major clouds adding or expanding database migration services designed to make the process simpler. AWS and Azure already had services that provided guided approaches to migrating from Oracle or SQL Server to MySQL or PostgreSQL. Meanwhile, Google introduced a database migration service that makes the transfer of on-premises MySQL or PostgreSQL to Cloud SQL into an almost fully-automated process.

Last year, we posited that Responsible AI and Explainable AI would be joined at the hip. A year later, we're hearing about the need for interpretable AI – as most of what (little) explainable AI has been used has not exactly been understandable. This year, we're still hearing calls for AI to be more responsible. We're still in the early stages of what will prove a long slog. The more things change...  

Also: Analytics and AI in 2022: Innovation in the era of COVID-19

Cloud: The burden is currently on the customer

Cloud providers are not going to suddenly stop expanding their portfolios while adding new products and services. But we expect they will pay more attention to identifying synergies across their portfolios, allowing them to create new blended solutions in 2022. The driver? Offering solutions blending their services should move at least some of the burden of integrating capabilities off the shoulders of cloud customers. 

The backdrop to all this is that the cloud was supposed to simplify IT budgeting and operations. In the data world, when customers adopt managed database as a service (DBaaS), such as Amazon Aurora, Azure SQL Database, Google Cloud Spanner, IBM Db2 Warehouse Cloud, or Oracle Autonomous Database, compute and storage instances are typically predetermined, as the DBaaS provider handles the software housekeeping. Serverless, in turn, takes simplification up another notch by dispensing with the need for customers to capacity plan their deployments.

The problem then becomes, are we getting too much of a good thing?

AWS alone has well over 250 services, of which, for instance, you have 11 different container services, 16 databases, and over 30 machine learning (ML) services. It's not much different with Google Cloud or Azure either. Google Cloud offers a dozen analytic services, 10 container services, and at least a dozen or more AI and ML services; Azure offers nearly a dozen DevOps services, 10 hybrid and multi-cloud services, and almost a dozen IoT services. 

With tongue in cheek, we were privately relieved when AWS did not introduce a 17th database at the 2021 re:Invent conference.

The breadth of managed offerings in the cloud reflects a growing maturity: cloud providers are expanding the reach of their platform-, database-, and software-as-a-service offerings, serving a wider swath of enterprise compute needs.

What happens when you want to integrate a BI tool with a database? Or add a customer experience chatbot, video recognition system, or an event-alerting capability for a manufacturing process? Or containerize and deploy these as microservices? With such a wealth of choices, the burden has been on the customer to piece them together.

Also: Storage in 2022 will see active archiving and ML-enabled volumes on the rise

The cloud might start getting easier

The next step for cloud providers is to tap the diversity of their portfolios, identify the synergies, and start bundling solutions that lift part of the burden of integration off the customer's shoulders. We're seeing some early stirrings. For instance, AWS and Google Cloud have made strides to unify their ML development services. As we'll note below, we're seeing some progress in the analytics stack where cloud data warehousing services are beginning to either morph into end-to-end solutions or push down more processing into the database. And we're seeing integration of conversational AI (chatbots) into prescriptive offerings, such as Google Contact Center AI.

Our wish list for 2022 includes embedding some data fabric, cataloging, and federated query capabilities into analytic tools for end users and data scientists, so they don't have to integrate a toolchain to get a coherent view of data. There is excellent opportunity to embed ML capabilities that learn and optimize into an end user's or organization's querying patterns -- based on SLA and cost requirements. 

We'd also like to see prescriptive solutions that tie in different AI services to business applications, such as video recognition for manufacturing quality applications. As we note below, we expect to see streaming integrated more tightly with data warehouses/data lakes and operational database services.

We expect that, in 2022, cloud providers will ramp up efforts to tap the synergies hiding in plain sight in their portfolios -- an initiative that should also heavily involve horizontal and vertical solution partners.

Streaming will start converging with analytics and operational databases

A long elusive goal for operational systems and analytics is unifying data in motion (streaming) with data at rest (data sitting in a database or data lake).

In the coming year, we expect to see streaming and operational systems come closer together. The benefit would be to improve operational decision support by embedding some lightweight analytics or predictive capability. There would be clear benefits for use cases as diverse as Customer 360 and Supply Chain Optimization; Maintenance, Repair, and Overhaul (MRO); capital markets trading; and smart grid balancing. It could also provide real-time feedback loops for ML models. In a world where business is getting digitized, having that predictive loop to support data-driven operational decisions is morphing from luxury to necessity.

The idea of bringing streaming and data at rest together is hardly new; it was spelled out years ago as the Kappa architecture, and there have been isolated implementations on big data platforms -- the former MapR's "converged platform" (now HPE Ezmeral Unified Analytics) comes to mind.

Streaming workloads traditionally run on their own dedicated platforms because of their extreme resource demands. The show stopper keeping streaming on its own island of infrastructure is resource contention.

Streaming applications -- such as parsing real-time capital market feeds, detecting anomalies in the flow of data from physical machines, troubleshooting the operation of networks, or monitoring clinical data --have typically operated standalone. And because of the need to maintain a light footprint, analytics and queries tend to be simpler than what you could run in a data warehouse or data lake. Specifically, streaming analytics often involves filtering, parsing, and, increasingly, predictive trending.

When there is a handoff to data warehouses or data lakes, in most cases, the data is limited to result sets. For instance, you can run an SQL query on Amazon Kinesis Data Analytics that identifies outliers, persist the results to Redshift, and then perform a query on the combined data for more complex analytics. But it's a multistep operation involving two services, and it's not strictly real-time.

Admittedly, in-memory operational databases like Redis, you can support the near-instant persistence of streaming data with append-only log data formats, but that is not the same as adding a predictive feedback loop to operational applications.

Over the past couple years, we've seen some hints that streaming is about to become part of operational and analytic data clouds. Confluent kicked open the doors when it released ksqldb on Confluent Cloud back in 2020. Last year, DataStax introduced the beta for Astra Streaming, backed on Apache Pulsar (not Kafka); it's currently a separate service, but we expect that it will be blended in with Astra DB over time. In the Spark universe, Delta Lake can act as a streaming source or sink for Spark Structured Streaming.

The game changer is cloud-native architecture. The elasticity of the cloud eliminates issues of resource contention, while microservices provide more resilient alternatives to classic design patterns involving a central orchestrator or state machine. In turn, Kubernetes (K8s) enables analytic platforms to support elasticity without having to reinvent the wheel for orchestrating compute resources. Converged streaming and operational or analytic systems can run on distributed clusters, which can be partitioned and orchestrated for performing real-time stream analytics, merging results, and correlating with complex operational models.

Such convergence won't replace dedicated streaming services, but there are clear opportunities for cloud incumbents: Amazon Kinesis Data Analytics paired with Redshift or DynamoDB; Azure Stream Analytics with Cosmos DB or Synapse Analytics; Google Cloud Dataflow with BigQuery or Firestore all come to mind. 

But there are also opportunities for real-time in-memory data stores. We're talking to you, Redis, not to mention any of the dozens of time series databases out there.

Also: What data management leaders forecast for the sector in 2022

Data share and share, alike

In hindsight, this looks like a no-brainer. With cloud storage being the de facto data lake, promoting wider access to data should be a win-win for everybody: data providers get more mileage (and potentially, monetization) out of their data; data customers gain access to more diverse data sets; cloud platform providers can sell more utilization (storage and compute); and cloud data warehouses can transform themselves into data destinations. 

From that perspective, it's surprising that it's taken each of the major cloud providers almost five years to catch on to an idea that Snowflake hatched.

Snowflake and AWS have been the most active in promoting data exchanges, although both approached it from opposite directions. Snowflake began with a data-sharing capability aimed across internal departments and later opened a data exchange for third parties. AWS went in reverse order, opening a data exchange on AWS Marketplace a couple years back, but it's only been adding capabilities for internal sharing of data for Redshift customers (that required AWS to develop the RA3 instance that finally separated Redshift data into its own pool) for the past year. 

Snowflake has taken the added step of opening vertical industry sections of its marketplace, making it easier for customers to connect to the right data sets. On the other hand, AWS beat Snowflake to the punch in commercializing its data marketplace by utilizing the existing AWS Marketplace mechanism.

Google followed suit with Analytics Hub for sharing BigQuery data sets, a capability that they will subsequently extend to other assets such as Looker Blocks and Connected Sheets. Microsoft Azure has also gotten into the act.

Over the next year, we expect each of the cloud providers to flesh out their internal and external data exchanges and marketplaces, especially where it comes to commercialization.

Database platforms turn to ML to run themselves

This is the flip side of in-database ML, which we predicted would become a checkbox item in 2021 for cloud data warehouses and data lakes. What we're talking about here is the use of ML under the covers to help run or optimize a database.

Oracle fired the first shot with the Autonomous Database; Oracle went full-bore with ML by designing a database that literally runs itself. That's only possible with the breadth of database automation that is largely unique to Oracle database. But for Oracle's rivals, we're taking a more modest view: applying ML to assist, not replace, the DBA in optimizing specific database operations.

As any experienced DBA will testify, running a database involves lots of figurative "knobs." Examples include physical data placement and storage tiering, the sequence of joins in a complex query, and identifying the right indexes. In the cloud, that could also encompass identifying the most optimal hardware instances. Typically, configurations are set by formal rules or based on the DBA's informal knowledge.

Optimizing a database is well-suited for ML. The processes are data rich, as databases generate huge troves of log data. The problem is also well-bounded, as the features are well-defined. And there is significant potential for cost savings, especially when it comes to factoring how to best lay out data or design a query. Cloud DBaaS providers are well-situated to apply ML to optimize the running of their database services, as they control the infrastructure and have rich pools of anonymized operational data on which to build and continually improve models.

We've been surprised, however, that there have been few takers to Oracle's challenge. Just about the only formally productized use of ML (aside from Oracle) is with Azure SQL Database and SQL Managed Instance; Microsoft offers autotuning of indexes and queries. That's a classical problem of trade-offs: the faster speed of retrieval with an index vs. the cost and overhead of writes when you have too many indexes. Azure's automated tuning can automatically create indexes when it senses query hot spots; drops indexes that go unused after 90 days; and reinstates previous versions of query plans if newer ones prove slower.

Over the coming year, we expect to see more cloud DBaaS services introduce options incorporating ML to optimize the database, promoting to enterprises how they can save money. 

Disclosure: AWS, DataStax, Google Cloud, HPE, IBM, and Oracle are dbInsight clients.

This is the first part of our Data Outlook for 2022. Click here for part two, where we predict that data meshes will stand up to their first serious scrutiny.

Editorial standards