The Data Expeditions, Part II: A new alliance makes a daring charge on the data warehouse

The Data Expeditions, Part II: Where an IBM engineer, who helped steer her company away from locked-in, proprietary, vendor-driven data platforms, steers a new course for herself, not only building a new model for live, streaming data but possibly setting a new precedent for vendors.
Written by Scott Fulton III, Contributor

Video: IBM's mainframe: A long-evolving system

"We at IBM," stated its vice president of data analytics, Anjul Bhambhri, during a 2015 conference, "believe that Spark is the analytics operating system."

It was a bold declaration, especially coming from a senior executive of a company whose reputation was made, in large part, on the quality of its operating systems. But Spark is not an IBM product. It's actually an Apache project that entered the big data space as an alternative processing engine to MapReduce, one of the original, principal components of Hadoop.

"If you have structured data, you can use Spark SQL. If you have unstructured data, you can drop to Spark Core," Bhambhri continued. "If you're getting data from one of those firehoses, you can use Spark Streaming. You want to build models, you want to learn from networks of data, use MLlib, use GraphX. And what is really magical is, all of these components interoperate seamlessly."

Suddenly, an open source project for a single component was being presented as a complete data platform. IBM's database platform was, and still is, DB2. It has been "2" since 1983, although the numeral never was a version number. Rather, it was symbolic of a generational shift, from the flat-file, hierarchical databases upon which businesses had been subsisting up to that point, to the relational model pioneered by an IBM engineer, Dr. E. F. Codd.

Yet here, speaking to Spark Summit Europe 2015, was a lady referring directly to a "single toolbox for analytics" -- not an extension, not a connector, not some kind of "plus-pack" for DB2. She was talking as though Spark were the core of a "DB3."

"Spark has expanded the range of big data problems that can now be addressed," Bhambhri told the audience. "And it has given us a very enhanced solution vocabulary. It has really taken distributed computing to a whole new, different level. If you can visualize the value locked in your data, Spark can help you find it."

Read also: Hadoop 3 confronts the realities of storage growth



Adobe is not known for being a database company or a data service provider. In March 2015, it launched a kind of portal service for advertisers to purchase ad inventory across multiple media -- something it called Experience Cloud. Mere weeks after Bhambhri's appearance in Amsterdam, Adobe announced its intention to build out that service as part of an emerging Marketing Cloud platform. It would provide a single provisioning mechanism for creating and then deploying advertising material.

Advertising is becoming -- and in the eyes of its practitioners, has already become -- a business of direct contact and participation. For the first several decades of its existence, it has leveraged the science of ascertaining the pages or programs, which will gather people's attention in the future. But the definition of "attention" has changed, to become the tuning of people's senses to some particular point in the network. Television and the web are segments of this network, and may be different media, but are becoming less separate entities. So, the science of guessing future behavior can actually take a back seat to the process of determining where attention is being paid this moment.

Suddenly, advertising itself becomes a response: A message that speaks to where the viewer is and what she cares about, now. For a modern advertiser to be competitive, it needs to execute the process of ascertainment, analysis, and response in a cycle time that's already shorter than what manufacturers or retailers required from their supply chain management processes just a few years ago.

Read also: Artificial intelligence on Hadoop: Does it make sense?


Adobe needed what Anjul Bhambhri was selling: A system for managing an unfathomable flow of data through a network with a variable number of indeterminate sources, in real-time. Spark provided a data flow framework for Hadoop, but it had evolved to become a cluster manager, performing many of the services for which MapReduce was relying upon YARN and other Hadoop components. Indeed, Spark was looking more like an operating system. More attractively to Adobe and other enterprises, it was assembling something much more like a real platform.

So, in April 2016, Adobe placed its bet. But not on IBM. Adobe hired Anjul Bhambhri, making her its vice president of platform engineering.

"The next wave in enterprise infrastructure is edge computing," the Adobe VP told ZDNet Scale. "We are moving the compute closer to the source of data, because to reduce the amount of data and flows back and forth between the data center and the public cloud, this is a must. This is what enables real-time decisioning, and the sub-millisecond response time that everyone in the industry is chasing."

Bhambhri perceives the spectrum of enterprise data as a heat map. Cooler data, in her mind, suffers the least from the latencies imposed by storage in public cloud services. Processing engines such as Google's BigQuery may be just fine for cool data. Surprisingly, she told us that much of the data that enterprises leverage for analytics operations is fairly cool.

In the emerging business of real-time advertising, the changing behavior of a thousand or more viewers clicking off a website or changing a channel is too hot an asset to leave to public database services.

"We are building a very intelligent profile," Bhambhri told us, "where we are able to bring in all the behavioral data -- who came to which website when, what did they click on, where they went from there, what they're doing on their mobile devices. We capture, on an annual basis, almost 150 trillion transactions from our customer base. And 57 percent of these are from mobile devices. We help our customers build those profiles which help them understand, what is the optimized experience that they have to deliver to their businesses?"

The problem with relational databases is not the data itself, but the means of storing it. Dr. Codd originally envisioned a logical system whereby the coordination of the relations in the database were defined by what he called a "meta-data catalog" (in so doing, perhaps coining the word "metadata"), and what he reluctantly acknowledged was also called a schema. Codd envisioned other management tools with capabilities to manage this metadata. At the very least, he foresaw the classes of interoperable tools and services that Hadoop would eventually embody; more likely, he laid the groundwork for them.

But within enterprises themselves, schemas would become the rulebooks for how data would function. Oracle, for instance, perceived the schema as the foundation for procedures that made database elements perform as objects in object-oriented contexts. Procedures in PL/SQL that would not look too unfamiliar to someone grounded in COBOL, would leverage schemas to explain how data could be used -- and, by extension, place restrictions on any other type of exploitation or experimentation.

Read also: The future of the future: Spark, big data insights, streaming and deep learning in the cloud

The enforcement of these rules became the main bottleneck for distributed data at huge scale. The wave of database methodologies whose name "NoSQL" originally meant "no SQL," eliminated this bottleneck, sometimes substituting another system and sometimes not. Having no rules to enforce, or fewer rules, meant faster processing. But it also led to data inconsistencies which prevent long, long chains from being reliably processed at large scale.

Hadoop resolved this deficiency -- at first without replacing SQL, but later enabling the query language to come back, with the addition of tools such as Apache Hive and Apache Drill. But the way Hadoop pulled off this miracle not only paved the way for its own overthrow in the enterprise, but, effectively, wrote a manual for how something else would do so.

It's as though the invading task force on a strategic ocean outpost forgot to watch its own back.

High noon

The sun is high over the desert island of Datumoj -- the island we introduced to you in Part I, where the history of the world's database technologies is being played out. The production center for corporate databases has been firmly established on the north shore. On the south shore is the island's jewel: The ETL facility, whose three buildings are now all running at maximum production levels. From the eastern shore, SQL Division has pushed west-northwest to build its command post.

Read also: Cisco rolls out industry-first security features for Spark


Meanwhile, what remains of the Ledger Domain has consolidated, retaining its positions on the high ground by successfully defending its schematic fortress from incursion by the allies. It has formed its own accord with manufacturers, each of which has its own interest in maintaining the high ground. An uneasy truce permits production and ETL to run a direct supply route between each other, through the high territory, but only through one fortress at a time. It's a truce that disrupts the homogeneity of the payloads passing over each supply route. Every payload's manifest is exclusive to its chosen road, ensuring that each fortress maintains some level of control over how payloads are processed.

It is D + 242. In a daring and unannounced northern assault, a trio of units comprising the northern task force make a daytime landing. They strike just west of the key production facilities, OLTP and OLAP. Their objective is the under-utilized supply route between the production facility and ETL -- the route which bypasses the schematic strongholds.

The assault is led by an unusual, amphibious mechanized brigade called MapReduce. It arrives with all the equipment it would need to overtake the production facility, and it replace its processors with automated units running in parallel. Also establishing a beachhead to the southwest, near the foot of one of the Schematic mountains, is a highly specialized transport battalion called HDFS. Its aim is to create a network of high-grade supply routes between all the storage batteries on the island, linking them together so that they operate as a single core. Together, MapReduce, and HDFS push back what minimal resistance there was, downhill toward the SQL command post to regroup.

Read also: Want to work for Google? CEO touts 'jobs for thousands' in $2.5bn US expansion plan

Then, an unusual, highly-mobile artillery unit, calling itself YARN, takes the northern point of the forgotten western island of Eliro. YARN prides itself on keeping a tight schedule. With superior communications capabilities, YARN's objective is to make its way gradually south toward ETL, assimilating the western supply route mile by mile, and in so doing taking charge of all payload traffic passing over it.

With the original allies regrouping, just days after the Hadoop invasion, an outside force makes an uncalculated, perhaps foolish, attack on the island's eastern shoreline. Calling itself the NoSQL Task Force, its lead battalion, MongoDB Brigade, aims for the heart of the command post, but it finds itself greatly outnumbered, with heavy resistance. In a risky maneuver with little chance of success, a breakaway light infantry unit called Cassandra Company lands further south. Perhaps with no real plan in mind, it surges south toward ETL, though with certainly not enough force to capture it.

Read also: Hybrid cloud 2017: Deployment, drivers, strategies, and value


Light of day

The data part of the data center is in continual conflict. At the core of it all are the schemas and domain models in which the business logic of our organizations is tightly enmeshed. Those models are typically tied to proprietary formats that serve to protect the lofty positions of the brands that IT departments' predecessors, or perhaps their predecessors, first chose. Yes, these models are entrenched. But they don't, to borrow a term, scale.

Along comes an entirely new model that breaks through physical barriers. Hadoop enabled a heretofore unseen degree of scalability, deployed a new parallel operating model for tasks, and established a file system that could finally transcend physical volumes. But, immediately, that model was pushed aside as experimental, or relegated to deployment on platforms where it wouldn't interfere with principal workloads. As late as 2013, commercial Hadoop provider Hortonworks was pitching Hadoop as the perfect data engine for platforms such as OpenStack, where it would face few legacy applications and wouldn't have to bother with the whole integration problem.

"Each organization has its own internal data architectures," said Matei Zaharia, the Stanford University professor who co-created Apache Spark. "A data warehouse and the process for loading it there, might look different than for somewhere else. Every family is unhappy in its own way."

I've spoken to enterprise purchasing decision makers at recent conferences, including some that feature Spark and its associated staging platform, Mesos. They tell me they're intrigued by what they're seeing, but they're not ready to make a commitment just yet. They're waiting, they say, for things to finally "gel" -- to become a platform, a seamless unit, or what a wise lady once likened to an operating system.

These people are looking for a solid, uninterrupted connection between data definitions, transactions, analysis, and processing. Yet, ever since the advent of the PC and the microprocessor, all through the reign of dBASE, the rise of DB2, the market supremacy of Oracle, and the under-the-radar rise of SQL Server that some say continues even today, there has never been a single, end-to-end platform for the processing of actionable, recordable, configurable, valuable data in the enterprise. The dream of these purchasing managers often remains a dream.

"The easiest thing is to adopt new technology for new use cases," said Zaharia. "So, if you have something that's working, and has been working for a couple of decades and its users are a traditional data warehouse, maybe that's fine. But when you have a new use case -- especially one that generates lots of data, like sensors or smart meters -- you design it to use the new technologies, to use Spark or maybe these streaming systems or Cassandra or a data lake. And then at the very least, you want to connect it to data and the data warehouse."

Some organizations, Zaharia conceded, are interested in the idea of moving their transactions out of their existing data warehouses, simply to save costs. And a few, he said, are forced to do so out of an unavoidable need to scale their systems out. These are the organizations that literally have to break something to keep their systems working. But there are many other firms for which the cost/benefit analysis has yet to yield fruit. This from Spark's own creator and the CTO of Databricks, who would arguably benefit most from a positive result.

"Unfortunately, I don't think there's a standard way to do it," he told ZDNet Scale. "But I think definitely for new use cases and new data sources, it makes sense to architect them using more recent tools. And then beyond that, there are patterns you can use to move stuff over, when it makes sense."

That's not at all how Adobe's Anjul Bhambhri sees the situation. Having spent 15 years at IBM, she sees enterprises as being too constrained by their own data warehouses, and jettisoning them outright. Rather than build their successors on-premises, or assemble them themselves, she expects they'll adopt prefabricated platforms in the public cloud, where there already are standard ways to do things.

"Enterprises really have made the leap, and have moved away from their own data centers to the public cloud providers," Bhambhri remarked. "For infrastructure-as-a-service and platform-as-a-service, to get that centralized storage, compute, scale, and elasticity to handle all this inconsistent workload, and to get the multitude of PaaS offerings -- for certain workloads you need a mix. Just SQL is not enough. You need NoSQL. And batch-ingesting the data is not sufficient; you need that real-time ingest."

Read also: Cloud computing will virtually replace traditional data centers within three years


We have set the stage for the invasion of Spark, and the open-source components that follow in its wake. But the success of its mission is nowhere near certain. To produce an end-to-end platform, or at least the semblance of one, any new scheme has to make peace with the existing schema, be interoperable with the SQL queries that are already on the books, and must be capable of integrating with ETL only if and when such integration is practical and beneficial.


It is a very tall order. For our next stop, we find out whether it's a job for the SMACK Stack. After that, we'll undertake a real-world investigation of a database the size of a whole country, and how it's coping with the fact that it could have used Kafka and real-time streaming eight years ago. Is it too late to perform a nervous system transplant? Until next time, hold fast.

Journey Further -- From the CBS Interactive Network



The Data Expeditions

The "Battle of Datumoj" was inspired by World War II's Battle of Morotai. There, an island,which seemed easy enough to liberate just months after D-Day in France, ended up being an active battlefield until V-J Day, and even afterward. The real story of Morotai, its strategic importance, the real regiments that fought there, and the troop movement maps that inspired this series, are available from the World War II Database.

Editorial standards