'

We interrupt this revolution: Apache Spark changes the rules of the game

The Data Expeditions, Part III: Where the database market's equivalent of a commando raid brings a new actor to center stage, with the incentive and the opportunity to upset the balance of power and rewrite the rules of the data center.

Video: The age of cloud data center dominance is here

Hadoop and Spark: A tale of two cities

It's easy to get excited by the idealism around the shiny new thing. But let's set something straight: ​Spark ain't going to replace Hadoop.

Read More

Mid-afternoon on Datumoj island, D-Day plus 286.

The Hadoop task force has a firm hold on the western coastline. It does not yet control the old supply route between production on the north end and transport on the south. Snipers from the fortress atop the Schematic mountain range peer down at potential targets. On the eastern front, the NoSQL offensive has collapsed. NoSQL surrenders to the SQL camp, changing its callsign to "Not Only SQL," and leaving Cassandra Company stranded by itself in an encampment outside the southern ETL facility.

As part of a support effort, Hadoop has enlisted a long range artillery division, called Spark, to shore up its northern line. On Spark's recommendation, the task force has brought in Mesos, an experienced mechanized cavalry unit, to help it stage a new production and transport operation along the northwest coast, assuming the old route first paved by the original occupying engineers.

Read also: Hadoop 3 confronts the realities of storage growth | Artificial intelligence on Hadoop: Does it make sense? | Data.world: The importance of linking data and people

180223-m04-scale-w03-fig-01.jpg

Spark and Mesos advocate for a new, "two-level" scheduling objective, with transport operations on one itinerary and production operations on a separate one. Mesos Cavalry Unit has the skill, they argue, to run both operations in parallel. They gain support from the Hadoop task force, and begrudging support from the allies, who are mainly interested in learning whether someone else can divert operations away from the Schematic strongholds without suffering too many casualties.

Read also: Linux Foundation offers Hadoop training | We found 22 cloud services your business definitely needs to try | Have hyperscale, will travel: How the next data center revolution starts in a toolshed

180223-m04-scale-w03-fig-02.jpg

Tag team

"One of the goals of Spark," remarked Spark's co-creator, Databricks CTO Matei Zaharia, "is to give you a single computing engine that can do many functions together, so you don't have to learn how to manage and develop against four or five different systems to build your application.

"But the thing that we're trying to do with Spark," Zaharia continued, "is let you write your application against the Spark programming interface, or the Spark SQL interface if you're using SQL. Then you can connect to many different data sources underneath in a uniform way." As an example, Zaharia cited a user storing data on Amazon AWS' S3 object storage service, but then moving that data back on-premises and into Apache Cassandra, which is a non-relational column store born from the NoSQL movement. Although the type of data changed entirely, the Spark application, said Zaharia, does not have to change, with the exception of disclosing the database source.

Spark is not a database, nor a database manager or administrative front-end. It is an engine, designed to process tasks at high speed. The Apache Mesos framework (created at UC Berkeley just across the hall from Spark) can manage the scheduling and staging of Spark tasks, making DC/OS one commercial implementation of a fast -- and by many folks' measures, "real time" -- data processing system.

"The thing that Mesos allows for is an abstraction of the physical resources from the application," explained Joshua Bernstein, Dell EMC's vice president of emerging technologies.

"In traditional deployments, even in the old data warehouse days of Oracle," Bernstein continued, "you ran these applications regardless of whether it was Oracle, Spark, or Hadoop, on dedicated hardware. And that worked fine, except when you needed to start actually developing that application and doing something different with those resources. So what Mesos allows you to do is abstract the underlying hardware from the application. Now with Mesos, I can run Spark; I can also run two instances of Spark, pointing at the same data warehouse underneath. Or I can have a hundred instances of Spark, in a development environment where I don't want to have a hundred separate pieces of physical hardware to support this environment, but I need to be able to run Spark at some sort of scale."

Read also: After Oracle WebLogic miner attack, critical Apache Solr bug is now targeted | Back to the future: Does graph database success hang on query language? | Apache Flink: Does the world need another streaming engine?

In Part II, we told you about the schema -- the part of the relational database whose name E. F. Codd probably didn't like very much, and which defines the relationships between tables and elements. In an earlier era of database platforms ("earlier" being a relative term, referring more to newness of development than active practice) procedures were constructed atop these schemas. This relationship helped lock procedures to the databases to which they referred, and thus helped defend the proprietary fortresses.

"Mainframe offload is a very popular enterprise use case [for Kafka], where the applications are built around big mainframes. They use a connector, using Kafka's Connect API, to offload data from mainframe into Kafka, and then change the consumption pattern to go with Kafka."

— Neha Narkhede, chief technology officer, Confluent

Mesos enables a much looser connection. It enables essentially any applicable language to produce a procedure that interacts with Spark in a scalable manner. Put another way, you're not stuck with PL/SQL. Database developers today lean toward Scala, a stateless form of Java for distributed components, although it's entirely permissible to use Python, a language most often used for serving webpages.

Spark and Mesos have made their way forward by severing dependencies and decoupling the interfaces between the key components of the data warehouse. "We don't need the single, global schema where everyone has to agree, in order to put anything in," said Zaharia. "You don't even need to use a single storage system, necessarily."

Dell EMC's Bernstein expects many organizations to run Spark on bare-metal servers, without Mesos or any other framework. While there's plenty of engineering reasons why this may not be preferable in certain situations, it's still going to happen, he said. Corporate politics being what they are, the decision to finally adopt Spark will be an immediate one, and the hardware and infrastructure will already have been chosen for it.

"I think in a perfect world, in a more involved world, you run Spark and you treat Spark as just another application on top of Mesos. No question," he remarked. "One of the powers of Mesos is that you can use it to schedule bare-metal workloads just as easily as you can containerized workloads. Mesos allows you to schedule it and resource-manage it. That's the powerful concept."

Resource management, he argues, brings back into fashion the idea of crafting efficiency into processes -- an idea born from the era of timing UNIVAC translation and loading processes using a hand-held slide rule. In that era, the users of applications placed often unreasonable demands on the hardware for throughput and performance. Engineers had to balance out those demands with the capabilities of their machines, in what we may literally call "real time."

"It's not necessarily the containerization of it, it's the schedulability, if you will," said Bernstein. "That's the future. That's what I think the infrastructure and the concepts look like."

"One of the reasons you see that crust of legacy in data centers is because these organizations have failed to recognize the importance of changing their behavior."

— Joshua Bernstein, vice president of Technology, Emerging Technologies Division, Dell EMC

Open source components tend to be created with the objective of solving specific problems, he explained. In Spark's case, it was to manage data processing tasks in parallel, and in so doing, lend structure to classes of data that would normally be considered unstructured. If a component can continue to serve that original purpose, he believes, citing Linux as an example, it could live on indefinitely. What's more, each vendor in the Linux space can leverage the expertise of all the others, to substantiate their efforts and give them space to flourish. Databricks' Zaharia also cited MySQL and Postgres as examples of open source functionality that has no good reason to be displaced, at least not yet.

Read also: DataTorrent: Hard code around streaming data philosophy in 90 days | Servers? We don't need no stinkin' servers! | Equifax says more private data was stolen in 2017 breach than first revealed

180223-m04-scale-w03-fig-03.jpg

The crust of legacy

"The virtue of the modern data warehouse," I wrote in 2013, "is that it serves relevant data to applications that know how to use it. It is not a database. Rather, it is the component that databases turn to when they need to retrieve data. . . Thanks to the emergence of cloud technology, a data warehouse is no longer just one thing with one brand. It's not even just one place, unless you count 'Earth' as a place. It can be the product of many brands and many components working together."

"Hadoop tried to tackle two very different problems: Storage with the Hadoop File System, and computations through MapReduce. The storage part is great if you need to manage your own data center, but it doesn't make sense at all in the cloud."

— Matei Zaharia, CTO and co-founder, Databricks

All this was just so much wishful thinking, and I should have recognized the taste of Kool-Aid. Back then, once it had become clear that Hadoop had called into question the continued existence of data warehouses, the major brands, including and especially IBM and Teradata, launched a kind of "embrace-and-extend" initiative. Indeed the data warehouse was dead, they declared. But just like in a detergent commercial, a newer and more polished version of the same brand may emerge with slightly brighter packaging, and the promise of delivering more value without you having to move an extra muscle.

IBM's initial strategy for embracing structured data, unstructured data, cloud-based object stores, and live data streams, was dubbed the Layered Data Architecture. It was an architecture in as much as the piles on my desk was an intentional assembly of paper products.

Read also: Dell confirms that it's exploring IPO, VMware reverse merger | Hyperscale data centers fuel revenue growth for enterprise storage vendors | Dell EMC unveils new VxBlock System 1000

"One of the reasons you see that crust of legacy in data centers," said Dell EMC's Joshua Bernstein, "is because these organizations have failed to recognize the importance of changing their behavior. They're not resourced -- either financially, culturally, risk-wise, or a wide variety of other reasons. What happens is, these organizations suffer [at the hands of] these younger companies that are resourced, that are doing things more cost-effectively, that can create more value out of their data and make decisions based on their data."

As an unusually negative use case, Bernstein cited Toys 'R' Us. As part of a 2016 initiative to bring its outsourced and cloud-based operations back in-house, and win back some competitive edge against Amazon, the retailer launched a desperate effort to rebuild and retool its supply chain. This after a Wall Street Journal report earlier in the year revealed that the Toys 'R' Us website retained about 62 percent of its inventory on the top 100 selling toys, throughout the Black Friday shopping period.

Toys 'R' Us was a customer of Teradata, still the champion of data warehousing. At the time of the initiative, Teradata helped the retailer announce what it called the optimization of its supply chain.

A promotional video produced at that time explained a common technological problem: Split shipments, where portions of orders are fulfilled from two or more distribution centers when they could conceivably be fulfilled from one, faster. "Because things do move so quickly," the retailer's expert said in the video, "and competition's out there doing the same stuff, so how do you keep up?"

"One of the most interesting tidbits out of this," remarked Bernstein, "was that their infrastructure was so old, that it was very, very difficult for the private equity firms to derive any value out of the data they were getting presented with. They were getting data that came out of some archaeological dig, presented as a slide on a PowerPoint. But by the time they saw the data, it was already one quarter old."

The toy retailer filed for bankruptcy last September. It cannot be stated with certainty just yet that an outdated or outmoded supply chain was to blame for the retailer's trauma, nor that modernizing it would help it emerge into prosperity again. But it's obvious that, no matter when it actually realized its future was just as uncertain, Toys 'R' Us lacked the time to effectuate a change in its IT operations.

"I think it's very, very important for companies to evolve," stated Bernstein. "Otherwise, somebody's going to evolve them, or evolve past them. That's sorta what happened with the dinosaurs."

180223-m04-scale-w03-fig-04.jpg

Turncoats

D-Day-plus-291: In a shock to the Hadoop task force, Spark Division and Mesos Cavalry Unit declare their independence. Now officially a rogue force, they make way for two other units to land along the coast that Hadoop had already won: Akka is a mid-range infantry regiment capable of staging quick-and-dirty missions with limited streams of resources, on short notice. And Kafka, an intriguing special engineering battalion, seeks to build a powerful long-range transmitter on Eliro island, potentially coordinating all operations and reducing the use of supply routes for communication.

Read also: Spark NZ to launch IoT network next month | The future of the future: Spark, big data insights, streaming and deep learning in the cloud | Spark: 85 percent of customers to be off copper by 2020

News of Spark's success on the western front brings Cassandra into the emerging rogue stack, as it takes up a southernmost position. Now the SMACK Stack is geared to launch an all-out assault on the allies' warehouse operations. But its final strategy is as yet undetermined: Does it take out ETL, or supplement it? If it embraces SQL, must it also make peace with the domain fortresses at the top of the hills?

180223-m04-scale-w03-fig-05.jpg

The woolly mammoth

"I think that the word 'Hadoop' is problematic," stated Ted Dunning, chief application architect for data platform provider MapR (a company whose very name derives from an Hadoop component, MapReduce). "It's been used and abused so much that it's very hard to understand what somebody's talking about, when they say, 'Hadoop.'"

By way of metaphor, we've introduced you to the three principal components of Hadoop: The HDFS file system, the YARN scheduler (which has since been replaced on numerous occasions, including by Spark), and MapReduce. "I think MapReduce is becoming relatively passé," said Dunning, speaking with ZDNet Scale. Though there have been late efforts to extend the service lifetime of Hadoop's components, including with some last-blast efforts as Tez and the aptly named Live-Long-and-Prosper, Dunning believes, "I think the innovation rate dropped dramatically, and the fashionability of it has dropped dramatically."

What remains of Hadoop in active development, he told us, are projects pursuant to Apache Hive, the component that enables existing data warehouse SQL queries to be extended to large HDFS volumes. It's somehow both fitting and ironic that Hadoop's final projects should pertain to integration with data warehouses, in preparation for a future that Hadoop no longer really has.

"I think what happened with Hadoop was two issues," said Zaharia. "One is that Hadoop as a project tried to tackle two very different problems: Storage with the Hadoop File System, and computations through MapReduce. These were tied together; it was pretty difficult to use HDFS initially without MapReduce, and it was pretty difficult to use MapReduce with anything else. And the storage part is great if you need to manage your own data center, but it doesn't make sense at all in the cloud. For cloud storage, block stores are way more scalable, way more available, and also cheaper than HDFS. And it also doesn't make sense if you want a key/value store such as Cassandra; there's this mismatch, where you can only store stuff in HDFS if you wanted to process it this way."

"What's happening is that this traditional, monolithic Hadoop stack often isn't the right choice anymore for these modern applications," remarked Tobias Knaup, chief technology officer of Mesosphere, which produces the DC/OS commercial Mesos platform. Laying down the gauntlet, Knaup invoked the dreaded word "monolithic" not in reference to data warehouses, or proprietary database platforms, but to Hadoop -- the technology whose very symbol implies traveling in packs.

Read also: Google builds out its data center estate, with added solar power | Cloud computing will virtually replace traditional data centers within three years | Apple expands US capital spending, plans to add 20,000 jobs, invest in data centers

"Customers instead want to pick and choose the right tools for building their data applications," Knaup continued. He told the story of one customer that produces, among other things, medical records databases for healthcare customers. Such systems can be deployed in a matter of days, rather than as long as ten months. "They can deploy the software that they've developed in-house to production within minutes now, instead of deployments taking multiple hours."

The conflicting strategies to win customers and influence enterprises have always been a complicated form of warfare. Peace, as embodied by the latest open source efforts to collaborate to produce non-proprietary components that don't lock enterprises into single-vendor environments, is among the latest weapons in this effort. The battles may be bloodless, yet you still know it's war by counting the casualties. Even yesterday's allies start to look like the old guard.

"A lot of people ask this question: They say, 'You've displaced this one technology; does this happen all the time?'" remarked Zaharia. "I would say no."

Lookout

180214-m04-scale-w01-fig-07.jpg

From here, we leverage our metaphorical time machine to look deeper into the "real-time" nature of data operations, and how the SMACK Stack may yet unseat the data warehouse. And finally, we'll spot Spark's own obsolescence looming on the horizon, and ask whether the nature of open source development makes it impossible for anything to finally settle down into a stable platform. Until then, hold firm.

Journey Further -- From the CBS Interactive Network

Elsewhere

QUEST FOR THE ONE TRUE DEVOPS

The Data Expeditions

The "Battle of Datumoj" was inspired by World War II's Battle of Morotai. There, an island which seemed easy enough to liberate just months after D-Day in France, ended up being an active battlefield until V-J Day, and even afterward. The real story of Morotai, its strategic importance, the real regiments that fought there, and the troop movement maps that inspired this series, are available from the World War II Database.