X
Business

The Data Expeditions, Part V: A new wave of change forces the data revolution to adapt or perish

The Data Expeditions, Part V: Where we re-introduce ourselves to the two types of rules that are typically considered "business logic" in an organization, and gauge whether there's time to translate them both into modern data streams before another orchestrator comes charging in.
Written by Scott Fulton III, Contributor

Video: Portrait of a modern multi-cloud data center

"Our planning and movements systems are still cumbersome. . . We have been schooled on planning by using big matrices, with every cell to be filled before moving forward. We have learned a passion for detail, but not necessarily how to compromise for the sake of urgency. Surely with all modern capabilities, we can be much more timely in deployment planning, as well as operational analyses and preparations."

-Gen. Wesley K. Clark, Former Supreme Allied Commander, Europe
Waging Modern War, 2001

Rarely, and probably never, in history has the old guard, with its antiquated methods, laid down its arms and surrendered without a vigorous fight. History, if it's kind to us, will eventually record that the greatest technological disruptions were caused by the resistors to change rather than the revolutionaries.

Read also: The future of IT: Snapshot of a modern multi-cloud data center

180223-m04-scale-w05-fig-00.jpg

This is the story of the setup for such a defense. It begins on D-Day-plus-298 for the metaphorical island of Datumoj, stuck in the throes of a tense and enduring stalemate. Spark Battalion feels it has one chance for breakthrough, if it can establish offensive positions on the Ledger Domain's mountaintop strongholds. Cutting off their supply routes to the valleys below could starve them just enough to force a bloodless surrender, and a negotiated armistice.

It all sounds like a decent plan, provided Spark doesn't get upstaged by someone else before it can pull it off.

Read also: Serverless computing, containers see triple-digit quarterly growth among cloud users

180223-m04-scale-w05-fig-01.jpg

The first domain

In his 2002 book Patterns of Enterprise Application Architecture, Martin Fowler (who has since come to be known as the father of continuous delivery, the "CD" part of "CI/CD") presented a kind of object-oriented programming context called a domain model. Fowler didn't invent the idea, but he did put forth the best definition: In a domain such as a business, an object model of the domain incorporates both its behavior and its data.

Essentially, a domain model is how a program written in an object-oriented language such as C++ or Java should map its values and registers in memory, and the elements of data it creates and manages, to things in the real world. A service, under such a system, would be a program that performs a discrete function and provides an explicit result. Fowler did not have the opportunity this early to define microservice, but it has come to be known as a class of service designed to be independently scalable, for use in distributed systems architectures.

"At its worst business logic can be very complex," Fowler wrote. "Rules and logic describe many different cases and slants of behavior, and it's this complexity that objects were designed to work with. A Domain Model creates a web of interconnected objects, where each object represents some meaningful individual, whether as large as a corporation or as small as a single line on an order form. . . If you have complicated and everchanging business rules involving validation, calculations, and derivations, chances are that you'll want an object [domain] model to handle them."

At the core of every enterprise is its digital model -- the way its applications represent how the business works. If a single database platform were to encapsulate the way all businesses could be modeled, in all foreseeable cases, then it could make a compelling case for a rip-and-replace transformation of business applications -- a "digital transformation" that could be measured on a stopwatch.

Such a case might look like this: Suppose a single mechanism could model both the stages and events of core business logic, and the way incoming data should be interpreted to integrate with that logic. A domain model plus an ETL model, in one graph. And suppose the immediate benefit were immediacy itself: the means to stage translation processes in parallel, and execute tasks upon incoming streams in real-time.

SQL was not originally intended to encode business logic. But because the procedures built around SQL, using languages such as PL/SQL, provided both the form and format for the reports upon which businesses relied, the schemas built in the development of these procedures have often been considered the core logic of the business.

Fowler would rather that business logic be encoded as objects in memory, not as database components. However, even his own 2003 treatise on the issue conceded that one's choice of encoding depends upon its own codes of best practices, and perhaps even certain matters of convenience.

Read also: Hybrid cloud 2017: Deployment, drivers, strategies, and value

The domain model is one way to apply a pattern to the objects a program retains in memory, so that they have some manageable relationship to things in the real world. It is the programmatic counterpart to the schema -- the rules with which a relational database associates units of data with the real world. And because the domain model was meant for programmers and not data engineers, it works differently from a schema.

Here's one example: In a typical schema, a record has its own exclusive primary key -- like a license plate number, but something human users would rarely see. Anything sharing the same primary key must, by relation, be part of the same record. In an object-oriented domain model, equality and identity are separate concepts. So if two instances of an object class have the same values and contents, they're still separate items -- they're not part of the same object.

These odd behavioral differences usually have been explained away as esoteric and unimportant in each other's respective context. But a decade ago, the act of coping with these differences gave rise to a legitimate industry: object / relational mapping (ORM). A tool called Hibernate ORM, now produced by Red Hat, helps software developers automate the ORM process so that schemas may map more directly to object models.

The means for implementing this mapping is called a directed acyclic graph (DAG). It's like a visual flowchart depicting stages of processes, with a clear start and a clear end. Rendered properly, a DAG can be an intermediary between the object model of data that depends more on behavior, with the schema that relies more upon state.

Dr. E. F. Codd, the inventor of SQL, warned against the use of such an interpreter as early as 1990, stating that an object-oriented language could not yet enforce the integrity of data and data systems. "Each new model that comes along," Codd wrote, "must be carefully examined from the standpoint of its technical merit, usability, and comprehensiveness."

Despite Codd's warning, DAG has evolved to become the tool that represents the translation of the extract / transform / load (ETL) process -- the original maintenance engine of the data warehouse -- for Apache Spark. All jobs in Spark, including ETL and also the translation of schemas, may be drawn up as DAG blueprints. Indeed, the use of Spark SQL can trigger the creation of new DAG graphs that trigger Spark's built-in partial directed acyclic scheduler -- naturally, called Shark.

The delay imposed by almost any amount of time spent with cleansing and translation, argued Matei Zaharia, Spark's co-creator and the co-founder and CTO of Databricks, works against the intent of a system that purports to be "real-time."

There will always be a need to cleanse data at some stage of the processing operation, he conceded. In that sense, ETL is not, nor will it become, dead. But the cleansing task can now, he proposed, be engineered into the analysis task, in a type of parallel, "just-in-time" schedule -- probably with the aid of DAG, but perhaps using more of a CI/CD-type pipeline. Like any other process, it would consume time; but running in parallel, it would introduce no delays. He advised using the Databricks Delta platform as a mechanism for tracking data through the stages of transformation, and potentially holding onto older versions as backstops in case transformed data fails certain tests in the pipeline.

DAG gives Spark, and its allies in the SMACK Stack, an opening to mount a direct assault on the center of all relational data operations: the schema. Indeed, in just the past few months, researchers at IIT Bombay have launched an effort to leverage DAG to automate the schema translation process [PDF]. Specifically, their automated code would analyze existing models, translate them, and then analyze the translations to determine whether they would run faster or more efficiently than the existing code they would seek to replace.

"Everyone, regardless of what technology they're using, has to do some data transformation and extraction," explained Zaharia. "In that sense, everyone is doing ETL. You can use Spark and Kafka as just another tool to do the ETL process. But they're not using the ETL technology that existed before."

No time flat

"A lot of these data-driven applications are now real-time," remarked Tobias Knaup, CTO of commercial Mesos platform provider Mesosphere. "They're no longer batch, where I can collect data over a week and then run some report, someone looks at that and makes a decision. This data actually flows back into the application in real time."Back when an enterprise chose the data system best suited to the tasks it envisioned for itself, it chose the entire platform -- meaning, it also ended up selecting the database format, and the means with which the schema explaining the business logic was formulated and delivered. Anyone building a domain model would have to revise the ORM process to fit this schema. For that reason, among several others, it was a bad idea to change platforms. The advent of the data warehouse gave organizations the freedom, at least, to integrate everything together on its own schedule. But it left them in a place where those decisions ended up being fixed in stone once they were made.

Read also: Yes, DevOps is all about business growth, especially the digital variety

So the challenge organizations face today is attaining the ability to build new applications that run at the speed that modern distributed systems require, while at the same time integrating with the formats that already exist for the applications that are already in place -- even if those old apps are being phased out over time. Databricks' Zaharia points the way for organizations to meet this challenge: by devising pipelines for staged processes that both define both business and domain models, in a parallel scheme that consumes next to no time flat.

"The difficulty, and the land mines that you tread on here. . . is the perception that people have of the environment versus the technology itself," stated Guy Churchward, the CEO of real-time data platform provider DataTorrent, speaking with ZDNet Scale. "Generally, their pre-conception of where they currently sit with the state of their architecture, is probably the hardest thing. Most people don't realize they're driving around in a car that's going to collapse. So they're fat, dumb, and happy, and either they aren't doing anything and they're competitively getting killed, or they're doing something, but the data they have actually has no integrity around it, and they're in essence looking at false data."

Churchward tells us a story that he said is a recurring one, where a client or prospective client insists it's already deployed real-time analytics. So he quizzes the client about its latency loop, and it proudly responds, 60 milliseconds. Why that number? They show him the architecture where Kafka cleanses the data, then it gets fed through a batch processing engine, then parked in a data lake (a virtualized repository for all data in various states) where it awaits a query from an analytics service.

"Okay, so what you're really doing is real-time analytics on a data lake with stale data," he would say. To which the client would answer, "'No, no, no, we've got real-time data coming in.' Yea, but you're not actually analyzing the data in real-time; you're actually making an analytics call in real-time."

Churchward and Zaharia would agree upon the basic principle that real-time is only real when its processes are engineered to work in parallel with extractions, transformations, and queries. A DAG would help a modern data processing engine to perceive these stages, but not as a discrete and inviolable sequence. Where these two gentlemen part ways is with respect to the mechanism itself. Churchward's DataTorrent is the commercial steward of an open source component called Apache Apex. Introduced in 2015 as an alternative scheduler and resource manager to YARN in the Hadoop stack, Apex seeks to zip together batch processing and stream processing in a single engine.

DataTorrent's value proposition is to provide a means for enterprises that have already adopted Hadoop and gotten stuck there to effectively drop in the DataTorrent platform, led by Apex, and move to a whole new nervous system in 60 days' time. Churchward suggests an alternative stack: KASH -- Kafka, Apex, Spark, Hadoop -- to produce a new foundation for real-time applications. From there, he said, it becomes easier for clients to implement machine learning libraries, essentially because the old ETL mechanism has not been reiterated but instead replaced altogether.

"When we know how the data has to be stored, for all the fixed schemas -- when we know exactly what questions to ask the data," remarked Anjul Bhambhri, Adobe's vice president of platform engineering, "the traditional data warehouses were designed with ETL playing a very fundamental role. Knowing the source, and then knowing exactly what format the target needs to be populated, the rules-based approach of ETL worked."

But as new applications make use of data in different ways, the rules should be flexible, or at least conditional, Bhambhri argues. A rigid, automated process for preparing all data for the limited use cases that an organization has already pre-defined for itself, is no longer applicable to a world where the applications themselves could conceivably learn.

"It's not like it's completely gone away -- we still need to do that for some kinds of cleansing," Bhambhri continued. "But when you look at aspects of data science, AI, ML [machine learning], there is a lot of feature engineering that has to be done on this 'massaged' data as well. And this is not heavy-duty ETL, but it needs to be done. ETL is good for when, in a batch mode, data is processed, you know the schema, you know the kinds of reports that have to be generated. But for this world where there is a lot of time-series behavioral data, trying to use an ETL approach can not only be cumbersome, but it is very time consuming."

Remarked Churchward, "If I found out in my stock portfolio that a company was basing its analytics exclusively on a classic, old-style ETL data lake architecture, I would short the stock."

Read also: Eliminating storage failures in the cloud

180223-m04-scale-w05-fig-03.jpg

Sunset assault

It is dusk over Datumoj, D-Day-plus-300. There is a storm arriving from the north, and it's not the clouds that are bringing it.

The Spark coalition has established a perimeter around the Ledger Domain's fortresses in the Schematic mountain range. It now controls all traffic in and out of their supply routes. To the group's surprise, though, that traffic has become sparse. What they don't yet realize is that Apex, a mountain infantry division, has been sneaked in by stealth landing forces along the southwest coast. Establishing a truce with the remainder of the Hadoop Task Force, it captured offensive positions along the old western supply route, bypassing the main mountain roads.

Read also: AMD debuts embedded EPYC and Ryzen processors

180223-m04-scale-w05-fig-04.jpg

But even as they prepare to launch a strategic counter-offensive against Spark, lookouts on Eliro Island spot the distinctive steam plumes of an invading naval force to the north. It's carrying the legendary Kubernetes Marine Expeditionary Corps, accompanied by Docker Containerized Brigade. It's believed they share the means to deploy a completely new staging and production system for the island, in self-deploying capsules that are self-provisioning, can be dropped in place, and made fully operational within days. "Microservices," these capsules are called, though no one's ever seen this enemy up close.

Read also: Cloud computing will virtually replace traditional data centers within three years

180223-m04-scale-w05-fig-05.jpg

Upstage

"I don't think that we're talking about technical differentiation here. I think what we're talking about is market domination," said Ted Dunning, chief application architect for MapR. "And what I see is, Kubernetes is the runaway favorite on GitHub -- the most starred project. And I see that with our corporate customers -- 90-plus percent are adopting Kubernetes for their production instances."

MapR's cloud service partners, Dunning told us, have already deployed Kubernetes to manage over 90 percent of their new server clusters. Some of the nodes in those clusters are still managed by Mesos, but he said that number is declining to around half.

"The trends are dramatic," he declared. "As we all know, trends like this do not always respect technical qualities. Not that I think there's technical deficiencies in Kubernetes -- that's part of the problem: It's really good."

"The only reason why we're talking about Kubernetes," said Joshua Bernstein, vice president of technology for Dell EMC's Emerging Technologies division, "is because Google has done a phenomenal job marketing it. From a purely technical perspective, Kubernetes also has done a very good job with its data model and data abstraction model. On one hand, that gives it flexibility. But it's also very complex, the code base is very young, and regardless of what you think, it's controlled by a single entity, which is Google."

The Cloud Native Computing Foundation -- of which Google is a member, but also Microsoft, Oracle, and now parent company Dell Technologies itself -- might take issue with Bernstein's last comment. However, the outstanding point that Kubernetes may yet be an immature technology, does bear some scrutiny.

"What's going to be interesting here," Bernstein said, "is that Kubernetes will struggle to run different workloads simultaneously on the same resources. It will be hard to run Spark and Cassandra in the same environment. In fact, that capability in Kubernetes is still something that's being actively developed right now -- we're just beginning to see the inklings of this kind of capability. Mesos is bolder, more mature, and has two-level scheduling, which I think is incredibly powerful. But because it doesn't have the ecosystem, fanfare, and hype around it, it doesn't really get its fair share of respect, to be honest with you. So what you're really trading off here is, people are gravitating towards Kubernetes for hype."

DataTorrent's Guy Churchward tells us his company is preparing for a full-on Kubernetes invasion. Its response would conceivably enable HDFS to be integrated with the Kubernetes orchestrator, allowing Apex to cooperate with respect to scheduling. But then it could bring the KASH stack back in, provisioning Kafka, Spark, and Apex once again for processing real-time streams.

Read also: Google builds out its data center estate, with added solar power

"Look, components don't matter," remarked Churchward, in the midst of a conversation where components certainly did appear to matter. "It doesn't matter whether Spark wins or Apex wins or Kubernetes or Mesos wins, or YARN is there or not, or Hadoop is good or bad. The reality is, you've got to look at it and say, 'I want to land on the island, I want to liberate my data, I want to get a result off of it, I know what it is, and I need it done within two quarters of me thinking about it.' And I also need the flexibility of saying, 'I got it wrong,' and then readjusting my line of sight."

Reconnoiter

180223-m04-scale-w05-fig-06.jpg

As twilight descends over Datumoj, the theater of operations is being set for an epic confrontation. Each of the contestants in this battle have the goal of making the island into a self-service refueling stop, much as Churchward described it. What faces us, and what faces the enterprise, is the likelihood of a showdown between the parallel task model and the distributed microservices model for facilitating real-time streams and traditional batches simultaneously. CTOs and CIOs may be hoping for more time for one platform to emerge victorious. For some of these executives, their decision-making time may have already expired.

Meanwhile, as the silhouette of our metaphorical island fades into starlight, someone else's reality takes its place. In that reality, a new wave of digital transformation is taking place, bringing with it a concept called the "integrated data warehouse." In a world where anything can be successfully marketed, everything old is indeed new again, and it suddenly appears Datumoj Island was just a blip on the radar.

We'll measure the impact of that blip next time. Until then, hold true.

Journey Further -- From the CBS Interactive Network

Elsewhere

QUEST FOR THE ONE TRUE DEVOPS

The Data Expeditions

The "Battle of Datumoj" was inspired by World War II's Battle of Morotai. There, an island which seemed easy enough to liberate just months after D-Day in France, ended up being an active battlefield until V-J Day, and even afterward. The real story of Morotai, its strategic importance, the real regiments that fought there, and the troop movement maps that inspired this series, are available from the World War II Database.

Editorial standards