Video: The age of cloud data center dominance is here
"As time progresses, a database is bound to change in terms of the types of information stored in it. In fact, there is likely to be a significant expansion in the kinds of information stored in any database. Such expansion should not require changes to be made in application programs. . . If each of these requirements is met, the approach can be claimed with some credibility to be adaptable to change. When applied to database management, the object-oriented approaches take a very restrictive and non-adaptable approach to the interpretation and treatment of data." -- Dr. E. F. Codd, The Relational Model for Database Management, Version 2, 1990
You hold in your hand a device that some folks call a slide rule, but whose label identifies it as a "minimum latency calculator." It's two round cardboard discs sandwiched together with a transparent disc on top and a peg in the middle. If you're a UNIVAC I programmer in the early 1950s, this calculator may be the most important tool you own, so you never leave it behind on the break room table.
You have used this tool to accumulate the number of clock cycles consumed by sequences of complex instruction words. In an era before the advent of the "memory drum" -- the forerunner of the ceramic hard disk drive -- data could not be fetched in streams or sequences from their registers. You fetched entries, one-at-a-time, in the smallest batches possible. This tool told you how many cycles your chains of batch fetches would consume.
Surely, you're not a woman. As everyone knows (or at least every man), a woman's job is to punch the chads out of the cards, operate the sorter, and insert the cards into the feeder. And also to take the time to clean the break room, because those boys in the lab coats sure will leave messes.
For seven decades, the mechanism of processing data has involved fetching batches from memory or storage, preparing them for examination, placing them in the right slots, observing the results of the examination, executing conditional operations, and preparing memory or storage to handle the results of those operations. In the beginning, it was literally manual labor -- unashamedly called "woman's work." When some portion of it finally became automated, the calculator in your hand told you how efficient your results would be.
Since the 1970s, the automated mechanism at the center of data warehouses has been called ETL -- Extract, Transform, Load. Every revolution in the history of database technology since the advent of ETL has been essentially about one of two things: Empowering it or eliminating it. The timing mechanism that sets the pace of data warehouses, from the time of the last moon landing to today, has been ETL. Even today, some technologies that are advertised as real-time streaming are actually really high-powered ETL engines operating with really small batches, which some engineers argue isn't really streaming, or even really real.
Our next adventure in ZDNet Scale frames this series of revolutions in an unusual, historical context. Yet we won't spend too much time in the distant past. You'll soon see Hadoop, the first harbinger of the data revolution. You'll see the latest sieges upon the architecture of data warehouses. And you'll be introduced to a team of upstart allies, with comic-book callsigns like Spark, Flink, and Akka, that have seized control of the latest revolution and, in so doing, permanently altered the entire landscape.
Your journey begins with that calculator. If you're the person who knows its true value -- beyond the fact that your colleagues would like to swipe it from you -- then you're wearing the uniform of a US Navy lieutenant. So, you understand the historical context you're about to see better than most people alive.
The Korean conflict has begun. America is in a race for technological supremacy against a looming, and largely fictional, image of an ideological competitor. Private institutions such as your employer, Remington Rand, are sharing their discoveries with one another, but privately and very, very carefully. And the men in lab coats talking behind your back have yet to call you "AmazingGrace."
Laziness is the driving force that Rear Admiral Grace Brewster Murray Hopper gave for producing, in 1951, a kind of syntax adjustment and math correction program that became, in retrospect, the first single-pass compiler. She wrote it at a time when programmers merely engineered the code, but other people were hired to plug that code into machines like the UNIVAC I. They'd misinterpret individual characters, Adm. Hopper told an interviewer in 1980 -- for example, mistaking the stand-in character for "space," the Greek delta, for a 4. And the addition of a 4 would mess up the math.
"There sat that beautiful big machine whose sole job was to copy things and do addition," said Hopper. "Why not make the computer do it? That's why I sat down and wrote the first compiler. It was very stupid. What I did was watch myself put together a program and make the computer do what I did."
What Adm. Hopper did was apply a methodology to the translation of instructions and data, bringing Navy sensibility for the first time to the new and, as yet, untamed practice of programming.
Just four years after producing the A-0 compiler, Hopper and her team at Remington Rand would build another compiler that utilized common English words for the first time. With her gift for naming things immediately and moving on to more important things, Hopper called this new compiler B-0. Her civilian employers renamed it Flow-Matic, but down the road, folks thought that sounded like a sewer rooting chemical. So, a later rendition was dubbed AIMACO, and then an even later version called Common Business Oriented Language (COBOL).
As former IBM scientist Jean E. Sammet explained in her 1969 magnumopus, Programming Languages: History and Fundamentals, Flow-Matic introduced the idea that computers were, in fact, data processing machines that ingested data in batches. With each new rendition of the compiler, the means by which those batches were assembled became more and more explicit. Eventually, COBOL was a language that explained how a data processor should iterate through records, as they were gathered together into groups or divisions and then examined single-file.
Sammet described the fundamental dichotomy that COBOL was facing, at the time it was facing it. This series is about that dichotomy, and its persistence even to the present day.
A COBOL program had four divisions, and the records to be processed were all categorized under "DATA DIVISION." "Every attempt has been made," Sammet wrote in 1969, "to provide external descriptions of data, i.e., in terms of letters and numerals and types of usage rather than internal representation and format. This can be done to a very large extent, providing the user is less concerned about efficiency than about compatibility. In other words, if he wishes to establish a standard data description for a file that can be used on many machines, he may have to sacrifice certain specific features that would increase the efficiency on a particular computer."
There was no single, standard COBOL at this time; it was designed to enable descriptions of how to process data for the specific machine being operated. There were no random-access memory, no assembly language mnemonics, and no internal caches. So, the form and format of data were defined by the machine running it. To translate a COBOL program from one machine to another meant rewriting its mechanism for handling data.
Lieutenant Hopper recognized this fact, but didn't see it as a problem. If each procedure was presented to the programmer in an intelligible manner, she believed, then writing a new translation routine would be so rudimentary that even men could be trusted with it.
Hopper's use of divisions to coordinate the rank and file of data records was probably a double-entendre. Her military background taught her order and efficiency, and her principal customer for UNIVAC I at the time of its design was NATO.
As the Admiral herself put it, what differentiated her from a mathematician was that she was good at math. She envisioned complex business logic problems as sequences of repeatable steps of simple, explainable actions. In that sense, the Admiral actually invented business automation -- the idea that code could represent a business transaction which could, with military precision, be translated into mathematical transactions.
Grace Hopper -- no doubt unwittingly -- set in motion a strange and serpentine chain of events that continues to this day: a struggle between the people who craft computer code, the people who manage computer data, and the people who design business processes.
The best way to envision this situation more succinctly, in the context of history, is perhaps to project it as the Admiral herself would have done: As a military campaign. It may be bloodless, but in the arena of business, there have been, and may yet continue to be, casualties.
The campaign to capture and hold the modern data center is a struggle for the key strategic leverage point in the campaign to liberate industries, governments, and their people. It has been global warfare. And for our purposes, it has all taken place on a small island where, at this time in history, no one would ever spend a vacation.
The long campaign
It is dawn over Datumoj. From about five thousand feet in altitude, a reconnaissance sortie reveals the silhouette of the strategic hub of data operations: An island about 25 miles wide, jutting out of the metaphorical ocean like a submerged head of broccoli. Its peaks don't look like the kind of challenges that would strike a mountain climber's fancy. But anyone running a supply route over these tree-capped desert peaks will attest to their being treacherous enough.
Datumoj is the staging ground for our representation of the enterprise data center, and the components that would seek to enable it and to control it. It stands for the realization, albeit in this unreal way, that since the dawn of the data center era, everything we've needed to process our data has always been gathered together in one place, as though they shared an island. But their proximity to one another has never been enough to guarantee their compatibility. Certainly the parties involved would deny any sort of conspiracy, especially a joint one. Yet history yields essentially the same outcome.
The original liberators of this island may be considered a kind of allied expeditionary force. For them, Datumoj was a relative cakewalk, at least on "D-Day."
Hopper's Flo-Matic Division struck the critical landing strip on the southernmost peninsula, establishing the first beachhead. Subsequently, forces led by COBOL Regiment sailed north through the strait, seizing five beach points with minimal resistence. Within four days, the allies had seized command of the western coast. Enemy forces, holding true to their ancient traditions of paper ledgers and undecipherable business models, withdrew to the high ground.
From the northernmost point, the allies began composing the business processes that would determine the outcome of the war. But the ETL station was at the southernmost point, meaning that the success of every operation necessitated a treacherous journey along a supply route fortified by artillery points established along the western coast of the island. This way, operators could bypass the high ground where the Ledger Domain -- the keepers of the business rules -- had retreated.
The following is fact, not metaphor: When the concept of the data warehouse was finally, firmly established, several of its practitioners adopted a kind of "air traffic control" metaphor for its nucleus of operation. They conceived what is still called a landing zone for incoming data, followed in some metaphors by a taxi-way for data frames awaiting processing, a gateway for parking, and a runway for takeoff and final delivery.
The automation of the data warehouse has always involved devising a kind of "fire-and-forget" process for assimilating data and making it usable -- of setting the air traffic control station free to do whatever it does. The usability of that data has always been managed by a second, perhaps not so automated, process outside of the warehouse: Essentially crafting the procedure or the query that accesses the data from the warehouse (from ETL) and applies it to some real-world entity such as a report or an analysis.
In our metaphorical data expedition, these two processes occupy opposite points of Datumoj island. Here, we have compressed seven decades of history into 10 months' time (the approximate length of the real-world battle that inspired it).
It is now D-Day plus 96 (D + 96). Engineering companies detached from COBOL Regiment to the north and Flo-Matic Division to the south are still toughing it out over difficult terrain to build a supply route between the two points -- and they're way behind schedule. What both points lack is a single command methodology, which could unify them behind a common purpose.
Help arrives from the southeast, in the form of an allied invading force. Codd's SQL Mechanized Division takes the eastern coast with minimal resistence. Once set up, its objective will be to establish a command post for central operations. Preferably, this command post would be on higher ground, where it can promote the establishment of a single schema for the rank and file of all data. That might suffice if the island were otherwise inhabited. But the Ledger Domain is holed up amid terrain it's prepared to defend to the last soldier.
Up to now, the success of the allies' database operations has depended upon being able to consistently skirt past the Ledger Domain, avoiding conflict wherever possible. But their own logistical tangles, as you'll see down the road, will make conflict inevitable, and will open up opportunities for invading waves of improved technologies.
Our data expedition continues in Part II, with the invasion of Hadoop, followed almost immediately by the emergence of a rogue, rag-tag team of actors led by Apache Spark. You'll see the story of a real-world American institution that places a strategic bet on Spark's success, but not the kind of bet that anyone expected.
In Part III, the emerging Spark coalition establishes a beachhead, and sets up the final siege of ETL. We'll show you a real-world company whose very existence may be in peril for the need to modernize its supply chain and either revolutionize, or completely eliminate, its ETL. Then, in Part IV, we'll introduce you to perhaps the largest active data process in the world, with nearly 1.2 billion users -- a process that doesn't get much coverage in the western hemisphere, but which may already be suffering from obsolescence mere months after its launch. We'll introduce you to a class of data architecture that may yet save it, and in so doing, become the new model for the world's data operations. Whether it annihilates or embraces ETL depends upon whom you ask and when.
Finally, in Part V, one of the most powerful technologies ever to remake the data center produces a headwind that could render the last three decades of database history irrelevant. Will Datumoj defend itself or surrender? Until our next rendezvous, hold tight.