The Data Expeditions, Epilogue: Reassessing the battle for the core of enterprise data
The Data Expeditions, Epilogue: Where we review the epic struggle to capture and remake data warehouses, in the context of a world that doesn't appear to be aware such a struggle exists, and may never have seen it as all that epic.
Last summer at the MesosCon conference, I met an attendee responsible for procuring database software for his organization. He summed up his situation for me this way: All this innovation is really interesting, and it probably sells magazines -- or rather, correcting himself, produces great headlines for those websites you read about. But folks in his position are all waiting for these innovator guys to do whatever they have to do, to come together and produce a platform. Settle it. Duke it out. Arm wrestle, or whatever. At some point, basically, the innovation has to stop and business has to start.
This is one of the great dilemmas with living in a capitalist society: Innovation truly never stops. In any technology market, the means for distinguishing oneself and the basis of one's value proposition is always improvement. That works against any system that tries to adhere to the dogma of automated production and pre-defined methods.
The history of database technology, unlike much of consumer technology, has not been defined by market dynamics. It has been about capturing and holding territory, and regulating the pace of innovation so that these strongholds may be maintained. A great deal of the processes we use today, and will use tomorrow, to manage the data we consume, were first conceived and implemented under the watchful eye of Lieutenant Grace Hopper. And she did so, by her own explanation, for the silliest of reasons: To automate the process of correcting everyone else's mistakes. The course of evolution for all of data technology began with her, but immediately became an effort to capture and hold territory.
In our last series for ZDNet Scale, The Data Expeditions, I zeroed in on just the innovative part of data science and database development. That's why I didn't really spend all that much time with data marts, data lakes, or cloud-based data services. They're for future battlegrounds. Datumoj Island is the staging ground for all the main actors in the architecture of data processing. That architecture is now being innovated so rapidly that the efforts to maintain that safe, measured, incrementally slow pace of technology innovation to which we are unknowingly accustomed, crumble like fixed fortifications.
The reason I started with an image of Grace Hopper holding a slide rule is this: Literally all of database technology (and here I'm using the word "literally" literally) began with an effort to measure the time it took to move data from place to place, and devise processes to reduce that time. And every great innovation since Adm. Hopper and Flo-Matic has begun with exactly the same effort: measuring latency and improving performance.
The leaders of the emerging data technology space in the 1950s realized that the key to retaining leadership in that space was not about technology but rather about people. Back then, they didn't want to admit that fact in Grace Hopper's presence. And you'd think IBM would have learned this lesson already about Anjul Bhambhri, before Adobe acquired her in 2016.
What the emergence of Spark and its "coalition" -- the SMACK Stack -- reveals is that leadership in an open source environment is also more about people than code or machines or methods. This stack works as well as it does because of cooperation. You might think that any series attempting to make such a point about harmonious collaboration would avoid using a metaphor based on World War II. But all great efforts are struggles to overcome obstacles and achieve objectives, and history typically mandates that such efforts take place contentiously rather than consecutively.
Part III: We interrupt this revolution
In open-source development environments, the efforts to replace one technology or method with another are often conceived and implemented by the very same people who created the things being rendered obsolete. That makes it easier to characterize development conferences as almost Olympic village-like experiences.
But for the enterprises that depend on the consequences of their actions, the war toys these developers play with are incurring real-world casualties. If you're Toys 'R' Us, or another retailer with storefronts and floor space, you're struggling to rebuild and repurpose your entire supply chain. When the tools of that repurposing are being continually obsoleted, the fact that someone else may be capable of starting a fresh, new supply chain with their successors, forces you to find a way to do the same.
Part IV: Network effect
If the data system you're constructing is seriously intended to become the foundation (which is what "Aadhaar" means) for a society of over a billion people, then there emerges an inherent vulnerability with any platform intentionally geared for change: Elements of that society will conspire to change it for you -- perhaps openly, sometimes covertly. And unless a majority of that society truly loves and admires you, they will likely succeed once or twice.
Aadhaar is being accused of obsolescence even before it has attained full functionality. Such an accusation doesn't have to have merit, just as an accusation against a politician today need not be true to carry weight (for more, see "boat, swift"). There is a current of change that Aadhaar must resist if it is to survive. But the level of resilience required to pull this off may require the type of architectural innovation that its original design did not account for.
Part V: The devil we know
The true definition of "real time" in computing has always been variable, and thus to a certain extent, unreal. If we're being exact, it should mean that there should be no perceivable gap between the moment of a completed operation and the state of the data to which that operation refers. That's not always reasonable -- even in live television, there's at least a few seconds' gap between a real-world production and a "live feed" (in cable news, possibly even a few years). But it is fair to say that for global logistical operations, whatever lag there may be must not be detrimental to the final result.
When we say that engineers today "demand real time," what we mean is that they are working to build a processing system where elements or packets of data are smoothly processed with a minimum of latency. When we say that customers "demand real-time," what they're looking for is a persistent state of immediacy. They may settle for the appearance of it, but only because they have not yet been introduced to a better alternative. DataTorrent spotlights the chief deficiency of most big data engines: They process small "micro-batches" somewhat fast, but often in a non-deterministic way. If the ticking of the clock sounds like John Cleese's footsteps during one of his "Silly Walks," it's not real-time.
Why does this matter? Because the architectural decision to break down workloads into smaller batches, and then process those batches in parallel, only remains viable for as long as the volume of those workloads remains constant.
This series has not really been a history of database architecture. Moreover, it has been an effort to land all the major architectures together on the same island, and demonstrate how co-existence has nearly always been the result of an uneasy and delicate truce.
The "great shaking out," where every platform and processor comes together under one flag, has never happened before, and it's a wonder that folks expect it to happen now. The open source community has hit upon one very hopeful architectural principle: allowing components to perform discrete functions however they will, but to remain addressable using a common API method, and to have their inputs and outputs follow a rule set forth by some independent manifest or interface. This "decoupling" enables the kinds of "stacks" we showed you. But they're not quite platforms, at least not insofar as vendors have defined them historically.
That's not such a bad thing, open source engineers maintain, because this gives you the freedom to swap out components when new and more adept ones come along, which seems reasonable, at least for systems in which the interaction of components is presumed to always follow a prescribed sequence. In the case of India's Aadhaar, however, the original prescription may not have been as capable of guessing the future as its architects expected.
Put another way, choice is nice, unless you're making new choices before you're finished with the existing ones.
All innovation, in any field of endeavor, is a struggle to replace old methods, to improve the way of work, and often to better people's lives. As peaceful as some of us would prefer to be known and remembered, each of our struggles has an active antagonist. More often than not, it's non-belligerent. It may not have a face or even a brand name. But it has an objective, a method, a mission, and backers with the will to succeed. And if we are incapable of recognizing these elements for what they are, then we may find ourselves defeated before our struggle even begins.
The "Battle of Datumoj" was inspired by World War II's Battle of Morotai. There, an island that seemed easy enough to liberate just months after D-Day in France, ended up being an active battlefield until V-J Day, and even afterward. The real story of Morotai, its strategic importance, the real regiments that fought there, and the troop movement maps that inspired this series, are available from the World War II Database.