The Data Expeditions, Part IV: The drive for synchronicity shakes the foundations of the data warehouse
The Data Expeditions, Part IV: Where a component described simply enough as a "messaging queue" strikes at the nerve center of organizations' long-standing three-tier applications, and calls into question whether certain of their principal functions still matter.
The Oxford English Dictionary's 2017 Hindi Word of the Year was "Aadhaar," which in Sanskrit means foundation. News of the linguists' pronouncement, covered in the Times of India, did not even have to mention why, other than to say it's a word that has attracted a great deal of attention. Its tens of thousands of readers already know why.
Today, most Indian citizens don't read or write Sanskrit. For them, Aadhaar is a massive biometric database, generating unique identity codes for potentially greater than a billion Indian citizens. These non-secret 12-digit codes are verified against a colossal personal data store containing individuals' photographs, fingerprints for both hands, and retinal scans for both eyes.
After over a decade of preparation and government promotion, the system came online in September 2010, with the objective of applying digital authentication to every business transaction in India. Today, Aadhaar wirelessly networks together citizens' handheld devices with the country's select national payment transaction coordinator. Not only can individuals use their devices in place of their wallets without the value of these wallets becoming susceptible to theft, but Indian citizens in want of food and sustainance can use their digital authentication to receive grain and other disbursements through public distribution stations.
Key to the uniqueness of Aadhaar's architecture is the principle that it is intended not to permit general lookups. By design, its architects pronounced, no individual should be able to obtain a table of citizens' data based on general query criteria.
Aadhaar quite possibly may be the world's single most used database system, if it indeed services the 1.1 billion users claimed by Infosys Technologies (current numbers are projected at 1.19 billion). In a 2017 interview with CNN, Infosys co-founder and chairman Nandan Nilekani told a bewildered correspondent who appeared to be learning this for the first time, that the Aadhaar ID code had already been associated with about 300,000 citizens' bank accounts, constituting what he described as the world's single largest cash transfer program.
"If it goes down, India goes down," said Ted Dunning, chief application architect with data platform provider MapR, which provides several of the operational components for Aadhaar.
"They necessarily have a 150-year planning horizon," Dunning told ZDNet Scale. "Obviously, further out than a few years, the details become a little fuzzy. But what is constant there is the mission, and the mission is to authenticate identity. Not identify people, but authenticate their claims of identity."
An estimated 70 percent of India's citizens have had to pay bribes simply to receive public services, according to a recent Germany-based NGO's survey. Simply by attempting to centralize identity, Aadhaar pervades every aspect of commerce and society, calling into question the extent of citizens' rights to privacy as guaranteed by the country's constitution. A society so conditioned to distrust its government will inevitably distrust such a centralized service of that government, whether it be administered personally or automatically.
The problem with a presumptive level of distrust is that it effectively camouflages the specific types of conduct that earn such distrust. Last January 4, India's Tribune News Service reported the discovery of websites run by automated agents, selling Aadhaar ID codes evidently obtained through the databases. That data was probably collected through accounts attributable to the database's governing body, the Unique Identification Authority of India (UIDAI).
If the reports are true, this particular defect in India's system is clearly institutional, as so many Indian citizens suspected it inevitably would be. Yet immediately, second-hand reports announced that Aadhaar's database was "hacked" and its information leaked -- which would suggest that its basic architecture had failed, not necessarily the people in charge of it.
Speaking with us, MapR's Dunning maintained that the Aadhaar system, at least from a technological perspective, was stable. The reason, he maintained, is that its architects have embraced the realization that "there will be change. There is no way that a system like that can last on exactly the same hardware/software combinations for the next 100, 150 years.
"All the documents I've read, from the very beginning," he told us, "say, 'We know that change is inevitable, and we must adapt to it and deal with it.' And they have designed change capability into the system."
MapR's engine components were actually the first such change; they were not part of the original system. At a July 2012 big data conference in Bangalore with the curious title "The Fifth Element," Aadhaar chief architect Dr. Pradmod Varma and colleague Regunath Balasubramanian revealed the original component buildout for the first version. The distribution mechanism was Staged Event Driven Architecture (SEDA) which, at the time Aadhaar was first designed, must have been the most cutting-edge distributed processing system being discussed in academic circles. It was SEDA, Balasubramanian told the audience, which enabled threads to scale out dynamically.
But SEDA came into being in 2001.
SEDA was created by a three-person UC Berkeley team that included a fellow named Eric Brewer. It proposed several novel, turn-of-the-century concepts. One of them the use of dynamic resource controllers (DRC) as oversight mechanisms, distributing tasks to execution threads on demand, and throttling down distribution when those threads were overloaded. The controller could detect these overload conditions through periodic reads of the event batches, which were delivered through a kind of message queue. SEDA could even deconstruct applications running on threads into discrete stages, so controllers could test their operational integrity in progress, in what could arguably have been a forerunner of CI/CD.
The SEDA architecture, to sound like blues lyrics for a moment, did not go nowhere. Neither, for that matter, did Eric Brewer. He is now vice president of infrastructure at Google. And he clearly took the lessons he learned from SEDA with him, in his current role as one of the principal contributors to Kubernetes. The evolution of DRC's "stages," across several generations, into Kubernetes' "pods" took place with Brewer's direct guidance, and obviously with very good reason.
Recent information about the architectural changes UIDAI may have implemented since 2012, including replacing MapReduce with a MapR component, were removed from UIDAI's website at the time of this writing. But a 2016 study by the Indian Institute of Science in Bangalore [PDF] reveals that the system was designed to guarantee a one-second maximum end-to-end latency for authentication transactions by first separating enrollment transactions -- bringing new citizens into the system -- into a separate, slower pipeline. There, the expectation for completing batch processing may be comfortably extended from one second to 24 hours.
The study mentions one way that third parties may access certain categories of citizens' data. Called the Know Your Client (KYC) service, it's described as enabling an agency to retrieve a photo and certain details about a citizen, but only upon that person's informed consent. However, KYC would not reveal complete biometric data, such as fingerprint or iris scans. In the list of details the Tribune investigators reportedly obtained without authorization, fingerprints, and iris scans were omitted.
It's not a trivial detail. The two pipelines of Aadhaar are mechanically different from one another. The enrollment pipeline is geared for a system that operates much more like a modernized data warehouse, with a staged batch processing mechanism. Each stage in this mechanism refines the data in a fashion that's so similar to ETL (extract / transform / load) that it may as well be called ETL. It utilizes RabbitMQ as a message queue that fires the events triggering successive stages in the process. That's not a revolutionary architecture, but it is a workable one.
The authentication pipeline, on the other hand, dared to go where no database had gone before, at least at the time it was conceived. It introduced distributed data clusters with replicated HBase and MySQL data stores, and in-memory cache clusters. In-memory pipelines typically have the virtue of preparing data for processing just prior to the act itself, reducing the time spent in ETL.
If the vulnerability the Tribune investigators reportedly discovered exists in the authentication pipeline as opposed to enrollment, as the limitations of the retrieved data suggests, then it's the "new," faster side of the operation that's at fault here. Although the Indian Institute of Science study was a test of performance, not security, its practitioners gently suggested that the performance of newer distributed stream processing mechanisms, such as Apache Storm and Spark Streaming, demonstrated that that the streaming mechanism utilized by Aadhaar's authentication pipeline was already outmoded.
Transplanting one mechanism for another, however, may not be as simple as MapR's Dunning perceived it -- not a heart for a heart, or a lung for a lung. Imagine instead a nervous system. It's something that intuition tells us must be engineered into the data system in its embryonic state. And the Institute researchers warned of the implications of making the attempt:
"The current SEDA model, which a distributed stream processing system could conceivably replace," the team wrote in 2016, "is one of many Big Data platforms that work together to sustain the operations within UIDAI. Switching from a batch to a streaming architecture can have far-reaching impact, both positive and negative (e.g., on robustness, throughput), that needs to be understood, and the consequent architectural changes to other parts of the software stack validated. Making any design change in a complex, operational system supporting a billion residents each day is not trivial."
Datumoj Island offers us a metaphorical rendition of the battle being played out, not only in the Indian government but in enterprises and public institutions all over the world.
In the compressed history of the history of data warehousing, today is D-Day-plus-295. Just a week and a half earlier, the Hadoop Task Force had established an uneasy, though workable, truce with the original liberating allies. It would enable both forces to co-exist on the same island, so long as the old supply routes were restricted to carrying slower payloads. Faster payloads would take a separate route along the western coast, bypassing the Schematic mountain fortresses.
Spark rode in with the Hadoop Task Force, as a relief for MapReduce Brigade. But now it has brought in Mesos Cavalry Unit to wage an assault on the production facilities to the north. And it has turned the allegiance of Cassandra, which has joined Spark in a raid on the ETL facilities to the south. Spark's surrender offer to the entrenched allied forces is this: Both Hadoop and the old Flo-Matic processes may maintain their existing production facilities, while at the same time making way for new ones. The southern ETL facilities must submit to the oversight of Kafka, an engineering battalion that has established a powerful transmitter station on Eliro Island just to the west. And the SQL command post to the east must allow itself to come under Spark control, directing its instructions to the Mesos staging unit instead of the Ledger Domain, holed up in their Schematic mountain fortresses.
It would be co-existence, but not on the fortress keepers' terms. Even if the allies accede to the new occupiers' demands, the Ledger Domain probably won't. A lasting peace depends on the capability of ETL to service every occupier on the island, each according to its own terms. And that depends, for now, upon Kafka.
"This is not a one-shot process. It oftentimes is very incremental, and it is rooted in a specific problem that companies are facing," explained Neha Narkhede, the chief technology officer of Confluent and the co-creator of the Kafka data center messaging component. She's referring to the transition process within enterprises, from the ETL procedures that prepared data for batch processing and SQL queries, to something that may or may not be called "ETL" depending upon whom you ask.
What may very well get uprooted during this transition process is what used to be considered the nervous system or the supply route of all software in the organization: The enterprise messaging bus (ESB).
"The new world that digital companies are working towards doesn't include just relational databases and that relational warehouse," Narkhede continued, in an interview with ZDNet Scale, "which is a lot of what of ETL tools in the old world were designed for. Today, there are a lot of diverse systems that all need access to different types of data, and that goes way beyond relational databases and the warehouse. So this is a diversity of systems problem that organizations are trying to deal with: How to get data to efficiently flow between all these different kinds of systems without diverging."
"Because a lot of the data that companies are dealing with now is real-time data and streaming data," remarked Mesosphere CTO Tobias Knaup, "Kafka becomes the data nervous system of a company. Because data is always in motion, by using a message queue like Kafka instead of more static databases or file systems, you can often get rid of a lot of the ETL steps. If you introduce delays, you lose your real-time-ness; if you only run your ETL jobs once a day, then your data is up to a day old. But with something like Kafka and its different topics, you can build a data nervous system where data is always up-to-date. Transforming it into different formats, enriching it by various processes, becomes very, very easy."
For most of this decade, data center analysts have been promoting the idea that old data warehouses and new streaming data clusters could co-exist. Some co-opted Gartner analysts' notion of "bimodal IT" as meaning the co-existence of a "slow mode" and a "fast mode" for data processing, usually made feasible through some magic form of integration.
Such a co-existence would mean some data is subject to being modeled and transformed (the "T" in "ETL") and other data is exempt. That's the argument that IBM, Teradata, and integration platform maker Informatica made in 2013. At that time, IBM characterized Hadoop as an "active archive" for all data, some of which would be selected by ETL. And Teradata referred to Hadoop as a "refinery" that could expedite existing transformations, and effectively re-stage the existing data warehouse on new ground.
That's actually not diverse enough, as Confluent's Narkhede perceives it. In a February 2017 session at QCon in San Francisco with the attention-grabbing title, "ETL is Dead; Long Live Streams," she made the case for ETL actually not being dead, but rather outmoded and ready for a nervous system transplant.
"If you think about how things worked roughly a decade ago," she told the audience, "data really resided in two popular locations: The operational databases and the warehouse. Most of your reporting ran on the warehouse about once a day, sometimes several times a day. So data didn't really need to move between those two locations, any faster than several times a day.
"This, in turn, influenced the architecture of the tool stack, or the technology," Narkhede continued, "to move data between places -- called ETL -- and also the process of integrating data between sources and destinations, which broadly came to be known as data integration."
Narkhede's point was a compelling one: Because so many enterprise data operations have been running on two tracks at once, their records, tables, and views (their warehoused data) were being prepared for an environment that assumed the presence of both tracks, and translated between them. Meanwhile, their operational data (the by-products of the ETL process) was being deposited. . . places. When the cloud came into being, it was immediately co-opted as a convenient place to bury operational data -- out of sight, out of mind. Yet that didn't make it go away. In fact, the insertion of the public cloud into the mix injected latencies that hadn't been there before. So when integrations from both tracks were tacked onto the operation, the fast track began slowing down too.
"[Kafka] oftentimes comes in as a net-new, parallel system that gets deployed with a couple of different apps," Narkhede told ZDNet Scale, "and transformation pipelines being transferred to it. Over a period of maybe two to three years, everything moves over from the old way to the new, streaming way. That's basically how companies adopt a streaming platform like Kafka for streaming ETL."
Are the pipelines defined by Kafka, or by a system in which Kafka participates? "A little bit of both," she responded. "Kafka happens to be the core platform that data goes through. There are a couple of different APIs around it, all of which get used in some shape or form for doing what you might call ETL."
One set of interfaces is called the Connect API, which she described as a means for reducing the data exchange process to the act of communicating with two types of connectors: The source and sink (often misspelled as "sync"), which respectively represent the input and output points for data retained by any system with which the connectors are compatible. Combined, these connectors hide the details of integration with different data storage models and systems.
The north of the island, in our metaphorical model, becomes decoupled from the south. No longer must a production application tailor the way it handles the data it has already queried, to any specific database, data warehouse, data lake, or other data model. More importantly, the design of the database system no longer binds the design of the application that uses it.
Simplifying ETL to a common set of inputs and a common set of outputs, all routed directly, at least theoretically eliminates the performance bottlenecks that originally necessitated Aadhaar processes to be subdivided into slower and faster pipelines. On the other hand, it suggests that for Aadhaar to embrace this latest technology, and not be dashed upon the ash heap of history along with the punch card sorter, it would require far more than a mere one-to-one component transplant.
Neha Narkhede estimates that a properly staged enterprise transition to a fully functional, Kafka-oriented process model could take as long as three years. No one has estimated how long it would take India's UIDAI to make a similar transition for Aadhaar. Perhaps more importantly, though, is this pressing question: If the present state of the open source data ecosystem were to stay pretty much the same, does Kafka even have three years? Or, for that matter, does Spark or Mesos?
These are appropriate questions, especially given what appears to be a fast approaching storm on the horizon -- one that has already swept aside the old order of virtualization platforms, and that may only now be making landfall in the realm of the data warehouse. That's where we'll pick up the concluding waypoint of our Data Expeditions series next time. Until then, hold strong.