When Doug Cutting created the Hadoop framework 10 years ago he never expected it to bring massive-scale computing to the corporate world.
"My expectations were more moderate than what we've seen, for sure," he said, speaking at the Strata and Hadoop World conference.
Today Hadoop is used by many household names, helping Facebook analyse traffic for its more than 1.6 billion monthly users and aiding Visa in uncovering billions of dollars-worth of fraud.
The attraction of Hadoop lies in how it allows big data to be processed more cheaply and, in certain respects, more simply. The platform provides a group of technologies that allow very large datasets to be spread across large clusters of commodity servers and processed in parallel.
Yet there are limitations to what the platform can do. Today, the speed at which Hadoop clusters can process very large datasets is capped by the rate at which data is shuttled between secondary storage -- SSDs or even slower spinning discs -- and a computer's memory and CPU.
This I/O bottleneck has arisen because processor speed and efficiency are increasing faster than storage read-write rates.
Putting a petabyte in memory
But now storage technology is poised to undergo a significant shift, one which Cutting said will help take the brakes off big data processing.
This year, Intel plans to release its 3D XPoint storage chips, which can retrieve data 1,000 times faster than the NAND flash typically used in SSDs, while also being able to store data at a density ten times greater than DRAM, which is the type of memory typically used today.
While XPoint will initially be offered as storage in the form of Optane-branded SSDs, Intel is planning to follow that up by releasing XPoint memory modules. Thanks to XPoint storing data at far higher densities than traditional DRAM, these modules will allow servers to have a much larger memory than is the norm today. Intel has talked about Intel Xeon servers being available next year with 6TB of memory, made up of a mixture of DDR4 DRAM and XPoint. That said XPoint won't match DDR4 DRAM for performance. The seven microsecond latency and 78,000 read/write IOPS of pre-release XPoint SSDs is slower than DRAM and no more than 20 times faster than high-performance SSDs by some estimates.
Regardless, Cutting predicts the use of XPoint and other non-volatile memory in Hadoop clusters will open up the platform to new uses, allowing users to process much larger datasets in memory, which in turn will bypass the latency inherent in fetching data from disk.
"If you could have a petabyte of data in memory, accessible from any node within cycles, that's several layers of magnitude performance improvement, if you're doing certain kinds of algorithms," said Cutting, who is now chief architect at Cloudera, which offers its own distribution of Hadoop.
"Things that are very expensive now, like graph operations, various sorts of iterative machine learning algorithms, clustering -- things that have traditionally taken a very long time -- can now be done very quickly and over pretty impressive amounts of data.
"There are still going to be datasets that are too big, and computations that are too slow but I think it will change things a lot," he said, adding that latency related to network traffic would also be lessened by remote direct management access and Gigabit Ethernet switching.
Intel invested an estimated $740m in Cloudera in 2014. As part of the partnership between the two companies, Intel keeps Cloudera informed about new features and hardware in the pipeline, helping ensure this technology can be leveraged by Cloudera's distribution of Hadoop.
"We want to make sure that the tools we provide can take advantage of this," Cutting said about XPoint.
"We've worked very hard to minimise the CPU usage for accessing data structures that are already in memory," he said, adding Cloudera had attempted to prevent unnecessary operations that would cause the CPU to bottleneck the processing of in-memory data.
Hadoop and the cloud
Cutting is also keen to make Hadoop available to a wider number of people by making it easier to spin up Hadoop clusters in the cloud.
It is already possible to build a Hadoop cluster on various cloud platforms. For instance, those running Cloudera's Distribution of Hadoop (CDH), can use Cloudera Director to spin up clusters of virtual servers on Amazon Web Services and Google Cloud Platform.
However, there are still limitations that Cutting says need to be addressed to make the process easier, with Cloudera planning to improve support for feeding data from AWS S3 and other cloud-based storage to Hadoop's data processing engines.
"There are some adaptations we need to make to Hadoop to make it able to work better in the cloud. We need to treat the storage, things like Amazon's S3, as a first-class citizen, along with HDFS [Hadoop Distributed File System] for input and output, so that people can spin up clusters dynamically," he said.
And with clusters more likely to be spun up and down in the cloud, Cutting says Cloudera also want to improve start-up times.
Another problem that Cutting would like to tackle, is making it easier to transfer a Hadoop cluster from one cloud platform to another, with Cutting dismayed at the current state of cloud lock-in.
"We think we could provide some real value in giving people portability between cloud providers. Right now, if you start developing your applications in the cloud you're very rapidly locked into one cloud vendor."
With its distribution of Hadoop, Cutting said Cloudera is in the process of building "a layer that lets people decide whether workloads go on-premise, or into Amazon, Google, Microsoft or other cloud provider."
This functionality is available to some extent via Cloudera Director today, he said, and "it's something that we'll continue to push and make that more seamless".
Looking further into the future of distributed systems, Cutting says an architecture is needed that can instantaneously consult both real-time and historical data to help make real-time decisions.
"There's various ways of doing that today and they all have flaws. I think we will iron that one out soon."
Ultimately, he believes Hadoop's legacy will be to have played a role in making big data the norm, open-source the defacto choice for software and relegating relational databases to a niche.
"We won't be talking about big data, we'll be talking about data systems. The open source stack will no longer be the new thing, it will be the incumbent and the way people operate. Relational systems will be the Cobol-equivalent and very much legacy. We'll have moved a long ways in 10 years."