Hadoop creator: 'Google is living a few years in the future and sending the rest of us messages'

Hadoop creator: 'Google is living a few years in the future and sending the rest of us messages'

Summary: The co-creator of the Hadoop distributed computing platform on how Google's systems of today are shaping the business systems of tomorrow.


Want to understand the type of systems global businesses will be using in five years? Look at the technology used by Google today.

Enterprise has a history of riding in Google's slipstream. It was in 2004 that Google revealed the technologies that inspired the creation of Hadoop, the platform that it is only today starting to be used by business for big data analytics.

Hadoop's co-creator Doug Cutting believes industry will continue to borrow from Google's toolbox, and sees a bright future in enterprise for the recently announced Google Spanner.

"Google is living a few years in the future and sending the rest of us messages," he said at the O'Reilly Strata Conference in London.

Hadoop co-creator Doug Cutting. Image: Tim Bray (http://en.wikipedia.org/wiki/User:TimBray) under licence (http://creativecommons.org/licenses/by-sa/3.0/)

Spanner was unveiled by Google last year as the technology that allows the search giant to provide almost instantaneous access to its services to millions of people worldwide without its software falling over. Primarily it stops Google's systems from getting tangled up while trying to keep up to date with what each other is doing.

In creating Spanner, Google had built a planet-spanning distributed database that allowed its global datacentres to keep in sync without suffering huge latencies. 

At the heart of Spanner is Google's TrueTime service, which allows systems to get accurate timestamps based on readings from atomic clocks and GPS receivers installed in each of Google's datacentres. Because Google can rely on TrueTime systems in different datacentres being in sync it can ensure applications situated on other sides of the world are able to read, write and replicate data without falling out of step with each other.

For Cutting, Spanner shows the future possibilities for open source distributed processing platforms like Hadoop.

Hadoop allows data to be spread over large clusters of commodity servers and processed in parallel. Today the platform is generally used to analyse data that sits outside of online transaction processing (OLTP) systems that are the engine of businesses – the likes of e-commerce, CRM and HR systems.

Spanner demonstrates how major corporations may soon use a Hadoop-like platform run these OLTP systems at a globally distributed scale, said Cutting, who is also chief architect at Hadoop specialist Cloudera.

"I think it [Spanner] is the Holy Grail for big data," he said. "Just a couple of years ago people would talk about OLTP and say 'You can't do that sort of stuff on a Hadoop-like platform'. Google demonstrated that you can."

Google created Spanner because it needed a technology with global reach to underpin its massive software platforms, said Cutting, a need that other large enterprises may struggle with in future.

"In a lot of cases people are served just fine by their existing relational solutions to OLTP problems, and there's no need to drive it to Hadoop.

"[However] as enterprises become more Google like that might not be satisfactory. I think the rest of us will be driven there as well. In the next couple of years I think we'll see it."

Facebook is already demonstrating one way of effectively linking an Hadoop cluster spanning multiple datacentres worldwide with its Prism system.

What's next for Hadoop?

The type of processing that can be carried out on a Hadoop cluster is evolving, with the general availability of Hadoop 2 last month refining Hadoop's software tools to make it easier to use clusters for more than batch processing.

As Hortonworks co-founder Arun Murthy told ZDNet, the introduction of the a separate job scheduler called YARN widens potential uses for Hadoop.

"It opens up Hadoop to so many new use cases, whether it's real-time event processing, or interactive SQL. Machine learning is another example — people are building native machine-learning apps on top of Hadoop right now, thanks to YARN," he said.

Another use for YARN, said Cutting, is for dynamically reassigning computing resources from the Hadoop cluster according to factors that are important to an organisation, such as who is running the job or at what time, which could be spelled out in SLAs.

Cutting sees Hadoop evolving into an enterprise data hub, a platform for running a variety of enterprise workloads — including batch processing, interactive SQL, enterprise search and advanced analytics — that integrates with existing corporate systems.

"Changes are going very quickly. I think there will be a trend towards seeing it as a kernel for datacentres in the way that Linux is the kernel for single nodes. More and more it will be the centre of a wide range of applications," he said.

"In the last year we've seen huge numbers of the traditional enterprise software vendors start to move and make their software available on top of Hadoop.

We've had SAS, Tableau [Software], across the board people are starting to see this as a platform that their customers want. I think that's going to accelerate until it's the default platform for most vendors."

Further reading

Topics: Big Data, Data Centers, Enterprise Software


Nick Heath is chief reporter for TechRepublic UK. He writes about the technology that IT-decision makers need to know about, and the latest happenings in the European tech scene.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • no actually not

    They are trying to steer the kind of future you can have and they want it to be their version of the future... see The Matrix... I have other ideas though.

    I didn't come here to tell you how this would end... I came here to show them a world without you...
    • And you ignore the fact that MS adops whatever Google is doing

      And screws it up.

      It isn't that they want it to be their version of the future, it is just that others like what they see.
      • Sometimes following others is a smart move

        But not always. Sometimes it's better to follow our personal ideas about what to do next and not being constantly looking over the shoulder to see what others are doing.
      • And you ignore the fact that...

        No sorry, fact(s) not in evidence and I ignore nothing. Your short circuit thinking is your own problem.

        I have no doubt that some look at the Matrix or Google and like what they see... others not so much. When you look at the steak (which is a metaphor for the marketing view of the product) and see something juicy and delicious your not looking at the whole picture your accepting the program. Free your mind.
        • You've got to be the dumbest...

          ...person posting on this site.
          • clearly that would be you

      • Wow. Second post in and you try to spin this to an anti-MS blog.

        What's wrong with sticking to the subject and company at hand in the blog? You seem to become mighty uncomfortable when the scrutiny, or critique is not aimed at MS.

        Why is that?
        • Oooops, my mistake...

          ...BillieF is the dumbest person posting on this site
          • yep, still you

      • WHAT???

        "And you ignore the fact that MS adops whatever Google is doing And screws it up"

        My lord. What on earth are you talking about??

        What is this? Lets just give MS a bash to the head because we just feel like it?

        Your an idiot.

        I don't see anything that MS has simply taken and screwed up from Google.

        All your doing with that nonsense is just asking some MS fanatic to come back at you and say that Google steals its ideas for MS and then screws them up. Everyone can talk like an idiot if that's the way you like it better.
  • Hadoop creator: 'Google is living a few years in the future and sending the

    I wonder why businesses are just now catching on to this. I remember reading Google's set up a few years ago on how they do their distributed computing. They had it documented on their site somewhere. I don't like Google but they did find a way for quick network access to data.
  • Get your facts straight...

    It was a couple guys from Yahoo that came up with map reduce and Hadoop. Google just took it and used it. The guys that created it at Yahoo have now formed HortonWorks and they are getting huge support from Microsoft. Google and Cloudera are just riding the coattails of the true innovators. Look at things like the Microsoft Jim Gray Systems Lab at the University of Wisconsin to see where things are really headed.

    • Nope

      Doug Cutting was one of those "guys" from Yahoo! who created Hadoop. The inspiration for the core components of Hadoop came from a Google paper published in 2004 detailing Google MapReduce and Google File System, as Cutting himself details in the following presentation.
      Nick Heath
      • The point is...

        The original white paper for map reduce did come from Google, but the implementation was done at Yahoo. While GFS was the inspiration for the original HDFS, it has since been changed significantly. The point is that Google is not the leader in this field. Most of the original designers from Yahoo went to HortonWorks, and Microsoft and Teradata have been helping them to innovate the platform. All of the innovation from the past couple years has come out of HotronWorks and been added back into the Apache Hadoop open source projects. microsoft is contributing a layer of abstraction that allows the use of traditional SQL queries to query the Hadoop data. The only thing Google did was add geographically dispersed replication to GFS, a feature already found on several other file systems and RDBMS systems. It's not much of an innovation. Same goes for using Hadoop or any other system for DC management and runbook automation. These are already offered by most ITIL software vendors including BMC, HP, IBM, Microsoft, and ServiceNow. Google is not a huge innovator. All they're doing is taking existing software and turning it into a could service like everyone else. Most of the BI innovation these days is just adding abstraction layers on top of data models for presentation to information workers, or changes to storage to try to catch up with the performance gains resulting from Moore's Law. PureStorage is an innovator. They've basically just turned the traditional disk-based SAN into the new tape library.
  • MapReduce was evolution not revolution.

    I do believe that Hadoop is nice & MapReduce is a sound approach to solve a problem that is suitable for parallel processing.
    But neither are revolutionary concepts created by some visionary.

    Distributed parallel computing was a well worn path even by mid 1990's. "Map Reduce" was a standard design pattern for anyone writing s/w for High Performance Computing (HPC) systems. It is at the heart of GRID computing systems (which started around 1997-ish).
    It is also core to many data appliances: eg: Microsoft's Parallel Data Warehouse (PDW) which evolved from Stratus (circa 2002) & Teradata(in the '90's), to name just two.

    Its great that Google are pushing the scale of this technology further. They are at the pointy end.

    Unfortunately for most DBA's this is still an over-hyped niche technology. Few companies have the projects with data sizes that justify this type of massively parallel scale out approach. Most problems can be solved commodity h/w. With the possible exception of social analytic &/or the largest of multi-nationals. The rest of the time I see DBA's wanting Hadoop to be the solution purely because it is cool & fun to play with.
    ie: "Look, what I used to do on 1 system, I 'm doing with 10 VM's"
  • Can Google sell me what I want?

    I want a truly relational database management system (SQL isn't relational and doesn't count).

    Whether it runs on one server or there hundred should be completely invisible to me from a logical perspective.

    It must be possible for me to run it on my own hardware so for security reasons it can be disconnected from the internet.

    When will Google have this product available?