Look at what Google and Amazon are doing with databases: That's your future
It may seem unlikely that ordinary firms will ever be able to emulate the resource-rich web giants when it comes to data architectures. But that possibility may be closer than you think, says Neo Technology CEO Emil Eifrem.
With their thousands of software engineers, huge resources and myriad databases, the Googles, Amazons and Facebooks may seem to inhabit an alternative IT universe.
But what the big web services firms are doing today with data volumes and data types will soon be the common experience for many businesses, according to Neo Technology CEO Emil Eifrem.
Neo Technology is the company that developed the open-source Neo4j graph database, implemented in Java, which is used by businesses such as eBay and Walmart. Graph databases use nodes and the connections between them to describe networks and contexts.
"If you're interested in seeing the future of how data-oriented architectures are likely to evolve, the future is already here — just unevenly distributed," Eifrem said.
"What that means is if you look at some of the big web services — the Googles and the Amazons of the world — they are already today dealing with the volume and shape of data that everyone else will be working on in five years from now."
Those companies have invested a lot of time in building up systems that keep data in sync, to cope with it being broken down and distributed to various types of databases chosen for their efficiency at certain tasks.
"When you receive an email in your Gmail account, that's going to be chopped up into various forms and stored as a simple log over here in maybe the equivalent of a document database. But then over here all the contacts and the keywords are going to be stored in a graph database. Then they have really awesome systems for keeping all that data in sync," he said.
"It's not perfect and it's a lot of hard work. But it's the reality that we're all going to face and a lot of people are already facing it today."
Eifrem said companies are already past the point where a single database is capable of managing all data workloads — and it's misleading for any vendor to suggest it has the answer to all an enterprise's database problems.
"The era of the one-size-fits-all database is over. It used to be when I grew up as a developer that for the architect in the project, when it came to choosing the bottom layer of the stack — the persistence layer — the choice was Microsoft, or IBM, or Oracle, or Sybase. It was a vendor choice," he said.
"They were all the same type of database. But that era has gone forever and it will never come back because data is just so big and so irregularly shaped now that you're always going to be able to get a hundred times improvement, a thousand times improvement, a million times improvement if you get a data technology that is shaped like the shape of your data.
"If you don't go to the pains of choosing that technology and getting that thousand times improvement, then someone else in your vertical will. They're just going to build a much better product and glean so much better insights that they're going to outperform you.
"Eventually, all of us are going to have to [use multiple databases]. I fundamentally disagree if vendors tell you their database is going to be the only database. That's a naive view."
Because companies have to make choices from a range of database approaches, each offering various strengths and weaknesses and with none great at everything, they will have to explore and categorise corporate data carefully.
"Every single dataset today or tomorrow is going to be big. So the role of the data architect in the future is going to be to look at my big dataset and then identify parts of it that are shaped as tables and say, 'That fits really well in my trusted old relational database from Oracle or IBM or whatever,'" Eifrem said.
"This part over here, this is more like what I call tall and skinny tables. Just keys and values, and nothing more," he continued. "You can imagine something like user name and password. That looks very much like a key-value store.
"But over here in this part of my dataset, it's big and messy and connected. It may a social graph; it may be going from point A to point B; or it may be product recommendations, where you want to know that the people who bought similar things to you also bought this thing that you haven't yet bought. That's big and it's connected and it's messy. Let's put that in the graph database.
"That's really the bigger picture here. None of these new databases is horizontally better than any one of the others. You can always find situations where a key-value store will outperform a graph database and where a graph database will outperform a column family or whatever.
"But the really interesting question is not, 'Is this one faster than the other?' but 'In what situations are they used?' — and that comes down to the shape of the data," he said.
The situation with the choice of databases is not helped by the division of the technology into relational and NoSQL, a term widely disliked for its vagueness.
"There are so many ways of slicing that horrible term of NoSQL that no one loves. It's a weird term. It sort of defines something by what it's not," Eifrem said.
"I'm holding an HTC One mobile phone here. Is that a NoSQL database? Well, it's not a SQL database, so I guess it is. That white Ford I'm pointing at — is that a NoSQL database? Well, it doesn't support SQL so I guess it is.
"It's just a weird term and no one likes it but it's what we have and it's a little bit of a rallying cry, a movement of alternative databases. And there are bunches of them — hundreds."
Companies may fail to see how they could ever be in a position to emulate a Google or a Facebook when it comes to the resources needed to manage multiple databases and data integration.
"They have a gazillion times more engineers than a random big enterprise and honestly probably on average they're more talented. What we've seen, however, is as the market demand for more sophisticated and better data technologies picks up across the field — not just at the big high-end web services — there's also an ecosystem that is growing up around it," Eifrem said.
"What first started it was when some of the big web services began to open-source some of their internal technologies. That's what gave us Cassandra, that's what gave us HBase."
Hadoop could also be said to fall into this category, even though it was not open-sourced by Google but rather described by the firm in an academic paper, which spawned inside Yahoo an open-source project that imitated it.
"Now what we're seeing is a bunch of vendors that have been at this for quite some time, building out these technologies that it took Facebook seven years and probably hundreds, maybe even thousands, of engineers to build. It's now available off the shelf, either as open source or as commercial or a combination," Eifrem said.
"That's been the democratisation of these technologies that we see happening right now. We're one of them. It used to be that if you wanted to work on data and look at it from a graphic perspective, your choices were to take employment at Google, Facebook or Twitter.
"Now it's available as open source for free. You just download it and put billions of data records into it and do things that only [Facebook CEO] Mark Zuckerberg could do five years ago. It's a very interesting trend where we see these technologies diffuse out of where they typically were birthed inside these big web services, out into the mainstream," he said.