Ashish Thusoo and Joydeep Sen Sarma know a thing or two about big data. They led the team that built Facebook's data infrastructure, and they are also the co-authors of the Apache Hive project and founders of Qubole. Facebook's entire operation and culture is centered around data, and Thusoo and Sarma were a big part of making it happen.
As Thusoo explains, they both felt they have benefited enormously from this experience and wanted to share their insights so that others can benefit as well. To do this, Thusoo and Sarma co-authored a book on creating a data-driven enterprise with DataOps, which is out today and available for free download. Thusoo shared some of his insights, beginning on what DataOps is and why you should care.
DataOps is the equivalent of DevOps for data. The same way DevOps is about enabling a continuous and unobstructed flow of development and deployment of applications, DataOps is about enabling a continuous and unobstructed flow of access to and insights from data. Or in Thusoo's own words:
"DataOps is a new way of managing data that promotes communication between, and integration of, formerly siloed data, teams, and systems. It takes advantage of process change, organizational realignment, and technology to facilitate relationships between everyone who handles data: developers, data engineers, data scientists, analysts, and business users. DataOps closely connects the people who collect and prepare the data, those who analyze the data, and those who put the findings from those analyses to good business use."
When Thusoo started working with Facebook back in 2007, big data was not what it is today. All 4 V's that define big data -- volume, variety, velocity, and veracity -- were at lower levels. But perhaps more importantly, there was not much previous experience of working with big data and using it to drive decision making in organizations.
"When we started," Thusoo remembers, "the question was still out as to whether having all that data is useful or not. Today I feel the value of data has been proven, and it's more of a question of how to get it." Thusoo bases his argument on the value of big data on a simple premise: data-driven organizations perform better.
Beyond anecdotal evidence and hype, he cites a survey on the financial performance of organizations conducted by the Economist in 2012. According to this survey, organizations that rely on data more than their competitors outperform them financially.
So, if that's settled, the question then is "how do we get there." Thusoo gives a detailed account of his journey in Facebook in the book, focusing on two "ah-ha" moments.
The first one came about when they leveraged what was then a new set of tools, namely Hadoop and Hive, to enable self-service data access for Facebook employees. That was in early 2008, and it was only a few months after that the second "ah-ha" moment followed.
What Thusoo realized was that by making data more universally accessible within the company, they could actually disrupt the entire industry. And it was not long before this started to unfold.
By developing the infrastructure and putting in the work required to democratize access to data, which Thusoo details in the book, things started happening at Facebook. Things like interns coming up with business-transforming ideas. One intern in particular, Paul Butler, performed analyses using Hadoop and Hive and mapped out how Facebook users were interacting with each other all over the world.
By drawing the interactions between people and their locations, he developed a global map of Facebook's reach. As Butler says, when he shared the image within Facebook, it resonated with many people: "It's not just a pretty picture, it's a reaffirmation of the impact we have in connecting people, even across oceans and borders."
From that point on, things were set in motion and connectivity took a life of its own. Metrics were devised, experiments were conducted, and the connectivity theme was pushed up the management chain, picked up as a key message for marketing and drove product features such as "People you may know." Bottom-up innovation proper.
This could never have happened in the old world when a data team was needed to fulfill all requests for data, argues Thusoo. He is adamant about the need to have all the right infrastructure in place to enable removing gatekeepers, as he notes that "data was clearly too important to be left behind lock and key, accessible only by data engineers. We were on our way to turning Facebook into a data-driven company."
But if infrastructure is the catalyst, it's organizational culture that makes DataOps possible and enables the insight pearls to come to the surface. Thusoo mentions that Facebook was a greenfield organization, and furthermore one whose core business was built around tech and data, so for them adopting such a culture came naturally. Other organizations have established ways of doing things and making decisions:
"Regardless of whether you have acknowledged it, your business already has a culture of decision-making. That culture might not be geared toward a data-driven approach. All too many companies subscribe to the "HIPPO" (highest-paid person in the office) method of decision-making, whereby the senior person in the meeting gets to make the final choice. Needless to say, this HIPPO can be wrong.
But unless you have the data as well as the permission coming from the very top of the organization to argue back, that decision stands. And herein lies the key: to succeed at becoming a data-driven organization, your employees should always use data to start, continue, or conclude every single business decision, no matter how major or minor."
It's the science, stupid
To anyone with a science background, this should sound familiar. It's the quintessence of the scientific method: developing hypotheses and putting them to the test with data. No data, no party. But this principle can also work the other way -- by observing patterns in data and developing theories to account for them. Thusoo considers both equally valid and has seen them work well in practice.
"What you want to use depends on a number of things such as job function, skills and goal. In Facebook, one of our major goals was growth -- how to get to 1 billion users. Part of that was evaluating different templates, layouts and calls to action. People had different theories about this -- some said sleek, sophisticated approaches would work best, others supported simple calls to action. We tested with user groups, analyzed the data and decided to go for simplicity.
In other areas, like security, we took an exploratory approach. We wanted to eliminate fake accounts, but this was a complex issue to deal with. So instead of devising rules, we analyzed data to pick up patterns that would help us figure it out. Both options work. Sometimes you have a wealth of business knowledge, and it makes sense to leverage this, sometimes using techniques like machine learning can help to deal with vast amounts of complex data," says Thusoo.
The collaborative and open culture aspect of DataOps that Thusoo refers to also comes very close to how scientific research communities function: working in groups, having educated arguments backed by data and deciding on the basis of those instead of hierarchy.
So is this a requirement for organizations that want to apply DataOps? Can DataOps be applied a la carte, or is it a "get with the program" kind of thing? And what happens if organizations really do buy into this?
Getting with the program, changing the world
Thusoo is adamant: "This is an integral part of DataOps. You can't make it work if you don't have an open, collaborative decision making process in place. I understand organizations may have concerns about this, but to me assigning gatekeepers is not the answer. There are tools you can use, there are ways to balance access with too much access. This mixed approach scales, and we want to share this."
So, if all of that is in place, what happens if someone uses access to data to come up with a brilliant discovery that ends up being adopted and driving business value? Could data be used to trace that back, estimate just how much value that generated and distribute it accordingly?
"I don't know whether you could use data for that," says Thusoo. "Probably yes. But again, it comes down to organizational DNA. It's a market economy and culture, you could use data to calculate dividends and the like".
Speaking of markets, let's come full circle: if data-driven organizations perform better, can there be a data-driven way to estimate just how much data-driven an organization is? Definitely, according to Thusoo.
"There are metrics that can be used for this, such as the amount of data, or the number of people that have access to data, or the degree to which data drives research & development. And we could correlate these with metrics such as growth rate or rate of innovation. In Facebook for example, in 2011 when I left the company 30 percent of our (internal) users had access to data. You can compare that to 5 percent which is a typical number for other organizations.
Yes, it will require a certain degree of openness and transparency for organizations to do this and get that data out. But the benefits of this approach will far outweigh the disadvantages. Data shows that organizations with open culture have stronger brand appeal too."
And what if that works? If organizations get on board with DataOps, will our overall culture eventually change to that direction? Could that drive society at large?
"Technological evolution is already pushing to that direction," says Thusoo. "As with every new technology, some pioneer it, some adopt it early, others see the benefits and get on board. We live in a culture of wrangling with data and alternative facts. Data does not lie. You can try and interpret data your way, but in an open society the true facts will come to the fore."