When it comes to using data to drive business, organizations such as Google or Facebook are iconic. Although a lot can be said about their practices with regards to topics such as privacy, ownership, and governance, there's no denying that they have been pioneering the field both in terms of technical approach as well as culture.
The people that helped build the technical substrate, and the data-driven culture upon which Facebook operates, call this combination DataOps. When they started in 2007, big data was not what it is today. All four Vs that define big data -- volume, variety, velocity, and veracity -- were at lower levels.
But perhaps more importantly, there was not much previous experience of working with big data and using it to drive decision making in organizations. At that time, the question was still out as to whether having all that data is useful. Today, the feeling is that the value of data has been proven, and it's more of a question of how to get it.
"Decision makers are used to making judgments. Any CEO understands statistics at a gut level, because that's what they do every day. They may not know the math behind it, but the idea of collecting evidence, iterating on it and basing decisions on this is intuitive for executives."
By putting in the work required to streamline access to data, things start happening -- things like business-transforming ideas coming from likely and unlikely places. There is a well-known story about how a Facebook intern performed analyses mapping how users interact with each other, leading to a global campaign and driving product features and growth. More than feeling and stories, however, there is mounting hard evidence pointing to a simple fact: Data-driven organizations perform better. For example, according an Economist survey from 2012, organizations that rely on data more than their competitors outperform them financially.
Business Intelligence, Data Warehouses, and Dashboards
The first step in this journey is to acknowledge the effectiveness of data-driven decision making. Then the right infrastructure needs to be in place, and the culture of organizations also needs to shift and adjust.
It's not that data-driven decision making is all that new. But things are different in terms of scope: Collecting data and performing analytics used to be a privilege few organizations could afford, and it mostly involved after the fact, retroactively trying to come up with explanations and solutions.
Business intelligence (ΒΙ) and descriptive analytics are the terms that have been associated with this type of tools and approaches respectively. ΒΙ is something that many executives have come to familiarize themselves with and even rely on. In the beginning, it meant having a hefty pack of printouts on their desks.
Those printouts would summarize key metrics and key performance indicators (KPIs), such as production, sales, and churn. As executives started getting used to the idea of being able to review these key figures, the practice started going deeper, and bumping into some issues. This information could be too much and yet not enough.
Imagine getting metrics for an organization with thousands of branches or employees. This would be overwhelming, and it would require a substantial amount of time and effort just to scan through, let alone digest. KPIs can help in delivering a bird-eye view of an organization, but they are not enough.
Even if someone puts in the time and effort to review metrics, what if they discovered something that required closer introspection? For example, how would they be able to focus their attention on a specific branch that is underperforming, see historical metrics, and compare with other branches?
Those are the types of issues that became opportunities that drove the evolution of analytics. So we went from printouts and ad-hoc questions that required dedicated teams to process and collect data to answer, to solutions such as visualization, dashboards, and data warehouses.
As the questions started becoming more complicated, the queries required to answer them started becoming more burdensome for databases and teams alike. This lead to the emergence of a special type of database, the data warehouse, designed in a way that's optimized to answer analytical queries.
The other thing that helped move analytics forward was visualization and dashboards. In order to quickly get an overview of vast amounts of data, various types of charts were introduced. These charts were hosted in applications called dashboards, aiming to serve as the one-stop shop for reviewing organizational performance.
These dashboards eventually became interactive. What this meant in practice was that users could click on them, drilling down to the underlying data in order to go from a bird's-eye view to the specifics of something that caught their eye. This type of analytics associated with explaining why something has happened is called Diagnostic Analytics.
Diagnostic analytics is tremendously helpful in getting insights, and more and more businesses started developing business analytics programs and becoming data-driven. However, this type of analytics also has limitations, which were eventually beginning to show.
Moving data from operational databases to data warehouses involves a process of extracting, transforming, and loading (ETL), which is cumbersome, error-prone, and time consuming. In addition, data warehouses are built on the notion of pre-calculated dimensions that can help answer certain questions.
When new questions arise, more ETL and data warehouse design and implementation cycles are needed. And to make matters worse, the volume, variety, velocity, and veracity (the four Vs or 4V) of data started reaching new heights that relational databases, which had practically been the only game in town, had trouble keeping up.
This gave birth to the term big data, used to describe 4V data, as well as a new array of technologies, most notably Hadoop and NoSQL databases.
NoSQL, which eventually has come to be defined as Not Only SQL, refers to a range of database solutions not based on relational models. These databases (document, key-value, columnar stores, and graph) sprung out of the need to have systems that scale out, rather than scale up for operational applications, and accommodate models best suited to specific issues and domains.
Instead of throwing more powerful hardware at the problem, which was the typical approach for operational relational databases and data warehouses, NoSQL databases took a different approach. They were designed to scale horizontally, aiming to maintain linear scalability by adding more nodes in a distributed, clustered system.
NoSQL broke with the tradition of relational databases, building on the CAP theorem and introducing eventual consistency. The CAP theorem states that in the presence of a network partition, one has to choose between consistency and availability.
This means that most NoSQL databases departed from the guaranteed consistency model of relational database ACID (atomic, consistent, isolated, and durable) transactions, and instead they rely on the BASE model (basically available, soft state, and eventual consistency). In other words, NoSQL databases may have their benefits, but there are scenarios in which they impose a burden on the applications they support to maintain data integrity.
While originally aimed at operational applications, NoSQL had an impact on analytics, too. The poster child of the big data era, however, is Hadoop. Although Hadoop is based on similar design principles as NoSQL, its focus has been analytics all along. Hadoop was based on the premises of:
Using many commodity hardware nodes
Having a distributed file system (HDFS) as the main storage, and
Utilizing a programming model (MapReduce) that leverages data locality and parallelism for efficient computation
This turned out to provide a number of benefits:
Hadoop was more cost efficient than data warehouses, as its architecture enabled it to store more for less
Hadoop could store any type of data, rather than just structured, relational data
Hadoop also lent itself well to compute as well as storage
The latter is worth some additional analysis. As the MapReduce framework could be utilized to implement any type of processing, combined with Hadoop's efficient storage and distributed, layered architecture, this essentially lead to a decoupling of database architecture.
As a result, Hadoop and data warehouses often live side by side, and a Hadoop ecosystem is thriving. Many businesses use Hadoop's cost-efficient storage as a data lake, storing any type of data for later processing. Then its programming framework can be used to ETL data into data warehouses.
From big data to predictive analytics, machine learning, and AI
Hadoop, however revolutionary, also had its issues. The ability to store any data for cheap is great, but the MapReduce framework was cumbersome to use and required expertise that's hard to get. In order to deal with this, a thriving ecosystem has been built on and around Hadoop, eventually offering the ability to abstract from MapReduce via tools and even SQL interfaces.
So, once again, analytics solutions could work across all kinds of back-ends -- relational, NoSQL, and Hadoop. Except in the case of Hadoop, there was one more problem. Hadoop was designed to be a batch processing tool, and using it for interactive querying was stressing it beyond its intended purpose. This, too, has been dealt with, as today there are higher level APIs and frameworks for Hadoop.
But before setting out to check the state of the art in Hadoop and beyond, it's important to emphasize the interplaying relation between storage and compute technology and analytics: Progress and requirements in one pushes the other forward. So, it's worth pausing for a while to see where the big data revolution is taking analytics next.
When sitting on piles of big data, interesting things start becoming possible. Remember the "data as evidence to support decision making" notion? Taking this analogy between human decision making and data-driven decision making further, think of a seasoned expert in any domain, with lots of experience and projects under their belts.
It could be a ball player that seems to know how the next play is going to unfold, or a business analyst who seems to anticipate what the competition is up to. Experts with that level of skill and experience sometimes give the impression they can predict what's going to happen next. In reality, they make reasonable projections, or educated guesses, based on their experience. This is what predictive analytics is also about.
Predictive analytics is about using data past to foresee what is going to happen next. What are sales going to be like next month? Which customers are more likely to unsubscribe? Which transactions are likely to be fraud? Which items will a user like? These are the types of questions predictive analytics aims to answer.
These are extremely hard questions to answer. Even identifying what parameters come into play is hard, let alone figuring out the way the parameters interact with each other and coming up with algorithms to express that. This is why dealing with such topics in a procedural, programmatic way is complicated.
Machine learning is not a new approach. Although progress has been made in the last few years, most of the machine learning techniques utilized have been around for decades. What has changed is that today we have the amounts of data and compute power needed to make this approach work.
Machine learning is reliant on troves of data and human work and expertise to work, and it also is computationally intensive. Data needs to be collected, curated, labeled, and associated in the right ML algorithm in the right way. This process is called training the algorithm, and it is a fine art few can claim to master in its entirety at this point.
However, if successful, the results can be astonishing. ML is being applied to an increasing array of domains, from fraud detection to healthcare, with outcomes equaling or surpassing those achieved by human experts. This has understandably lead some to call such approaches artificial intelligence (AI).
While the definition of AI goes beyond our scope here, there is one aspect of it that is interesting when discussing the next, or ultimate, stage of the evolution of analytics: Prescriptive analytics. Prescriptive analytics is about making certain desirable outcomes occur.
To draw on the previous analogy, that would be the equivalent of a coach drawing a play on a whiteboard for their team to win a game, or a strategist coming up with a plan to maximize the influence of their group. Although the distinction with predictive analytics is not clear, it may not be that important either at this time: for most organizations, those would be first-world problems.
"The competition is going to be about data, who has the best data to use. If you're still struggling to move data from one silo to another, it means you're behind at least two or three years. Better allocate resources now, because in five years there will already be the haves and have nots."
It's a strategic decision, leading to strategic advantages and really transforming organizations. But what about the ones that can't or won't afford this, literally and metaphorically?
Not every organization is born digital. Not everyone can build a data team from one day to the next, and even if they wanted to, there's just not enough proficient data engineers and scientists to go by at this point. And, of course, building infrastructure is a costly business. This is why the cloud can offer a remedy.
On the infrastructure front, the plus and cons of the cloud are well understood by now. The cloud offers elasticity with little to zero upfront investment, and the downside of moving data back and forth is less of an issue when using data from applications that live in the cloud anyway. On the other hand, vendor lock-in is something to keep in mind always.
But beyond storage infrastructure, or analytics tools that live in the cloud, the cloud has more to offer.
The cloud can make up for the lack of expertise in the analytics market, for example by offering ready to use libraries and data pipelines. The promise there is that is should be plug and play; the catch is that this means outsourcing core expertise and getting commoditized value offering that is not a differentiator and may well be behind the curve compared to the leaders.
Even beyond this, however, we have started seeing the rise of a new class of integrated analytics platforms in the cloud, going by the name of Insights Platforms as a Service (IPaaS). These platforms promise to do all the heavy lifting, integration, and insight generation across clouds for their users, and they deliver turn-key analytics solutions.
Wrapping up, let's take a look at the latest trends in the evolution of analytics.
We mentioned how building the infrastructure and capacity for analytics is complicated. But what if instead of having to get the data out of the applications you use and analyze them yourself, there was such functionality embedded in the application out of the box? As more and more users are looking to get insights our of their applications, embedded analytics is gaining popularity.
These days we are also seeing convergence across platforms. While the rise of NoSQL signaled fragmentation, it soon became evident that there needed to be some way of getting the big picture out of big data across platforms. As data started accumulating in NoSQL databases, they eventually had to develop their own analytics solutions as well, rather than having to ETL data to data warehouses.
Those solutions were initially expectedly not as sophisticated as their relational-based counterparts. This was a reason for driving adoption and adaptation of SQL in NoSQL databases, and today we have many analytics solutions that work across SQL and NoSQL databases.
Real-time processing, aka streaming, is another sign of the times. Both data warehouses and Hadoop impose a lag between the time date is generated and the time insights can be informed based on the latest data. As having the latest data inform insights and algorithms on the fly can lead to business benefits, a new class of streaming platforms is rising to prominence.
In-memory processing is a related, although orthogonal, trend. As the cost of memory is dropping, and new memory types are becoming available, building platforms that process data in-memory first has become viable. Although in many cases a substantial amount of redesign is needed on the platform side, the idea is that users should benefit from the increased speed of memory compared to disk.
Last but not least, the emergence of specialized hardware goes beyond memory to processors as well. Today, we see specialized chips such as GPUs or FPGAs, and more that are still in the works, getting mainstream. Progress on this front has been rapid, as these chips can offer great advantages for specialized types of workloads such as machine learning. Platforms like Hadoop are expanding to accommodate this hardware, while at the same time specialized platforms based on those, such as GPU databases, are being born.