Big data, crystal balls, and looking glasses: Reviewing 2017, predicting 2018

It's this time of year again, when crystal balls meet data. Here's what's kept the data world busy in 2017, and will most likely continue to do so in 2018.
Written by George Anadiotis, Contributor

Let's begin by getting the obvious out of the way: there's no way we could accurately predict what is going to happen, and you should be very skeptical when people claim otherwise. Even when using data and advanced analysis techniques, there will always be bias and imperfection in the analyses.

We admit to having a subjective point of view, but one that is informed by monitoring the data industry and its news for the past year, and we try to leverage this to highlight what are the most import trends going forward. Without further ado, here are the top five things we have noted in 2017, and will be keeping an eye on in 2018.

5: Streaming becomes mainstream

In the data world, streaming is not all about Netflix -- although Netflix also does it. Confused? Let's set the record straight. Streaming refers to streams of data processed in real time. The real time part is not all that new -- operational databases have been doing that for years.

Streaming becomes mainstream

What is new however is the fact that now data is not just pushed to some back end for storage to power the operation of applications, but also analyzed on the fly. The endless streams of data generated by applications lends its name to this paradigm, but also brings some hard to deal with requirements to the table.

How do you deal with querying semantics and implementation when your data is not finite, what kind of processing can you do on such data, how do you combine it with data from other sources or feed it to your machine learning pipelines, and do this at production scale?

These are hard issues. This is why data analytics has resorted to using what is called a Lambda architecture: two different layers for processing incoming data, a batch one working with historical data and a real-time one working with live data.

This is not ideal: two sets of codebases and platforms to maintain, which means more effort, more cost, and more chances of discrepancies. As the real-time layer was not really up to the task of handling everything though, that was the only viable option. But as real-time data processing platforms are maturing, the Lambda architecture is giving way to the Kappa architecture: one real-time layer to rule them all.

Use cases driving adoption of real-time, streaming data applications have been IoT and Financial Services. However these are not the only domains where time is money, as adopters in Programmatic Advertising or Retail are showcasing. For example, being able to identify and process rejected credit card transactions in real-time can result in up to 80 percent less abandoned transactions, therefore increased sales revenues.

The most prominent choices in terms of platforms here are Apache open source projects,with some commercial entity offering SLAs and support and cloud providers. On the first camp we have Flink / data Artisans, Spark / Databricks and Kafka / Confluent, on the second one Amazon Kinesis, Azure Stream Analytics, and Google Cloud Dataflow.

Apache Beam is an interesting effort at an interoperability layer between different options, with the goal of offering one common API across all streaming platforms. Beam was started by Google and adopted by Flink, but it seems to be at a stalemate as Kafka people say they are not interested unless support for tables is added and Spark people do not intend to commit any resources on supporting it.

4. Hybrid Transactional Analytical Processing

Traditionally, operational databases and platforms for data analysis have been two different worlds. This has come to be seen as natural, as after all the requirements for use cases that need immediate results and transactional integrity are very different from those that need complex analysis and long-running processing.

Hybrid transactional analytical processing

Again, however, this leads to a non ideal situation where data has to be moved around between operational and analytical data platforms. This incurs great cost and complexity, and it means analytics do not take the latest data into account. So what if there could be a way to unify transactional databases and data warehouse-like processing?

That's easier said than done of course, and there are good reasons why this has not been achieved up to now. Today however there's a name for that -- Hybrid Transactional Analytical Processing -- but perhaps more importantly there are efforts at tackling this in the real world.

Some are based on in-memory approaches, such as GridGain which started out as an in-memory grid before expanding to a fully blown transactional database, or SnappyData which combines an in-memory grid and transactional database (FireGem) with Apache Spark. Similarly, Splice Machine combines an key-value store from the Hadoop stack (HBase) with its proprietary technology to run operational and analytical workloads under the same hood.

Hadoop vendors also have a say here, as both Cloudera with Kudu and MapR with MapR-DB are attempting to expand Hadoop's traditional focus on analytics to act as an operational database too. Another interesting and little known approach comes from Swarm64, working on giving operational databases analytical superpowers.

3. Insight Platforms as a Service

Remember how we noted data is going the way of the cloud? While there are no signs of this slowing down, there's another interesting trend unraveling, the so-called Insight Platforms as a Service (IPaaS). The thinking behind this is simple: if your data is in the cloud anyway, why not use a platform that's also in the cloud to run analytics on them, and automate as much of the process as possible?

Insight Platforms as a Service

The proposition here is to offer the underlying data management and analytics capabilities as commodities to get to the real value which comes from insights delivered from data. Why would you want to get into the trouble of setting up and maintaining data collection, storage and pipelines, visualization and analytics tools, complex processing and machine learning algorithms to get to insights, if you can just subscribe to a platform that does all this for you?

This is a tempting proposition for organizations that see this as a way of side-stepping all the complexity and cost associated with getting the in-house expertise required to set something like this up on their own.

The counter argument would be that if not all organizations are digital and data-driven already, they will be to a large extent in the near future. So to outsource everything would perhaps not be very wise -- not to mention that not everyone will be willing or able to offload everything to the cloud, as there are some issues associated with this too.

Expectedly, key offers in this category come from cloud vendors such as AWS, Microsoft Azure, IBM Watson, and Google Cloud Platform, but there are also independent vendors such as Databricks and Qubole with their own value proposition. Hadoop vendors Cloudera, Hortonworks, and MapR are also transitioning to this space, as they realize it's not so much about Hadoop anymore, but what you can do with it that matters.

2. Moving up the analytics stack

Traditionally, when talking analytics, people would think of data warehouses, reports, dashboards, and lately also visual interfaces, widgets, and so on. In other words, seeing what has happened in your domain of interest, and perhaps getting an idea of why it has happened by drilling down and correlating.

Moving up the analytics stack

While none of that, going under the name descriptive and diagnostic analytics respectively, has gone away, it's last year's news. This is pretty much a given these days, and as many average Joes are already well versed in the art of data-driven analysis, it could hardly be a differentiating factor for organizations.

As descriptive and diagnostic analytics are getting commoditized, we are moving up the stack towards predictive and prescriptive analytics. Predictive analytics is about being able to forecast what's coming next based on what's happened so far, while prescriptive analytics is about taking the right course of action to make a desirable outcome happen.

Predictive analytics typically utilizes machine learning (ML), a technique based on using past data to train algorithms to predict future data, rather than hand-crafting them procedurally as in traditional software engineering. Prescriptive analytics is an even more complicated step that arguably borders on AI, and very few are able to utilize at this point.

As has been argued before the explosion in applications, and hype, for ML is not so much due to progress in algorithms, but more due to the fact that by now we have accumulated enough data and processing power to make ML viable in many cases.

This space is really exploding, and includes everything from ML libraries such as Spark MLLib, Caffe2, TensorFlow, and DeepLearning4J people can use to build their own ML algorithms and applications from scratch, to embedded analytics frameworks such as Salesforce Einstein, SAP Hana, or GoodData that offer such capabilities in their own environment, to names such as Amazon, Facebook, Uber, and YouTube as iconic examples of applications and sometimes contributions in this field.

1. The machine learning feedback loop

The pace of change is catalyzed and accelerated at large by data itself, in a self-fulfilling prophecy of sorts: data-driven product -> more data -> better insights -> more profit -> more investment -> better product -> more data. So while some are still struggling to deal with basic issues related to data collection and storage, governance, security, organizational culture, and skillset, others are more concerned with the higher end of the big data hierarchy of needs.

The machine learning feedback loop

The above paragraph is a verbatim copy of what we noted last year, and if anything, what it describes has become even more pronounced, to the point of having the Economist calling for "a new approach to antitrust rules demanded for the data economy." Now, why would the Economist do this? Isn't data supposed to fuel innovation and all sorts of wonderful new applications?

Absolutely. Innovation and data-driven automation powered by advances in ML and AI are game-changing. We have even began to see traces of automating automation in 2017, for example having ML frameworks that build ML. The problem with this however is that this data-driven feedback loop also leads to new monopolies that are left unchecked.

There is a staggering concentration of data, expertise, and infrastructure in the hands of very few players, while lack of awareness and action means the gap is likely to keep widening. And those few players have clear agendas that include one thing: themselves. So to put blind faith in data-driven innovation and automation means to be up for a rude awakening.

"I have not met a single CEO, from Deutsche Bank to JP Morgan, who said to me: 'ok, this will increase our productivity by a huge amount, but it's going to have social impact -- wait, let's think about it'. The most important thing right now is how to move mankind to a higher ground. If people don't wake up, they'll have to scramble up -- that's my 2 cents." -- Chetan Dube, IPSoft CEO
"We're talking about machines displacing people, machines changing the ways in which people work. Who owns the machines? Who should own the machines? Perhaps what we need to think about is the way in which the workers who are working with the machines are part owners of the machines." -- Laura Tyson, former Chair of the US President's Council of Economic Advisers
Editorial standards