Boiling frogs. That's what we all are really. We are staying put in our pot, while the temperature is rising. Not dramatically, but steadily. Little by little, not jumping wildly so that we might be alarmed, but unmistakably rising. This well-known metaphor can be applied to a number of things, but here the occasion is the traditional end of year review, new year predictions extravaganza.
At first, sunk deep as we are in the constant flow of new developments, only a couple of things came to mind as the most striking about 2018. Upon deeper reflection, however, it turns out 2018 has been quite a year, having set the groundwork for years to come. Here's the proof.
From big data to flexible, real-time data
Or just data, really. The "big data" moniker had its time, and its purpose. It was coined at a time when data volume, variety, velocity, and veracity were exploding. It helped capture and convey the significance of these properties of data at scale. It served as a catch-all buzzword for what was then a new breed of solutions for storing and processing data that broke away from the stronghold of relational databases.
By now, NoSQL, Hadoop, and cloud databases and storage are commonplace. The tradeoffs one has to make when designing and operating distributed data systems, neatly captured in theorems called CAP and BASE, are increasingly well-understood among the ranks of people who work with such systems.
By now, it's a given: Data from all kinds of sources is generated rapidly, and has to be stored and processed at scale, on premise and in the cloud, and, increasingly, in real-time. It's important to do this for a number of reasons, and there are many options. So, what's the point of even using "big data" anymore? Let's declare this cause won and just move on.
An empirical rule for data systems is they need 10 years to reach maturity. NoSQL champions such as Apache Cassandra and MongoDB are hitting the 10 year-old mark; Hadoop is well past that point. Many of the features such solutions originally lacked, such as SQL and transactional support, are now there. Vendors and mergers have been spawned. Protocols have been adopted by incumbents and imitators. Communities have grown.
As the realities of the underlying technology have changed, the architectures and the economics are changing, and the bar is moving for everyone. The flexibility required to operate in multi-cloud and hybrid (on premise and cloud) environments, and the ability to work with data in real-time are becoming key.
Database and Hadoop vendors are adding options for their solutions to operate seamlessly across many environments. Cloud vendors are also moving in this direction, adding the ability to run on premise versions of their solutions. Kubernetes promises to become the de facto operating system for data solutions in all environments. And streaming data frameworks promise to become the de facto gateways for data.
Machine Learning October Fest, AI Galore
One of the most controversial mantras of the big data era has been the prompt to store everything now and figure it out later. In a world where storage is expensive, storing data needs to be meticulously designed upfront, and changes are a pain, this does not make sense. That's not necessarily the world we live in today, but what may mark the decisive blow for this approach is machine learning.
It's practically impossible to miss the machine learning buzz and success stories out there. Machine learning is increasingly being used to power everything from retail and self-driving cars to sports and entertainment. One thing all these machine learning applications have in common is they need troves of data to train the models used to power them. Those old invoices, for example? They may come in handy if you want to train an accounting model.
The other thing you need, obviously, is a machine learning library to help build those models. This is why there are so many frameworks around these days, and choosing the right one for your needs is not easy. Heavyweights such as Facebook can afford to simply build their own. Facebook's new PyTorch framework, consolidating previous work on Caffe2, was released in October, but it's far from being the only one.
MLFlow was released by the creators of Apache Spark with an emphasis on distribution, and open source fast.ai came out of stealth hoping to democratize machine learning. AWS announced updates to its SageMaker library, and Google improved its own offering with AutoML, AI Hub and Kubeflow. Neuton came out of nowhere claiming to be faster, more compact, and less demanding than anything the AWSs, Googles, and Facebooks of the world have.
Important as they may be, these frameworks are not what machine learning is all about. Besides having the right expertise and data to train the models, the right infrastructure and deployment process needs to be in place. Adding humans in the loop is one strategy that can be used to integrate machine learning in organizations. Picking the right programming language for your requirements is key. But don't forget: Machine learning does not equal AI, and it takes more than data and code to get there.
Software 2.0, Compute 2.0
The effect of machine learning is profound, changing the paradigm in everything including software itself. It's official: We are entering the Software 2.0 era. Even though the majority of the software we use today is old-fashionably deterministic, that may be about to change. Software is becoming cloud-native, data-driven, and being automated itself.
Software as we know it has fundamentally been a set of rules, or processes, encoded as algorithms. Of course, over time its complexity has been increasing. APIs enabled modular software development and integration, meaning isolated pieces of software could be combined and/or repurposed. This increased the value of software, but at the cost of also increasing complexity, as it made tracing dependencies and interactions non trivial.
But what happens when we deploy software based on machine learning approaches is different. Rather than encoding a set of rules, we train models on datasets, and release it in the wild. When situations occur that are not sufficiently represented in the training data, results can be unpredictable. Models will have to be re-trained and validated, and software engineering and operations need to evolve to deal with this new reality.
Machine learning is also shaping the evolution of hardware. For a long time, hardware architecture has been more or less fixed, with CPUs being their focal point. That's not the case anymore. Machine learning workloads favor specialized chips, which we usually refer to as AI chips. Some are already calling this Compute 2.0. GPUs are the most common example of a specialized chip, but they are not the only game in town.
Intel is working on getting FPGAs in shape to become a viable option for machine learning. Google is putting its weight behind its custom-made TPU chips. AWS is updating its cloud and releasing a custom chip of its own called AWS Inferentia. And there is a slew of startups working on new AI chips, with the most high-profile among those, GraphCore, having just reached unicorn status and released its chips to select partners.
Regulation, governance, licensing
What organizations do with their data is no longer something that only concerns a bunch of geeks. Data has the power to meddle elections, grant or deny access to finance and healthcare, make or break reputations and fortunes, make the difference for companies and individuals. It stands to reason that some sort of regulation is needed for something that has become this central for society at large.
The EU is leading the way with GDPR, which came into effect in 2018. GDPR is, in effect, a global regulation, as it concerns anyone active in the EU, or having interactions with EU citizens. As the first regulation in this domain with such far-reaching consequences, GDPR has been met with fear, uncertainty and doubt. By empowering individuals to take control of their data, GDPR forces organizations to get their data governance right.
Organizations need to be able to answer questions such as where their data comes from, how it is used, and whether users are aware and have consented to their data being collected and processed, and for what purpose. To do this, they need to have the right processes and metadata in place. Data lineage and access rights and policies are part of what we refer to under the umbrella term data governance: Knowing where data comes from, what it's used for, when, why, and by whom.
Counter-intuitive as it may seem, regulation such as GDPR may spur innovation. For one, making vendors respond to market demand by making data governance a first-class citizen, adding features to support it. Furthermore, in machine learning, by boosting emphasis on explainability. With regulatory frameworks in place for domains such as finance or healthcare, transparent, explainable decisions become a must-have.
We are just beginning to see the effect of regulation on data-related technology and business. In 2019 PSD2, another EU regulation that forces financial institutions to open their data to third parties will become effective. This is going to have cascading effects on the market. And let's not forget the infamous EU Copyright Directive, which is about to enforce measures such as upload filters and a link tax.
- Toyota and SoftBank are teaming up to bring big data to mobility CNET
- The top 10 big data frameworks used in the enterprise TechRepublic
Last but not least, we are seeing data platforms take note of the reality that is the cloud, and cloud poaching, or "strip mining": The encroaching of open source/open core platforms by cloud providers. Besides adapting their offerings to run in multiple environments, as managed services, or iPaaS, data vendors are reacting by adapting their licenses, too. Confluent and Timescale have done it, we expect to see more of this.
The Years of the Graph
Calling 2018 the year of the graph was our opener last year. You can call it bias, or foresight, since we have a special relationship with graph databases. Either way, it turns out we were not the only ones. Graph databases have consistently been the leading category in terms of growth and interest as captured by the DBEngines index since 2014.
Much of that has to do with AWS and Microsoft releasing graph database products, with AWS Neptune going GA in May 2018 and joining Azure Cosmos DB in this lively market that has more than 30 vendors in total. Obviously, they each have their strengths and weaknesses, and not all of them are suitable for all use cases.
Picking up from industry pundits, Gartner included Knowledge Graphs in its hype cycle in August 2018. Whether this makes sense for a technology that is at least 20 years old and what this all means is a different discussion, but the fact remains: Graph is here to stay. Graph really is going mainstream, with analysts such as fellow ZDNet contributor/Ovum analyst Tony Baer giving it a shoutout.
We are seeing the likes of Airbnb, Amazon, eBay, Google, LinkedIn, Microsoft, Uber and Zalando building graphs, and improving their services and their bottomline as a result. We are seeing innovation in this domain, with machine learning being applied to enrich and complement traditional techniques at web scale. We are seeing new standardization efforts under way, aiming to add to existing standards under the auspices of the W3C.
We are seeing vendors upping their game and their financing, and graphs being investigated as a fundamental metaphor upon which software and hardware for a new era can be built. We will be seeing more of this in 2019, and we'll be here to keep track.
Previous and related coverage:
AI is the most disruptive technology of our lifetimes, and AI chips are the most disruptive infrastructure for AI. By that measure, the impact of what Graphcore is about to massively unleash in the world is beyond description. Here is how pushing the boundaries of Moore's Law with IPUs works, and how it compares to today's state of the art on the hardware and software level. Should incumbent Nvidia worry, and users rejoice?
Can a platform conceived to support running ephemeral applications become the operating system of choice for running data workloads in the multi-cloud and hybrid cloud era? Looks like it, but we're not just there yet.
What does a Data Hub Reference Architecture have to do with Customer Engagement? A lot, according to Informatica, who wants to complement Adobe, Microsoft, and SAP, in their Open Data Initiative. The big question, however, is whether this has legs.
Samza is now at near-parity with other Apache open-source streaming frameworks such as Flink and Spark. The key features in Samza 1.0 are SQL and a higher level API, adopting Apache Beam. What does this mean for this space, and how do you choose?
On Black Friday, Cyber Monday, and other retail peak points purchases see new heights every year, and attempted fraud follows suit. How can data and domain knowledge be leveraged to safegueard consumers and retailers?
It took AI just a couple of years to go from undercurrent to mainstream. But despite rapid progress on many fronts, AI still is something few understand and fewer yet can master. Here are some pointers on how to make it work for you, regardless of where you are in your AI journey.