It's midyear, which seems to be the time that VCs who put skin in the game give us their assessments of the data and analytics market landscape. Picking up where Big on Data bro George Anadiotis left off with his voluminous coverage on the State of AI, we're directing our focus toward a deep dive on the current market landscape led by FirstMark Capital partner Matt Turck. They've taken a very comprehensive look, and if you want a map of it, you can click on this enlargeable link.
Turck, who is also a prominent tech evangelist in the New York tech community, has lead several of the longest running series of monthly data and smart device technologies meetups in the city. As part of his day job, his team has been issuing these landscape reports since 2012. The title of this year's report is one of the changes -- they are no longer calling this "Big Data" as that term now seems so 2014. We now think of it as just "data," because analyzing nonrelational data is no longer exceptional; the economics of cloud computing have made big compute and big storage affordable; and also because, by the way, there's been an explosion of IoT data and use cases that are increasingly pervading our lives.
And while we're on the topic of technology that's becoming pervasive, there's the emergence of AI. It has spread from online recommendation systems to personal assistants, and now to predictive and prescriptive analytics, and is now a central character in the FirstMark market landscapes. So, not surprisingly, given the broadened scope, this year's report was split into two separate posts (here and here) that starts with an overview of sociopolitical and regulatory trends because data and analytics are impacting peoples lives. Part two cuts to the chase, diving into the market landscape.
A quick summation of the social and political landscape could be summarized by the theme of the loss of innocence. Maybe we're jaded, but the abuses committed by the likes of Cambridge Analytica back in the 2016 election thrust the issue outside the ivory tower. Fast forward to 2019 and it's appearing that tech's free ride from public regulation may be coming to an end. There's GDPR and new privacy laws from the state of California for starters. To get itself off the hook, Mark Zuckerberg is even requesting that Facebook gets regulated. Yet, for all the new mandates and concern over privacy, the report notes that we still all love our smart devices, and even in the wake of bad press, Facebook continues to add subscribers.
For the full size image, click here
Part two starts with the elephant in the room. Countering those trolling over Hadoop's death, the report takes a more nuanced view. Unlike five years ago, Hadoop is no longer the sole path to analyzing big data; there are cloud offerings from the complete platform to specialized point services such as Spark, streaming, data transformation, and AI. Furthermore, in the cloud, object storage, not HDFS, is becoming the de facto data lake. But with the fading of MapR and the merger of Hortonworks and Cloudera, there's still a healthy installed base of at least a couple thousand blue chip customers -- the vast majority on-premises -- that are each paying six or seven figures annually in support (in the open source world, that's the new maintenance). Those workloads are not moving to the cloud overnight.
Nonetheless, the move to the cloud is unmistakable. FirstMark's report aligned with a prediction we made while at Ovum that by 2019, most new big data workloads would start in the cloud. FirstMark expects that, but with a twist. As they consider the cloud for new strategic workloads, there is concern over cloud vendor lock-in. Hybrid has entered the dialogue. It's given infrastructure players like IBM who missed out on the cloud on the first go-round, along with database and data warehousing household names, maybe some hope for a second wind. Not lost in the conversation is Kubernetes, the sleeper Google open source project that makes hybrid clouds thinkable. That, of course, drove IBM's $34 billion acquisition of Red Hat, and it's very much behind Google's embryonic Anthos offering, repackaging its Kubernetes services so that, conceivably, you could run a Google Cloud native workload (sans the Google hardware) in, dare we ask, AWS?
But we'll take a time out here -- Kubernetes is still a diamond in the rough -- best practices for security, load balancing, service configurations, and so on remain works in progress. Nonetheless, FirstMark has another spin. They speculate that Kubernetes could spark a move away from cloud-based ML services as data scientists (and we presume, data engineers) want to exert more control over their environments. Our take is that ML is ravenous for data, and so the key enabler, or hurdle, depending on your viewpoint, will be the enterprise's ability and willingness to store or process all that data on premises, capital costs and all. Our view on Kubernetes is that it will prove too complex for all but the most sophisticated enterprise IT organizations, although the mission of third parties like IBM or Pivotal would be to bury all that complexity inside a black box. Have at it.
The report also looks at the state of serverless computing for complex analytics and ML workloads, and similarly concludes it's still too early for prime time. Serverless grew popular with agile development of apps with short-lived processes, or for databases with volatile traffic spikes. The development simplicity of serverless, where you let the system autoscale the amount of compute, has appeal for developers practicing agile, but the long-running processes of machine learning will make serverless hit the wall, as this link provided by the FirstMark report pointed out.
Another area of growing pains will be data management and governance, an issue that is compounded with the spate of new and proposed data privacy laws. To database and BI veterans, these issues are nothing new. When you have so much data, how do you find what to look for? No wonder that data catalogs are popping up right and left -- they are furnished by third parties like Alation and Waterline Data, and built into data platforms like Cloudera's. For instance, Collibra, which is partly backed by Google Ventures, just raised $100 million, but at the same time, that hasn't stopped the Google Cloud folks from unveiling their own data catalog that overlaps on Collibra's turf. But not all data catalogs are created equal; some are highly collaborative tools that employ machine learning to crawl and build queries for accessing the data, while others are glorified data dictionaries.
Data lineage is yet another piece of technology that the FirstMark report regards as emerging -- it's supposed to tell you where the data came from and provide an audit trail as to how it's been used, and preferably, by who. While data lineage should provide that single source of the truth, the challenge is that analytics tools, data catalogs, and data platforms are each recording their own views of data lineage, providing the latest example of having too much of a good thing.
A survey of the data and analytics landscape in 2019 would not be complete without touching on the latest round of consolidation in the BI space, with Google buying Looker, Salesforce swallowing Tableau, and at a more modest scale, Alteryx buying ClearStory Data, and Logi Analytics buying Zoomdata. There are the parallels with the BI consolidation wave of a decade ago that saw Business Objects, Cognos, and Hyperion snapped up by SAP, IBM, and Oracle respectively. FirstMark speculates that this story might not be over yet, asking whether Amazon might mull an acquisition for bulking up QuickSight. Our take is that the next wave of innovation in BI will be from embedding machine learning that acts as a digital assistant to the business analyst in helping select data, cleanse it, and tell the story. We'll likely see much of this innovation surface in existing tools, such as Tableau's Ask Data natural language query, but this could also be the impetus for startups that engineer themselves around natural language and digital assistance, rather than retrofitting it.
As BI democratized analytics, FirstMark is looking at machine learning as the next analytics segment ripe for market development. It segments the space under a couple buckets. The first, AutoML, which automates much of the grunt work in developing and productionalizing ML models, is being hotly contested by the cloud usual suspects and third parties such as Data Robot. There is a second bucket, primarily the domain of third parties such as Dataiku, RapidMiner, and H2O, that add a heavy collaboration component. We expect that FirstMark's 2020 report will chart the emergence of how these tools -- or others that have yet to emerge from stealth -- explain AI models.
FirstMark also sees a hotbed of AI activity in horizontal services such as computer vision, natural language processing, voice to text (and vice versa) that are commercializing the deep learning end of the pool. But there's a caveat here, which is that horizontal services, which knock on the door of what Turck terms artificial general intelligence (AI getting closer to human capabilities), are for now relatively shallow (they perform tasks like text translation, but have limited abilities to actually think). So the market is at a much more earlier state of development. There are general services like Amazon Rekognition, and the beginnings of vertical services such as Google Contact Center AI. FirstMark notes significant improvements in baseline capabilities such as NLP.
We've always believed that ultimately, the biggest payoff from AI will be through embedding into business applications. That is much of the impetus behind SAP's Leonardo initiative, which is not a product or set of products per se, but one of its roles is being a lab for SAP to identify opportunities for productization from its client engagements. Maybe it would be too dramatic to call this enterprise AI's final frontier, but FirstMark views this as being 3-4 years into what it implies will be a longer journey.