In what is becoming a valued yearly tradition, we caught up with Benaich and Hogarth to discuss topics that stood out for us in the report.
MLOps, machine learning in production
First off, there is overlap with the topics that Turck covered and Baer reported on, and for good reason. As Baer pointed out, the wave of IPOs and proliferation of unicorns is turning this market into its own sector, and that is impossible to ignore. For an overview of market trends, we encourage readers to have a look at Baer's coverage.
That said, our feeling is that the State of AI 2021 report covers more topics: the latest developments in AI research, industry, talent, and politics, while it also ventures on predictions. In fact, Benaich and Hogarth keep track of their predictions, and they are doing pretty well. For example, in 2020 they correctly predicted the obstacles in Arm's acquisition by Nvidia, and AI and biotech-related IPOs.
As Benaich noted, by virtue of being investors at different mostly early stages machine learning companies, they have access to major AI labs, academic groups, up and coming startups, bigger companies, as well as people who work in government. So they try to synthesize all those different angles in a public good product that is open source and aims to holistically inform all stakeholders.
With the increasing power and availability of machine learning models, gains from model improvements have become marginal. In this context, the machine learning community is growing increasingly aware of the importance of better data practices, and more generally better MLOps, to build reliable machine learning products.
Benaich noted that they thought it important to highlight renewed attention in more industry minded academic work around data quality and various issues that can reside within data that ultimately propagate towards ML models, determining whether models predicts well or not:
"A lot of academia was focused on competing on static benchmarks, showing model performance offline on these benchmarks, and then moving into industry. So generation one was a lot about -- let's just get a model that works for a specific problem, and then deal with any issues or any changes whenever they happen.
There's been a huge amount of money and interest and engineering time that's been thrown into MLOps. And this is motivated by the idea that machine learning is not like a static software product that you can write once and forget about. You have to constantly update it, and it's not just [about] updating the model.
You have to look at how your classes might drift over time, or if you're still using the right benchmarks to determine whether a new model that you trained is going to work in production or not. You may see issues like choosing different random seeds for your model and then seeing completely different behavior on real world data, or even that data that you've been using is garbage".
That sounds intuitively right, and probably resonates with anyone who has worked with machine learning models and data pipelines. Now people are giving names to that phenomena, such as distribution shifts (mismatches in dataset versions) and data cascades (issues with data influencing downstream operations). As naming things is the first step to start analyzing them and taking them more seriously, that's a good thing.
Data-centric AI: good data, bad data, distribution shifts and data cascades
A distribution shift happens when data at test/deployment time is different from the training data. In production, this often happens in the form of concept drifts, where the test data gradually changes over time.
As machine learning is increasingly used in real-world applications, the need for a solid understanding of distributional shifts becomes paramount. This begins with designing challenging benchmarks, Benaich and Hogarth state in the report.
Benaich believes that it's hard to pin specific distribution shift examples in the real world, because organizations would probably not want the world to know they were affected by such issues. But one of the areas this could affect would be around pricing on various retail websites.
Frequently, there is a machine-learning powered dynamic pricing engine in the back-end, and its output depends on how much information they have about you, noted Benaich. So distribution shift may mean you end up getting a very, very different price for a particular product that you're looking at, depending on which data is being utilized. Interestingly, this exact practice is targeted by China's market regulator.
Benaich emphasized the fact that there were at least two major new datasets released aiming to deal with distributions shifts, WILDS and Shifts, developed by a number of American and Japanese universities and companies and Yandex, respectively.
Having more industry-oriented datasets being used in academia means the ultimately academic projects are more likely to succeed in the production environment, because there's less distribution shift when you move from industry to academia and vice versa, noted Benaich.
Google researchers define data cascades as "compounding events causing negative, downstream effects from data issues". Supported by a survey of 53 practitioners from the US, India, East and West African countries, they warn that current practices undervalue data quality and result in data cascades.
It's a fairly intuitive idea -- the domino effect. If you have a problem at the start, it's going to likely come down by the time you get to the last domino. What's notable is that the overwhelming majority of data scientists reports having experienced one of these issues.
When trying to attribute why these issues actually happened, it was mostly due to lack of recognition of the importance of data within the context of their work in AI, or lack of training in the domain, or not getting access to enough specialized data for the particular problem that they were solving.
What that points to is that in the world of machine learning there is more nuance than "good data" and "bad data". As datasets are multi-faceted, with different subsets used in different contexts, and different versions evolving, context is key in defining data quality. The insights from machine learning in production incite a shift of focus from model-centric to data-centric AI.
Data-centric AI is a notion developed in Hazy Research, Chris Ré's Research Group at Stanford. As noted, the importance of data is not new --- there are well-established mathematical, algorithmic, and systems techniques for working with data, which have been developed over decades.
What is new is how to build on and re-examine these techniques in light of modern AI models and methods. Just a few years ago, we did not have long-lived AI systems or the current breed of powerful deep models.
Join us next week as we continue the conversation with Benaich and Hogarth, to cover topics such as language models, AI commercialization, and AI-powered biotechnology.: