After talking to machine learning and infrastructure engineers at major Internet companies across the US, Europe, and China, two groups of companies emerged. One group has invested hundreds of millions of dollars into infrastructure to allow real-time machine learning and has already seen returns on their investments. The other group still wonders if there's value in real-time machine learning.
The fact that reporting on return on investment is a good way to get attention does not seem to be lost on Chip Huyen. Huyen is a writer and computer scientist who works on infrastructure for real-time machine learning. She is the one who wrote the above introduction to her findings on real-time machine learning in order to crystallize the growing experience she and her colleagues are accumulating.
Huyen has worked with the likes of Netflix, Nvidia, Primer and Snorkel AI before founding her own (stealth) startup. She is a Stanford graduate, where she also teaches Machine Learning Systems Design and was a LinkedIn Top Voice in 2019 and 2020.
In other words, Huyen is very well-positioned to report on what fellow ZDNet contributor Tony Baer described as "a long-elusive goal for operational systems and analytics" in his data 2022 outlook: unifying data in motion (streaming) with data at rest (data sitting in a database or data lake). The ultimate goal in doing that would be to achieve the kind of ROI Huyen reports on.
Machine learning predictions and system updates in real-time
Huyen's analysis refers to real-time machine learning models and systems on 2 levels. Level 1 is online predictions: ML systems that make predictions in real-time, for which she defines real-time to be in the order of milliseconds to seconds. Level 2 is continual learning: ML systems that incorporate new data and update in real-time, for which she defines real-time to be in the order of minutes.
The gist of why Level 1 systems are important is that, as Huyen puts it, "no matter how great your ML models are, if they take just milliseconds too long to make predictions, users are going to click on something else". As she elaborates, a "non-solution" for fast predictions is making them in batch offline, storing them, and pulling them when needed.
This can work when the input space is finite -- you know exactly how many possible inputs to make predictions for. One example is when you need to generate movie recommendations for your users -- you know exactly how many users there are. So you predict a set of recommendations for each user periodically, such as every few hours.
To make their user input space finite, many applications make their users choose from categories instead of entering open-ended queries, Huyen notes. She then proceeds to show examples of how this approach can produce results that can hurt user experience, from the likes of TripAdvisor and Netflix.
Although tightly coupled with user engagement/retention, this is not a catastrophic failure. Bad results could be catastrophic in other domains, such as autonomous vehicles of fraud detection. Switching from batch predictions to online predictions enables the use of dynamic features to make more relevant predictions.
ML systems need to have two components to be able to do that, Huyen notes. They need fast inference, i.e. models that can make predictions in the order of milliseconds. And they also need real-time pipelines, i.e. pipelines that can process data, input it into models, and return a prediction in real-time.
To achieve faster inference, Huyen goes on to add, models can be made faster, they can be made smaller, or hardware can be made faster. The focus on inference, TinyML, and AI chips that we've been covering in this column is perfectly aligned to this, and naturally, these approaches are not mutually exclusive either.
Huyen also embarked on an analysis on streaming fundamentals and frameworks, something that has also seen wide coverage on this column from early on. Many companies are switching from batch processing to stream processing, from request-driven architecture to event-driven architecture, and this is tied to the popularity of frameworks such as Apache Kafka and Apache Flink. This change is still slow in the US but much faster in China, Huyen notes.
However, there are many reasons why streaming isn't more popular. Companies don't see the benefits; there's a mental shift and high initial investment in infrastructure required, the processing cost is higher, and these frameworks are not Python-native, despite efforts to bridge the gap via Apache Beam.
Huyen prefers the term "continual learning" instead of "online training" or "online learning" for machine learning systems based on models that get updated in real-time. When people hear online training or online learning, they think that a model must learn from each incoming data point.
Very few companies actually do this because this method suffers from catastrophic forgetting -- neural networks abruptly forget previously learned information upon learning new information. Plus, it can be more expensive to run a learning step on only one data point than on a batch.
Huyen did the above analysis in December 2020. In January 2022, she revisited the topic. While her take is that we're still a few years away from mainstream adoption of continual learning, she sees significant investments from companies to move towards online inference. She sketches evolutionary progress towards online prediction.
Towards online prediction
Stage 1 is batch prediction. At this stage, all predictions are pre-computed in batch, generated at a certain interval, e.g. every 4 hours or every day. Typical use cases for batch prediction are collaborative filtering content-based recommendations. Examples of companies that use batch prediction are DoorDash's restaurant recommendations, Reddit's subreddit recommendations, or Netflix's recommendations circa 2021.
Huyen notes that Netflix is currently moving its machine learning predictions online. Part of the reason, she goes on to add, is that for users who are new or aren't logged in, there are no pre-computed recommendations personalized to them. By the time the next batch of recommendations is generated, these visitors might have already left without making a purchase because they didn't find anything relevant to them.
Huyen attributes the predominance of batch prediction to legacy batch systems such as Hadoop. Those systems enabled periodic processing of large amounts of data very efficiently, so when companies started with machine learning, they leveraged their existing batch systems to make predictions.
Stage 2 is online prediction with batch features. Features in machine learning are individual measurable properties or characteristics of a phenomenon used to build a model. Batch features are features extracted from historical data, often with batch processing, also called static features or historical features.
Instead of generating predictions before requests arrive, organizations at this stage generate predictions after requests arrive. They collect users' activities on their applications in real-time. However, these events are only used to look up pre-computed embeddings to generate session embeddings.
Here Huyen refers to embeddings in machine learning. Embeddings can be thought of as a way to represent vectors, which is what machine learning models work with to represent information pertaining to the real world.
The important thing to remember about Stage 2 systems is that they use incoming data from user actions to look up information in pre-computed embeddings. The machine learning models themselves are not updated; it's just that they produce results in real-time.
The goal of session-based predictions as per Huyen, is to increase conversion (e.g. converting first-time visitors to new users or click-through rates) and retention. The list of companies that are already doing online inference or have online inference on their 2022 roadmaps is growing, including Netflix, YouTube, Roblox, Coveo, etc.
Huyen notes that every single company that's moved to an online inference that spoke to her told her that they're very happy with their metrics wins. She expects that in the next two years, most recommender systems will be session-based: every click, every view, every transaction will be used to generate fresh, relevant recommendations in near real-time.
Organizations will need to update their models from batch prediction to session-based predictions for this stage. This means that they might need to add new models. Organizations will also need to integrate session data into their prediction service. This can typically be done with streaming infrastructure, which consists of two components, Huyen writes.
The first part is a streaming transport, such as Kafka, AWS Kinesis or GCP Dataflow, to move streaming data (users' activities). The second part is a streaming computation engine, such as Flink SQL, KSQL, or Spark Streaming, to process streaming data.
Many people believe that online prediction is less efficient, both in terms of cost and performance, than batch prediction because processing predictions in batch is more efficient than processing predictions one by one. Huyen believes this is not necessarily true.
Part of the reason is that there's no need to generate predictions for users who are not visiting a site with online prediction. If only 2% of total users log in daily, and predictions are generated for every user each day, the compute used to generate 98% of those predictions will be wasted. The challenges of this stage will be in inference latency, setting up the streaming infrastructure and having high-quality embeddings.
Online prediction with complex streaming and batch features
Stage 3 in Huyen's evolutionary scale is an online prediction with complex streaming and batch features. Streaming features are features extracted from streaming data, often with stream processing, also called dynamic features or online features.
If companies at Stage 2 require some stream processing, companies at Stage 3 use a lot more streaming features. For example, after a user puts in order on Doordash, they might need both batch features and streaming features to estimate the delivery time.
Batch features may include the mean preparation time of this restaurant in the past, while streaming features at this moment may include how many other orders they have and how many delivery people are available.
In the case of session-based recommendation discussed in Stage 2, instead of just using item embeddings to create session embedding, stream features such as the amount of time the user has spent on the site or the number of purchases an item has had in the last 24 hours may be used.
Examples of companies at this stage include Stripe, Uber, Faire for use cases like fraud detection, credit scoring, estimation for driving and delivery, and recommendations.
The number of stream features for each prediction can be in the hundreds, if not thousands. The stream feature extraction logic can require complex queries with join and aggregation along different dimensions. To extract these features requires efficient stream processing engines.
There are some important requirements in order to move machine learning workflows to this stage, as per Huyen. The first one is a mature streaming infrastructure with an efficient stream processing engine that can compute all the streaming features with acceptable latency. The second one is a feature store for managing materialized features and ensuring consistency of stream features during training and prediction.
The third one is a model store. A stream feature, after being created, needs to be validated. To ensure that a new feature actually helps with your model's performance, you want to add it to a model, which efficiently creates a new model, says Huyen. Ideally, a model store should help manage and evaluate models created with new streaming features, but model stores that also evaluate models don't exist yet, she notes.
Last but not least, a better development environment. Data scientists currently work off historical data even when they're creating streaming features, which makes it difficult to come up with and validate new streaming features.
What if we can give data scientists direct access to data streams so that they can quickly experiment and validate new stream features, Huyen asks. Instead of data scientists only having access to historical data, what if they can also access incoming streams of data from their notebooks?
That actually seems to be possible today, for example, with Flink and Kafka notebook integrations. Although we are not certain whether those meet what Huyen is envisioning, it's important to see the big picture here.
It's a complicated topic, and Huyen is laying out a path based on her experience with some of the most technologically advanced organizations. And we have not even touched upon Level 2 -- machine learning systems that incorporate new data and update in real-time.
However, to come full circle, if Huyen's experience is anything to go by, the gains may well justify the investment.