AI has a big data problem. Here's how to fix it

Supervised algorithms require lots of data, and often result in shaky predictions. Is it time for the next stage of AI?

IBM on how AI and creativity go hand-in-hand

Artificial intelligence has, quite literally, got a big data problem – and one that the COVID-19 crisis has now made impossible to ignore any longer. 

For businesses, governments, and individuals alike, the global pandemic has effectively redefined "normal" life; but while most of us have now adjusted to the change, the same cannot be said of AI systems, which base their predictions on what the past used to look like.

Speaking at the CogX 2020 conference, British mathematician David Barber said: "The deployment of AI systems is currently clunky. Typically, you go out there, collect your data set, label it, train the system and then deploy it. And that's it – you don't revisit the deployed system. But that's not good if the environment is changing."

SEE: Managing AI and ML in the enterprise 2020 (free PDF)    

Barber was referring to supervised machine learning, which he called today's "classical paradigm" in AI, and which consists of teaching algorithms by example. In a supervised model, an AI system is fed a large dataset that has been previously labeled by humans, and which is used to train the technology into recognizing patterns and making predictions.

You could train an algorithm to automate the lending decision in a bank for example, based on individuals' incomes or credit scores. Cue COVID-19, along with a whole new set of banking patterns, and the AI system is likely to be at a loss to decide who gets the cash.

Similarly, a few months into the COVID-19 crisis, a US researcher pointed out that algorithms, despite all the training data they have been fed, wouldn't be all that helpful in understanding the nature of the outbreak or its spread across the globe.

Because of the lack of training data about past coronaviruses, explains the research, most of the predictions generated by AI tools were found to lack reliability, and results often skewed away from the severity of the crisis. 

Meanwhile, in healthtech, the makers of AI health tools struggled to update their algorithms due to a lack of relevant data about the virus, resulting in many "symptom finder" chatbots being a little off the mark.

With data from a pre-COVID environment not matching the real world anymore, supervised algorithms are running out of examples to base their predictions on. And to make matters worse, AI systems don't flag their uncertainties to their human operator. 

"The AI won't tell you when it actually isn't confident about the accuracy of its prediction and needs a human to come in," said Barber. "There are many uncertainties in these systems. So it is important that the AI can alert the human when it is not confident about its decision."

This is what Barber described as an "AI co-worker situation", where humans and machines would interact to make sure that gaps aren't left unfilled. In fact, it is a method within artificial intelligence that is slowly emerging as a particularly efficient one.

Dubbed "active learning", it consists of establishing a teacher-learner relationship between AI systems and human operators. Instead of feeding the algorithm a huge labeled dataset, and letting it draw conclusions – often in a less-than-transparent way – active learning lets the AI system do the bulk of data labeling on its own, and crucially, ask questions when it has a doubt.

The process involves a small pool of human-labeled data, called the seed, which is used to train the algorithm. The AI system is then presented with a larger set of unlabeled data, which the algorithm annotates by itself, based on its training – before integrating the newly labeled data back into the seed.

When the tool isn't confident about a particular label, it can ask for help from a human operator in the form of a query. The choices made by human experts are then fed back into the system, to improve the overall learning process. 

The immediate appeal of active learning lies in the much smaller volume of labeled data that is needed to train the system. Supervised algorithms, because they don't learn on their own, require an extensive set of labeled examples to be provided by humans. This translates into long and costly processes to manually label up to billions of data points for any given dataset.

Some platforms, such as Amazon's Mechanical Turk, have even specialized in connecting organizations with a large pool of low-cost labor spread across the globe. "Turkers", as they are called, click through thousands of images a day, annotating data points as requested, all of which will go into training future algorithms. 

Active learning, on the other hand, only requires labeling a small seed pool of data. Barber, in fact, estimates that the process involves annotating up to ten times less data.

He is not the only one to have picked up on this particular perk of the method. Big tech companies, especially, have a strong interest in reducing the volume of labeled data they need to feed their algorithms.

Facebook's AI unit is heavily invested in developing a model for an AI that learns, and for various applications, including identifying harmful content. The tech giant recently published results showing that its AI team, using a teacher-student method, had successfully trained an image classification algorithm based on a collection of a record one billion unlabeled images, using a "relatively smaller set" of labeled data.

But it's not only about reducing the data-labeling process: active learning is also more efficient, compared to supervised learning. Being able to ask a human for tips on where to focus when it is faced with a piece of data that it is unsure about, means that an "active" AI system can not only respond to the unknown, but also learn from it.

In the case of content moderation, an "active" algorithm will make more informed decisions, as it increasingly learns to pick up on more and more subtle forms of content violation. An "active" AI system would also be very efficient at natural language processing or medical imaging. 

Barber added that a high-profile use of the technology is in driverless autos, where videos still need to be segmented into parts and labeled into "pedestrian", "car", "tree" and other objects that the car needs to recognize. Annotating millions of these videos is time-consuming and expensive; on the other hand, letting algorithms learn and ask questions could significantly accelerate the process.

And, when a global pandemic strikes, "active" AI systems would be able to integrate new data in real time, along with some human input, and then adapt their predictions – rather than wait for large datasets to be manually annotated for training.

"If you're developing AI using the traditional approach of collecting large amounts of data and then training a deep-learning model, there's only so fast that can go," Barber told ZDNet. "With the traditional model, you'd be lucky to have a new model live in production in less than a few months. But with active learning, this can take only a few days at most."

The mathematician co-founded Re:infer – a company that leverages active learning to help businesses better understand and automate the processing of emails, calls and chats they receive every day from suppliers.  

Traditionally, building up an algorithm for this specific task would have required manually labeling each sentence from the thousands of customer messages received by a given business, before feeding it as training to an AI system. 

Using active learning, however, the algorithm can quickly learn from a base dataset, and present employees only with the sentences that it is not confident with. According to Barber, the method accelerates the time to value of the overall process by ten to 100 times.

SEE: AI runs smack up against a big data problem in COVID-19 diagnosis

Speaking at the same conference as Barber, Emine Yilmaz, professor of computer science at University College London, agreed that active learning holds much potential."Where we are headed in the next few years, is towards a model where the AI is learning from us," she said.

"A system should be able to say that it is uncertain about a given classification, and that it is having difficulties. It should be able to ask questions to humans directly, just like a child learns," she added.

The new level of interaction between humans and AI is likely to play in the algorithm's favor: Yilmaz argued that the method could overcome the fear that some workers might feel at the prospect of deploying the technology in the workplace. Active AI, in this context, could provide a softer option by which the algorithm acts as a co-worker, and not a replacement.

As smart as this algorithmic co-worker can be, it will still need human help from time to time. And whether or not that sounds like a natural work relationship, the concept of a human-in-the-loop certainly seems like a step up from, and potentially a solution to, AI's big data problem.