Why AI still needs humans in the loop, at least for now

If AI-generated writing becomes indistinguishable from human writing, we'll still need to check it carefully enough to see the mistakes.
Written by Mary Branscombe, Contributor on

Large language models can help you write code -- or rewrite adverts, so they look fresh. They can make it easier to quickly grasp the key points of a research paper or a news story by writing and answering questions. Or they can get things embarrassingly wrong.

Large language models like GPT-3 are key to search engines like Google and Bing, as well as providing suggested replies in email and chat, trying to finish your sentence in Word and powering coding assistants like GitHub Copilot.

But they're also not perfect. Considerations of what harm they can do usually focus on what you get by learning from everything that's published on the web, which includes the less positive opinions held by some. Large language models trained on massive text sources such as an online community can end up repeating some rather offensive remarks. And when the model learns from writing with common biases in, like a set of interviews that refer to men by their titles and women by their first names or assuming that men are doctors and women nurses, those biases are likely to show up in what the model writes. 

See also: Facebook: Here comes the AI of the Metaverse.

The possible harms with code generation include the code being wrong but looking right; it's still up to the coder to review the AI-powered suggestions and make sure they understand what they do, but not everyone will do that.

That 'human in the loop' review stage is important to the responsible use of large language models because it's a way to catch a problem before the text is published or the code goes into production. Code licences are one issue when it comes to writing code, but AI-generated text could create all sorts of headaches, some embarrassing and some more serious.

The way large language models work is by predicting what the next word in a sentence will be, and the next word after that, and the next word after that, and so on, all the way to the end of the sentence, the paragraph or the code snippet, looking at each word in the context of all the words around it.

That means a search engine can understand that a search query that asks 'what can aggravate a concussion' is asking about what to do when someone has a head injury, not the symptoms or causes of concussion.

Another approach is to pair large language models with different kinds of machine learning models to avoid entire classes of harms. Picking the most likely word can mean a large language model only gives you obvious answers, like always answering 'birds' when asked 'what can fly' and never 'butterflies' or 'vaccinated airline passengers'. Adding a binary model that distinguishes different kinds of birds might get you 'birds can fly, except for ostriches and penguins and other flightless birds'.

Using a binary model alongside a large language model is one example of how Bing uses multiple AI models to answer questions. Many of them are there to cope with how many different ways we have of saying the same thing.

Information about entities like the Eiffel Tower are stored as vectors so Bing can tell you the tower's height even if your query doesn't include the word Eiffel -- asking 'how tall is the Paris tower' would get you the right answer. The Microsoft Generic Intent Encoder turns search queries into vectors so it can capture what people want to see (and click on) in search results even when the vocabulary they use is semantically very different.

Bing uses Microsoft's large language models (as does the Azure Cognitive Search Service that lets you create a custom search tool for your own documents and content) to rank search results, pull out snippets from web pages and spotlight the best result or highlight key phrases to help you know whether a web page has the information you're looking for, or give you ideas for different terms that might get you better search results. That doesn't change anything, except possibly the emphasis of a sentence.

But Bing also uses a large language model called Turing Natural Language Generation to summarise some of the information from web pages in the search results, rewriting and shortening the snippet you see so it's a better answer to the question you typed in. So far, so useful.

On some Bing searches, you'll see a list of questions under the heading People Also Ask. Originally, that was just related queries some other Bing user had typed, so if you were searching for 'accountancy courses', you'd also see questions like how long it takes to get a qualification as an accountant, to save you time typing in those other searches yourself.

See also: Gartner releases its 2021 emerging tech hype cycle: Here's what's in and headed out.

Bing doesn't always have question and answer pairs that match every search, so last year Microsoft started using Turing NLG to create questions and answers for documents before anyone types in a search that would create them on demand, so more searches would get extra ideas and handy nuggets.

The Q&A can show you more details than are in the headline and the snippets you see in results for news stories. But it's only helpful when the question Bing generates to go with the answer is accurate.

Over the summer, one of the questions Bing came up with showed that common metaphors can be a problem for AI tools. Perhaps confused by headlines that reported a celebrity criticising someone's actions as 'slamming' them, one of these Turing-written questions that I saw clearly misunderstood who was doing what in a particular news story.

The generative language model that created the question and answer pair isn't part of Cognitive Search. Microsoft is only offering its GPT-3 service (which can do the same kind of language generation) in a private preview, so it's not as if the average business has to worry about making these kinds of mistakes on their own search pages. But it shows these models can make mistakes, so you need to have a process in place to deal with them.

A search engine isn't going to have a human look at every page of search results before you see them; the point of AI models is to cope with problems where the scale is too large for humans to do it. But businesses might still want to have a human review for the writing they generate with a large language model. Don't take the human out of the loop for everything, just yet.

Editorial standards