X
Innovation

Research consortium builds new AI benchmark for understanding language

Conversational AI models had "hit a ceiling" when tested on older benchmarks, limiting their ability to improve.
Written by Campbell Kwan, Contributor

Facebook AI has partnered up with New York University (NYU), Google's DeepMind, and the University of Washington (UW) to launch a benchmarking platform that measures the natural language processing (NLP) capabilities of AI -- the ability for AI to understand and interpret language. 

The benchmarking platform, called SuperGLUE, builds upon an older platform called GLUE by making a "much harder benchmark with comprehensive human baselines," Facebook AI said. 

SuperGlue was created as conversational AI systems had "hit a ceiling" on various benchmarks and needed greater challenges to improve their NLP capabilities.

"Within one year of release, several NLP models have already surpassed human baseline performance on the GLUE benchmark. Current models have advanced a surprisingly effective recipe that combines language model pretraining on huge text data sets with simple multitask and transfer learning techniques," Facebook said.   

According to Facebook AI, SuperGLUE's benchmarking comprises of new ways of testing for a range of difficult NLP tasks that focus on innovations in a number of core areas of machine learning, including sample-efficient, transfer, multitask, and self-supervised learning. 

Using Google's BERT as a model performance baseline, the benchmark itself consists of eight tasks, including a choice of plausible alternatives (COPA) test, a causal reasoning task -- in which a system is given a premise sentence and must determine either the cause or effect of the premise from two possible choices -- and a textual recognition entailment task whereby AI are required to infer the meaning of one text from another text, among others.

After performing its benchmark, SuperGLUE provides a single-number metric summarising an AI's ability to handle various NLP tasks upon completion of the benchmark. 

According to Facebook AI, humans can obtain 100% accuracy on COPA while Google's BERT achieved only 74%, signifying there is a lot of room for NLP improvement.

The research consortium has also developed a leaderboard and a PyTorch toolkit for bootstrapping research in conjunction with SuperGLUE.

Facebook AI also introduced a separate long-form question answering data set and benchmark back in July, which requires machines to provide long, complex answers -- something that existing algorithms had not been challenged to do before. This long-form question answering challenge goes requires machines to elaborate with in-depth answers to open-ended questions, such as "How do jellyfish function without a brain?" 

Meanwhile, Google unveiled a neural network called XLNet in June, which the search giant says is better than BERT in terms of realistically training a computer on how language actually shows up in real-world documents.  

Related Coverage

NVIDIA's AI advance: Natural language processing gets faster and better all the time

Yesterday NVIDIA announced record-breaking developments in machine learning for natural language processing. How and why did it do this, and what does it mean for the world at large?

Moveworks bets IT overload is a natural language processing problem

Help-desk tickets are regularly stranded for 72 hours while admins try to figure out what people are actually asking. Startup Moveworks is adapting natural language understanding to decipher and automate those mysterious requests.

The state of AI in 2019: Breakthroughs in machine learning, natural language processing, games, and knowledge graphs

A tour de force on progress in AI, by some of the world's leading experts and venture capitalists.

Google's latest language machine puts emphasis back on language

Carnegie Mellon and Google's Brain outfit have tried to undo some of the techniques of Google's BERT machine learning model for natural language processing. They propose a new approach called "XLNet." Built on top of the popular "Transformer" A.I. for language, it may be a more straightforward way to examine how language works.

Editorial standards