With a trend to bigger and bigger machine learning models, state-of-the-art artificial intelligence research continues to run up against the limits of conventional computing technology.
That's one outcome of the latest mammoth piece of work by researchers at Facebook's AI team. Last week they published a report on their invention, "XLM-R," a natural language model based on the wildly popular Transformer model from Google.
The paper, Unsupervised Cross-lingual Representation Learning at Scale, posted on arXiv, is authored by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov, all with Facebook AI Research.
XLM-R is engineered to be able to perform translations between one hundred different languages. It builds upon work that Conneau did earlier this year with Guillaume Lample at Facebook, the creation of the initial XLM. It's most similar, they write, to a system shown earlier this year by Google researchers that did cross-lingual training on 103 languages.
It's a big improvement over those prior efforts on various benchmark tasks like question answering between languages. It makes intriguing progress, in particular, with what are called "low-resource" languages, ones that don't have a lot of textual material that is suitable as labeled training data, such as Swahili and Urdu.
But XLM-R runs into resource constraints despite using five hundred of Nvidia's most powerful GPUs, the V100. The authors refer to the "curse of multilinguality": As you stuff more and more languages into a single end-to-end Transformer, the low-resource languages benefit from being in the soup, but at some point, everything hits a wall.
That's because while XLM-R is big -- it has 24 layers and 16 "attention heads" and 550 million parameters -- it still has a finite capacity. At some point, it reaches a limit to what it can handle.
"Model capacity (i.e. the number of parameters in the model) is constrained due to practical considerations such as memory and speed during training and inference," the authors write.
XLM-R is being asked to handle an enormous amount of training data, 2.5 trillion bytes of data gathered from the Web using the CommonCrawl program. It's not even that XLM-R is the biggest network out there. OpenAI's GPT2, introduced earlier this year, is 48 layers and 1.5 billion parameters in its largest version. Networks keep getting bigger and bigger, as Facebook's head of PyTorch, Joe Spisak, told ZDNet earlier this year.
But XLM-R is running up against some specific limits, such as how big a vocabulary can be accommodated. The authors built it with 250,000 "tokens" as the baseline, which is already bigger than GPT-2's 50,000 tokens, but they know XLM-R can get better if it has many more tokens -- meaning a larger vocabulary.
"With bigger models, we believe that using a vocabulary of up to two million tokens with an adaptive softmax should improve performance even further," they write, "but we leave this exploration to future work. For simplicity and given the computational constraints, we use a vocabulary of 250k for XLM-R."
Tokens are a computational issue because using more of them requires dedicating more parameters of the model to the input layer of the neural network, where words are embedded as vectors, and that means taking some of the finite parameter capacity away from other parts of the network.
XLM-R is an example of two important trends in deep learning. One is that scientists are still intent on building bigger and bigger models of language to get better benchmark results.
And the other is that those scientists continue to run up against roadblocks in computing capacity. It's another sign that computing is going to have to change if it's going to accommodate what deep learning scientists want to get done.