In an age when, supposedly, language is all the rage, artificial intelligence programs such as ChatGPT are conspicuously narrow: they mostly deal with English to the exclusion of the world's hundreds of other commonly spoken languages.
In a sign of things to come, AI computer startup Cerebras Systems this week announced it has partnered with Abu Dhabi's Inception, a subsidiary of investment firm G42 of the United Arab Emirates, to create what it calls the world's most powerful open-source large language model for Arabic, a language spoken by approximately 400 million people worldwide.
Using the program -- called Jais-Chat -- is just like typing into Chat-GPT's prompt, except that Jais-Chat can take and produce Arabic-language writing as input and output. It can, for example, write a letter in Arabic when prompted in English:
Or it can take an Arabic-language prompt and generate a response in Arabic:
Trained on a special corpus of Arabic texts much larger than what's commonly available, the program eschews the typical approach of building a generalist program that handles hundreds of languages, in many cases poorly, and instead focuses exclusively on English and Arabic translations.
When performing tests in Arabic of knowledge and reasoning and bias-- tests such as The University of California at Berkeley's MMLU test, a set of multiple-choice questions, and the Allen Institute for AI's HellaSwag, a sentence completion task -- Jais-Chat scored a full 10 points higher than leading state-of-the-art language models such as Meta's LlaMA 2. It beat out top open-source programs such as this year's Bloom from Big Science Workshop, and it also beat out specialized language models built exclusively for Arabic.
"Lots of companies talk about democratizing AI," said Andrew Feldman, co-founder and CEO of Cerebras, in an interview with ZDNET. "Here, we're giving the experience of 400 million Arabic speakers a voice in AI -- that is democratizing AI. It is the primary language of 25 nations, so, we thought it was an extraordinary sort of project."
The language disparity in AI has been observed and given considerable attention for some time now. In last year's "No Language Left Behind" (NLLB) effort by Meta Properties, the company's scientists strove to advance the state of the art in handling 200 languages simultaneously, with a special focus on so-called "low-resource" languages, those without a large corpus of online text that can be used to train the models.
As the Meta authors noted, studies of the field "indicate that while only 25.9 percent of internet users speak English, 63.7 percent of all websites are in English."
"The truth is, the biggest data sets rely on scraping the internet, and the internet's mostly in English, and this is a really unfortunate sort of situation," said Feldman.
Attempts to close the language gap in AI have typically involved generalist AI programs, things such as Meta's NLLB. However, the programs fail to show improvement in a number of languages, including not only low-resource languages such as Oromo (native to Ethiopia and Kenya) but even languages with prevalent translation material such as Greek and Icelandic.
And so-called multi-modal programs such as the NLLB successor, SeamlessM4T from Meta, unveiled this month, try to do many different tasks with dozens of languages using just one model, including text-to-speech transcription and speech-to-text generation. That can weigh down the whole process with extra goals.
Instead of a generalist or a multi-modal approach, lead author Neha Sengupta of Inception, along with the Cerebras team and scholars at the UAE's Mohamed bin Zayed University of Artificial Intelligence, built a program that only trains the program on Arabic and English together.
And, they constructed a special data set of Arabic language texts. They compiled 55 billion tokens' worth of data from myriad sources such as Abu El-Khair, a collection of over 5 million articles, spanning 14 years, from major news sources; the Arabic-language version of Wikipedia; and United Nations transcripts, among others.
Then, in an approach that is likely to become exemplary for languages with fewer resources, the authors managed to increase Arabic-language training data from the 55 billion original tokens to 72 billion by performing machine translation of English texts into Arabic. As they describe it, "We further augment the Arabic data by translating 3 billion tokens from English Wikipedia and 15 billion tokens from the Books3 corpus."
The authors then up-sampled the Arabic language text by 1.6 times, further augmenting the Arabic-language data to a total of 116 billion tokens.
The authors took another novel approach: They combined the Arabic and English texts with billions of tokens from computer code snippets, in various languages, gathered from GitHub. The final data set is 29% Arabic, 59% English, and 12% code.
Sengupta and team went beyond simply using a special data set. They also employed several special techniques to represent the vocabulary of Arabic.
The researchers built their own "tokenizer," the algorithm for cutting up text into individual units. The typical tokenizer used by programs such as GPT-3 "is primarily trained on English corpora," they write, so that common Arabic words "are over-segmented into individual characters [which] lowers the performance of the model and increases the computational cost."
They also employed a state-of-the-art "embedding" algorithm, ALiBi, developed last year by the Allen Institute and Meta. This algorithm is much better at handling very long context -- that is, inputs to a language model typed at the prompt or recalled from memory.
"What we were looking to do was to capture the linguistic nuances in Arabic and the cultural references," said Feldman, who has spent extensive time traveling in the Middle East. "And that's not easy when most of the model is in English."
Enhanced with these and other modifications, the result is a language model called Jais, and its companion chat app, Jais-Chat, measuring 13 billion in "parameters," the neural weights that form the critical active elements of the neural net. Jais is based on the GPT-3 architecture designed by OpenAI, a so-called decoder-only version of Google's Transformer from 2017.
The machine is composed of 32 of Cerebras's special-purpose AI computers, the CS-2, whose chips, the "Wafer-Scale-Engine," collectively hold a total of 27 million compute cores, 41 terabytes of memory, and 194 trillion bits per second of bandwidth. They are overseen by 36,352 of AMD's EPYC x86 server processors.
The researchers used a slice of that capacity, 16 machines, to train and "fine-tune" Jais.
The program punches above its weight at 13 billion parameters. That is a relatively small neural network, compared to things such as the 175-billion-parameter GPT-3, and larger programs with more parameters are generally viewed as more powerful.
"Its pre-trained and fine-tuned capabilities outperform all known open-source Arabic models," write Sengupta and team, "and are comparable to state-of-the-art open-source English models that were trained on larger datasets."
As the authors note, the original Arabic data set of 72 billion tokens wouldn't ordinarily be enough for a model larger than 4 billion parameters, according to the AI rule of thumb known as The Chinchilla Law, formulated by researchers at Google's DeepMind.
In fact, not only does Jais-Chat in its 13 billion-parameter form top LlAMA 2 and others, in a smaller version of their program with just 6.7 billion parameters, they are also able to achieve higher scores on the same standardized tests such as MMLU and HellaSwag.
"What was interesting was that the Arabic made the English better, too," said Feldman, referring to Jais's performance on the evaluations. "We ended up with a model that's as good as LlaMA in English, even though we trained it on about a tenth of the data."
The work not only sets new benchmark scores in Arabic but also speeds up dramatically the time taken to train such a model compared to what would be required with standard GPU chips of the kind Nvidia, the dominant AI vendor, sells.
It is estimated that to distribute the work and train Jais it would take a 512-node GPU cluster between 60 and 100 days, said Cerebras, versus just 21 days on the Condor Galaxy 1.
"It would have taken 20 days just to configure a GPU cluster before you ran the model," quipped Feldman. "And that's an extraordinarily expensive cluster."
The Jais programs are the latest in a string of contributions by Cerebras to the open-source software effort in the wake of OpenAI and Google scaling back their disclosure. Another program trained on Condor Galaxy 1, called BTLM-3B-8K, is the number one model for a 3-billion-parameter configuration on Hugging Face at the moment, with over a million downloads, noted Feldman.
"We built a supercomputer, we've got people using it, we're moving the open-source community forward," said Feldman, "that's all goodness."