When Google artificial intelligence scientists revealed a significant new program -- the Pathways Language Model (PaLM) -- a year ago, they spent several hundred words in a technical paper describing the significant new AI techniques used to achieve the program's results.
PaLM-2 is a new state-of-the-art language model. We have small, medium, and large variants that use stacked layers based on the Transformer architecture, with varying parameters depending on model size. Further details of model size and architecture are withheld from external publication.
The deliberate refusal to disclose the so-called architecture of PaLM 2 -- the way the program is constructed -- is at variance not only with the prior PaLM paper but is a distinct pivot from the entire history of AI publishing, which has been mostly based on open-source software code, and which has customarily included substantial details about program architecture.
(There is also a blog post summarizing the new elements of PaLM 2, but without technical detail.)
PaLM 2, like GPT-4, is a generative AI program that can produce clusters of text in response to prompts, allowing it to perform a number of tasks such as question answering and software coding.
Like OpenAI, Google is reversing course on decades of open publishing in AI research. It was a Google research paper in 2017, "Attention is all you need," that revealed in intimate detail a breakthrough program called The Transformer. That program was swiftly adopted by much of the AI research community, and by industry, to develop natural language processing programs.
Among those offshoots is the ChatGPT program unveiled in the fall by OpenAI, the program that sparked global excitement over ChatGPT.
None of the authors of that original paper, including Ashish Vaswani, are listed among the PaLM 2 authors.
In a sense, then, by disclosing in its single paragraph that PaLM 2 is a descendent of The Transformer, and refusing to disclose anything else, the company's researchers are making clear both their contribution to the field and their intent to end that tradition of sharing breakthrough research.
The rest of the paper focuses on background about the training data used, and benchmark scores by which the program shines.
This material does offer a key insight, picking up on the research literature on AI: There is an ideal balance between the amount of data with which a machine learning program is trained and the size of the program.
The authors were able to put the PaLM 2 program on a diet by finding the right balance of the program's size relative to the amount of training data, so that the program itself is far smaller than the original PaLM program, they write. That seems significant, given that the trend of AI has been in the opposite direction of late, to greater and greater scale.
As the authors write,
The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute. Our evaluation results show that PaLM 2 models significantly outperform PaLM on a variety of tasks, including natural language generation, translation, and reasoning. These results suggest that model scaling is not the only way to improve performance. Instead, performance can be unlocked by meticulous data selection and efficient architecture/objectives. Moreover, a smaller but higher quality model significantly improves inference efficiency, reduces serving cost, and enables the model's downstream application for more applications and users.
There is a sweet spot, the PaLM 2 authors are saying, between the balance of program size and training data amount. The PaLM 2 programs compared to PaLM show marked improvement in accuracy on benchmark tests, as the authors outline in a single table:
In that way, they are building on observations of the past two years of practical research in the scale of AI programs.
For example, a widely cited work by Jordan Hoffman and colleagues last year at Google's DeepMind coined what's come to be known as the Chinchilla rule of thumb, which is the formula for how to balance the amount of training data and the size of the program.
The PaLM 2 scientists come up with slightly different numbers from Hoffman and team, but it validates what that paper had said. They show their results head-to-head with the Chinchilla work in a single table of scaling:
That insight is in keeping with efforts by young companies such as Snorkel, a three-year-old AI startup based in San Francisco, which in November unveiled tools for labeling training data. The premise of Snorkel is that better curation of data can reduce some of the compute that needs to happen.
This focus on a sweet spot is a bit of a departure from the original PaLM. With that model, Google emphasized the scale of training the program, noting it was "the largest TPU-based system configuration used for training to date," referring to Google's TPU computer chips.
No such boasts are made this time around. As little as is revealed in the new PaLM 2 work, you could say it does confirm the trend away from size for the sake of size, and toward a more thoughtful treatment of scale and ability.