In a paper posted last week by Google's DeepMind unit, researchers Chengrun Yang and team created a program called OPRO that makes large language models try different prompts until they reach one that gets closest to solving a task. It's a way to automate the kinds of trial and error that a person would do by typing.
The research paper, "Large Language Models as Optimizers," posted on the arXiv pre-print server, details an experiment in how to "optimize" anything with a language model, meaning, to make the program produce better and better answers, getting closer to some ideal state.
Yang and team decided, instead of explicitly programming that ideal state, to use large language models to state in natural language the ideal to be reached. That allows the AI program to adapt to constantly changing requests for optimization on different tasks.
As Yang and co-authors write, the language-handling flexibility of large language models "lays out a new possibility for optimization: instead of formally defining the optimization problem and deriving the update step with a programmed solver, we describe the optimization problem in natural language, then instruct the LLM to iteratively generate new solutions based on the problem description and the previously found solutions."
At the heart of the OPRO program is an algorithm called "Meta-Prompt." Meta-prompt looks back over prior prompts, and measures how those prompts did in solving a given problem. It then generates multiple prompts that it can try out to find the best one.
In effect, Meta-Prompt is a like a person sitting at the keyboard typing lots of new possibilities based on what they've seen work and not work before. Meta-Prompt can be hooked up to any large language model to produce the actual prompts and answers. The authors test a bunch of different large language models, including GPT-3 and GPT-4, and Google's own PaLM 2 language model.
The authors start by testing OPRO on baby problems. One is linear regression, in which the program is prompted to "minimize a function," meaning, find a pair of numbers that are similar to past examples but produce a smaller numerical value as their result.
The point is that the language model is able to find solutions to a math problem, simply by prompting, that would normally be approached by a program built for that problem alone -- a "solver," as it's called. As the authors write, "LLMs properly capture the optimization directions on small-scale problems merely based on the past optimization trajectory provided in the meta-prompt."
It turns out that the art of writing a good prompt for a large language model can itself be viewed as a task to be optimized.
Researchers have known that for some time. Scientists at Microsoft earlier this year proposed what they called "Automatic Prompt Optimization." That approach automatically edits the writing of the prompt to improve it. Yang and team went farther. Instead of merely editing a previous prompt to make it better, Meta-Prompt generates entirely new prompts.
As they put it, "Each optimization step in our work generates new prompts that aim to increase the test accuracy based on a trajectory of previously generated prompts, instead of editing one input prompt according to natural language feedback or requiring the new prompt to follow the same semantic meaning."
After the baby problems, Yang and team set out to see how well Meta-Prompt can optimize prompts.
They test Meta-Prompt on some benchmark evaluations where getting the prompt right has been shown to improve performance.
One is "GSM8K," introduced in 2021 by OpenAI, a series of grade school math word problems such as, "Beth bakes 4, 2 dozen batches of cookies in a week. If these cookies are shared amongst 16 people equally, how many cookies does each person consume?"
A second test is a derivative of BIG-bench, the reasoning test introduced last year by Google and dozens of collaborating organizations. The new version by Google authors, called BIG-bench Hard, introduced this year, focuses on reasoning problems where large language models have failed in the past to achieve human-level accuracy.
The BIG-bench problems are "diverse," as the Google authors wrote in the original paper, "drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond."
The authors compare their automatically-generated prompts for both tasks to prompts crafted "by hand," as exemplified in the 2022 work of Takeshi Kojima and team at The University of Tokyo and Google Research.
Famously, Kojima and team found they could improve the ability of large language models on tasks like GSM8K and BIG-bench simply by adding the phrase "Let's think step by step" at the beginning of the prompt, and then an example answer. That phrase, they found, was sufficient to induce "chain-of-thought" steps on the part of the language model.
With Meta-Prompt, Yang and team find they can automatically generate prompts with phrases similar to "Let's think step by step" but better -- or, more optimal, in their vernacular.
Sometimes, the automatically generated prompts become very intricate. For example, on the BIG-bench reasoning task called "temporal_sequence," a language model is provided with some givens of a scenario and then asked to answer what time something happened, such as:
Today, Richard went to the swimming pool. Between what times could they have gone?
We know that: Richard woke up at 7am. Samantha saw Richard walking in the garden from 7am to 8am. Mark saw Richard working out at the gym from 8am to 9am. David saw Richard attending class at the school from 9am to 10am. Andrew saw Richard waiting at the train station from 10am to 4pm. The swimming pool was closed after 5pm. Between what times could Richard have gone to the swimming pool?
Yang and team found that Meta-prompt did better as it compiled very complex prompts such as the following:
"To determine the possible time period when a person went to a place, first identify all the time periods when the person was not seen doing anything else and the place was open. Then, rule out any time periods during which the person was seen doing something else. The remaining time periods are the possible times when the person could have gone to the place."
Overall, they found, "our optimized prompts outperform human-designed prompts on GSM8K and Big-Bench Hard by a significant margin, sometimes over 50%."
There's more work to be done, however, to optimize the algorithm that optimizes the prompts.
In particular, OPRO's Meta-Prompt is not able to extrapolate from negative examples. "We tried including error cases in the meta-prompt rather than randomly sampling from the training set at each optimization step," they observe, "but the results are similar, indicating that the error cases alone are not informative enough for the optimizer LLM to grasp the cause of the wrong prediction."
Maybe, then, your next programming job is figuring out how to best prompt the Meta-Prompt to create better prompts.