How Nvidia uses GPT-4 to make AI better at Minecraft

The Voyager program can both devise new challenges in the game, and then continuously refine strategies for success. Sometimes it hallucinates non-existent items, though, such as an "acacia axe."
Written by Tiernan Ray, Senior Contributing Writer

A comparison of Nvidia's Voyager against other automated agents proceeding up through the game's so-called tech tree of achievements. The program is measurably faster at accomplishing new tasks, and it's so far the only automation of Minecraft that can unlock the highly prized diamond level of implements. The numbers along the bottom of the graphic represent the number of prompt iterations for the programs.

Guanzhi Wang et al

Like check and mate in chess, the ability to fashion a diamond tool in the video game Minecraft, one of the game's high-level challenges, is becoming mundane for artificial intelligence

And now, something like memory is coming to AI's ability in the popular computer game. 

AI programs have been widely developed to play Minecraft without human intervention, with enormous investment in all kinds of approaches. For example, OpenAI, the creator of ChatGPT, has spent enormous sums of money to hire human players of the game in order to capture video footage that can develop AI to play the game by imitating people's moves. 

Also: What is GPT-4? Here's everything you need to know

A team led by Zihao Wang of Peking University in Beijing in February described what the team believes is "the first multi-task agent that can robustly accomplish 70+ Minecraft tasks." 

But the state of the art moves fast. A team led by Nvidia last week said they had come up with the first "lifelong learning agent" that refined its approach to the game based on trying out different techniques, and then saving its achievements to a library of techniques. 

Matched against other automatic systems, the technology consistently achieves milestones in Minecraft faster. 

Also: OpenAI spent $160,000 on Upwork for Minecraft gamers to train a neural net

The program, called Voyager, is described in a paper -- posted on the arXiv pre-print server -- penned by Guanzhi Wang of Nvidia and Caltech, and colleagues from UT Austin, Stanford, and Arizona State University. An advisor to the team is Nvidia's senior director of AI research Anima Anandkumar. (The paper and additional material are also posted by Nvidia on a companion Web site.)

Voyager makes use of GPT-4, the latest "large language model" from ChatGPT creator OpenAI. GPT-4 was unveiled in March, although OpenAI declined to describe the technical aspects of the program. The GPT-4 code is better than prior versions, and better than many other large language models, or, LLMs, at many tasks for which ChatGPT is used, such as answering natural-language challenges and writing code, according to OpenAI.

GPT-4 is used in three ways in Voyager. One is to take the current inventory of possessions in Minecraft and use them to come up with a new challenge for the Voyager program. Give GPT-4 an inventory description at the prompt in natural language, with formatting for easy parsing, such as,

Inventory (5/36): {'oak_planks': 3, 'stick': 4, 'crafting_table': 1, 'stone': 3, 'wooden_pickaxe': 1},

GPT-4 will output a natural language description of a new challenge, such as to craft a stone pickaxe, along with the statement of why that is an appropriate new task, as, for example, 

Reasoning: Since you have a wooden pickaxe and some stones, it would be beneficial to upgrade your pickaxe to a stone pickaxe for better efficiency. 

Task: Craft 1 stone pickaxe.

A second function of GPT-4 in Voyager is to input that new challenge and generate code to make the next move in Minecraft. GPT-4 writes program code to run in Minecraft, and each bit of code is tested in Minecraft, and the feedback is then fed back into GPT-4, which then refines the code.

Also: How ChatGPT can rewrite and improve your existing code

It's well-known that GPT-4 can refine program code. The authors describe this trial-and-error process of code as "iterative prompting," because of the loop of code/feedback/recode via the GPT-4 prompt. A second instance of GPT-4 is used as a critic to test each code invention and determine if it is successful. That is known as "self-verification."

For example, if the initial program code is to send the instruction to Minecraft to fashion an "acacia axe," an axe made of the acacia plant, it will fail because there is no such thing as an acacia axe in Minecraft. The failure of that instruction is handled by Voyager as an "execution error," and the program revises its Minecraft code and tries again. 

The most interesting part comes with what's called a library, where Voyager stores those bits of code it has tried and tested and found successful, which are known as "skills."

Also: The best AI chatbots: ChatGPT and other noteworthy alternatives

In just the way that GPT-4  predicts the next word in a sentence, Voyager can mine this library for suggested actions in the future. GPT-4 starts with a "query" -- something like "craft an iron pickaxe" -- then it searches the library for the "key" -- the stored description of a skill -- and retrieves the required skill as the output, the "value" of that query-key combo, much like a database search.

Using what are called ablation studies -- removing parts of the program -- Wang and team find that the most critical element in the entire Voyager construction is the critic, the self-verification unit.


Examples of how Voyager can output more sophisticated results when being given human feedback during its gameplay. 

Guanzhi Wang et al

"Self-verification is the most important among all the feedback types" that Voyager receives, they write. 

"Removing the module leads to a significant drop (−73%) in the discovered item count," from which they deduce that "Self-verification serves as a critical mechanism to decide when to move on to a new task or reattempt a previously unsuccessful task."

To test Voyager against the state of the art in automated Minecraft, the authors cobble together some other AI programs because, as they put it, "there are no LLMs that play Minecraft out of the box."

Also: Game console showdown: PS5, Xbox, Nintendo Switch, and more

The programs they test against, what constitute their baseline, include MineDojo, a program developed by some of the same contributors last year that won an "outstanding paper award" at the NeurIPS AI conference; ReAct, a Google invention introduced this year that prompts a large language model to "perform dynamic reasoning" in problem-solving, in this case, Minecraft; and AutoGPT, an adaptation of GPT-4 that automates the language model's next action, posted on GitHub, developed by contract development house Significant Gravitas

Compared to these other approaches, the authors write, Voyager reaches goals much faster. "Voyager's superiority is evident in its ability to consistently make new strides, discovering 63 unique items within 160 prompting iterations, 3.3× many novel items compared to its counterparts," they write. "Voyager unlocks the wooden level 15.3× faster (in terms of the prompting iterations), the stone level 8.5× faster, the iron level 6.4× faster."

Also: GPT-4 unveiled: ChatGPT's next big upgrade is here

And, "Voyager is the only one to unlock the diamond level of the tech tree." (Obtaining a diamond pickaxe is one of the hardest tasks in Minecraft. Diamond-based tools last longer and can do more damage, and their power in other ways becomes important for end-game activities such as enchanting tables and making netherite equipment.)

They also found there is a residual capacity of the program to progress in the game even when its library of skills is emptied. 

To test what's called "zero-shot generalization," they write, "we clear the agent's inventory, reset it to a newly instantiated world, and test it with unseen tasks," against a plain-vanilla GPT-4. "Voyager can consistently solve all the tasks, while baselines cannot solve any task within 50 prompting iterations."

There's much to be done in future directions, Wang and team write. For one, GPT-4 can't yet process images. If it could, Voyager could get visual feedback from the graphics of the game, they hypothesize. 

Also: With GPT-4, OpenAI opts for secrecy versus disclosure

Another direction is to use real-time human feedback as either "critic" or "curriculum" or both, to advance the choices Voyager makes. In fact, in experiments they perform, "We demonstrate that given human feedback, Voyager is able to construct complex 3D structures in Minecraft, such as a Nether Portal and a house."

Voyager is expensive from a compute standpoint, they observe. "The GPT-4 API incurs significant costs. It is 15× more expensive than GPT-3.5. Nevertheless, Voyager requires the quantum leap in code generation quality from GPT-4, which GPT-3.5 and open-source LLMs cannot provide."

Also: The 5 biggest risks of generative AI, according to an expert

And, yes, Voyager is prone to hallucination in this task just as in all things language models do. The acacia axe is one example, and Voyager comes up with other "unachievable tasks," they note, such as crafting a "copper sword" or "copper chest plate," which, they note, "are items that do not exist within the game."

Furthermore, they note, "hallucinations also occur during the code generation process," such as "using cobblestone as a fuel input, despite being an invalid fuel source in the game."

Editorial standards