Can generative AI solve computer science's greatest unsolved problem?

The question -- Does P = NP? -- is a grand theoretical challenge. Scientists took the approach of spoon-feeding ideas to GPT-4 to seek an answer.
Written by Tiernan Ray, Senior Contributing Writer
Lightbulbs hanging from the ceiling with one switched on

Computer scientists contemplate whether the time required to compute a solution is out of reach for the hardest problems. The question of Does P = NP? is now treated as a multi-prompt session with the GPT-4 language model. The greatest insight of the work may be how to prune past chat sessions to maintain a discussion.

artpartner-images/Getty Images

When computer scientists hang out at cocktail parties, they're apt to chat, among other things, about the single most important unsolved problem in computer science: the question, Does P = NP

Formulated nearly 50 years ago, the question of whether P equals NP is a deep meditation on what can ultimately be achieved with computers. The question, which has implications for fields such as cryptography and quantum computing, has resisted a convincing answer despite decades of intense study. Now, that effort has enlisted the help of generative AI.

Also: DeepMind's RT-2 makes robot control a matter of AI chat

In a paper titled "Large Language Model for Science: A Study on P vs. NP,"  lead author Qingxiu Dong and colleagues program OpenAI's GPT-4 large language model using what they call a Socratic Method, several turns of chat via prompt with GPT-4. (The paper was posted this month on the arXiv pre-print server by scientists at Microsoft, Peking University, Beihang University in Beijing, and Beijing Technology and Business University.)

The team's method amounts to taking arguments from a prior paper and spoon-feeding them to GPT-4 to prompt useful responses.

Dong and team observe that GPT-4 demonstrates arguments to conclude that P does not, in fact, equal NP. And they claim that the work shows that large language models can do more than spit back vast quantities of text, they can also "discover novel insights" that may lead to "scientific discoveries," a prospect they christen "LLMs for Science."

To grasp what the authors are doing, it's necessary to know a little bit about the P = NP problem. 

Formulated independently in the 1970s by computer scientists Stephen Cook and Leonid Levin, the P versus NP problem -- or, "P = NP," as it is often referred to -- is a question of how easy it is to solve a given problem with a computer. The letter P represents problems that have been shown to be feasible to solve, meaning, the time to compute a solution is not out of reach; and the solution to which is also easy to verify, meaning, to check that the answer is correct. 

Also: Microsoft, TikTok give generative AI a sort of memory

The letters NP, by contrast, stand for problems whose answer is also relatively easy to verify, just like P, but for which there is no easy way known to compute a solution. It's common to cite the game Sudoku as an example of NP: any filled-in Sudoku game can be quite easily checked for accuracy, but the task of finding a solution grows exponentially in terms of time required as the game grid gets larger. (If you want to dive into the heavy theoretical details of P = NP, try Cook's 2000 paper on the problem.)

The problem, then, Does P = NP? asks whether those problems that we think are hard to solve, NP, but which we know are easy to verify, might actually turn out to be both easily verified and easily solved, just like P problems. 

A negative answer, that P doesn't equal NP, would mean some problems are beyond the ability of computers to solve even with tremendous computing budgets -- an upper bound on computing, in other words. Challenges such as cracking some encryption would then seem more formidable, beyond computing's reach. 

To tackle P = NP, Dong and team build from a trend of the past several years of "reasoning" with large language models.  As exemplified in the 2022 work of Takeshi Kojima and team at The University of Tokyo and Google Research, it's possible to improve the ability of large language models on certain tasks simply by adding the phrase "Let's think step by step" at the beginning of the prompt, accompanied by an example answer. That phrase, they found, was sufficient to induce "chain-of-thought" steps on the part of the language model.

Also: Generative AI: Just don't call it an 'artist' say scholars in Science magazine

It's the same chain-of-thought type of procedure Dong and team are after with their Socratic Method. Through 97 prompt rounds, the authors coax GPT-4 with a variety of requests that get into the nitty-gritty of the mathematics of P = NP, prepending each of their prompts with a leading statement to condition GPT-4, such as, "You are a wise philosopher," "You are a mathematician skilled in probability theory" -- in other words, the now familiar game of getting GPT-4 to play a role, or, "persona" to stylize its text generation. 


Sample of one of the chat rounds of the highly theoretical discussion.

Microsoft, Peking University, Beihang University in Beijing, and Beijing Technology and Business University

Their strategy is to induce GPT-4 to prove that P does not, in fact, equal NP, by first assuming that it does with an example and then finding a way that the example falls apart -- an approach known as proof by contradiction.

The interesting thing is that two of the authors of the research, Ke Xu and Guangyan Zhou, have separately released a paper this month in which they directly reason about P = NP in traditional formal mathematical terms. In that paper, they conclude that P does not equal NP. 

What Dong and Xu and Zhou and team are doing, then, is akin to reconstructing their formal math paper by leading GPT-4 through the language of their own reasoning, prompt by prompt. In fact, out of the 73-page paper, 67 pages are a complete printing of each of the 97 prompts and the complete response from GPT-4. It's like a giant exercise in prompt engineering to reconstruct an argument.

Whether the output that Dong and team have achieved with GPT-4 actually proves P does not equal NP is hard to say because Xu and Zhou's paper is itself very new. On the site Semantic Scholar, which gathers citations of papers, there are no citations yet for the paper -- other than their own paper with Dong and team. There is some discussion of the GPT-4 paper by various interested readers on the AI site HuggingFace that you can check out.

So, the world has yet to accept their argument. 

More importantly for people who like Gen AI, the authors argue that their dialogue in prompts shows the prospect for large language models to do more than merely mimic human textual creations. 

Also: ChatGPT: What The New York Times and others are getting terribly wrong about it

"Our investigation highlights the potential capability of GPT-4 to collaborate with humans in exploring exceptionally complex and expert-level problems," they write. Throughout the 67 pages of prompts and responses, they highlight passages that they deem "the insightful parts" of what GPT-4 spits out.

Just how insightful those responses are is probably also a topic that needs its own investigation. Some scientists have found large language models to be particularly shallow in how they string together citations and descriptions.

However, an interesting telltale item pops up in the margins of the paper, where Dong and team annotate the replies of GPT-4 with their observations about the quality of the replies. 

In one of those parenthetical notes, the authors write that each of GPT-4's preceding responses has been incorporated as background in the latest prompt -- except where the authors chose to prune the responses to keep only the most relevant bits.

Also: ChatGPT 'lacked depth and insight,' say prestigious science journal editors

"If the model provides multiple solutions, we only include the most valuable solution in the conversation history," they write in the margin on page 7. "This strategy enables GPT-4 to concentrate on pertinent information, thereby enhancing its overall efficiency and effectiveness."

In other words, there was a certain helpful curation of the way that GPT-4 used past history in what's called its "context window," all of the prior rounds upon which it can draw. Dong and team were engaged in a very selective prompt engineering to guide GPT-4 through the thread of an argument. That bears upon the practice of "retrieval-augmented generation," or "RAG," the current interest in using past chat data as new input to a large language model.

Also: ChatGPT lies about scientific results, needs open-source alternatives, say researchers

That may be one of the most significant contributions of the whole exercise: Whether or not it solves P = NP, a new frontier in prompt engineering could move programs closer to RAG in order to give chat sessions greater depth. When you think back only recently in time to chat sessions, they tended to be inane, often wandering off topic. 

Through 97 rounds, Dong and team managed to keep the machine on point, and there's something to be said for that.

Editorial standards