Telling AI model to “take a deep breath” boosts math scores in study – Ars Technica

A worried looking tin toy robot.

Google DeepMind researchers recently developed a technique to improve the math capabilities of AI language models like ChatGPT by using other AI models to tell the AI ​​model what to do to improve prompting. Using human-style encouragement was found to dramatically improve math skills, consistent with previous results.

In a paper titled “Large Language Models as Optimizers” listed this month on arXiv, DeepMind scientists present Optimization by PROmpting (OPRO), a method for improving the performance of large language models (LLMs) such as OpenAI’s ChatGPT and Google’s PaLM 2. The new approach sidesteps the limitations of traditional math-based optimizers by using natural language to guide LLMs in solving problems. “Natural language” is an unusual form of everyday human speech.

“Instead of formally defining an optimization problem and obtaining an update step with a programmed solver,” the researchers write, “we describe the optimization problem in natural language, then instruct the LLM to iteratively generate new solutions based on the problem description and previously found solutions.”

Generally, in machine learning, algorithmic techniques such as derivative-based optimizers serve as a guide to improve the performance of AI models. Visualize model performance as a curve on a graph: the goal is to find the lowest point on this curve because that is where the model makes the fewest errors. By using the slope of the curve to make adjustments, the optimizer helps the model get closer to that ideal low point, making any task it designs more accurate and efficient.

Rather than relying on formal mathematical definitions to perform this task, OPRO uses “meta-prompts” described in natural language to set the stage for the optimization process. LLM then generates candidate solutions based on the problem description and previous solutions and tests them, assigning a quality score to each.

In OPRO, two large language models play different roles: a scorer LLM evaluates an objective function such as accuracy, while an optimizer LLM generates new solutions based on previous results and natural language descriptions. Different combinations of scorer and optimizer LLM are evaluated, including PaLM 2 and GPT variants. OPRO Scorer can optimize prompts for LLM, and the optimizer can iteratively generate high-scoring prompts. These scores help the system identify the best solutions, which are then fed back into the ‘meta-prompt’ for the next round of optimization.

“Take a deep breath and work on this step”

Perhaps the most interesting part of the DeepMind study is the effect of certain phrases on the output. Phrases like “let’s think step by step” made each AI model produce more accurate results when tested against the math problem data set. (This technique became widely publicized in May 2022, thanks to a now-famous paper titled “Large Language Models Zero-Shot Reasoners”.)

Consider a simple word problem, such as, “Beth bakes four two-dozen batches of cookies in a week. If these cookies are shared equally among 16 people, how many cookies does each person use?” A 2022 paper found that instead of feeding a chatbot such a word problem itself, you prefix it with “let’s think through it step by step” and then paste it into the problem. The accuracy of AI model results almost always improves and it works well with ChatGPT.

Interestingly, in this latest study, DeepMind researchers found “take a deep breath and work on this problem step by step” to be the most effective prompt when used with Google’s PaLM 2 language model. This phrase achieved a high accuracy score of 80.2 percent in tests against GSM8K, a data set of grade-school math word problems. In comparison, PaLM 2 achieved only 34 percent accuracy on GSM8K, without any special prompting, and 71.8 percent accuracy with the classic “let’s think step by step” prompt.

So why does this work? Of course, big tongue models can’t take deep breaths because they don’t have lungs or bodies. They do not even think and reason like humans. What “reasoning” they do (and “reasoning” is a controversial word among some, although it is loosely used as a term of art in AI) is derived from large data sets of language phrases scraped from books and the web. They include things like question-and-answer forums, which include many instances of “let’s take a deep breath” or “think step by step” before presenting a more carefully reasoned solution. Those sentences can help LLM tap into better answers or generate better examples of reasoning or problem solving from the data absorbed into its neural network weights.

While figuring out the best way to prompt LLM like a human is a bit confusing for us, this is not a problem for OPRO as the technique uses large language models to find more effective prompting phrases. DeepMind researchers think the biggest win for OPRO is its ability to sift through many possible prompts to find the one that gives the best results for a particular problem. This may allow people to produce much more useful or accurate results from the LLM in the future.

Leave a Comment