Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

BibTex

Copy

@Article{Wang2023PlanandSolvePI,
 author = {Lei Wang and Wanyu Xu and Yihuai Lan and Zhiqiang Hu and Yunshi Lan and R. Lee and Ee-Peng Lim},
 booktitle = {Annual Meeting of the Association for Computational Linguistics},
 pages = {2609-2634},
 title = {Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models},
 year = {2023}
}

GitHub

Plan-and-Solve-Prompting

635

HTTPS

https://github.com/AGI-Edgerunners/Plan-and-Solve-Prompting

SSH

git@github.com:AGI-Edgerunners/Plan-and-Solve-Prompting.git

CLI

gh repo clone AGI-Edgerunners/Plan-and-Solve-Prompting

AI Audio Lecture + Q&A

0:00 / 0:00

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

Transcript

John: Welcome to Advanced Topics in Natural Language Processing. Today's lecture is on 'Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models.' We've seen a clear trend in the field, starting with the original 'Chain-of-Thought Prompting' paper, which showed how few-shot examples could elicit reasoning. Then came 'Large Language Models are Zero-Shot Reasoners,' which simplified things by just adding a magic phrase. This paper, from researchers at Singapore Management University and others, asks if we can refine that zero-shot approach to be more robust. Go ahead, Noah? Noah: Excuse me, Professor. Could you quickly recap the key difference between few-shot and zero-shot Chain-of-Thought? Just to make sure we're all on the same page. John: Of course. Few-shot CoT provides the model with several complete examples of a problem and its step-by-step solution before asking the target question. It's effective, but requires manual effort to create those examples. Zero-shot CoT, on the other hand, provides no examples at all. It simply appends a phrase like 'Let's think step by step' to the prompt, which was surprisingly effective but prone to certain errors. Noah: Okay, so the motivation here is to get the performance of few-shot without the manual labor of creating examples. What specific errors were they trying to fix? John: Precisely. The authors identified three main pitfalls in standard Zero-shot CoT. Through an error analysis, they found calculation errors, missing-step errors where the model skips a crucial logical step, and semantic misunderstanding errors. This paper primarily targets the first two: calculation and missing steps. The core idea is that a more structured prompt can guide the model to a more robust reasoning process, much like how a human would approach a complex problem. Instead of just telling it to 'think step by step,' they propose a two-stage approach: first, devise a plan, and second, execute that plan. Noah: So they're trying to elicit an emergent planning capability in the model? John: That's a good way to put it. Their method, which they call Plan-and-Solve, or PS prompting, uses a more prescriptive trigger phrase: 'Let’s first understand the problem and devise a plan to solve the problem. Then, let’s carry out the plan and solve the problem step by step.' This explicitly encourages the model to decompose the problem first, which they hypothesized would reduce missing-step errors. They also proposed an enhanced version, PS+, which adds even more detailed instructions. Noah: What kind of details did they add for PS+? It seems like this is getting into very specific prompt engineering. John: It is, but it's principled. The PS+ prompt instructs the model to 'extract relevant variables and their corresponding numerals' and to 'calculate intermediate results, paying attention to calculation and commonsense.' These instructions directly target the error types they identified. By forcing the model to explicitly list variables, it's less likely to miss one. By reminding it to be careful with calculations, it's more likely to be accurate. The hypothesis is that these aren't just arbitrary words, but specific cognitive guide rails for the model's reasoning process. Noah: How did they test this? Was it just on arithmetic tasks, where planning and variable extraction are obvious wins? John: That's a critical question. They did focus heavily on arithmetic reasoning, using six different benchmarks like GSM8K and AQuA. But to test generalizability, they also included two commonsense reasoning datasets—CommonsenseQA and StrategyQA—and two symbolic reasoning tasks. They used GPT-3's text-davinci-003 engine and compared their results against strong baselines, including standard Zero-shot CoT, Auto-CoT, and even manual few-shot CoT. Noah: And what were the results? Did this more complex prompt actually outperform the simpler methods? John: The results were quite convincing. On the arithmetic tasks, PS+ outperformed Zero-shot CoT by an average of 6.3%. More notably, its performance was comparable to Manual-CoT and even better than Auto-CoT, which automates the selection of few-shot examples. This suggests that a well-designed zero-shot prompt can nearly eliminate the need for in-context examples on these tasks. They also saw consistent, albeit smaller, gains in commonsense and symbolic reasoning. Noah: Wait, so did they confirm that the prompt actually reduced the specific errors they were targeting? John: Yes, their error analysis showed that PS+ reduced calculation errors from 7% down to 5% and, more significantly, cut missing-step errors nearly in half, from 12% down to 7%. They also found a strong negative correlation: when the model's output contained an explicit plan and variable definitions, the incidence of these errors was much lower. This provides empirical evidence that the prompt is working as intended. Noah: So it seems very effective for those two error types. But what about the third one you mentioned, semantic misunderstanding? Did the planning help with that at all? John: That's the main limitation they acknowledge. The rate of semantic misunderstanding errors remained largely unchanged. This suggests that while we can guide the model to be more methodical in its process, fixing fundamental misinterpretations of the problem itself is a harder challenge that this prompting strategy doesn't solve. It improves the structure of the reasoning, but not necessarily the initial comprehension. Noah: That makes sense. It seems this work really strengthens the case for structured, zero-shot prompting as a viable alternative to few-shot methods. How does this fit with other approaches like Program-of-Thoughts, which offload computation to an external tool? John: An excellent connection. Program-of-Thoughts, or PoT, also tackles calculation errors, but it does so by having the LLM write code that is then executed. PS+ is different because it aims to improve the LLM's internal numerical reasoning abilities through better prompting, rather than outsourcing it. The results show PS+ was competitive with, and on several datasets superior to, Zero-shot PoT. This implies that improving the model's native reasoning process is a powerful and complementary direction to tool-assisted reasoning. John: So, to wrap up, this paper demonstrates that we can significantly enhance zero-shot reasoning by designing prompts that mirror human problem-solving frameworks. The Plan-and-Solve approach effectively reduces common errors in calculation and logic by explicitly instructing the model to decompose the problem and execute its plan carefully. The key takeaway is that unlocking an LLM's reasoning potential may be less about giving it examples and more about giving it a better cognitive strategy to follow. Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models