Plan-and-Solve Prompting: A Complete Guide
Plan-and-Solve (PS) prompting is a zero-shot technique that improves large language model reasoning by explicitly separating the problem-solving process into two distinct phases: first devising a plan to decompose the task into subtasks, then systematically executing that plan step by step. Rather than letting the model reason in an unstructured manner, PS prompting instructs the model to understand the problem, create a solution strategy, and then methodically carry out that strategy.
The technique addresses a critical weakness in standard zero-shot Chain-of-Thought (CoT) prompting: missing-step errors. When models use the simple trigger "Let's think step by step," they often skip crucial reasoning steps, leading to incorrect conclusions. PS prompting forces explicit planning before execution, significantly reducing these omissions.
Category: Plan-and-Solve belongs to reasoning-based decomposition techniques within the zero-shot prompting family. It combines task decomposition with structured execution, making it a planning-first approach to multi-step reasoning.
Type: Zero-shot reasoning technique that structures the model's cognitive process through explicit planning and systematic execution phases.
Scope: PS prompting includes explicit problem understanding, plan formulation, subtask identification, sequential execution, and intermediate result tracking. It excludes tasks requiring external knowledge retrieval, multi-turn dialogue management, or creative generation where rigid planning may constrain outcomes.
Why This Exists
Core Problems Solved:
- Missing-step errors: Zero-shot-CoT frequently skips essential reasoning steps, particularly in multi-step mathematical problems
- Unstructured reasoning: "Let's think step by step" provides no guidance on how to structure the reasoning process
- Calculation errors: Without explicit attention to intermediate calculations, models make arithmetic mistakes
- Semantic misunderstanding: Complex problems require careful problem comprehension before solving
- Inconsistent reasoning quality: Standard CoT produces variable quality reasoning depending on problem complexity
Value Proposition:
- Accuracy: PS+ achieves 91.8% on MultiArith, 59.3% on GSM8K, 76.7% on SVAMP—comparable to 8-shot manual CoT
- Zero-shot capability: No examples required, making it universally applicable without task-specific engineering
- Reduced missing steps: Explicit planning ensures all necessary reasoning steps are identified upfront
- Improved calculation accuracy: PS+ variant specifically addresses arithmetic errors through targeted instructions
- Transparent reasoning: Clear separation of planning and execution makes the reasoning process auditable
- Scalability: Single prompt template works across diverse reasoning tasks without modification
Research Foundation
Seminal Work: Wang et al. (2023)
The paper "Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models" by Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim introduced this technique. Published at ACL 2023, the research emerged from systematic analysis of Zero-shot-CoT failures.
Key Findings:
- Error analysis of 46 incorrect GSM8K responses from Zero-shot-CoT revealed three distinct error categories: calculation errors (7%), missing-step errors (12%), and semantic misunderstanding errors (27%)
- PS prompting specifically targets missing-step errors through explicit planning
- PS+ extends the approach to address calculation errors through additional detailed instructions
- The technique achieves comparable performance to few-shot CoT methods without requiring any examples
Theoretical Motivation:
The authors observed that Zero-shot-CoT's trigger phrase "Let's think step by step" fails to guide the model on how to decompose problems effectively. By replacing this with explicit planning instructions, the model receives clearer guidance on structuring its reasoning process. This mirrors human problem-solving, where effective solutions typically begin with planning before execution.
Prior Approaches Improved Upon:
- Zero-shot-CoT (Kojima et al., 2022): Simple trigger phrase without planning structure
- Few-shot CoT (Wei et al., 2022): Requires manually crafted examples for each task domain
- Auto-CoT (Zhang et al., 2022): Automates example generation but still requires clustering and sampling
Evolution:
The research builds on the Zero-shot-CoT foundation while addressing its limitations. PS prompting represents a middle ground between the simplicity of zero-shot approaches and the effectiveness of few-shot methods. The subsequent development of PS+ added targeted instructions for calculation accuracy and variable extraction, further closing the gap with manual few-shot approaches.
Real-World Performance Evidence
Arithmetic Reasoning Benchmarks:
| Dataset | Zero-shot-CoT | PS | PS+ | Manual-CoT (8-shot) | | ---------- | ------------- | ----- | ----- | ------------------- | | MultiArith | 83.8% | 88.0% | 91.8% | 93.3% | | GSM8K | 56.4% | 58.7% | 59.3% | 60.1% | | SVAMP | 70.8% | 73.2% | 76.7% | 78.2% | | AddSub | 83.5% | 87.1% | 88.4% | 89.2% | | SingleEq | 92.1% | 93.4% | 94.7% | 94.9% | | AQuA | 43.7% | 45.3% | 46.8% | 48.2% |
Commonsense Reasoning:
| Dataset | Zero-shot-CoT | PS+ | Manual-CoT | | ------------- | ------------- | ----- | ---------- | | CommonsenseQA | 65.2% | 71.9% | 74.2% | | StrategyQA | 63.8% | 65.4% | 68.7% |
Symbolic Reasoning:
| Dataset | Zero-shot-CoT | PS+ | Manual-CoT | | ------------ | ------------- | ----- | ---------- | | Last Letters | 65.2% | 75.2% | 70.6% | | Coin Flip | 96.8% | 99.6% | 100.0% |
Key Performance Insights:
- PS+ outperforms Zero-shot-CoT by an average of 2.5% across all 10 datasets tested
- On arithmetic reasoning, PS+ improves accuracy by at least 5% on most datasets (GSM8K shows 2.9% improvement)
- PS+ matches or exceeds few-shot Manual-CoT on symbolic reasoning tasks (75.2% vs 70.6% on Last Letters)
- The technique shows consistent improvements across all three reasoning categories: arithmetic, commonsense, and symbolic
- Average PS+ accuracy (76.7%) approaches Manual-CoT (77.6%) while requiring no examples
Error Reduction Analysis (GSM8K):
| Error Type | Zero-shot-CoT | PS+ | Reduction | | ------------------------- | ------------- | --- | --------- | | Calculation errors | 7% | 5% | 28.6% | | Missing-step errors | 12% | 7% | 41.7% | | Semantic misunderstanding | 27% | 27% | 0% | | Total wrong answers | 44 | 39 | 11.4% |
The data reveals that PS+ effectively addresses calculation and missing-step errors but does not improve semantic understanding—a fundamental limitation of the approach.
Model-Specific Results:
Testing across different model sizes and families reveals performance variation:
| Model | Zero-shot-CoT | PS+ | Improvement | | ------------------------ | ------------- | --------- | ----------- | | GPT-3 (text-davinci-003) | Baseline | +2.5% avg | Consistent | | GPT-3.5-turbo | 80% | 85% | +5% | | Mistral-7B | 60% | 65% | +5% | | Llama-2-70b | 70% | 60% | -10% | | Zephyr-7b | 65% | 45% | -20% |
Note: Smaller and open-source models show inconsistent results, suggesting PS prompting benefits scale with model capability.
How It Works
Theoretical Foundation
Plan-and-Solve prompting is grounded in cognitive psychology's distinction between problem representation and problem solving. Research on human problem-solving shows that expert reasoners spend more time understanding and planning before executing, while novices jump directly to solution attempts. PS prompting encodes this expert behavior into the prompt structure.
Core Insight: The fundamental innovation is recognizing that "Let's think step by step" provides insufficient guidance for complex reasoning. Models benefit from explicit instructions to:
- Understand the problem before solving it
- Devise a structured plan
- Execute the plan systematically
This mirrors the cognitive process of metacognition—thinking about how to think—which improves problem-solving effectiveness.
Fundamental Ideas:
The technique rests on task decomposition theory: complex problems become tractable when broken into smaller, manageable subtasks. Unlike implicit decomposition in standard CoT (where the model discovers subtasks during generation), PS prompting makes decomposition explicit and upfront.
Conceptual Model:
Standard prompting: P(answer | problem)
Zero-shot-CoT: P(answer | problem, "think step by step")
PS prompting: P(answer | problem, plan(problem), execute(plan))
The explicit planning phase creates a roadmap that guides subsequent token generation, reducing the probability of missing steps.
Key Assumptions:
- Models can effectively decompose problems when explicitly instructed to plan
- A planning phase improves the quality of subsequent reasoning
- Natural language plans can guide step-by-step execution
- Explicit attention to intermediate calculations reduces arithmetic errors
Where Assumptions Hold:
- Multi-step mathematical problems with clear structure
- Problems where subtasks can be identified from the problem statement
- Tasks requiring sequential reasoning with dependencies between steps
- Domains where calculation accuracy matters
Where Assumptions Fail:
- Problems requiring lateral thinking or creative leaps
- Tasks where the solution path isn't decomposable upfront
- Semantic understanding errors (PS doesn't improve comprehension)
- Problems requiring external knowledge not in the problem statement
- Very simple problems where planning adds unnecessary overhead
Fundamental Trade-offs:
- Verbosity vs efficiency: Planning instructions add tokens but improve reasoning quality
- Structure vs flexibility: Rigid planning may constrain creative problem approaches
- Comprehensiveness vs speed: Thorough planning takes more generation time
- Universal vs optimized: Single template sacrifices task-specific optimization
Execution Mechanism
Phase 1: Problem Understanding
The model first processes the problem statement with explicit attention to comprehension:
- Identifies what is being asked
- Notes given information and constraints
- Recognizes the problem type and domain
- Flags potential ambiguities
Phase 2: Plan Formulation
Before generating any solution steps, the model creates a plan:
- Breaks the problem into logical subtasks
- Determines the order of operations
- Identifies dependencies between subtasks
- Notes intermediate values to calculate
Phase 3: Plan Execution
The model executes the plan systematically:
- Follows the planned sequence of steps
- Calculates intermediate results explicitly
- Maintains attention on calculation accuracy
- Tracks progress through the plan
Phase 4: Answer Extraction
The final answer is derived from the completed reasoning:
- Combines intermediate results
- States the final answer clearly
- Uses consistent formatting (e.g., "The answer is...")
Cognitive Processes Triggered:
- Metacognition: Thinking about how to approach the problem
- Task decomposition: Breaking complex tasks into manageable parts
- Sequential attention: Maintaining focus through multi-step processes
- Working memory management: Explicitly storing intermediate values
- Self-monitoring: Following the plan creates implicit checkpoints
Single-Pass vs Iterative:
Standard PS prompting is single-pass: one forward inference generating plan and execution together. However, it can be combined with iterative approaches:
- Self-consistency: Multiple PS reasoning paths with majority voting
- Verification: Separate pass to check answer against reasoning
- Refinement: Iterative improvement of plan or execution
Initialization and Completion:
- Initialization: Problem statement + PS trigger phrase
- Completion criteria: Clear answer statement, typically "The answer is [X]" or similar format marker
Causal Mechanisms
Why PS Prompting Improves Outputs:
-
Explicit decomposition reduces omissions: When the model plans before solving, it identifies all necessary steps upfront, reducing the probability of skipping steps during execution.
-
Attention allocation improves: The planning phase primes relevant reasoning patterns, helping the model attend to important problem aspects during execution.
-
Intermediate variable tracking: Instructions to "extract relevant variables" create explicit bookkeeping that prevents calculation errors from propagating.
-
Structured generation constrains errors: Following a plan constrains the solution space, reducing the probability of wandering into incorrect reasoning paths.
Cascading Effects:
- Clear problem understanding → correct plan formulation → accurate step execution → correct final answer
- Explicit variable extraction → accurate intermediate calculations → reduced error propagation
- Structured planning → consistent reasoning format → easier verification
Feedback Loops:
- Positive: Well-formulated plans guide accurate execution; accurate intermediate results validate the plan
- Negative: Flawed plans lead to incorrect execution; errors in early steps compound through subsequent reasoning
Emergent Behaviors:
- Models sometimes generate more detailed plans than explicitly requested
- Variable extraction naturally extends to unit tracking in physics problems
- Planning instructions generalize to problems beyond the original research domains
Dominant Factors (Ranked by Impact):
- Problem complexity (35%): Larger gains on multi-step problems requiring decomposition
- Model capability (30%): Benefits scale with model size and reasoning ability
- Instruction specificity (20%): PS+ improvements come from more detailed instructions
- Problem domain (15%): Mathematical problems show larger gains than commonsense reasoning
Structure and Components
Essential Components
Plan-and-Solve (PS) Prompt Structure:
- Problem statement: The task or question to be solved
- Understanding trigger: Instruction to comprehend the problem first
- Planning trigger: Explicit instruction to devise a plan
- Execution trigger: Instruction to carry out the plan step by step
PS+ Enhanced Components:
- Variable extraction instruction: "Extract relevant variables and their corresponding numerals"
- Calculation attention: "Calculate intermediate results"
- Commonsense reminder: "Pay attention to calculation and commonsense"
Required vs Optional:
| Component | Required | Purpose | | --------------------- | -------------- | ------------------------------------ | | Problem statement | Yes | Defines the task | | Understanding phase | Yes | Ensures comprehension before solving | | Planning instruction | Yes | Creates solution structure | | Execution instruction | Yes | Guides systematic solving | | Variable extraction | Optional (PS+) | Improves numerical accuracy | | Calculation attention | Optional (PS+) | Reduces arithmetic errors | | Commonsense reminder | Optional (PS+) | Catches logical errors |
Design Principles
Linguistic Patterns:
- Sequential structure: "First understand... then devise... then carry out..."
- Imperative guidance: "Let's" creates collaborative framing
- Phase markers: Clear transitions between understanding, planning, and execution
- Completion signals: "Show the answer" or "solve the problem step by step"
Cognitive Principles Leveraged:
- Metacognitive prompting: Explicit instruction to plan before acting
- Task decomposition: Breaking complex problems into subtasks
- Attention direction: Focusing on calculations and commonsense
- Working memory support: External storage of intermediate variables
- Goal-subgoal hierarchy: Plan creates structured problem representation
Core Design Principles:
- Explicit over implicit: State the cognitive process rather than assuming it
- Phase separation: Distinct understanding, planning, and execution phases
- Attention guidance: Direct focus to error-prone areas (calculations)
- Universal applicability: Template works without task-specific modification
Structural Patterns
Minimal Pattern (Basic PS):
Q: [Problem statement]
A: Let's first understand the problem and devise a plan to solve the problem. Then, let's carry out the plan and solve the problem step by step.
Standard Pattern (PS+):
Q: [Problem statement]
A: Let's first understand the problem, extract relevant variables and their corresponding numerals, and devise a plan. Then, let's carry out the plan, calculate intermediate results (pay attention to calculation and commonsense), solve the problem step by step, and show the answer.
Advanced Pattern (PS+ with Structured Output):
Q: [Problem statement]
A: Let's first understand the problem, extract relevant variables and their corresponding numerals, and devise a complete plan.
**Understanding:**
[Problem comprehension]
**Variables:**
[List of extracted variables with values]
**Plan:**
1. [Step 1]
2. [Step 2]
...
**Execution:**
[Step-by-step solution following the plan]
**Answer:**
[Final answer]
Reasoning Patterns Used:
- Forward reasoning: Start with given information, derive conclusion
- Decomposition: Break problem into sequential subtasks
- Variable tracking: Maintain explicit record of values
- Verification: Check calculations and commonsense validity
Modifications for Scenarios
High Complexity Problems:
- Extend the planning phase with more detailed subtask breakdown
- Add explicit dependency tracking between steps
- Include verification checkpoints within the plan
Ambiguous Problems:
- Strengthen the understanding phase
- Add assumption clarification to the plan
- Include multiple interpretation handling
Domain-Specific Adaptation:
# For physics problems:
"...extract relevant variables, their values, and their units, and devise a plan. Then, let's carry out the plan, apply relevant formulas, calculate intermediate results (pay attention to unit consistency and physical reasonableness)..."
# For financial calculations:
"...extract relevant financial variables (prices, rates, periods), and devise a plan. Then, let's carry out the plan, calculate intermediate results (pay attention to percentage calculations and time value of money)..."
Format-Critical Tasks:
- Add format specification to the execution phase
- Include output validation in the completion
Token-Constrained Scenarios:
- Use minimal PS pattern
- Focus on planning without extensive variable extraction
- Compress to: "Let's devise a plan and solve the problem step by step."
Applications and Task Selection
General Applications
Mathematical Reasoning:
PS prompting excels at mathematical word problems requiring multiple computation steps. The technique is particularly effective for:
- Arithmetic word problems (addition, subtraction, multiplication, division sequences)
- Algebraic reasoning requiring variable isolation
- Multi-step percentage and ratio calculations
- Problems requiring intermediate value tracking
- Sequential dependency calculations (where later steps depend on earlier results)
Example Application:
Problem: A store sells apples for $2 each and oranges for $3 each. John buys 5 apples and some oranges. He pays with a $50 bill and receives $29 in change. How many oranges did he buy?
PS+ Response:
Understanding: John buys apples at $2 each, oranges at $3 each. He spends money from $50 and gets $29 back.
Variables:
- Apple price = $2
- Orange price = $3
- Apples bought = 5
- Money paid = $50
- Change received = $29
Plan:
1. Calculate total amount spent
2. Calculate cost of apples
3. Calculate remaining amount spent on oranges
4. Determine number of oranges
Execution:
Step 1: Total spent = $50 - $29 = $21
Step 2: Apple cost = 5 × $2 = $10
Step 3: Orange cost = $21 - $10 = $11
Step 4: Oranges bought = $11 ÷ $3 = 3.67
Wait, this gives a non-integer. Let me verify...
Actually, $11 ÷ $3 does not give a whole number, suggesting a potential issue. However, rounding or problem constraints may apply.
The answer is approximately 3-4 oranges (likely 3 with leftover or a problem error).
Logical and Symbolic Reasoning:
- Deductive reasoning chains
- Letter manipulation tasks (last letter concatenation)
- State tracking problems (coin flip sequences)
- Constraint satisfaction with multiple conditions
Commonsense Reasoning:
- Multi-hop reasoning requiring world knowledge
- Strategy questions requiring planning
- Causal reasoning chains
- Social reasoning with implicit rules
Domain-Specific Applications
Educational Settings:
PS prompting creates pedagogically valuable outputs showing complete reasoning processes:
- Worked example generation for tutoring systems
- Step-by-step solution explanations
- Error identification through plan-execution comparison
- Assessment of student reasoning strategies
Scientific Problem Solving:
- Physics problems with unit conversion and formula application
- Chemistry stoichiometry calculations
- Biology population dynamics modeling
- Engineering calculations with multi-step dependencies
Financial Analysis:
- Investment return calculations with compounding
- Loan amortization schedules
- Tax computation with multiple brackets
- Budget allocation problems
Code Generation (Indirect):
PS prompting informs code-specific variants like Self-Planning:
- Algorithm design before implementation
- Function decomposition planning
- Test case generation strategy
- Debugging approach formulation
Unconventional Applications:
- Recipe scaling: Plan ingredient adjustments, execute calculations
- Travel planning: Decompose logistics, calculate times and costs
- Project estimation: Break down tasks, estimate durations
- Decision analysis: Structure options, evaluate trade-offs
Selection Framework
Problem Characteristics That Favor PS Prompting:
| Characteristic | Suitability | Reason | | ------------------------------- | ----------- | -------------------------------- | | Multi-step required | High | Planning prevents missing steps | | Numerical calculations | High | Variable tracking reduces errors | | Clear decomposition possible | High | Plan structure matches problem | | Dependencies between steps | High | Plan captures order requirements | | Zero examples available | High | No few-shot examples needed | | Moderate complexity (4-8 steps) | High | Planning overhead justified |
Problem Characteristics That Disfavor PS Prompting:
| Characteristic | Suitability | Reason | | ------------------------------- | ----------- | ------------------------------------- | | Single-step problems | Low | Planning overhead not justified | | Creative/open-ended tasks | Low | Rigid planning constrains exploration | | Semantic understanding required | Low | PS doesn't improve comprehension | | Pattern matching tasks | Low | No decomposition needed | | Time-critical applications | Medium | Planning adds latency |
Selection Signals:
Use PS prompting when:
- Zero-shot-CoT produces missing-step errors
- The problem has clear sequential structure
- Calculation accuracy is important
- You need consistent reasoning format
- No domain-specific examples are available
Avoid PS prompting when:
- The task is simple enough for direct answering
- Creative exploration is desired
- The problem requires deep semantic understanding
- Latency is critical and problem is straightforward
Model Requirements:
| Specification | Requirement | | ------------- | ------------------------------------- | | Minimum | 7B+ parameters (results inconsistent) | | Recommended | 70B+ parameters or GPT-3.5+ | | Optimal | GPT-4, Claude 3+, or equivalent | | Not suitable | Models under 7B parameters |
Required Capabilities:
- Strong instruction following
- Multi-step reasoning ability
- Arithmetic computation skills
- Coherent long-form generation
Context and Resource Requirements:
| Metric | PS | PS+ | | ---------------- | ---------- | ------- | | Prompt tokens | ~50 | ~80 | | Response tokens | 100-300 | 150-400 | | Latency overhead | +10-20% | +15-30% | | Total context | Low-medium | Medium |
Cost Implications:
- One-time costs: None (no example curation or optimization required)
- Per-request costs: Slightly higher due to longer prompts and responses
- Quality-cost trade-off: PS+ provides better accuracy for modest token increase
- Compared to few-shot: Lower total tokens (no examples) despite longer trigger
When to Use vs When NOT to Use:
Use PS prompting:
- Multi-step mathematical reasoning tasks
- Problems where Zero-shot-CoT shows missing-step errors
- When consistent structured output is needed
- Zero-shot scenarios without available examples
- When calculation accuracy is important
Do NOT use PS prompting:
- Simple factual questions
- Classification tasks
- Creative writing or generation
- When few-shot examples are readily available and effective
- Tasks requiring semantic inference over calculation
Escalation to Alternatives:
| Condition | Alternative Technique | | ------------------------ | ---------------------------------------- | | PS+ accuracy < 60% | Consider few-shot CoT with examples | | Semantic errors dominate | Try Role prompting or context enrichment | | Latency critical | Use simpler Zero-shot-CoT | | Complex multi-turn | Consider ReAct or agent frameworks | | Very complex problems | Least-to-Most or Tree of Thoughts |
Variant Selection:
| Variant | Best For | | --------------------- | ----------------------------------------------- | | Basic PS | Quick deployment, token-constrained settings | | PS+ | Mathematical reasoning, calculation-heavy tasks | | PS + Self-consistency | High-stakes decisions requiring reliability | | PS + Verification | Applications requiring auditability |
Implementation
Implementation Steps
Step 1: Problem Preparation
Ensure the problem is well-formulated:
- Clear question or objective
- All necessary information provided
- Unambiguous constraints and conditions
Step 2: Select PS Variant
Choose based on requirements:
- Basic PS for general use
- PS+ for calculation-intensive tasks
- Extended PS for domain-specific needs
Step 3: Construct Prompt
Combine problem with PS trigger:
def construct_ps_prompt(problem, variant="ps+"):
triggers = {
"basic": "Let's first understand the problem and devise a plan to solve the problem. Then, let's carry out the plan and solve the problem step by step.",
"ps+": "Let's first understand the problem, extract relevant variables and their corresponding numerals, and devise a plan. Then, let's carry out the plan, calculate intermediate results (pay attention to calculation and commonsense), solve the problem step by step, and show the answer.",
"minimal": "Let's devise a plan and solve the problem step by step."
}
return f"Q: {problem}\n\nA: {triggers[variant]}"
Step 4: Execute Inference
Send prompt to the model and collect response.
Step 5: Extract Answer
Parse the response to extract the final answer:
def extract_answer(response):
# Look for common answer patterns
patterns = [
r"[Tt]he answer is[:\s]*([^\.\n]+)",
r"[Aa]nswer[:\s]*([^\.\n]+)",
r"####\s*([^\n]+)",
r"= ([^\.\n]+)$"
]
for pattern in patterns:
match = re.search(pattern, response)
if match:
return match.group(1).strip()
return None
Platform-Specific Implementations
OpenAI API (Python):
from openai import OpenAI
client = OpenAI()
def ps_plus_solve(problem: str) -> str:
prompt = f"""Q: {problem}
A: Let's first understand the problem, extract relevant variables and their corresponding numerals, and devise a plan. Then, let's carry out the plan, calculate intermediate results (pay attention to calculation and commonsense), solve the problem step by step, and show the answer."""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=1024
)
return response.choices[0].message.content
# Example usage
problem = "A farmer has 15 chickens and 12 cows. How many total legs are there?"
solution = ps_plus_solve(problem)
print(solution)
Anthropic Claude (Python):
import anthropic
client = anthropic.Anthropic()
def ps_plus_solve_claude(problem: str) -> str:
prompt = f"""Q: {problem}
A: Let's first understand the problem, extract relevant variables and their corresponding numerals, and devise a plan. Then, let's carry out the plan, calculate intermediate results (pay attention to calculation and commonsense), solve the problem step by step, and show the answer."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
# Example usage
problem = "If a train travels at 60 mph for 2.5 hours, how far does it travel?"
solution = ps_plus_solve_claude(problem)
print(solution)
LangChain Integration:
LangChain provides a built-in Plan-and-Execute agent framework inspired by PS prompting:
from langchain.chat_models import ChatOpenAI
from langchain_experimental.plan_and_execute import (
PlanAndExecute,
load_agent_executor,
load_chat_planner
)
from langchain.agents.tools import Tool
from langchain.utilities import SerpAPIWrapper
from langchain.chains import LLMMathChain
# Set up tools
llm = ChatOpenAI(temperature=0, model="gpt-4")
search = SerpAPIWrapper()
llm_math = LLMMathChain.from_llm(llm=llm)
tools = [
Tool(
name="Search",
func=search.run,
description="Useful for searching current information"
),
Tool(
name="Calculator",
func=llm_math.run,
description="Useful for mathematical calculations"
)
]
# Create plan-and-execute agent
planner = load_chat_planner(llm)
executor = load_agent_executor(llm, tools, verbose=True)
agent = PlanAndExecute(planner=planner, executor=executor, verbose=True)
# Run
result = agent.run("What is the population of France multiplied by 2?")
DSPy Implementation:
import dspy
class PlanAndSolve(dspy.Signature):
"""Solve a problem by first planning then executing."""
problem = dspy.InputField(desc="The problem to solve")
plan = dspy.OutputField(desc="Step-by-step plan to solve the problem")
solution = dspy.OutputField(desc="Executed solution following the plan")
answer = dspy.OutputField(desc="Final answer")
class PSPromptModule(dspy.Module):
def __init__(self):
super().__init__()
self.solve = dspy.ChainOfThought(PlanAndSolve)
def forward(self, problem):
return self.solve(problem=problem)
# Usage
lm = dspy.OpenAI(model="gpt-4", temperature=0)
dspy.settings.configure(lm=lm)
ps_module = PSPromptModule()
result = ps_module("A store has 45 items. 12 are sold, then 28 more arrive. How many items now?")
Configuration
Key Parameters:
| Parameter | Recommended Value | Reasoning | | ----------------- | ----------------- | ----------------------------------------- | | Temperature | 0 - 0.3 | Lower values for consistent reasoning | | Max tokens | 512 - 1024 | Allow space for full plan + execution | | Top-p | 0.95 | Slightly constrained sampling | | Frequency penalty | 0 | Don't penalize repetition in calculations | | Stop sequences | None typically | Let model complete naturally |
Task-Specific Tuning:
- Mathematical reasoning: Temperature 0, max tokens 512
- Complex multi-step: Temperature 0.1, max tokens 1024
- Exploratory reasoning: Temperature 0.3, max tokens 768
Domain Adaptation:
Modify the PS+ trigger for domain-specific focus:
domain_triggers = {
"physics": "...extract relevant variables with units, identify applicable formulas, and devise a plan. Then apply formulas, calculate intermediate results (pay attention to unit consistency)...",
"finance": "...extract monetary values, rates, and time periods, and devise a plan. Then calculate intermediate results (pay attention to percentage conversions and compounding)...",
"programming": "...identify inputs, outputs, and constraints, and devise an algorithm plan. Then implement step by step (pay attention to edge cases and data types)..."
}
Best Practices and Workflow
Implementation Workflow:
- Identify candidate problems: Multi-step reasoning required
- Select PS variant: Basic, PS+, or domain-adapted
- Construct prompt: Problem + trigger
- Run inference: With appropriate parameters
- Extract answer: Parse response
- Validate: Check answer reasonableness
- Iterate: Adjust trigger if needed
Do's:
- Use PS+ for calculation-intensive tasks
- Keep temperature low for consistent reasoning
- Allow sufficient tokens for complete responses
- Parse and validate extracted answers
- Test on representative problems before deployment
Don'ts:
- Don't use for simple single-step problems
- Don't expect improvement on semantic understanding tasks
- Don't ignore extraction failures—they indicate reasoning problems
- Don't use high temperature—planning benefits from consistency
- Don't truncate responses mid-reasoning
Common Prompt Patterns:
# Pattern 1: Standard PS+
trigger_standard = """Let's first understand the problem, extract relevant variables and their corresponding numerals, and devise a plan. Then, let's carry out the plan, calculate intermediate results (pay attention to calculation and commonsense), solve the problem step by step, and show the answer."""
# Pattern 2: Structured output
trigger_structured = """Let's solve this step by step:
1. First, I'll understand the problem and identify what we need to find.
2. Then, I'll extract relevant variables and their values.
3. Next, I'll devise a plan with clear steps.
4. Finally, I'll execute the plan and calculate the answer.
Let me begin:"""
# Pattern 3: Minimal
trigger_minimal = """Let's devise a plan and solve the problem step by step."""
Debugging Decision Tree
Problem: Inconsistent Outputs
Symptom: Different answers for same problem
├── Root cause: Temperature too high
│ └── Solution: Set temperature to 0
├── Root cause: Ambiguous problem statement
│ └── Solution: Clarify problem before prompting
└── Root cause: Model capability variance
└── Solution: Use self-consistency (multiple samples + voting)
Problem: Missing Steps in Output
Symptom: Plan or execution incomplete
├── Root cause: Max tokens too low
│ └── Solution: Increase max_tokens
├── Root cause: Basic PS instead of PS+
│ └── Solution: Use PS+ for detailed variable extraction
└── Root cause: Problem too complex for single pass
└── Solution: Break into sub-problems
Problem: Calculation Errors
Symptom: Arithmetic mistakes in solution
├── Root cause: Not using PS+ variant
│ └── Solution: Switch to PS+ with calculation attention
├── Root cause: Complex calculations without intermediate steps
│ └── Solution: Add "show all work" to trigger
└── Root cause: Model limitation
└── Solution: Use code interpreter or calculator tool
Problem: Format Violations
Symptom: Answer not extractable
├── Root cause: Missing answer marker
│ └── Solution: Add explicit "show the answer" instruction
├── Root cause: Extraction regex too narrow
│ └── Solution: Broaden answer pattern matching
└── Root cause: Model ended without conclusion
└── Solution: Increase max tokens or add stopping instruction
Problem: Poor Quality Despite PS+
Symptom: Incorrect reasoning despite planning
├── Root cause: Semantic misunderstanding
│ └── Solution: Add problem rephrasing step
├── Root cause: Domain knowledge gap
│ └── Solution: Add domain context or use few-shot
└── Root cause: Model capability insufficient
└── Solution: Upgrade to more capable model
Common Mistakes:
- Using PS prompting for simple factual questions (overhead not justified)
- Expecting PS to fix semantic understanding issues (it doesn't)
- Setting temperature too high (undermines planning consistency)
- Insufficient max tokens (truncates reasoning)
- Not extracting answers systematically (manual review doesn't scale)
Testing and Optimization
Validation Strategy:
- Holdout testing: Reserve 20% of problems for final evaluation
- Stratified sampling: Include easy, medium, hard problems
- Error categorization: Track calculation, missing-step, semantic errors
- Baseline comparison: Always compare against Zero-shot-CoT
Test Coverage Requirements:
| Category | Coverage | | ---------------------------------------- | -------- | | Happy path (solvable problems) | 70% | | Edge cases (unusual values, zero) | 15% | | Boundary conditions (max/min values) | 10% | | Adversarial (ambiguous, trick questions) | 5% |
Quality Metrics:
| Metric | Application | | ------------------------- | ----------------------------------- | | Accuracy | Primary measure for reasoning tasks | | Answer extraction rate | Measures format compliance | | Plan quality (human eval) | Assesses reasoning structure | | Step completion rate | Measures missing-step prevention | | Consistency (across runs) | Measures reliability |
Optimization Techniques:
Token Reduction:
# Minimal trigger saves ~40 tokens vs PS+
minimal_trigger = "Let's devise a plan and solve the problem step by step."
# Still effective, but less calculation accuracy
Caching Strategies:
- Cache identical problems (deterministic with temp=0)
- Cache problem templates for parameterized queries
- Pre-compute trigger embeddings for efficiency
Consistency Techniques:
def ps_with_consistency(problem, n_samples=5):
"""Run PS+ multiple times and take majority vote."""
answers = []
for _ in range(n_samples):
response = ps_plus_solve(problem)
answer = extract_answer(response)
if answer:
answers.append(answer)
# Majority voting
from collections import Counter
if answers:
return Counter(answers).most_common(1)[0][0]
return None
A/B Testing Approach:
- Define metric (accuracy on test set)
- Split traffic between PS variants
- Collect sufficient samples (100+ per variant)
- Statistical significance test (chi-squared for accuracy)
- Roll out winning variant
Iteration Criteria:
Stop optimizing when:
- Accuracy plateaus across variant tests
- Further gains require disproportionate complexity
- Production constraints (latency, cost) are met
- Error distribution shows mostly semantic errors (PS can't help)
Limitations and Constraints
Known Limitations
Fundamental Limitations (Cannot Be Overcome):
-
Semantic Misunderstanding: PS prompting does not improve the model's ability to understand problem semantics. If the model misinterprets what the problem is asking, no amount of planning will help. Error analysis shows semantic errors remain at 27% with both Zero-shot-CoT and PS+.
-
Knowledge Limitations: Planning cannot compensate for missing factual knowledge. If the model doesn't know a formula or fact needed for the solution, PS prompting won't help.
-
Inherent Model Capabilities: PS prompting amplifies existing reasoning capabilities but doesn't create new ones. Small models that can't reason well won't suddenly perform well with PS.
Inefficient Problem Types:
- Simple factual retrieval: Planning overhead not justified
- Pattern matching tasks: No decomposition needed
- Creative generation: Rigid planning constrains creativity
- Single-step calculations: Planning adds unnecessary verbosity
- Classification tasks: Direct prediction is sufficient
Behavior Under Non-Ideal Conditions:
| Condition | Behavior | | -------------------- | ---------------------------------------------------------- | | Ambiguous problem | Plans based on one interpretation, may solve wrong problem | | Missing information | Plans around gap, may make incorrect assumptions | | Token limit reached | Truncated reasoning, incomplete answers | | Very complex problem | Plan may be superficial, execution incomplete |
Edge Cases
Problematic Edge Cases:
-
Ambiguous problems: When multiple interpretations exist, PS will plan for one without acknowledging alternatives.
-
Conflicting constraints: Problems with impossible conditions may generate plans that fail during execution.
-
Out-of-domain problems: PS trigger is optimized for reasoning tasks; creative or generative tasks may show degraded performance.
-
Circular dependencies: Problems where step N depends on step M which depends on step N may cause planning failures.
-
Very large numbers: Calculation accuracy degrades with numbers beyond typical training distribution.
Edge Case Detection:
def detect_edge_cases(problem):
warnings = []
# Check for ambiguity signals
if any(word in problem.lower() for word in ["or", "either", "might"]):
warnings.append("Potential ambiguity detected")
# Check for large numbers
numbers = re.findall(r'\d+', problem)
if any(int(n) > 1000000 for n in numbers):
warnings.append("Large numbers may reduce accuracy")
# Check for missing information signals
if "unknown" in problem.lower() or "some" in problem.lower():
warnings.append("Possible missing information")
return warnings
Graceful Degradation Strategies:
- Ambiguity: Add clarification request before PS prompt
- Missing info: State assumptions explicitly in plan
- Complexity overflow: Break into sub-problems with chained PS
- Out-of-domain: Fall back to general Zero-shot-CoT
Constraint Management
Balancing Competing Factors:
| Trade-off | PS Approach | | -------------------------- | -------------------------------------------------------- | | Clarity vs conciseness | PS+ prioritizes clarity; use minimal PS for conciseness | | Accuracy vs speed | Planning adds latency; justified for accuracy gains | | Generality vs optimization | Single template trades peak performance for universality | | Token cost vs quality | PS+ adds ~30 tokens for measurable accuracy gains |
Token/Context Constraints:
def adaptive_ps_prompt(problem, max_tokens_available):
"""Choose PS variant based on available tokens."""
# Estimate tokens needed
ps_plus_overhead = 80
basic_ps_overhead = 50
minimal_overhead = 20
expected_response = estimate_response_length(problem)
if max_tokens_available > ps_plus_overhead + expected_response + 100:
return construct_ps_prompt(problem, "ps+")
elif max_tokens_available > basic_ps_overhead + expected_response + 50:
return construct_ps_prompt(problem, "basic")
else:
return construct_ps_prompt(problem, "minimal")
Handling Incomplete Information:
When problems have missing information, modify the trigger:
"Let's first understand the problem and identify any missing information. State assumptions clearly. Then devise a plan and solve the problem step by step."
Error Handling and Recovery:
def robust_ps_solve(problem, max_retries=3):
"""PS solving with retry logic."""
for attempt in range(max_retries):
response = ps_plus_solve(problem)
answer = extract_answer(response)
if answer is not None:
# Validate answer reasonableness
if validate_answer(problem, answer):
return answer
# Answer extracted but seems wrong
problem = add_verification_instruction(problem)
else:
# Extraction failed, try more explicit format
problem = add_format_instruction(problem)
return None # Failed after retries
Advanced Techniques
Clarity and Context Optimization
Ensuring Clarity:
The PS trigger itself promotes clarity through explicit phases. Additional clarity techniques:
- Problem rephrasing: Add "First, let me restate the problem in my own words..."
- Constraint listing: "The constraints are: ..."
- Goal statement: "We need to find: ..."
Removing Ambiguity:
clarity_enhanced_trigger = """Let's first understand the problem:
- What is being asked?
- What information is given?
- Are there any ambiguities? If so, I'll state my interpretation.
Then let's extract relevant variables, devise a plan, and solve step by step."""
Balancing Detail with Conciseness:
| Scenario | Approach | | ------------------- | ---------------------------------- | | Simple problem | Minimal PS trigger | | Moderate complexity | Standard PS+ | | High stakes | Enhanced PS+ with verification | | Token constrained | Minimal with post-hoc verification |
Context Optimization:
PS prompting is relatively context-efficient since it doesn't require examples. Optimization strategies:
- Problem pruning: Remove irrelevant information before prompting
- Variable condensing: Represent lengthy conditions as symbolic variables
- Reference compression: Use abbreviations for repeated concepts
Context Length Limitations:
For very long problems:
def chunk_problem(problem, max_chunk_size=2000):
"""Break long problems into chunks with maintained context."""
# Extract and preserve key variables across chunks
variables = extract_key_variables(problem)
chunks = split_by_logical_sections(problem, max_chunk_size)
return variables, chunks
Advanced Reasoning and Output Control
Multi-Step Reasoning Structure:
For complex problems, extend the planning phase:
"Let's approach this systematically:
Phase 1 - Understanding:
- Identify the core question
- List all given information
- Note any constraints
Phase 2 - Planning:
- Break the problem into sub-problems
- Determine dependencies between steps
- Identify formulas or methods needed
Phase 3 - Execution:
- Solve each sub-problem in order
- Show all calculations
- Track intermediate results
Phase 4 - Verification:
- Check the answer makes sense
- Verify calculations
- Confirm all constraints satisfied"
Decomposition Strategies:
| Strategy | When to Use | | ------------ | ------------------------------------------------- | | Sequential | Steps have clear linear dependencies | | Hierarchical | Problem has natural sub-problem structure | | Parallel | Independent sub-problems can be solved separately | | Iterative | Solution requires refinement cycles |
Self-Verification Integration:
ps_with_verification = """Let's first understand the problem, extract relevant variables, and devise a plan. Then, let's carry out the plan, calculate intermediate results (pay attention to calculation and commonsense), and solve step by step.
After finding an answer, let's verify:
- Does the answer make sense given the problem?
- Are all calculations correct when rechecked?
- Does the answer satisfy all constraints?
Show the verified answer."""
Uncertainty Quantification:
ps_with_uncertainty = """Let's solve this problem step by step. After reaching an answer, assess:
- Confidence in the reasoning (High/Medium/Low)
- Any assumptions that could affect the answer
- Alternative interpretations if applicable"""
Structured Output Control:
ps_structured_output = """Solve the following problem and format your response as:
UNDERSTANDING:
[Your understanding of the problem]
VARIABLES:
[List of variables with values]
PLAN:
[Numbered list of steps]
EXECUTION:
[Step-by-step solution]
ANSWER:
[Final answer]
Problem: {problem}"""
JSON Output:
ps_json_output = """Solve the following problem. Return your response as JSON with this structure:
{{
"understanding": "problem comprehension",
"variables": {{"var1": value1, "var2": value2}},
"plan": ["step1", "step2", "step3"],
"execution": ["result1", "result2", "result3"],
"answer": "final answer"
}}
Problem: {problem}"""
Constraint Enforcement:
For hard constraints:
ps_constrained = """Solve this problem with the following constraints:
- Answer must be a positive integer
- Show all intermediate calculations
- Use SI units throughout
Let's first understand the problem, extract variables, devise a plan, then solve step by step ensuring all constraints are met."""
Interaction Patterns
Conversational PS (Multi-Turn):
def conversational_ps(conversation_history, new_input):
"""Maintain PS reasoning across conversation turns."""
# Summarize previous reasoning context
context = summarize_previous_turns(conversation_history)
prompt = f"""Previous context:
{context}
New input: {new_input}
Let's update our understanding, revise the plan if needed, and continue solving step by step."""
return generate(prompt)
Iterative Refinement:
def iterative_ps(problem, max_iterations=3):
"""Iteratively refine PS solution."""
solution = ps_plus_solve(problem)
for i in range(max_iterations):
# Check for errors
verification = verify_solution(problem, solution)
if verification["correct"]:
return solution
# Refine based on errors
refinement_prompt = f"""Previous solution attempt:
{solution}
Issues identified:
{verification['issues']}
Let's revise our plan to address these issues and solve again."""
solution = generate(refinement_prompt)
return solution
Chaining PS with Other Techniques:
def ps_chain_with_retrieval(problem, knowledge_base):
"""Chain PS with knowledge retrieval."""
# Step 1: Identify knowledge needs
knowledge_query = f"What knowledge is needed to solve: {problem}"
relevant_knowledge = retrieve(knowledge_base, knowledge_query)
# Step 2: PS with retrieved context
enhanced_prompt = f"""Given this relevant knowledge:
{relevant_knowledge}
Problem: {problem}
Let's first understand the problem using the provided knowledge, extract relevant variables, devise a plan, and solve step by step."""
return generate(enhanced_prompt)
Error Propagation Management:
When chaining PS prompts, errors can propagate. Mitigation strategies:
- Validate intermediate outputs: Check each chain output before passing forward
- Include context summaries: Reduce accumulated context to essentials
- Add checkpoints: Verify reasoning at critical points
- Enable backtracking: Allow revision of earlier steps if later steps fail
Model Considerations
Model-Specific Behavior:
| Model Family | PS Behavior | Recommendations | | ------------------------ | -------------------------------------------------- | --------------------------------------- | | GPT-4 / GPT-4o | Excellent plan quality, consistent execution | Use full PS+ | | GPT-3.5-turbo | Good performance, occasional calculation errors | Use PS+ with verification | | Claude 3+ | Strong instruction following, verbose plans | Works well, may need conciseness tuning | | Llama-2-70B | Variable results, benefits from explicit structure | Use structured output PS | | Smaller models (<13B) | Inconsistent, may not follow instructions | Consider alternatives |
Model Capability Verification:
Before deploying PS with a new model:
def verify_model_ps_capability(model, test_problems):
"""Test if model handles PS prompting well."""
results = {
"follows_format": 0,
"completes_plan": 0,
"correct_answers": 0
}
for problem, expected_answer in test_problems:
response = ps_solve_with_model(model, problem)
if has_plan_structure(response):
results["follows_format"] += 1
if has_complete_execution(response):
results["completes_plan"] += 1
if extract_answer(response) == expected_answer:
results["correct_answers"] += 1
total = len(test_problems)
return {k: v/total for k, v in results.items()}
Cross-Model Portability:
PS prompting is relatively portable across models because:
- Simple, clear instructions
- No model-specific syntax
- Doesn't rely on specific training data
For cross-model deployment:
- Start with minimal PS trigger
- Test format compliance
- Adjust verbosity based on model tendencies
- Verify calculation accuracy
Model Version Handling:
def adaptive_ps_for_model(model_name, problem):
"""Adapt PS based on known model characteristics."""
model_configs = {
"gpt-4": {"trigger": "ps+", "temperature": 0},
"gpt-3.5-turbo": {"trigger": "ps+", "temperature": 0},
"claude-3": {"trigger": "ps+", "temperature": 0.1},
"llama-70b": {"trigger": "structured", "temperature": 0.1},
}
config = model_configs.get(model_name, {"trigger": "basic", "temperature": 0})
return ps_solve(problem, **config)
Evaluation and Efficiency
Effectiveness Metrics:
| Metric | Calculation | Target | | -------------------- | ----------------------------- | ------------------- | | Answer accuracy | correct / total | >80% for math tasks | | Plan completion rate | complete_plans / total | >95% | | Missing-step rate | problems_with_missing / total | <10% | | Extraction success | extracted / total | >98% | | Consistency | same_answer_rate across runs | >95% (temp=0) |
Human Evaluation Role:
Human evaluation is valuable for:
- Plan quality assessment (logical, complete, appropriate)
- Reasoning coherence
- Error categorization (calculation vs. semantic vs. missing-step)
- Domain-specific correctness
Custom Benchmark Creation:
def create_ps_benchmark(domain, difficulty_levels):
"""Create domain-specific benchmark for PS evaluation."""
benchmark = []
for difficulty in difficulty_levels:
problems = generate_problems(domain, difficulty, count=50)
for problem in problems:
benchmark.append({
"problem": problem["text"],
"answer": problem["answer"],
"difficulty": difficulty,
"steps_required": problem["steps"],
"domain": domain
})
return benchmark
Token Optimization:
# Token usage comparison
def compare_token_usage():
problem = "If a train travels at 60 mph for 2 hours, how far does it travel?"
# Count tokens for each approach
variants = {
"minimal": "Let's devise a plan and solve the problem step by step.",
"basic": "Let's first understand the problem and devise a plan. Then solve step by step.",
"ps+": "Let's first understand the problem, extract relevant variables and their corresponding numerals, and devise a plan. Then, let's carry out the plan, calculate intermediate results (pay attention to calculation and commonsense), solve the problem step by step, and show the answer."
}
for name, trigger in variants.items():
full_prompt = f"Q: {problem}\n\nA: {trigger}"
tokens = count_tokens(full_prompt)
print(f"{name}: {tokens} tokens")
Typical token comparison:
- Minimal: ~40 prompt tokens
- Basic PS: ~60 prompt tokens
- PS+: ~90 prompt tokens
- Response: 100-300 tokens additional for reasoning
Latency Reduction:
- Use minimal trigger for simple problems: Reduce prompt size
- Set appropriate max_tokens: Don't over-allocate
- Streaming responses: Start processing before generation completes
- Batch similar problems: Amortize API overhead
- Cache deterministic results: Temperature 0 enables caching
Parallel Processing:
import asyncio
async def parallel_ps_solve(problems):
"""Solve multiple problems in parallel."""
async def solve_one(problem):
return await async_ps_plus_solve(problem)
tasks = [solve_one(p) for p in problems]
results = await asyncio.gather(*tasks)
return results
Safety, Robustness, and Domain Adaptation
Prompt Injection Protection:
PS prompting's structured nature provides some protection against injection:
- Clear phase separation makes injection harder
- Explicit instruction structure reduces ambiguity
Additional protection:
def sanitize_problem(problem):
"""Sanitize user input before PS prompting."""
# Remove potential injection patterns
suspicious_patterns = [
r"ignore previous",
r"disregard above",
r"new instructions",
r"system:",
]
for pattern in suspicious_patterns:
if re.search(pattern, problem.lower()):
return None # Reject suspicious input
return problem
Input Validation:
def validate_problem_input(problem):
"""Validate problem before PS processing."""
checks = {
"not_empty": len(problem.strip()) > 0,
"reasonable_length": len(problem) < 10000,
"contains_question": "?" in problem or any(word in problem.lower() for word in ["find", "calculate", "what", "how"]),
"no_injection": not contains_injection_patterns(problem)
}
return all(checks.values()), checks
Output Safety:
PS prompting focuses on reasoning, which generally produces safe outputs. Considerations:
- Verify answers don't contain harmful content
- Validate numerical answers are reasonable
- Check for leaked sensitive information in reasoning
Consistency Techniques:
def ensure_consistency(problem, n_samples=5, threshold=0.6):
"""Ensure consistent answers through multiple sampling."""
answers = []
for _ in range(n_samples):
# Use slightly different temperatures for diversity
temp = random.uniform(0, 0.2)
response = ps_solve(problem, temperature=temp)
answer = extract_answer(response)
if answer:
answers.append(normalize_answer(answer))
# Check consistency
if not answers:
return None, 0
counter = Counter(answers)
most_common, count = counter.most_common(1)[0]
confidence = count / len(answers)
if confidence >= threshold:
return most_common, confidence
else:
return None, confidence # Inconsistent results
Quality Degradation Monitoring:
class PSQualityMonitor:
def __init__(self, baseline_accuracy):
self.baseline = baseline_accuracy
self.recent_results = []
self.window_size = 100
def record_result(self, correct):
self.recent_results.append(correct)
if len(self.recent_results) > self.window_size:
self.recent_results.pop(0)
def check_degradation(self, threshold=0.1):
if len(self.recent_results) < 20:
return False
current_accuracy = sum(self.recent_results) / len(self.recent_results)
degradation = self.baseline - current_accuracy
return degradation > threshold
Domain Adaptation:
domain_prompts = {
"medical": """Let's first understand the clinical scenario, identify relevant medical variables (symptoms, lab values, patient factors), and devise a diagnostic or treatment plan. Then, let's carry out the plan, apply clinical reasoning (pay attention to contraindications and standard of care), and solve step by step.""",
"legal": """Let's first understand the legal question, identify relevant legal variables (parties, facts, applicable laws), and devise an analysis plan. Then, let's carry out the plan, apply legal principles (pay attention to precedents and jurisdiction), and analyze step by step.""",
"engineering": """Let's first understand the engineering problem, identify relevant variables (dimensions, materials, loads), and devise a solution plan. Then, let's carry out the plan, apply engineering formulas (pay attention to units and safety factors), and calculate step by step."""
}
Domain-Specific Terminology:
def adapt_for_domain(problem, domain):
"""Adapt PS trigger for specific domain."""
domain_vocabulary = {
"physics": {"variables": "physical quantities with units", "attention": "dimensional analysis and physical laws"},
"chemistry": {"variables": "chemical species and stoichiometric coefficients", "attention": "conservation of mass and charge balance"},
"economics": {"variables": "economic variables and their relationships", "attention": "equilibrium conditions and assumptions"}
}
vocab = domain_vocabulary.get(domain, {"variables": "relevant variables", "attention": "calculation and common sense"})
trigger = f"""Let's first understand the problem, extract {vocab['variables']}, and devise a plan. Then, let's carry out the plan, calculate intermediate results (pay attention to {vocab['attention']}), and solve step by step."""
return f"Q: {problem}\n\nA: {trigger}"
Risk and Ethics
Ethical Considerations
Model Capability Insights:
PS prompting reveals important aspects of LLM capabilities:
- Models can follow multi-phase instructions effectively
- Explicit planning improves reasoning quality
- Semantic understanding remains a bottleneck
- Structured prompting can substitute for examples
Implications for AI Development:
- Prompting strategies can unlock latent capabilities
- The gap between zero-shot and few-shot performance can be narrowed
- Model limitations (semantic understanding) may require architectural solutions
Bias and Manipulation Risks:
- Training data biases: PS prompting doesn't introduce new biases but doesn't mitigate existing ones
- Problem framing biases: How problems are stated affects solutions
- Cultural assumptions: Word problems may embed cultural contexts
Mitigation:
def check_for_bias(problem, solution):
"""Check for potential biases in problem or solution."""
bias_indicators = {
"gender_specific_names": check_gender_balance,
"cultural_assumptions": check_cultural_neutrality,
"socioeconomic_framing": check_economic_assumptions
}
warnings = []
for indicator, check_fn in bias_indicators.items():
if not check_fn(problem, solution):
warnings.append(indicator)
return warnings
Transparency Concerns:
- PS prompting increases transparency by showing reasoning
- Plan phase reveals intended approach before execution
- Intermediate steps enable auditing
- However, generated reasoning may not reflect actual model computation
Risk Analysis
Failure Modes:
| Failure Mode | Impact | Likelihood | Mitigation | | ------------------------- | -------------------- | ---------- | ------------------------------ | | Incorrect plan | Wrong answer | Medium | Verification step | | Missing variables | Incomplete solution | Low | PS+ variable extraction | | Calculation error | Wrong answer | Medium | Explicit calculation attention | | Semantic misunderstanding | Wrong interpretation | High | Problem clarification | | Incomplete execution | No answer | Low | Sufficient max_tokens |
Cascading Failures:
When PS is part of a larger system:
Incorrect plan → Wrong intermediate results → Incorrect final answer → Bad decision based on wrong answer
Mitigation:
- Add checkpoints between phases
- Validate intermediate results against constraints
- Include uncertainty quantification
- Enable human review for high-stakes decisions
Safety Concerns:
- Overconfidence: PS produces confident-looking reasoning even when wrong
- Automation complacency: Users may over-trust structured outputs
- Error propagation in chains: Mistakes compound in multi-step systems
Prompt Injection Risks:
PS prompting is vulnerable to adversarial problems designed to:
- Override instructions mid-reasoning
- Inject malicious content into "plan"
- Manipulate execution phase
Protection:
def secure_ps_pipeline(user_input):
"""Secure PS pipeline with input/output validation."""
# Input validation
if not validate_input(user_input):
raise SecurityError("Invalid input detected")
# Sandboxed execution
response = ps_solve(user_input)
# Output validation
if contains_harmful_content(response):
raise SecurityError("Harmful output detected")
return response
Bias Amplification:
PS prompting may amplify biases when:
- Problems contain biased assumptions
- Plans encode biased approaches
- Variable extraction reflects skewed perspectives
Detection:
def detect_bias_amplification(problem, solution):
"""Detect if PS amplified biases from input."""
input_bias_score = measure_bias(problem)
output_bias_score = measure_bias(solution)
amplification = output_bias_score - input_bias_score
if amplification > 0.2: # Significant amplification
return True, amplification
return False, amplification
Innovation Potential
Derived Innovations:
PS prompting has inspired several extensions:
- Self-Planning for Code: Adapts PS for code generation with algorithm plans
- Plan-and-Execute Agents: LangChain's agent framework based on PS principles
- QDMR-based PS: Combines Question Decomposition Meaning Representation with PS
- MSG (Multi-Stage Guided): Three-phase planning for code generation
Novel Combinations:
| Combination | Benefit | | --------------------- | --------------------------------------------------- | | PS + Self-Consistency | Multiple plans with voting for reliability | | PS + RAG | Plan-guided retrieval for knowledge-intensive tasks | | PS + Tool Use | Plan incorporates tool calls | | PS + Verification | Explicit verification phase after execution | | PS + Tree of Thoughts | Multiple plan branches explored |
Research Opportunities:
- Automated plan quality assessment
- Learning optimal PS triggers per domain
- Multi-agent PS (different agents for planning and execution)
- Hierarchical PS for complex multi-objective problems
Ecosystem and Integration
Tools and Frameworks
Framework Support:
| Framework | PS Support | Implementation |
| --------- | ------------- | ------------------------- |
| LangChain | Native | plan_and_execute module |
| DSPy | Custom module | Can build PS signature |
| Haystack | Custom node | Pipeline component |
| Guidance | Templates | Structured generation |
| LMQL | Constraints | Query-based planning |
LangChain Plan-and-Execute:
from langchain_experimental.plan_and_execute import (
PlanAndExecute,
load_agent_executor,
load_chat_planner
)
# Setup
model = ChatOpenAI(model="gpt-4", temperature=0)
planner = load_chat_planner(model)
executor = load_agent_executor(model, tools, verbose=True)
# Create agent
agent = PlanAndExecute(
planner=planner,
executor=executor,
verbose=True
)
# Run
result = agent.run("Research and calculate the GDP per capita of France")
Pre-Built Templates:
Official implementation: https://github.com/AGI-Edgerunners/Plan-and-Solve-Prompting
Key files:
prompt.py: Contains PS and PS+ trigger templatesmain.py: Evaluation script for benchmarksprediction_runner.py: Inference utilities
Evaluation Tools:
- OpenAI Evals: Custom eval for PS reasoning quality
- LangSmith: Tracing for PS execution phases
- Weights & Biases: Metric tracking across experiments
Related Techniques and Combinations
Closely Related Techniques:
| Technique | Relationship | Key Difference | | -------------------- | --------------------- | ------------------------------------------------- | | Zero-shot-CoT | Direct predecessor | PS adds explicit planning | | Least-to-Most | Similar decomposition | L2M decomposes questions; PS decomposes solutions | | Self-Planning (code) | Derived technique | Specialized for code generation | | DECOMP | Related approach | Uses separate decomposition model | | Tree of Thoughts | Extended approach | Explores multiple plan branches |
How Patterns Transfer:
- PS planning principle applies to any multi-step task
- Variable extraction generalizes to any domain with quantifiable elements
- Phase separation (plan/execute) works across reasoning types
Hybrid Approaches:
PS + Self-Consistency:
def ps_self_consistent(problem, n_paths=5):
"""Combine PS with self-consistency voting."""
answers = []
for _ in range(n_paths):
# Vary temperature slightly for path diversity
response = ps_solve(problem, temperature=0.3)
answer = extract_answer(response)
if answer:
answers.append(answer)
# Majority vote
return Counter(answers).most_common(1)[0][0] if answers else None
PS + Verification:
def ps_with_verification(problem):
"""PS with explicit verification phase."""
# Phase 1: PS solution
solution = ps_plus_solve(problem)
answer = extract_answer(solution)
# Phase 2: Verification
verification_prompt = f"""Problem: {problem}
Proposed solution:
{solution}
Please verify this solution:
1. Is the plan complete and logical?
2. Are all calculations correct?
3. Does the answer make sense?
If errors found, provide the correct answer."""
verification = generate(verification_prompt)
# Check if verification found errors
if "correct" in verification.lower():
return answer
else:
return extract_answer(verification)
PS + RAG:
def ps_with_rag(problem, retriever):
"""PS with retrieval-augmented generation."""
# Retrieve relevant knowledge
relevant_docs = retriever.retrieve(problem, k=3)
context = "\n".join([doc.content for doc in relevant_docs])
# PS with context
prompt = f"""Relevant information:
{context}
Problem: {problem}
Using the information above, let's first understand the problem, extract relevant variables, and devise a plan. Then solve step by step."""
return generate(prompt)
Comparison Table:
| Aspect | PS | Zero-shot-CoT | Few-shot CoT | Least-to-Most | | ------------------- | --------------- | ----------------- | --------------- | ---------------------- | | Examples needed | No | No | Yes (3-8) | Yes (few) | | Planning phase | Explicit | Implicit | Implicit | Explicit | | Missing-step errors | Low | High | Low | Low | | Setup effort | None | None | High | Medium | | Token efficiency | Medium | High | Low | Medium | | Best for | Multi-step math | General reasoning | Domain-specific | Compositional problems |
Integration Patterns
Task Adaptation:
Adapt PS for specific task types:
task_adaptations = {
"math_word_problem": {
"trigger": "ps+",
"additions": "show all calculations"
},
"logical_reasoning": {
"trigger": "ps+",
"additions": "state each logical step explicitly"
},
"code_debugging": {
"trigger": "basic",
"additions": "identify the bug, plan the fix, implement step by step"
},
"text_analysis": {
"trigger": "basic",
"additions": "identify key elements, plan the analysis, execute systematically"
}
}
Integration with RAG:
class PSWithRAG:
def __init__(self, retriever, generator):
self.retriever = retriever
self.generator = generator
def solve(self, problem):
# Plan phase includes retrieval
plan_prompt = f"What information do we need to solve: {problem}"
info_needs = self.generator(plan_prompt)
# Retrieve
docs = self.retriever(info_needs)
# Solve with retrieved context
solve_prompt = f"""Context: {docs}
Problem: {problem}
Let's use the provided context, understand the problem, devise a plan, and solve step by step."""
return self.generator(solve_prompt)
Integration with Agents:
PS principles integrate with agent frameworks:
- Planning phase: Agent creates action plan
- Execution phase: Agent executes actions sequentially
- Reflection: Agent reviews results after each action
class PSAgent:
def __init__(self, tools, model):
self.tools = tools
self.model = model
def plan(self, task):
prompt = f"""Task: {task}
Available tools: {list(self.tools.keys())}
Create a plan with specific tool calls to accomplish this task."""
return self.model(prompt)
def execute(self, plan):
results = []
for step in parse_plan(plan):
tool_name, args = extract_tool_call(step)
if tool_name in self.tools:
result = self.tools[tool_name](**args)
results.append(result)
return results
Transition Strategies:
From Zero-shot-CoT to PS:
- Replace trigger phrase
- No other changes needed
- Monitor for accuracy improvement
From Few-shot CoT to PS:
- Remove examples
- Add PS+ trigger
- Test on same problems
- May sacrifice some accuracy for generality
From PS to Advanced Approaches:
When PS accuracy is insufficient:
- Try PS + Self-Consistency first
- Add domain-specific examples (hybrid few-shot PS)
- Consider Tree of Thoughts for very complex problems
- Use specialized agents for tool-dependent tasks
Production System Integration:
class ProductionPSSystem:
def __init__(self, model, config):
self.model = model
self.config = config
self.metrics = MetricsCollector()
self.cache = ResultCache()
def solve(self, problem):
# Check cache
cached = self.cache.get(problem)
if cached:
return cached
# Solve with monitoring
start_time = time.time()
response = ps_solve(problem, self.model, self.config)
latency = time.time() - start_time
# Extract and validate
answer = extract_answer(response)
valid = validate_answer(problem, answer)
# Record metrics
self.metrics.record({
"latency": latency,
"tokens": count_tokens(response),
"extracted": answer is not None,
"valid": valid
})
# Cache result
if answer:
self.cache.set(problem, answer)
return answer
Versioning and Rollback:
class PSVersionManager:
def __init__(self):
self.versions = {}
self.current = None
def register_version(self, name, trigger, config):
self.versions[name] = {"trigger": trigger, "config": config}
def set_active(self, name):
if name in self.versions:
self.current = name
def rollback(self, name):
if name in self.versions:
self.current = name
logging.info(f"Rolled back to {name}")
def get_trigger(self):
return self.versions[self.current]["trigger"]
Future Directions
Emerging Innovations
Current Developments:
-
Automated Trigger Optimization: Research into learning optimal PS triggers for specific domains without manual engineering
-
Hierarchical Planning: Multi-level plans where high-level steps contain sub-plans, enabling complex multi-objective problems
-
Dynamic Planning: Plans that adapt during execution based on intermediate results
-
Multi-Agent PS: Separate agents for planning and execution, potentially with different model sizes or specializations
-
PS with Tool Learning: Models learn which tools to include in plans based on problem characteristics
Impact Assessment:
| Innovation | Potential Impact | Timeline | | ------------------------- | -------------------------- | ----------- | | Auto-trigger optimization | Removes manual engineering | Near-term | | Hierarchical planning | Enables complex problems | Medium-term | | Multi-agent PS | Improved efficiency | Medium-term | | Native PS models | Built-in planning | Long-term |
Research Frontiers
Open Questions:
-
Optimal plan granularity: What level of detail in plans maximizes accuracy without adding overhead?
-
Cross-domain transfer: Can PS triggers optimized for one domain transfer to others?
-
Plan quality metrics: How do we automatically measure plan quality separate from execution quality?
-
Semantic understanding integration: Can PS be combined with techniques that improve semantic comprehension?
-
Scaling laws for PS: How does PS benefit scale with model size and problem complexity?
Promising Directions:
-
Learned Planning Modules: Train specialized modules for the planning phase that work with various execution models
-
Formal Verification of Plans: Use formal methods to verify plan correctness before execution
-
Adaptive Phase Allocation: Dynamically allocate computational resources between planning and execution based on problem characteristics
-
Human-in-the-Loop PS: Interactive systems where humans can review and modify plans before execution
-
PS for Multi-Modal Reasoning: Extend planning-execution separation to problems involving images, audio, or structured data
Integration with Emerging Paradigms:
- Reasoning Models (o1, o3): How PS prompting interacts with models that have native reasoning capabilities
- Agent Systems: PS as a planning module within larger autonomous agent architectures
- Continuous Learning: Improving PS triggers based on execution feedback
- Multi-Modal Planning: Plans that incorporate non-text modalities
Benchmarking Needs:
- Standardized PS-specific benchmarks measuring plan quality
- Multi-domain evaluation suites
- Long-horizon problem benchmarks
- Adversarial planning challenges
Conclusion
Plan-and-Solve (PS) prompting represents a significant advancement in zero-shot reasoning for large language models. By explicitly separating problem-solving into planning and execution phases, the technique addresses fundamental weaknesses in standard zero-shot Chain-of-Thought approaches—particularly missing-step errors that plague multi-step reasoning tasks.
Key Takeaways:
-
Simple yet effective: A single trigger phrase transformation yields measurable accuracy improvements (2.5% average across benchmarks, up to 10% on specific datasets)
-
Zero-shot universality: No examples required, making it deployable across domains without task-specific engineering
-
Complements existing methods: Works well in combination with self-consistency, verification, and other techniques
-
Clear limitations: Does not address semantic understanding errors—the largest error category
-
Strong ecosystem support: Integrated into major frameworks like LangChain as "Plan-and-Execute"
When to Deploy PS Prompting:
- Multi-step mathematical and logical reasoning tasks
- When Zero-shot-CoT shows missing-step errors
- When few-shot examples aren't available
- When you need consistent, auditable reasoning
When to Consider Alternatives:
- Simple single-step tasks (use direct prompting)
- When semantic understanding is the bottleneck (consider context enrichment)
- When highest accuracy is required and examples are available (use few-shot CoT)
- Very complex problems requiring exploration (consider Tree of Thoughts)
PS prompting demonstrates that careful prompt design can unlock latent model capabilities without additional training or examples. As models continue to evolve, the principles underlying PS—explicit planning before execution, structured decomposition, and attention to error-prone operations—will remain valuable patterns for effective human-AI collaboration in reasoning tasks.
References
-
Wang, L., Xu, W., Lan, Y., Hu, Z., Lan, Y., Lee, R. K.-W., & Lim, E.-P. (2023). Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023). https://aclanthology.org/2023.acl-long.147/
-
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large Language Models are Zero-Shot Reasoners. In Advances in Neural Information Processing Systems (NeurIPS 2022). https://arxiv.org/abs/2205.11916
-
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems (NeurIPS 2022). https://arxiv.org/abs/2201.11903
-
Zhang, Z., Zhang, A., Li, M., & Smola, A. (2022). Automatic Chain of Thought Prompting in Large Language Models. arXiv preprint arXiv:2210.03493. https://arxiv.org/abs/2210.03493
-
Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., & Chi, E. (2022). Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv preprint arXiv:2205.10625. https://arxiv.org/abs/2205.10625
-
Zheng, C., Liu, Z., Xie, E., Li, Z., & Li, Y. (2024). DUP: Deeply Understanding the Problems Makes LLMs Better Reasoners for Math Word Problems. arXiv preprint arXiv:2404.14963. https://arxiv.org/abs/2404.14963
-
LangChain Plan-and-Execute Documentation. https://python.langchain.com/docs/modules/agents/agent_types/plan_and_execute
-
AGI-Edgerunners/Plan-and-Solve-Prompting GitHub Repository. https://github.com/AGI-Edgerunners/Plan-and-Solve-Prompting
Read Next
Start reading to get personalized recommendations
Explore Unread
Great job! You've read all available articles