Complexity-based prompting: a complete guide

Few-shot chain-of-thought lives or dies on which examples you pick, and most teams pick them by gut feeling. Complexity-based prompting replaces the gut with one rule: count the reasoning steps in each candidate example and keep the longest. That single structural criterion, plus the same trick applied at decoding time, set state-of-the-art on GSM8K, MultiArith, and MathQA, averaging +5.3 points over handcrafted CoT on GPT-3 and up to +18 points on the hardest benchmark (Fu, Peng, Sabharwal, Clark, and Khot of the University of Edinburgh and the Allen Institute for AI, "Complexity-Based Prompting for Multi-Step Reasoning," ICLR 2023, arXiv:2210.00720).

See it work

You have a pool of annotated math examples with step counts from 2 to 11, and room for 8 demonstrations. Two ways to fill the slots:

Random selection (8 examples, mixed depth):
  "Tom has 8 apples..."        2 steps
  "A box holds..."             3 steps
  "John has 4 bags..."         5 steps
  "A store sells 5 types..."   9 steps
  ... (average ~4 steps each)

Complexity selection (8 examples, deepest first):
  "A baker makes 3 cakes..."  11 steps
  "A store sells 5 types..."   9 steps
  "A train travels..."         8 steps
  "Maria earns..."             7 steps
  ... (every example 6+ steps)

Same annotation budget, same prompt format, same model. On GSM8K validation with GPT-3, the random set lands around 52.5% and the complexity set around 58.5% — a six-point swing from nothing but a sort order. Then you push the same idea into decoding: sample 50 chains for each test question, throw away the 10 shortest, and take a majority vote over the 40 longest. That output-side filter is what carries GSM8K to 82.9% on Codex.

The mental model

Think of a tutor choosing worked examples for a study packet. They could hand you ten quick one-liners, or three problems that each grind through rate conversion, unit tracking, a branch, and a verification. The hard problems teach more, because seeing a full solution to a hard problem reveals structure — how to decompose, when to apply each operation, how to check your work — that an easy problem never exposes.

The example that took the most steps to solve is the one that teaches the model the most. Length is a proxy for the richness of the reasoning schema inside.

How it works

The technique has two phases. Phase 1 chooses the demonstrations once, before any test query. Phase 2 (optional but where the biggest gains live) filters the model's own outputs at decoding time. Both use the same scoring function: count the newline-separated lines in a reasoning chain.

Collect a pool. Gather (question, reasoning chain, answer) triples. You need only a modest pool — 15 to 25 examples to select 8 from. No full corpus required.
Score each example. Count the steps (non-empty newline-separated lines). No parsing, no embeddings, no domain knowledge.
Rank and select. Sort descending, take the top M (M=8 is standard; 4 still beats random).
Build the prompt. Lay the demos out in standard few-shot CoT format. Use Question: as the prefix (it beats Q: by about 4 points) and newlines as the step separator.
Decode. Greedy for one chain, or sample N=50 at temperature 0.7 for Phase 2.
Filter and vote (Phase 2). Score all 50 chains the same way, keep the top K=40, parse each answer, majority vote.

Why it works

The paper isolates the cause with a clean experiment: hold total reasoning steps fixed at 72 and vary how they're distributed.

Factor	What the evidence shows	Effect
Per-example depth	8 examples × 9 steps (58.5%) beat 24 examples × 3 steps (51.0%) at the same 72 total steps	Largest — depth, not token count, is the active ingredient
Complexity-based demo selection	Beats random, centroid, and matches retrieval under a tiny annotation budget	+5 to +6 pp average (greedy)
Complexity-based consistency	Voting over the longest K chains beats voting over all N	+2 to +4 pp over plain Self-Consistency
`Question:` prefix vs `Q:`	Larger than expected for a formatting change	+4 pp on GSM8K validation
Newline separator vs period	Robust across separators, but newline wins	+2 to +4 pp

Two mechanisms drive it. A long chain exposes a richer, more articulated problem-solving schema for the model's in-context attention to latch onto. And problems that need many steps are intrinsically hard problems, so their solutions encode more generalizable reasoning than easy-problem solutions. On the output side, correct solutions to hard problems tend to be longer; shortcuts and lucky guesses tend to be short. Filtering to the longest chains preferentially keeps the correctly-derived answers — voting over the shortest K always does worse, which pins down the direction of causality.

The paper's incremental ablation on GSM8K validation stacks the contributions: handcrafted CoT 43.5% → add "Let's think step by step" 48.5% → add complexity-based selection 54.0% → switch to the Question: prefix 58.0% → add complexity-based consistency 71.0%. And the technique is robust to the separator choice — complex prompts win under newline (58.5%), period (54.5%), semicolon (54.0%), and explicit "Step i:" labels (52.0%) — but newline is consistently strongest.

Where it shines

All numbers below are from Fu et al. on GPT-3 (text-davinci-002) and Codex (code-davinci-002), both 175B parameters.

Benchmark	Handcrafted CoT	Complex CoT (greedy)	Complex + majority vote (N=50, K=40)
GSM8K (n=1,319), GPT-3	48.1%	55.4%	72.6%
GSM8K, Codex	61.0%	66.6%	82.9%
MultiArith (n=600), GPT-3	90.8%	94.2%	98.7%
MultiArith, Codex	95.8%	95.8%	99.8%
MathQA (n=600), GPT-3	30.1%	36.0%	50.2%
MathQA, Codex	29.3%	47.3%	60.0%

MathQA is the standout: Codex jumps from 29.3% to 47.3% on greedy decoding alone (+18 pp), the paper's largest single gain, suggesting the handcrafted demos were badly miscalibrated for that benchmark's algebraic structure. Averaged across benchmarks under greedy decoding, complexity selection adds +5.3 points on GPT-3 and +6.2 on Codex.

The picture on commonsense and table reasoning (BigBench Hard) is more mixed, because some of those tasks have a ceiling on how deep the reasoning can go:

Task	GPT-3 handcrafted → complex	Codex handcrafted → complex
StrategyQA	66.9% → 77.0% (+10.1)	73.1% → 73.9% (+0.8)
Penguins in a Table	76.7% → 79.5% (+2.8)	78.1% → 80.8% (+2.7)
Date Understanding	82.8% → 82.4% (-0.4)	86.0% → 86.8% (+0.8)

The technique fits multi-step arithmetic, algebraic word problems, multi-hop commonsense, and structured extraction or proof-style tasks — anywhere a correct solution naturally decomposes into a variable number of verbalizable steps and the hard cases need more of them. It adds little on single-step lookup, classification, summarization, translation, or creative generation, where step count is not a quality signal. Date and table tasks with bounded depth (4–5 steps no matter the difficulty) hit that ceiling and gain little.

When to use it (and when not)

Reach for it when you have a small pool of at least 8 annotated demonstrations, the task is genuinely multi-step with verifiable intermediate steps, you see high variance across differently curated example sets, you need annotation efficiency (no budget for a retrieval corpus), or you already run Self-Consistency and want a free upgrade.

Skip it when your test distribution is highly heterogeneous (retrieval or a hybrid wins), you have zero annotations (use Auto-CoT or Zero-Shot-CoT), the task is creative or open-ended, or you need sub-50ms latency (then Phase 2's sampling is off the table — limit to Phase 1).

Phase 2 costs 50× a greedy call. Sampling N=50 chains per question is the same cost profile as Self-Consistency: a 50× latency and dollar multiplier over one forward pass. Phase 1 alone runs at ~1.5–2× a zero-shot query and still buys most of the structural gain. For high-throughput production, ship Phase 1 and reserve Phase 2 for accuracy-critical paths.

Model fit. This is an emergent-ability technique. It needs a model that can already do multi-step CoT — roughly 100B+ parameters. The paper found text-curie-001 (6.7B) gets essentially zero benefit and Flan-T5 (11B) only a marginal +1.5%. GPT-3 and Codex (175B) show large gains, and stronger modern models inherit the benefit (with a narrower margin as their baseline rises). If your model can't produce a reasonable chain zero-shot, complexity selection has nothing to amplify.

Escalation. When you hit the accuracy ceiling, layer in semantic filtering (complexity plus similarity), add Self-Refine or Reflexion to improve chains before voting, move to Tree-of-Thoughts to search the reasoning space, or fine-tune on high-complexity chains if the distribution is stable.

Scenario	Recommended variant
Accuracy-critical, cost-unconstrained	Phase 1 + Phase 2 (N=50, K=40)
Accuracy-important, cost-sensitive	Phase 1 only (greedy)
Highly heterogeneous test set	Phase 1 with complexity-plus-diversity selection
Very small pool (under 8 examples)	Phase 1 with a minimum step threshold, not a top-M cutoff
Already running Self-Consistency	Add Phase 1 as a drop-in demo-selection improvement

Structure and components

You need four things, and three of them are trivial:

A candidate pool of (question, reasoning chain, answer) triples — 8 to 20 is plenty, only modestly larger than the M you want to select.
A scoring function that maps a chain to an integer step count (count non-empty newline-separated lines).
A selection step that ranks the pool and takes the top M.
A standard CoT prompt format with Question: prefix and newline step separators.

For Phase 2 you add temperature sampling, the same step-count scorer applied to outputs, and an answer parser for the majority vote. You do not need embeddings, a retrieval corpus, a separate tuning set, or any fine-tuning.

Design principles worth internalizing: keep the metric deliberately surface-level (the paper tested fancier proxies like question length and formula length, and plain step count won); each step should state exactly one new fact or inference; and watch for fake complexity. Redundant restatement, splitting 3 × 15 = 45 into three micro-steps, and narrative padding ("now that we have the total, we move on...") all inflate the count without adding reasoning, and they teach the model to pad its outputs. To raise genuine step count, add unit tracking, intermediate verification, explicit sub-goal framing, or a cited formula at each step.

The core algorithm

The whole technique is about ten lines. Scoring and selection:

def complexity_score(chain: str) -> int:
    """Count non-empty newline-separated reasoning steps."""
    return sum(1 for line in chain.split("\n") if line.strip())

def select_demonstrations(pool: list[dict], m: int = 8) -> list[dict]:
    """Top-m examples by reasoning-chain complexity."""
    return sorted(pool, key=lambda x: complexity_score(x["chain"]), reverse=True)[:m]

def build_prompt(demos: list[dict], question: str) -> str:
    parts = [f"Question: {d['question']}\n{d['chain']}\nThe answer is {d['answer']}."
             for d in demos]
    parts.append(f"Question: {question}")
    return "\n\n".join(parts)

Phase 2 — sample, filter to the longest K, and vote:

from collections import Counter

def complexity_consistency(prompt, generate_fn, n=50, k=40, temperature=0.7, min_steps=2):
    """Sample n chains, keep the k longest, return the majority answer."""
    chains = generate_fn(prompt, n, temperature)
    valid = [c for c in chains if complexity_score(c) >= min_steps] or chains
    top_k = sorted(valid, key=complexity_score, reverse=True)[:k]
    answers = [a for a in (extract_answer(c) for c in top_k) if a is not None]
    if not answers:
        return extract_answer(chains[0])
    return Counter(answers).most_common(1)[0][0]

The generate_fn abstraction keeps it model-agnostic. A concrete wrapper, here on the Claude API (which samples one chain per call, so you loop):

import anthropic
client = anthropic.Anthropic()

def generate_claude(prompt, n, temperature, model="claude-opus-4-6"):
    chains = []
    for _ in range(n):
        msg = client.messages.create(
            model=model, max_tokens=1024, temperature=temperature,
            messages=[{"role": "user", "content": prompt}],
        )
        chains.append(msg.content[0].text)
    return chains

OpenAI-style APIs accept n>1 in a single call, which makes Phase 2 one request instead of fifty; local engines like vLLM batch it natively. The format string (Question: / chain / The answer is X) and the scorer are identical across providers — only the call syntax changes.

Configuration

Parameter	Default	Range	Effect
M (demos selected)	8	4–12	Higher M → longer prompt, more schema variety; diminishing returns past 8
N (samples per question)	50	10–100	Higher N → more stable vote, higher cost; 50 is the paper's sweet spot
K (top chains kept)	40	20–N	Lower K → stronger filter; near 60–80% of N is best; never set K=N (that's plain Self-Consistency)
Temperature	0.7	0.5–1.0	Higher → more chain diversity; too high adds noise
Max tokens per chain	512	256–1024	Must fit the longest expected chain; set generously to avoid truncation
Step separator	`\n`	`\n`, `.`, `;`	Newline is consistently best; changing it costs 2–4 pp
Question prefix	`Question:`	`Q:`, `Question:`	`Question:` beats `Q:` on math tasks

Smaller models (30B–100B, weaker CoT) want M=4–6, lower temperature (0.5–0.6), more samples (N up to 100) for a stable majority, and a stricter minimum-step filter. RLHF-aligned chat models reward conciseness, which compresses the step-count distribution and weakens the filter — counter it with an explicit instruction like "show all intermediate steps; do not skip steps."

Implementation workflow

Collect 15–25 problems from the target domain with full reasoning chains. Annotate the hardest ones first — easy problems produce short chains that won't be selected anyway.
Score everything by step count. Manually read the top 10 to confirm they're genuinely multi-step, not padded.
Select the top 8. Run greedy on a held-out validation set of 50–100 examples and record accuracy.
If that's good enough, ship Phase 1 (greedy) for cost efficiency.
If you need more accuracy, enable Phase 2 (N=50, K=40) and re-measure on validation.
If Phase 2 underwhelms, tune K on the validation set; the task may have low chain-length variance.
Re-score and refresh the pool as new, harder problem types appear.

To bootstrap a pool without writing chains by hand, use a strong model under Zero-Shot-CoT to draft chains for your questions, then have a reviewer correct them — model-assisted annotation typically cuts effort by 60–70%. Because the demo prefix is fixed across all queries, enable prefix caching (Anthropic and OpenAI both support it) to avoid re-paying for it on every call.

Do	Don't
Use `\n` as the separator in demos and in output scoring	Mix separators across demonstrations
Read the top-selected demos before deploying	Trust the step count blindly
Set a minimum step threshold (≥3) to drop degenerate chains	Set K=N in Phase 2 (wastes the filter)
Test on a held-out set, including a different sub-domain	Assume one sub-domain's gains transfer everywhere
Refresh the pool as the task drifts	Lock the pool forever

Debugging

No better than handcrafted CoT? Check that selected demos are genuinely complex (6+ steps); a shallow pool degenerates to arbitrary selection. Check the model is large enough. Check the task is actually multi-step.
Chains come out short despite complex demos? Temperature too low or max_tokens too small — raise temperature to 0.7–0.9 and tokens to 768–1024.
Majority vote wrong even at N=50? K may be too aggressive (amplifying a few long-but-wrong chains) or the demos are from a mismatched sub-domain. Raise K toward N; verify domain match.
Inconsistent across runs? High temperature with small K. Raise N, lower temperature slightly, or raise K.
Phase 2 not beating greedy? Low natural chain-length variance, so the filter can't discriminate. If plain Self-Consistency also doesn't help, the task won't benefit from multi-sample aggregation.
Hallucinated intermediate steps? Complex demos can teach elaborate-but-wrong chains. Add a verification step to each demo ("Let me check: ..."), which both teaches self-checking and raises the complexity score.

Limitations

Scale dependency is fundamental, not fixable by prompting. Below the emergent-reasoning threshold, the model can't do sustained CoT, so there's nothing for complexity to improve. Verify zero-shot CoT ability before reaching for this.

Step count is a surface proxy. It over-counts a long cascade of trivial arithmetic and under-counts a short chain packed with dense inference. It's the strongest proxy available without semantic analysis, but you still need to read the selected demos.

Depth can crowd out diversity. Taking the top M by complexity can pull every demo from one hard sub-type, leaving the model weak on others. For heterogeneous tests, stratify: group the pool by sub-type, then take the most complex example within each.

It needs annotations. Unlike Auto-CoT, the technique selects from human-quality chains; it doesn't generate them. The cost is moderate (8–20 examples) but non-zero.

Phase 2's silent overconfidence. A near-unanimous vote feels like high confidence, but if every sampled chain makes the same systematic error, the vote confidently returns the wrong answer. High vote share isn't a reliable confidence signal when chains share a prompt and an error mode.

Audit selected demos for bias in high-stakes domains. The selection criterion is blind to content, and complex chains have more surface area for implicit bias. If the deepest medical examples all happen to be older male cardiovascular cases (longer differentials), the model gets primed toward that frame. Review the intermediate steps, not just the final answers, before deploying in healthcare, legal, or financial settings.

Advanced techniques and ecosystem

A few extensions earn their keep. Verification steps at the end of each demo raise the score and teach self-checking. Sub-problem framing (labeled sub-goals) adds genuine structure for 15+ step problems. Complexity-plus-retrieval hybrids — score = α·complexity + (1-α)·similarity — directly fix the heterogeneous-distribution weakness; tune α on validation (α=1.0 is pure complexity, lower for varied tasks). Complexity-plus-Auto-CoT generates chains with Zero-Shot-CoT, scores them, and selects the deepest — a fully annotation-free pipeline.

Frameworks slot it in cleanly: a LangChain BaseExampleSelector that sorts by step count, a DSPy teleprompter that bootstraps demos by complexity, or custom selectors in Haystack, LlamaIndex, and Semantic Kernel. The FranxYao Complexity-Based-Prompting and chain-of-thought-hub repos ship complexity-ranked templates for GSM8K, MultiArith, MathQA, and several BigBench Hard tasks, and EleutherAI's LM Evaluation Harness covers the benchmarks for replication.

The selection-criterion ablation shows where the technique sits among alternatives (validation accuracy):

Selection criterion	Annotation needed	GSM8K	MultiArith	MathQA
Random	Small pool	52.5%	86.5%	33.0%
Centroid (embedding)	Small pool	52.0%	92.0%	32.0%
Retrieval	Full corpus (10K+)	56.0%	88.0%	69.5%
Complexity (step count)	Small pool	58.5%	93.0%	42.5%

Complexity ties or beats retrieval on two of three benchmarks at a fraction of the annotation cost. The exception is telling: retrieval crushes complexity on MathQA (69.5% vs 42.5%), because when test questions are structurally diverse, semantic proximity beats raw depth. How the technique relates to its neighbors:

Technique	Relation
Self-Consistency (Wang et al., 2022)	Complexity-based consistency is a direct extension — it filters SC's sample pool by length
Auto-CoT (Zhang et al., 2022)	Complementary — generate a pool with Auto-CoT, then select by complexity
Least-to-Most (Zhou et al., 2022)	Orthogonal — L2M changes the reasoning structure; complexity changes which demos teach it
Active-Prompt (Diao et al., 2023)	Sibling — both pick demos, but by answer uncertainty vs step count
Zero-Shot-CoT (Kojima et al., 2022)	The fallback when no annotated pool exists

Complexity-based prompting also reads as an early step toward test-time compute scaling: it showed in 2022 that more reasoning at inference — longer demos, longer sampled chains — yields better answers, the same bet that o1-class (2024) and DeepSeek-R1 (2025) models later internalized in training. Phase 1 still helps those long-thinking models; Phase 2's external sampling-and-voting is largely superseded by their built-in extended reasoning.

The headline result, in context. Complexity-based prompting plus complexity-based consistency took GSM8K to 82.9% on Codex and MultiArith to 99.8%, state-of-the-art at publication, with no fine-tuning and a pool of eight annotated examples. The entire mechanism is a sort by line count applied twice — once to the prompt, once to the outputs. Few techniques offer that ratio of payoff to implementation cost.

Summary

What: select few-shot CoT demonstrations by reasoning-step count (keep the longest), and optionally filter sampled outputs the same way before a majority vote.
Why: deeper worked examples expose a richer reasoning schema, and long correct chains beat short shortcuts — the depth-vs-breadth experiment shows 8×9-step demos beat 24×3-step demos at the same total length (58.5% vs 51.0%).
When: multi-step reasoning (4+ steps), a homogeneous test distribution, and at least 8 annotated demos on hand; skip it for shallow, heterogeneous, or creative tasks.
Where: arithmetic, algebraic, and multi-hop reasoning shine (GSM8K, MultiArith, MathQA, StrategyQA); bounded-depth date and table tasks barely move.
How: count newline steps, sort descending, take the top 8 with a Question: prefix; for Phase 2 sample N=50 at T=0.7, keep the K=40 longest, vote.
Which model: 100B+ with real CoT ability — text-curie-001 (6.7B) gained nothing, GPT-3 and Codex (175B) gained +5.3 and +6.2 points on average, up to +18 on MathQA.
Cost: Phase 1 runs at ~1.5–2× a zero-shot call; Phase 2 multiplies cost by ~50×, so ship Phase 1 by default and reserve Phase 2 for accuracy-critical work.

Sources

Explore Unread

Great job! You've read all available articles

Complexity-based prompting: a complete guide

See it work

You have a pool of annotated math examples with step counts from 2 to 11, and room for 8 demonstrations. Two ways to fill the slots:

Random selection (8 examples, mixed depth):
  "Tom has 8 apples..."        2 steps
  "A box holds..."             3 steps
  "John has 4 bags..."         5 steps
  "A store sells 5 types..."   9 steps
  ... (average ~4 steps each)

Complexity selection (8 examples, deepest first):
  "A baker makes 3 cakes..."  11 steps
  "A store sells 5 types..."   9 steps
  "A train travels..."         8 steps
  "Maria earns..."             7 steps
  ... (every example 6+ steps)

The mental model

The example that took the most steps to solve is the one that teaches the model the most. Length is a proxy for the richness of the reasoning schema inside.

How it works

Collect a pool. Gather (question, reasoning chain, answer) triples. You need only a modest pool — 15 to 25 examples to select 8 from. No full corpus required.
Score each example. Count the steps (non-empty newline-separated lines). No parsing, no embeddings, no domain knowledge.
Rank and select. Sort descending, take the top M (M=8 is standard; 4 still beats random).
Build the prompt. Lay the demos out in standard few-shot CoT format. Use Question: as the prefix (it beats Q: by about 4 points) and newlines as the step separator.
Decode. Greedy for one chain, or sample N=50 at temperature 0.7 for Phase 2.
Filter and vote (Phase 2). Score all 50 chains the same way, keep the top K=40, parse each answer, majority vote.

Why it works

The paper isolates the cause with a clean experiment: hold total reasoning steps fixed at 72 and vary how they're distributed.

Factor	What the evidence shows	Effect
Per-example depth	8 examples × 9 steps (58.5%) beat 24 examples × 3 steps (51.0%) at the same 72 total steps	Largest — depth, not token count, is the active ingredient
Complexity-based demo selection	Beats random, centroid, and matches retrieval under a tiny annotation budget	+5 to +6 pp average (greedy)
Complexity-based consistency	Voting over the longest K chains beats voting over all N	+2 to +4 pp over plain Self-Consistency
`Question:` prefix vs `Q:`	Larger than expected for a formatting change	+4 pp on GSM8K validation
Newline separator vs period	Robust across separators, but newline wins	+2 to +4 pp

Where it shines

All numbers below are from Fu et al. on GPT-3 (text-davinci-002) and Codex (code-davinci-002), both 175B parameters.

Benchmark	Handcrafted CoT	Complex CoT (greedy)	Complex + majority vote (N=50, K=40)
GSM8K (n=1,319), GPT-3	48.1%	55.4%	72.6%
GSM8K, Codex	61.0%	66.6%	82.9%
MultiArith (n=600), GPT-3	90.8%	94.2%	98.7%
MultiArith, Codex	95.8%	95.8%	99.8%
MathQA (n=600), GPT-3	30.1%	36.0%	50.2%
MathQA, Codex	29.3%	47.3%	60.0%

The picture on commonsense and table reasoning (BigBench Hard) is more mixed, because some of those tasks have a ceiling on how deep the reasoning can go:

Task	GPT-3 handcrafted → complex	Codex handcrafted → complex
StrategyQA	66.9% → 77.0% (+10.1)	73.1% → 73.9% (+0.8)
Penguins in a Table	76.7% → 79.5% (+2.8)	78.1% → 80.8% (+2.7)
Date Understanding	82.8% → 82.4% (-0.4)	86.0% → 86.8% (+0.8)

When to use it (and when not)

Scenario	Recommended variant
Accuracy-critical, cost-unconstrained	Phase 1 + Phase 2 (N=50, K=40)
Accuracy-important, cost-sensitive	Phase 1 only (greedy)
Highly heterogeneous test set	Phase 1 with complexity-plus-diversity selection
Very small pool (under 8 examples)	Phase 1 with a minimum step threshold, not a top-M cutoff
Already running Self-Consistency	Add Phase 1 as a drop-in demo-selection improvement

Structure and components

You need four things, and three of them are trivial:

A candidate pool of (question, reasoning chain, answer) triples — 8 to 20 is plenty, only modestly larger than the M you want to select.
A scoring function that maps a chain to an integer step count (count non-empty newline-separated lines).
A selection step that ranks the pool and takes the top M.
A standard CoT prompt format with Question: prefix and newline step separators.

The core algorithm

The whole technique is about ten lines. Scoring and selection:

def complexity_score(chain: str) -> int:
    """Count non-empty newline-separated reasoning steps."""
    return sum(1 for line in chain.split("\n") if line.strip())

def select_demonstrations(pool: list[dict], m: int = 8) -> list[dict]:
    """Top-m examples by reasoning-chain complexity."""
    return sorted(pool, key=lambda x: complexity_score(x["chain"]), reverse=True)[:m]

def build_prompt(demos: list[dict], question: str) -> str:
    parts = [f"Question: {d['question']}\n{d['chain']}\nThe answer is {d['answer']}."
             for d in demos]
    parts.append(f"Question: {question}")
    return "\n\n".join(parts)

Phase 2 — sample, filter to the longest K, and vote:

from collections import Counter

def complexity_consistency(prompt, generate_fn, n=50, k=40, temperature=0.7, min_steps=2):
    """Sample n chains, keep the k longest, return the majority answer."""
    chains = generate_fn(prompt, n, temperature)
    valid = [c for c in chains if complexity_score(c) >= min_steps] or chains
    top_k = sorted(valid, key=complexity_score, reverse=True)[:k]
    answers = [a for a in (extract_answer(c) for c in top_k) if a is not None]
    if not answers:
        return extract_answer(chains[0])
    return Counter(answers).most_common(1)[0][0]

The generate_fn abstraction keeps it model-agnostic. A concrete wrapper, here on the Claude API (which samples one chain per call, so you loop):

import anthropic
client = anthropic.Anthropic()

def generate_claude(prompt, n, temperature, model="claude-opus-4-6"):
    chains = []
    for _ in range(n):
        msg = client.messages.create(
            model=model, max_tokens=1024, temperature=temperature,
            messages=[{"role": "user", "content": prompt}],
        )
        chains.append(msg.content[0].text)
    return chains

Configuration

Parameter	Default	Range	Effect
M (demos selected)	8	4–12	Higher M → longer prompt, more schema variety; diminishing returns past 8
N (samples per question)	50	10–100	Higher N → more stable vote, higher cost; 50 is the paper's sweet spot
K (top chains kept)	40	20–N	Lower K → stronger filter; near 60–80% of N is best; never set K=N (that's plain Self-Consistency)
Temperature	0.7	0.5–1.0	Higher → more chain diversity; too high adds noise
Max tokens per chain	512	256–1024	Must fit the longest expected chain; set generously to avoid truncation
Step separator	`\n`	`\n`, `.`, `;`	Newline is consistently best; changing it costs 2–4 pp
Question prefix	`Question:`	`Q:`, `Question:`	`Question:` beats `Q:` on math tasks

Implementation workflow

Collect 15–25 problems from the target domain with full reasoning chains. Annotate the hardest ones first — easy problems produce short chains that won't be selected anyway.
Score everything by step count. Manually read the top 10 to confirm they're genuinely multi-step, not padded.
Select the top 8. Run greedy on a held-out validation set of 50–100 examples and record accuracy.
If that's good enough, ship Phase 1 (greedy) for cost efficiency.
If you need more accuracy, enable Phase 2 (N=50, K=40) and re-measure on validation.
If Phase 2 underwhelms, tune K on the validation set; the task may have low chain-length variance.
Re-score and refresh the pool as new, harder problem types appear.

Do	Don't
Use `\n` as the separator in demos and in output scoring	Mix separators across demonstrations
Read the top-selected demos before deploying	Trust the step count blindly
Set a minimum step threshold (≥3) to drop degenerate chains	Set K=N in Phase 2 (wastes the filter)
Test on a held-out set, including a different sub-domain	Assume one sub-domain's gains transfer everywhere
Refresh the pool as the task drifts	Lock the pool forever

Debugging

No better than handcrafted CoT? Check that selected demos are genuinely complex (6+ steps); a shallow pool degenerates to arbitrary selection. Check the model is large enough. Check the task is actually multi-step.
Chains come out short despite complex demos? Temperature too low or max_tokens too small — raise temperature to 0.7–0.9 and tokens to 768–1024.
Majority vote wrong even at N=50? K may be too aggressive (amplifying a few long-but-wrong chains) or the demos are from a mismatched sub-domain. Raise K toward N; verify domain match.
Inconsistent across runs? High temperature with small K. Raise N, lower temperature slightly, or raise K.
Phase 2 not beating greedy? Low natural chain-length variance, so the filter can't discriminate. If plain Self-Consistency also doesn't help, the task won't benefit from multi-sample aggregation.
Hallucinated intermediate steps? Complex demos can teach elaborate-but-wrong chains. Add a verification step to each demo ("Let me check: ..."), which both teaches self-checking and raises the complexity score.

Limitations

It needs annotations. Unlike Auto-CoT, the technique selects from human-quality chains; it doesn't generate them. The cost is moderate (8–20 examples) but non-zero.

Advanced techniques and ecosystem

The selection-criterion ablation shows where the technique sits among alternatives (validation accuracy):

Selection criterion	Annotation needed	GSM8K	MultiArith	MathQA
Random	Small pool	52.5%	86.5%	33.0%
Centroid (embedding)	Small pool	52.0%	92.0%	32.0%
Retrieval	Full corpus (10K+)	56.0%	88.0%	69.5%
Complexity (step count)	Small pool	58.5%	93.0%	42.5%

Technique	Relation
Self-Consistency (Wang et al., 2022)	Complexity-based consistency is a direct extension — it filters SC's sample pool by length
Auto-CoT (Zhang et al., 2022)	Complementary — generate a pool with Auto-CoT, then select by complexity
Least-to-Most (Zhou et al., 2022)	Orthogonal — L2M changes the reasoning structure; complexity changes which demos teach it
Active-Prompt (Diao et al., 2023)	Sibling — both pick demos, but by answer uncertainty vs step count
Zero-Shot-CoT (Kojima et al., 2022)	The fallback when no annotated pool exists

Summary

What: select few-shot CoT demonstrations by reasoning-step count (keep the longest), and optionally filter sampled outputs the same way before a majority vote.
Why: deeper worked examples expose a richer reasoning schema, and long correct chains beat short shortcuts — the depth-vs-breadth experiment shows 8×9-step demos beat 24×3-step demos at the same total length (58.5% vs 51.0%).
When: multi-step reasoning (4+ steps), a homogeneous test distribution, and at least 8 annotated demos on hand; skip it for shallow, heterogeneous, or creative tasks.
Where: arithmetic, algebraic, and multi-hop reasoning shine (GSM8K, MultiArith, MathQA, StrategyQA); bounded-depth date and table tasks barely move.
How: count newline steps, sort descending, take the top 8 with a Question: prefix; for Phase 2 sample N=50 at T=0.7, keep the K=40 longest, vote.
Which model: 100B+ with real CoT ability — text-curie-001 (6.7B) gained nothing, GPT-3 and Codex (175B) gained +5.3 and +6.2 points on average, up to +18 on MathQA.
Cost: Phase 1 runs at ~1.5–2× a zero-shot call; Phase 2 multiplies cost by ~50×, so ship Phase 1 by default and reserve Phase 2 for accuracy-critical work.

Sources

Explore Unread

Great job! You've read all available articles

Complexity-based prompting: a complete guide

See it work

The mental model

How it works

Why it works

Where it shines

When to use it (and when not)

Structure and components

The core algorithm

Configuration

Implementation workflow

Debugging

Limitations

Advanced techniques and ecosystem

Summary

Sources

Read Next

Explore Unread

Complexity-based prompting: a complete guide

See it work

The mental model

How it works

Why it works

Where it shines

When to use it (and when not)

Structure and components

The core algorithm

Configuration

Implementation workflow

Debugging

Limitations

Advanced techniques and ecosystem

Summary

Sources

Read Next

Explore Unread