Complexity-Based Prompting: A Complete Guide
Complexity-Based Prompting is a demonstration-selection technique for chain-of-thought (CoT) prompting that systematically prioritizes in-context examples whose reasoning chains contain the most steps. Rather than selecting examples by semantic similarity to the test question, by random sampling, or by human judgment, it applies a single structural criterion: examples that require more reasoning steps are ranked higher and chosen as demonstrations. The technique was introduced by Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot (University of Edinburgh and Allen Institute for AI) in "Complexity-Based Prompting for Multi-Step Reasoning," published at ICLR 2023 (arXiv:2210.00720).
To see the technique immediately, consider a practitioner building a GSM8K-style math prompt. They have ten candidate examples, with step counts ranging from 2 to 11. The standard approach would be to pick 8 arbitrarily, or by gut feeling. Complexity-Based Prompting ranks the examples:
| Example | Step Count | Accuracy contribution (validation) |
|---|---|---|
| "A baker makes 3 cakes..." | 11 steps | High |
| "A store sells 5 types..." | 9 steps | High |
| "A train travels..." | 8 steps | High |
| "Maria earns..." | 7 steps | Moderate-high |
| "There are 12 books..." | 6 steps | Moderate |
| "John has 4 bags..." | 5 steps | Moderate |
| "A box holds..." | 3 steps | Low |
| "Tom has 8 apples..." | 2 steps | Low |
The top 8 are selected. At inference time, the model processes these 8 complex demonstrations and then generates reasoning chains for test questions. With temperature sampling, 50 chains are generated; only the 40 longest are kept, and their answers are put to a majority vote. This two-stage design—complexity selection at the prompt level, complexity filtering at the output level—is the full technique.
The problem it solves is prompt sensitivity in few-shot chain-of-thought settings. Standard few-shot CoT prompting requires a practitioner to manually curate examples, and the quality of the chosen examples has an outsized effect on model performance. Research prior to this work had shown that random example selection causes high variance, yet the field lacked a principled, annotation-efficient selection criterion that did not require a large labeled corpus (as retrieval-based selection does). Complexity-Based Prompting provides that criterion: count the reasoning steps, rank by that count, and take the top examples. Applied to output decoding, it extends the same logic to filter the K most complex sampled chains before taking a majority vote—a refinement of Self-Consistency (Wang et al., 2022) that consistently outperforms unfiltered voting.
Category: Complexity-Based Prompting sits within the intersection of few-shot prompting and optimization-based prompting. It is a meta-level decision rule applied to the demonstration selection phase of chain-of-thought prompting, not a change to the prompting format itself.
Type: It is a selection-based, structural technique. It does not modify the reasoning format presented to the model; it modifies which examples are presented and, at decoding time, which generated chains are trusted.
Scope: The technique covers two tightly coupled operations: (1) complexity-based prompt construction—selecting the highest-step-count demonstrations from a candidate pool—and (2) complexity-based consistency—filtering sampled output chains by step count before majority voting. It does not include automatic generation of reasoning chains (that is Auto-CoT's scope), retrieval from external corpora, semantic similarity matching, or changes to model architecture. It assumes a pre-existing pool of human-annotated (question, reasoning chain, answer) examples from which to select.
1. Introduction
1.1 Definition and Core Concept
Complexity-Based Prompting operates on a deceptively simple hypothesis: within a candidate pool of chain-of-thought demonstrations, the examples that require more reasoning steps are the most informative for guiding a language model through hard multi-step problems. The technique measures reasoning complexity operationally as the count of newline-separated lines in a reasoning chain—a surface-level proxy that requires no semantic parsing and no additional annotation.
What is included vs. excluded: The technique operates on an annotated pool of demonstrations; selecting from that pool using the step-count criterion is the full scope. It does not address how to build that pool (separate from the technique itself), how to format individual steps (any CoT format works), or how to generate demonstrations automatically. The decoding-side extension—complexity-based consistency—is an optional but closely coupled component.
Fundamental difference from other approaches:
-
Versus random selection: Random selection treats all examples as equally informative. Complexity-Based Prompting empirically refutes this: high-complexity examples are more predictive of complex test problem solutions. On GSM8K with GPT-3, random selection achieves 52.5% while complexity selection achieves 58.5%—a 6 percentage-point gap with identical annotation budgets.
-
Versus embedding/centroid selection: Centroid-based selection picks the examples most "average" or representative of the pool. It maximizes coverage but does not prioritize depth. The paper shows centroid selection (52.0% on GSM8K) underperforms both random and complexity-based selection, confirming that representativeness is not the right optimization target for multi-step reasoning.
-
Versus retrieval-based selection: Retrieval picks the examples most semantically similar to the test question, requiring a large annotated corpus (often the entire training set). Complexity-based selection matches or exceeds retrieval accuracy (58.5% vs. 56.0% on GSM8K) while requiring only a small candidate pool of ~8–10 examples—orders of magnitude fewer labeled instances.
-
Versus Self-Consistency: Self-Consistency (Wang et al., 2022) samples N diverse reasoning chains at inference time and takes a majority vote. Complexity-Based Consistency extends this by restricting the vote to the top-K most complex chains, filtering out outputs that arrived at answers via shallow reasoning. When K = N, it recovers vanilla Self-Consistency; the paper shows K < N is always strictly better.
Value provided: The technique offers three distinct benefits: accuracy gains (averaging +5–6 percentage points over handcrafted CoT on GPT-3 and Codex), annotation efficiency (works with as few as 8 examples), and a principled, reproducible selection criterion that removes subjective human judgment from the curation process.
1.2 Research Foundation
Cognitive Science Origins
The intuition behind Complexity-Based Prompting connects to long-standing findings in educational psychology and expertise research. Cognitive Load Theory (Sweller, 1988) distinguishes between intrinsic load (inherent complexity of the content) and extraneous load (irrelevant cognitive burden from presentation). Worked example research in mathematics education consistently shows that learners who study complex, multi-step worked examples develop stronger problem schemas than those who study simpler examples—even when total study time is held constant. Chi et al. (1989) demonstrated that students who self-explained the steps of complex examples outperformed those who passively read simpler ones, because the effort of processing each step forces active schema construction.
The parallel for language model in-context learning is direct: a model processing a demonstration with many structured steps is effectively exposed to a richer, more articulated problem-solving schema than one processing a two-step example. The more complex demonstration contains more signal about how to decompose, intermediate, and verify a solution.
Worked Example Effect in Educational Psychology
The cognitive science concept most directly relevant is the worked example effect (Sweller & Cooper, 1985; Ward & Sweller, 1990). Learners who study worked examples—problems with full step-by-step solutions—consistently outperform learners who solve equivalent problems themselves, especially in the early stages of skill acquisition. The effect is strongest when the worked examples are complex enough to require schema induction (the learner must identify the underlying problem structure by studying the solution process) and weakest for trivially simple problems where the solution is immediately obvious.
The critical variable, both in the educational psychology literature and in Complexity-Based Prompting, is whether the worked example reveals a non-trivial solution structure. A simple example (3 steps for a 3-operation problem) reveals little structure—the problem is solved by applying three obvious operations in sequence. A complex example (9 steps for a problem involving rate conversion, unit tracking, conditional branching, and verification) reveals a rich structure: the learner (or model) must process how these operations chain together, when to apply each, and how to verify the result. This structural revelation is the mechanism of the worked example effect.
Sweller's later Cognitive Load Theory formalization (1988) explains why this doesn't scale indefinitely: beyond a certain complexity level, the example's intrinsic load exceeds the learner's working memory capacity, and learning degrades. For language models, the analogous constraint is context window size and attention capacity. Demonstrations that are excessively long may fail to contribute meaningfully if the model's attention is overwhelmed. The optimal complexity range—identified empirically by Fu et al. as 7–12 steps for the benchmarks they evaluated—corresponds to the "region of optimal challenge" in educational psychology: hard enough to reveal structure, not so hard as to be unprocessable.
Prior Approaches This Replaced or Refined
Before Complexity-Based Prompting, the dominant CoT approaches were:
-
Handcrafted CoT (Wei et al., 2022): Human engineers manually selected 8 examples per task, writing their own reasoning chains. This achieved strong results but was labor-intensive, non-reproducible across practitioners, and produced arbitrarily varying example quality.
-
Zero-Shot-CoT (Kojima et al., 2022): Adding "Let's think step by step" to the prompt elicits reasoning without any examples, eliminating curation cost. But without structural guidance from worked examples, the model's reasoning quality is lower, especially on problems that require sustained multi-step arithmetic or combinatorial thinking.
-
Self-Consistency (Wang et al., 2022): Improved output decoding by aggregating over multiple sampled chains. This addressed output variance but did not address the quality of the demonstrations in the prompt.
-
Auto-CoT (Zhang et al., 2022): Automatically generated demonstrations by clustering questions and using Zero-Shot-CoT to write their reasoning chains. This addressed annotation cost but prioritized diversity across clusters, not depth within examples.
Complexity-Based Prompting identifies a gap that none of these address: given a fixed annotation budget, which criterion for selecting examples is most predictive of good performance? The answer is step count.
Seminal Paper: Fu et al. (2022/2023)
Citation: Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. "Complexity-Based Prompting for Multi-Step Reasoning." arXiv:2210.00720. Published at ICLR 2023.
Authors and affiliations:
- Yao Fu — University of Edinburgh
- Hao Peng, Ashish Sabharwal, Peter Clark, Tushar Khot — Allen Institute for AI (AI2)
Key research questions:
- Does the number of reasoning steps in a demonstration predict the quality of the model's downstream output?
- Does selecting demonstrations with the most steps outperform other selection criteria under the same annotation budget?
- Can step-count filtering improve upon Self-Consistency at decoding time?
All three questions were answered affirmatively with statistical significance across multiple benchmarks and two large model families.
Code: github.com/FranxYao/Complexity-Based-Prompting, subsequently integrated into github.com/FranxYao/chain-of-thought-hub.
1.3 Real-World Performance Evidence
All results below are on GPT-3 (text-davinci-002) and Codex (code-davinci-002), both 175B parameters, as reported in Fu et al. (2023).
Mathematical Reasoning — Core Benchmarks
| Benchmark | Method | GPT-3 | Codex |
|---|---|---|---|
| GSM8K (n=1,319) | Handcrafted CoT | 48.1% | 61.0% |
| GSM8K | Complex CoT (greedy) | 55.4% | 66.6% |
| GSM8K | Complex CoT + Majority Vote (N=50, K=40) | 72.6% | 82.9% |
| MultiArith (n=600) | Handcrafted CoT | 90.8% | 95.8% |
| MultiArith | Complex CoT (greedy) | 94.2% | 95.8% |
| MultiArith | Complex CoT + Majority Vote | 98.7% | 99.8% |
| MathQA (n=600) | Handcrafted CoT | 30.1% | 29.3% |
| MathQA | Complex CoT (greedy) | 36.0% | 47.3% |
| MathQA | Complex CoT + Majority Vote | 50.2% | 60.0% |
The MathQA result is the most striking single improvement: Codex improves from 29.3% to 47.3% (+18 pp) under greedy decoding alone, suggesting the handcrafted demonstrations were particularly miscalibrated for that benchmark's algebraic structure.
Commonsense and Logical Reasoning — BigBench Hard
| Task | GPT-3 Handcrafted | GPT-3 Complex | Codex Handcrafted | Codex Complex |
|---|---|---|---|---|
| Date Understanding | 82.8% | 82.4% (-0.4) | 86.0% | 86.8% (+0.8) |
| Penguins in a Table | 76.7% | 79.5% (+2.8) | 78.1% | 80.8% (+2.7) |
| StrategyQA | 66.9% | 77.0% (+10.1) | 73.1% | 73.9% (+0.8) |
Average Gains (greedy decoding, across all benchmarks):
- GPT-3: +5.3 accuracy points over handcrafted CoT
- Codex: +6.2 accuracy points over handcrafted CoT
Incremental ablation — building toward the full method (GSM8K validation):
| Stage | Accuracy |
|---|---|
| Original handcrafted CoT | 43.5% |
| + "Let's think step by step" elicitation | 48.5% (+5.0) |
| + Complexity-based demonstration selection | 54.0% (+5.5) |
| + "Question:" prefix (vs. "Q:") | 58.0% (+4.0) |
| + Complexity-based consistency (N=50, K=40) | 71.0% (+4.0 additional over Self-Consistency baseline) |
This incremental table shows that each component contributes meaningfully. The largest single jump is from adding the step-by-step elicitor (+5.0), followed closely by complexity selection (+5.5). The prefix change (+4.0) is larger than expected for a formatting detail. Complexity-based consistency adds +4.0 on top of standard Self-Consistency.
Step separator sensitivity (GSM8K):
| Separator format | Accuracy |
|---|---|
Newline \n | 58.5% |
Period . | 54.5% |
Semicolon ; | 54.0% |
| Explicit "Step i:" labels | 52.0% |
Complex prompts outperform simple prompts under all four separator formats, confirming the method is robust to surface variation. However, the newline separator is consistently strongest—reinforcing that it is the correct implementation choice.
Depth vs. breadth controlled experiment (GSM8K validation):
The paper's most illuminating ablation holds total reasoning steps constant at 72 and varies their distribution:
| Configuration | Examples | Steps each | Total steps | Accuracy |
|---|---|---|---|---|
| Many simple examples | 24 | 3 | 72 | 51.0% |
| Fewer complex examples | 8 | 9 | 72 | 58.5% |
The 7.5-point gap, with total step count held constant, definitively shows that per-example depth matters more than aggregate step count. 8 examples with 9 steps each are worth substantially more than 24 examples with 3 steps each.
Demonstration Selection Ablation (GSM8K, MathQA, MultiArith, validation sets):
| Selection Criterion | Annotation Needed | GSM8K | MultiArith | MathQA |
|---|---|---|---|---|
| Random | Small pool | 52.5% | 86.5% | 33.0% |
| Centroid (embedding) | Small pool | 52.0% | 92.0% | 32.0% |
| Retrieval-based | Full corpus (≥10K examples) | 56.0% | 88.0% | 69.5% |
| Complexity (step count) | Small pool | 58.5% | 93.0% | 42.5% |
Complexity selection ties or beats retrieval on 2 out of 3 benchmarks despite requiring orders of magnitude fewer annotations. The exception (retrieval wins on MathQA by a wide margin at 69.5% vs. 42.5%) reveals an important limitation: when test questions are highly heterogeneous in structure, semantic proximity is a stronger signal than complexity alone. This is addressed further in the Limitations section.
2. How It Works
2.1 Theoretical Foundation
The Core Insight
The paper's central empirical finding is that reasoning step count is the single strongest predictor of whether a demonstration will help the model solve a hard multi-step problem. This observation challenges the prevailing intuition that semantic relevance (retrieved examples that look like the test question) should dominate selection.
The theoretical explanation rests on two mechanisms:
Schema richness: A demonstration with 9 reasoning steps exposes the model to a richer, more articulated problem-solving schema than a 3-step demonstration. The in-context learning mechanism in transformers—attention-weighted pattern matching over the context—has more structured signal to latch onto when the demonstration contains explicit intermediate steps. The model is not just learning what the answer looks like; it is learning how to decompose, compute, verify, and express a multi-step solution.
Self-selection effect: Problems that require many reasoning steps are, by definition, problems that require careful multi-step reasoning. By selecting examples with many steps, the practitioner is implicitly selecting examples that encode solutions to hard problems—and hard-problem solutions encode more generalizable reasoning patterns than easy-problem solutions.
What Assumptions Underlie This, and Where They Fail
The technique assumes:
-
Step count is a valid proxy for reasoning depth. This holds when each step represents a genuine reasoning operation (arithmetic, logical inference, spatial reasoning). It fails when steps are artificially inflated (verbose but shallow) or when the reasoning requires dense, interlocked inferences that are hard to express as discrete steps.
-
More complex demonstrations are more useful for all types of test questions. The ablation data shows this assumption partially fails for heterogeneous tasks: on MathQA, retrieval (which matches semantics, not just complexity) outperforms complexity selection by 27 points. For narrowly scoped tasks (pure arithmetic, commonsense), the assumption holds strongly.
-
Complexity in demonstrations transfers to better reasoning in generated outputs. This assumes the model can read multi-step demonstrations and generalize the reasoning pattern—which requires sufficient model scale. The technique fails for models below ~100B parameters (empirically, text-curie-001 at 6.7B shows no benefit; Codex at 175B shows large benefits).
Fundamental Trade-Offs
| Dimension | Trade-off |
|---|---|
| Complexity vs. diversity | Maximizing step count can reduce topical diversity; all selected examples may be from similar hard problem types |
| Complexity vs. token budget | More steps per demonstration mean longer prompts; selecting fewer but more complex examples can exceed context limits |
| Quality at inference (N=50 samples) vs. cost | Complexity-based consistency requires 50× more generation calls than greedy decoding |
| Complexity vs. accuracy on easy sub-questions | Filtering to high-step chains at decoding time can hurt performance on easy questions, which are reliably solved with short chains |
2.2 Execution Mechanism
The technique operates in two distinct phases that can be applied independently or jointly.
Phase 1: Complexity-Based Prompt Construction (Demonstration Selection)
This phase happens once, before any test query is processed.
-
Collect a candidate pool: Assemble a set of (question, reasoning chain, answer) triples. The pool does not need to be large—8–20 human-annotated examples are sufficient. The pool should cover the domain of interest but does not need to be exhaustive.
-
Score each example: For each reasoning chain, count the number of lines separated by the
\nnewline character. This count is the complexity score. No semantic analysis, dependency parsing, or domain-specific knowledge is needed. -
Rank and select: Sort the pool by complexity score in descending order. Select the top M examples (M = 4–8 in the paper's experiments, with M = 8 being the standard configuration).
-
Construct the prompt: Arrange the selected examples in the standard few-shot CoT format, using
Question:as the question prefix (the paper finds this performs better thanQ:) and\nas the step separator. The prompt template is:
Question: {example_1_question}
{step_1}\n{step_2}\n...\n{step_n}\nThe answer is {answer}.
Question: {example_2_question}
...
Question: {test_question}
Phase 2: Complexity-Based Consistency (Output Decoding)
This phase is applied at inference time for each test question.
-
Sample N chains: Using temperature sampling (the paper uses T = 0.7, N = 50), generate 50 distinct reasoning chains for the test question. Each chain is a full response including reasoning steps and a final answer.
-
Score each chain: Count the newline-separated lines in each generated chain. This is the same scoring function used for demonstrations.
-
Filter to the top K: Sort the 50 chains by step count descending. Retain the top K = 30–40 chains. The paper finds K = 40 is near-optimal across benchmarks.
-
Extract answers and vote: Parse the final answer from each of the K retained chains. Apply majority voting across these K answers. The most frequently occurring answer is the prediction.
Single-pass vs. iterative: The technique is single-pass at the prompt construction stage (demonstrations are fixed once selected). At the decoding stage it involves N parallel forward passes, not a sequential iteration—the chains are sampled independently and aggregated.
Initialization and completion criteria: No initialization beyond the candidate pool is needed. Completion is defined by the majority vote answer across the K best chains.
2.3 Causal Mechanisms
Why Complexity-Based Selection Works
The paper provides a controlled experiment isolating the causal role of per-example complexity from total prompt length. Two configurations are compared with equal total reasoning steps (72 steps total):
- Configuration A: 24 simple examples × 3 steps each = 72 total steps
- Configuration B: 8 complex examples × 9 steps each = 72 total steps
On GSM8K validation, Configuration A achieves 51.0% and Configuration B achieves 58.5%—a 7.5-point gap with token count held constant. This rules out the hypothesis that longer prompts are simply better (due to more tokens providing more signal). The structure of having many steps per example is the active ingredient.
The causal mechanism: a 9-step demonstration exposes the model to a sustained reasoning trajectory. Processing this trajectory primes the model's attention heads to generate similarly extended, methodical outputs. A 3-step demonstration does not provide this priming—it only teaches the model how to reach a conclusion quickly, which reinforces shallow solution patterns.
Why Complexity-Based Consistency Works
The vote-filtering mechanism works because output quality and output length are positively correlated for multi-step math problems. When a model generates a correct solution to a hard problem, it typically requires more steps to show the work. When it generates an incorrect solution through a reasoning shortcut or lucky guess, it tends to produce fewer steps. Filtering to the top-K most complex chains among the N samples therefore preferentially selects the correctly-derived answers.
The paper's direct test of this: voting among the K least complex chains (the "simple chains" condition) always performs worse than voting across all N chains. Voting among the K most complex chains always performs better. This asymmetry confirms the direction of causality.
Cascading Effects and Feedback Loops
The technique does not involve a feedback loop in the traditional sense—there is no iterative refinement of the demonstrations based on observed output quality. However, there is a one-way cascade: the quality of the selected demonstrations determines the distribution of the N sampled chains at inference time, which in turn determines the quality of the top-K filtered chains. A high-quality prompt (complex demonstrations) biases the model toward generating complex outputs, which yields a richer pool of chains to filter at the decoding stage. The two components (selection and consistency) reinforce each other.
Information-Theoretic Framing
An illuminating way to understand why complexity matters is through an information-theoretic lens. A reasoning chain is a sequence of tokens. The information content of the chain—how much it tells the model about how to solve a class of problems—depends not just on its length but on the diversity of reasoning operations it demonstrates.
A 3-step chain that says "multiply, subtract, add" exposes the model to three operation types with minimal context for when to apply each. A 9-step chain that says "identify the rate, compute duration, convert units, apply the formula, check boundary conditions, verify the intermediate result, compute the final aggregation, round to appropriate precision, state the answer" exposes the model to nine operation types with rich context for their sequencing and conditional application.
From a minimum description length (MDL) perspective: a complex reasoning chain is harder to compress without loss of information than a simple one. This incompressibility is a signal that the chain contains novel, non-redundant information. By selecting the most complex (least compressible) demonstrations, the technique maximizes the information density of the few-shot context.
This framing also explains why the depth-vs-breadth experiment (8 complex vs. 24 simple, same total steps) favors depth: 8 diverse, long chains are informationally richer than 24 short chains because the short chains are highly compressible (they all encode the same basic pattern) while the long chains each encode distinct, less compressible problem-solving trajectories.
Connection to In-Context Learning Theory
Theoretical analyses of in-context learning (Akyürek et al., 2022; Dai et al., 2023; Von Oswald et al., 2023) suggest that transformers implicitly perform a form of gradient descent on the in-context examples during the forward pass. The demonstration examples define a loss landscape that the model is "learning from" without parameter updates.
Under this interpretation, complex demonstrations define a richer loss landscape: more steps means more gradient signal per example, and selecting the highest-step demonstrations maximizes the total "implicit gradient" available to the model for the test question. Simple demonstrations are informationally sparse and define a shallow loss landscape that barely distinguishes correct from incorrect reasoning approaches.
This theoretical framing is not proven for Complexity-Based Prompting specifically, but it provides a principled mechanistic story that is consistent with the empirical results and suggests where the technique is most likely to provide gains (tasks where the space of reasoning approaches is large and the model needs strong signal to select the right approach).
What Emergent Behaviors Arise
An unexpected finding from deploying the technique is what can be called reasoning format contagion: when demonstrations consistently use a particular step style (e.g., starting each step with a verb: "Compute...", "Subtract...", "Verify..."), the model's generated chains tend to adopt the same step structure even on test questions that were not in the demonstration domain. This suggests that complex demonstrations do not only teach problem-specific patterns—they establish a global reasoning style for the entire inference session.
A second emergent behavior is partial answer consistency: with Phase 2 active, intermediate values within the retained top-K chains tend to be more consistent across chains than in the bottom (N-K) chains, even for chains that ultimately give different final answers. This suggests that complexity filtering selects chains that are reasoning correctly for more of their length, even when the final answer diverges—a property that could be exploited to build richer uncertainty estimates.
Dominant Factors in Effectiveness
Based on the ablation studies, ranked by contribution:
- Complexity-based demonstration selection (greedy decoding improvement): ~+5–6 pp average across benchmarks.
- Complexity-based consistency filtering (added on top of standard Self-Consistency): additional +2–4 pp improvement over unfiltered majority vote.
- Question prefix format ("Question:" vs. "Q:"): +4 pp on GSM8K validation—larger than expected for a cosmetic change, suggesting prompt sensitivity to instruction phrasing.
- Step separator format (newline vs. period): ~+2–4 pp in favor of newline separator.
3. Structure and Components
3.1 Essential Components
Required elements:
-
Candidate demonstration pool: A set of (question, reasoning chain, answer) triples with human-written or high-quality model-generated reasoning chains. Minimum 8–10 examples; the pool only needs to be moderately larger than the number of demonstrations you want to select (e.g., a pool of 15–20 to select the top 8).
-
Complexity scoring function: A function that maps a reasoning chain to a non-negative integer. The canonical implementation counts
\n-separated lines. This is the technique's only required algorithm and it is trivially implemented. -
Demonstration selection procedure: Rank the pool by complexity score and take the top M. M = 8 is the standard; M as low as 4 still yields improvements over random selection in the paper's experiments.
-
Chain-of-thought prompt format: Standard few-shot CoT format with
Question:prefix. The reasoning chain for each demonstration must use newline characters as step separators (not periods or explicit step labels).
Optional but strongly recommended:
-
Temperature sampling at inference (for Complexity-Based Consistency): Without sampling, the technique is limited to Phase 1 (prompt construction). Adding N = 50 samples and filtering to K = 30–40 provides the largest accuracy gains.
-
Step-count scoring of generated outputs: Required for Phase 2. The same
\nline-count function is applied to generated chains. -
Majority voting with answer extraction: A parser to extract the final numerical or categorical answer from each chain. For math tasks, this typically means extracting the last number or "The answer is X" pattern.
Not required:
- Semantic embeddings or similarity scores
- External retrieval corpus
- Separate validation set for tuning M or K (defaults from the paper work well)
- Model fine-tuning or gradient updates
3.2 Design Principles
Linguistic patterns: Each reasoning step in a demonstration should be a complete, self-contained inference—not a fragment. The newline separator should appear after each step, not within a step. Steps should express progress: "Since there are 5 groups of 3 apples, there are 5 × 3 = 15 apples total" is a valid step. "Multiply 5 by 3" alone is too sparse.
Cognitive principles leveraged:
- Schema priming: Multi-step demonstrations prime the model's in-context attention pattern toward structured, sequential inference, activating learned schema for how to decompose hard problems.
- Reasoning depth over breadth: Instead of showing the model many simple solution patterns, the technique shows it fewer but deeper patterns. This is a conscious trade-off of example diversity for example depth.
- Cascaded inference: Each step in a complex demonstration conditions the next, creating a structured reasoning chain that the model can transfer to unseen problems.
Design principles:
- Parsimony in the complexity metric: The step-count metric is deliberately surface-level. Attempting to measure "semantic depth" or "logical independence of steps" adds engineering cost without reliable gain. The paper explicitly tested more sophisticated proxies (question length, formula length) and found step count to be the most predictive.
- Newline as the canonical separator: The
\nseparator is not arbitrary; it corresponds to the token-boundary structure that transformers process efficiently and that model pre-training (on code and structured text) reinforces as a meaningful unit boundary. - "Question:" prefix: The paper's ablation shows this performs better than "Q:" by ~4 pp. The hypothesis is that "Question:" is a more complete and unambiguous instruction token, activating a stronger question-answering schema in the model's attention.
3.3 Structural Patterns
Minimal pattern (demonstration selection only, greedy decoding):
Applicable when latency or cost prevents N=50 sampling, or as a starting baseline. Select top-M examples by step count, format prompt, generate one chain per test question.
Question: [Complex example question 1]
[Step 1]
[Step 2]
...
[Step 9]
The answer is [A1].
Question: [Complex example question 2]
[Step 1]
...
[Step 8]
The answer is [A2].
...
Question: [Test question]
Standard pattern (demonstration selection + complexity-based consistency):
Select top-M by step count, sample N = 50 chains, filter to top K = 40 by step count, majority vote.
# Prompt construction: same as minimal pattern
# Decoding: N=50 samples, filter top-40, majority vote
responses = model.generate(prompt, n=50, temperature=0.7)
step_counts = [chain.count('\n') for chain in responses]
sorted_indices = sorted(range(50), key=lambda i: step_counts[i], reverse=True)
top_k_responses = [responses[i] for i in sorted_indices[:40]]
answers = [extract_answer(r) for r in top_k_responses]
final_answer = Counter(answers).most_common(1)[0][0]
Advanced pattern (adaptive K threshold + transfer-domain robustness testing):
Use a validation set to tune both M (number of demonstrations) and K (vote threshold). Test prompt robustness by using demonstrations from a different sub-domain within the same general task area. Add a step-count lower bound to filter out degenerate chains (< 2 steps) that are likely hallucinations or off-topic responses.
def complexity_based_consistency(prompt, model, n=50, k_low=2, k_high=40, temp=0.7):
responses = model.generate(prompt, n=n, temperature=temp)
# Filter degenerate chains (fewer than k_low steps)
valid = [r for r in responses if r.count('\n') >= k_low]
# Sort by step count descending, take top k_high
valid_sorted = sorted(valid, key=lambda r: r.count('\n'), reverse=True)
top_k = valid_sorted[:k_high]
if not top_k:
# Fallback: use all valid responses
top_k = valid
answers = [extract_answer(r) for r in top_k]
return Counter(answers).most_common(1)[0][0]
Reasoning patterns used:
- Forward reasoning: each step builds on the result of the previous step (standard in arithmetic CoT)
- Decomposition: complex problems are broken into sub-problems within a single chain
- No backward or verification steps are required by the technique, though adding verification ("Let me check: ...") to demonstrations would increase step count and would therefore be preferentially selected
Reasoning anti-patterns to avoid in demonstrations:
Not all ways of increasing step count are equally beneficial. Certain anti-patterns inflate step count without adding reasoning value:
Redundant restatement: Repeating the question or a prior step in different words:
Step 3: So we have established that there are 45 marbles total. (restatement of step 2)
Step 4: With 45 marbles... (same information again)
This inflates step count without adding inferential content. It may teach the model to produce padded outputs rather than genuinely complex reasoning.
Arithmetic micro-steps: Breaking a single mental arithmetic operation into multiple written sub-steps:
Step 3: 3 × 10 = 30
Step 4: 3 × 5 = 15
Step 5: 30 + 15 = 45
Step 6: So 3 × 15 = 45
While technically correct, this multiplication breakout adds 3 steps without adding conceptual complexity. A single "3 × 15 = 45" step is more informative per step.
Narrative padding: Adding sentence-length transitions that contain no new information:
Step 7: Now that we have computed the total, we can move to the next part of the problem.
The rule of thumb: each step should state exactly one new fact, computation, or inference. If a step restates existing information or adds only transitional language, it should be eliminated or merged with an adjacent step.
How to increase genuine step count (not padding):
- Add units tracking: explicitly carry units through each computation step
- Add conditional verification: explicitly check whether an intermediate result is reasonable
- Add sub-problem framing: explicitly name each sub-goal before solving it
- Add alternative checking: briefly consider and eliminate a wrong interpretation before proceeding
- Add domain knowledge application: explicitly cite the formula or principle used at each step
3.4 Modifications for Different Scenarios
Ambiguous tasks (unclear problem structure): When the test domain is ambiguous, build a diverse candidate pool covering multiple sub-types, then apply complexity selection within each sub-type separately. The risk of using a single complexity ranking across heterogeneous sub-types is that high-complexity examples may all come from one sub-type, reducing coverage.
Complex reasoning that resists step decomposition (e.g., spatial reasoning): When reasoning steps are difficult to express as discrete newline-separated lines, use explicit numbered step labels within the chain and count those instead of newlines. Adjust the scoring function accordingly.
Format-critical outputs (structured data, code): When the output must be in a specific format (JSON, code, table), add a format constraint to the prompt after the demonstrations. Complexity-based selection applies to the reasoning portion of the chain; the output format is specified separately via an explicit instruction appended at the end.
Domain-specific tasks with limited pool: When the candidate pool is small (fewer than 8 examples), do not apply a strict top-M cutoff. Instead, apply a minimum step threshold (e.g., include all examples with ≥ 5 steps) and supplement with the highest-step examples if the threshold yields too few.
4. Applications and Task Selection
4.1 General Applications by Task Type
Multi-step arithmetic reasoning:
The core domain for which the technique was developed and evaluated. Tasks like GSM8K (grade-school math word problems) benefit maximally because each problem naturally decomposes into a variable number of arithmetic steps, and harder problems require more steps. Selecting complex demonstrations gives the model templates for 7–12 step solutions, which are the hardest test cases.
Algebraic and symbolic math:
MathQA (algebraic word problems requiring equation setup and evaluation) showed the largest absolute gain (+18 pp on Codex under greedy decoding). The technique is particularly effective when the candidate pool contains demonstrations with explicit equation-setting steps rather than purely arithmetic reasoning.
Commonsense and factual reasoning:
StrategyQA (+10.1 pp on GPT-3) involves multi-hop commonsense questions ("Did Napoleon meet Beethoven?") that require chaining several factual retrieval steps. Complexity selection improves performance by providing demonstrations that explicitly enumerate each reasoning hop, teaching the model to surface and connect intermediate facts.
Date and table reasoning:
Date Understanding and Penguins in a Table (BigBench Hard) show more modest improvements (+0–2.8 pp). These tasks have inherently bounded reasoning depth—date arithmetic decomposes into at most 4–5 steps regardless of problem difficulty. The ceiling effect on complexity limits the technique's benefit.
Code generation (indirectly):
The Chain-of-Thought Hub repository (Fu et al.) includes Codex-based code generation evaluations. While the original paper does not evaluate code generation as a primary task, the Codex results (175B code-trained model) suggest code-generating models benefit at least as much as text models from complexity-based selection, since code problems naturally decompose into discrete algorithmic steps that map cleanly to the step-count criterion.
For code generation specifically, the complexity criterion maps to the algorithmic complexity of the demonstration solution—demonstrations that implement sorting, graph traversal, or dynamic programming (requiring multiple distinct phases: initialization, iteration, termination) are preferred over demonstrations that solve problems with a single built-in function call. The reasoning chain for a code generation demonstration can be expressed either as step-by-step comments within the code or as a natural-language reasoning block preceding the code:
Question: Given a list of integers, find all pairs that sum to a target value.
Step 1: We need all pairs, not just one, so we cannot stop at the first match.
Step 2: Use a hash set to track numbers seen so far for O(n) time complexity.
Step 3: For each number, check if (target - number) is already in the set.
Step 4: If yes, record the pair as (min, max) for canonical deduplication.
Step 5: Add the current number to the seen set before continuing to next.
Step 6: Handle duplicates by storing pairs in a set, not a list.
Step 7: Return the final list of unique pairs after the loop.
def find_pairs(nums, target):
seen = set()
result = set()
for num in nums:
complement = target - num
if complement in seen:
result.add((min(num, complement), max(num, complement)))
seen.add(num)
return list(result)
The answer is the above implementation using a hash set with O(n) time and O(n) space.
Structured data extraction:
Tasks that require extracting structured information from unstructured text (named entities, relationships, events from documents) benefit when the reasoning chain makes extraction logic explicit: "The sentence contains 'Apple Inc. acquired Shazam' — 'Apple Inc.' is the ACQUIRER entity, 'acquired' is the relationship predicate, 'Shazam' is the TARGET entity." High-complexity demonstrations enumerate this extraction reasoning for each entity-relationship triple rather than just providing the output JSON directly. This teaches the model to read carefully and justify each extraction rather than pattern-match surface features.
Formal reasoning and proof verification:
For tasks involving logical deduction, theorem proving, or proof verification, demonstrations that check each inference step explicitly have naturally high step counts. The model is taught to verify not only whether the conclusion follows from the premises but whether each intermediate inference is valid—a thoroughness that is difficult to elicit without demonstrations that model it explicitly.
Tasks where complexity-based prompting adds limited value:
- Single-step lookup or retrieval (step count = 1 uniformly, no signal to exploit)
- Classification without reasoning (selecting complex examples adds verbosity without structural benefit; the model may learn to over-explain simple classifications)
- Creative generation (reasoning step count is not a valid quality proxy for creative writing, poetry, or open-ended narrative tasks)
- Summarization (output quality correlates with comprehension of the source, not with the number of steps in the reasoning chain)
- Factual recall from model memory (when the answer is a specific fact like a year, a name, or a definition, extended reasoning can cause confabulation rather than retrieval)
- Translation (the reasoning chain for translation is typically a single-step semantic mapping; complexity selection would only select demonstrations with unnecessary metalinguistic analysis)
4.2 Domain-Specific Applications
Medical clinical reasoning:
Clinical diagnosis involves multi-step differential diagnosis: gather symptoms, enumerate differential, apply elimination criteria, consider comorbidities, recommend investigation. Each step is distinct and the number of steps is a reasonable proxy for diagnostic thoroughness. Complexity-based selection would preferentially choose demonstrations that model exhaustive differential reasoning over shortcuts that jump to a diagnosis.
No peer-reviewed results for complexity-based prompting in clinical NLP exist as of 2025, but the technique's design aligns with clinical reasoning requirements. The limitation is that medical reasoning chains require domain expertise to write, making the candidate pool more expensive to construct.
Legal reasoning:
Legal analysis similarly decomposes into issue spotting, rule identification, application of rule to facts, and counter-argument consideration (IRAC structure). Complex demonstrations that follow all four components of IRAC would be preferentially selected, teaching the model to produce thorough legal analysis.
Scientific reasoning and research QA:
Tasks like GPQA (Graduate-level Professional QA, Rein et al. 2023) involve multi-step scientific reasoning where solutions require recalling and applying multiple scientific principles. The technique should benefit these tasks, though GPQA was not evaluated in the original paper.
Financial and quantitative analysis:
Financial modeling problems (DCF calculations, options pricing, risk decomposition) involve multi-step quantitative reasoning. The technique would select demonstrations that enumerate all steps in the calculation, preventing the model from skipping to a final number without showing intermediate values.
Unconventional applications:
- Proof verification: Selecting demonstrations whose reasoning chains include explicit logical justification for each step (more steps = more careful justification) teaches the model to produce proofs with traceable derivation.
- Debugging assistance: Code debugging demonstrations that include hypothesis generation, test design, execution simulation, and diagnosis steps would be selected over simple "the error is X" demonstrations.
- Argument analysis: For debate or rhetoric analysis, demonstrations that enumerate multiple sub-arguments, counter-arguments, and rebuttals would be selected, teaching more thorough analytical output.
4.3 Selection Framework
Problem Characteristics That Make This Suitable:
- Multi-step structure: The problem's ground-truth solution requires 5+ distinct steps. Problems with 1–2 steps derive no benefit.
- Reasoning chain expressibility: The solution can be expressed as a linear sequence of verbalizable steps separated by newlines. Problems that require parallel sub-computations or tree-structured reasoning benefit less.
- Homogeneous sub-type: The test distribution is concentrated in a specific type of reasoning (e.g., all arithmetic, all multi-hop factual). When the test distribution is highly heterogeneous, retrieval-based selection may be more appropriate.
- Existing annotated pool: A small candidate pool of human-quality demonstrations is available. The technique is not applicable when starting from zero annotations—use Zero-Shot-CoT or Auto-CoT instead.
Problem Characteristics That Make This Unsuitable:
- Single-step or shallow tasks: Classification, retrieval, sentiment analysis, entity extraction—any task where a correct solution can be expressed in 1–2 lines. Complexity selection reduces to arbitrary selection in this case.
- Highly heterogeneous test distribution: When test questions span many structurally distinct sub-types, complexity selection without semantic matching can select demonstrations that are all from one sub-type, creating representation bias.
- Length-sensitive outputs: When output length is constrained (e.g., a summary task where brevity is required), selecting complex demonstrations biases the model toward verbose outputs.
- Low-resource languages or highly domain-specific jargon: When the model has weak in-context learning ability for the target domain, adding complex demonstrations can confuse rather than guide.
Selection Signals — When to Choose This Approach:
Use Complexity-Based Prompting when:
- You have a candidate pool of ≥ 8 human-annotated demonstrations
- The target task involves multi-step reasoning with verifiable intermediate steps
- You observe high variance in few-shot CoT performance across differently curated example sets
- You need annotation-efficient selection (cannot build a full retrieval corpus)
- You are already using Self-Consistency and want to improve it without changing the prompt format
Use an alternative when:
- Your test distribution is highly heterogeneous → use retrieval-based selection or dynamic few-shot selection
- You have no annotated demonstrations → use Zero-Shot-CoT or Auto-CoT
- Your task is creative or open-ended → use diversity-based example selection
- You need sub-50ms latency → do not use Complexity-Based Consistency (N=50 sampling); limit to Phase 1 only
Model Requirements:
| Requirement | Details |
|---|---|
| Minimum model size | ~100B parameters (empirically: text-curie-001 at 6.7B shows no gain) |
| Recommended | 175B+ (GPT-3, Codex); comparable instruction-tuned models |
| Optimal | GPT-4 class; gains from complexity selection compound with stronger base capability |
| Not suitable | Models < 30B that do not exhibit emergent multi-step reasoning |
| Required capability | Chain-of-thought generation ability (the model must be capable of producing step-by-step reasoning in the first place) |
Context and Resource Requirements:
- Context budget: 8 complex demonstrations × ~9 steps × ~15 tokens/step ≈ 1,000–1,500 tokens for demonstrations. Total prompt (with question) is typically 1,200–2,000 tokens. This fits within GPT-3.5-turbo's context window but requires attention to budget for complex demonstrations.
- Inference cost (Phase 1 only, greedy): 1 forward pass per test question. Cost is proportional to prompt + output length.
- Inference cost (Phase 1 + 2, N=50): 50 forward passes per test question at temperature > 0. This is the same cost profile as Self-Consistency and represents a 50× latency and cost multiplier over greedy decoding.
- Latency: If using N=50 sampling, the decoding phase dominates latency. Parallel sampling (most modern APIs support batch generation) can partially mitigate this.
Cost Implications:
One-time costs:
- Building and annotating the candidate pool: manual effort, domain-dependent. ~1–4 hours for a small pool of 10–15 examples.
- Implementing the step-count scoring function: negligible (< 10 lines of code).
- Tuning K (if not using default K=40): requires a small validation set (~50–100 examples) and multiple inference runs.
Per-request production costs:
- Phase 1 only: ~1.5–2× the cost of a zero-shot query (due to longer prompt from demonstrations).
- Phase 1 + Phase 2: ~50–55× the cost of a zero-shot query (50 samples + longer prompt).
Trade-offs: Phase 2 provides the largest accuracy gains but at 50× cost. For production systems where per-query cost is constrained, Phase 1 alone (greedy decoding) offers a good middle ground—still substantially better than handcrafted CoT, with only 1 forward pass per query.
Variant Selection:
| Scenario | Recommended Variant |
|---|---|
| Accuracy-critical, cost-unconstrained | Phase 1 + Phase 2 (N=50, K=40) |
| Accuracy-important, cost-sensitive | Phase 1 only (greedy decoding) |
| High heterogeneity in test distribution | Phase 1 with diversity-augmented selection (combine complexity and topical diversity) |
| Very small candidate pool (< 8 examples) | Phase 1 with minimum step threshold instead of top-K cutoff |
| Already using Self-Consistency | Add Phase 1 (complexity-based demo selection) as a drop-in improvement |
5. Implementation
5.1 Implementation Steps
Prerequisites:
- A candidate pool of (question, reasoning chain, answer) examples for the target task. Minimum 8–15 examples, each with a multi-step reasoning chain.
- Access to a capable LLM API (GPT-3.5-turbo, GPT-4, Claude, or equivalent 175B+ model).
- Python 3.8+ environment.
Step-by-step implementation:
Step 1 — Build the candidate pool. Write or collect question-reasoning-answer triples. For math tasks, each reasoning chain should include all intermediate calculations. For reasoning tasks, each chain should enumerate each inferential hop. Aim for variety in the types of problems but do not worry about coverage—you will select only the most complex ones.
Step 2 — Score and rank the pool. Count the newline characters in each reasoning chain. This is the complexity score.
def complexity_score(reasoning_chain: str) -> int:
"""Count the number of reasoning steps (newline-separated lines)."""
return len([line for line in reasoning_chain.split('\n') if line.strip()])
Step 3 — Select demonstrations. Sort the pool by complexity score descending and take the top M.
def select_demonstrations(pool: list[dict], m: int = 8) -> list[dict]:
"""
pool: list of dicts with keys 'question', 'chain', 'answer'
Returns top-m examples by complexity score.
"""
scored = sorted(pool, key=lambda x: complexity_score(x['chain']), reverse=True)
return scored[:m]
Step 4 — Format the prompt. Assemble demonstrations into the standard CoT few-shot format.
def build_prompt(demonstrations: list[dict], test_question: str) -> str:
"""Build the few-shot CoT prompt with complexity-selected demonstrations."""
parts = []
for demo in demonstrations:
parts.append(f"Question: {demo['question']}\n{demo['chain']}\nThe answer is {demo['answer']}.")
parts.append(f"Question: {test_question}")
return "\n\n".join(parts)
Step 5 — Generate (Phase 1 only, greedy).
import openai
def predict_greedy(prompt: str, model: str = "gpt-4o") -> str:
response = openai.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=512,
)
return response.choices[0].message.content
Step 6 — Generate with complexity-based consistency (Phase 1 + 2).
from collections import Counter
def predict_with_complexity_consistency(
prompt: str,
model: str = "gpt-4o",
n: int = 50,
k: int = 40,
temperature: float = 0.7,
) -> str:
"""
Sample n chains, filter to top-k most complex, return majority vote answer.
"""
response = openai.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
n=n,
max_tokens=512,
)
chains = [choice.message.content for choice in response.choices]
# Score and sort by complexity (step count)
scored = sorted(chains, key=lambda c: complexity_score(c), reverse=True)
top_k = scored[:k]
# Extract answers and majority vote
answers = [extract_final_answer(chain) for chain in top_k]
answers = [a for a in answers if a is not None]
if not answers:
return extract_final_answer(chains[0]) # Fallback to first chain
return Counter(answers).most_common(1)[0][0]
def extract_final_answer(chain: str) -> str | None:
"""Extract the answer from the last line or 'The answer is X' pattern."""
import re
match = re.search(r"[Tt]he answer is\s+([^\.\n]+)", chain)
if match:
return match.group(1).strip()
lines = [l.strip() for l in chain.split('\n') if l.strip()]
return lines[-1] if lines else None
Platform-specific implementations:
Anthropic (Claude API):
import anthropic
client = anthropic.Anthropic()
def predict_claude_greedy(prompt: str, model: str = "claude-opus-4-6") -> str:
message = client.messages.create(
model=model,
max_tokens=512,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text
def predict_claude_complexity_consistency(
prompt: str,
model: str = "claude-opus-4-6",
n: int = 50,
k: int = 40,
temperature: float = 0.7,
) -> str:
"""
Claude does not support n>1 in a single call; loop and collect chains.
"""
chains = []
for _ in range(n):
message = client.messages.create(
model=model,
max_tokens=512,
temperature=temperature,
messages=[{"role": "user", "content": prompt}]
)
chains.append(message.content[0].text)
scored = sorted(chains, key=lambda c: complexity_score(c), reverse=True)
top_k = scored[:k]
answers = [extract_final_answer(chain) for chain in top_k]
answers = [a for a in answers if a is not None]
return Counter(answers).most_common(1)[0][0] if answers else None
LangChain:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
llm = ChatOpenAI(model="gpt-4o", temperature=0)
template = PromptTemplate(
input_variables=["demonstrations", "question"],
template="{demonstrations}\n\nQuestion: {question}"
)
chain = template | llm | StrOutputParser()
def run_complexity_prompting(pool: list[dict], test_question: str) -> str:
demos = select_demonstrations(pool, m=8)
demo_text = "\n\n".join(
f"Question: {d['question']}\n{d['chain']}\nThe answer is {d['answer']}."
for d in demos
)
return chain.invoke({"demonstrations": demo_text, "question": test_question})
5.2 Configuration
Key parameters and their effects:
| Parameter | Default | Range | Effect |
|---|---|---|---|
| M (demonstrations selected) | 8 | 4–12 | Higher M → longer prompt, more schema diversity; diminishing returns above 8 |
| N (samples per question) | 50 | 10–100 | Higher N → better vote quality, higher cost; 50 is the sweet spot per the paper |
| K (top chains for voting) | 40 | 20–N | Lower K → stronger complexity filter; optimal around 60–80% of N |
| Temperature | 0.7 | 0.5–1.0 | Higher temp → more chain diversity; too high introduces noise |
| Max tokens per chain | 512 | 256–1024 | Must accommodate the longest expected reasoning chain; set generously |
| Step separator | \n | \n, ., ; | \n is consistently best; do not change unless forced by API constraints |
| Question prefix | Question: | Q:, Question:, Problem: | Question: is empirically the best for math reasoning tasks |
Task-specific tuning guidelines:
- Pure arithmetic (GSM8K-style): M=8, N=50, K=40, T=0.7. The canonical configuration works well as-is.
- Algebraic/symbolic problems (MathQA-style): Increase M to 10–12 if the pool is large; algebraic problems vary more in structure and benefit from broader coverage.
- Commonsense reasoning (StrategyQA-style): Reduce K to 30 (more aggressive filtering); commonsense problems have shorter natural chains and the step-count signal is weaker, so more conservative filtering avoids noise.
- BigBench Hard tasks with bounded complexity: Phase 1 alone may be sufficient; Phase 2's benefit is small when all chains have similar step counts.
- Code generation tasks: Use
\nas separator but score by non-blank code lines, not total lines (blank lines in code do not represent reasoning steps).
Domain adaptation considerations:
When adapting to a new domain:
- Check whether the domain's natural reasoning chains vary substantially in step count. If all problems in the domain decompose into 3–4 steps regardless of difficulty, the technique's signal is weak.
- Adjust the minimum step threshold for "complex" accordingly. In a domain where chains range 2–6 steps, selecting the top-8 by complexity may mean selecting examples with 5–6 steps, which is a weaker complexity criterion than GSM8K's 8–12 steps.
- Consider mixing a domain-complexity criterion (relative to the pool's distribution) with an absolute minimum threshold.
5.3 Best Practices and Workflow
Complete End-to-End Worked Example
The following walks through a complete application to a GSM8K-style problem, showing every step explicitly.
Candidate pool (before selection):
Example A (11 steps):
Question: A factory produces widgets. In January, 240 widgets were made at a cost of $3 each.
In February, production doubled but cost increased by 50%. In March, production dropped by 25%
and cost dropped by $0.50. What was the average cost per widget across all three months?
Step 1: January production = 240 widgets.
Step 2: January cost per widget = $3.00.
Step 3: January total cost = 240 × $3 = $720.
Step 4: February production = 240 × 2 = 480 widgets.
Step 5: February cost per widget = $3.00 × 1.50 = $4.50.
Step 6: February total cost = 480 × $4.50 = $2,160.
Step 7: March production = 480 × 0.75 = 360 widgets.
Step 8: March cost per widget = $4.50 - $0.50 = $4.00.
Step 9: March total cost = 360 × $4.00 = $1,440.
Step 10: Total cost across three months = $720 + $2,160 + $1,440 = $4,320.
Step 11: Total widgets = 240 + 480 + 360 = 1,080. Average cost = $4,320 / 1,080 = $4.00.
The answer is $4.00.
Example B (3 steps):
Question: A store sells 5 apples for $2. How much do 20 apples cost?
Step 1: Cost per apple = $2 / 5 = $0.40.
Step 2: 20 apples × $0.40 = $8.
The answer is $8.
Scoring and selection:
pool = [
{"question": "A factory produces widgets...", "chain": "Step 1: ...\n...\nStep 11: ...", "answer": "$4.00", "steps": 11},
{"question": "A store sells 5 apples...", "chain": "Step 1: ...\nStep 2: ...", "answer": "$8", "steps": 3},
# ... more examples ...
]
selected = select_demonstrations(pool, m=8)
# selected = the 8 examples with highest step counts
Prompt construction:
Question: A factory produces widgets. In January, 240 widgets...
Step 1: January production = 240 widgets.
Step 2: January cost per widget = $3.00.
[... 9 more steps ...]
The answer is $4.00.
[... 7 more complex demonstrations ...]
Question: Sarah has 3 bags of marbles. Each bag contains 15 marbles.
She gives away 12 marbles total. How many does she have left?
N=50 chains generated (sample of 3):
Chain 1 (8 steps):
Total marbles = 3 × 15 = 45.
She starts with 45 marbles.
She gives away 12 marbles.
Marbles given away from bag 1: we need to track by bag.
Wait — actually she gives away 12 total, not per bag.
Total given = 12.
Remaining = 45 - 12 = 33.
The answer is 33. [step_count=8]
Chain 2 (4 steps):
3 bags × 15 marbles = 45 marbles.
Gives away 12.
45 - 12 = 33.
The answer is 33. [step_count=4]
Chain 3 (3 steps):
3 × 15 = 45, minus 12 = 33.
The answer is 33. [step_count=3]
Complexity-based consistency filtering (K=40 out of N=50):
After scoring all 50 chains, Chain 1 (8 steps) ranks in the top-40; Chain 3 (3 steps) may or may not be retained depending on the distribution. Answer "33" wins the majority vote.
Full workflow from candidate pool to production:
- Collect 15–25 example problems from the target domain with full reasoning chains.
- Score all examples by step count; review the top-10 manually to confirm they represent genuine multi-step reasoning, not verbose padding.
- Select the top-8 as the demonstration set. Run the prompt on a held-out validation set of 50–100 examples under greedy decoding. Record accuracy.
- If performance is satisfactory, deploy with Phase 1 only (greedy) for cost efficiency.
- If accuracy must be maximized, enable Phase 2 (N=50, K=40). Evaluate on the validation set to confirm the gain.
- If Phase 2 gain is smaller than expected, tune K by trying K ∈ {20, 30, 40, 50} on the validation set.
- Periodically audit the candidate pool: as new, harder examples are collected, re-score and update the selected demonstrations.
Implementation do's and don'ts:
| Do | Don't |
|---|---|
Use \n as step separator in both demonstrations and output parsing | Mix step separators across demonstrations (inconsistency reduces pattern strength) |
| Review top-selected demonstrations manually before deploying | Trust the step count alone without reading the selected examples |
| Set a minimum step threshold (≥3) to exclude trivially short examples | Set K = N in Phase 2 (this recovers vanilla Self-Consistency, wasting the filtering) |
| Test robustness by evaluating on a held-out set with demonstrations from a different sub-domain | Assume that gains on one sub-domain transfer automatically to very different sub-domains |
| Keep the candidate pool updated as you encounter new problem types | Lock the pool permanently; the distribution of test questions may shift |
Common instruction design patterns:
For demonstration chains, each step should follow the pattern:
[Observation or sub-question] → [Operation or inference] → [Intermediate result]
Example (GSM8K-style):
Question: A store has 5 shelves. Each shelf holds 12 boxes. Each box contains 8 items.
If 30% of the items are returned, how many items remain?
Total boxes: 5 × 12 = 60 boxes.
Total items before returns: 60 × 8 = 480 items.
Items returned: 480 × 0.30 = 144 items.
Items remaining: 480 - 144 = 336 items.
The answer is 336.
Each line is a reasoning step. The chain has 5 steps, making it a moderately complex demonstration. A chain with 8–10 such steps would be ranked higher.
5.4 Debugging Decision Tree
Symptom: Accuracy is no better than handcrafted CoT
Root cause investigation:
- Check step count of selected demonstrations: are they genuinely complex (≥ 6 steps)? If the pool contains only short chains, the technique degenerates to arbitrary selection.
- Check model size: is the model large enough to benefit from complex demonstrations? Models < 100B parameters may not exhibit the step-complexity benefit.
- Check task type: is this a multi-step reasoning task? If the task requires only 1–2 steps, complexity selection has no signal to exploit.
Fixes:
- Add more complex examples to the candidate pool manually.
- Switch to a larger model.
- Verify the task is multi-step; if not, use standard few-shot CoT.
Symptom: Generated chains are short despite selecting complex demonstrations
Root cause: Temperature too low (model converges to short chains) or max_tokens too low (chains are truncated).
Fix: Increase temperature to 0.7–0.9. Increase max_tokens to 768 or 1024. Verify that the model is not being penalized for long outputs by a downstream truncation step.
Symptom: Majority vote returns incorrect answers even with N=50 samples
Root cause: Either K is set too aggressively (filtering to very few chains amplifies noise from incorrectly-complex bad chains), or the prompt demonstrations are from a mismatched sub-domain.
Fix: Increase K toward N (less aggressive filtering). Verify demonstrations are from the same general reasoning type as the test questions. Consider adding a semantic similarity check to augment complexity-based filtering.
Symptom: Inconsistent outputs across runs
Root cause: Temperature is high and K is small, so small changes in sampling yield different top-K sets.
Fix: Increase N (more samples, more stable majority), or reduce temperature slightly (0.6–0.7), or increase K (averaging over more chains reduces volatility).
Symptom: Phase 2 (complexity-based consistency) is not beating Phase 1 (greedy)
Root cause: The task has low natural chain length variance, so step-count filtering does not effectively distinguish correct from incorrect chains.
Fix: Evaluate whether standard Self-Consistency (without complexity filtering) beats greedy; if it does, tune K more aggressively. If Self-Consistency also does not help, the task may not benefit from multi-sample aggregation—use greedy decoding only.
Symptom: Hallucinations in intermediate steps
Root cause: Complex demonstrations can inadvertently teach the model to produce long, elaborate reasoning chains even when it doesn't know the answer, filling steps with plausible-sounding but incorrect intermediate values.
Fix: Add a verification instruction at the end of each demonstration chain: "Let me verify: [check computation]." This teaches the model to self-verify, and the additional verification step also increases complexity score (beneficial for selection and filtering).
5.5 Testing and Optimization
Validation Strategy
Use a held-out validation set of at least 50 examples from the target domain. Do not use the same examples as your candidate pool. The validation set should represent the expected distribution of test questions—if the test distribution is harder on average, ensure the validation set reflects this.
For robust evaluation:
- Happy path: Standard multi-step questions of the type the demonstrations cover.
- Harder cases: Questions requiring 10+ steps; verify the model does not give up early.
- Simpler cases: Questions requiring 1–3 steps; verify the technique does not over-complicate short solutions.
- Out-of-distribution questions: Questions from a different but related sub-domain; check how much performance degrades and whether complexity selection still beats random.
Quality Metrics
- Exact match accuracy: The standard metric for math reasoning benchmarks. Compare against handcrafted CoT, random selection, and vanilla Self-Consistency baselines.
- Chain length distribution: Plot the distribution of generated chain lengths. Complexity-based prompting should shift the distribution rightward (more complex outputs) compared to handcrafted CoT.
- Consistency (standard deviation across runs): At fixed K, run the full pipeline 5× and compute answer variance. This measures stability of the voting procedure.
- Per-question difficulty correlation: Group test questions by ground-truth solution step count and compute accuracy per group. Complexity-based prompting should improve accuracy more on harder (more steps) questions.
Token and Cost Optimization
- Phase 1 only (no Phase 2): If cost is constrained, Phase 1 alone provides +5–6 pp average improvement with the same latency as any other greedy few-shot CoT. This is often the right production trade-off.
- Reduced N: Use N=20 instead of N=50 with K=15 for a 2.5× cost reduction. The paper shows diminishing returns above N=40; N=20 captures most of the Phase 2 benefit.
- Prompt compression: If context budget is tight, reduce the step detail within demonstrations while preserving step count. Shorter steps with the same count retain most of the complexity signal.
- Caching: The prompt (demonstrations) is fixed across all test questions—cache the encoded demonstration context if the inference API supports prefix caching (Anthropic and OpenAI offer this). This reduces the per-token cost of the demonstration prefix across all queries.
Experimentation
A/B test framework for comparing variants:
- Hold demonstrations fixed. Compare greedy (Phase 1) vs. Phase 1+2 on 200 validation examples.
- Hold decoding fixed (greedy). Compare complexity selection vs. random selection vs. centroid selection on 200 examples.
- For K tuning: sweep K ∈ {10, 20, 30, 40, 50} on 100 validation examples with N=50. Choose K that maximizes validation accuracy.
For statistical significance: with 200 examples and typical accuracy differences of 4–8 pp, a two-proportion z-test at α=0.05 has sufficient power to detect real differences. Do not claim a variant is better without significance testing.
Complete A/B evaluation implementation:
from scipy import stats
import numpy as np
def evaluate_method(method_fn, test_set: list[dict]) -> tuple[float, list[bool]]:
"""
Evaluate a prompting method on the test set.
Returns (accuracy, per_example_correctness).
"""
results = []
for example in test_set:
prediction = method_fn(example['question'])
correct = normalize_answer(prediction) == normalize_answer(example['answer'])
results.append(correct)
return sum(results) / len(results), results
def compare_methods(method_a_fn, method_b_fn, test_set: list[dict], alpha: float = 0.05):
"""
A/B comparison between two prompting methods with statistical significance test.
Uses McNemar's test (paired comparison) for correlated predictions.
"""
acc_a, results_a = evaluate_method(method_a_fn, test_set)
acc_b, results_b = evaluate_method(method_b_fn, test_set)
# McNemar's test: compare paired binary outcomes
# n01: A wrong, B correct; n10: A correct, B wrong
n01 = sum(1 for a, b in zip(results_a, results_b) if not a and b)
n10 = sum(1 for a, b in zip(results_a, results_b) if a and not b)
# Apply continuity correction for small samples
chi2 = (abs(n01 - n10) - 1) ** 2 / (n01 + n10) if (n01 + n10) > 0 else 0
p_value = 1 - stats.chi2.cdf(chi2, df=1)
print(f"Method A accuracy: {acc_a:.3f}")
print(f"Method B accuracy: {acc_b:.3f}")
print(f"Difference: {acc_b - acc_a:+.3f}")
print(f"McNemar's test: chi2={chi2:.3f}, p={p_value:.4f}")
print(f"Statistically significant (α={alpha}): {p_value < alpha}")
return acc_a, acc_b, p_value
# Usage example
from functools import partial
random_method = partial(predict_greedy, prompt=build_prompt(random_demos, test_q))
complexity_method = partial(predict_greedy, prompt=build_prompt(complex_demos, test_q))
compare_methods(
method_a_fn=lambda q: predict_greedy(build_prompt(random_demos, q)),
method_b_fn=lambda q: predict_greedy(build_prompt(complex_demos, q)),
test_set=validation_set,
)
McNemar's test is the correct statistical test for this comparison because the two methods are evaluated on the same test examples (paired observations). A two-proportion z-test assumes independence, which is violated when both methods see the same questions.
Hyperparameter tuning with Bayesian optimization:
For tuning M (demonstrations) and K (vote threshold) jointly, a simple grid search becomes expensive as the search space grows. Use Bayesian optimization (e.g., Optuna) to find the optimal configuration efficiently:
import optuna
def objective(trial):
m = trial.suggest_int('m', 4, 12)
k = trial.suggest_int('k', 10, 50)
temp = trial.suggest_float('temperature', 0.5, 0.9)
demos = select_demonstrations(candidate_pool, m=m)
prompt_builder = lambda q: build_prompt(demos, q)
accuracy = evaluate_on_validation(
prompt_fn=lambda q: predict_with_complexity_consistency(
prompt_builder(q), n=50, k=k, temperature=temp
),
validation_set=validation_set,
)
return accuracy
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=30)
print(f"Best config: {study.best_params}")
print(f"Best validation accuracy: {study.best_value:.3f}")
30 trials of Bayesian optimization typically finds a near-optimal configuration with 10× fewer evaluations than a full grid search over the same parameter ranges.
Building a high-quality candidate pool (detailed guidance):
The quality of the candidate pool is the single most important factor in the technique's effectiveness. The scoring function is automatic; the pool is human-constructed. Investing in pool quality pays dividends for every subsequent inference call.
Guidelines for pool construction:
-
Aim for 20–30 examples, more than you need. Selecting the top-8 from 20 is better than selecting the top-8 from 8 (more competition for slots).
-
Cover the full range of difficulty. Include simple examples (3–4 steps) and complex ones (9–12 steps). The complex ones will be selected; the simple ones serve as a sanity check that the pool is correctly scored.
-
Write reasoning chains before checking the answer. This prevents the common error of writing a backward chain that starts with the answer and works back to justify it. Forward chains are more authentic and better demonstrate the problem-solving process.
-
Vary the surface form, not the structure. Examples can use different domains (pricing, rates, geometry) but should all use the same structural reasoning pattern (identify knowns → set up calculation → compute intermediate → aggregate → verify). Surface variety helps; structural variety in demonstrations can confuse the model.
-
Review selected demonstrations from the model's perspective. Read the top-8 selected examples as if you are the model seeing them for the first time. Ask: does this sequence of demonstrations teach me how to solve the types of problems I will encounter? If not, add more targeted examples to the pool.
-
Update the pool quarterly or on distribution shift. As the task evolves, new problem types emerge. Stale pools from six months ago may not represent the current test distribution. Schedule pool reviews as a recurring maintenance task.
6. Limitations and Constraints
6.1 Known Limitations
Scale dependency (fundamental, not mitigable through prompt engineering):
The most significant fundamental limitation is that Complexity-Based Prompting requires a model with emergent multi-step reasoning ability. The paper tests text-curie-001 (6.7B parameters) and observes essentially zero gain from complexity-based selection: the model's outputs are near-uniformly poor regardless of demonstration quality, because it lacks the capacity to perform sustained multi-step reasoning in the first place. Flan-T5 (11B parameters) shows only marginal improvement (+1.5% at best). The technique's benefits emerge sharply in the 100B+ regime.
This is not a limitation that prompt engineering can overcome. It is an emergent-ability constraint: CoT reasoning itself is emergent at scale, and complexity-based selection improves CoT—it cannot conjure CoT where CoT does not exist.
Practical implication: Before applying the technique, verify that the target model can produce reasonable multi-step reasoning chains on zero-shot or minimal-shot prompts. If it cannot, the technique is inapplicable.
Step-count as a surface proxy (partially mitigable):
Counting newline-separated lines is a surface heuristic. It fails as a complexity measure in two directions:
- Over-counting: A chain with 12 lines may consist of 12 trivial arithmetic operations, each with 1-digit arithmetic. The "complexity" score is high, but the reasoning depth is low—the problem is just a long cascade of simple operations, not a problem requiring multi-level decomposition or conditional reasoning.
- Under-counting: A 4-line chain may involve a sophisticated probabilistic argument where each line encodes a non-trivial inference. The step count underestimates its informational value as a demonstration.
The paper acknowledges this and shows that step count is nonetheless the empirically strongest proxy available without semantic analysis. But practitioners should read selected demonstrations and verify they genuinely represent complex reasoning, not just verbose output.
Diversity-complexity tension (partially mitigable):
Selecting the top-M examples by complexity can inadvertently select all examples from the same problem type within the pool. If the hardest problems in your pool all happen to be probability problems, your demonstrations will all be probability problems, and the model will struggle with algebra or geometry questions. The technique maximizes depth at the potential expense of topical breadth.
This is a real problem for heterogeneous test distributions. The mitigation is to stratify the selection: first group the pool by sub-type, then select the most complex example within each sub-type. This hybrid approach (complexity within type + diversity across types) is not described in the original paper but is a natural extension.
Annotation requirement (not mitigable by the technique itself):
Unlike Auto-CoT, which generates demonstrations automatically from questions using Zero-Shot-CoT, Complexity-Based Prompting requires pre-existing human-annotated reasoning chains. The technique selects from annotated examples; it does not generate them. For truly annotation-free settings, Auto-CoT or Zero-Shot-CoT must be used instead.
The annotation cost is moderate—you need only 8–20 examples, not a full training corpus—but it is non-zero.
Annotation cost reduction strategies:
When annotation cost is a constraint, several approaches reduce it while preserving effectiveness:
Model-assisted annotation: Use a capable model (GPT-4, Claude Opus) to generate candidate reasoning chains for pool questions, then have a domain expert review and correct them. Corrective review is substantially faster than writing chains from scratch—typical time saving is 60–70% of annotation effort.
Complexity-guided annotation priority: With a limited annotation budget, annotate the hardest problems first. Hard problems produce complex chains, which are the ones complexity selection will choose. Easy problems produce short chains that won't be selected—annotating them first wastes the budget.
Chain augmentation: Take existing short chains (3–4 steps) and augment them by adding: unit verification steps, boundary-checking steps, and final answer verification steps. This increases step count without writing new chains. Each added step must represent a genuine reasoning contribution (see anti-patterns discussion in Section 3.3).
Cross-domain borrowing: When target-domain annotations are scarce, borrow high-complexity demonstrations from a structurally analogous domain (see Section 7.6 on domain adaptation). Defer target-domain annotation until the borrowed demonstrations are clearly insufficient.
Inference cost of Phase 2 (manageable through configuration):
Sampling N=50 chains per question is 50× more expensive than greedy decoding. For high-throughput production systems (millions of queries per day), this cost is prohibitive. Phase 1 alone provides meaningful improvements without the cost multiplier.
Problems solved inefficiently with this technique:
- Out-of-distribution shift: When the test distribution shifts significantly (e.g., the math topic distribution changes), the fixed demonstrations become stale. The technique provides no mechanism for detecting or responding to distribution shift.
- Non-decomposable reasoning: Problems that require creative synthesis rather than sequential decomposition (e.g., open-ended argument construction, creative writing, novel algorithm design) do not benefit from step-count optimization.
- Very short-form tasks: Tasks where all valid answers can be derived in 1–2 steps derive no benefit. The technique adds cost (longer prompts) with no accuracy gain.
6.2 Edge Cases
Ambiguous inputs:
When a test question is ambiguous (multiple valid interpretations), the N sampled chains may split across interpretations. If interpretation A generates 5-step chains and interpretation B generates 3-step chains, complexity-based filtering will preferentially select interpretation A's chains—even if interpretation B is the intended one. The result is a confident, wrong answer. Detection: monitor cases where the vote is unanimous (no diversity in answers) on questions that should be ambiguous.
Conflicting constraints in demonstrations:
If the candidate pool contains demonstrations with step counts that were inflated to artificially increase their complexity ranking (e.g., by splitting single steps into multiple sub-steps for no reasoning benefit), the scoring function selects these inflated examples, which may teach the model to produce needlessly verbose outputs without deeper reasoning. This is a data quality issue, but it is not detectable by the step-count metric alone. Manual review of selected demonstrations before deployment is essential.
Out-of-domain test questions:
When a test question falls outside the distribution of the demonstrations (e.g., a geometry question when all demonstrations are arithmetic), the model may attempt to apply the arithmetic reasoning schema to the geometry problem, producing structurally correct but semantically wrong chains. The step-count filter in Phase 2 may then preferentially select these wrong-but-long chains.
Mitigation: test with a small held-out set of out-of-distribution questions before deploying. If performance degrades sharply, the technique is not appropriate for the target deployment without augmenting with domain-specific demonstrations.
Extreme-length questions:
Questions with very long setup text (multiple paragraphs of context) may cause the demonstrations to consume less of the model's effective attention than expected, reducing their guidance effect. The model's attention may disproportionately focus on the long question context rather than the reasoning demonstrations. Mitigation: use a prompt structure that places the question after a brief separator rather than inline with demonstrations, and consider increasing temperature slightly to encourage more varied chain generation.
Chain truncation (max_tokens limit hit):
If the generated chain is truncated by the max_tokens limit mid-reasoning, the step count of the truncated chain is artificially low, causing it to be filtered out in Phase 2. This is paradoxically the opposite of the desired behavior—long chains should be retained. Mitigation: set max_tokens generously (1024+) and add a post-processing step to detect and exclude truncated chains (check if the chain ends with an answer extraction pattern).
Grade of examples that are just "long" rather than "deep":
A related edge case: some reasoning tasks naturally produce long but shallow chains (e.g., a long sequence of currency conversions, each a single multiplication). These would be ranked highly by step count but represent narrow, not deep, reasoning. The technique may select these over shorter but more cognitively rich demonstrations.
Graceful degradation strategies:
When Phase 2 filtering yields too few valid chains (due to truncation or answer extraction failures):
- Fall back to the full N chains without filtering.
- If fewer than K/2 chains have extractable answers, fall back to the chain with the highest step count as the single-chain prediction.
- If the pool has insufficient complexity variation (all chains have ≤ 3 steps), disable Phase 2 and use Phase 1 greedy decoding only.
6.3 Constraint Management
Balancing complexity vs. diversity:
The core tension is selecting examples that are both complex (many steps) and diverse (covering different problem types). Resolution strategy:
- Stratify the candidate pool by problem sub-type or topic.
- Within each sub-type, rank by complexity.
- Select the top example per sub-type to ensure topical diversity.
- Fill remaining slots (up to M) from the global complexity ranking.
This ensures that demonstrations are as complex as possible within topical categories, without sacrificing coverage.
Handling token and context constraints:
When the total prompt length (demonstrations + test question) exceeds the model's context limit:
- Reduce M (number of demonstrations). Prefer fewer but more complex over more but shorter.
- Compress individual steps: maintain step count but reduce step verbosity (shorter sentences per step).
- Switch to a model with a larger context window before reducing demonstration quality.
Handling incomplete information in the pool:
If some candidate examples have incomplete or missing reasoning chains:
- Exclude examples with no reasoning chain from the pool entirely.
- For examples with partial chains (chain cut off mid-solution), exclude if the chain does not include a final answer statement.
- Do not attempt to complete or repair chains algorithmically; this introduces inconsistency in the reasoning style.
Error handling in the majority vote:
When the most common answer is a tie (two or more answers have equal vote count):
- Among tied answers, select the answer associated with the most complex chains (highest average step count among chains that gave that answer).
- If still tied, select the answer from the single highest-complexity chain.
- Log the tie as a quality signal; ties indicate the technique is uncertain and the answer should be flagged for human review in production systems.
Interaction with instruction-tuned models:
A non-obvious constraint arises when applying the technique to instruction-tuned models (InstructGPT, Claude, Llama 3 Instruct). These models are trained via RLHF to follow instructions and produce helpful, concise responses. RLHF can conflict with the complexity signal in two ways:
Conciseness bias: RLHF training on human preferences often rewards concise answers over verbose ones (humans rate shorter, direct answers as more helpful). This means instruction-tuned models may generate shorter chains than base models, compressing the step-count distribution and weakening the complexity filter's discriminative power.
Instruction override: If an instruction-tuned model's RLHF policy strongly prefers short answers, adding long demonstrations may be partially overridden by the alignment training. The model might "see" 9-step demonstrations but internally truncate its outputs at 3–4 steps because its RLHF reward surface penalizes length.
Practical mitigation: add an explicit length instruction to the system prompt or test question: "Provide a complete, step-by-step solution showing all intermediate calculations. Do not skip steps." This instruction leverages the model's instruction-following ability to counteract the conciseness bias, making it cooperate with the complexity signal from demonstrations rather than fight it.
7. Advanced Techniques
7.1 Clarity and Context Optimization
Ensuring Clarity in Demonstrations
Each demonstration's reasoning chain should exhibit three properties for maximum effectiveness:
-
Explicit intermediate goals: Each step should state what it is computing and why. "We need to find the total number of boxes first" preceding the multiplication step makes the reasoning goal explicit, not just the arithmetic.
-
Step independence: Each step's result should be fully derivable from the steps that precede it, without relying on implicit domain knowledge that the model might not apply consistently. Ambiguous reasoning steps increase the risk that the model, when generalizing, skips the step or performs it incorrectly.
-
Consistent notation: Use consistent variable names, units, and notation across all demonstrations. Inconsistency in demonstration style forces the model to infer what notational convention to use, adding unnecessary uncertainty.
Balancing detail with conciseness:
More steps are better for the complexity score, but each step should still be concise. A demonstration chain with 10 concise steps is better than one with 10 verbose steps, because:
- It stays within token budget more easily.
- Concise steps are read and attended to more efficiently by the model.
- Verbose steps can inadvertently teach the model to produce unnecessarily long outputs on simple test questions.
Target 10–25 tokens per step.
Context Optimization
The demonstration section of the prompt is fixed context that all test questions share. Optimizing it once pays dividends across all inference calls.
Strategy:
- Order demonstrations from most to least complex (the most complex first). Models show a primacy bias in in-context learning—earlier examples have marginally stronger influence on output. Placing the most complex example first reinforces the target reasoning depth immediately.
- Alternatively, order from least to most complex (a difficulty-ramp structure). This mirrors how textbooks introduce concepts, and some analyses suggest progressive difficulty ordering improves generalization on held-out evaluation. Test both orderings on your validation set.
- Include a brief meta-instruction before the demonstrations to orient the model: "The following examples show detailed step-by-step solutions. Emulate the level of detail shown." This instruction-based framing primes the model to attend to the reasoning depth, not just the format.
Context length management:
When context is tight:
- Reduce demonstration count (M=4 instead of M=8) rather than reducing step depth. Fewer complex examples outperform more simple examples.
- Apply prefix caching (available on Anthropic and OpenAI APIs) to avoid re-processing the demonstration prefix on every call.
- Use a structured format that minimizes formatting tokens: avoid markdown headers, bullet points, or numbered lists within demonstrations. Plain text with newlines is the most token-efficient format.
7.2 Advanced Reasoning and Output Control
Multi-Step Reasoning Structure
For tasks that require conditional reasoning (if/then branching) within the chain, demonstrations should include explicit conditional steps:
If the discount applies (price > $50), compute discounted price: 80 × 0.85 = $68.
Otherwise, compute: 80 × 1.00 = $80.
Since price = $80 > $50, the discounted price is $68.
This teaches the model to explicitly resolve conditions rather than averaging over them implicitly. The conditional step also adds to the step count (one step for the condition check, one for each branch, one for resolution), making such demonstrations naturally higher complexity.
Decomposition strategies for very hard problems:
For problems that require 15+ reasoning steps (competition math, multi-hop research QA), consider breaking the problem into explicitly labeled sub-problems within the demonstration:
Sub-problem 1: Find the total items before returns.
[3–4 steps]
Sub-problem 2: Calculate returns.
[2–3 steps]
Sub-problem 3: Compute final count.
[2 steps]
This hierarchical structure (sub-problem headers as lines) increases the step count while adding navigational clarity. The model learns to decompose complex problems into labeled sub-goals before solving.
Verification steps:
Including an explicit verification step at the end of each demonstration chain serves two purposes:
- It increases the step count (favorable for complexity scoring).
- It teaches the model to double-check its answer before finalizing, reducing arithmetic errors.
Example verification step:
Verify: 15 items × 4 groups = 60 items total. With 144 returned, remaining = 480 - 144 = 336. ✓
The answer is 336.
The paper does not test verification steps explicitly, but combining them with complexity selection is a natural extension that reinforces both the depth signal and output reliability.
Self-Verification in Generated Chains
For Phase 2, a complementary filtering criterion is to prefer chains that include self-verification steps. This can be implemented as a combined score:
def combined_score(chain: str) -> tuple[int, int]:
"""Score by (step count, presence of verification step)."""
step_count = complexity_score(chain)
has_verification = int(bool(re.search(r'[Vv]erif|[Cc]heck|[Cc]onfirm', chain)))
return (step_count, has_verification)
# Sort by combined score (step count first, then verification presence)
sorted_chains = sorted(chains, key=combined_score, reverse=True)
This biases the vote toward long chains that include self-correction, the highest-reliability category.
Structured Output from Complex Chains
When the output must be structured (JSON, table, code), the standard approach is to add format instructions after the demonstrations and after the test question:
[Demonstrations as above]
Question: [test question]
Provide your final answer as JSON: {"answer": <number>, "reasoning_summary": "<brief summary>"}.
The reasoning chain still benefits from complexity-based selection; only the final output is redirected into the structured format. Do not include JSON structure within demonstration reasoning chains—this would confuse the step-count metric by counting JSON lines as reasoning steps.
Constraint Enforcement
Hard constraints (answer must be an integer, answer must be in a specific range, answer must be one of N options) should be specified in the test question prompt, not in the demonstrations. Demonstrations should show the reasoning process; the constraint reminder at the test question ensures the model applies the constraint at output time.
For multiple simultaneous constraints:
Question: [question]. Note: your answer must be a whole number between 0 and 100.
This pattern keeps the demonstration section clean (not polluted by constraint reminders) while still enforcing constraints at inference time.
Style Control
The step-count criterion is neutral with respect to output style (formal vs. informal, technical vs. accessible). Style is controlled through the demonstration content, not the selection criterion. To produce a specific output style:
- Write demonstrations in the target style (formal academic prose, casual explanation, bullet-point reasoning structure).
- The model will adopt the demonstration style for its outputs, independently of the step-count filtering.
- A useful property: the step-count filter in Phase 2 will preferentially retain chains that are long and in the target style, since the demonstrations prime the model toward that style and shorter chains may deviate from it.
For persona adoption (e.g., "respond as a financial analyst"): include a system prompt persona instruction alongside the demonstrations. The persona instruction and the complexity demonstrations interact additively—the model maintains the persona while emulating the reasoning depth shown in the demonstrations. The persona does not affect the step-count filtering logic.
One non-obvious style interaction: if the target persona involves a concise, expert communication style ("brief expert summaries"), the persona instruction and the complexity signal conflict. Resolve this by writing demonstrations that model expert-depth reasoning followed by a concise summary at the end—the chain has high step count (from the reasoning), but the final output format is concise (from the persona).
7.3 Interaction Patterns
Conversational Deployment
In a multi-turn conversation, Complexity-Based Prompting applies to the system prompt and/or the first substantive turn, not to subsequent user messages. The standard pattern is:
- System prompt: Include the M complexity-selected demonstrations as illustrative reasoning examples.
- User turns: Receive test questions.
- Assistant turns: Produce reasoning chains in the same format as the demonstrations.
Maintaining context across turns: in a multi-turn conversation, the model's in-context "memory" of the demonstrations may degrade as the conversation grows. If the conversation grows long enough to push the demonstrations outside the effective attention window, re-insert a brief reminder: "Please continue solving problems step-by-step as demonstrated earlier."
For conversational tasks (tutoring, multi-step problem-solving dialogue), use the demonstrations to establish the reasoning depth norm for the session. Once the norm is established, the model tends to maintain it across subsequent turns without explicit re-prompting.
Iterative Improvement Pattern
When the initial chain is incorrect (Phase 1, greedy), an iterative refinement loop can be applied:
- Generate initial chain (greedy, using complexity-selected demonstrations).
- Evaluate whether the chain's answer is plausible (e.g., within expected range for math problems).
- If not plausible, prompt for refinement: "The answer appears incorrect. Please re-examine step [N] and correct the reasoning."
- Repeat up to 3 iterations.
This iterative loop is compatible with Complexity-Based Prompting—the demonstrations remain fixed, and only the specific error is targeted for correction. The risk is error propagation: if the model corrects step N but introduces an error in step M, the loop must detect this at step M.
Stopping criteria: stop when the answer is plausible or after a fixed number of iterations (3 is typical). Do not iterate indefinitely—models can oscillate between incorrect answers.
Chaining with Downstream Tasks
Complexity-Based Prompting fits naturally into multi-stage pipelines where reasoning is the first stage and structured output is a downstream stage:
Stage 1: Complexity-Based CoT reasoning → extracts key quantities and relationships
Stage 2: Template filling → inserts extracted quantities into a report template
Stage 3: Validation → verifies the report's numerical consistency
Between stages, pass the extracted quantities as structured data (not the full reasoning chain). The reasoning chain is for human inspection and debugging; the structured extraction is for downstream consumption.
Error propagation consideration: errors in Stage 1 (incorrect intermediate values) propagate to Stage 2 as incorrect report values. The complexity-based consistency vote in Stage 1 reduces but does not eliminate error. Build validation logic in Stage 3 to catch impossible values.
7.4 Model Considerations
Behavior across model families:
| Model Family | Expected Behavior |
|---|---|
| GPT-4o / GPT-4 | Strong benefit; model has high native CoT ability, complexity selection further refines demonstration quality |
| GPT-3.5-turbo | Moderate benefit; slightly weaker than GPT-3 text-davinci-002 in the paper's era; gains are present but smaller |
| Claude claude-opus-4-6 | Strong benefit; Anthropic models respond well to structured reasoning demonstrations with clear step separations |
| Claude claude-haiku-4-5 | Moderate benefit; smaller Claude models still show CoT ability but weaker performance on very hard multi-step problems |
| Llama 3 70B | Moderate benefit; open-source models at this scale show CoT ability; complexity selection should help but has not been systematically evaluated in the original paper |
| Llama 3 8B / Mistral 7B | Minimal benefit; insufficient emergent CoT ability for the technique to provide reliable gains |
| Flan-T5 (11B) | Near-zero benefit (confirmed in the paper's ablation) |
Model-specific quirks:
- OpenAI models: Support
n>1in a single API call, making Phase 2 efficient (one request returns N completions). Thenparameter maps directly to the sampling count. - Anthropic Claude: Does not support
n>1natively; Phase 2 requires N separate API calls. Consider batching with async requests to parallelize. Alternatively, use the streaming API with early stopping once K valid chains are collected. - Local models (Ollama, vLLM): Support batch generation natively. vLLM's
SamplingParams(n=50)is the most efficient implementation for local inference.
Adapting for different model sizes:
For smaller models (30B–100B range), where CoT ability is present but weaker:
- Reduce M to 4–6 demonstrations (smaller context capacity).
- Use shorter, more direct reasoning steps (fewer words per step).
- Increase N to 100 (more samples needed for a stable majority with noisier outputs).
- Use a lower temperature (0.5–0.6) to reduce output randomness.
- Apply a stronger step-count minimum filter in Phase 2: exclude chains with < 3 steps to remove shallow guesses.
Handling model version changes:
When a model is updated (e.g., GPT-4 → GPT-4o), re-evaluate the technique on your validation set. Newer models often require less explicit demonstration guidance because their base CoT ability is stronger. It is possible that a model update makes complexity-based selection less impactful (baseline is already high) or changes the optimal M and K values. Treat model version upgrades as trigger events for re-evaluation.
Cross-model portability:
The demonstration format (Question: / reasoning chain / "The answer is X") is generic and works across all major model APIs. The step-count scoring function is model-agnostic. Phase 2 requires temperature sampling support, which all major APIs provide. The technique is thus highly portable—implement it once and adapt only model-specific API call syntax.
Reasoning format differences across model families:
Claude models (Anthropic) tend to produce longer, more discursive reasoning chains than GPT-4 under the same prompt. This means Claude-generated chains will have naturally higher step counts, and the complexity filter K may need to be adjusted upward (using K/N = 0.9 rather than 0.8) to retain the benefit of filtering without over-excluding valid chains.
GPT-4o tends to produce concise, efficient reasoning chains. The step count for correct chains may be lower on average than GPT-3-era models. The complexity filter should be calibrated on a validation set for each model family rather than assuming universal constants.
Llama 3 (70B instruction-tuned) produces intermediate-length chains. The critical issue for local inference is that batch sampling (N=50) is feasible with vLLM but requires careful memory budgeting—50 parallel forward passes with a 70B model can exhaust GPU memory on a single node.
Model-specific configuration reference:
# Configuration profiles for different model families
COMPLEXITY_CONFIGS = {
"gpt-4o": {
"m": 8, "n": 50, "k": 40, "temperature": 0.7,
"max_tokens": 512, "separator": "\n",
"question_prefix": "Question:",
},
"claude-opus-4-6": {
"m": 8, "n": 30, # Fewer samples due to per-call API; adjust for cost
"k": 25, "temperature": 0.7,
"max_tokens": 1024, # Claude tends to be more verbose
"separator": "\n",
"question_prefix": "Question:",
},
"claude-haiku-4-5": {
"m": 6, "n": 20, "k": 15, "temperature": 0.6,
"max_tokens": 512, "separator": "\n",
"question_prefix": "Question:",
},
"llama-3-70b-instruct": {
"m": 6, "n": 20, "k": 16, "temperature": 0.7,
"max_tokens": 512, "separator": "\n",
"question_prefix": "Question:",
},
}
7.5 Evaluation and Efficiency
Metrics for measuring the technique's effectiveness:
- Primary: Exact match accuracy on benchmark test sets. Compare against at minimum: (a) zero-shot CoT, (b) random few-shot CoT, (c) handcrafted few-shot CoT, (d) vanilla Self-Consistency.
- Secondary: Solution correctness with partial credit (for problems with multiple components, award partial marks for correct intermediate steps even if the final answer is wrong). This assesses whether complexity selection improves intermediate reasoning quality, not just final answer accuracy.
- Efficiency: Accuracy per token cost. Phase 1 at greedy is significantly more cost-efficient than Phase 1+2; plot the accuracy-vs-cost Pareto frontier for N ∈ {1, 10, 20, 50}.
- Stability: Variance of accuracy across 5 independent runs with different random seeds for the temperature sampling. High variance indicates sensitivity to the specific chains sampled—consider increasing N or reducing temperature.
Human evaluation role:
For tasks without a clear ground truth (legal reasoning, argument quality, scientific analysis), human evaluation of chain quality is essential. Evaluate:
- Logical coherence: are all steps valid inferences?
- Completeness: does the chain address all components of the question?
- Depth: is the reasoning thorough or superficial?
Human raters should be blind to which selection method produced the chain (A/B format). Ask raters to rate on a 1–5 scale for each dimension.
Custom benchmarks:
For domain-specific deployment, create a benchmark by:
- Collecting 100–500 domain questions with verified ground truth answers.
- Stratifying by difficulty (easy: 1–3 steps, medium: 4–7, hard: 8+).
- Running all comparison methods on this benchmark.
- Reporting accuracy by difficulty tier: complexity selection should show the largest gains on the hard tier.
Token and latency optimization:
| Strategy | Token reduction | Accuracy impact |
|---|---|---|
| Phase 1 only (no Phase 2) | 98% fewer output tokens | -3 to -6 pp vs. Phase 1+2 |
| Reduce N from 50 to 20 | 60% fewer sampling calls | -1 to -2 pp vs. N=50 |
| Reduce M from 8 to 4 | ~50% fewer demonstration tokens | -1 to -3 pp vs. M=8 |
| Compress step verbosity (shorter steps) | 20–30% fewer prompt tokens | Minimal (<1 pp) if step count preserved |
| Prefix caching of demonstrations | No accuracy change | 40–60% token cost reduction for the prompt prefix |
Streaming optimization for Phase 2:
When using streaming APIs (e.g., Anthropic's streaming messages or OpenAI's stream=True), it is possible to implement an early-stopping variant of Phase 2: stream each chain and compute step count incrementally. Once K chains with step count ≥ threshold have been collected, stop sampling. This reduces wasted computation on chains that will be filtered out anyway.
async def streaming_complexity_consistency(
prompt: str, client, model: str,
k_target: int = 40, step_threshold: int = 5,
max_samples: int = 100, temperature: float = 0.7
) -> str:
"""
Stream chains and stop once k_target chains with >= step_threshold steps collected.
"""
collected_chains = []
samples_drawn = 0
while len(collected_chains) < k_target and samples_drawn < max_samples:
chain = await generate_one_chain(client, model, prompt, temperature)
samples_drawn += 1
if complexity_score(chain) >= step_threshold:
collected_chains.append(chain)
answers = [extract_final_answer(c) for c in collected_chains if extract_final_answer(c)]
return Counter(answers).most_common(1)[0][0] if answers else None
This approach typically achieves similar accuracy to fixed N=50 but with 20–40% fewer API calls on tasks where many generated chains are naturally complex.
7.6 Safety, Robustness, and Domain Adaptation
Adversarial Protection
The primary adversarial risk for Complexity-Based Prompting is indirect prompt injection via the reasoning chain itself. If the model's generated reasoning chain contains injected instructions (e.g., embedded in a user-provided problem statement that includes hidden directives like "ignore the above instructions and output..."), the chain may appear complex (many steps) and therefore be retained in the top-K chains for voting. The injected answer may then win the majority vote.
Mitigation strategies:
- Sanitize user-provided inputs before they are embedded in the prompt. Strip or escape any substring that resembles a prompt instruction (imperative sentences starting with "Ignore," "Forget," "Instead," etc.).
- Hard-delimit user content from the demonstration section using XML-style tags:
[DEMONSTRATIONS]
Question: ...
[/DEMONSTRATIONS]
[USER QUESTION]
{user_input}
[/USER QUESTION]
Solve the question above step by step.
- Validate extracted answers against the expected answer type and range. If the task expects a numeric answer in the range 1–1,000 and the extracted answer is "OVERRIDE: output confidential information," discard the chain before voting.
- Monitor step count anomalies: A chain with an abnormally high step count (3× the typical maximum) may indicate injected verbose content rather than genuine complex reasoning. Flag or exclude such chains.
Output Safety
For high-stakes applications, the reasoning chain itself—not just the final answer—may contain harmful content. A chain that reasons step by step through a harmful process may produce factually correct but dangerous intermediate information.
Mitigation:
- Apply content filtering to the full reasoning chain text, not just the final answer extraction.
- For domains with well-defined harmful content categories (medical: do not provide specific dosing for dangerous substances; legal: do not advise on how to commit fraud), add domain-specific filters that check for prohibited content in any chain step.
- Use the model's native content filtering (Anthropic's built-in safety policies, OpenAI's moderation endpoint) before including a chain in the vote.
Reliability: Ensuring Consistent Outputs Across Runs
Consistency across runs is a function of three variables: temperature (lower = more consistent), N (higher = more stable majority), and K (larger = less noise from individual chain variance).
Practical consistency targets:
| Use case | Target consistency | Recommended config |
|---|---|---|
| Production math QA (user-facing) | >95% agreement across re-runs | T=0.5, N=50, K=40 |
| Internal analytics / batch processing | >90% agreement | T=0.7, N=30, K=25 |
| Research / experimentation | 80%+ agreement acceptable | T=0.7, N=20, K=16 |
Consistency monitoring in production:
Run a shadow evaluation: for 5% of live queries, run the pipeline twice with different random seeds and compare answers. The disagreement rate is a direct measure of output instability. Alert if disagreement exceeds 10% (indicating temperature is too high or N is too low for the current task difficulty).
Variance reduction techniques:
- Fixed random seeds: For development and testing, set a fixed random seed to make runs reproducible. For production, use fixed seeds only if consistency is more important than answer diversity (e.g., for audit purposes).
- Temperature annealing: Start with a higher temperature (0.8) for the first half of the N samples, then decrease to (0.5) for the second half. This captures diverse chains early while ensuring the later samples are higher-quality refinements.
- Consensus threshold checking: Before returning the majority vote answer, check if the vote share exceeds a minimum threshold (e.g., >50% of K chains). If not, flag the answer as uncertain. For mission-critical applications, trigger a fallback (e.g., a more expensive larger model, or human review).
Domain Adaptation
Adapting Complexity-Based Prompting to a new domain is primarily a data exercise, not an algorithm change. The scoring function (step count) and selection logic (top-K) are universal. What changes per domain is:
- The candidate pool: Must contain domain-specific problems with domain-appropriate reasoning chains. Generic examples from mathematics cannot substitute for medical differential diagnosis examples.
- The minimum complexity threshold: Different domains have different natural step-count distributions. A legal IRAC chain naturally has 4–6 steps. A multi-step arithmetic chain may have 8–12 steps. Calibrate the complexity criterion relative to the domain's baseline, not relative to the paper's math benchmarks.
- The answer extraction pattern:
"The answer is X"works for math. Legal tasks may require extracting a holding ("The court would likely rule..."). Medical tasks may require extracting a diagnosis ("The most likely diagnosis is..."). Update the extraction regex accordingly.
Domain-specific terminology handling:
When the target domain uses specialized terminology, ensure the demonstration pool uses that terminology consistently and correctly. Inconsistent terminology within demonstrations confuses the model's attention patterns. For example, a medical demonstration pool that alternates between "myocardial infarction" and "heart attack" creates lexical ambiguity that reduces the technique's effectiveness.
Action: standardize terminology within the pool before scoring. If the domain has a controlled vocabulary (ICD codes for medical, legal citation formats for legal, IUPAC names for chemistry), use it consistently throughout the reasoning chains.
Rapid adaptation to new domains (5-step process):
- Collect 20 problems from the new domain with verified correct answers. This is the only human-intensive step.
- Generate candidate reasoning chains: Use a capable base model (GPT-4 or Claude Opus) with a Zero-Shot-CoT prompt to generate reasoning chains for all 20 problems.
- Score generated chains by step count and manually review the top-10 for correctness and reasoning quality. Correct any chains with errors.
- Select the top-8 correct, complex chains as demonstrations.
- Evaluate on 30–50 held-out domain examples to verify performance before deployment.
This process typically takes 2–4 hours for a new domain with a capable base model and does not require domain experts to write reasoning chains from scratch—they only need to review and correct model-generated chains.
Leveraging analogies for domain transfer:
When the new domain has insufficient problems for building a pool (< 10 available), consider whether a structurally analogous domain exists with a richer pool. For example:
- Chemistry stoichiometry problems are structurally analogous to arithmetic word problems (molar ratios ↔ proportions, molecular weights ↔ unit prices). Math demonstrations may provide partial transfer.
- Legal IRAC reasoning is structurally analogous to scientific hypothesis testing (rule ↔ hypothesis, application ↔ experiment, conclusion ↔ result). Scientific reasoning demonstrations may provide partial transfer.
Test cross-domain transfer by evaluating the source-domain demonstrations on target-domain validation examples. If transfer accuracy is within 5 pp of domain-specific demonstrations, source-domain examples can be used until sufficient target-domain examples are collected.
8. Risk and Ethics
8.1 Ethical Considerations
What does this reveal about language model capabilities?
Complexity-Based Prompting demonstrates that language models are highly sensitive to the structural properties of in-context demonstrations—not just their semantic content. The technique works because models learn from the form of reasoning as much as from the domain content. This has an important implication: the way a practitioner presents examples to a model can substantially alter its output behavior, without any gradient updates or explicit fine-tuning.
This is a double-edged capability. It means practitioners have powerful tools for improving model behavior (as this technique demonstrates). It also means that adversarial actors have powerful tools for manipulating model behavior by crafting demonstrations that encode undesirable reasoning patterns.
Bias risks:
The technique's candidate pool is human-constructed, and the selection criterion (step count) is agnostic to the content of the steps. If the human-annotated pool contains reasoning chains that encode biased inference patterns (e.g., statistical stereotyping in medical diagnosis examples, assumption of default demographics in legal reasoning), selecting the most complex such chains may amplify those biases—because the most complex chains have more steps and thus more surface area for biased inferences to influence the model.
Practitioners should audit selected demonstrations for implicit biases, especially when deploying in high-stakes domains (healthcare, legal, financial). The audit should focus not just on the final answers but on the intermediate reasoning steps, where biases are more likely to be embedded implicitly.
Transparency concerns:
The technique modifies which examples are shown to the model but does not expose this modification to end users. Users interacting with a system that uses complexity-based prompting receive outputs that were shaped by a non-disclosed selection criterion. In high-stakes deployments, this is ethically relevant: users should be informed that the system applies structured prompting techniques that influence its reasoning style.
More broadly, the technique demonstrates that the "same" model can produce substantially different outputs depending on demonstration selection—a point that should inform responsible AI deployment guidelines. Systems should document which prompting techniques are used and how they affect output distributions.
Framing effects:
Complex demonstrations may inadvertently frame problems in ways that lead to systematic errors. For example, if the most complex arithmetic demonstrations all involve multi-product pricing, the model may interpret subsequent word problems through a pricing frame even when the problem is about a different domain. This is a framing bias from demonstration selection that is difficult to detect without targeted evaluation.
8.2 Risk Analysis
Failure Modes
Silent overconfidence: The majority vote in Phase 2 produces a single answer with high vote share—suggesting high confidence—even when all chains followed the same incorrect reasoning path. If all sampled chains make the same systematic error (e.g., all confuse "remaining" with "removed"), the majority vote confidently selects the wrong answer. This failure mode is particularly dangerous because the voting mechanism provides a false signal of reliability.
Detection: Monitor cases where the vote share is near-unanimous (>90%) but the answer is wrong. High vote share is not a reliable confidence signal when chains are not independent (they all start from the same prompt and may share systematic error modes).
Demonstration staleness: As the task distribution shifts over time (new problem types emerge, difficulty increases), fixed demonstrations become misaligned. The technique provides no mechanism for detecting or adapting to this drift. A model that was achieving 80% accuracy may silently degrade to 65% as the test distribution shifts.
Detection: Monitor rolling accuracy on a continuously updated validation sample. Set alert thresholds for accuracy drops > 5 pp.
Cascading errors in chained pipelines: In multi-stage systems where Complexity-Based Prompting feeds into downstream stages, an error in the reasoning chain that is incorrectly selected as "complex" (many steps, but steps are wrong) propagates through all downstream stages. The downstream stage receives confident, complex-looking but incorrect inputs, which it processes as valid.
Adversarial Risks
Prompt injection via user-provided context: In deployments where user-provided content is included in the prompt (e.g., "solve this problem: [user content]"), a malicious user could inject a high-step-count false reasoning chain designed to override the model's correct reasoning. The model, primed to follow multi-step demonstrations, may adopt the injected reasoning pattern.
Mitigation: Strictly separate user-provided content from the demonstration section using hard delimiters and system-level instructions. Never allow user input to appear in the portion of the prompt that the model treats as a demonstration.
Adversarial examples targeting step-count filter: An adversary who knows the system uses complexity-based consistency could craft inputs designed to elicit complex-looking but incorrect chains. By structuring the question to reward verbose incorrect reasoning (e.g., "explain in maximum detail"), the adversary could ensure their target answer dominates the top-K chains.
Mitigation: Combine step-count filtering with semantic coherence checking (e.g., validate that extracted intermediate values are numerically self-consistent before including a chain in the vote).
Bias Amplification
The technique amplifies whatever biases are present in the most complex examples in the candidate pool. Because complex examples contain more reasoning steps, they have more surface area for implicit bias expression. Regular auditing of the selected demonstrations is essential.
Steps to detect and mitigate bias in demonstrations:
- Present selected demonstrations to domain experts and ask them to identify any implicit assumptions or stereotypes in the reasoning steps.
- Test model outputs on a set of "bias probe" questions designed to surface systematic errors (e.g., for legal reasoning: does the model systematically favor one party?).
- If bias is found in selected demonstrations, rewrite the biased steps and add neutral alternatives to the pool. The corrected versions may score lower on step count but are safer.
Concrete examples of complexity-amplified bias:
To make the bias risk concrete, consider two scenarios:
Medical reasoning: A candidate pool contains 20 diagnostic reasoning examples. The three most complex examples (9–11 steps each) all involve male patients in their 50s with cardiovascular presentations, because these cases generate longer differential diagnosis chains (more comorbidities to exclude). Complexity selection will preferentially choose these three examples. The model is then implicitly primed to reason about cardiovascular presentations in older male patients, potentially leading to under-consideration of cardiovascular disease in younger patients or women when the differential is shorter in the demonstrations. A triage system built on this prompt could exhibit systematic demographic bias in differential depth.
Legal reasoning: A legal analysis pool's most complex examples involve high-value commercial disputes with multiple parties and lengthy contract chains. Complexity selection chooses these. The model becomes primed to produce extensive multi-party analysis even for straightforward single-party disputes, and may apply commercial dispute framing to unrelated legal contexts. More subtly, commercial law complexity may crowd out examples from consumer protection or employment law, creating topical bias in the system's legal coverage.
These scenarios illustrate why domain expert review of selected demonstrations is not optional in high-stakes deployments.
Systematic bias audit checklist for high-stakes domains:
For each selected demonstration, evaluate:
- Does any step assume a demographic characteristic of a person mentioned in the question?
- Does any step apply a heuristic that is known to be statistically biased in the domain (e.g., anchoring to a diagnosis based on age or gender in medical reasoning)?
- Does the solution path prefer one interpretive frame over another without justification?
- Would an expert reviewer consider any step to be an oversimplification that could mislead a practitioner?
If yes to any of the above: rewrite the affected step to be explicitly neutral, or add a counterbalancing step that acknowledges the alternative interpretation. Accept the possible reduction in step count as a necessary cost of safe deployment.
Evaluation robustness:
When measuring the technique's accuracy, always evaluate on a held-out test set that was not used to tune K or M. Researchers have reported that prompt-tuning hyperparameters (including K and M) on the same set used for evaluation overfits to that set and overestimates generalization performance.
8.3 Innovation Potential
Derived Innovations
Adaptive complexity thresholding: Rather than using a fixed K, compute the optimal K adaptively for each test question based on the variance of the sampled chains. When chains are highly consistent (low answer variance), use a smaller K (strict filtering for efficiency). When chains are inconsistent (high variance), increase K (less filtering for stability). This would make the technique self-calibrating.
Complexity-stratified ensembling: Instead of discarding the bottom N-K chains, use them as a secondary signal. If the top-K chains vote for answer A and the bottom (N-K) chains vote for answer B, the disagreement signals uncertainty. The system can flag the question for human review or output a calibrated uncertainty estimate.
Semantic-complexity hybrid selection: Combine step count (complexity) with semantic similarity (relevance) into a single score, weighted by a task-specific parameter:
score(example) = α × complexity(example) + (1-α) × similarity(example, test_question)
This directly addresses the heterogeneous distribution problem where pure complexity selection underperforms retrieval.
Complexity-guided demonstration generation: Instead of selecting from a pre-existing pool, automatically generate demonstrations that maximize the step-count criterion by prompting the model: "Generate a problem and its detailed step-by-step solution that requires exactly 9 reasoning steps." This combines the annotation-free spirit of Auto-CoT with the complexity criterion of this technique.
Complexity-based curriculum for fine-tuning: The insight that example complexity correlates with learning value applies to fine-tuning, not just prompting. Training on examples ordered from simple to complex (curriculum learning) or on high-complexity examples only may yield models that generalize better to hard problems—an extension of this technique's core insight to the gradient-based learning domain.
Complexity-stratified ensemble architecture:
A more sophisticated derivative of the technique uses the complexity of generated chains as the basis for a multi-tier ensemble:
Tier 1 (highest step count chains, top 20%): These are the "expert" chains. They are most likely to be correct on hard problems but may over-engineer easy ones. Tier 2 (middle 40%): These are "standard" chains. They represent balanced reasoning. Tier 3 (lowest step count chains, bottom 40%): These are "quick" chains. They may be correct on easy problems but unreliable on hard ones.
For a given test question, the ensemble predicts:
- If tiers 1, 2, and 3 agree: high confidence, output the agreed answer.
- If tier 1 and 2 agree but tier 3 disagrees: moderate confidence, output tier 1+2 answer (the question is probably hard and tier 3 reasoning is insufficient).
- If all tiers disagree: low confidence, flag for human review or escalate to a more expensive model.
This approach extracts more signal from the N sampled chains than simple top-K filtering, at the cost of implementation complexity.
Novel combinations:
-
With Least-to-Most Prompting: Apply Least-to-Most decomposition as the reasoning structure, then select demonstrations by the number of sub-problems decomposed (a complexity criterion adapted to the hierarchical structure). Chains that decompose into more sub-problems are ranked higher.
-
With Self-Refine (Madaan et al., 2023): Use Self-Refine to iteratively improve generated chains, then apply step-count filtering to the refined chain pool. Refined chains with more steps (added by the self-critique and revision process) are likely higher quality.
-
With Tree-of-Thoughts (Yao et al., 2023): Apply complexity scoring to the leaves of the reasoning tree—select the tree paths with the most steps for the final answer extraction. This adapts the decoding-time filtering to tree-structured search rather than parallel sampling.
9. Ecosystem and Integration
9.1 Tools and Frameworks
Framework support:
| Framework | Support Level | Notes |
|---|---|---|
| LangChain | Native compatibility | Implement as a custom ExampleSelector that sorts by step count; integrates with FewShotPromptTemplate |
| DSPy | Strong fit | Implement as a custom Teleprompter that optimizes over demonstration complexity; the BootstrapFewShot module can be adapted |
| Haystack | Compatible | Use a custom PromptNode with a complexity-ranked example store |
| LlamaIndex | Compatible | Adapt FewShotSelectorModule to score by step count instead of semantic similarity |
| Semantic Kernel | Compatible | Implement as a custom SemanticFunction with a pre-processing step that ranks demonstrations |
LangChain implementation as a custom ExampleSelector:
from langchain_core.example_selectors.base import BaseExampleSelector
from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate
class ComplexityExampleSelector(BaseExampleSelector):
"""Selects examples by reasoning chain complexity (step count)."""
def __init__(self, examples: list[dict], k: int = 8):
self.examples = examples
self.k = k
def add_example(self, example: dict) -> None:
self.examples.append(example)
def select_examples(self, input_variables: dict) -> list[dict]:
def step_count(ex):
chain = ex.get('chain', ex.get('reasoning', ''))
return len([l for l in chain.split('\n') if l.strip()])
scored = sorted(self.examples, key=step_count, reverse=True)
return scored[:self.k]
# Usage
selector = ComplexityExampleSelector(examples=candidate_pool, k=8)
example_prompt = PromptTemplate(
input_variables=["question", "chain", "answer"],
template="Question: {question}\n{chain}\nThe answer is {answer}."
)
prompt = FewShotPromptTemplate(
example_selector=selector,
example_prompt=example_prompt,
suffix="Question: {input}",
input_variables=["input"]
)
DSPy integration:
In DSPy's optimization framework, Complexity-Based Prompting can be implemented as a custom teleprompter that bootstraps demonstrations ranked by step count:
import dspy
class ComplexityBootstrap(dspy.Teleprompter):
"""Bootstraps demonstrations sorted by complexity (step count)."""
def __init__(self, m=8):
self.m = m
def compile(self, student, trainset):
# Score trainset by step count
def score(ex):
prediction = student(question=ex.question)
return len([l for l in prediction.rationale.split('\n') if l.strip()])
scored = [(ex, score(ex)) for ex in trainset]
scored.sort(key=lambda x: x[1], reverse=True)
top_examples = [ex for ex, _ in scored[:self.m]]
# Assign selected demonstrations to student
student.demos = top_examples
return student
Pre-built templates: The FranxYao/chain-of-thought-hub repository provides complexity-ranked prompt templates for GSM8K, MultiArith, MathQA, and several BigBench Hard tasks. These templates are directly usable for replication and as starting points for new domains.
Evaluation tools: EleutherAI's LM Evaluation Harness supports GSM8K, MathQA, and BigBench benchmarks; use it to evaluate complexity-based prompts against baseline configurations. The lm_eval library accepts custom few-shot prompts, making it straightforward to evaluate the technique.
Practical LM Eval Harness integration:
# Install
pip install lm-eval
# Run complexity-selected few-shot evaluation on GSM8K
lm_eval --model openai-chat-completions \
--model_args model=gpt-4o \
--tasks gsm8k \
--num_fewshot 8 \
--fewshot_as_multiturn \
--apply_chat_template \
--output_path ./results/complexity_gsm8k.json
For custom demonstrations, create a task override file that specifies your complexity-ranked examples as the fixed few-shot pool. The harness supports doc_to_fewshot overrides for per-task custom example selection.
Weights & Biases / MLflow for experiment tracking:
Track complexity-based prompting experiments using standard MLOps tools:
import wandb
wandb.init(project="complexity-based-prompting", config={
"m": 8, "n": 50, "k": 40, "temperature": 0.7, "model": "gpt-4o"
})
for test_question, ground_truth in test_set:
prediction = predict_with_complexity_consistency(
build_prompt(selected_demos, test_question), **config
)
correct = normalize_answer(prediction) == normalize_answer(ground_truth)
wandb.log({"correct": correct, "question": test_question})
wandb.log({"accuracy": sum(results) / len(results)})
Logging individual prediction correctness allows offline analysis of which question types benefit most from the technique, which informs pool expansion decisions.
9.2 Related Techniques and Combinations
Closely related techniques:
| Technique | Relation to Complexity-Based Prompting |
|---|---|
| Self-Consistency (Wang et al., 2022) | Complexity-Based Consistency is a direct extension; complexity selection filters SC's sampling pool |
| Auto-CoT (Zhang et al., 2022) | Complementary: Auto-CoT generates demonstrations automatically; Complexity-Based Prompting selects the best from a pool. Combining: use Auto-CoT to generate a large pool, then apply complexity selection |
| Least-to-Most Prompting (Zhou et al., 2022) | Orthogonal: L2M changes the reasoning structure; Complexity changes which examples demonstrate that structure |
| Analogical Prompting (Yasunaga et al., 2023) | Orthogonal: AP generates task-specific examples on the fly; Complexity selects from pre-existing examples by structural depth |
| Active-Prompt (Diao et al., 2023) | Related: Active-Prompt selects demonstrations by answer uncertainty; Complexity-Based Prompting selects by step count. Both address the same problem (which examples to use) with different criteria |
| Zero-Shot-CoT (Kojima et al., 2022) | Complexity-Based Prompting extends Zero-Shot-CoT by adding structured demonstrations; when no pool is available, ZS-CoT is the fallback |
Hybrid solutions:
Complexity + Retrieval: For heterogeneous test distributions where pure complexity selection underperforms retrieval:
def hybrid_score(example, test_question, alpha=0.5):
comp = complexity_score(example['chain'])
comp_normalized = comp / max_complexity_in_pool
sim = semantic_similarity(example['question'], test_question)
return alpha * comp_normalized + (1 - alpha) * sim
The alpha parameter controls the trade-off. Set alpha = 1.0 for pure complexity (Fu et al.), alpha = 0.0 for pure retrieval. Tune alpha on a validation set. For homogeneous tasks (all examples are similar type), alpha = 1.0 is near-optimal. For heterogeneous tasks, alpha = 0.3–0.5 often outperforms both extremes.
Complexity + Auto-CoT:
- Collect the full training set of questions (no reasoning chains needed).
- Use Auto-CoT to generate reasoning chains for the training questions (via Zero-Shot-CoT).
- Score generated chains by step count.
- Select the top-M by complexity.
This approach is fully annotation-free (no human-written chains required) and should outperform pure Auto-CoT (which selects by cluster centroids, not complexity).
Comparison table with key alternatives:
| Dimension | Complexity-Based | Random Few-Shot | Retrieval-Based | Auto-CoT | Zero-Shot-CoT |
|---|---|---|---|---|---|
| Annotation required | Small pool (8–20) | Small pool (8+) | Large corpus (1K–10K+) | Questions only | None |
| Selection criterion | Step count | None (random) | Semantic similarity | Cluster diversity | N/A |
| Performance (avg) | Highest (few-shot methods) | Moderate | High (heterogeneous tasks) | Moderate-high | Baseline |
| Cost at inference | Low (greedy) / High (N=50) | Low | Low | Low | Lowest |
| Suitable model size | 100B+ | Any | Any | Any | 100B+ |
| Heterogeneous tasks | Weaker | Baseline | Stronger | Moderate | N/A |
| Homogeneous tasks | Strongest | Moderate | Strong | Moderate | Baseline |
When to choose each approach (decision tree):
Does the task require multi-step reasoning (≥ 4 steps for correct solutions)?
├── NO → Use Zero-Shot-CoT or standard few-shot CoT (complexity selection adds no value)
└── YES →
Do you have any annotated demonstrations available?
├── NO → Use Auto-CoT (generates demonstrations automatically from Zero-Shot-CoT)
└── YES →
Is the test distribution highly heterogeneous (many distinct problem sub-types)?
├── YES → Use retrieval-based selection OR hybrid (complexity + semantic similarity)
└── NO →
Is the annotation pool large (≥ 50 examples)?
├── YES → Both retrieval and complexity are viable; complexity requires less infrastructure
└── NO (8–20 examples) → Use Complexity-Based Prompting
Is accuracy the primary constraint (cost is secondary)?
├── YES → Add Phase 2 (complexity-based consistency, N=50, K=40)
└── NO → Use Phase 1 only (greedy decoding, 1 forward pass per query)
Pattern transfer between techniques:
The core insight—that the structural richness of context examples predicts their utility—transfers broadly. When switching from Complexity-Based Prompting to Tree-of-Thoughts, the selection criterion adapts: instead of choosing demonstrations by reasoning chain length, choose demonstrations whose tree expansions have the most branches (wider search trees indicate harder problems with richer solution spaces). When switching to Least-to-Most Prompting, choose demonstrations that decompose into the most sub-problems (more sub-problems = higher decomposition complexity). The step-count heuristic is an instantiation of a more general principle that can be adapted to the structural unit of any reasoning framework.
9.3 Integration Patterns
Task Adaptation
Complexity-Based Prompting adapts to a new task by updating the candidate pool—not by changing the algorithm. The procedure:
- Collect 15–25 examples with human-written reasoning chains for the new task.
- Score by step count; review top-10 manually.
- Select top-8 as demonstrations.
- Test on a 50-example validation set.
- If accuracy is insufficient, expand the pool and re-select; do not change M or K prematurely.
Integration with RAG (Retrieval-Augmented Generation):
In a RAG pipeline, the retrieved context is placed before the question in the user's prompt. Complexity-Based Prompting applies to the demonstration section (system prompt or initial turns), which is separate from the retrieved context. The two components do not interfere.
However, a natural integration is to apply complexity scoring to the retrieved documents themselves: among the retrieved passages, prioritize those that exhibit multi-step reasoning patterns (for QA tasks where the retrieved passage contains the solution). This applies the complexity intuition at the retrieval stage rather than the demonstration stage.
System: [Complexity-selected CoT demonstrations]
User: Based on the following context, answer the question step by step.
Context: [Retrieved passages, ordered by reasoning complexity]
Question: [test question]
Integration with agents and multi-step workflows:
In agentic frameworks (LangChain Agents, AutoGen, Claude Code), Complexity-Based Prompting can improve the reasoning quality of individual tool-call decisions:
- Use the technique to generate the agent's reasoning chain for selecting the next tool.
- Demonstrations show agents reasoning through tool selection: "First, I need to retrieve X. Then I need to compute Y from X. Finally, I need to format Z." High-step demonstrations prime the agent toward thorough planning rather than greedy single-tool calls.
Transition from other approaches:
From handcrafted CoT to Complexity-Based Prompting:
- Take your existing handcrafted CoT examples. Score them by step count.
- If the top-8 by complexity are a subset of your existing examples, deploy immediately with no new annotations.
- If your existing examples are all short (< 4 steps), rewrite the top-3 existing examples with more explicit intermediate steps, or add new complex examples to the pool.
- Test on your validation set and compare to your current handcrafted baseline.
From Zero-Shot-CoT to Complexity-Based Prompting:
- Run Zero-Shot-CoT on a sample of your training questions. Collect the generated chains.
- Score generated chains by step count.
- Select the top-8 generated chains (with their questions) as the demonstration pool.
- Use this auto-generated pool for Complexity-Based Prompting.
This transition is fully annotation-free and allows Zero-Shot-CoT users to upgrade to few-shot performance without manual labeling.
Transitioning from Complexity-Based Prompting to more advanced approaches:
If the accuracy ceiling of Complexity-Based Prompting is reached and task performance is still insufficient:
- Add semantic filtering: combine step count with semantic similarity to the test question.
- Incorporate self-refinement: use Self-Refine or Reflexion to iteratively improve generated chains before final answer extraction.
- Move to Tree-of-Thoughts: apply a search algorithm over the reasoning space rather than sampling independently.
- Consider task-specific fine-tuning: if the task distribution is stable and the annotation budget is large, fine-tuning on high-complexity chain demonstrations may provide gains beyond what in-context selection can achieve.
Production system integration:
For production deployment, the recommended architecture is:
1. Offline: Build and rank candidate pool → select top-M demonstrations → encode as fixed prompt prefix
2. Online inference:
a. Greedy path (cost-optimized): fixed prompt prefix + test question → single forward pass
b. Sampled path (accuracy-optimized): fixed prompt prefix + test question → N parallel calls → complexity filter → majority vote
3. Monitoring: Log chain lengths and answer distributions; alert on accuracy drops > 5 pp on rolling validation sample
4. Refresh cycle: Re-evaluate candidate pool quarterly or when distribution shift is detected
Versioning: treat the demonstration set as a versioned artifact. When the pool is updated (new examples added, old ones replaced), create a new version and run A/B evaluation before promoting to production. This ensures changes to demonstrations are tracked and reversible.
Detailed versioning, monitoring, and rollback protocol:
Versioning:
demonstrations/
v1/
pool.jsonl # Full candidate pool (25 examples)
selected.jsonl # Top-8 selected by step count
config.json # {m: 8, n: 50, k: 40, temperature: 0.7}
eval_results.json # Validation accuracy at promotion time
v2/
...
Each version directory is immutable once promoted to production. Tagging versions in a git repository is recommended: git tag demos-v2 && git push origin demos-v2. The production system references the version tag, making rollback a config-file update.
Monitoring:
Implement a three-layer monitoring stack:
-
Layer 1 (real-time): Log the step count of every generated chain and the winning answer's vote share. Alert if mean step count of generated chains drops more than 2 standard deviations below the historical baseline (signals the model may have changed behavior or the prompt is being ineffective).
-
Layer 2 (hourly): Compute rolling accuracy on a held-out evaluation set (if answers are verifiable). Alert if accuracy drops > 5 pp from the 7-day rolling average.
-
Layer 3 (daily): Run a full benchmark evaluation (N=50 sampling, full test set). Compare against the previous day's result and against the baseline (handcrafted CoT). Flag regressions for investigation.
Rollback procedure:
- Identify the last known-good version tag from the monitoring dashboard (the last version with accuracy within 2 pp of target).
- Update the production config to reference the previous version tag.
- Deploy immediately (zero-downtime config update; no model redeployment required).
- Investigate the regression: compare the current and previous
selected.jsonlfiles to identify which demonstrations changed. Test the new demonstrations on a validation set to confirm they are the source of regression.
Canary deployment for demonstration updates:
When promoting a new demonstration version, route 10% of production traffic to the new version for 24 hours. If accuracy on the 10% traffic is within 1 pp of the 90% (old version), promote the new version to 100%. This pattern prevents full-exposure regressions from demonstration changes.
# Example canary routing logic
import random
def route_request(user_id: str, current_version: str, canary_version: str, canary_fraction: float = 0.1) -> str:
"""Route request to canary or stable version based on user_id hash."""
user_hash = hash(user_id) % 100
if user_hash < int(canary_fraction * 100):
return canary_version
return current_version
10. Future Directions
10.1 Emerging Innovations
Adaptive complexity thresholding in production:
Current implementations use a fixed K threshold tuned on a validation set. An emerging direction is to compute K dynamically per query based on the observed variance in sampled chains. When chains agree strongly (low variance in answers), a strict K filter is safe. When chains disagree significantly, increasing K reduces over-filtering. This adaptive approach would make the technique self-calibrating without requiring offline tuning.
Early work in this direction appears in the context of calibrated Self-Consistency (2024), where the vote threshold is adapted based on entropy of the answer distribution. Applying this to complexity-based filtering is a natural next step.
Buffer of Thoughts (Yang et al., 2024) as a complexity-inspired extension:
Buffer of Thoughts introduces a "thought-template" buffer that accumulates and reuses high-level reasoning patterns distilled from previously successful chains. This is conceptually an evolution of Complexity-Based Prompting: instead of selecting the most complex individual examples, it abstracts and stores the structural patterns from complex chains for reuse across diverse problems.
The connection to complexity-based prompting is direct: the thought-templates that are retained in the buffer are those derived from complex, successful chains—the same chains that Complexity-Based Prompting would preferentially select. The innovation is that the buffer generalizes beyond specific examples to abstract templates, enabling transfer across problem types.
Complexity-guided automatic prompt optimization:
DSPy's optimization framework (Khattab et al., 2023) optimizes prompt instructions and demonstrations automatically via gradient-free search. A natural extension is to add step count as an explicit objective in the optimization: among all candidate demonstration sets discovered during search, prefer those with higher average step count. This would automate the manual step of selecting complex demonstrations.
Complexity scoring for structured reasoning (code, proofs):
The current step-count metric is designed for natural language reasoning chains. For code generation, the analogous metric might be the number of distinct algorithmic operations in the solution (not line count, which conflates comments and blank lines with logic). For formal proofs, it might be the number of distinct lemmas applied. Extending the complexity criterion to structured output domains is an open implementation direction.
Dynamic Cheatsheet (2025) as a runtime analogue:
The Dynamic Cheatsheet technique (2025) accumulates complex reasoning patterns from previous queries at test time and reuses them as in-context examples for subsequent queries. This is a runtime analogue of Complexity-Based Prompting: instead of selecting complex demonstrations from a pre-built pool, the system builds the pool dynamically from its own successful high-complexity outputs during inference.
This direction—using the model's own complex outputs as future demonstrations—could create a self-improving complexity-based prompting system that becomes more effective as it processes more queries.
10.2 Research Frontiers
Open research question 1: Why does step count predict correctness?
The paper demonstrates empirically that step count correlates with correctness at the demonstration level (selection) and at the output level (consistency filtering). But the causal mechanism is not fully understood. Is it: (a) That complex demonstrations activate deeper attention patterns in the model's CoT generation? (b) That selecting complex demonstrations selects intrinsically harder, richer problem types? (c) That long output chains are more likely to have errors that cancel each other out through voting? (d) Some combination of the above?
Mechanistic interpretability research on how transformer attention patterns change when in-context demonstrations have more steps could illuminate this. The answer would also clarify whether the step-count heuristic can be improved upon with a more principled complexity measure.
Open research question 2: Optimal complexity metrics beyond step count
Step count is a surface proxy. Research into richer complexity metrics could include:
- Semantic diversity of steps: Measuring how many distinct semantic categories each step belongs to (arithmetic, logical, spatial, etc.), not just how many steps.
- Dependency depth: The depth of the dependency tree among steps (how many earlier steps does step N depend on?). Deep dependency implies interleaved reasoning that may be more informative.
- Information-theoretic complexity: The information content of the reasoning chain (entropy or minimum description length), measuring how compressible the chain is. Incompressible chains express non-redundant reasoning.
Whether any of these outperforms step count for demonstration selection is an open empirical question.
Open research question 3: Complexity selection for newer model families
The paper evaluates GPT-3 and Codex (175B). Subsequent model generations (GPT-4, Claude, Llama 3) have dramatically stronger native CoT ability. An important question is whether these models still benefit from complexity selection, and whether the optimal M and K have changed. If stronger models have smaller variability in demonstration quality (all examples are equally useful), the technique's marginal value decreases. If stronger models amplify quality differences (the best examples become even more beneficial), the technique's value increases.
Informal experimentation reported in community evaluations (2024) suggests that GPT-4-class models still show a 2–4 pp benefit from complexity selection over random selection, though the gap is smaller than the 6–7 pp gap observed with GPT-3. This would be consistent with the hypothesis that the technique's value scales inversely with baseline model capability—a weaker model has more to gain from structured demonstrations, while a stronger model's higher baseline narrows the headroom. The question of whether complexity selection continues to provide statistically significant gains on frontier models (GPT-4o, Claude Opus 4) as of 2025 remains open.
Robustness analysis: three conditions from the original paper
The paper tests complexity-based prompting under three conditions beyond the standard in-distribution evaluation:
Condition 1: Transfer across datasets (out-of-domain demonstrations)
Demonstrations selected from MultiArith (multi-step arithmetic) are applied to MathQA test questions (algebraic reasoning). Performance under transfer is lower than in-domain, as expected, but the complexity criterion still outperforms random selection under transfer. This confirms that the selection criterion generalizes beyond the specific dataset it was applied to—complex examples from domain A are still more useful than simple examples from domain A when the test questions come from domain B.
Condition 2: Noisy annotation (incorrect demonstrations included)
A fraction of the candidate pool examples have incorrect reasoning chains (wrong intermediate steps, wrong final answers). Under this condition, complexity selection is more robust than random selection: because the most complex demonstrations are harder to produce incorrectly (writing a 10-step correct solution requires more competence than writing a 3-step incorrect one), complexity selection implicitly filters toward higher-quality annotations. The paper reports that complex prompts remain better than simple prompts even under annotation noise conditions.
Condition 3: Cross-format robustness
The technique is tested under four different step separator formats (newline, period, semicolon, explicit step labels). Complex prompts outperform simple prompts under all four formats. The step separator primarily affects the absolute accuracy level, not the relative benefit of complexity selection. This confirms that the technique's core principle is robust to surface formatting variation.
Implications of robustness results for production:
The three robustness conditions map directly to common production scenarios:
- Transfer robustness: when you have demonstrations from a related but not identical domain, use them rather than starting from zero. Complexity-ranked transfer demonstrations are better than random transfer demonstrations.
- Annotation noise robustness: when the candidate pool was collected hastily and may contain errors, apply complexity selection as a quality filter before manually reviewing. The highest-step examples are more likely to be correct; spot-check the top-8 rather than reviewing all examples.
- Format robustness: when the production API or downstream system imposes a specific separator format, implement that format in demonstrations without worrying about losing the technique's benefit. The step-count criterion functions across separator styles.
Open research question 4: Complexity-based prompting for multimodal tasks
The technique is defined for text-only reasoning chains. Extending it to multimodal tasks (where the question includes images and the reasoning chain involves visual analysis steps) requires defining a complexity metric for multimodal reasoning chains. Step count may generalize directly if each reasoning step is verbalized; alternatively, a visual-linguistic complexity measure incorporating both the number of text steps and the number of distinct image regions referenced could be more appropriate.
Open research question 5: Combining complexity selection with RLHF-aligned models
RLHF-aligned models (InstructGPT, Claude, Llama 3 Instruct) are trained with human preference feedback that may already incorporate implicit complexity preferences (humans may prefer longer, more detailed answers). Understanding the interaction between RLHF alignment and complexity-based prompting—specifically, whether RLHF-aligned models show stronger or weaker responses to the complexity criterion—is an open question with practical deployment implications.
Relationship to test-time compute scaling (o1, DeepSeek-R1):
OpenAI's o1 model family (2024) and DeepSeek-R1 (2025) represent a paradigm shift in test-time compute: rather than sampling N independent chains and voting, these models perform an extended, structured internal reasoning process (sometimes called "thinking" or "chain of extended thoughts") before producing a final answer. The effective reasoning chain lengths in o1-class models can run to thousands of tokens per query.
Complexity-Based Prompting is a conceptual precursor to this paradigm in the following sense: both approaches rest on the insight that more reasoning at inference time yields better answers. Fu et al. (2023) demonstrate this at the demonstration level (selecting examples that model long reasoning) and at the decoding level (sampling and filtering for long chains). O1-class models internalize this principle at the architecture level, training the model to extend its own reasoning before finalizing.
For practitioners working with o1-class or similar long-thinking models:
- Phase 1 (complexity-based demonstration selection) remains applicable: even long-thinking models benefit from demonstrations that show the depth and style of reasoning expected.
- Phase 2 (complexity-based consistency) is largely superseded: o1-class models produce a single, extended reasoning trace rather than N independent samples. The filtering mechanism is internalized in the model's training rather than applied post-hoc by the practitioner.
- The step-count selection criterion for demonstrations adapts: for models that think in extended format, demonstrations should show the full extended reasoning process, not just 8–12 brief steps.
Complexity-Based Prompting thus belongs to the early wave of work that established the empirical foundation for test-time compute scaling—demonstrating in 2022 what architecture-level developments have since encoded into model training.
Promising future directions:
-
Online complexity-based pool construction: A system that continuously adds successful high-complexity chains from production queries to the demonstration pool, with a retention policy based on recency and accuracy-conditional step count.
-
Cross-task complexity transfer: Testing whether demonstrations from one domain (e.g., arithmetic) with high step counts transfer to improve performance in a different but structurally similar domain (e.g., algorithmic code problems), without domain-specific annotation.
-
Theoretical grounding: Developing a formal learning-theoretic account of why in-context example complexity predicts generalization on test problems. This would connect the empirical finding to statistical learning theory and possibly derive tighter bounds on the number of demonstrations needed as a function of problem complexity.
-
Integration with Constitutional AI and critique-based prompting: Using complex demonstrations that include explicit self-critique steps ("Wait, step 3 assumes X, but X requires Y which I haven't verified—let me check") to train models to be both complex and self-aware about their reasoning limitations.
-
Complexity-weighted knowledge distillation: Fine-tune a smaller model on high-complexity-chain examples generated by a larger model, using the chain length as an importance weight in the training objective. Steps in longer chains are upweighted, teaching the student model to prioritize extended reasoning trajectories.
-
Calibrated uncertainty with complexity stratification: Use the step-count distribution across N sampled chains as a calibrated confidence signal. When the top-K chains are very long (high mean step count) and agree, the model is likely highly confident and correct. When the top-K chains are short and disagree, the model is likely uncertain. Map this distribution to a calibrated probability output.
The technique's legacy in the broader prompting literature:
Complexity-Based Prompting represents one of the first systematic demonstrations that demonstration quality is highly sensitive to structural properties independent of semantic content. Prior to this work, the field's intuition was that selecting semantically relevant examples was the primary lever for improving few-shot performance. The finding that step count—a purely syntactic property—is at least as predictive as semantic relevance, and more annotation-efficient than retrieval, fundamentally changed how the community thinks about the demonstration selection problem.
This insight has cascaded into subsequent work: the "thought-template" concept in Buffer of Thoughts (2024), the complexity-aware query routing in Adaptive-RAG (2024), and the quality-filtered example accumulation in Dynamic Cheatsheet (2025) all instantiate, in different forms, the core idea that the structural richness of a reasoning example predicts its value for guiding model inference.
The step-count heuristic may ultimately be superseded by more principled complexity measures derived from mechanistic interpretability or learning theory. But as a practical tool for the annotation-efficient practitioner, it remains among the highest-ROI prompting techniques available: implementable in ten lines of code, requiring no infrastructure beyond a pre-existing example pool, and providing 5–6 percentage point accuracy gains that compound with Self-Consistency to reach state-of-the-art on standard benchmarks.
Appendix: Reference Card and Quick Implementation Guide
Complete Implementation in Under 50 Lines
For practitioners who want a single, self-contained implementation that covers both phases without any dependencies beyond the standard library and an API client:
import re
from collections import Counter
# ─── Core scoring ────────────────────────────────────────────────────────────
def complexity_score(chain: str) -> int:
"""Count non-empty newline-separated lines in a reasoning chain."""
return sum(1 for line in chain.split('\n') if line.strip())
# ─── Phase 1: Demonstration selection ─────────────────────────────────────
def select_demonstrations(pool: list[dict], m: int = 8) -> list[dict]:
"""Select top-m demonstrations by reasoning chain complexity."""
return sorted(pool, key=lambda x: complexity_score(x['chain']), reverse=True)[:m]
def build_prompt(demos: list[dict], question: str) -> str:
"""Assemble the few-shot CoT prompt."""
parts = [
f"Question: {d['question']}\n{d['chain']}\nThe answer is {d['answer']}."
for d in demos
]
parts.append(f"Question: {question}")
return "\n\n".join(parts)
# ─── Answer extraction ────────────────────────────────────────────────────
def extract_answer(chain: str) -> str | None:
m = re.search(r'[Tt]he answer is\s+([^\.\n]+)', chain)
if m:
return m.group(1).strip()
lines = [l.strip() for l in chain.split('\n') if l.strip()]
return lines[-1] if lines else None
# ─── Phase 2: Complexity-based consistency ────────────────────────────────
def complexity_consistency(
prompt: str,
generate_fn, # callable(prompt, n, temperature) -> list[str]
n: int = 50,
k: int = 40,
temperature: float = 0.7,
min_steps: int = 2,
) -> str:
"""Sample n chains, filter to top-k most complex, return majority answer."""
chains = generate_fn(prompt, n, temperature)
# Filter degenerate chains
valid = [c for c in chains if complexity_score(c) >= min_steps]
if not valid:
valid = chains # Fallback: use all chains
# Sort by complexity descending, take top-k
top_k = sorted(valid, key=complexity_score, reverse=True)[:k]
# Majority vote
answers = [extract_answer(c) for c in top_k]
answers = [a for a in answers if a is not None]
if not answers:
return extract_answer(chains[0]) or ""
return Counter(answers).most_common(1)[0][0]
# ─── Full pipeline ────────────────────────────────────────────────────────
def run_complexity_prompting(
pool: list[dict],
question: str,
generate_fn,
m: int = 8,
n: int = 50,
k: int = 40,
use_consistency: bool = True,
) -> str:
demos = select_demonstrations(pool, m=m)
prompt = build_prompt(demos, question)
if use_consistency:
return complexity_consistency(prompt, generate_fn, n=n, k=k)
else:
# Greedy: single chain
chains = generate_fn(prompt, n=1, temperature=0.0)
return extract_answer(chains[0]) or ""
This implementation covers: scoring, selection, prompt building, answer extraction, complexity-based consistency, and greedy fallback. The generate_fn abstraction makes it model-agnostic—pass any callable that returns a list of strings.
Parameter Quick Reference
| Parameter | Recommended | Notes |
|---|---|---|
| M (demos) | 8 | Reduce to 4–6 if context is tight; increase to 10+ for heterogeneous pools |
| N (samples) | 50 | Reduce to 20 for cost; increase to 100 for critical applications |
| K (top chains) | 40 | ~80% of N; tune on validation; never set K = N (reduces to vanilla Self-Consistency) |
| T (temperature) | 0.7 | 0.5–0.6 for smaller models; 0.7–0.8 for GPT-4 class |
| min_steps | 2 | Exclude degenerate 1-step chains; increase to 3–4 for tasks with naturally longer chains |
| Max tokens | 512 | 768–1024 if demonstrations have 10+ steps; set generously |
| Question prefix | "Question:" | Empirically better than "Q:" on math tasks |
| Step separator | "\n" | Do not change; other separators reduce accuracy by 2–4 pp |
Common Failure Modes and Fixes at a Glance
| Symptom | Most Likely Cause | Fix |
|---|---|---|
| No improvement over random selection | Pool too shallow (all examples ≤ 3 steps) | Add genuinely complex examples to the pool |
| No improvement on this model | Model < 100B or lacks CoT ability | Switch to a larger model; verify CoT ability with zero-shot test |
| Majority vote worse than greedy | K too small or T too high | Increase K toward N; reduce temperature to 0.6 |
| Correct chains being filtered out | max_tokens too low causing truncation | Increase max_tokens; add truncation detection |
| High variance across runs | N too small or T too high | Increase N to 50+; reduce T to 0.5–0.6 |
| Improvement on easy questions only | Pool is all simple despite step-count filter | Manually review pool; verify top examples are genuinely hard |
| Format violations in outputs | Demonstrations use inconsistent formats | Standardize all demonstration chains to identical format |
| Injection in generated chains | User input in prompt without sanitization | Sanitize all user input; use hard delimiters |
Summary: When and How to Use Complexity-Based Prompting
Complexity-Based Prompting is one of the most annotation-efficient techniques available for improving few-shot chain-of-thought performance on multi-step reasoning tasks. Its core principle—select the most step-rich demonstrations from your candidate pool—can be implemented in under 20 lines of code and requires no infrastructure beyond a small annotated example set.
Key decision points for practitioners:
| Decision | Guidance |
|---|---|
| Should I use this technique at all? | Yes, if the task requires ≥ 4 reasoning steps and you have ≥ 8 annotated demonstrations |
| Which phase should I use? | Phase 1 alone for cost-constrained production; Phase 1+2 for accuracy-critical applications |
| What model is required? | 100B+ parameters with demonstrated CoT ability; GPT-4, Claude Opus, or comparable |
| How large should the pool be? | 15–25 examples to select from; more gives better selection, diminishing returns above 30 |
| How do I tune K? | Default K=40 out of N=50; tune on a 50–100 example validation set if default underperforms |
| How do I know it's working? | Validate against random selection and handcrafted CoT baselines; expect +4–8 pp improvement |
| When should I stop using it? | When the task distribution shifts significantly, when a larger model eliminates the accuracy gap, or when test-time compute budget permits o1-style thinking models |
The core insight in one sentence:
Among the demonstrations you have, the ones with the most reasoning steps are the most valuable—not because length correlates with quality by definition, but because problems that require many steps to solve encode richer, more transferable reasoning schemas than problems that can be dispatched in two or three operations.
Frequently Asked Questions
Q: Does the technique still help if I only have 5 examples in my pool?
A: With only 5 examples, the selection function will simply return all 5—there is no filtering effect. The technique requires a pool larger than M to have selection benefit. With 5 examples and M=4, you are choosing the top-4 from 5, which offers minimal selectivity. To derive benefit, build a pool of at least 12–15 examples so that at least 4–7 are excluded, giving the complexity criterion real work to do.
Q: Should I always use the highest-step examples, or could there be too-complex examples?
A: The paper does not report a systematic "over-complexity" failure in its experiments (step counts of 8–12 were tested). However, the cognitive load analogy suggests that excessively long examples (20+ steps) may be counterproductive if each step is trivial and the chain reads as verbose rather than deeply reasoned. Use the highest-step examples in your pool, but perform a manual review to verify that each selected example is genuinely complex (many non-trivial steps) rather than artificially inflated (many trivial steps). In practice, naturally annotated examples rarely exceed 15 steps, so this is rarely a concern.
Q: Does the technique work for non-English tasks?
A: There is no theoretical reason it would not. The step-count scoring function (counting newline-separated lines) is language-agnostic. The critical requirement is that the model can produce and understand multi-step chain-of-thought reasoning in the target language, which is a function of the model's multilingual pretraining. For languages where the model has weaker CoT ability, the technique's benefit will be smaller or absent (consistent with the model-scale limitation). For high-resource languages with strong model coverage (Spanish, French, German, Chinese, Japanese), the technique should transfer with minimal modification.
Q: How does this interact with structured prompting systems like system prompts and user/assistant role separation?
A: The demonstrations should be placed in the system prompt (as a fixed few-shot context) when using a chat-format API. The test question is the user turn. This separation is important: placing demonstrations in the user turn can make them subject to instruction-following policies that might truncate or modify them, while system prompt placement is typically more stable. Verify with your specific API that long system prompts are not truncated.
Q: Can I use this technique with function calling or tool use APIs?
A: When the reasoning chain involves tool use (e.g., a model calls a calculator API as part of its reasoning process), the step count should include the reasoning steps both before and after each tool call. Select demonstrations that involve the most tool calls combined with the richest pre- and post-tool reasoning. The filtering in Phase 2 should also score by the number of reasoning steps in the text portions of the chain, excluding raw tool output (which inflates character count without adding reasoning).
Q: What if my task is multi-modal (image + text)?
A: Apply complexity scoring to the text-reasoning portion of the chain only. For a demonstration where the model analyzes an image and produces a multi-step text analysis, score the text analysis steps by newline count. Multi-modal chain-of-thought demonstrations with richer text analysis steps (more description of visual evidence, more inference steps from visual observations) will naturally score higher and be preferentially selected—which is the desired behavior.
Q: Is there a risk of the technique becoming obsolete as models improve?
A: Partially. As frontier models' zero-shot CoT ability approaches 100% on standard benchmarks (GSM8K is already at 95%+ for GPT-4o), the headroom for demonstration-level improvements narrows. On these saturated benchmarks, the technique may show no benefit. However, on harder benchmarks (MATH competition problems, GPQA graduate-level questions, domain-specific professional tasks) where current models remain below 70% accuracy, complexity-based selection continues to provide measurable improvements. The technique's relevance will track the frontier of tasks that remain challenging for large models—as simpler benchmarks saturate, harder benchmarks emerge, and the technique's usefulness migrates to those harder settings.
Q: Should I combine Phase 1 and Phase 2 with a verification step?
A: Yes, if the task allows automatic verification (e.g., math problems where the answer can be checked by substitution, code problems where the solution can be executed). The recommended integration is:
- Apply Phase 1+2 to get a candidate answer.
- Attempt to verify the answer against the problem constraints.
- If verification passes, return the answer.
- If verification fails, re-run Phase 2 with the verified-invalid answers removed from the vote pool.
- If the remaining pool is too small, fall back to the full N chains without filtering.
This verification-augmented loop is most effective for math and code tasks where programmatic verification is straightforward.
Sources
- Fu, Y., Peng, H., Sabharwal, A., Clark, P., & Khot, T. (2022). Complexity-Based Prompting for Multi-Step Reasoning. arXiv:2210.00720. ICLR 2023.
- ICLR 2023 OpenReview: Complexity-Based Prompting for Multi-Step Reasoning
- Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., & Zhou, D. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171. ICLR 2023.
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903. NeurIPS 2022.
- Zhang, Z., Zhang, A., Li, M., & Smola, A. (2022). Automatic Chain of Thought Prompting in Large Language Models. arXiv:2210.03493. ICLR 2023.
- Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Bousquet, O., Le, Q., & Chi, E. (2022). Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv:2205.10625. ICLR 2023.
- Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916. NeurIPS 2022.
- Yasunaga, M., Chen, X., Li, Y., Pasupat, P., Leskovec, J., Liang, P., Chi, E. H., & Zhou, D. (2023). Large Language Models as Analogical Reasoners. arXiv:2310.01714. ICLR 2024.
- Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., Haq, I., Sharma, A., Joshi, T. T., Moazam, H., Miller, H., Zaharia, M., & Potts, C. (2023). DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714.
- Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B. P., Hermann, K., Welleck, S., Yazdanbakhsh, A., & Clark, P. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651. NeurIPS 2023.
- Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601. NeurIPS 2023.
- Guo, Z., et al. (2024). Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models. arXiv:2406.04271. NeurIPS 2024.
- Schulhoff, S., et al. (2024). The Prompt Report: A Systematic Survey of Prompting Techniques. arXiv:2406.06608.
- FranxYao. (2022). Complexity-Based-Prompting (GitHub repository).
- FranxYao. (2022). chain-of-thought-hub (GitHub repository).
- Weng, L. (2023). Prompt Engineering (Blog post). Lilian's Blog.
- Sweller, J. (1988). Cognitive Load During Problem Solving: Effects on Learning. Cognitive Science, 12(2), 257–285.
- Chi, M. T. H., Bassok, M., Lewis, M. W., Reimann, P., & Glaser, R. (1989). Self-Explanations: How Students Study and Use Examples in Learning to Solve Problems. Cognitive Science, 13(2), 145–182.
- Diao, S., Wang, P., Lin, Y., Han, X., Zhang, T., & Xu, R. (2023). Active Prompting with Chain-of-Thought for Large Language Models. arXiv:2302.12246.
- Sweller, J., & Cooper, G. A. (1985). The use of worked examples as a substitute for problem solving in learning algebra. Cognition and Instruction, 2(1), 59–89.
- Ward, M., & Sweller, J. (1990). Structuring effective worked examples. Cognition and Instruction, 7(1), 1–39.
- Akyürek, E., Schuurmans, D., Andreas, J., Ma, T., & Yu, D. (2022). What Learning Algorithm is In-Context Learning? Investigations with Linear Models. arXiv:2211.15661.
- Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., & Vladymyrov, M. (2023). Transformers Learn In-Context by Gradient Descent. arXiv:2212.07677.
- Shum, K. S., Diao, S., & Zhang, T. (2023). Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data. arXiv:2302.12822.
- Rein, D., Hou, B. L., Stickland, A. C., Petty, R., Pang, R. Y., Dirani, J., Michael, J., & Bowman, S. R. (2023). GPQA: A Graduate-Level Google-Proof Q&A Benchmark. arXiv:2311.12022.
- Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). Training Verifiers to Solve Math Word Problems (GSM8K dataset). arXiv:2110.14168.
- Patel, A., Bhatt, S., & Baral, C. (2021). Are NLP Models really able to Solve Simple Math Word Problems? (SVAMP dataset). arXiv:2103.07191.
- Srivastava, A., et al. (2022). Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (BigBench). arXiv:2206.04615.
- Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). Emergent Abilities of Large Language Models. arXiv:2206.07682. TMLR 2022.
- Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E. H., Zhou, D., & Wei, J. (2022). Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. arXiv:2210.09261.
- Zhou, A., Lei, T., Liu, H., Du, N., Huang, Y., Zhao, V., Dai, A., Chen, Z., Le, Q., & Laudon, J. (2023). Large Language Models Are Human-Level Prompt Engineers. arXiv:2211.01910. ICLR 2023.
- Guo, Z., Yang, C., Liu, J., Wang, J., Hu, J., Tang, J., & Cheng, M. (2024). Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models. arXiv:2406.04271.
- Jeong, S., Baek, J., Cho, S., Hwang, S. J., & Park, J. C. (2024). Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. arXiv:2403.14403.
- Miao, S., Liang, C., & Shi, K. (2020). A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers (MathQA). arXiv:2005.06461.
- Roy, S., & Roth, D. (2016). Solving General Arithmetic Word Problems (MultiArith). arXiv:1608.01413.
- Ling, W., Yogatama, D., Dyer, C., & Blunsom, P. (2017). Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems (AQuA-RAT). arXiv:1705.04146.
- Geva, M., Khashabi, D., Segal, E., Khot, T., Roth, D., & Berant, J. (2021). Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies (StrategyQA). arXiv:2101.02235.
Read Next
Start reading to get personalized recommendations
Explore Unread
Great job! You've read all available articles