Analogical prompting: a complete guide

Hand-writing few-shot examples for every task is a chore, and zero-shot "let's think step by step" flies blind. Analogical prompting skips both: it tells the model to recall a few related problems, solve those first, then use its own freshly-made examples to crack the real one. The model becomes its own example bank — and the examples are picked for the exact problem in front of it, not chosen by a human weeks earlier. That's enough to beat human-written few-shot on the hard stuff: GSM8K 77.8% versus 76.7%, and on the tougher MATH benchmark 37.3% versus 34.9% — with zero human-labeled examples (Yasunaga et al., 2023, ICLR 2024).

See it work

Hand a model a probability problem cold and watch it grab the wrong formula. Make it recall a related problem first, and it walks into the answer.

Problem: A bag has 3 red and 5 blue marbles. You draw 2 without
replacement. P(both red)?

Cold:   "Each draw is 3/8, so 3/8 × 3/8 = 9/64."
        ✗ treated the draws as independent — forgot the marble doesn't go back.

Analogical: "Recall a related problem first."
  Recalls: "P(drawing 2 aces from a deck)" → (4/52)(3/51), because the
  deck shrinks after the first card.
  Then the original: (3/8)(2/7) = 6/56 = 3/28.   ✓

Nobody taught the model anything between those two runs. Writing out the deck-of-cards solution lit up the "the pool shrinks after you draw" pattern, and that pattern carried straight into the marble problem. That's the entire trick: warm up the right machinery by solving an analogue, then ride it into the target.

The mental model

Before working a new proof, a mathematician doesn't start from nothing — they mentally flip through similar theorems they've already solved. Analogical prompting bottles that habit. You tell the model: recall problems like this one, solve them, then tackle mine.

Few-shot hands the model examples. Analogical prompting makes the model fetch its own — picked for this exact problem.

This matters because of a quirk cognitive science documented decades ago: humans rarely apply a relevant analogy spontaneously (Gick and Holyoak, 1980). They need a nudge to recall it. The instruction "recall related problems" is that nudge — it cues retrieval the model wouldn't do on its own. Gentner's structure-mapping theory (1983) explains why the recalled examples help: transfer works when the relationships line up, not the surface details. "Two trains closing on each other" maps onto "two reactions consuming a shared reagent" because both are two agents converging on a shared resource at given rates. Hofstadter (2001) pushed the idea furthest — that analogy is the core of cognition itself — which lands neatly on transformers, since they're relational mappers at scale.

How it works

It's a single API call, but the model's completion moves through distinct phases.

Problem encoding. The model reads the target and forms a representation of its domain and structure. This gates what it recalls next.
Exemplar self-generation. Prompted to "recall K distinct related problems and their solutions," it writes K worked examples. Generating each solution — not just the problem statement — is what activates the procedural knowledge.
Knowledge generation (optional). A stronger variant first writes a short "core concepts" tutorial, then the exemplars. Abstract knowledge to concrete examples to solution mirrors how good teaching is sequenced, and it wins on harder tasks.
Solution. With the exemplars now in context, the model solves the original, inheriting their approach, format, and depth.
Answer extraction. A marker (#### answer for math, a code block for code) lets you pull the final answer with a regex.

Along the way you see emergent behavior: the model identifies the problem subtype without being told (generating probability exemplars for a probability problem), calibrates exemplar difficulty to the target, and inherits format conventions in code.

Why it works

Four mechanisms, roughly ranked by how much of the effect they carry:

Factor	Share	What it does
Domain knowledge activation	~40%	Writing an exemplar retrieves and "warms up" related knowledge before the real problem.
Solution template conditioning	~35%	Exemplar solutions show the right depth, format, and step decomposition for the domain.
Structural alignment	~20%	Same-structure exemplars apply the same procedures, cutting logical errors.
Diversity coverage	~5%	Distinct exemplars span subtypes, helping when the target sits at a boundary.

When exemplars are good, the effect compounds: better exemplars to sharper template to fewer errors. When they're bad, the cascade runs in reverse and accuracy can dip below zero-shot CoT — which is the technique's central risk.

Where it shines

Analogical prompting works best on multi-step reasoning with recognizable patterns — the kind where an expert would say "ah, this is a two-pointer problem" or "that's basically Monty Hall."

Mathematical reasoning (its strongest domain): GSM8K 77.8% vs 76.7% few-shot CoT and ~75% zero-shot CoT; MATH 37.3% vs 34.9% few-shot and ~32% zero-shot (+2.4 points over few-shot, ~5 over zero-shot). MATH gains are biggest because its problems span algebra, geometry, probability, combinatorics, and number theory — breadth that makes any fixed example set structurally mismatched. PaLM 2-L hit 81.7% on GSM8K.
Code generation: on Codeforces Level-A problems (2023, to avoid contamination), Acc@1 of 15% vs 11% few-shot, Acc@10 of 29% vs 27% — plus cleaner, better-commented code inherited from self-generated analogues.
Logical reasoning: BIG-Bench Hard Word Sorting 75.2% vs 68.4% zero-shot CoT; Logical Deduction 41.6% vs 36.4%.
Commonsense and scientific QA: self-generated situational primes help; the knowledge variant shines on multi-step science by surfacing the relevant laws first.

Averaged across math, code, logic, and commonsense, the paper reports roughly +4%. A related approach, Thought Propagation (Yu et al., ICLR 2024), extended analogical reasoning to graph tasks and showed +12% on shortest-path optimality, +13% human preference on creative writing, and +15% on LLM-agent planning.

Domain applications: rare-disease diagnosis (recall structurally similar symptom profiles before a novel case), legal reasoning (precedent law is institutionalized analogy), scientific hypothesis generation and drug discovery (cross-domain transfer), and education (generate worked examples at rising difficulty). Less useful for plain extraction, binary classification, or pure generation — there the overhead rarely pays.

Where it doesn't help. BIG-Bench Temporal Sequences showed no gain (57.6% vs 58.0%). The paper's qualitative pass (50 correct, 50 incorrect) found 70% of correct answers had relevant exemplars — and that the dominant failure was exemplars easier than the target, leading the model to underestimate the difficulty.

When to use it (and when not)

Reach for it when:

The task is structured reasoning (math, code, logic) with recognizable solution patterns.
Hand-curating few-shot examples is impractical — many tasks, fast prototyping.
Zero-shot CoT half-works but makes systematic errors on specific subtypes.
The problem space is broad enough that one fixed example set can't cover it.
The model is GPT-4-class or comparable.

Skip it when:

The model lacks the domain knowledge to generate accurate exemplars (misinformation cascade risk).
The answer needs specific facts — names, dates, statistics — that self-generated exemplars might confabulate.
The task is simple enough that direct answering suffices.
Latency or cost is tight: the technique roughly doubles to triples output tokens.

Cost lives in the output. Self-generating K exemplars with solutions adds 800–2000 tokens for K=3, pushing total output to 2–3× zero-shot CoT. At GPT-4o output pricing (~$15 per 1M tokens), an extra ~1,500 tokens is about $0.023 per call. Negligible once, material at scale. Latency rises with it — 5–15s extra on math, 20–30s for K=5.

Model fit: below roughly 70B-equivalent capability, exemplar quality degrades to where the technique stops helping (and below ~7B it can hurt). GPT-4-class, Claude 3.5+, Gemini 1.5 Pro+ are the reliable zone. Unlike fixed few-shot, the technique improves as models improve — the model picks better examples on its own.

Escalate when: accuracy stays below threshold after a few prompt iterations (switch to retrieval-augmented few-shot from a curated set), exemplar solutions are consistently wrong (fine-tune on domain data), or your latency SLA can't absorb the extra tokens (cache pre-generated exemplars per subtype).

Technique	Human labeling	Problem-specific	Token cost	Best for
Analogical	Zero	High (per problem)	High (2–3×)	Complex reasoning, diverse domains
Few-shot CoT	High	Low (fixed)	Medium	Narrow, well-defined tasks
Zero-shot CoT	Zero	None	Minimal	Simple–medium reasoning
Auto-CoT	Medium (corpus)	Medium (clustered)	Medium	Automated few-shot selection

Structure and components

A complete invocation has four required parts and a few high-impact options.

Required: (1) task/domain framing, (2) a self-recall instruction specifying K, distinctness, and "include solutions," (3) an implicit exemplar format, and (4) the target problem placed after the recall instruction.

Optional but high-impact: a knowledge-generation step before exemplars (wins on hard tasks); structural delimiters (# headers) for reliable extraction; explicit diversity language ("distinct," "different subtypes") to prevent exemplar collapse; and a verification instruction to catch bad exemplar solutions.

The design rests on cognitive principles: schema abstraction (multiple examples reveal the general schema), elaborative encoding (generating a solution is a stronger memory event than reading one), and graceful degradation — if the model lacks knowledge, it falls back toward zero-shot CoT rather than failing outright.

Patterns

Standard pattern (recommended, K=3, best balance for math and reasoning):

[Problem statement]

Before solving, recall 3 related problems with distinct solution approaches
and give their complete step-by-step solutions. Then solve the original.

# Related problems:
## Problem 1: [model generates] ## Solution 1: [model generates]
## Problem 2: [distinct problem]  ## Solution 2: [solution]
## Problem 3: [distinct problem]  ## Solution 3: [solution]

# Now solving the original problem:
[model solves]

Knowledge + Exemplars (best for complex, multi-concept problems — hard MATH, advanced code):

[Problem statement]

Step 1: Identify the core concepts needed. Write a brief tutorial.
Step 2: Recall 3 distinct related problems that use these concepts; solve them.
Step 3: Using the concepts and related problems, solve the original.

A minimal pattern (K=2, no diversity machinery) suits tight token budgets, and self-consistency + analogical — sample the full prompt N times at temperature above 0 and majority-vote — buys maximum accuracy on high-stakes problems.

Scenario tweaks: for ambiguous problems, add an explicit "first identify the problem type" step. For format-critical tasks, include one exemplar in the exact target format and the model inherits it. For domain tasks, frame the system prompt as an expert in that sub-domain.

One platform example

Consolidated to a single concise function — the structure ports to any provider:

from openai import OpenAI
import re

client = OpenAI()

def analogical_prompt(problem: str, k: int = 3, domain: str = "math") -> str:
    prompt = f"""Problem: {problem}

Recall {k} distinct related {domain} problems covering different subtypes,
and give their complete step-by-step solutions. Then solve the original,
using insights from them. End with: #### <final answer>"""
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,        # deterministic for math/logic
        max_tokens=3000,      # room for K=3 exemplars + solution
    )
    return resp.choices[0].message.content

def extract_answer(text: str) -> str:
    m = re.search(r"####\s*(.+)", text)
    return m.group(1).strip() if m else text.splitlines()[-1].strip()

Configuration

Setting	Guidance
Temperature	0 for math/logic; 0.1–0.3 scientific; 0.2–0.4 code; 0.5–0.7 creative/commonsense; 0.7 for self-consistency
max_tokens	~1000–1500 (K=2), 2000–3000 (K=3), 4000–6000 (K=5 + knowledge)
K (exemplars)	K=3 is the sweet spot; K=5 for hard MATH/code; K=2 minimum; above 5 is diminishing returns
Knowledge variant	Use when the domain has terminology the model won't invoke spontaneously
Top-p	Default 1.0 at temperature 0; 0.9 for higher-temperature creative tasks

Variant selection

Scenario	Variant	K
Simple math (GSM8K)	Standard	3
Hard math (MATH)	Knowledge + Exemplars	5
Code (easy–medium)	Standard	3
Code (competitive)	Knowledge + Exemplars	5
Commonsense	Minimal	2–3
Domain (medical, legal)	Knowledge + Exemplars	3–5
Max accuracy, no cost limit	Knowledge + Exemplars + Self-Consistency	5

Implementation workflow

Domain analysis (~30 min): identify subtypes and what a correct solution format looks like.
Prompt construction (1–2 hr): start with the standard pattern; manually inspect 3–5 outputs — are exemplars relevant, are their solutions correct, does the final answer actually use them?
Variant exploration (~1 hr): if exemplars are relevant but answers still wrong, try Knowledge + Exemplars; if exemplars are the wrong subtype, add explicit subtype identification.
K tuning (~30 min): test K=2, 3, 5 on an eval set; K=3 usually wins.
Delimiters and extraction (~30 min): add headers, test the regex parser against edge cases.
Evaluation (1–2 hr): compare against zero-shot CoT and fixed few-shot CoT. If you don't beat zero-shot CoT, the problem is exemplar quality or model capability.
Production hardening (2–4 hr): malformed-output handling, retries, optional self-consistency for high-stakes calls.

Do and don't

Do: always specify K explicitly; always require diversity ("distinct" / "different subtypes"); test exemplar quality separately before measuring end-to-end accuracy; use the knowledge variant for multi-concept tasks; apply self-consistency (N=5) for high-stakes answers; log exemplar quality in production as an early warning.

Don't: use it for simple factual lookups; rely on it for real-time or proprietary data; push K above 5 without evidence; treat self-generated exemplars as ground truth (the most common production error); reorder knowledge after exemplars (knowledge-first wins); or run it on sub-GPT-3.5-class models without heavy validation.

Debugging

Symptom	Root cause	Fix
Exemplars are the wrong type	Recall instruction too vague	Add a subtype-identification step before recall
Exemplars right, answer wrong	Model isn't bridging	Add "using the patterns above, solve…"; or use the knowledge variant
Exemplar solutions contain errors	Plausible-but-wrong generation	Add a verification step; drop K to 2; add a secondary validation call
Model skips exemplars, answers directly	Instruction too weak	"You must recall and solve related problems BEFORE the original"
Merged or missing sections	Format conflation	Mandate exact section headers
Hallucinations despite good exemplars	Drift mid-solution	Self-consistency (N=5); if systematic, the model lacks the knowledge

Testing and how to prove it

Build a three-tier set: happy path (60%, where zero-shot CoT half-works), edge cases (25%, subtype boundaries and unusual structures), and adversarial (15%, misleading surface features). Score with task-appropriate metrics — exact match for math, Acc@1/Acc@10 for code, label match for logic, LLM-as-judge for open-ended.

The technique adds intermediate outputs you can grade directly:

# Self-consistency: sample N analogical runs, majority-vote the answer
from collections import Counter

def analogical_self_consistency(problem, k=3, n=5):
    answers = [
        extract_answer(analogical_prompt(problem, k))  # temperature > 0
        for _ in range(n)
    ]
    return Counter(answers).most_common(1)[0][0]

Three technique-specific metrics matter: exemplar relevance rate (aim above 70%; below 60% is too low to help), exemplar accuracy rate (aim above 80%; below 75% means errors are propagating), and solution-exemplar coherence (embedding similarity — low coherence with correct answers means the model succeeded without the exemplars, so reconsider using the technique there). For sizing an A/B test: detecting a 3-point gain at 80% power needs roughly 400 problems per arm.

Token optimization without losing much: a "concise but complete" instruction cuts 20–40% of tokens with under 1 point accuracy loss on simple tasks; dropping K from 5 to 3 saves 30–40% at 0.5–2 points; bulleting the knowledge phase saves 20–30%.

Limitations

Parametric knowledge ceiling. Exemplars are only as good as training coverage. For niche subspecialties or post-cutoff developments, the model generates wrong, generic, or off-domain examples. No prompt engineering fixes absent knowledge.
Error propagation. A wrong exemplar solution actively misleads the target. The dominant failure mode is exemplars easier than the target, causing under-estimation of difficulty.
Token cost. 2–3× output tokens is structural, not a tuning problem — generating examples before answering always costs more.
Model-size dependency. Below ~70B-equivalent, exemplar quality collapses; smaller models do better with retrieval-based fixed few-shot.
Temporal boundary. Self-generated exemplars can't know recent events — pair with retrieval for anything current.

Inefficient for: simple factual lookup, rule-based binary classification, summarizing provided text, or any problem whose statement already contains all the context. The knowledge-only fallback (concept summary, no exemplars) still retains ~50–60% of the benefit at near zero-shot token cost when budgets are tight.

Advanced techniques

The highest-leverage refinement is a meta-cognitive bridging step between exemplars and solution — have the model state what pattern the exemplars share, how it maps to the target, and what the target needs that the analogues don't. This forces explicit structural mapping and cuts superficial pattern-matching. For rigid outputs, embed the schema in the exemplar instruction and the model inherits the format. For uncertainty, ask it to rate confidence and flag where the analogies may not apply — a clean signal for routing to human review.

For very hard problems, chaining applies analogical prompting per sub-phase (parsing, then core reasoning, then verification), passing summarized state — not full exemplar lists — between stages. Iterative re-analogization targets a specific failure: "the previous solution erred at step X; recall 3 problems focused on that sub-problem, then revise." And the most cost-effective production pattern is caching: pre-generate high-quality exemplar blocks for your common subtypes offline, classify the incoming problem, and inject the matching block — exemplar quality at near zero-shot cost, with live generation as the fallback for unseen subtypes.

Risk and ethics

The technique has an unusual transparency upside: the exemplar phase exposes the model's reasoning context, so biases that a direct answer would hide become visible. Showing users "here are the cases I'm drawing from" is a real trust mechanism in medicine, law, and education.

Confident wrong exemplars are the top risk. A fluent, coherent, but subtly incorrect exemplar — wrong formula, bad precedent, logical fallacy — propagates its error into a final answer whose reasoning looks sound. That's harder to catch than a bluntly wrong answer. Mitigate with an exemplar-verification step, self-consistency (divergent answers flag instability), and a secondary validation call for high-stakes use.

Other failure modes: domain-inappropriate exemplars (mitigate with explicit domain framing and a subtype step), knowledge-cutoff staleness (pair with RAG, use date-aware prompts), and cascading errors in agentic chains (validate exemplar quality per step, add human checkpoints before irreversible actions). Because exemplars are model-generated, the technique resists adversarial example injection — but it doesn't validate that a recalled exemplar type is appropriate, so standard prompt-injection and jailbreak mitigations still apply, and you should content-filter exemplar text, not just final answers. Training-data bias amplifies through self-referential generation; counter it with a diversity instruction that spans contexts and populations, and test whether accuracy gains hold uniformly across subgroups — demographic skew in medical or legal exemplars is a concrete safety concern.

Ecosystem and integration

No major framework ships analogical prompting as a named module as of early 2026 — it's plain prompt engineering, so any template-capable framework (LangChain, DSPy, LlamaIndex, Haystack) supports it. LlamaIndex is the natural fit for the RAG-augmented variant; DSPy can optimize the recall instruction itself. For monitoring, log both exemplar outputs and final answers (LangSmith, Weights & Biases); standard benchmarks (BIG-Bench Hard, the EleutherAI harness, HELM) cover evaluation.

Technique	Relationship	Key difference
Few-shot CoT	Self-generating variant of it	Fixed human examples vs model-generated
Auto-CoT	Both automate examples	Auto-CoT retrieves from a corpus; analogical generates from parametric memory
Zero-shot CoT	Builds on it	Zero-shot elicits structure only; analogical adds worked examples
Self-Generated ICL	Near-identical	Analogical specifically emphasizes structural relatedness
Thought Propagation	Direct extension	TP propagates exemplar solutions across a problem graph
Step-Back / Generate-Knowledge	Shared "generate context first" family	They generate principles/facts; analogical generates solved examples

Hybrids worth knowing: Analogical + RAG grounds exemplars in retrieved facts (cuts confabulation); Analogical + Step-Back pairs an abstract principle with concrete instantiations; Analogical + Self-Consistency (in the original paper) majority-votes diverse runs for +3–8 points; Analogical + Chain-of-Verification checks each exemplar before use.

Transitions: to migrate from few-shot, run both in parallel on a validation set, and if analogical matches or beats it, swap the fixed examples for the recall instruction (keep the best human examples as a low-quality-exemplar fallback). When analogical plateaus, add self-consistency, move to Thought Propagation for graph-structured tasks, or fine-tune if the model's domain exemplars are consistently wrong. In production, version the recall instruction separately (it's the most model-sensitive component) and keep a zero-shot CoT rollback behind a prompt flag.

Future directions

Adaptive K: the model estimates difficulty and generates more exemplars only when needed — early work suggests 20–30% token savings at equal accuracy.
First-class exemplar verification: checking self-generated solutions against constraints (math consistency, code execution) before use, killing the worst failure mode.
Cross-model exemplar injection: a strong model generates exemplars, a cheaper model solves — letting small models punch above their weight on structured reasoning.
Multimodal analogical prompting: "recall similar images with their analysis, then analyze this one" — explored at the Analogy-Angle II workshop (ACL 2025).
Self-growing exemplar libraries: verify solved problems and add them back, evolving from pure self-generation toward a hybrid self-generation/retrieval store.

Open questions remain on which dimensions of structural similarity actually drive transfer, the attention-level mechanism behind exemplar influence, whether exemplar quality can be predicted before solving, and how the technique degrades under distribution shift and across languages. Derived work is already here: DEFINE (ACL 2025 Findings) extends it to narrative decision-making, and Thought Propagation to graph reasoning.

The result that anchors the technique. Yasunaga et al. (2023) showed a model generating its own examples beat human-written few-shot CoT on the hardest tasks — MATH 37.3% vs 34.9%, GSM8K 77.8% vs 76.7%, Codeforces Acc@1 15% vs 11% — with zero labeling cost. The lesson: a model's own problem-specific examples can out-teach a human's fixed ones.

Summary

Analogical prompting makes the model recall and solve K related problems first, then use them as self-made few-shot examples for the target.
It beats human few-shot CoT on hard, diverse tasks at zero labeling cost: MATH 37.3% vs 34.9%, GSM8K 77.8% vs 76.7%, Codeforces Acc@1 15% vs 11%.
The gain comes mostly from knowledge activation (~40%) and solution-template conditioning (~35%); the central risk is confident-but-wrong exemplars propagating errors.
Use the Knowledge + Exemplars variant and K=5 for hard multi-concept problems; standard K=3 otherwise; layer self-consistency for high stakes.
It needs a GPT-4-class model, costs 2–3× output tokens, and can't help when the knowledge isn't in the model — pair with retrieval for current or proprietary facts.
Verify exemplar quality separately, expose exemplars for transparency, and keep a zero-shot CoT fallback ready.

Sources:

Explore Unread

Great job! You've read all available articles

Analogical prompting: a complete guide

See it work

Hand a model a probability problem cold and watch it grab the wrong formula. Make it recall a related problem first, and it walks into the answer.

Problem: A bag has 3 red and 5 blue marbles. You draw 2 without
replacement. P(both red)?

Cold:   "Each draw is 3/8, so 3/8 × 3/8 = 9/64."
        ✗ treated the draws as independent — forgot the marble doesn't go back.

Analogical: "Recall a related problem first."
  Recalls: "P(drawing 2 aces from a deck)" → (4/52)(3/51), because the
  deck shrinks after the first card.
  Then the original: (3/8)(2/7) = 6/56 = 3/28.   ✓

The mental model

Few-shot hands the model examples. Analogical prompting makes the model fetch its own — picked for this exact problem.

How it works

It's a single API call, but the model's completion moves through distinct phases.

Problem encoding. The model reads the target and forms a representation of its domain and structure. This gates what it recalls next.
Exemplar self-generation. Prompted to "recall K distinct related problems and their solutions," it writes K worked examples. Generating each solution — not just the problem statement — is what activates the procedural knowledge.
Knowledge generation (optional). A stronger variant first writes a short "core concepts" tutorial, then the exemplars. Abstract knowledge to concrete examples to solution mirrors how good teaching is sequenced, and it wins on harder tasks.
Solution. With the exemplars now in context, the model solves the original, inheriting their approach, format, and depth.
Answer extraction. A marker (#### answer for math, a code block for code) lets you pull the final answer with a regex.

Why it works

Four mechanisms, roughly ranked by how much of the effect they carry:

Factor	Share	What it does
Domain knowledge activation	~40%	Writing an exemplar retrieves and "warms up" related knowledge before the real problem.
Solution template conditioning	~35%	Exemplar solutions show the right depth, format, and step decomposition for the domain.
Structural alignment	~20%	Same-structure exemplars apply the same procedures, cutting logical errors.
Diversity coverage	~5%	Distinct exemplars span subtypes, helping when the target sits at a boundary.

Where it shines

Analogical prompting works best on multi-step reasoning with recognizable patterns — the kind where an expert would say "ah, this is a two-pointer problem" or "that's basically Monty Hall."

Mathematical reasoning (its strongest domain): GSM8K 77.8% vs 76.7% few-shot CoT and ~75% zero-shot CoT; MATH 37.3% vs 34.9% few-shot and ~32% zero-shot (+2.4 points over few-shot, ~5 over zero-shot). MATH gains are biggest because its problems span algebra, geometry, probability, combinatorics, and number theory — breadth that makes any fixed example set structurally mismatched. PaLM 2-L hit 81.7% on GSM8K.
Code generation: on Codeforces Level-A problems (2023, to avoid contamination), Acc@1 of 15% vs 11% few-shot, Acc@10 of 29% vs 27% — plus cleaner, better-commented code inherited from self-generated analogues.
Logical reasoning: BIG-Bench Hard Word Sorting 75.2% vs 68.4% zero-shot CoT; Logical Deduction 41.6% vs 36.4%.
Commonsense and scientific QA: self-generated situational primes help; the knowledge variant shines on multi-step science by surfacing the relevant laws first.

When to use it (and when not)

Reach for it when:

The task is structured reasoning (math, code, logic) with recognizable solution patterns.
Hand-curating few-shot examples is impractical — many tasks, fast prototyping.
Zero-shot CoT half-works but makes systematic errors on specific subtypes.
The problem space is broad enough that one fixed example set can't cover it.
The model is GPT-4-class or comparable.

Skip it when:

The model lacks the domain knowledge to generate accurate exemplars (misinformation cascade risk).
The answer needs specific facts — names, dates, statistics — that self-generated exemplars might confabulate.
The task is simple enough that direct answering suffices.
Latency or cost is tight: the technique roughly doubles to triples output tokens.

Technique	Human labeling	Problem-specific	Token cost	Best for
Analogical	Zero	High (per problem)	High (2–3×)	Complex reasoning, diverse domains
Few-shot CoT	High	Low (fixed)	Medium	Narrow, well-defined tasks
Zero-shot CoT	Zero	None	Minimal	Simple–medium reasoning
Auto-CoT	Medium (corpus)	Medium (clustered)	Medium	Automated few-shot selection

Structure and components

A complete invocation has four required parts and a few high-impact options.

Patterns

Standard pattern (recommended, K=3, best balance for math and reasoning):

[Problem statement]

Before solving, recall 3 related problems with distinct solution approaches
and give their complete step-by-step solutions. Then solve the original.

# Related problems:
## Problem 1: [model generates] ## Solution 1: [model generates]
## Problem 2: [distinct problem]  ## Solution 2: [solution]
## Problem 3: [distinct problem]  ## Solution 3: [solution]

# Now solving the original problem:
[model solves]

Knowledge + Exemplars (best for complex, multi-concept problems — hard MATH, advanced code):

[Problem statement]

Step 1: Identify the core concepts needed. Write a brief tutorial.
Step 2: Recall 3 distinct related problems that use these concepts; solve them.
Step 3: Using the concepts and related problems, solve the original.

One platform example

Consolidated to a single concise function — the structure ports to any provider:

from openai import OpenAI
import re

client = OpenAI()

def analogical_prompt(problem: str, k: int = 3, domain: str = "math") -> str:
    prompt = f"""Problem: {problem}

Recall {k} distinct related {domain} problems covering different subtypes,
and give their complete step-by-step solutions. Then solve the original,
using insights from them. End with: #### <final answer>"""
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,        # deterministic for math/logic
        max_tokens=3000,      # room for K=3 exemplars + solution
    )
    return resp.choices[0].message.content

def extract_answer(text: str) -> str:
    m = re.search(r"####\s*(.+)", text)
    return m.group(1).strip() if m else text.splitlines()[-1].strip()

Configuration

Setting	Guidance
Temperature	0 for math/logic; 0.1–0.3 scientific; 0.2–0.4 code; 0.5–0.7 creative/commonsense; 0.7 for self-consistency
max_tokens	~1000–1500 (K=2), 2000–3000 (K=3), 4000–6000 (K=5 + knowledge)
K (exemplars)	K=3 is the sweet spot; K=5 for hard MATH/code; K=2 minimum; above 5 is diminishing returns
Knowledge variant	Use when the domain has terminology the model won't invoke spontaneously
Top-p	Default 1.0 at temperature 0; 0.9 for higher-temperature creative tasks

Variant selection

Scenario	Variant	K
Simple math (GSM8K)	Standard	3
Hard math (MATH)	Knowledge + Exemplars	5
Code (easy–medium)	Standard	3
Code (competitive)	Knowledge + Exemplars	5
Commonsense	Minimal	2–3
Domain (medical, legal)	Knowledge + Exemplars	3–5
Max accuracy, no cost limit	Knowledge + Exemplars + Self-Consistency	5

Implementation workflow

Domain analysis (~30 min): identify subtypes and what a correct solution format looks like.
Prompt construction (1–2 hr): start with the standard pattern; manually inspect 3–5 outputs — are exemplars relevant, are their solutions correct, does the final answer actually use them?
Variant exploration (~1 hr): if exemplars are relevant but answers still wrong, try Knowledge + Exemplars; if exemplars are the wrong subtype, add explicit subtype identification.
K tuning (~30 min): test K=2, 3, 5 on an eval set; K=3 usually wins.
Delimiters and extraction (~30 min): add headers, test the regex parser against edge cases.
Evaluation (1–2 hr): compare against zero-shot CoT and fixed few-shot CoT. If you don't beat zero-shot CoT, the problem is exemplar quality or model capability.
Production hardening (2–4 hr): malformed-output handling, retries, optional self-consistency for high-stakes calls.

Do and don't

Debugging

Symptom	Root cause	Fix
Exemplars are the wrong type	Recall instruction too vague	Add a subtype-identification step before recall
Exemplars right, answer wrong	Model isn't bridging	Add "using the patterns above, solve…"; or use the knowledge variant
Exemplar solutions contain errors	Plausible-but-wrong generation	Add a verification step; drop K to 2; add a secondary validation call
Model skips exemplars, answers directly	Instruction too weak	"You must recall and solve related problems BEFORE the original"
Merged or missing sections	Format conflation	Mandate exact section headers
Hallucinations despite good exemplars	Drift mid-solution	Self-consistency (N=5); if systematic, the model lacks the knowledge

Testing and how to prove it

The technique adds intermediate outputs you can grade directly:

# Self-consistency: sample N analogical runs, majority-vote the answer
from collections import Counter

def analogical_self_consistency(problem, k=3, n=5):
    answers = [
        extract_answer(analogical_prompt(problem, k))  # temperature > 0
        for _ in range(n)
    ]
    return Counter(answers).most_common(1)[0][0]

Limitations

Parametric knowledge ceiling. Exemplars are only as good as training coverage. For niche subspecialties or post-cutoff developments, the model generates wrong, generic, or off-domain examples. No prompt engineering fixes absent knowledge.
Error propagation. A wrong exemplar solution actively misleads the target. The dominant failure mode is exemplars easier than the target, causing under-estimation of difficulty.
Token cost. 2–3× output tokens is structural, not a tuning problem — generating examples before answering always costs more.
Model-size dependency. Below ~70B-equivalent, exemplar quality collapses; smaller models do better with retrieval-based fixed few-shot.
Temporal boundary. Self-generated exemplars can't know recent events — pair with retrieval for anything current.

Advanced techniques

Risk and ethics

Ecosystem and integration

Technique	Relationship	Key difference
Few-shot CoT	Self-generating variant of it	Fixed human examples vs model-generated
Auto-CoT	Both automate examples	Auto-CoT retrieves from a corpus; analogical generates from parametric memory
Zero-shot CoT	Builds on it	Zero-shot elicits structure only; analogical adds worked examples
Self-Generated ICL	Near-identical	Analogical specifically emphasizes structural relatedness
Thought Propagation	Direct extension	TP propagates exemplar solutions across a problem graph
Step-Back / Generate-Knowledge	Shared "generate context first" family	They generate principles/facts; analogical generates solved examples

Future directions

Adaptive K: the model estimates difficulty and generates more exemplars only when needed — early work suggests 20–30% token savings at equal accuracy.
First-class exemplar verification: checking self-generated solutions against constraints (math consistency, code execution) before use, killing the worst failure mode.
Cross-model exemplar injection: a strong model generates exemplars, a cheaper model solves — letting small models punch above their weight on structured reasoning.
Multimodal analogical prompting: "recall similar images with their analysis, then analyze this one" — explored at the Analogy-Angle II workshop (ACL 2025).
Self-growing exemplar libraries: verify solved problems and add them back, evolving from pure self-generation toward a hybrid self-generation/retrieval store.

Summary

Analogical prompting makes the model recall and solve K related problems first, then use them as self-made few-shot examples for the target.
It beats human few-shot CoT on hard, diverse tasks at zero labeling cost: MATH 37.3% vs 34.9%, GSM8K 77.8% vs 76.7%, Codeforces Acc@1 15% vs 11%.
The gain comes mostly from knowledge activation (~40%) and solution-template conditioning (~35%); the central risk is confident-but-wrong exemplars propagating errors.
Use the Knowledge + Exemplars variant and K=5 for hard multi-concept problems; standard K=3 otherwise; layer self-consistency for high stakes.
It needs a GPT-4-class model, costs 2–3× output tokens, and can't help when the knowledge isn't in the model — pair with retrieval for current or proprietary facts.
Verify exemplar quality separately, expose exemplars for transparency, and keep a zero-shot CoT fallback ready.

Sources:

Explore Unread

Great job! You've read all available articles

Analogical prompting: a complete guide

See it work

The mental model

How it works

Why it works

Where it shines

When to use it (and when not)

Structure and components

Patterns

One platform example

Configuration

Variant selection

Implementation workflow

Do and don't

Debugging

Testing and how to prove it

Limitations

Advanced techniques

Risk and ethics

Ecosystem and integration

Future directions

Summary

Read Next

Explore Unread

Analogical prompting: a complete guide

See it work

The mental model

How it works

Why it works

Where it shines

When to use it (and when not)

Structure and components

Patterns

One platform example

Configuration

Variant selection

Implementation workflow

Do and don't

Debugging

Testing and how to prove it

Limitations

Advanced techniques

Risk and ethics

Ecosystem and integration

Future directions

Summary

Read Next

Explore Unread