Consistency-based Self-Adaptive Prompting (COSP): A Complete Guide

Consistency-based Self-Adaptive Prompting (COSP) is a zero-shot automatic prompting technique that bridges the gap between zero-shot simplicity and few-shot effectiveness. Rather than requiring manually crafted examples or labeled data, COSP leverages an LLM's own predictions to automatically construct high-quality pseudo-demonstrations. The technique identifies which self-generated responses are most likely correct by measuring consistency across multiple outputs, then uses these reliable examples to guide subsequent inference.

The core insight is elegant: confident and consistent predictions are more likely correct. When a model repeatedly arrives at the same answer through different reasoning paths, that answer is probably right. COSP exploits this principle by having the model generate multiple responses to unlabeled questions, scoring them based on consistency and quality metrics, and selecting the best ones as demonstrations for a second inference pass.

Category: COSP belongs to ensembling and self-adaptive prompting techniques. It combines elements of zero-shot Chain-of-Thought prompting with automated few-shot example selection.

Type: Optimization-based and meta-cognitive technique that uses the model's own outputs to improve subsequent performance through automatic demonstration selection.

Scope: COSP includes automatic generation and selection of pseudo-demonstrations, consistency-based scoring, diversity enforcement, and two-stage inference. It excludes manual example curation, labeled data requirements, and fine-tuning. The technique specifically targets reasoning tasks where answers can be compared for consistency.

Why COSP Exists

Core Problems Solved:

Manual demonstration burden: Few-shot prompting requires carefully crafted examples, which is time-consuming and requires domain expertise
Labeled data dependency: Traditional few-shot approaches need ground-truth labels, limiting applicability to new domains
Zero-shot performance gap: Pure zero-shot methods often underperform compared to few-shot, especially on complex reasoning tasks
Example quality sensitivity: Few-shot performance varies significantly based on example selection, but optimal selection is non-obvious
Domain adaptation cost: Creating new demonstrations for each domain or task is expensive and doesn't scale

Value Proposition:

Zero labeled data requirement: Works with only unlabeled test samples and the LLM itself
Accuracy improvement: Up to 15% gains over zero-shot baselines
Few-shot parity: Matches or exceeds manually-crafted few-shot performance on many reasoning tasks
Automatic adaptation: Self-adapts to different tasks without human intervention
Scalability: Can be applied to any task where answer consistency is measurable
Cost efficiency: Eliminates human effort in example curation while maintaining quality

Research Foundation

Seminal Work: Wan et al. (2023)

The paper "Better Zero-Shot Reasoning with Self-Adaptive Prompting" by Xingchen Wan, Ruoxi Sun, Hanjun Dai, Sercan O. Arik, and Tomas Pfister at Google Cloud AI Research and Google DeepMind introduced COSP. Published at ACL 2023 (Findings), this work established that LLMs can effectively bootstrap their own demonstrations by identifying high-confidence outputs.

Key Findings:

Performance gains: Up to 15% improvement compared to zero-shot baselines
Few-shot parity: Matches or exceeds 5-shot CoT baselines across multiple reasoning benchmarks
Consistency-correctness correlation: Normalized entropy of answer distributions is a strong proxy for correctness
Multi-model validation: Demonstrated effectiveness across PaLM-62B, PaLM-540B, and GPT-3 (code-davinci-001)

Building on Prior Work:

COSP builds directly on several foundational techniques:

Zero-Shot CoT (Kojima et al., 2022): The "Let's think step by step" trigger that enables reasoning without examples
Self-Consistency (Wang et al., 2022): Multiple sampling and majority voting to improve reliability
Auto-CoT (Zhang et al., 2022): Automatic demonstration generation through clustering

COSP's innovation was combining these elements with a principled scoring function that balances consistency, repetition avoidance, and diversity.

Follow-up Work: Universal Self-Adaptive Prompting (USP)

Published at EMNLP 2023, USP extended COSP's principles beyond reasoning tasks to general NLU and NLG applications. While COSP focuses on tasks with clear, verifiable answers, USP introduces task-specific confidence measures for classification, short-form generation, and long-form generation.

Real-World Performance Evidence

Benchmark Results:

COSP was evaluated on six reasoning benchmarks across three LLMs:

Arithmetic Reasoning:

MultiArith: Significant gains over zero-shot CoT, approaching few-shot performance
GSM8K: Improvements with standard COSP; COSP-FS (few-shot in stage 1) outperformed 5-shot CoT
AddSub: Consistent improvements across all models
SingleEq: Strong performance gains

Commonsense Reasoning:

CommonsenseQA: Improvements over zero-shot baseline
StrategyQA: Gains demonstrated across model sizes

Model-Specific Results:

| Model | Zero-Shot vs COSP | vs Few-Shot | | ------------------------ | ------------------ | ----------------------------------- | | PaLM-62B | 10-15% improvement | Matches/exceeds 5-shot on 2/5 tasks | | PaLM-540B | Significant gains | Matches/exceeds 5-shot on 3/5 tasks | | GPT-3 (code-davinci-001) | 10-15% improvement | Competitive with 5-shot |

Key Finding: All LLMs using COSP outperformed zero-shot prompting on all tasks except GPT-3 on GSM8K, which required the COSP-FS variant with initial few-shot prompting for demonstration generation.

Comparative Performance:

vs Zero-Shot CoT: Massive outperformance across all configurations
vs 5-Shot CoT with labeled examples: On par or better in majority of cases
Complex tasks (GSM8K): Required COSP-FS variant, indicating that more difficult problems benefit from few-shot bootstrapping

How COSP Works

Theoretical Foundation

COSP is grounded in a fundamental observation about LLM behavior: when models are confident about an answer, they tend to produce the same answer consistently across multiple generations. This consistency-correctness correlation provides a signal for identifying high-quality outputs without requiring ground-truth labels.

Core Insight: The model's own uncertainty, measured through output consistency, serves as a reliable proxy for correctness. By sampling multiple responses and measuring their agreement, we can identify which predictions are trustworthy enough to serve as demonstrations for subsequent inference.

Conceptual Model:

Traditional Few-Shot: Human selects examples → LLM uses examples → Output
Zero-Shot CoT: Trigger phrase → LLM reasons → Output
COSP: LLM generates candidates → Score by consistency → Select best → LLM uses self-generated examples → Output

Fundamental Ideas:

Self-consistency as quality signal: Answers that appear repeatedly across different reasoning paths are more likely correct
Automatic demonstration construction: The model can generate its own high-quality examples
Multi-criteria selection: Combining consistency, repetition avoidance, and diversity yields better demonstrations than any single criterion
Two-stage inference: Using self-generated demonstrations in a second pass improves over single-pass zero-shot

Assumptions and Where They Fail:

Assumption 1: Consistent answers are more likely correct

Holds: When the model has relevant knowledge and the task has deterministic answers
Fails: When the model consistently produces the same wrong answer (confident but incorrect), or when tasks have multiple valid answers

Assumption 2: The model can generate useful reasoning chains

Holds: With sufficiently large models (100B+ parameters) on reasoning tasks
Fails: With smaller models that generate incoherent reasoning, or on tasks outside the model's competence

Assumption 3: Repetition in reasoning indicates poor quality

Holds: When repetition reflects genuine redundancy or confusion
Fails: When repetition is legitimate emphasis or necessary recapitulation

Assumption 4: Diverse demonstrations improve performance

Holds: When different examples cover different aspects of the problem space
Fails: When diversity introduces inconsistent or contradictory patterns

Fundamental Trade-offs:

| Trade-off | COSP's Balance | | ----------------------------- | ------------------------------------------------------------ | | Automation vs Control | Fully automated selection, less human control | | Computational cost vs Quality | Multiple generations required, but no labeling cost | | Generality vs Specialization | Task-agnostic scoring, may not capture task-specific quality | | Consistency vs Diversity | Balances both through scoring function |

Execution Mechanism

COSP operates in two distinct stages:

Stage 1: Pseudo-Demonstration Generation and Selection

Input: Unlabeled questions/problems from the target domain
Generation: For each of n questions, generate m reasoning chains using Zero-Shot CoT with non-zero temperature
Scoring: Compute a composite score for each question-response pair based on:
- Normalized entropy of answer distribution (consistency)
- Repetitiveness within the reasoning chain
Selection: Rank all n × m candidates by score and select top k with lowest scores
Output: Set of k pseudo-demonstrations (question + reasoning + answer)

Stage 2: Test Inference

Prompt construction: Concatenate selected pseudo-demonstrations with test question
Generation: Generate multiple reasoning chains for the test question
Aggregation: Apply majority voting across chains to determine final answer
Output: Final predicted answer

Detailed Execution Flow:

Stage 1:
Questions Q₁...Qₙ → [Zero-Shot CoT with temp > 0] →
  For each Qᵢ: Generate m responses Rᵢ₁...Rᵢₘ →
  Extract answers Aᵢ₁...Aᵢₘ →
  Compute entropy(Aᵢ₁...Aᵢₘ) →
  Compute repetitiveness(Rᵢⱼ) for each response →
  Score = entropy + λ × repetitiveness →
  Select k lowest-scoring (Qᵢ, Rᵢⱼ, Aᵢⱼ) tuples

Stage 2:
Test question Qₜₑₛₜ →
  Prepend selected demonstrations →
  Generate multiple reasoning paths →
  Majority vote on answers →
  Return final answer

Cognitive Processes Triggered:

Pattern recognition: Selected demonstrations prime the model to recognize problem structure
Reasoning template application: High-quality demonstrations provide reasoning templates
Answer format alignment: Consistent demonstration format guides output formatting
Confidence calibration: Multiple sampling enables uncertainty estimation

Is This Single-Pass or Multi-Stage?

COSP is inherently multi-stage:

Stage 1: Multiple generation passes for candidate creation (n × m generations)
Stage 2: Multiple generation passes for self-consistency voting
Minimum API calls: n × m + k (demonstration generation) + s (test inference samples)

Completion Criteria:

Stage 1 completes when k demonstrations are selected
Stage 2 completes when majority vote determines the answer
No iterative refinement between stages (single demonstration selection)

Causal Mechanisms

Why COSP Improves Outputs:

Quality filtering: Low-entropy responses are more likely correct; selecting these provides better demonstrations than random selection
Noise reduction: Repetition penalty filters out degenerate reasoning chains that might confuse the model
Coverage improvement: Diversity encouragement ensures demonstrations cover different problem types or reasoning patterns
Bootstrapping effect: Using the model's own confident outputs creates a positive feedback loop where good reasoning begets better reasoning
Format consistency: Self-generated demonstrations naturally match the model's preferred output format

Cascading Effects:

High-quality demonstrations selected →
  Better reasoning patterns primed →
    More accurate intermediate steps →
      Correct final answers →
        (If used iteratively) Even better demonstration pool

Feedback Loops:

Positive: Correct demonstrations improve test accuracy; in iterative settings, this could improve future demonstration quality
Negative: If initial zero-shot performance is poor, the demonstration pool may lack high-quality candidates, limiting gains

Emergent Behaviors:

Self-calibration: The entropy scoring implicitly identifies which questions the model finds difficult
Automatic difficulty stratification: COSP+ variant uses entropy to provide more demonstrations for harder questions
Domain adaptation: Without explicit programming, COSP adapts to domain-specific reasoning patterns present in the unlabeled questions

Dominant Factors in Effectiveness (Ranked):

Model capability (40%): Larger models generate higher-quality candidates and better utilize demonstrations
Consistency-correctness correlation (25%): How well entropy predicts correctness for the specific task
Demonstration diversity (20%): Coverage of different problem types in selected demonstrations
Scoring function calibration (15%): Appropriate balance between consistency and repetition penalties

Structure and Components

Essential Components

Required Components:

Unlabeled question pool: Set of questions from the target domain (no labels needed)
Zero-Shot CoT trigger: Reasoning elicitation phrase (e.g., "Let's think step by step")
Scoring function: Weighted combination of entropy and repetitiveness
Selection mechanism: Ranking and top-k selection
Aggregation method: Majority voting for final answer

Optional Components:

Few-shot bootstrap (COSP-FS): Initial few-shot prompting in Stage 1 for complex tasks
Adaptive demonstration count (COSP+): Variable k based on question difficulty
Custom repetition detection: Domain-specific repetitiveness scoring
Diversity constraints: Explicit diversity requirements in selection

Component Hierarchy:

COSP System
├── Stage 1: Demonstration Generation
│   ├── Question Pool (required)
│   ├── Zero-Shot CoT Generator (required)
│   ├── Answer Extractor (required)
│   └── Scoring Module
│       ├── Entropy Calculator (required)
│       ├── Repetitiveness Calculator (required)
│       └── Diversity Enforcer (optional)
├── Stage 2: Test Inference
│   ├── Prompt Constructor (required)
│   ├── Multi-path Generator (required)
│   └── Majority Voter (required)
└── Variants
    ├── COSP-FS (optional)
    └── COSP+ (optional)

Design Principles

Linguistic Patterns:

COSP relies on standard CoT linguistic patterns in generated demonstrations:

Sequential markers: "First," "Then," "Next," "Finally"
Reasoning connectors: "Therefore," "Thus," "So," "Because"
Calculation language: "Let's calculate," "Computing," "This gives us"
Conclusion signals: "The answer is," "Therefore, the answer is"

Cognitive Principles Leveraged:

Metacognition: Using consistency as a self-assessment of knowledge certainty
Learning by example: Demonstrations prime specific reasoning patterns
Redundancy detection: Recognizing that repetitive reasoning indicates confusion
Ensemble wisdom: Multiple perspectives (diverse demonstrations) improve robustness

Core Design Principles:

| Principle | Implementation in COSP | | ---------------------- | -------------------------------------------- | | Self-reliance | Uses model's own outputs, no external labels | | Quality over quantity | Selects few high-quality demonstrations | | Uncertainty awareness | Entropy measures confidence | | Diversity preservation | Avoids selecting redundant examples | | Simplicity | Straightforward scoring function |

Structural Patterns

Minimal Pattern:

[Stage 1 - Implicit]
Generate responses to unlabeled questions
Select most consistent ones

[Stage 2 - Prompt]
Q: [Selected question 1]
A: [Selected reasoning and answer 1]

Q: [Selected question 2]
A: [Selected reasoning and answer 2]

Q: [Test question]
A: Let's think step by step.

Standard Pattern:

[Stage 1 Prompt - for each unlabeled question]
Q: {unlabeled_question}
A: Let's think step by step.

[Repeat m times with temperature > 0, collect answers]
[Score all n×m responses]
[Select top k]

[Stage 2 Prompt]
Q: {selected_question_1}
A: {selected_reasoning_1}. The answer is {selected_answer_1}.

Q: {selected_question_2}
A: {selected_reasoning_2}. The answer is {selected_answer_2}.

Q: {selected_question_3}
A: {selected_reasoning_3}. The answer is {selected_answer_3}.

Q: {test_question}
A: Let's think step by step.

Advanced Pattern (COSP-FS for Complex Tasks):

[Stage 1 Prompt - with few-shot bootstrap]
Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each. How many balls does he have?
A: Roger started with 5 balls. He bought 2 × 3 = 6 balls. Total: 5 + 6 = 11. The answer is 11.

Q: {unlabeled_question}
A: Let's think step by step.

[Generate m responses, score, select k]

[Stage 2 - same as standard pattern with selected demonstrations]

COSP+ Pattern (Adaptive Demonstration Count):

[After Stage 1 scoring]
For test question with entropy E:
  If E < threshold_low: use k_min demonstrations
  If E > threshold_high: use k_max demonstrations
  Else: use k_standard demonstrations

[Stage 2 with variable demonstration count based on difficulty]

Modifications for Different Scenarios

High-Complexity Tasks (e.g., GSM8K):

Use COSP-FS variant with few-shot bootstrap in Stage 1
Increase m (samples per question) for better candidate pool
Increase k (selected demonstrations) for more context
Consider COSP+ for adaptive demonstration count

Ambiguous or Open-Ended Tasks:

COSP may struggle; consider USP for such tasks
Increase diversity weight in selection
Use domain-specific repetitiveness detection
May need task-specific consistency metrics

Format-Critical Tasks:

Ensure demonstration format matches expected output format
Add explicit format instructions in Stage 2 prompt
Consider post-processing to extract structured answers

Domain-Specific Applications:

Use domain-specific unlabeled questions for demonstration generation
May need custom answer extraction for non-standard formats
Consider domain terminology in repetitiveness scoring

Resource-Constrained Settings:

Reduce m (fewer candidates per question)
Reduce n (smaller question pool)
Use greedy selection instead of global ranking
Consider caching demonstrations across similar queries

Applications and Task Selection

General Applications

Arithmetic Reasoning:

COSP excels at mathematical word problems where answers are unambiguous:

Multi-step arithmetic (MultiArith, AddSub, SingleEq)
Grade school math (GSM8K with COSP-FS)
Algebraic problem solving
Numerical reasoning tasks

Commonsense Reasoning:

Effective on structured commonsense tasks:

StrategyQA (yes/no commonsense questions)
CommonsenseQA (multiple choice)
Physical reasoning with discrete answers
Temporal and spatial reasoning

Logical Reasoning:

Applicable to logic tasks with verifiable answers:

Syllogistic reasoning
Deductive logic problems
Constraint satisfaction
Symbolic reasoning tasks

Question Answering:

Works well for extractive and discrete QA:

Factoid questions with clear answers
Reading comprehension with extractable answers
Multi-hop QA requiring reasoning chains

Domain-Specific Applications

Education:

Automated tutoring for math problems
Self-improving problem solution generation
Adaptive difficulty assessment using entropy scores
Homework assistance systems

Scientific Computing:

Unit conversion and dimensional analysis
Scientific calculation problems
Data interpretation tasks
Experimental design reasoning

Business Analytics:

Financial calculations with clear answers
Metric computations
Quantitative business problems
ROI and cost-benefit analysis

Legal and Compliance:

Regulatory compliance checking (yes/no determinations)
Policy interpretation with discrete outcomes
Contract clause analysis with binary decisions

Unconventional Applications:

Code output prediction: Predicting program outputs for given inputs
Game strategy: Move selection in deterministic games
Puzzle solving: Logic puzzles with verifiable solutions
Scheduling optimization: Constraint satisfaction problems

Selection Framework

Problem Characteristics Favoring COSP:

| Characteristic | Why It Helps | | ------------------------------ | -------------------------------------- | | Deterministic answers | Enables consistency measurement | | Reasoning required | Benefits from CoT demonstrations | | Multiple valid reasoning paths | Allows diverse demonstration selection | | Clear answer extraction | Enables entropy calculation | | Domain with unlabeled examples | Provides demonstration candidates |

Scenarios COSP is Optimized For:

Zero-shot reasoning with no labeled data available
Tasks where few-shot example creation is expensive
Domains requiring automatic adaptation
Applications needing consistent, reproducible reasoning
Settings where answer correctness can be verified post-hoc

Scenarios NOT Recommended For:

Open-ended generation: No clear answer consistency metric (use USP instead)
Highly subjective tasks: Multiple valid answers break consistency assumption
Very simple tasks: Overhead not justified for single-step problems
Tasks with no similar unlabeled data: Cannot generate relevant demonstrations
Real-time applications: Multiple generation passes add latency
Small models: Requires 100B+ parameters for quality reasoning

Selection Signals - When to Choose COSP:

✓ You have unlabeled questions from the target domain ✓ Answers can be compared for consistency (discrete, extractable) ✓ Zero-shot CoT underperforms but you lack labeled examples ✓ Task involves multi-step reasoning ✓ Computational budget allows multiple API calls ✓ Few-shot examples would require significant domain expertise

Selection Signals - When NOT to Choose COSP:

✗ Task has subjective or open-ended answers ✗ Real-time latency constraints (< 2 seconds) ✗ No unlabeled questions available from target domain ✗ Using native reasoning models (o1, o3) that don't need demonstrations ✗ Simple retrieval or pattern matching tasks ✗ Very limited computational budget

Model Requirements:

| Level | Specification | | ------------ | ---------------------------------------------------- | | Minimum | 100B+ parameters, instruction-tuned | | Recommended | PaLM-62B, GPT-3.5, Claude 2, Llama 70B+ | | Optimal | PaLM-540B, GPT-4, Claude 3, Llama 405B | | Not Suitable | Models < 50B, base models without instruction tuning |

Required Model Capabilities:

Zero-Shot CoT reasoning ability
Consistent output format
Temperature-controlled sampling
Sufficient context window for demonstrations (4K+ tokens)

Context and Resource Requirements:

| Resource | Typical Usage | | ------------------- | ------------------------------------------------- | | Unlabeled questions | 10-50 questions for demonstration pool | | Generation budget | n × m + s calls (e.g., 20 × 5 + 10 = 110 calls) | | Context window | 2000-4000 tokens for k=3 demonstrations | | Latency | 10-60 seconds total (Stage 1 can be pre-computed) |

Cost Implications:

One-Time Costs (Stage 1):

n × m API calls for demonstration generation
Embedding API calls for repetitiveness calculation
Can be amortized across many test queries

Per-Request Costs (Stage 2):

s API calls for self-consistency (typically 5-10)
Longer prompts due to demonstrations
2-5x cost of simple zero-shot

Cost-Quality Trade-offs:

| Configuration | Cost | Expected Gain | | --------------------------- | -------------- | ---------------- | | COSP (m=2, k=3, s=5) | ~5x zero-shot | +10-15% accuracy | | COSP-lite (m=2, k=2, s=3) | ~3x zero-shot | +7-12% accuracy | | COSP-heavy (m=5, k=5, s=10) | ~10x zero-shot | +12-18% accuracy |

Variant Selection Guide:

| Variant | Best For | | --------------- | --------------------------------------------------------- | | COSP (standard) | Most reasoning tasks, moderate complexity | | COSP-FS | Complex tasks (GSM8K), when zero-shot candidates are poor | | COSP+ | Heterogeneous difficulty, adaptive resources | | COSP-lite | Cost-constrained, still want automation |

When to Escalate to Alternatives:

To USP: When task is classification, summarization, or open-ended
To Manual Few-Shot: When COSP underperforms and you have expert examples
To Fine-Tuning: When COSP ceiling reached and large labeled dataset available
To Native Reasoning Models: When using o1/o3 (built-in reasoning superior)

Implementation

Step-by-Step Implementation

Prerequisites:

Access to LLM API with temperature control
Embedding API for repetitiveness scoring (optional but recommended)
Set of unlabeled questions from target domain
Answer extraction logic for the task

Phase 1: Setup and Configuration

# Configuration parameters
config = {
    "n": 20,           # Number of unlabeled questions
    "m": 5,            # Reasoning chains per question
    "k": 3,            # Demonstrations to select
    "s": 5,            # Self-consistency samples
    "temperature": 0.7, # For diverse sampling
    "trade_off": 0.2,  # Repetitiveness weight
}

Phase 2: Stage 1 - Demonstration Generation

import openai
import numpy as np
from collections import Counter

def generate_candidates(questions, m, temperature):
    """Generate m reasoning chains for each question."""
    candidates = []

    for q in questions:
        for _ in range(m):
            prompt = f"Q: {q}\nA: Let's think step by step."

            response = openai.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                temperature=temperature,
                max_tokens=500
            )

            reasoning = response.choices[0].message.content
            answer = extract_answer(reasoning)

            candidates.append({
                "question": q,
                "reasoning": reasoning,
                "answer": answer
            })

    return candidates

def extract_answer(reasoning):
    """Extract final answer from reasoning chain."""
    import re
    # Customize based on task format
    match = re.search(r'(?:answer is|=)\s*(-?\d+(?:\.\d+)?)', reasoning.lower())
    return match.group(1) if match else reasoning.split()[-1]

Phase 3: Scoring Function

def compute_entropy(answers):
    """Compute normalized entropy of answer distribution."""
    if len(answers) <= 1:
        return 0.0

    counter = Counter(answers)
    total = len(answers)
    probabilities = [count / total for count in counter.values()]

    entropy = -sum(p * np.log(p) for p in probabilities if p > 0)
    max_entropy = np.log(len(answers))

    return entropy / max_entropy if max_entropy > 0 else 0.0

def compute_repetitiveness(reasoning, get_embedding):
    """Compute repetitiveness score using sentence embeddings."""
    import re

    # Split into sentences
    sentences = re.split(r'[.!?]+', reasoning)
    sentences = [s.strip() for s in sentences if s.strip()]

    if len(sentences) <= 1:
        return 0.0

    # Get embeddings
    embeddings = [get_embedding(s) for s in sentences]

    # Compute pairwise cosine similarities
    similarities = []
    for i in range(len(embeddings)):
        for j in range(i + 1, len(embeddings)):
            sim = cosine_similarity(embeddings[i], embeddings[j])
            similarities.append(sim)

    return np.mean(similarities) if similarities else 0.0

def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def score_candidates(candidates, m, trade_off, get_embedding):
    """Score all candidates and return sorted list."""
    # Group by question
    questions = {}
    for c in candidates:
        q = c["question"]
        if q not in questions:
            questions[q] = []
        questions[q].append(c)

    scored = []
    for q, responses in questions.items():
        answers = [r["answer"] for r in responses]
        entropy = compute_entropy(answers)

        for r in responses:
            rep = compute_repetitiveness(r["reasoning"], get_embedding)
            score = entropy + trade_off * rep
            scored.append({**r, "score": score, "entropy": entropy})

    return sorted(scored, key=lambda x: x["score"])

Phase 4: Demonstration Selection

def select_demonstrations(scored_candidates, k):
    """Select top k demonstrations with diversity."""
    selected = []
    seen_questions = set()

    for candidate in scored_candidates:
        # Optional: enforce question diversity
        if candidate["question"] not in seen_questions:
            selected.append(candidate)
            seen_questions.add(candidate["question"])

        if len(selected) >= k:
            break

    return selected

Phase 5: Stage 2 - Test Inference

def build_prompt(demonstrations, test_question):
    """Construct prompt with demonstrations."""
    prompt_parts = []

    for demo in demonstrations:
        prompt_parts.append(f"Q: {demo['question']}")
        prompt_parts.append(f"A: {demo['reasoning']}")
        prompt_parts.append("")

    prompt_parts.append(f"Q: {test_question}")
    prompt_parts.append("A: Let's think step by step.")

    return "\n".join(prompt_parts)

def cosp_inference(test_question, demonstrations, s, temperature=0.7):
    """Run COSP inference with self-consistency."""
    prompt = build_prompt(demonstrations, test_question)

    answers = []
    for _ in range(s):
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            max_tokens=500
        )

        reasoning = response.choices[0].message.content
        answer = extract_answer(reasoning)
        answers.append(answer)

    # Majority vote
    counter = Counter(answers)
    return counter.most_common(1)[0][0]

Complete COSP Pipeline:

def cosp_pipeline(unlabeled_questions, test_questions, config):
    """Complete COSP pipeline."""

    # Stage 1: Generate and select demonstrations
    print("Stage 1: Generating candidates...")
    candidates = generate_candidates(
        unlabeled_questions,
        config["m"],
        config["temperature"]
    )

    print("Scoring candidates...")
    scored = score_candidates(
        candidates,
        config["m"],
        config["trade_off"],
        get_embedding_function()
    )

    demonstrations = select_demonstrations(scored, config["k"])
    print(f"Selected {len(demonstrations)} demonstrations")

    # Stage 2: Run inference on test questions
    print("Stage 2: Running inference...")
    results = []
    for test_q in test_questions:
        answer = cosp_inference(
            test_q,
            demonstrations,
            config["s"]
        )
        results.append({"question": test_q, "answer": answer})

    return results, demonstrations

Platform-Specific Implementations

OpenAI API:

import openai

client = openai.OpenAI(api_key="your-key")

def get_embedding(text):
    response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=text
    )
    return response.data[0].embedding

def generate_response(prompt, temperature=0.7):
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        max_tokens=500
    )
    return response.choices[0].message.content

Anthropic Claude:

import anthropic

client = anthropic.Anthropic(api_key="your-key")

def generate_response_claude(prompt, temperature=0.7):
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    return message.content[0].text

LangChain Integration:

from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage

class COSPChain:
    def __init__(self, model_name="gpt-4", config=None):
        self.llm = ChatOpenAI(model=model_name, temperature=0.7)
        self.config = config or {"m": 5, "k": 3, "s": 5}
        self.demonstrations = None

    def generate_demonstrations(self, questions):
        """Stage 1: Generate and select demonstrations."""
        candidates = []

        for q in questions:
            for _ in range(self.config["m"]):
                prompt = f"Q: {q}\nA: Let's think step by step."
                response = self.llm.invoke([HumanMessage(content=prompt)])
                candidates.append({
                    "question": q,
                    "reasoning": response.content,
                    "answer": self._extract_answer(response.content)
                })

        self.demonstrations = self._select_best(candidates)
        return self.demonstrations

    def inference(self, test_question):
        """Stage 2: Run inference with demonstrations."""
        if not self.demonstrations:
            raise ValueError("Must generate demonstrations first")

        prompt = self._build_prompt(test_question)
        answers = []

        for _ in range(self.config["s"]):
            response = self.llm.invoke([HumanMessage(content=prompt)])
            answers.append(self._extract_answer(response.content))

        return max(set(answers), key=answers.count)

Configuration

Key Parameters:

| Parameter | Description | Recommended Value | | ------------- | ------------------------ | ----------------- | | n | Unlabeled questions | 20-50 | | m | Samples per question | 3-7 | | k | Selected demonstrations | 3-5 | | s | Self-consistency samples | 5-10 | | temperature | Sampling temperature | 0.5-0.8 | | trade_off (λ) | Repetitiveness weight | 0.1-0.3 |

Task-Specific Tuning:

Arithmetic Reasoning:

Higher m (5-7) for better coverage
Standard k (3-5)
Lower temperature (0.5-0.7) for more coherent math
Consider COSP-FS for GSM8K

Commonsense Reasoning:

Standard m (3-5)
k = 3 typically sufficient
Higher temperature (0.7-0.8) for diversity
Lower trade_off for more consistency focus

Complex Multi-Step:

Use COSP-FS variant
Higher k (5-7) for more context
Consider COSP+ for adaptive selection
May need domain-specific answer extraction

Temperature Settings:

| Setting | Temperature | Use Case | | --------------------------- | ----------- | -------------------- | | Stage 1 generation | 0.7-0.9 | Diverse candidates | | Stage 2 inference | 0.5-0.7 | Balanced exploration | | Final answer (low variance) | 0.0-0.3 | Deterministic output |

Best Practices and Workflow

Do's:

Pre-compute demonstrations: Run Stage 1 offline and cache demonstrations
Validate answer extraction: Test extraction logic before full pipeline
Monitor entropy distribution: Check that scoring differentiates candidates
Use diverse unlabeled questions: Cover different problem types
Start with standard config: Tune parameters only if needed
Verify demonstration quality: Manually inspect selected demonstrations
Test on held-out set: Don't evaluate on demonstration source questions

Don'ts:

Don't use COSP with o1/o3: Native reasoning models don't need demonstrations
Don't apply to subjective tasks: Consistency metric won't work
Don't skip repetitiveness scoring: It filters degenerate outputs
Don't use too few unlabeled questions: Need sufficient candidate pool
Don't ignore computational costs: Budget API calls appropriately
Don't assume demonstrations transfer: Re-generate for new domains

Workflow:

1. Collect unlabeled questions (15-30 min)
   └─ Gather 20-50 representative questions from target domain

2. Configure COSP parameters (5 min)
   └─ Set n, m, k, s, temperature based on task type

3. Run Stage 1 (5-30 min depending on n×m)
   └─ Generate candidates, score, select demonstrations

4. Validate demonstrations (10 min)
   └─ Manually inspect selected demonstrations for quality

5. Test on sample (10 min)
   └─ Run Stage 2 on 5-10 test questions

6. Evaluate and iterate (15-30 min)
   └─ Measure accuracy, adjust parameters if needed

7. Deploy (ongoing)
   └─ Use cached demonstrations for production inference

Debugging Decision Tree

Problem: Low overall accuracy

Root Cause Analysis:
├─ Poor demonstration quality?
│  └─ Check: Manually inspect selected demonstrations
│  └─ Fix: Increase n or m, use COSP-FS, improve question pool
├─ Wrong answer extraction?
│  └─ Check: Verify extraction on known examples
│  └─ Fix: Improve regex or parsing logic
├─ Insufficient demonstrations?
│  └─ Check: Try increasing k
│  └─ Fix: Use k=5 instead of k=3
└─ Task not suitable for COSP?
   └─ Check: Are answers deterministic and comparable?
   └─ Fix: Consider USP or manual few-shot

Problem: Inconsistent results across runs

Root Cause Analysis:
├─ High temperature in Stage 2?
│  └─ Fix: Reduce to 0.3-0.5
├─ Too few self-consistency samples?
│  └─ Fix: Increase s to 7-10
└─ Ambiguous answer format?
   └─ Fix: Standardize answer extraction

Problem: Demonstrations are low quality

Root Cause Analysis:
├─ Unlabeled questions too difficult?
│  └─ Fix: Use COSP-FS with bootstrap examples
├─ Scoring function not discriminating?
│  └─ Check: Entropy distribution should have variance
│  └─ Fix: Increase m for better entropy estimation
└─ Repetitiveness scoring broken?
   └─ Check: Verify embedding function works
   └─ Fix: Use different embedding model

Problem: High latency

Root Cause Analysis:
├─ Too many candidates (n×m)?
│  └─ Fix: Reduce n to 15, m to 3
├─ Stage 1 not cached?
│  └─ Fix: Pre-compute and cache demonstrations
└─ Too many self-consistency samples?
   └─ Fix: Reduce s to 3-5

Common Mistakes:

| Mistake | Impact | Prevention | | ------------------------------- | ----------------------------- | ---------------------------- | | Using COSP for open-ended tasks | Poor accuracy | Check task suitability first | | Not caching Stage 1 | Wasted computation | Always cache demonstrations | | Wrong answer format | Bad entropy scores | Validate extraction logic | | Too small question pool | Limited demonstration quality | Use 20+ unlabeled questions | | Ignoring repetitiveness | Degenerate demonstrations | Always include rep scoring |

Testing and Optimization

Validation Strategy:

Held-out test set: Never use test questions in demonstration pool
Cross-validation: For limited data, use k-fold on unlabeled questions
Ablation testing: Compare COSP vs zero-shot vs manual few-shot
Component analysis: Test with/without repetitiveness, diversity

Test Coverage:

Standard cases (60%): Typical problems from target domain
Edge cases (25%): Unusual inputs, boundary conditions
Hard cases (15%): Known difficult problems

Quality Metrics:

| Metric | Description | Target | | --------------------- | ------------------------ | -------------- | | Accuracy | Correct answers / total | Task-dependent | | Gain over zero-shot | COSP acc - zero-shot acc | > 5% | | Consistency | Variance across runs | Low (< 5%) | | Demonstration quality | Manual quality score | High (4/5+) |

Optimization Techniques:

Token Efficiency:

Compress demonstration reasoning (remove filler)
Use shorter unlabeled questions
Reduce k for simpler tasks

Latency Optimization:

Cache Stage 1 demonstrations
Batch API calls where possible
Use streaming for long responses
Consider parallel Stage 2 calls

Cost Optimization:

Start with COSP-lite (m=2, k=2, s=3)
Only increase if accuracy insufficient
Re-use demonstrations across similar queries
Use cheaper model for Stage 1 if quality sufficient

Iteration Criteria:

Stop if accuracy within 2% of few-shot baseline
Stop if increasing parameters shows < 1% gain
Maximum 3-4 parameter iterations
Focus on demonstration quality over quantity

Limitations and Constraints

Known Limitations

1. Task Scope Restriction

COSP is designed for reasoning tasks with deterministic, comparable answers. It cannot effectively handle:

Open-ended generation (summarization, creative writing)
Subjective evaluations (style assessment, preference ranking)
Tasks with multiple equally valid answers

Why fundamental: The consistency metric requires answers that can be meaningfully compared. Without this, the entire selection mechanism breaks down.

2. Consistency-Correctness Assumption

COSP assumes consistent answers are correct, but models can be consistently wrong:

Systematic biases produce consistent incorrect answers
Common misconceptions may be repeated confidently
Majority of training data may contain errors

When this fails:

Out-of-distribution problems
Problems requiring uncommon knowledge
Tasks where model training data is biased

3. Computational Overhead

COSP requires significant additional computation:

Stage 1: n × m API calls for demonstration generation
Stage 2: s API calls per test question
Embedding calls for repetitiveness scoring

Cannot be eliminated: This is inherent to the multi-sampling approach.

4. Model Size Dependency

Like other CoT-based methods, COSP requires large models:

Minimum ~100B parameters for quality reasoning
Smaller models generate incoherent chains
Demonstration selection can't fix fundamentally poor reasoning

5. Cold Start Problem

COSP needs unlabeled questions from the target domain:

New domains without examples can't use COSP
Quality depends on question pool representativeness
Mismatched questions lead to irrelevant demonstrations

6. No Iterative Improvement

Standard COSP is two-stage without feedback:

Selected demonstrations are fixed
No mechanism to improve based on test performance
Errors in Stage 1 propagate to Stage 2

Edge Cases

Highly Homogeneous Answers:

When all m samples produce the same answer:

Entropy = 0, but might still be wrong
Detection: Check for zero/near-zero entropy across all questions
Handling: Increase temperature, verify with alternative methods

No Clear Answer Pattern:

When answers are uniformly distributed:

Entropy is maximum, no confident prediction
Detection: High entropy across many questions
Handling: May indicate task unsuitability or need for COSP-FS

Conflicting High-Quality Demonstrations:

When selected demonstrations suggest different reasoning patterns:

Detection: Check demonstration consistency
Handling: Enforce stricter diversity constraints or reduce k

Domain Shift:

When test questions differ significantly from unlabeled pool:

Detection: Poor test accuracy despite good demonstration quality
Handling: Expand unlabeled question pool, use domain adaptation

Graceful Degradation:

If COSP accuracy < zero-shot:
  → Fall back to zero-shot CoT
  → Check task suitability

If Stage 1 produces no good candidates:
  → Switch to COSP-FS
  → Reduce scoring thresholds

If latency exceeds budget:
  → Reduce s (fewer self-consistency samples)
  → Use cached demonstrations longer

Constraint Management

Balancing Consistency vs Diversity:

High consistency weight → may select redundant demonstrations
High diversity weight → may select inconsistent demonstrations
Approach: Start with standard trade_off (0.2), adjust based on results

Token/Context Constraints:

When demonstrations exceed context window:

Reduce k (fewer demonstrations)
Compress demonstration reasoning
Select shorter questions/answers
Use models with larger context

Handling Incomplete Information:

When unlabeled questions lack context:

Include context in question format
Use COSP-FS to establish reasoning patterns
Add explicit assumption-stating in prompts

Error Recovery:

| Error | Recovery | | -------------------------- | --------------------------------- | | All candidates low quality | Use COSP-FS bootstrap | | Extraction fails | Fall back to full-text matching | | API timeout | Retry with exponential backoff | | Context overflow | Reduce k, compress demonstrations |

Advanced Techniques

Clarity and Context Optimization

Ensuring Demonstration Clarity:

Format demonstrations consistently
Include clear reasoning transitions
Show explicit calculations
End with unambiguous answer format

Example of Clear Demonstration:

Q: A store has 45 apples. They sell 12 in the morning and 18 in the afternoon. How many apples remain?
A: Let's solve this step by step.
Step 1: Start with 45 apples.
Step 2: Sold in morning: 45 - 12 = 33 apples remain.
Step 3: Sold in afternoon: 33 - 18 = 15 apples remain.
Therefore, the answer is 15.

Context Optimization:

Include only relevant demonstrations
Order by relevance to test question (if measurable)
Remove redundant reasoning steps
Balance completeness with conciseness

Demonstration Design Principles:

| Principle | Implementation | | ------------ | ---------------------------------- | | Clarity | Explicit steps, clear language | | Completeness | All necessary reasoning shown | | Consistency | Same format across demonstrations | | Relevance | Similar to expected test questions |

Advanced Reasoning Patterns

Multi-Step Verification:

Add verification to demonstrations:

...calculation steps...
Let me verify: 15 + 12 + 18 = 45 ✓
The answer is 15.

Decomposition Pattern:

For complex problems, demonstrate decomposition:

Q: [Complex problem]
A: Let me break this into parts.
Part 1: [Subproblem] → [Solution]
Part 2: [Subproblem] → [Solution]
Combining: [Integration] → [Final answer]

Self-Correction Pattern:

Demonstrate error detection:

...initial reasoning...
Wait, let me check: [verification]
Actually, I made an error. [correction]
The correct answer is [corrected answer].

Interaction Patterns

Iterative COSP (Advanced):

def iterative_cosp(questions, iterations=2):
    demonstrations = None

    for i in range(iterations):
        # Generate candidates (using previous demos if available)
        candidates = generate_with_demos(questions, demonstrations)

        # Select new demonstrations
        demonstrations = select_best(candidates)

    return demonstrations

Chaining with Other Techniques:

COSP can be combined with:

RAG: Retrieve relevant knowledge, then apply COSP
Self-Refinement: Use COSP output as input to refinement
Verification: Add verification step after COSP inference

Model Considerations

Model-Specific Adaptations:

| Model | Adaptation | | -------------- | ------------------------------------------ | | GPT-4 | Standard COSP works well | | GPT-3.5 | May need COSP-FS, lower expectations | | Claude | Works well, may need format adjustment | | PaLM | Original evaluation model, standard config | | Llama 70B+ | Needs more demonstrations (k=5+) | | Smaller models | Not recommended |

Cross-Model Considerations:

Demonstrations generated by one model may not transfer well to another
Re-run Stage 1 when switching models
Answer format may differ between models

Handling Model Updates:

Re-evaluate COSP periodically with model updates
Demonstrations may become suboptimal with new model versions
Monitor performance for degradation

Efficiency Optimization

Token Minimization:

def compress_demonstration(demo):
    """Compress demonstration while preserving key information."""
    # Remove filler phrases
    compressed = demo.replace("Let me think about this.", "")
    compressed = compressed.replace("So, ", "")
    # Combine short steps
    # ... additional compression logic
    return compressed

Batch Processing:

async def batch_cosp_inference(test_questions, demos, batch_size=5):
    """Process multiple questions in parallel."""
    import asyncio

    async def process_one(q):
        return await async_inference(q, demos)

    results = []
    for i in range(0, len(test_questions), batch_size):
        batch = test_questions[i:i+batch_size]
        batch_results = await asyncio.gather(*[process_one(q) for q in batch])
        results.extend(batch_results)

    return results

Caching Strategy:

import hashlib
import json

class COSPCache:
    def __init__(self, cache_file="cosp_cache.json"):
        self.cache_file = cache_file
        self.cache = self._load_cache()

    def _hash_questions(self, questions):
        return hashlib.md5(json.dumps(sorted(questions)).encode()).hexdigest()

    def get_demonstrations(self, questions):
        key = self._hash_questions(questions)
        return self.cache.get(key)

    def store_demonstrations(self, questions, demos):
        key = self._hash_questions(questions)
        self.cache[key] = demos
        self._save_cache()

Safety and Robustness

Output Validation:

def validate_cosp_output(answer, expected_format):
    """Validate COSP output meets expected format."""
    if expected_format == "numeric":
        try:
            float(answer)
            return True
        except:
            return False
    elif expected_format == "yes_no":
        return answer.lower() in ["yes", "no"]
    # ... additional formats
    return True

Consistency Monitoring:

def monitor_consistency(results, threshold=0.7):
    """Monitor self-consistency across inference runs."""
    from collections import Counter

    counter = Counter(results)
    most_common_count = counter.most_common(1)[0][1]
    consistency = most_common_count / len(results)

    if consistency < threshold:
        logging.warning(f"Low consistency: {consistency:.2%}")
        return False, consistency

    return True, consistency

Fallback Mechanisms:

def cosp_with_fallback(question, demos, config):
    """COSP with fallback to zero-shot."""
    try:
        answer, consistency = cosp_inference_with_confidence(question, demos, config)

        if consistency < 0.5:
            # Fall back to simple zero-shot
            return zero_shot_cot(question)

        return answer
    except Exception as e:
        logging.error(f"COSP failed: {e}")
        return zero_shot_cot(question)

Domain Adaptation

Adapting to New Domains:

Collect 20-50 unlabeled questions from new domain
Run Stage 1 with domain questions
Validate demonstration quality manually
Test on held-out domain examples
Adjust parameters if needed

Domain-Specific Considerations:

| Domain | Adaptation | | ------- | -------------------------------------------------------- | | Medical | Use medical terminology in questions, validate carefully | | Legal | Longer reasoning chains, may need k=5+ | | Code | Adjust answer extraction for code outputs | | Math | Standard COSP works well |

Quick Domain Adaptation:

def quick_domain_adapt(domain_questions, generic_demos=None):
    """Quick adaptation to new domain."""
    # If we have generic demos, use COSP-FS approach
    if generic_demos:
        candidates = generate_with_demos(domain_questions, generic_demos)
    else:
        candidates = generate_candidates(domain_questions)

    # Select domain-specific demonstrations
    return select_best(candidates)

Risk and Ethics

Ethical Considerations

What COSP Reveals About LLMs:

Models have implicit confidence that correlates with correctness
Self-generated content can improve model performance
Consistency is a learnable, measurable property

Bias Considerations:

Demonstrations inherit biases from model's zero-shot outputs
Consistent biased answers will be selected as demonstrations
No mechanism to detect or correct systematic biases

Transparency Concerns:

Selected demonstrations may not represent optimal reasoning
Users may not understand why certain demonstrations were chosen
Automated selection obscures human oversight

Risk Analysis

Failure Modes:

| Failure Mode | Likelihood | Impact | Mitigation | | ------------------------------ | ---------- | ------ | -------------------------------- | | Consistently wrong answers | Medium | High | Validate demonstrations manually | | Biased demonstration selection | Medium | Medium | Audit selected demonstrations | | Poor quality on edge cases | High | Medium | Test on diverse cases | | Format extraction errors | Medium | Low | Robust parsing, fallbacks |

Cascading Failures:

Bad demonstrations selected →
  Incorrect reasoning patterns primed →
    Test inference follows bad patterns →
      Systematic errors on test set

Safety Concerns:

Prompt injection: Unlabeled questions could contain adversarial content
Mitigation: Sanitize inputs, validate question format

Bias Amplification:

COSP may amplify existing model biases by selecting "confident" biased outputs
Detection: Audit demonstrations for bias patterns
Mitigation: Diverse question pool, explicit fairness constraints

Innovation Potential

Derived Innovations:

Task-adaptive COSP: Automatically detect task type and adjust scoring
Continuous learning COSP: Update demonstrations based on feedback
Multi-model COSP: Use different models for generation vs selection
Hierarchical COSP: Multi-level demonstration selection

Novel Combinations:

COSP + RAG: Retrieve knowledge, then select demonstrations
COSP + Verification: Add automated verification of selected demonstrations
COSP + Active Learning: Use entropy to identify questions needing labels

Ecosystem and Integration

Tools and Frameworks

Framework Support:

| Framework | COSP Support | | ---------- | ------------------------------ | | LangChain | Custom chain implementation | | DSPy | Can implement as custom module | | Instructor | Built-in COSP guidance | | Haystack | Custom pipeline component |

Evaluation Tools:

Standard accuracy metrics
Consistency measurement
Demonstration quality scoring
Entropy distribution analysis

Pre-built Resources:

Instructor library includes COSP documentation and examples
LearnPrompting provides educational materials
Research implementations available (reference original paper)

Closely Related:

| Technique | Relationship | | ---------------- | ---------------------------------------------------------- | | Self-Consistency | COSP uses self-consistency for scoring and inference | | Auto-CoT | Similar goal (automatic demos), different selection method | | USP | Extends COSP to general NLP tasks | | Zero-Shot CoT | Foundation that COSP builds upon |

Comparison Table:

| Aspect | COSP | Auto-CoT | Self-Consistency | Manual Few-Shot | | ------------------- | --------- | ---------------- | ---------------- | --------------- | | Labeled data | No | No | No | Yes | | Automatic selection | Yes | Yes (clustering) | N/A | No | | Multi-stage | Yes | Yes | No | No | | Task scope | Reasoning | Reasoning | Any | Any | | Compute cost | High | Medium | Medium | Low |

Integration Patterns

With RAG Systems:

def cosp_with_rag(question, retriever, demos):
    """Integrate COSP with retrieval."""
    # Retrieve relevant context
    context = retriever.retrieve(question)

    # Augment question with context
    augmented_q = f"Context: {context}\n\nQuestion: {question}"

    # Run COSP inference
    return cosp_inference(augmented_q, demos)

With Agent Systems:

class COSPAgent:
    """Agent that uses COSP for reasoning steps."""

    def __init__(self, tools, demo_pool):
        self.tools = tools
        self.cosp_demos = self._generate_demos(demo_pool)

    def reason(self, task):
        # Use COSP for reasoning
        reasoning = cosp_inference(task, self.cosp_demos)

        # Extract action from reasoning
        action = self._parse_action(reasoning)

        return action

Transition Strategies:

From Zero-Shot to COSP:

Collect unlabeled questions from production queries
Run Stage 1 to generate demonstrations
A/B test COSP vs zero-shot
Gradually increase COSP traffic if positive

From COSP to Fine-Tuning:

Use COSP-generated demonstrations as training data
Filter to high-confidence examples
Fine-tune on selected examples
Compare fine-tuned model to COSP

Production Integration

Deployment Architecture:

┌─────────────────────────────────────────┐
│             Production System            │
├─────────────────────────────────────────┤
│                                         │
│  ┌──────────────┐    ┌──────────────┐  │
│  │ Demo Cache   │    │ Test Query   │  │
│  │ (Stage 1)    │───▶│ (Stage 2)    │  │
│  └──────────────┘    └──────────────┘  │
│         │                   │          │
│         ▼                   ▼          │
│  ┌──────────────┐    ┌──────────────┐  │
│  │ Periodic     │    │ LLM API      │  │
│  │ Refresh      │    │              │  │
│  └──────────────┘    └──────────────┘  │
│                                         │
└─────────────────────────────────────────┘

Monitoring:

Track accuracy over time
Monitor demonstration quality scores
Alert on consistency degradation
Log entropy distributions

Versioning:

Version demonstration sets
Track which demo version produced each result
Enable rollback to previous demonstrations

Future Directions

Emerging Innovations

Active COSP:

Using entropy scores to actively identify questions that need human labels:

High entropy questions are difficult for the model
Prioritize these for human annotation
Combine with active learning frameworks

Continuous COSP:

Updating demonstrations based on feedback:

Track which demonstrations lead to correct answers
Dynamically adjust demonstration pool
Learn optimal scoring weights from outcomes

Multi-Modal COSP:

Extending to vision-language tasks:

Generate demonstrations from image-text pairs
Measure consistency across visual reasoning
Select demonstrations covering diverse visual patterns

Research Frontiers

Open Questions:

Optimal scoring function: Is entropy + repetitiveness optimal, or can we learn better scoring?
Transfer across tasks: Can demonstrations transfer between similar tasks?
Scaling laws: How does COSP performance scale with n, m, k?
Theoretical foundations: Why exactly does consistency correlate with correctness?

Promising Directions:

Learned demonstration selection: Train a model to select demonstrations
Compositional COSP: Build complex demonstrations from simple ones
Adversarial robustness: Make COSP robust to adversarial questions
Efficiency improvements: Reduce computational overhead while maintaining quality

Integration with Native Reasoning:

As models like o1/o3 incorporate native reasoning:

COSP principles may be internalized in model training
External COSP may become unnecessary for advanced models
Techniques may shift to improving native reasoning consistency

Quick Reference

COSP at a Glance

Purpose: Automatic zero-shot demonstration selection
Key Innovation: Consistency-based scoring for self-generated examples
Input: Unlabeled questions + test questions
Output: Improved reasoning accuracy

Stage 1: Generate n×m candidates → Score by entropy + repetitiveness → Select k best
Stage 2: Use selected demonstrations → Self-consistency inference → Majority vote

Typical Config: n=20, m=5, k=3, s=5, temperature=0.7, trade_off=0.2

Decision Checklist

Use COSP when:
☑ Task has deterministic, comparable answers
☑ Zero-shot underperforms
☑ No labeled data available
☑ Have unlabeled questions from domain
☑ Computational budget allows multi-sampling

Don't use COSP when:
☒ Open-ended or subjective task
☒ Using native reasoning models (o1, o3)
☒ Real-time latency requirements
☒ No relevant unlabeled questions available
☒ Small models (< 100B parameters)

Performance Expectations

vs Zero-Shot CoT: +10-15% accuracy
vs 5-Shot Manual: Comparable or better on 50-60% of tasks
Compute Cost: 5-10x zero-shot
Latency: 10-60 seconds (Stage 1 cacheable)

References

Primary Paper:

Wan, X., Sun, R., Dai, H., Arik, S. O., & Pfister, T. (2023). Better Zero-Shot Reasoning with Self-Adaptive Prompting. Findings of ACL 2023.

Related Work:

Kojima, T., et al. (2022). Large Language Models are Zero-Shot Reasoners. NeurIPS 2022.
Wang, X., et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023.
Zhang, Z., et al. (2022). Automatic Chain of Thought Prompting in Large Language Models. ICLR 2023.
Wan, X., et al. (2023). Universal Self-Adaptive Prompting. EMNLP 2023.

Resources:

Explore Unread

Great job! You've read all available articles

Consistency-based Self-Adaptive Prompting (COSP): A Complete Guide

Category: COSP belongs to ensembling and self-adaptive prompting techniques. It combines elements of zero-shot Chain-of-Thought prompting with automated few-shot example selection.

Type: Optimization-based and meta-cognitive technique that uses the model's own outputs to improve subsequent performance through automatic demonstration selection.

Why COSP Exists

Core Problems Solved:

Manual demonstration burden: Few-shot prompting requires carefully crafted examples, which is time-consuming and requires domain expertise
Labeled data dependency: Traditional few-shot approaches need ground-truth labels, limiting applicability to new domains
Zero-shot performance gap: Pure zero-shot methods often underperform compared to few-shot, especially on complex reasoning tasks
Example quality sensitivity: Few-shot performance varies significantly based on example selection, but optimal selection is non-obvious
Domain adaptation cost: Creating new demonstrations for each domain or task is expensive and doesn't scale

Value Proposition:

Zero labeled data requirement: Works with only unlabeled test samples and the LLM itself
Accuracy improvement: Up to 15% gains over zero-shot baselines
Few-shot parity: Matches or exceeds manually-crafted few-shot performance on many reasoning tasks
Automatic adaptation: Self-adapts to different tasks without human intervention
Scalability: Can be applied to any task where answer consistency is measurable
Cost efficiency: Eliminates human effort in example curation while maintaining quality

Research Foundation

Seminal Work: Wan et al. (2023)

Key Findings:

Performance gains: Up to 15% improvement compared to zero-shot baselines
Few-shot parity: Matches or exceeds 5-shot CoT baselines across multiple reasoning benchmarks
Consistency-correctness correlation: Normalized entropy of answer distributions is a strong proxy for correctness
Multi-model validation: Demonstrated effectiveness across PaLM-62B, PaLM-540B, and GPT-3 (code-davinci-001)

Building on Prior Work:

COSP builds directly on several foundational techniques:

Zero-Shot CoT (Kojima et al., 2022): The "Let's think step by step" trigger that enables reasoning without examples
Self-Consistency (Wang et al., 2022): Multiple sampling and majority voting to improve reliability
Auto-CoT (Zhang et al., 2022): Automatic demonstration generation through clustering

COSP's innovation was combining these elements with a principled scoring function that balances consistency, repetition avoidance, and diversity.

Follow-up Work: Universal Self-Adaptive Prompting (USP)

Real-World Performance Evidence

Benchmark Results:

COSP was evaluated on six reasoning benchmarks across three LLMs:

Arithmetic Reasoning:

MultiArith: Significant gains over zero-shot CoT, approaching few-shot performance
GSM8K: Improvements with standard COSP; COSP-FS (few-shot in stage 1) outperformed 5-shot CoT
AddSub: Consistent improvements across all models
SingleEq: Strong performance gains

Commonsense Reasoning:

CommonsenseQA: Improvements over zero-shot baseline
StrategyQA: Gains demonstrated across model sizes

Model-Specific Results:

Comparative Performance:

vs Zero-Shot CoT: Massive outperformance across all configurations
vs 5-Shot CoT with labeled examples: On par or better in majority of cases
Complex tasks (GSM8K): Required COSP-FS variant, indicating that more difficult problems benefit from few-shot bootstrapping

How COSP Works

Theoretical Foundation

Conceptual Model:

Traditional Few-Shot: Human selects examples → LLM uses examples → Output
Zero-Shot CoT: Trigger phrase → LLM reasons → Output
COSP: LLM generates candidates → Score by consistency → Select best → LLM uses self-generated examples → Output

Fundamental Ideas:

Self-consistency as quality signal: Answers that appear repeatedly across different reasoning paths are more likely correct
Automatic demonstration construction: The model can generate its own high-quality examples
Multi-criteria selection: Combining consistency, repetition avoidance, and diversity yields better demonstrations than any single criterion
Two-stage inference: Using self-generated demonstrations in a second pass improves over single-pass zero-shot

Assumptions and Where They Fail:

Assumption 1: Consistent answers are more likely correct

Holds: When the model has relevant knowledge and the task has deterministic answers
Fails: When the model consistently produces the same wrong answer (confident but incorrect), or when tasks have multiple valid answers

Assumption 2: The model can generate useful reasoning chains

Holds: With sufficiently large models (100B+ parameters) on reasoning tasks
Fails: With smaller models that generate incoherent reasoning, or on tasks outside the model's competence

Assumption 3: Repetition in reasoning indicates poor quality

Holds: When repetition reflects genuine redundancy or confusion
Fails: When repetition is legitimate emphasis or necessary recapitulation

Assumption 4: Diverse demonstrations improve performance

Holds: When different examples cover different aspects of the problem space
Fails: When diversity introduces inconsistent or contradictory patterns

Fundamental Trade-offs:

Execution Mechanism

COSP operates in two distinct stages:

Stage 1: Pseudo-Demonstration Generation and Selection

Input: Unlabeled questions/problems from the target domain
Generation: For each of n questions, generate m reasoning chains using Zero-Shot CoT with non-zero temperature
Scoring: Compute a composite score for each question-response pair based on:
- Normalized entropy of answer distribution (consistency)
- Repetitiveness within the reasoning chain
Selection: Rank all n × m candidates by score and select top k with lowest scores
Output: Set of k pseudo-demonstrations (question + reasoning + answer)

Stage 2: Test Inference

Prompt construction: Concatenate selected pseudo-demonstrations with test question
Generation: Generate multiple reasoning chains for the test question
Aggregation: Apply majority voting across chains to determine final answer
Output: Final predicted answer

Detailed Execution Flow:

Stage 1:
Questions Q₁...Qₙ → [Zero-Shot CoT with temp > 0] →
  For each Qᵢ: Generate m responses Rᵢ₁...Rᵢₘ →
  Extract answers Aᵢ₁...Aᵢₘ →
  Compute entropy(Aᵢ₁...Aᵢₘ) →
  Compute repetitiveness(Rᵢⱼ) for each response →
  Score = entropy + λ × repetitiveness →
  Select k lowest-scoring (Qᵢ, Rᵢⱼ, Aᵢⱼ) tuples

Stage 2:
Test question Qₜₑₛₜ →
  Prepend selected demonstrations →
  Generate multiple reasoning paths →
  Majority vote on answers →
  Return final answer

Cognitive Processes Triggered:

Pattern recognition: Selected demonstrations prime the model to recognize problem structure
Reasoning template application: High-quality demonstrations provide reasoning templates
Answer format alignment: Consistent demonstration format guides output formatting
Confidence calibration: Multiple sampling enables uncertainty estimation

Is This Single-Pass or Multi-Stage?

COSP is inherently multi-stage:

Stage 1: Multiple generation passes for candidate creation (n × m generations)
Stage 2: Multiple generation passes for self-consistency voting
Minimum API calls: n × m + k (demonstration generation) + s (test inference samples)

Completion Criteria:

Stage 1 completes when k demonstrations are selected
Stage 2 completes when majority vote determines the answer
No iterative refinement between stages (single demonstration selection)

Causal Mechanisms

Why COSP Improves Outputs:

Quality filtering: Low-entropy responses are more likely correct; selecting these provides better demonstrations than random selection
Noise reduction: Repetition penalty filters out degenerate reasoning chains that might confuse the model
Coverage improvement: Diversity encouragement ensures demonstrations cover different problem types or reasoning patterns
Bootstrapping effect: Using the model's own confident outputs creates a positive feedback loop where good reasoning begets better reasoning
Format consistency: Self-generated demonstrations naturally match the model's preferred output format

Cascading Effects:

High-quality demonstrations selected →
  Better reasoning patterns primed →
    More accurate intermediate steps →
      Correct final answers →
        (If used iteratively) Even better demonstration pool

Feedback Loops:

Positive: Correct demonstrations improve test accuracy; in iterative settings, this could improve future demonstration quality
Negative: If initial zero-shot performance is poor, the demonstration pool may lack high-quality candidates, limiting gains

Emergent Behaviors:

Self-calibration: The entropy scoring implicitly identifies which questions the model finds difficult
Automatic difficulty stratification: COSP+ variant uses entropy to provide more demonstrations for harder questions
Domain adaptation: Without explicit programming, COSP adapts to domain-specific reasoning patterns present in the unlabeled questions

Dominant Factors in Effectiveness (Ranked):

Model capability (40%): Larger models generate higher-quality candidates and better utilize demonstrations
Consistency-correctness correlation (25%): How well entropy predicts correctness for the specific task
Demonstration diversity (20%): Coverage of different problem types in selected demonstrations
Scoring function calibration (15%): Appropriate balance between consistency and repetition penalties

Structure and Components

Essential Components

Required Components:

Unlabeled question pool: Set of questions from the target domain (no labels needed)
Zero-Shot CoT trigger: Reasoning elicitation phrase (e.g., "Let's think step by step")
Scoring function: Weighted combination of entropy and repetitiveness
Selection mechanism: Ranking and top-k selection
Aggregation method: Majority voting for final answer

Optional Components:

Few-shot bootstrap (COSP-FS): Initial few-shot prompting in Stage 1 for complex tasks
Adaptive demonstration count (COSP+): Variable k based on question difficulty
Custom repetition detection: Domain-specific repetitiveness scoring
Diversity constraints: Explicit diversity requirements in selection

Component Hierarchy:

COSP System
├── Stage 1: Demonstration Generation
│   ├── Question Pool (required)
│   ├── Zero-Shot CoT Generator (required)
│   ├── Answer Extractor (required)
│   └── Scoring Module
│       ├── Entropy Calculator (required)
│       ├── Repetitiveness Calculator (required)
│       └── Diversity Enforcer (optional)
├── Stage 2: Test Inference
│   ├── Prompt Constructor (required)
│   ├── Multi-path Generator (required)
│   └── Majority Voter (required)
└── Variants
    ├── COSP-FS (optional)
    └── COSP+ (optional)

Design Principles

Linguistic Patterns:

COSP relies on standard CoT linguistic patterns in generated demonstrations:

Sequential markers: "First," "Then," "Next," "Finally"
Reasoning connectors: "Therefore," "Thus," "So," "Because"
Calculation language: "Let's calculate," "Computing," "This gives us"
Conclusion signals: "The answer is," "Therefore, the answer is"

Cognitive Principles Leveraged:

Metacognition: Using consistency as a self-assessment of knowledge certainty
Learning by example: Demonstrations prime specific reasoning patterns
Redundancy detection: Recognizing that repetitive reasoning indicates confusion
Ensemble wisdom: Multiple perspectives (diverse demonstrations) improve robustness

Core Design Principles:

Structural Patterns

Minimal Pattern:

[Stage 1 - Implicit]
Generate responses to unlabeled questions
Select most consistent ones

[Stage 2 - Prompt]
Q: [Selected question 1]
A: [Selected reasoning and answer 1]

Q: [Selected question 2]
A: [Selected reasoning and answer 2]

Q: [Test question]
A: Let's think step by step.

Standard Pattern:

[Stage 1 Prompt - for each unlabeled question]
Q: {unlabeled_question}
A: Let's think step by step.

[Repeat m times with temperature > 0, collect answers]
[Score all n×m responses]
[Select top k]

[Stage 2 Prompt]
Q: {selected_question_1}
A: {selected_reasoning_1}. The answer is {selected_answer_1}.

Q: {selected_question_2}
A: {selected_reasoning_2}. The answer is {selected_answer_2}.

Q: {selected_question_3}
A: {selected_reasoning_3}. The answer is {selected_answer_3}.

Q: {test_question}
A: Let's think step by step.

Advanced Pattern (COSP-FS for Complex Tasks):

[Stage 1 Prompt - with few-shot bootstrap]
Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each. How many balls does he have?
A: Roger started with 5 balls. He bought 2 × 3 = 6 balls. Total: 5 + 6 = 11. The answer is 11.

Q: {unlabeled_question}
A: Let's think step by step.

[Generate m responses, score, select k]

[Stage 2 - same as standard pattern with selected demonstrations]

COSP+ Pattern (Adaptive Demonstration Count):

[After Stage 1 scoring]
For test question with entropy E:
  If E < threshold_low: use k_min demonstrations
  If E > threshold_high: use k_max demonstrations
  Else: use k_standard demonstrations

[Stage 2 with variable demonstration count based on difficulty]

Modifications for Different Scenarios

High-Complexity Tasks (e.g., GSM8K):

Use COSP-FS variant with few-shot bootstrap in Stage 1
Increase m (samples per question) for better candidate pool
Increase k (selected demonstrations) for more context
Consider COSP+ for adaptive demonstration count

Ambiguous or Open-Ended Tasks:

COSP may struggle; consider USP for such tasks
Increase diversity weight in selection
Use domain-specific repetitiveness detection
May need task-specific consistency metrics

Format-Critical Tasks:

Ensure demonstration format matches expected output format
Add explicit format instructions in Stage 2 prompt
Consider post-processing to extract structured answers

Domain-Specific Applications:

Use domain-specific unlabeled questions for demonstration generation
May need custom answer extraction for non-standard formats
Consider domain terminology in repetitiveness scoring

Resource-Constrained Settings:

Reduce m (fewer candidates per question)
Reduce n (smaller question pool)
Use greedy selection instead of global ranking
Consider caching demonstrations across similar queries

Applications and Task Selection

General Applications

Arithmetic Reasoning:

COSP excels at mathematical word problems where answers are unambiguous:

Multi-step arithmetic (MultiArith, AddSub, SingleEq)
Grade school math (GSM8K with COSP-FS)
Algebraic problem solving
Numerical reasoning tasks

Commonsense Reasoning:

Effective on structured commonsense tasks:

StrategyQA (yes/no commonsense questions)
CommonsenseQA (multiple choice)
Physical reasoning with discrete answers
Temporal and spatial reasoning

Logical Reasoning:

Applicable to logic tasks with verifiable answers:

Syllogistic reasoning
Deductive logic problems
Constraint satisfaction
Symbolic reasoning tasks

Question Answering:

Works well for extractive and discrete QA:

Factoid questions with clear answers
Reading comprehension with extractable answers
Multi-hop QA requiring reasoning chains

Domain-Specific Applications

Education:

Automated tutoring for math problems
Self-improving problem solution generation
Adaptive difficulty assessment using entropy scores
Homework assistance systems

Scientific Computing:

Unit conversion and dimensional analysis
Scientific calculation problems
Data interpretation tasks
Experimental design reasoning

Business Analytics:

Financial calculations with clear answers
Metric computations
Quantitative business problems
ROI and cost-benefit analysis

Legal and Compliance:

Regulatory compliance checking (yes/no determinations)
Policy interpretation with discrete outcomes
Contract clause analysis with binary decisions

Unconventional Applications:

Code output prediction: Predicting program outputs for given inputs
Game strategy: Move selection in deterministic games
Puzzle solving: Logic puzzles with verifiable solutions
Scheduling optimization: Constraint satisfaction problems

Selection Framework

Problem Characteristics Favoring COSP:

Scenarios COSP is Optimized For:

Zero-shot reasoning with no labeled data available
Tasks where few-shot example creation is expensive
Domains requiring automatic adaptation
Applications needing consistent, reproducible reasoning
Settings where answer correctness can be verified post-hoc

Scenarios NOT Recommended For:

Open-ended generation: No clear answer consistency metric (use USP instead)
Highly subjective tasks: Multiple valid answers break consistency assumption
Very simple tasks: Overhead not justified for single-step problems
Tasks with no similar unlabeled data: Cannot generate relevant demonstrations
Real-time applications: Multiple generation passes add latency
Small models: Requires 100B+ parameters for quality reasoning

Selection Signals - When to Choose COSP:

Selection Signals - When NOT to Choose COSP:

Model Requirements:

Required Model Capabilities:

Zero-Shot CoT reasoning ability
Consistent output format
Temperature-controlled sampling
Sufficient context window for demonstrations (4K+ tokens)

Context and Resource Requirements:

Cost Implications:

One-Time Costs (Stage 1):

n × m API calls for demonstration generation
Embedding API calls for repetitiveness calculation
Can be amortized across many test queries

Per-Request Costs (Stage 2):

s API calls for self-consistency (typically 5-10)
Longer prompts due to demonstrations
2-5x cost of simple zero-shot

Cost-Quality Trade-offs:

Variant Selection Guide:

When to Escalate to Alternatives:

To USP: When task is classification, summarization, or open-ended
To Manual Few-Shot: When COSP underperforms and you have expert examples
To Fine-Tuning: When COSP ceiling reached and large labeled dataset available
To Native Reasoning Models: When using o1/o3 (built-in reasoning superior)

Implementation

Step-by-Step Implementation

Prerequisites:

Access to LLM API with temperature control
Embedding API for repetitiveness scoring (optional but recommended)
Set of unlabeled questions from target domain
Answer extraction logic for the task

Phase 1: Setup and Configuration

# Configuration parameters
config = {
    "n": 20,           # Number of unlabeled questions
    "m": 5,            # Reasoning chains per question
    "k": 3,            # Demonstrations to select
    "s": 5,            # Self-consistency samples
    "temperature": 0.7, # For diverse sampling
    "trade_off": 0.2,  # Repetitiveness weight
}

Phase 2: Stage 1 - Demonstration Generation

import openai
import numpy as np
from collections import Counter

def generate_candidates(questions, m, temperature):
    """Generate m reasoning chains for each question."""
    candidates = []

    for q in questions:
        for _ in range(m):
            prompt = f"Q: {q}\nA: Let's think step by step."

            response = openai.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                temperature=temperature,
                max_tokens=500
            )

            reasoning = response.choices[0].message.content
            answer = extract_answer(reasoning)

            candidates.append({
                "question": q,
                "reasoning": reasoning,
                "answer": answer
            })

    return candidates

def extract_answer(reasoning):
    """Extract final answer from reasoning chain."""
    import re
    # Customize based on task format
    match = re.search(r'(?:answer is|=)\s*(-?\d+(?:\.\d+)?)', reasoning.lower())
    return match.group(1) if match else reasoning.split()[-1]

Phase 3: Scoring Function

def compute_entropy(answers):
    """Compute normalized entropy of answer distribution."""
    if len(answers) <= 1:
        return 0.0

    counter = Counter(answers)
    total = len(answers)
    probabilities = [count / total for count in counter.values()]

    entropy = -sum(p * np.log(p) for p in probabilities if p > 0)
    max_entropy = np.log(len(answers))

    return entropy / max_entropy if max_entropy > 0 else 0.0

def compute_repetitiveness(reasoning, get_embedding):
    """Compute repetitiveness score using sentence embeddings."""
    import re

    # Split into sentences
    sentences = re.split(r'[.!?]+', reasoning)
    sentences = [s.strip() for s in sentences if s.strip()]

    if len(sentences) <= 1:
        return 0.0

    # Get embeddings
    embeddings = [get_embedding(s) for s in sentences]

    # Compute pairwise cosine similarities
    similarities = []
    for i in range(len(embeddings)):
        for j in range(i + 1, len(embeddings)):
            sim = cosine_similarity(embeddings[i], embeddings[j])
            similarities.append(sim)

    return np.mean(similarities) if similarities else 0.0

def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def score_candidates(candidates, m, trade_off, get_embedding):
    """Score all candidates and return sorted list."""
    # Group by question
    questions = {}
    for c in candidates:
        q = c["question"]
        if q not in questions:
            questions[q] = []
        questions[q].append(c)

    scored = []
    for q, responses in questions.items():
        answers = [r["answer"] for r in responses]
        entropy = compute_entropy(answers)

        for r in responses:
            rep = compute_repetitiveness(r["reasoning"], get_embedding)
            score = entropy + trade_off * rep
            scored.append({**r, "score": score, "entropy": entropy})

    return sorted(scored, key=lambda x: x["score"])

Phase 4: Demonstration Selection

def select_demonstrations(scored_candidates, k):
    """Select top k demonstrations with diversity."""
    selected = []
    seen_questions = set()

    for candidate in scored_candidates:
        # Optional: enforce question diversity
        if candidate["question"] not in seen_questions:
            selected.append(candidate)
            seen_questions.add(candidate["question"])

        if len(selected) >= k:
            break

    return selected

Phase 5: Stage 2 - Test Inference

def build_prompt(demonstrations, test_question):
    """Construct prompt with demonstrations."""
    prompt_parts = []

    for demo in demonstrations:
        prompt_parts.append(f"Q: {demo['question']}")
        prompt_parts.append(f"A: {demo['reasoning']}")
        prompt_parts.append("")

    prompt_parts.append(f"Q: {test_question}")
    prompt_parts.append("A: Let's think step by step.")

    return "\n".join(prompt_parts)

def cosp_inference(test_question, demonstrations, s, temperature=0.7):
    """Run COSP inference with self-consistency."""
    prompt = build_prompt(demonstrations, test_question)

    answers = []
    for _ in range(s):
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            max_tokens=500
        )

        reasoning = response.choices[0].message.content
        answer = extract_answer(reasoning)
        answers.append(answer)

    # Majority vote
    counter = Counter(answers)
    return counter.most_common(1)[0][0]

Complete COSP Pipeline:

def cosp_pipeline(unlabeled_questions, test_questions, config):
    """Complete COSP pipeline."""

    # Stage 1: Generate and select demonstrations
    print("Stage 1: Generating candidates...")
    candidates = generate_candidates(
        unlabeled_questions,
        config["m"],
        config["temperature"]
    )

    print("Scoring candidates...")
    scored = score_candidates(
        candidates,
        config["m"],
        config["trade_off"],
        get_embedding_function()
    )

    demonstrations = select_demonstrations(scored, config["k"])
    print(f"Selected {len(demonstrations)} demonstrations")

    # Stage 2: Run inference on test questions
    print("Stage 2: Running inference...")
    results = []
    for test_q in test_questions:
        answer = cosp_inference(
            test_q,
            demonstrations,
            config["s"]
        )
        results.append({"question": test_q, "answer": answer})

    return results, demonstrations

Platform-Specific Implementations

OpenAI API:

import openai

client = openai.OpenAI(api_key="your-key")

def get_embedding(text):
    response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=text
    )
    return response.data[0].embedding

def generate_response(prompt, temperature=0.7):
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        max_tokens=500
    )
    return response.choices[0].message.content

Anthropic Claude:

import anthropic

client = anthropic.Anthropic(api_key="your-key")

def generate_response_claude(prompt, temperature=0.7):
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    return message.content[0].text

LangChain Integration:

from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage

class COSPChain:
    def __init__(self, model_name="gpt-4", config=None):
        self.llm = ChatOpenAI(model=model_name, temperature=0.7)
        self.config = config or {"m": 5, "k": 3, "s": 5}
        self.demonstrations = None

    def generate_demonstrations(self, questions):
        """Stage 1: Generate and select demonstrations."""
        candidates = []

        for q in questions:
            for _ in range(self.config["m"]):
                prompt = f"Q: {q}\nA: Let's think step by step."
                response = self.llm.invoke([HumanMessage(content=prompt)])
                candidates.append({
                    "question": q,
                    "reasoning": response.content,
                    "answer": self._extract_answer(response.content)
                })

        self.demonstrations = self._select_best(candidates)
        return self.demonstrations

    def inference(self, test_question):
        """Stage 2: Run inference with demonstrations."""
        if not self.demonstrations:
            raise ValueError("Must generate demonstrations first")

        prompt = self._build_prompt(test_question)
        answers = []

        for _ in range(self.config["s"]):
            response = self.llm.invoke([HumanMessage(content=prompt)])
            answers.append(self._extract_answer(response.content))

        return max(set(answers), key=answers.count)

Configuration

Key Parameters:

Task-Specific Tuning:

Arithmetic Reasoning:

Higher m (5-7) for better coverage
Standard k (3-5)
Lower temperature (0.5-0.7) for more coherent math
Consider COSP-FS for GSM8K

Commonsense Reasoning:

Standard m (3-5)
k = 3 typically sufficient
Higher temperature (0.7-0.8) for diversity
Lower trade_off for more consistency focus

Complex Multi-Step:

Use COSP-FS variant
Higher k (5-7) for more context
Consider COSP+ for adaptive selection
May need domain-specific answer extraction

Temperature Settings:

Best Practices and Workflow

Do's:

Pre-compute demonstrations: Run Stage 1 offline and cache demonstrations
Validate answer extraction: Test extraction logic before full pipeline
Monitor entropy distribution: Check that scoring differentiates candidates
Use diverse unlabeled questions: Cover different problem types
Start with standard config: Tune parameters only if needed
Verify demonstration quality: Manually inspect selected demonstrations
Test on held-out set: Don't evaluate on demonstration source questions

Don'ts:

Don't use COSP with o1/o3: Native reasoning models don't need demonstrations
Don't apply to subjective tasks: Consistency metric won't work
Don't skip repetitiveness scoring: It filters degenerate outputs
Don't use too few unlabeled questions: Need sufficient candidate pool
Don't ignore computational costs: Budget API calls appropriately
Don't assume demonstrations transfer: Re-generate for new domains

Workflow:

1. Collect unlabeled questions (15-30 min)
   └─ Gather 20-50 representative questions from target domain

2. Configure COSP parameters (5 min)
   └─ Set n, m, k, s, temperature based on task type

3. Run Stage 1 (5-30 min depending on n×m)
   └─ Generate candidates, score, select demonstrations

4. Validate demonstrations (10 min)
   └─ Manually inspect selected demonstrations for quality

5. Test on sample (10 min)
   └─ Run Stage 2 on 5-10 test questions

6. Evaluate and iterate (15-30 min)
   └─ Measure accuracy, adjust parameters if needed

7. Deploy (ongoing)
   └─ Use cached demonstrations for production inference

Debugging Decision Tree

Problem: Low overall accuracy

Root Cause Analysis:
├─ Poor demonstration quality?
│  └─ Check: Manually inspect selected demonstrations
│  └─ Fix: Increase n or m, use COSP-FS, improve question pool
├─ Wrong answer extraction?
│  └─ Check: Verify extraction on known examples
│  └─ Fix: Improve regex or parsing logic
├─ Insufficient demonstrations?
│  └─ Check: Try increasing k
│  └─ Fix: Use k=5 instead of k=3
└─ Task not suitable for COSP?
   └─ Check: Are answers deterministic and comparable?
   └─ Fix: Consider USP or manual few-shot

Problem: Inconsistent results across runs

Root Cause Analysis:
├─ High temperature in Stage 2?
│  └─ Fix: Reduce to 0.3-0.5
├─ Too few self-consistency samples?
│  └─ Fix: Increase s to 7-10
└─ Ambiguous answer format?
   └─ Fix: Standardize answer extraction

Problem: Demonstrations are low quality

Root Cause Analysis:
├─ Unlabeled questions too difficult?
│  └─ Fix: Use COSP-FS with bootstrap examples
├─ Scoring function not discriminating?
│  └─ Check: Entropy distribution should have variance
│  └─ Fix: Increase m for better entropy estimation
└─ Repetitiveness scoring broken?
   └─ Check: Verify embedding function works
   └─ Fix: Use different embedding model

Problem: High latency

Root Cause Analysis:
├─ Too many candidates (n×m)?
│  └─ Fix: Reduce n to 15, m to 3
├─ Stage 1 not cached?
│  └─ Fix: Pre-compute and cache demonstrations
└─ Too many self-consistency samples?
   └─ Fix: Reduce s to 3-5

Common Mistakes:

Testing and Optimization

Validation Strategy:

Held-out test set: Never use test questions in demonstration pool
Cross-validation: For limited data, use k-fold on unlabeled questions
Ablation testing: Compare COSP vs zero-shot vs manual few-shot
Component analysis: Test with/without repetitiveness, diversity

Test Coverage:

Standard cases (60%): Typical problems from target domain
Edge cases (25%): Unusual inputs, boundary conditions
Hard cases (15%): Known difficult problems

Quality Metrics:

Optimization Techniques:

Token Efficiency:

Compress demonstration reasoning (remove filler)
Use shorter unlabeled questions
Reduce k for simpler tasks

Latency Optimization:

Cache Stage 1 demonstrations
Batch API calls where possible
Use streaming for long responses
Consider parallel Stage 2 calls

Cost Optimization:

Start with COSP-lite (m=2, k=2, s=3)
Only increase if accuracy insufficient
Re-use demonstrations across similar queries
Use cheaper model for Stage 1 if quality sufficient

Iteration Criteria:

Stop if accuracy within 2% of few-shot baseline
Stop if increasing parameters shows < 1% gain
Maximum 3-4 parameter iterations
Focus on demonstration quality over quantity

Limitations and Constraints

Known Limitations

1. Task Scope Restriction

COSP is designed for reasoning tasks with deterministic, comparable answers. It cannot effectively handle:

Open-ended generation (summarization, creative writing)
Subjective evaluations (style assessment, preference ranking)
Tasks with multiple equally valid answers

Why fundamental: The consistency metric requires answers that can be meaningfully compared. Without this, the entire selection mechanism breaks down.

2. Consistency-Correctness Assumption

COSP assumes consistent answers are correct, but models can be consistently wrong:

Systematic biases produce consistent incorrect answers
Common misconceptions may be repeated confidently
Majority of training data may contain errors

When this fails:

Out-of-distribution problems
Problems requiring uncommon knowledge
Tasks where model training data is biased

3. Computational Overhead

COSP requires significant additional computation:

Stage 1: n × m API calls for demonstration generation
Stage 2: s API calls per test question
Embedding calls for repetitiveness scoring

Cannot be eliminated: This is inherent to the multi-sampling approach.

4. Model Size Dependency

Like other CoT-based methods, COSP requires large models:

Minimum ~100B parameters for quality reasoning
Smaller models generate incoherent chains
Demonstration selection can't fix fundamentally poor reasoning

5. Cold Start Problem

COSP needs unlabeled questions from the target domain:

New domains without examples can't use COSP
Quality depends on question pool representativeness
Mismatched questions lead to irrelevant demonstrations

6. No Iterative Improvement

Standard COSP is two-stage without feedback:

Selected demonstrations are fixed
No mechanism to improve based on test performance
Errors in Stage 1 propagate to Stage 2

Edge Cases

Highly Homogeneous Answers:

When all m samples produce the same answer:

Entropy = 0, but might still be wrong
Detection: Check for zero/near-zero entropy across all questions
Handling: Increase temperature, verify with alternative methods

No Clear Answer Pattern:

When answers are uniformly distributed:

Entropy is maximum, no confident prediction
Detection: High entropy across many questions
Handling: May indicate task unsuitability or need for COSP-FS

Conflicting High-Quality Demonstrations:

When selected demonstrations suggest different reasoning patterns:

Detection: Check demonstration consistency
Handling: Enforce stricter diversity constraints or reduce k

Domain Shift:

When test questions differ significantly from unlabeled pool:

Detection: Poor test accuracy despite good demonstration quality
Handling: Expand unlabeled question pool, use domain adaptation

Graceful Degradation:

If COSP accuracy < zero-shot:
  → Fall back to zero-shot CoT
  → Check task suitability

If Stage 1 produces no good candidates:
  → Switch to COSP-FS
  → Reduce scoring thresholds

If latency exceeds budget:
  → Reduce s (fewer self-consistency samples)
  → Use cached demonstrations longer

Constraint Management

Balancing Consistency vs Diversity:

High consistency weight → may select redundant demonstrations
High diversity weight → may select inconsistent demonstrations
Approach: Start with standard trade_off (0.2), adjust based on results

Token/Context Constraints:

When demonstrations exceed context window:

Reduce k (fewer demonstrations)
Compress demonstration reasoning
Select shorter questions/answers
Use models with larger context

Handling Incomplete Information:

When unlabeled questions lack context:

Include context in question format
Use COSP-FS to establish reasoning patterns
Add explicit assumption-stating in prompts

Error Recovery:

Advanced Techniques

Clarity and Context Optimization

Ensuring Demonstration Clarity:

Format demonstrations consistently
Include clear reasoning transitions
Show explicit calculations
End with unambiguous answer format

Example of Clear Demonstration:

Q: A store has 45 apples. They sell 12 in the morning and 18 in the afternoon. How many apples remain?
A: Let's solve this step by step.
Step 1: Start with 45 apples.
Step 2: Sold in morning: 45 - 12 = 33 apples remain.
Step 3: Sold in afternoon: 33 - 18 = 15 apples remain.
Therefore, the answer is 15.

Context Optimization:

Include only relevant demonstrations
Order by relevance to test question (if measurable)
Remove redundant reasoning steps
Balance completeness with conciseness

Demonstration Design Principles:

Advanced Reasoning Patterns

Multi-Step Verification:

Add verification to demonstrations:

...calculation steps...
Let me verify: 15 + 12 + 18 = 45 ✓
The answer is 15.

Decomposition Pattern:

For complex problems, demonstrate decomposition:

Q: [Complex problem]
A: Let me break this into parts.
Part 1: [Subproblem] → [Solution]
Part 2: [Subproblem] → [Solution]
Combining: [Integration] → [Final answer]

Self-Correction Pattern:

Demonstrate error detection:

...initial reasoning...
Wait, let me check: [verification]
Actually, I made an error. [correction]
The correct answer is [corrected answer].

Interaction Patterns

Iterative COSP (Advanced):

def iterative_cosp(questions, iterations=2):
    demonstrations = None

    for i in range(iterations):
        # Generate candidates (using previous demos if available)
        candidates = generate_with_demos(questions, demonstrations)

        # Select new demonstrations
        demonstrations = select_best(candidates)

    return demonstrations

Chaining with Other Techniques:

COSP can be combined with:

RAG: Retrieve relevant knowledge, then apply COSP
Self-Refinement: Use COSP output as input to refinement
Verification: Add verification step after COSP inference

Model Considerations

Model-Specific Adaptations:

Cross-Model Considerations:

Demonstrations generated by one model may not transfer well to another
Re-run Stage 1 when switching models
Answer format may differ between models

Handling Model Updates:

Re-evaluate COSP periodically with model updates
Demonstrations may become suboptimal with new model versions
Monitor performance for degradation

Efficiency Optimization

Token Minimization:

def compress_demonstration(demo):
    """Compress demonstration while preserving key information."""
    # Remove filler phrases
    compressed = demo.replace("Let me think about this.", "")
    compressed = compressed.replace("So, ", "")
    # Combine short steps
    # ... additional compression logic
    return compressed

Batch Processing:

async def batch_cosp_inference(test_questions, demos, batch_size=5):
    """Process multiple questions in parallel."""
    import asyncio

    async def process_one(q):
        return await async_inference(q, demos)

    results = []
    for i in range(0, len(test_questions), batch_size):
        batch = test_questions[i:i+batch_size]
        batch_results = await asyncio.gather(*[process_one(q) for q in batch])
        results.extend(batch_results)

    return results

Caching Strategy:

import hashlib
import json

class COSPCache:
    def __init__(self, cache_file="cosp_cache.json"):
        self.cache_file = cache_file
        self.cache = self._load_cache()

    def _hash_questions(self, questions):
        return hashlib.md5(json.dumps(sorted(questions)).encode()).hexdigest()

    def get_demonstrations(self, questions):
        key = self._hash_questions(questions)
        return self.cache.get(key)

    def store_demonstrations(self, questions, demos):
        key = self._hash_questions(questions)
        self.cache[key] = demos
        self._save_cache()

Safety and Robustness

Output Validation:

def validate_cosp_output(answer, expected_format):
    """Validate COSP output meets expected format."""
    if expected_format == "numeric":
        try:
            float(answer)
            return True
        except:
            return False
    elif expected_format == "yes_no":
        return answer.lower() in ["yes", "no"]
    # ... additional formats
    return True

Consistency Monitoring:

def monitor_consistency(results, threshold=0.7):
    """Monitor self-consistency across inference runs."""
    from collections import Counter

    counter = Counter(results)
    most_common_count = counter.most_common(1)[0][1]
    consistency = most_common_count / len(results)

    if consistency < threshold:
        logging.warning(f"Low consistency: {consistency:.2%}")
        return False, consistency

    return True, consistency

Fallback Mechanisms:

def cosp_with_fallback(question, demos, config):
    """COSP with fallback to zero-shot."""
    try:
        answer, consistency = cosp_inference_with_confidence(question, demos, config)

        if consistency < 0.5:
            # Fall back to simple zero-shot
            return zero_shot_cot(question)

        return answer
    except Exception as e:
        logging.error(f"COSP failed: {e}")
        return zero_shot_cot(question)

Domain Adaptation

Adapting to New Domains:

Collect 20-50 unlabeled questions from new domain
Run Stage 1 with domain questions
Validate demonstration quality manually
Test on held-out domain examples
Adjust parameters if needed

Domain-Specific Considerations:

Quick Domain Adaptation:

def quick_domain_adapt(domain_questions, generic_demos=None):
    """Quick adaptation to new domain."""
    # If we have generic demos, use COSP-FS approach
    if generic_demos:
        candidates = generate_with_demos(domain_questions, generic_demos)
    else:
        candidates = generate_candidates(domain_questions)

    # Select domain-specific demonstrations
    return select_best(candidates)

Risk and Ethics

Ethical Considerations

What COSP Reveals About LLMs:

Models have implicit confidence that correlates with correctness
Self-generated content can improve model performance
Consistency is a learnable, measurable property

Bias Considerations:

Demonstrations inherit biases from model's zero-shot outputs
Consistent biased answers will be selected as demonstrations
No mechanism to detect or correct systematic biases

Transparency Concerns:

Selected demonstrations may not represent optimal reasoning
Users may not understand why certain demonstrations were chosen
Automated selection obscures human oversight

Risk Analysis

Failure Modes:

Cascading Failures:

Bad demonstrations selected →
  Incorrect reasoning patterns primed →
    Test inference follows bad patterns →
      Systematic errors on test set

Safety Concerns:

Prompt injection: Unlabeled questions could contain adversarial content
Mitigation: Sanitize inputs, validate question format

Bias Amplification:

COSP may amplify existing model biases by selecting "confident" biased outputs
Detection: Audit demonstrations for bias patterns
Mitigation: Diverse question pool, explicit fairness constraints

Innovation Potential

Derived Innovations:

Task-adaptive COSP: Automatically detect task type and adjust scoring
Continuous learning COSP: Update demonstrations based on feedback
Multi-model COSP: Use different models for generation vs selection
Hierarchical COSP: Multi-level demonstration selection

Novel Combinations:

COSP + RAG: Retrieve knowledge, then select demonstrations
COSP + Verification: Add automated verification of selected demonstrations
COSP + Active Learning: Use entropy to identify questions needing labels

Ecosystem and Integration

Tools and Frameworks

Framework Support:

Evaluation Tools:

Standard accuracy metrics
Consistency measurement
Demonstration quality scoring
Entropy distribution analysis

Pre-built Resources:

Instructor library includes COSP documentation and examples
LearnPrompting provides educational materials
Research implementations available (reference original paper)

Closely Related:

Comparison Table:

Integration Patterns

With RAG Systems:

def cosp_with_rag(question, retriever, demos):
    """Integrate COSP with retrieval."""
    # Retrieve relevant context
    context = retriever.retrieve(question)

    # Augment question with context
    augmented_q = f"Context: {context}\n\nQuestion: {question}"

    # Run COSP inference
    return cosp_inference(augmented_q, demos)

With Agent Systems:

class COSPAgent:
    """Agent that uses COSP for reasoning steps."""

    def __init__(self, tools, demo_pool):
        self.tools = tools
        self.cosp_demos = self._generate_demos(demo_pool)

    def reason(self, task):
        # Use COSP for reasoning
        reasoning = cosp_inference(task, self.cosp_demos)

        # Extract action from reasoning
        action = self._parse_action(reasoning)

        return action

Transition Strategies:

From Zero-Shot to COSP:

Collect unlabeled questions from production queries
Run Stage 1 to generate demonstrations
A/B test COSP vs zero-shot
Gradually increase COSP traffic if positive

From COSP to Fine-Tuning:

Use COSP-generated demonstrations as training data
Filter to high-confidence examples
Fine-tune on selected examples
Compare fine-tuned model to COSP

Production Integration

Deployment Architecture:

┌─────────────────────────────────────────┐
│             Production System            │
├─────────────────────────────────────────┤
│                                         │
│  ┌──────────────┐    ┌──────────────┐  │
│  │ Demo Cache   │    │ Test Query   │  │
│  │ (Stage 1)    │───▶│ (Stage 2)    │  │
│  └──────────────┘    └──────────────┘  │
│         │                   │          │
│         ▼                   ▼          │
│  ┌──────────────┐    ┌──────────────┐  │
│  │ Periodic     │    │ LLM API      │  │
│  │ Refresh      │    │              │  │
│  └──────────────┘    └──────────────┘  │
│                                         │
└─────────────────────────────────────────┘

Monitoring:

Track accuracy over time
Monitor demonstration quality scores
Alert on consistency degradation
Log entropy distributions

Versioning:

Version demonstration sets
Track which demo version produced each result
Enable rollback to previous demonstrations

Future Directions

Emerging Innovations

Active COSP:

Using entropy scores to actively identify questions that need human labels:

High entropy questions are difficult for the model
Prioritize these for human annotation
Combine with active learning frameworks

Continuous COSP:

Updating demonstrations based on feedback:

Track which demonstrations lead to correct answers
Dynamically adjust demonstration pool
Learn optimal scoring weights from outcomes

Multi-Modal COSP:

Extending to vision-language tasks:

Generate demonstrations from image-text pairs
Measure consistency across visual reasoning
Select demonstrations covering diverse visual patterns

Research Frontiers

Open Questions:

Optimal scoring function: Is entropy + repetitiveness optimal, or can we learn better scoring?
Transfer across tasks: Can demonstrations transfer between similar tasks?
Scaling laws: How does COSP performance scale with n, m, k?
Theoretical foundations: Why exactly does consistency correlate with correctness?

Promising Directions:

Learned demonstration selection: Train a model to select demonstrations
Compositional COSP: Build complex demonstrations from simple ones
Adversarial robustness: Make COSP robust to adversarial questions
Efficiency improvements: Reduce computational overhead while maintaining quality

Integration with Native Reasoning:

As models like o1/o3 incorporate native reasoning:

COSP principles may be internalized in model training
External COSP may become unnecessary for advanced models
Techniques may shift to improving native reasoning consistency

Quick Reference

COSP at a Glance

Purpose: Automatic zero-shot demonstration selection
Key Innovation: Consistency-based scoring for self-generated examples
Input: Unlabeled questions + test questions
Output: Improved reasoning accuracy

Stage 1: Generate n×m candidates → Score by entropy + repetitiveness → Select k best
Stage 2: Use selected demonstrations → Self-consistency inference → Majority vote

Typical Config: n=20, m=5, k=3, s=5, temperature=0.7, trade_off=0.2

Decision Checklist

Use COSP when:
☑ Task has deterministic, comparable answers
☑ Zero-shot underperforms
☑ No labeled data available
☑ Have unlabeled questions from domain
☑ Computational budget allows multi-sampling

Don't use COSP when:
☒ Open-ended or subjective task
☒ Using native reasoning models (o1, o3)
☒ Real-time latency requirements
☒ No relevant unlabeled questions available
☒ Small models (< 100B parameters)

Performance Expectations

vs Zero-Shot CoT: +10-15% accuracy
vs 5-Shot Manual: Comparable or better on 50-60% of tasks
Compute Cost: 5-10x zero-shot
Latency: 10-60 seconds (Stage 1 cacheable)

References

Primary Paper:

Wan, X., Sun, R., Dai, H., Arik, S. O., & Pfister, T. (2023). Better Zero-Shot Reasoning with Self-Adaptive Prompting. Findings of ACL 2023.

Related Work:

Kojima, T., et al. (2022). Large Language Models are Zero-Shot Reasoners. NeurIPS 2022.
Wang, X., et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023.
Zhang, Z., et al. (2022). Automatic Chain of Thought Prompting in Large Language Models. ICLR 2023.
Wan, X., et al. (2023). Universal Self-Adaptive Prompting. EMNLP 2023.

Resources:

Explore Unread

Great job! You've read all available articles

Consistency-based Self-Adaptive Prompting (COSP): A Complete Guide

Why COSP Exists

Research Foundation

Real-World Performance Evidence

How COSP Works

Theoretical Foundation

Execution Mechanism

Causal Mechanisms

Structure and Components

Essential Components

Design Principles

Structural Patterns

Modifications for Different Scenarios

Applications and Task Selection

General Applications

Domain-Specific Applications

Selection Framework

Implementation

Step-by-Step Implementation

Platform-Specific Implementations

Configuration

Best Practices and Workflow

Debugging Decision Tree

Testing and Optimization

Limitations and Constraints

Known Limitations

Edge Cases

Constraint Management

Advanced Techniques

Clarity and Context Optimization

Advanced Reasoning Patterns

Interaction Patterns

Model Considerations

Efficiency Optimization

Safety and Robustness

Domain Adaptation

Risk and Ethics

Ethical Considerations

Risk Analysis

Innovation Potential

Ecosystem and Integration

Tools and Frameworks

Related Techniques

Integration Patterns

Production Integration

Future Directions

Emerging Innovations

Research Frontiers

Quick Reference

COSP at a Glance

Decision Checklist

Performance Expectations

References

Read Next

Explore Unread

Consistency-based Self-Adaptive Prompting (COSP): A Complete Guide

Why COSP Exists

Research Foundation

Real-World Performance Evidence

How COSP Works

Theoretical Foundation

Execution Mechanism

Causal Mechanisms

Structure and Components

Essential Components

Design Principles

Structural Patterns

Modifications for Different Scenarios

Applications and Task Selection

General Applications

Domain-Specific Applications

Selection Framework

Implementation

Step-by-Step Implementation

Platform-Specific Implementations

Configuration

Best Practices and Workflow

Debugging Decision Tree

Testing and Optimization

Limitations and Constraints