Consistency-based Self-Adaptive Prompting (COSP): A Complete Guide
Consistency-based Self-Adaptive Prompting (COSP) is a zero-shot automatic prompting technique that bridges the gap between zero-shot simplicity and few-shot effectiveness. Rather than requiring manually crafted examples or labeled data, COSP leverages an LLM's own predictions to automatically construct high-quality pseudo-demonstrations. The technique identifies which self-generated responses are most likely correct by measuring consistency across multiple outputs, then uses these reliable examples to guide subsequent inference.
The core insight is elegant: confident and consistent predictions are more likely correct. When a model repeatedly arrives at the same answer through different reasoning paths, that answer is probably right. COSP exploits this principle by having the model generate multiple responses to unlabeled questions, scoring them based on consistency and quality metrics, and selecting the best ones as demonstrations for a second inference pass.
Category: COSP belongs to ensembling and self-adaptive prompting techniques. It combines elements of zero-shot Chain-of-Thought prompting with automated few-shot example selection.
Type: Optimization-based and meta-cognitive technique that uses the model's own outputs to improve subsequent performance through automatic demonstration selection.
Scope: COSP includes automatic generation and selection of pseudo-demonstrations, consistency-based scoring, diversity enforcement, and two-stage inference. It excludes manual example curation, labeled data requirements, and fine-tuning. The technique specifically targets reasoning tasks where answers can be compared for consistency.
Why COSP Exists
Core Problems Solved:
- Manual demonstration burden: Few-shot prompting requires carefully crafted examples, which is time-consuming and requires domain expertise
- Labeled data dependency: Traditional few-shot approaches need ground-truth labels, limiting applicability to new domains
- Zero-shot performance gap: Pure zero-shot methods often underperform compared to few-shot, especially on complex reasoning tasks
- Example quality sensitivity: Few-shot performance varies significantly based on example selection, but optimal selection is non-obvious
- Domain adaptation cost: Creating new demonstrations for each domain or task is expensive and doesn't scale
Value Proposition:
- Zero labeled data requirement: Works with only unlabeled test samples and the LLM itself
- Accuracy improvement: Up to 15% gains over zero-shot baselines
- Few-shot parity: Matches or exceeds manually-crafted few-shot performance on many reasoning tasks
- Automatic adaptation: Self-adapts to different tasks without human intervention
- Scalability: Can be applied to any task where answer consistency is measurable
- Cost efficiency: Eliminates human effort in example curation while maintaining quality
Research Foundation
Seminal Work: Wan et al. (2023)
The paper "Better Zero-Shot Reasoning with Self-Adaptive Prompting" by Xingchen Wan, Ruoxi Sun, Hanjun Dai, Sercan O. Arik, and Tomas Pfister at Google Cloud AI Research and Google DeepMind introduced COSP. Published at ACL 2023 (Findings), this work established that LLMs can effectively bootstrap their own demonstrations by identifying high-confidence outputs.
Key Findings:
- Performance gains: Up to 15% improvement compared to zero-shot baselines
- Few-shot parity: Matches or exceeds 5-shot CoT baselines across multiple reasoning benchmarks
- Consistency-correctness correlation: Normalized entropy of answer distributions is a strong proxy for correctness
- Multi-model validation: Demonstrated effectiveness across PaLM-62B, PaLM-540B, and GPT-3 (code-davinci-001)
Building on Prior Work:
COSP builds directly on several foundational techniques:
- Zero-Shot CoT (Kojima et al., 2022): The "Let's think step by step" trigger that enables reasoning without examples
- Self-Consistency (Wang et al., 2022): Multiple sampling and majority voting to improve reliability
- Auto-CoT (Zhang et al., 2022): Automatic demonstration generation through clustering
COSP's innovation was combining these elements with a principled scoring function that balances consistency, repetition avoidance, and diversity.
Follow-up Work: Universal Self-Adaptive Prompting (USP)
Published at EMNLP 2023, USP extended COSP's principles beyond reasoning tasks to general NLU and NLG applications. While COSP focuses on tasks with clear, verifiable answers, USP introduces task-specific confidence measures for classification, short-form generation, and long-form generation.
Real-World Performance Evidence
Benchmark Results:
COSP was evaluated on six reasoning benchmarks across three LLMs:
Arithmetic Reasoning:
- MultiArith: Significant gains over zero-shot CoT, approaching few-shot performance
- GSM8K: Improvements with standard COSP; COSP-FS (few-shot in stage 1) outperformed 5-shot CoT
- AddSub: Consistent improvements across all models
- SingleEq: Strong performance gains
Commonsense Reasoning:
- CommonsenseQA: Improvements over zero-shot baseline
- StrategyQA: Gains demonstrated across model sizes
Model-Specific Results:
| Model | Zero-Shot vs COSP | vs Few-Shot | | ------------------------ | ------------------ | ----------------------------------- | | PaLM-62B | 10-15% improvement | Matches/exceeds 5-shot on 2/5 tasks | | PaLM-540B | Significant gains | Matches/exceeds 5-shot on 3/5 tasks | | GPT-3 (code-davinci-001) | 10-15% improvement | Competitive with 5-shot |
Key Finding: All LLMs using COSP outperformed zero-shot prompting on all tasks except GPT-3 on GSM8K, which required the COSP-FS variant with initial few-shot prompting for demonstration generation.
Comparative Performance:
- vs Zero-Shot CoT: Massive outperformance across all configurations
- vs 5-Shot CoT with labeled examples: On par or better in majority of cases
- Complex tasks (GSM8K): Required COSP-FS variant, indicating that more difficult problems benefit from few-shot bootstrapping
How COSP Works
Theoretical Foundation
COSP is grounded in a fundamental observation about LLM behavior: when models are confident about an answer, they tend to produce the same answer consistently across multiple generations. This consistency-correctness correlation provides a signal for identifying high-quality outputs without requiring ground-truth labels.
Core Insight: The model's own uncertainty, measured through output consistency, serves as a reliable proxy for correctness. By sampling multiple responses and measuring their agreement, we can identify which predictions are trustworthy enough to serve as demonstrations for subsequent inference.
Conceptual Model:
Traditional Few-Shot: Human selects examples → LLM uses examples → Output
Zero-Shot CoT: Trigger phrase → LLM reasons → Output
COSP: LLM generates candidates → Score by consistency → Select best → LLM uses self-generated examples → Output
Fundamental Ideas:
- Self-consistency as quality signal: Answers that appear repeatedly across different reasoning paths are more likely correct
- Automatic demonstration construction: The model can generate its own high-quality examples
- Multi-criteria selection: Combining consistency, repetition avoidance, and diversity yields better demonstrations than any single criterion
- Two-stage inference: Using self-generated demonstrations in a second pass improves over single-pass zero-shot
Assumptions and Where They Fail:
Assumption 1: Consistent answers are more likely correct
- Holds: When the model has relevant knowledge and the task has deterministic answers
- Fails: When the model consistently produces the same wrong answer (confident but incorrect), or when tasks have multiple valid answers
Assumption 2: The model can generate useful reasoning chains
- Holds: With sufficiently large models (100B+ parameters) on reasoning tasks
- Fails: With smaller models that generate incoherent reasoning, or on tasks outside the model's competence
Assumption 3: Repetition in reasoning indicates poor quality
- Holds: When repetition reflects genuine redundancy or confusion
- Fails: When repetition is legitimate emphasis or necessary recapitulation
Assumption 4: Diverse demonstrations improve performance
- Holds: When different examples cover different aspects of the problem space
- Fails: When diversity introduces inconsistent or contradictory patterns
Fundamental Trade-offs:
| Trade-off | COSP's Balance | | ----------------------------- | ------------------------------------------------------------ | | Automation vs Control | Fully automated selection, less human control | | Computational cost vs Quality | Multiple generations required, but no labeling cost | | Generality vs Specialization | Task-agnostic scoring, may not capture task-specific quality | | Consistency vs Diversity | Balances both through scoring function |
Execution Mechanism
COSP operates in two distinct stages:
Stage 1: Pseudo-Demonstration Generation and Selection
- Input: Unlabeled questions/problems from the target domain
- Generation: For each of
nquestions, generatemreasoning chains using Zero-Shot CoT with non-zero temperature - Scoring: Compute a composite score for each question-response pair based on:
- Normalized entropy of answer distribution (consistency)
- Repetitiveness within the reasoning chain
- Selection: Rank all
n × mcandidates by score and select topkwith lowest scores - Output: Set of
kpseudo-demonstrations (question + reasoning + answer)
Stage 2: Test Inference
- Prompt construction: Concatenate selected pseudo-demonstrations with test question
- Generation: Generate multiple reasoning chains for the test question
- Aggregation: Apply majority voting across chains to determine final answer
- Output: Final predicted answer
Detailed Execution Flow:
Stage 1:
Questions Q₁...Qₙ → [Zero-Shot CoT with temp > 0] →
For each Qᵢ: Generate m responses Rᵢ₁...Rᵢₘ →
Extract answers Aᵢ₁...Aᵢₘ →
Compute entropy(Aᵢ₁...Aᵢₘ) →
Compute repetitiveness(Rᵢⱼ) for each response →
Score = entropy + λ × repetitiveness →
Select k lowest-scoring (Qᵢ, Rᵢⱼ, Aᵢⱼ) tuples
Stage 2:
Test question Qₜₑₛₜ →
Prepend selected demonstrations →
Generate multiple reasoning paths →
Majority vote on answers →
Return final answer
Cognitive Processes Triggered:
- Pattern recognition: Selected demonstrations prime the model to recognize problem structure
- Reasoning template application: High-quality demonstrations provide reasoning templates
- Answer format alignment: Consistent demonstration format guides output formatting
- Confidence calibration: Multiple sampling enables uncertainty estimation
Is This Single-Pass or Multi-Stage?
COSP is inherently multi-stage:
- Stage 1: Multiple generation passes for candidate creation (n × m generations)
- Stage 2: Multiple generation passes for self-consistency voting
- Minimum API calls: n × m + k (demonstration generation) + s (test inference samples)
Completion Criteria:
- Stage 1 completes when k demonstrations are selected
- Stage 2 completes when majority vote determines the answer
- No iterative refinement between stages (single demonstration selection)
Causal Mechanisms
Why COSP Improves Outputs:
-
Quality filtering: Low-entropy responses are more likely correct; selecting these provides better demonstrations than random selection
-
Noise reduction: Repetition penalty filters out degenerate reasoning chains that might confuse the model
-
Coverage improvement: Diversity encouragement ensures demonstrations cover different problem types or reasoning patterns
-
Bootstrapping effect: Using the model's own confident outputs creates a positive feedback loop where good reasoning begets better reasoning
-
Format consistency: Self-generated demonstrations naturally match the model's preferred output format
Cascading Effects:
High-quality demonstrations selected →
Better reasoning patterns primed →
More accurate intermediate steps →
Correct final answers →
(If used iteratively) Even better demonstration pool
Feedback Loops:
- Positive: Correct demonstrations improve test accuracy; in iterative settings, this could improve future demonstration quality
- Negative: If initial zero-shot performance is poor, the demonstration pool may lack high-quality candidates, limiting gains
Emergent Behaviors:
- Self-calibration: The entropy scoring implicitly identifies which questions the model finds difficult
- Automatic difficulty stratification: COSP+ variant uses entropy to provide more demonstrations for harder questions
- Domain adaptation: Without explicit programming, COSP adapts to domain-specific reasoning patterns present in the unlabeled questions
Dominant Factors in Effectiveness (Ranked):
- Model capability (40%): Larger models generate higher-quality candidates and better utilize demonstrations
- Consistency-correctness correlation (25%): How well entropy predicts correctness for the specific task
- Demonstration diversity (20%): Coverage of different problem types in selected demonstrations
- Scoring function calibration (15%): Appropriate balance between consistency and repetition penalties
Structure and Components
Essential Components
Required Components:
- Unlabeled question pool: Set of questions from the target domain (no labels needed)
- Zero-Shot CoT trigger: Reasoning elicitation phrase (e.g., "Let's think step by step")
- Scoring function: Weighted combination of entropy and repetitiveness
- Selection mechanism: Ranking and top-k selection
- Aggregation method: Majority voting for final answer
Optional Components:
- Few-shot bootstrap (COSP-FS): Initial few-shot prompting in Stage 1 for complex tasks
- Adaptive demonstration count (COSP+): Variable k based on question difficulty
- Custom repetition detection: Domain-specific repetitiveness scoring
- Diversity constraints: Explicit diversity requirements in selection
Component Hierarchy:
COSP System
├── Stage 1: Demonstration Generation
│ ├── Question Pool (required)
│ ├── Zero-Shot CoT Generator (required)
│ ├── Answer Extractor (required)
│ └── Scoring Module
│ ├── Entropy Calculator (required)
│ ├── Repetitiveness Calculator (required)
│ └── Diversity Enforcer (optional)
├── Stage 2: Test Inference
│ ├── Prompt Constructor (required)
│ ├── Multi-path Generator (required)
│ └── Majority Voter (required)
└── Variants
├── COSP-FS (optional)
└── COSP+ (optional)
Design Principles
Linguistic Patterns:
COSP relies on standard CoT linguistic patterns in generated demonstrations:
- Sequential markers: "First," "Then," "Next," "Finally"
- Reasoning connectors: "Therefore," "Thus," "So," "Because"
- Calculation language: "Let's calculate," "Computing," "This gives us"
- Conclusion signals: "The answer is," "Therefore, the answer is"
Cognitive Principles Leveraged:
- Metacognition: Using consistency as a self-assessment of knowledge certainty
- Learning by example: Demonstrations prime specific reasoning patterns
- Redundancy detection: Recognizing that repetitive reasoning indicates confusion
- Ensemble wisdom: Multiple perspectives (diverse demonstrations) improve robustness
Core Design Principles:
| Principle | Implementation in COSP | | ---------------------- | -------------------------------------------- | | Self-reliance | Uses model's own outputs, no external labels | | Quality over quantity | Selects few high-quality demonstrations | | Uncertainty awareness | Entropy measures confidence | | Diversity preservation | Avoids selecting redundant examples | | Simplicity | Straightforward scoring function |
Structural Patterns
Minimal Pattern:
[Stage 1 - Implicit]
Generate responses to unlabeled questions
Select most consistent ones
[Stage 2 - Prompt]
Q: [Selected question 1]
A: [Selected reasoning and answer 1]
Q: [Selected question 2]
A: [Selected reasoning and answer 2]
Q: [Test question]
A: Let's think step by step.
Standard Pattern:
[Stage 1 Prompt - for each unlabeled question]
Q: {unlabeled_question}
A: Let's think step by step.
[Repeat m times with temperature > 0, collect answers]
[Score all n×m responses]
[Select top k]
[Stage 2 Prompt]
Q: {selected_question_1}
A: {selected_reasoning_1}. The answer is {selected_answer_1}.
Q: {selected_question_2}
A: {selected_reasoning_2}. The answer is {selected_answer_2}.
Q: {selected_question_3}
A: {selected_reasoning_3}. The answer is {selected_answer_3}.
Q: {test_question}
A: Let's think step by step.
Advanced Pattern (COSP-FS for Complex Tasks):
[Stage 1 Prompt - with few-shot bootstrap]
Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each. How many balls does he have?
A: Roger started with 5 balls. He bought 2 × 3 = 6 balls. Total: 5 + 6 = 11. The answer is 11.
Q: {unlabeled_question}
A: Let's think step by step.
[Generate m responses, score, select k]
[Stage 2 - same as standard pattern with selected demonstrations]
COSP+ Pattern (Adaptive Demonstration Count):
[After Stage 1 scoring]
For test question with entropy E:
If E < threshold_low: use k_min demonstrations
If E > threshold_high: use k_max demonstrations
Else: use k_standard demonstrations
[Stage 2 with variable demonstration count based on difficulty]
Modifications for Different Scenarios
High-Complexity Tasks (e.g., GSM8K):
- Use COSP-FS variant with few-shot bootstrap in Stage 1
- Increase m (samples per question) for better candidate pool
- Increase k (selected demonstrations) for more context
- Consider COSP+ for adaptive demonstration count
Ambiguous or Open-Ended Tasks:
- COSP may struggle; consider USP for such tasks
- Increase diversity weight in selection
- Use domain-specific repetitiveness detection
- May need task-specific consistency metrics
Format-Critical Tasks:
- Ensure demonstration format matches expected output format
- Add explicit format instructions in Stage 2 prompt
- Consider post-processing to extract structured answers
Domain-Specific Applications:
- Use domain-specific unlabeled questions for demonstration generation
- May need custom answer extraction for non-standard formats
- Consider domain terminology in repetitiveness scoring
Resource-Constrained Settings:
- Reduce m (fewer candidates per question)
- Reduce n (smaller question pool)
- Use greedy selection instead of global ranking
- Consider caching demonstrations across similar queries
Applications and Task Selection
General Applications
Arithmetic Reasoning:
COSP excels at mathematical word problems where answers are unambiguous:
- Multi-step arithmetic (MultiArith, AddSub, SingleEq)
- Grade school math (GSM8K with COSP-FS)
- Algebraic problem solving
- Numerical reasoning tasks
Commonsense Reasoning:
Effective on structured commonsense tasks:
- StrategyQA (yes/no commonsense questions)
- CommonsenseQA (multiple choice)
- Physical reasoning with discrete answers
- Temporal and spatial reasoning
Logical Reasoning:
Applicable to logic tasks with verifiable answers:
- Syllogistic reasoning
- Deductive logic problems
- Constraint satisfaction
- Symbolic reasoning tasks
Question Answering:
Works well for extractive and discrete QA:
- Factoid questions with clear answers
- Reading comprehension with extractable answers
- Multi-hop QA requiring reasoning chains
Domain-Specific Applications
Education:
- Automated tutoring for math problems
- Self-improving problem solution generation
- Adaptive difficulty assessment using entropy scores
- Homework assistance systems
Scientific Computing:
- Unit conversion and dimensional analysis
- Scientific calculation problems
- Data interpretation tasks
- Experimental design reasoning
Business Analytics:
- Financial calculations with clear answers
- Metric computations
- Quantitative business problems
- ROI and cost-benefit analysis
Legal and Compliance:
- Regulatory compliance checking (yes/no determinations)
- Policy interpretation with discrete outcomes
- Contract clause analysis with binary decisions
Unconventional Applications:
- Code output prediction: Predicting program outputs for given inputs
- Game strategy: Move selection in deterministic games
- Puzzle solving: Logic puzzles with verifiable solutions
- Scheduling optimization: Constraint satisfaction problems
Selection Framework
Problem Characteristics Favoring COSP:
| Characteristic | Why It Helps | | ------------------------------ | -------------------------------------- | | Deterministic answers | Enables consistency measurement | | Reasoning required | Benefits from CoT demonstrations | | Multiple valid reasoning paths | Allows diverse demonstration selection | | Clear answer extraction | Enables entropy calculation | | Domain with unlabeled examples | Provides demonstration candidates |
Scenarios COSP is Optimized For:
- Zero-shot reasoning with no labeled data available
- Tasks where few-shot example creation is expensive
- Domains requiring automatic adaptation
- Applications needing consistent, reproducible reasoning
- Settings where answer correctness can be verified post-hoc
Scenarios NOT Recommended For:
- Open-ended generation: No clear answer consistency metric (use USP instead)
- Highly subjective tasks: Multiple valid answers break consistency assumption
- Very simple tasks: Overhead not justified for single-step problems
- Tasks with no similar unlabeled data: Cannot generate relevant demonstrations
- Real-time applications: Multiple generation passes add latency
- Small models: Requires 100B+ parameters for quality reasoning
Selection Signals - When to Choose COSP:
✓ You have unlabeled questions from the target domain ✓ Answers can be compared for consistency (discrete, extractable) ✓ Zero-shot CoT underperforms but you lack labeled examples ✓ Task involves multi-step reasoning ✓ Computational budget allows multiple API calls ✓ Few-shot examples would require significant domain expertise
Selection Signals - When NOT to Choose COSP:
✗ Task has subjective or open-ended answers ✗ Real-time latency constraints (< 2 seconds) ✗ No unlabeled questions available from target domain ✗ Using native reasoning models (o1, o3) that don't need demonstrations ✗ Simple retrieval or pattern matching tasks ✗ Very limited computational budget
Model Requirements:
| Level | Specification | | ------------ | ---------------------------------------------------- | | Minimum | 100B+ parameters, instruction-tuned | | Recommended | PaLM-62B, GPT-3.5, Claude 2, Llama 70B+ | | Optimal | PaLM-540B, GPT-4, Claude 3, Llama 405B | | Not Suitable | Models < 50B, base models without instruction tuning |
Required Model Capabilities:
- Zero-Shot CoT reasoning ability
- Consistent output format
- Temperature-controlled sampling
- Sufficient context window for demonstrations (4K+ tokens)
Context and Resource Requirements:
| Resource | Typical Usage | | ------------------- | ------------------------------------------------- | | Unlabeled questions | 10-50 questions for demonstration pool | | Generation budget | n × m + s calls (e.g., 20 × 5 + 10 = 110 calls) | | Context window | 2000-4000 tokens for k=3 demonstrations | | Latency | 10-60 seconds total (Stage 1 can be pre-computed) |
Cost Implications:
One-Time Costs (Stage 1):
- n × m API calls for demonstration generation
- Embedding API calls for repetitiveness calculation
- Can be amortized across many test queries
Per-Request Costs (Stage 2):
- s API calls for self-consistency (typically 5-10)
- Longer prompts due to demonstrations
- 2-5x cost of simple zero-shot
Cost-Quality Trade-offs:
| Configuration | Cost | Expected Gain | | --------------------------- | -------------- | ---------------- | | COSP (m=2, k=3, s=5) | ~5x zero-shot | +10-15% accuracy | | COSP-lite (m=2, k=2, s=3) | ~3x zero-shot | +7-12% accuracy | | COSP-heavy (m=5, k=5, s=10) | ~10x zero-shot | +12-18% accuracy |
Variant Selection Guide:
| Variant | Best For | | --------------- | --------------------------------------------------------- | | COSP (standard) | Most reasoning tasks, moderate complexity | | COSP-FS | Complex tasks (GSM8K), when zero-shot candidates are poor | | COSP+ | Heterogeneous difficulty, adaptive resources | | COSP-lite | Cost-constrained, still want automation |
When to Escalate to Alternatives:
- To USP: When task is classification, summarization, or open-ended
- To Manual Few-Shot: When COSP underperforms and you have expert examples
- To Fine-Tuning: When COSP ceiling reached and large labeled dataset available
- To Native Reasoning Models: When using o1/o3 (built-in reasoning superior)
Implementation
Step-by-Step Implementation
Prerequisites:
- Access to LLM API with temperature control
- Embedding API for repetitiveness scoring (optional but recommended)
- Set of unlabeled questions from target domain
- Answer extraction logic for the task
Phase 1: Setup and Configuration
# Configuration parameters
config = {
"n": 20, # Number of unlabeled questions
"m": 5, # Reasoning chains per question
"k": 3, # Demonstrations to select
"s": 5, # Self-consistency samples
"temperature": 0.7, # For diverse sampling
"trade_off": 0.2, # Repetitiveness weight
}
Phase 2: Stage 1 - Demonstration Generation
import openai
import numpy as np
from collections import Counter
def generate_candidates(questions, m, temperature):
"""Generate m reasoning chains for each question."""
candidates = []
for q in questions:
for _ in range(m):
prompt = f"Q: {q}\nA: Let's think step by step."
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
max_tokens=500
)
reasoning = response.choices[0].message.content
answer = extract_answer(reasoning)
candidates.append({
"question": q,
"reasoning": reasoning,
"answer": answer
})
return candidates
def extract_answer(reasoning):
"""Extract final answer from reasoning chain."""
import re
# Customize based on task format
match = re.search(r'(?:answer is|=)\s*(-?\d+(?:\.\d+)?)', reasoning.lower())
return match.group(1) if match else reasoning.split()[-1]
Phase 3: Scoring Function
def compute_entropy(answers):
"""Compute normalized entropy of answer distribution."""
if len(answers) <= 1:
return 0.0
counter = Counter(answers)
total = len(answers)
probabilities = [count / total for count in counter.values()]
entropy = -sum(p * np.log(p) for p in probabilities if p > 0)
max_entropy = np.log(len(answers))
return entropy / max_entropy if max_entropy > 0 else 0.0
def compute_repetitiveness(reasoning, get_embedding):
"""Compute repetitiveness score using sentence embeddings."""
import re
# Split into sentences
sentences = re.split(r'[.!?]+', reasoning)
sentences = [s.strip() for s in sentences if s.strip()]
if len(sentences) <= 1:
return 0.0
# Get embeddings
embeddings = [get_embedding(s) for s in sentences]
# Compute pairwise cosine similarities
similarities = []
for i in range(len(embeddings)):
for j in range(i + 1, len(embeddings)):
sim = cosine_similarity(embeddings[i], embeddings[j])
similarities.append(sim)
return np.mean(similarities) if similarities else 0.0
def cosine_similarity(a, b):
"""Compute cosine similarity between two vectors."""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def score_candidates(candidates, m, trade_off, get_embedding):
"""Score all candidates and return sorted list."""
# Group by question
questions = {}
for c in candidates:
q = c["question"]
if q not in questions:
questions[q] = []
questions[q].append(c)
scored = []
for q, responses in questions.items():
answers = [r["answer"] for r in responses]
entropy = compute_entropy(answers)
for r in responses:
rep = compute_repetitiveness(r["reasoning"], get_embedding)
score = entropy + trade_off * rep
scored.append({**r, "score": score, "entropy": entropy})
return sorted(scored, key=lambda x: x["score"])
Phase 4: Demonstration Selection
def select_demonstrations(scored_candidates, k):
"""Select top k demonstrations with diversity."""
selected = []
seen_questions = set()
for candidate in scored_candidates:
# Optional: enforce question diversity
if candidate["question"] not in seen_questions:
selected.append(candidate)
seen_questions.add(candidate["question"])
if len(selected) >= k:
break
return selected
Phase 5: Stage 2 - Test Inference
def build_prompt(demonstrations, test_question):
"""Construct prompt with demonstrations."""
prompt_parts = []
for demo in demonstrations:
prompt_parts.append(f"Q: {demo['question']}")
prompt_parts.append(f"A: {demo['reasoning']}")
prompt_parts.append("")
prompt_parts.append(f"Q: {test_question}")
prompt_parts.append("A: Let's think step by step.")
return "\n".join(prompt_parts)
def cosp_inference(test_question, demonstrations, s, temperature=0.7):
"""Run COSP inference with self-consistency."""
prompt = build_prompt(demonstrations, test_question)
answers = []
for _ in range(s):
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
max_tokens=500
)
reasoning = response.choices[0].message.content
answer = extract_answer(reasoning)
answers.append(answer)
# Majority vote
counter = Counter(answers)
return counter.most_common(1)[0][0]
Complete COSP Pipeline:
def cosp_pipeline(unlabeled_questions, test_questions, config):
"""Complete COSP pipeline."""
# Stage 1: Generate and select demonstrations
print("Stage 1: Generating candidates...")
candidates = generate_candidates(
unlabeled_questions,
config["m"],
config["temperature"]
)
print("Scoring candidates...")
scored = score_candidates(
candidates,
config["m"],
config["trade_off"],
get_embedding_function()
)
demonstrations = select_demonstrations(scored, config["k"])
print(f"Selected {len(demonstrations)} demonstrations")
# Stage 2: Run inference on test questions
print("Stage 2: Running inference...")
results = []
for test_q in test_questions:
answer = cosp_inference(
test_q,
demonstrations,
config["s"]
)
results.append({"question": test_q, "answer": answer})
return results, demonstrations
Platform-Specific Implementations
OpenAI API:
import openai
client = openai.OpenAI(api_key="your-key")
def get_embedding(text):
response = client.embeddings.create(
model="text-embedding-ada-002",
input=text
)
return response.data[0].embedding
def generate_response(prompt, temperature=0.7):
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
max_tokens=500
)
return response.choices[0].message.content
Anthropic Claude:
import anthropic
client = anthropic.Anthropic(api_key="your-key")
def generate_response_claude(prompt, temperature=0.7):
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text
LangChain Integration:
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage
class COSPChain:
def __init__(self, model_name="gpt-4", config=None):
self.llm = ChatOpenAI(model=model_name, temperature=0.7)
self.config = config or {"m": 5, "k": 3, "s": 5}
self.demonstrations = None
def generate_demonstrations(self, questions):
"""Stage 1: Generate and select demonstrations."""
candidates = []
for q in questions:
for _ in range(self.config["m"]):
prompt = f"Q: {q}\nA: Let's think step by step."
response = self.llm.invoke([HumanMessage(content=prompt)])
candidates.append({
"question": q,
"reasoning": response.content,
"answer": self._extract_answer(response.content)
})
self.demonstrations = self._select_best(candidates)
return self.demonstrations
def inference(self, test_question):
"""Stage 2: Run inference with demonstrations."""
if not self.demonstrations:
raise ValueError("Must generate demonstrations first")
prompt = self._build_prompt(test_question)
answers = []
for _ in range(self.config["s"]):
response = self.llm.invoke([HumanMessage(content=prompt)])
answers.append(self._extract_answer(response.content))
return max(set(answers), key=answers.count)
Configuration
Key Parameters:
| Parameter | Description | Recommended Value | | ------------- | ------------------------ | ----------------- | | n | Unlabeled questions | 20-50 | | m | Samples per question | 3-7 | | k | Selected demonstrations | 3-5 | | s | Self-consistency samples | 5-10 | | temperature | Sampling temperature | 0.5-0.8 | | trade_off (λ) | Repetitiveness weight | 0.1-0.3 |
Task-Specific Tuning:
Arithmetic Reasoning:
- Higher m (5-7) for better coverage
- Standard k (3-5)
- Lower temperature (0.5-0.7) for more coherent math
- Consider COSP-FS for GSM8K
Commonsense Reasoning:
- Standard m (3-5)
- k = 3 typically sufficient
- Higher temperature (0.7-0.8) for diversity
- Lower trade_off for more consistency focus
Complex Multi-Step:
- Use COSP-FS variant
- Higher k (5-7) for more context
- Consider COSP+ for adaptive selection
- May need domain-specific answer extraction
Temperature Settings:
| Setting | Temperature | Use Case | | --------------------------- | ----------- | -------------------- | | Stage 1 generation | 0.7-0.9 | Diverse candidates | | Stage 2 inference | 0.5-0.7 | Balanced exploration | | Final answer (low variance) | 0.0-0.3 | Deterministic output |
Best Practices and Workflow
Do's:
- Pre-compute demonstrations: Run Stage 1 offline and cache demonstrations
- Validate answer extraction: Test extraction logic before full pipeline
- Monitor entropy distribution: Check that scoring differentiates candidates
- Use diverse unlabeled questions: Cover different problem types
- Start with standard config: Tune parameters only if needed
- Verify demonstration quality: Manually inspect selected demonstrations
- Test on held-out set: Don't evaluate on demonstration source questions
Don'ts:
- Don't use COSP with o1/o3: Native reasoning models don't need demonstrations
- Don't apply to subjective tasks: Consistency metric won't work
- Don't skip repetitiveness scoring: It filters degenerate outputs
- Don't use too few unlabeled questions: Need sufficient candidate pool
- Don't ignore computational costs: Budget API calls appropriately
- Don't assume demonstrations transfer: Re-generate for new domains
Workflow:
1. Collect unlabeled questions (15-30 min)
└─ Gather 20-50 representative questions from target domain
2. Configure COSP parameters (5 min)
└─ Set n, m, k, s, temperature based on task type
3. Run Stage 1 (5-30 min depending on n×m)
└─ Generate candidates, score, select demonstrations
4. Validate demonstrations (10 min)
└─ Manually inspect selected demonstrations for quality
5. Test on sample (10 min)
└─ Run Stage 2 on 5-10 test questions
6. Evaluate and iterate (15-30 min)
└─ Measure accuracy, adjust parameters if needed
7. Deploy (ongoing)
└─ Use cached demonstrations for production inference
Debugging Decision Tree
Problem: Low overall accuracy
Root Cause Analysis:
├─ Poor demonstration quality?
│ └─ Check: Manually inspect selected demonstrations
│ └─ Fix: Increase n or m, use COSP-FS, improve question pool
├─ Wrong answer extraction?
│ └─ Check: Verify extraction on known examples
│ └─ Fix: Improve regex or parsing logic
├─ Insufficient demonstrations?
│ └─ Check: Try increasing k
│ └─ Fix: Use k=5 instead of k=3
└─ Task not suitable for COSP?
└─ Check: Are answers deterministic and comparable?
└─ Fix: Consider USP or manual few-shot
Problem: Inconsistent results across runs
Root Cause Analysis:
├─ High temperature in Stage 2?
│ └─ Fix: Reduce to 0.3-0.5
├─ Too few self-consistency samples?
│ └─ Fix: Increase s to 7-10
└─ Ambiguous answer format?
└─ Fix: Standardize answer extraction
Problem: Demonstrations are low quality
Root Cause Analysis:
├─ Unlabeled questions too difficult?
│ └─ Fix: Use COSP-FS with bootstrap examples
├─ Scoring function not discriminating?
│ └─ Check: Entropy distribution should have variance
│ └─ Fix: Increase m for better entropy estimation
└─ Repetitiveness scoring broken?
└─ Check: Verify embedding function works
└─ Fix: Use different embedding model
Problem: High latency
Root Cause Analysis:
├─ Too many candidates (n×m)?
│ └─ Fix: Reduce n to 15, m to 3
├─ Stage 1 not cached?
│ └─ Fix: Pre-compute and cache demonstrations
└─ Too many self-consistency samples?
└─ Fix: Reduce s to 3-5
Common Mistakes:
| Mistake | Impact | Prevention | | ------------------------------- | ----------------------------- | ---------------------------- | | Using COSP for open-ended tasks | Poor accuracy | Check task suitability first | | Not caching Stage 1 | Wasted computation | Always cache demonstrations | | Wrong answer format | Bad entropy scores | Validate extraction logic | | Too small question pool | Limited demonstration quality | Use 20+ unlabeled questions | | Ignoring repetitiveness | Degenerate demonstrations | Always include rep scoring |
Testing and Optimization
Validation Strategy:
- Held-out test set: Never use test questions in demonstration pool
- Cross-validation: For limited data, use k-fold on unlabeled questions
- Ablation testing: Compare COSP vs zero-shot vs manual few-shot
- Component analysis: Test with/without repetitiveness, diversity
Test Coverage:
- Standard cases (60%): Typical problems from target domain
- Edge cases (25%): Unusual inputs, boundary conditions
- Hard cases (15%): Known difficult problems
Quality Metrics:
| Metric | Description | Target | | --------------------- | ------------------------ | -------------- | | Accuracy | Correct answers / total | Task-dependent | | Gain over zero-shot | COSP acc - zero-shot acc | > 5% | | Consistency | Variance across runs | Low (< 5%) | | Demonstration quality | Manual quality score | High (4/5+) |
Optimization Techniques:
Token Efficiency:
- Compress demonstration reasoning (remove filler)
- Use shorter unlabeled questions
- Reduce k for simpler tasks
Latency Optimization:
- Cache Stage 1 demonstrations
- Batch API calls where possible
- Use streaming for long responses
- Consider parallel Stage 2 calls
Cost Optimization:
- Start with COSP-lite (m=2, k=2, s=3)
- Only increase if accuracy insufficient
- Re-use demonstrations across similar queries
- Use cheaper model for Stage 1 if quality sufficient
Iteration Criteria:
- Stop if accuracy within 2% of few-shot baseline
- Stop if increasing parameters shows < 1% gain
- Maximum 3-4 parameter iterations
- Focus on demonstration quality over quantity
Limitations and Constraints
Known Limitations
1. Task Scope Restriction
COSP is designed for reasoning tasks with deterministic, comparable answers. It cannot effectively handle:
- Open-ended generation (summarization, creative writing)
- Subjective evaluations (style assessment, preference ranking)
- Tasks with multiple equally valid answers
Why fundamental: The consistency metric requires answers that can be meaningfully compared. Without this, the entire selection mechanism breaks down.
2. Consistency-Correctness Assumption
COSP assumes consistent answers are correct, but models can be consistently wrong:
- Systematic biases produce consistent incorrect answers
- Common misconceptions may be repeated confidently
- Majority of training data may contain errors
When this fails:
- Out-of-distribution problems
- Problems requiring uncommon knowledge
- Tasks where model training data is biased
3. Computational Overhead
COSP requires significant additional computation:
- Stage 1: n × m API calls for demonstration generation
- Stage 2: s API calls per test question
- Embedding calls for repetitiveness scoring
Cannot be eliminated: This is inherent to the multi-sampling approach.
4. Model Size Dependency
Like other CoT-based methods, COSP requires large models:
- Minimum ~100B parameters for quality reasoning
- Smaller models generate incoherent chains
- Demonstration selection can't fix fundamentally poor reasoning
5. Cold Start Problem
COSP needs unlabeled questions from the target domain:
- New domains without examples can't use COSP
- Quality depends on question pool representativeness
- Mismatched questions lead to irrelevant demonstrations
6. No Iterative Improvement
Standard COSP is two-stage without feedback:
- Selected demonstrations are fixed
- No mechanism to improve based on test performance
- Errors in Stage 1 propagate to Stage 2
Edge Cases
Highly Homogeneous Answers:
When all m samples produce the same answer:
- Entropy = 0, but might still be wrong
- Detection: Check for zero/near-zero entropy across all questions
- Handling: Increase temperature, verify with alternative methods
No Clear Answer Pattern:
When answers are uniformly distributed:
- Entropy is maximum, no confident prediction
- Detection: High entropy across many questions
- Handling: May indicate task unsuitability or need for COSP-FS
Conflicting High-Quality Demonstrations:
When selected demonstrations suggest different reasoning patterns:
- Detection: Check demonstration consistency
- Handling: Enforce stricter diversity constraints or reduce k
Domain Shift:
When test questions differ significantly from unlabeled pool:
- Detection: Poor test accuracy despite good demonstration quality
- Handling: Expand unlabeled question pool, use domain adaptation
Graceful Degradation:
If COSP accuracy < zero-shot:
→ Fall back to zero-shot CoT
→ Check task suitability
If Stage 1 produces no good candidates:
→ Switch to COSP-FS
→ Reduce scoring thresholds
If latency exceeds budget:
→ Reduce s (fewer self-consistency samples)
→ Use cached demonstrations longer
Constraint Management
Balancing Consistency vs Diversity:
- High consistency weight → may select redundant demonstrations
- High diversity weight → may select inconsistent demonstrations
- Approach: Start with standard trade_off (0.2), adjust based on results
Token/Context Constraints:
When demonstrations exceed context window:
- Reduce k (fewer demonstrations)
- Compress demonstration reasoning
- Select shorter questions/answers
- Use models with larger context
Handling Incomplete Information:
When unlabeled questions lack context:
- Include context in question format
- Use COSP-FS to establish reasoning patterns
- Add explicit assumption-stating in prompts
Error Recovery:
| Error | Recovery | | -------------------------- | --------------------------------- | | All candidates low quality | Use COSP-FS bootstrap | | Extraction fails | Fall back to full-text matching | | API timeout | Retry with exponential backoff | | Context overflow | Reduce k, compress demonstrations |
Advanced Techniques
Clarity and Context Optimization
Ensuring Demonstration Clarity:
- Format demonstrations consistently
- Include clear reasoning transitions
- Show explicit calculations
- End with unambiguous answer format
Example of Clear Demonstration:
Q: A store has 45 apples. They sell 12 in the morning and 18 in the afternoon. How many apples remain?
A: Let's solve this step by step.
Step 1: Start with 45 apples.
Step 2: Sold in morning: 45 - 12 = 33 apples remain.
Step 3: Sold in afternoon: 33 - 18 = 15 apples remain.
Therefore, the answer is 15.
Context Optimization:
- Include only relevant demonstrations
- Order by relevance to test question (if measurable)
- Remove redundant reasoning steps
- Balance completeness with conciseness
Demonstration Design Principles:
| Principle | Implementation | | ------------ | ---------------------------------- | | Clarity | Explicit steps, clear language | | Completeness | All necessary reasoning shown | | Consistency | Same format across demonstrations | | Relevance | Similar to expected test questions |
Advanced Reasoning Patterns
Multi-Step Verification:
Add verification to demonstrations:
...calculation steps...
Let me verify: 15 + 12 + 18 = 45 ✓
The answer is 15.
Decomposition Pattern:
For complex problems, demonstrate decomposition:
Q: [Complex problem]
A: Let me break this into parts.
Part 1: [Subproblem] → [Solution]
Part 2: [Subproblem] → [Solution]
Combining: [Integration] → [Final answer]
Self-Correction Pattern:
Demonstrate error detection:
...initial reasoning...
Wait, let me check: [verification]
Actually, I made an error. [correction]
The correct answer is [corrected answer].
Interaction Patterns
Iterative COSP (Advanced):
def iterative_cosp(questions, iterations=2):
demonstrations = None
for i in range(iterations):
# Generate candidates (using previous demos if available)
candidates = generate_with_demos(questions, demonstrations)
# Select new demonstrations
demonstrations = select_best(candidates)
return demonstrations
Chaining with Other Techniques:
COSP can be combined with:
- RAG: Retrieve relevant knowledge, then apply COSP
- Self-Refinement: Use COSP output as input to refinement
- Verification: Add verification step after COSP inference
Model Considerations
Model-Specific Adaptations:
| Model | Adaptation | | -------------- | ------------------------------------------ | | GPT-4 | Standard COSP works well | | GPT-3.5 | May need COSP-FS, lower expectations | | Claude | Works well, may need format adjustment | | PaLM | Original evaluation model, standard config | | Llama 70B+ | Needs more demonstrations (k=5+) | | Smaller models | Not recommended |
Cross-Model Considerations:
- Demonstrations generated by one model may not transfer well to another
- Re-run Stage 1 when switching models
- Answer format may differ between models
Handling Model Updates:
- Re-evaluate COSP periodically with model updates
- Demonstrations may become suboptimal with new model versions
- Monitor performance for degradation
Efficiency Optimization
Token Minimization:
def compress_demonstration(demo):
"""Compress demonstration while preserving key information."""
# Remove filler phrases
compressed = demo.replace("Let me think about this.", "")
compressed = compressed.replace("So, ", "")
# Combine short steps
# ... additional compression logic
return compressed
Batch Processing:
async def batch_cosp_inference(test_questions, demos, batch_size=5):
"""Process multiple questions in parallel."""
import asyncio
async def process_one(q):
return await async_inference(q, demos)
results = []
for i in range(0, len(test_questions), batch_size):
batch = test_questions[i:i+batch_size]
batch_results = await asyncio.gather(*[process_one(q) for q in batch])
results.extend(batch_results)
return results
Caching Strategy:
import hashlib
import json
class COSPCache:
def __init__(self, cache_file="cosp_cache.json"):
self.cache_file = cache_file
self.cache = self._load_cache()
def _hash_questions(self, questions):
return hashlib.md5(json.dumps(sorted(questions)).encode()).hexdigest()
def get_demonstrations(self, questions):
key = self._hash_questions(questions)
return self.cache.get(key)
def store_demonstrations(self, questions, demos):
key = self._hash_questions(questions)
self.cache[key] = demos
self._save_cache()
Safety and Robustness
Output Validation:
def validate_cosp_output(answer, expected_format):
"""Validate COSP output meets expected format."""
if expected_format == "numeric":
try:
float(answer)
return True
except:
return False
elif expected_format == "yes_no":
return answer.lower() in ["yes", "no"]
# ... additional formats
return True
Consistency Monitoring:
def monitor_consistency(results, threshold=0.7):
"""Monitor self-consistency across inference runs."""
from collections import Counter
counter = Counter(results)
most_common_count = counter.most_common(1)[0][1]
consistency = most_common_count / len(results)
if consistency < threshold:
logging.warning(f"Low consistency: {consistency:.2%}")
return False, consistency
return True, consistency
Fallback Mechanisms:
def cosp_with_fallback(question, demos, config):
"""COSP with fallback to zero-shot."""
try:
answer, consistency = cosp_inference_with_confidence(question, demos, config)
if consistency < 0.5:
# Fall back to simple zero-shot
return zero_shot_cot(question)
return answer
except Exception as e:
logging.error(f"COSP failed: {e}")
return zero_shot_cot(question)
Domain Adaptation
Adapting to New Domains:
- Collect 20-50 unlabeled questions from new domain
- Run Stage 1 with domain questions
- Validate demonstration quality manually
- Test on held-out domain examples
- Adjust parameters if needed
Domain-Specific Considerations:
| Domain | Adaptation | | ------- | -------------------------------------------------------- | | Medical | Use medical terminology in questions, validate carefully | | Legal | Longer reasoning chains, may need k=5+ | | Code | Adjust answer extraction for code outputs | | Math | Standard COSP works well |
Quick Domain Adaptation:
def quick_domain_adapt(domain_questions, generic_demos=None):
"""Quick adaptation to new domain."""
# If we have generic demos, use COSP-FS approach
if generic_demos:
candidates = generate_with_demos(domain_questions, generic_demos)
else:
candidates = generate_candidates(domain_questions)
# Select domain-specific demonstrations
return select_best(candidates)
Risk and Ethics
Ethical Considerations
What COSP Reveals About LLMs:
- Models have implicit confidence that correlates with correctness
- Self-generated content can improve model performance
- Consistency is a learnable, measurable property
Bias Considerations:
- Demonstrations inherit biases from model's zero-shot outputs
- Consistent biased answers will be selected as demonstrations
- No mechanism to detect or correct systematic biases
Transparency Concerns:
- Selected demonstrations may not represent optimal reasoning
- Users may not understand why certain demonstrations were chosen
- Automated selection obscures human oversight
Risk Analysis
Failure Modes:
| Failure Mode | Likelihood | Impact | Mitigation | | ------------------------------ | ---------- | ------ | -------------------------------- | | Consistently wrong answers | Medium | High | Validate demonstrations manually | | Biased demonstration selection | Medium | Medium | Audit selected demonstrations | | Poor quality on edge cases | High | Medium | Test on diverse cases | | Format extraction errors | Medium | Low | Robust parsing, fallbacks |
Cascading Failures:
Bad demonstrations selected →
Incorrect reasoning patterns primed →
Test inference follows bad patterns →
Systematic errors on test set
Safety Concerns:
- Prompt injection: Unlabeled questions could contain adversarial content
- Mitigation: Sanitize inputs, validate question format
Bias Amplification:
- COSP may amplify existing model biases by selecting "confident" biased outputs
- Detection: Audit demonstrations for bias patterns
- Mitigation: Diverse question pool, explicit fairness constraints
Innovation Potential
Derived Innovations:
- Task-adaptive COSP: Automatically detect task type and adjust scoring
- Continuous learning COSP: Update demonstrations based on feedback
- Multi-model COSP: Use different models for generation vs selection
- Hierarchical COSP: Multi-level demonstration selection
Novel Combinations:
- COSP + RAG: Retrieve knowledge, then select demonstrations
- COSP + Verification: Add automated verification of selected demonstrations
- COSP + Active Learning: Use entropy to identify questions needing labels
Ecosystem and Integration
Tools and Frameworks
Framework Support:
| Framework | COSP Support | | ---------- | ------------------------------ | | LangChain | Custom chain implementation | | DSPy | Can implement as custom module | | Instructor | Built-in COSP guidance | | Haystack | Custom pipeline component |
Evaluation Tools:
- Standard accuracy metrics
- Consistency measurement
- Demonstration quality scoring
- Entropy distribution analysis
Pre-built Resources:
- Instructor library includes COSP documentation and examples
- LearnPrompting provides educational materials
- Research implementations available (reference original paper)
Related Techniques
Closely Related:
| Technique | Relationship | | ---------------- | ---------------------------------------------------------- | | Self-Consistency | COSP uses self-consistency for scoring and inference | | Auto-CoT | Similar goal (automatic demos), different selection method | | USP | Extends COSP to general NLP tasks | | Zero-Shot CoT | Foundation that COSP builds upon |
Comparison Table:
| Aspect | COSP | Auto-CoT | Self-Consistency | Manual Few-Shot | | ------------------- | --------- | ---------------- | ---------------- | --------------- | | Labeled data | No | No | No | Yes | | Automatic selection | Yes | Yes (clustering) | N/A | No | | Multi-stage | Yes | Yes | No | No | | Task scope | Reasoning | Reasoning | Any | Any | | Compute cost | High | Medium | Medium | Low |
Integration Patterns
With RAG Systems:
def cosp_with_rag(question, retriever, demos):
"""Integrate COSP with retrieval."""
# Retrieve relevant context
context = retriever.retrieve(question)
# Augment question with context
augmented_q = f"Context: {context}\n\nQuestion: {question}"
# Run COSP inference
return cosp_inference(augmented_q, demos)
With Agent Systems:
class COSPAgent:
"""Agent that uses COSP for reasoning steps."""
def __init__(self, tools, demo_pool):
self.tools = tools
self.cosp_demos = self._generate_demos(demo_pool)
def reason(self, task):
# Use COSP for reasoning
reasoning = cosp_inference(task, self.cosp_demos)
# Extract action from reasoning
action = self._parse_action(reasoning)
return action
Transition Strategies:
From Zero-Shot to COSP:
- Collect unlabeled questions from production queries
- Run Stage 1 to generate demonstrations
- A/B test COSP vs zero-shot
- Gradually increase COSP traffic if positive
From COSP to Fine-Tuning:
- Use COSP-generated demonstrations as training data
- Filter to high-confidence examples
- Fine-tune on selected examples
- Compare fine-tuned model to COSP
Production Integration
Deployment Architecture:
┌─────────────────────────────────────────┐
│ Production System │
├─────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Demo Cache │ │ Test Query │ │
│ │ (Stage 1) │───▶│ (Stage 2) │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Periodic │ │ LLM API │ │
│ │ Refresh │ │ │ │
│ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────┘
Monitoring:
- Track accuracy over time
- Monitor demonstration quality scores
- Alert on consistency degradation
- Log entropy distributions
Versioning:
- Version demonstration sets
- Track which demo version produced each result
- Enable rollback to previous demonstrations
Future Directions
Emerging Innovations
Active COSP:
Using entropy scores to actively identify questions that need human labels:
- High entropy questions are difficult for the model
- Prioritize these for human annotation
- Combine with active learning frameworks
Continuous COSP:
Updating demonstrations based on feedback:
- Track which demonstrations lead to correct answers
- Dynamically adjust demonstration pool
- Learn optimal scoring weights from outcomes
Multi-Modal COSP:
Extending to vision-language tasks:
- Generate demonstrations from image-text pairs
- Measure consistency across visual reasoning
- Select demonstrations covering diverse visual patterns
Research Frontiers
Open Questions:
- Optimal scoring function: Is entropy + repetitiveness optimal, or can we learn better scoring?
- Transfer across tasks: Can demonstrations transfer between similar tasks?
- Scaling laws: How does COSP performance scale with n, m, k?
- Theoretical foundations: Why exactly does consistency correlate with correctness?
Promising Directions:
- Learned demonstration selection: Train a model to select demonstrations
- Compositional COSP: Build complex demonstrations from simple ones
- Adversarial robustness: Make COSP robust to adversarial questions
- Efficiency improvements: Reduce computational overhead while maintaining quality
Integration with Native Reasoning:
As models like o1/o3 incorporate native reasoning:
- COSP principles may be internalized in model training
- External COSP may become unnecessary for advanced models
- Techniques may shift to improving native reasoning consistency
Quick Reference
COSP at a Glance
Purpose: Automatic zero-shot demonstration selection
Key Innovation: Consistency-based scoring for self-generated examples
Input: Unlabeled questions + test questions
Output: Improved reasoning accuracy
Stage 1: Generate n×m candidates → Score by entropy + repetitiveness → Select k best
Stage 2: Use selected demonstrations → Self-consistency inference → Majority vote
Typical Config: n=20, m=5, k=3, s=5, temperature=0.7, trade_off=0.2
Decision Checklist
Use COSP when:
☑ Task has deterministic, comparable answers
☑ Zero-shot underperforms
☑ No labeled data available
☑ Have unlabeled questions from domain
☑ Computational budget allows multi-sampling
Don't use COSP when:
☒ Open-ended or subjective task
☒ Using native reasoning models (o1, o3)
☒ Real-time latency requirements
☒ No relevant unlabeled questions available
☒ Small models (< 100B parameters)
Performance Expectations
vs Zero-Shot CoT: +10-15% accuracy
vs 5-Shot Manual: Comparable or better on 50-60% of tasks
Compute Cost: 5-10x zero-shot
Latency: 10-60 seconds (Stage 1 cacheable)
References
Primary Paper:
- Wan, X., Sun, R., Dai, H., Arik, S. O., & Pfister, T. (2023). Better Zero-Shot Reasoning with Self-Adaptive Prompting. Findings of ACL 2023.
Related Work:
- Kojima, T., et al. (2022). Large Language Models are Zero-Shot Reasoners. NeurIPS 2022.
- Wang, X., et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023.
- Zhang, Z., et al. (2022). Automatic Chain of Thought Prompting in Large Language Models. ICLR 2023.
- Wan, X., et al. (2023). Universal Self-Adaptive Prompting. EMNLP 2023.
Resources:
Read Next
Start reading to get personalized recommendations
Explore Unread
Great job! You've read all available articles