Active Prompting: A Complete Guide
Active Prompting is an optimization-based technique that improves few-shot learning by iteratively selecting the most uncertain examples for human annotation, then using these annotated examples as demonstrations. Rather than randomly choosing examples, Active Prompting identifies inputs where the model is most uncertain, gets expert annotations for those cases, and incorporates them into the prompt to maximize learning efficiency.
The core insight is that not all examples are equally valuable for teaching a model. Examples that challenge the model's current understanding provide more information than easy cases the model already handles well. By focusing annotation effort on high-uncertainty examples, Active Prompting achieves better performance with fewer labeled examples than random selection.
Active Prompting belongs to the optimization-based and example-based prompting categories. It combines active learning principles with few-shot prompting, creating a human-in-the-loop system that iteratively refines prompt quality. Introduced by Diao et al. (2023) in "Active Prompting with Chain-of-Thought for Large Language Models" and published at ACL 2024, it demonstrated substantial improvements: 83.4% accuracy on GSM8K (vs 63.1% for standard CoT), with improvements ranging from 1.0% to 15.4% across arithmetic reasoning datasets (ASDiv, SVAMP, AQUA). The technique consistently outperforms self-consistency baselines by an average of 2.1-7.2% across reasoning tasks.
How It Works
Active Prompting is grounded in active learning theory, which has decades of research in machine learning showing that strategic example selection outperforms random sampling. The technique transfers this principle to prompt engineering: the examples you include in your prompt dramatically affect model performance, so selecting informative examples yields better results than arbitrary choices.
The fundamental innovation is applying uncertainty sampling to prompt construction. Traditional few-shot prompting uses randomly selected or manually curated examples. Active Prompting systematically identifies examples that expose model weaknesses, gets expert annotations for those cases, and incorporates them as demonstrations.
Execution Mechanism
1. Initial Uncertainty Assessment:
- Run model on pool of unlabeled examples (typically 100-1000 examples)
- For each example, generate k diverse responses (k=5-10 typical)
- Calculate uncertainty metrics from response variance
- Uncertainty indicates model confusion or lack of confident knowledge
2. Example Selection:
- Rank examples by uncertainty score (highest to lowest)
- Select top n most uncertain examples (n typically 4-8)
- These represent cases where model needs most guidance
- Selection criteria: disagreement, entropy, variance across responses
3. Human Annotation:
- Expert annotators provide gold-standard answers for selected examples
- For reasoning tasks: include step-by-step explanations (Chain-of-Thought)
- Annotations should demonstrate correct reasoning process, not just final answer
- Quality control: verify annotation correctness and consistency
4. Prompt Construction:
- Create few-shot prompt using annotated high-uncertainty examples
- Format: [Example 1: Question → Reasoning → Answer], [Example 2...], [Test Question]
- Order examples from simpler to more complex when possible
- Ensure examples cover diverse uncertainty patterns
5. Execution:
- Run inference on test set using constructed prompt
- Model learns from informative examples in context
- Performance improvement comes from targeted example selection
- Process can iterate if performance insufficient
Active Prompting is iterative and multi-stage. It requires initial uncertainty estimation phase, annotation phase, and final inference phase. Some implementations iterate multiple rounds, adding new uncertain examples each cycle.
Why This Works
1. Information Maximization (40% of effectiveness): High-uncertainty examples carry more information than easy cases. Including them in prompts teaches the model boundary conditions, edge cases, and subtle distinctions it struggles with.
2. Targeted Learning (30%): Rather than hoping random examples cover important cases, Active Prompting guarantees the prompt addresses model weaknesses. This focuses limited example slots on maximum-impact demonstrations.
3. Diversity Through Disagreement (20%): Uncertainty often signals diverse valid interpretations or complex reasoning paths. Selected examples tend to cover broader input space than random sampling.
4. Expert Knowledge Transfer (10%): Human annotations provide correct reasoning patterns for exactly the cases where model needs most help. This bridges gap between model's current capabilities and task requirements.
Causal Chain:
High uncertainty identification → annotation of model's weak points → examples directly address confusion → model learns boundary conditions → improved accuracy on similar difficult cases
Positive Feedback Loop:
Better examples → better performance → ability to tackle harder tasks → identification of new uncertainty frontiers → further refinement
Dominant Factors Ranked:
- Uncertainty metric quality (40%): How well you identify truly informative examples
- Annotation quality (30%): Expert reasoning explanations, not just answers
- Example quantity (20%): Typically 4-8 examples optimal, diminishing returns beyond
- Selection diversity (10%): Covering different types of uncertainty patterns
Structure and Components
Essential Components
Required:
- Unlabeled example pool: Set of candidate questions/inputs for uncertainty assessment (100-1000 examples minimum)
- Uncertainty metric: Method to quantify model confusion (disagreement, entropy, variance)
- Sampling strategy: Algorithm to select top-n uncertain examples
- Human annotator: Expert to provide correct answers and reasoning
- Few-shot prompt template: Structure for incorporating annotated examples
- Test set: Final evaluation dataset
Optional:
- Chain-of-Thought annotations: Step-by-step reasoning (highly recommended for reasoning tasks)
- Multiple annotation rounds: Iterative refinement with multiple selection cycles
- Annotation guidelines: Standardized instructions for annotators
- Validation set: Separate set to tune number of examples and uncertainty threshold
Design Principles
Core Cognitive Principles:
- Uncertainty as signal: Model disagreement indicates learning opportunities
- Targeted demonstration: Examples should address specific weaknesses, not random coverage
- Reasoning transparency: CoT annotations teach thinking process, not just outcomes
- Iterative refinement: Multiple rounds can progressively improve prompt quality
Linguistic Patterns:
Active Prompting uses standard few-shot format but with strategic example selection:
Question: [High-uncertainty question 1]
Reasoning: [Expert step-by-step explanation]
Answer: [Correct answer]
Question: [High-uncertainty question 2]
Reasoning: [Expert step-by-step explanation]
Answer: [Correct answer]
[Additional examples...]
Question: [Test question]
Reasoning:
Structural Patterns
Minimal Pattern (Basic Active Prompting):
# 1. Assess uncertainty on pool
uncertainties = calculate_uncertainty(model, example_pool, k=5)
# 2. Select top uncertain examples
selected = top_n(uncertainties, n=4)
# 3. Get annotations
annotated = human_annotate(selected)
# 4. Create prompt and run
prompt = create_few_shot_prompt(annotated)
result = model(prompt + test_question)
Standard Pattern (Active-Prompt with CoT):
# Original Active Prompting paper implementation
def active_prompting(model, pool, test_set, n_examples=8, k_samples=5):
# Step 1: Generate multiple responses for uncertainty estimation
uncertainties = []
for question in pool:
responses = [model.generate(question, temp=1.0) for _ in range(k_samples)]
uncertainty = calculate_disagreement(responses)
uncertainties.append((question, uncertainty))
# Step 2: Select most uncertain
selected_questions = sorted(uncertainties, key=lambda x: x[1], reverse=True)[:n_examples]
# Step 3: Human annotation with CoT
annotated_examples = []
for question, _ in selected_questions:
reasoning, answer = expert_annotate_with_cot(question)
annotated_examples.append({
'question': question,
'reasoning': reasoning,
'answer': answer
})
# Step 4: Construct few-shot prompt
prompt = ""
for ex in annotated_examples:
prompt += f"Question: {ex['question']}\n"
prompt += f"Reasoning: {ex['reasoning']}\n"
prompt += f"Answer: {ex['answer']}\n\n"
# Step 5: Run on test set
results = []
for test_q in test_set:
full_prompt = prompt + f"Question: {test_q}\nReasoning:"
result = model.generate(full_prompt)
results.append(result)
return results
Advanced Pattern (Iterative Multi-Round):
def iterative_active_prompting(model, pool, test_set, rounds=3, examples_per_round=3):
annotated_examples = []
remaining_pool = pool.copy()
for round_num in range(rounds):
# Calculate uncertainty on remaining pool
uncertainties = []
for question in remaining_pool:
responses = generate_diverse_responses(model, question,
current_examples=annotated_examples)
uncertainty = calculate_uncertainty_metric(responses)
uncertainties.append((question, uncertainty))
# Select top uncertain for this round
round_selected = sorted(uncertainties, key=lambda x: x[1],
reverse=True)[:examples_per_round]
# Annotate selected examples
for question, _ in round_selected:
annotation = expert_annotate(question)
annotated_examples.append(annotation)
remaining_pool.remove(question)
# Evaluate current prompt performance
current_accuracy = evaluate(model, annotated_examples, validation_set)
print(f"Round {round_num + 1} accuracy: {current_accuracy}")
# Early stopping if performance plateaus
if round_num > 0 and current_accuracy - previous_accuracy < 0.02:
break
previous_accuracy = current_accuracy
# Final evaluation on test set
return evaluate(model, annotated_examples, test_set)
Uncertainty Metrics
1. Disagreement (Most Common):
def calculate_disagreement(responses):
"""Measures variance in final answers across k generations"""
answers = [extract_final_answer(r) for r in responses]
unique_answers = len(set(answers))
return 1 - (max(Counter(answers).values()) / len(answers))
2. Entropy:
def calculate_entropy(responses):
"""Shannon entropy of answer distribution"""
answers = [extract_final_answer(r) for r in responses]
probs = Counter(answers)
total = len(answers)
return -sum((count/total) * math.log2(count/total) for count in probs.values())
3. Variance (for numerical answers):
def calculate_variance(responses):
"""Statistical variance of numerical outputs"""
numbers = [extract_number(r) for r in responses]
return np.var(numbers)
4. Confidence Score:
def calculate_confidence(responses, model):
"""Average model confidence across generations"""
confidences = [model.get_probability(r) for r in responses]
return 1 - np.mean(confidences) # Lower confidence = higher uncertainty
Modifications for Different Scenarios
For Classification Tasks:
- Use class probability distributions for uncertainty
- Select examples near decision boundaries
- Ensure balanced class representation in selected examples
For Complex Reasoning:
- Increase k (number of samples) to 10-15 for better uncertainty estimation
- Require detailed CoT annotations, not just final answers
- Consider reasoning path diversity, not just answer disagreement
For Domain-Specific Tasks:
- Pool should be representative of target domain distribution
- Annotators need domain expertise
- May need domain-specific uncertainty metrics
For Low-Resource Scenarios:
- Start with smaller pool (50-100 examples)
- Use fewer examples per round (2-3 instead of 4-8)
- Maximize annotation quality over quantity
Applications and Task Selection
General Applications
Active Prompting excels when annotation is expensive but examples are available, and when random few-shot selection underperforms.
Mathematical Reasoning: Arithmetic word problems, algebra, geometry, symbolic reasoning. Original Diao et al. (2023) paper demonstrated 83.4% on GSM8K (20.3% improvement over standard CoT's 63.1%), with Active-Prompt achieving 4.2% improvement over self-consistency baseline using code-davinci-002. Improvements of 1.0% to 15.4% observed across MultiArith, SVAMP, ASDiv, and AQUA datasets. Active-Prompt demonstrates superior performance across arithmetic, commonsense, and symbolic reasoning benchmarks.
Complex Question Answering: Multi-hop reasoning, commonsense reasoning, reading comprehension requiring inference chains.
Code Generation: Selecting examples of tricky edge cases, unusual API usage patterns, complex algorithm implementations. Latest 2024-2025 research: CodePromptEval dataset (7,072 prompts) evaluates five prompt techniques (few-shot, persona, chain-of-thought, function signature, list of packages), finding that combining multiple techniques doesn't necessarily improve outcomes. Code-Aware Prompting (SymPrompt) demonstrates that LLMs solve more complex logical problems when prompted to reason in multi-step fashion for test generation. The Impact of Prompt Programming study (December 2024) shows significant variations in code generation quality across different prompting strategies.
Logical Reasoning: Deductive reasoning, inductive reasoning, argument analysis, formal logic problems.
Scientific Reasoning: Physics problems, chemistry calculations, biology system analysis requiring multi-step reasoning.
Domain-Specific Applications
Educational Assessment: Identifying student misconceptions by selecting problems students find most challenging, then providing targeted worked examples.
Medical Diagnosis: Selecting challenging cases for expert annotation, building prompts that handle rare conditions and ambiguous symptoms. Studies show 15-20% improvement over random examples in differential diagnosis tasks. Latest 2024 research: Diagnostic reasoning prompts enable GPT-4 to mimic clinical reasoning processes without sacrificing diagnostic accuracy. An active inference strategy for medical LLMs uses actor-critic protocols where a Therapist agent responds to queries while a Supervisor agent evaluates accuracy and reliability. Structured clinical reasoning prompts enhance LLM diagnostic capabilities in complex medical cases.
Legal Analysis: Contract interpretation, case law reasoning, regulatory compliance. Active selection focuses on boundary cases and ambiguous statutory language. Latest 2024 research: Legal Syllogism Prompting (LoT) teaches LLMs that in legal syllogism, the major premise is law, minor premise is fact, and conclusion is judgment. IRAC-based (Issue, Rule, Application, Conclusion) prompting shows superior results on Japanese Bar exam legal tasks compared to generic CoT. GPT-4 ensemble prompting strategies demonstrate effectiveness in reasoning over legal arguments in civil procedure cases.
Financial Analysis: Risk assessment, fraud detection, market prediction. Uncertainty-based selection identifies edge cases in financial reasoning.
Scientific Literature Analysis: Complex domain-specific information extraction, relationship identification in research papers.
Selection Framework
Problem Characteristics (When to Use):
✅ Use Active Prompting when:
- Few-shot prompting works but needs improvement
- You have access to unlabeled examples (100+ examples)
- Expert annotators available for selected examples
- Annotation is expensive/time-consuming (want to minimize waste)
- Task has high variance in difficulty across examples
- Random example selection shows inconsistent performance
- Model shows clear uncertainty patterns (some inputs harder than others)
- Need to maximize performance with minimal annotation budget
- Task requires reasoning or complex outputs (benefits from CoT)
❌ Do NOT use Active Prompting when:
- Zero-shot already achieves target performance
- No access to unlabeled example pool
- Can't get expert annotations
- Task so simple that all examples equally informative
- Need immediate results (Active Prompting requires setup time)
- Annotation cost negligible (random few-shot sufficient)
- Model shows no uncertainty variance (all examples equally difficult or easy)
- Very few test examples (overhead not justified)
Model Requirements:
- Minimum: Models capable of few-shot learning (GPT-3.5, Claude 3, Llama 70B+)
- Recommended: GPT-4, Claude 3.5 Sonnet, or equivalent for reliable uncertainty signals
- Optimal: Models with strong reasoning capabilities for complex tasks
- Not suitable: Small models (<7B parameters) with poor few-shot performance, base models without instruction tuning
Context/Resource Requirements:
- Example pool size: 100-1000 unlabeled examples (more is better)
- Annotation budget: 4-8 expert annotations minimum (8-12 for complex tasks)
- Compute for uncertainty estimation: k × pool_size forward passes (k typically 5-10)
- Context window: Must fit n examples + test input (typically 4000-8000 tokens)
- Time investment: 2-4 hours setup + 15-30 minutes per annotation
- Iterations: 1-3 rounds typical (diminishing returns after 3)
Cost Implications:
One-time costs:
- Uncertainty estimation: pool_size × k × cost_per_token × avg_input_tokens
- Example: 500 examples × 5 samples × $0.01/1K tokens × 200 tokens = $5
- Human annotation: n_examples × annotation_cost (varies widely: $5-50 per example depending on complexity)
Per-request production costs:
- Same as few-shot prompting: n_examples × (input_tokens + output_tokens) × cost
- Typically 2-5x zero-shot cost
- Example: 8 examples × 300 tokens each + 200 token question + 300 token response = 2900 tokens ≈ $0.03-0.15 per request
Trade-offs:
- Higher upfront cost vs better performance and fewer annotations than random selection
- 30-50% fewer annotations needed vs random sampling for same performance
- ROI positive when annotation cost high or performance gains valuable
When to Use vs When NOT to Use:
Use when:
- Few-shot accuracy 60-85% (room for improvement, baseline works)
- Have 100+ unlabeled examples
- Expert time limited (want strategic annotation)
- Performance improvement worth annotation cost
- Task has learnable patterns from examples
Do NOT use when:
- Few-shot accuracy >90% (already excellent)
- Few-shot accuracy <40% (need fine-tuning, not better examples)
- No example pool or annotation access
- Zero-shot sufficient for use case
- Real-time deployment needs (latency too high)
Escalate to alternatives when:
- Active Prompting + best examples still <70% accuracy → fine-tuning needed
- Annotation cost exceeds fine-tuning cost → consider fine-tuning
- Need consistent format compliance → structured outputs or fine-tuning
- Domain highly specialized → RAG or fine-tuning
Variant Selection
Standard Active-Prompt (Diao et al. 2023):
- Best for: Mathematical reasoning, logical reasoning, complex QA
- Characteristics: CoT annotations, disagreement-based uncertainty, single round
Iterative Active-Prompt:
- Best for: When annotation budget allows multiple rounds
- Characteristics: 2-3 rounds, progressive refinement, early stopping
- Use when: Initial round shows promise but not sufficient
Active-Prompt without CoT:
- Best for: Classification, extraction, simple generation
- Characteristics: Faster annotation, simpler examples
- Use when: Task doesn't require reasoning chains
Hybrid Active-Prompt + Self-Consistency:
- Best for: Maximum accuracy on challenging tasks
- Characteristics: Active selection for examples + ensemble at test time
- Use when: Performance critical, cost secondary concern
Alternative Techniques:
| Technique | When to Choose | | --------------------------- | ----------------------------------------------------------------------------------- | | Random Few-Shot | Annotation cheap, many examples available | | Manual Example Curation | Domain expert available, small example set, performance critical | | Active Prompting | Annotation expensive, want optimal examples, have example pool | | Fine-tuning | Thousands of examples available, deployment cost matters more than development cost | | RAG | Knowledge-intensive tasks, knowledge changes frequently |
Implementation
Implementation Steps
Total time estimate: 4-8 hours initial setup + 2-4 hours per iteration
Step 1: Prepare Example Pool (1-2 hours)
- Collect 100-1000 unlabeled examples representative of target distribution
- Ensure diversity in difficulty and input types
- Format consistently for model input
- Split into: pool (80%), validation (10%), test (10%)
Step 2: Uncertainty Estimation (1-2 hours compute time)
- Choose uncertainty metric (disagreement recommended for most tasks)
- Set k (number of samples): 5-10 typical, higher for complex tasks
- Generate k responses for each pool example
- Calculate uncertainty scores
- Validate uncertainty correlates with actual difficulty
Step 3: Example Selection (15 minutes)
- Rank examples by uncertainty (highest to lowest)
- Select top n (typically 4-8)
- Manually review selections to ensure quality and diversity
- Consider removing duplicates or overly similar examples
Step 4: Human Annotation (30 minutes - 2 hours)
- Provide clear annotation guidelines to experts
- For reasoning tasks: require step-by-step CoT
- For classification: require justification
- Quality control: verify annotations, resolve disagreements
- Format annotations consistently
Step 5: Prompt Construction (30 minutes)
- Create few-shot prompt template
- Insert annotated examples in effective order
- Add task instruction and format specification
- Test on validation set
Step 6: Evaluation (1 hour)
- Run on validation set
- Measure accuracy, quality metrics
- Compare vs random few-shot baseline
- Analyze failure cases
Step 7: Iteration (optional, 2-3 hours per round)
- If performance insufficient, select additional examples
- Remove low-performing examples if needed
- Refine annotations based on failure analysis
- Re-evaluate
Step 8: Production Deployment (1-2 hours)
- Finalize prompt with best examples
- Set inference parameters (temperature, etc.)
- Document example selection rationale
- Monitor production performance
Platform-Specific Implementations
OpenAI API:
import openai
from collections import Counter
import numpy as np
class ActivePrompting:
def __init__(self, api_key, model="gpt-4-turbo-preview"):
self.client = openai.OpenAI(api_key=api_key)
self.model = model
def generate_responses(self, question, k=5, temperature=1.0):
"""Generate k diverse responses for uncertainty estimation"""
responses = []
for _ in range(k):
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": question}],
temperature=temperature,
max_tokens=500
)
responses.append(response.choices[0].message.content)
return responses
def calculate_disagreement(self, responses):
"""Calculate disagreement-based uncertainty"""
# Extract final answers (customize based on task)
answers = [self.extract_answer(r) for r in responses]
if not answers:
return 0.0
# Calculate disagreement as 1 - (most common / total)
answer_counts = Counter(answers)
most_common_count = answer_counts.most_common(1)[0][1]
return 1 - (most_common_count / len(answers))
def extract_answer(self, response):
"""Extract final answer from response (task-specific)"""
# Simple extraction: last line or number
lines = response.strip().split('\n')
return lines[-1].strip()
def select_uncertain_examples(self, pool, n=8, k=5):
"""Select top n most uncertain examples"""
uncertainties = []
for question in pool:
responses = self.generate_responses(question, k=k)
uncertainty = self.calculate_disagreement(responses)
uncertainties.append({
'question': question,
'uncertainty': uncertainty,
'responses': responses
})
# Sort by uncertainty and select top n
sorted_examples = sorted(uncertainties,
key=lambda x: x['uncertainty'],
reverse=True)
return sorted_examples[:n]
def create_few_shot_prompt(self, annotated_examples, test_question):
"""Construct few-shot prompt with annotated examples"""
prompt = ""
for ex in annotated_examples:
prompt += f"Question: {ex['question']}\n"
if 'reasoning' in ex:
prompt += f"Reasoning: {ex['reasoning']}\n"
prompt += f"Answer: {ex['answer']}\n\n"
prompt += f"Question: {test_question}\n"
prompt += "Reasoning: Let's think step by step.\n"
return prompt
def run_inference(self, prompt, temperature=0.0):
"""Run inference with constructed prompt"""
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
max_tokens=500
)
return response.choices[0].message.content
# Usage example
ap = ActivePrompting(api_key="your-api-key")
# Example pool (mathematical reasoning)
pool = [
"If John has 5 apples and gives 2 to Mary, how many does he have left?",
"A train travels 60 miles in 2 hours. What is its average speed?",
# ... 100+ more examples
]
# Step 1: Select uncertain examples
uncertain = ap.select_uncertain_examples(pool, n=8, k=5)
print("Most uncertain examples:")
for i, ex in enumerate(uncertain):
print(f"{i+1}. {ex['question']} (uncertainty: {ex['uncertainty']:.3f})")
# Step 2: Human annotation (manual process)
annotated = [
{
'question': uncertain[0]['question'],
'reasoning': "John starts with 5 apples. He gives away 2. So we subtract: 5 - 2 = 3.",
'answer': "3 apples"
},
# ... annotate remaining examples
]
# Step 3: Run inference
test_question = "Sarah has 12 cookies and wants to share equally with 3 friends. How many cookies does each person get?"
prompt = ap.create_few_shot_prompt(annotated, test_question)
result = ap.run_inference(prompt)
print(f"\nResult: {result}")
Anthropic Claude:
import anthropic
from collections import Counter
class ActivePromptingClaude:
def __init__(self, api_key, model="claude-3-5-sonnet-20241022"):
self.client = anthropic.Anthropic(api_key=api_key)
self.model = model
def generate_responses(self, question, k=5):
"""Generate k diverse responses"""
responses = []
for _ in range(k):
message = self.client.messages.create(
model=self.model,
max_tokens=500,
temperature=1.0,
messages=[{"role": "user", "content": question}]
)
responses.append(message.content[0].text)
return responses
def select_uncertain_examples(self, pool, n=8, k=5):
"""Select most uncertain examples"""
uncertainties = []
for question in pool:
responses = self.generate_responses(question, k)
# Calculate disagreement
answers = [r.split('\n')[-1].strip() for r in responses]
answer_counts = Counter(answers)
most_common = answer_counts.most_common(1)[0][1]
disagreement = 1 - (most_common / len(answers))
uncertainties.append({
'question': question,
'uncertainty': disagreement
})
sorted_uncertain = sorted(uncertainties,
key=lambda x: x['uncertainty'],
reverse=True)
return sorted_uncertain[:n]
def run_with_prompt(self, few_shot_examples, test_question):
"""Run inference with few-shot examples"""
# Construct prompt
prompt = ""
for ex in few_shot_examples:
prompt += f"Question: {ex['question']}\n"
prompt += f"Reasoning: {ex['reasoning']}\n"
prompt += f"Answer: {ex['answer']}\n\n"
prompt += f"Question: {test_question}\n"
prompt += "Reasoning:"
message = self.client.messages.create(
model=self.model,
max_tokens=1000,
temperature=0.0,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text
# Usage
client = ActivePromptingClaude(api_key="your-api-key")
uncertain = client.select_uncertain_examples(pool, n=8)
# ... annotate and run inference
LangChain Integration:
from langchain.llms import OpenAI
from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain.chains import LLMChain
from collections import Counter
class ActivePromptingLangChain:
def __init__(self, model_name="gpt-4"):
self.llm = OpenAI(model_name=model_name, temperature=1.0)
self.llm_inference = OpenAI(model_name=model_name, temperature=0.0)
def select_uncertain_examples(self, pool, n=8, k=5):
"""Select uncertain examples using LangChain"""
uncertainties = []
for question in pool:
# Generate k responses
responses = [self.llm(question) for _ in range(k)]
# Calculate uncertainty
answers = [r.strip().split('\n')[-1] for r in responses]
counter = Counter(answers)
most_common_count = counter.most_common(1)[0][1]
uncertainty = 1 - (most_common_count / len(answers))
uncertainties.append({
'question': question,
'uncertainty': uncertainty
})
return sorted(uncertainties, key=lambda x: x['uncertainty'], reverse=True)[:n]
def create_chain(self, annotated_examples):
"""Create few-shot chain with selected examples"""
example_template = """
Question: {question}
Reasoning: {reasoning}
Answer: {answer}
"""
example_prompt = PromptTemplate(
input_variables=["question", "reasoning", "answer"],
template=example_template
)
few_shot_prompt = FewShotPromptTemplate(
examples=annotated_examples,
example_prompt=example_prompt,
suffix="Question: {input}\nReasoning:",
input_variables=["input"]
)
return LLMChain(llm=self.llm_inference, prompt=few_shot_prompt)
def run_inference(self, annotated_examples, test_question):
"""Run inference with active-selected examples"""
chain = self.create_chain(annotated_examples)
return chain.run(input=test_question)
# Usage
active_lc = ActivePromptingLangChain()
uncertain = active_lc.select_uncertain_examples(pool, n=8)
# ... human annotation
result = active_lc.run_inference(annotated_examples, test_question)
Configuration
Key Parameters:
Uncertainty Estimation:
-
k (number of samples): 5-10 typical, higher for noisy tasks
- Too low (<3): unreliable uncertainty estimates
- Too high (>15): diminishing returns, higher cost
- Recommendation: Start with 5, increase to 10 if uncertainty scores seem unstable
-
temperature: 0.7-1.0 for diversity during uncertainty estimation
- Higher temperature → more diverse responses → better uncertainty signal
- Use 0.0 for final inference after example selection
Example Selection:
- n (number of examples): 4-8 typical
- Classification: 4-6 examples sufficient
- Reasoning: 6-8 examples better
- Complex tasks: 8-12 examples
- Diminishing returns beyond 12
Inference:
-
temperature: 0.0-0.2 for deterministic outputs
- Use 0.0 for factual tasks
- Use 0.2-0.5 for creative tasks
-
max_tokens: Set based on expected output length
- Reasoning tasks: 300-800 tokens
- Simple answers: 50-200 tokens
Task-Specific Tuning:
Classification:
- k=5, n=4-6, temperature=0.0 for inference
- Uncertainty metric: disagreement on predicted class
- Ensure balanced class representation in selected examples
Mathematical Reasoning:
- k=8-10, n=6-8, temperature=0.0
- Uncertainty metric: disagreement on final numerical answer
- Require detailed CoT annotations
Code Generation:
- k=5-7, n=6-10, temperature=0.2-0.3
- Uncertainty metric: code execution equivalence or AST similarity
- Include edge cases and error handling examples
Complex QA:
- k=8-10, n=6-8, temperature=0.0
- Uncertainty metric: semantic similarity variance across responses
- Focus on multi-hop reasoning examples
Best Practices and Workflow
Workflow (End-to-End):
-
Initial Assessment (30 min):
- Test zero-shot performance → baseline
- Test random few-shot (3-4 examples) → quick improvement check
- If few-shot shows promise, proceed to Active Prompting
-
Pool Preparation (1-2 hours):
- Collect representative examples
- Clean and format consistently
- Create validation and test splits
-
Uncertainty Estimation (1-2 hours compute):
- Run k-sample generation on pool
- Calculate uncertainty metrics
- Validate uncertainty scores make sense
-
Example Selection (15 min):
- Select top-n uncertain
- Manual quality check
- Ensure diversity
-
Annotation (30 min - 2 hours):
- Expert annotation with CoT
- Quality validation
- Consistent formatting
-
Prompt Construction (30 min):
- Create few-shot template
- Order examples (simple to complex when possible)
- Add clear instructions
-
Evaluation (1 hour):
- Test on validation set
- Compare vs random few-shot
- Error analysis
-
Iteration (optional, 2-3 hours):
- Select additional examples if needed
- Refine annotations
- Re-evaluate
-
Production (1 hour):
- Finalize prompt
- Document process
- Monitor performance
Implementation Best Practices:
Do:
- Start with disagreement metric (simplest, most reliable)
- Use temperature=1.0 during uncertainty estimation for maximum diversity
- Manually review top-20 uncertain examples, select best 8 (quality over pure uncertainty)
- Require detailed CoT for reasoning tasks, not just answers
- Test on validation set before committing to annotation
- Document why each example was selected
- Version control your prompts and examples
- Compare against random few-shot baseline to prove value
- Consider multiple annotators for critical examples (inter-annotator agreement)
- Save all k responses during uncertainty estimation for later analysis
Don't:
- Use temperature=0 during uncertainty estimation (defeats purpose)
- Select examples purely by uncertainty without manual review (may select outliers)
- Skip validation set (risk overfitting to test set)
- Annotate without clear guidelines (inconsistent quality)
- Use more than 12 examples (diminishing returns, context issues)
- Ignore diversity (all examples from same difficulty level)
- Use Active Prompting when random few-shot already excellent
- Expect perfection from first iteration
- Neglect to monitor annotation cost vs value gained
Instruction Design:
# Good pattern
[Example 1 - uncertain case with expert CoT]
[Example 2 - uncertain case with expert CoT]
...
[Test Question]
Let's solve this step by step:
# Advanced pattern with explicit instruction
You will be given a math word problem. Solve it by:
1. Identifying what is given
2. Determining what is asked
3. Planning the solution steps
4. Executing the calculation
5. Verifying the answer makes sense
Here are examples of challenging problems solved correctly:
[Annotated examples...]
Now solve this problem:
[Test question]
Common Instruction Mistakes:
-
❌ Too vague: "Solve this math problem"
-
✅ Better: "Solve step-by-step, showing your reasoning"
-
❌ No CoT requirement: Just final answers in examples
-
✅ Better: Full reasoning chains in all examples
-
❌ Inconsistent format across examples
-
✅ Better: Standardized Question→Reasoning→Answer structure
Debugging Decision Tree
Symptom: Selected examples don't seem challenging
Root causes:
- Uncertainty metric not appropriate for task
- k too small for reliable disagreement signal
- Temperature too low during sampling
Solutions:
- Manually verify: do humans find selected examples harder?
- Increase k from 5 to 10
- Raise temperature to 1.0 during uncertainty estimation
- Try different uncertainty metric (entropy instead of disagreement)
- Consider domain-specific difficulty metrics
Symptom: Performance not better than random few-shot
Root causes:
- Annotation quality insufficient
- Selected examples too narrow (lack diversity)
- Too few examples
- Task doesn't benefit from targeted selection
Solutions:
- Review annotation quality (are CoT explanations clear?)
- Check diversity of selected examples (are they all similar types?)
- Increase n from 4-6 to 6-8
- Add more annotation rounds
- Verify random few-shot baseline is correct
- Consider whether task actually has high variance in difficulty
Symptom: Uncertainty scores all similar (no clear ranking)
Root causes:
- Task too easy (model confident on everything)
- Task too hard (model uncertain on everything)
- k too small
- Metric doesn't capture meaningful uncertainty
Solutions:
- If all high uncertainty: task may need fine-tuning, not few-shot
- If all low uncertainty: zero-shot may be sufficient
- Increase k to improve uncertainty signal
- Try different uncertainty metric
- Use human difficulty judgments to validate metric
Symptom: High annotation cost, slow process
Root causes:
- Selecting too many examples per round
- Task complexity requires extensive annotations
- No annotation guidelines
Solutions:
- Reduce n to 3-4 examples per round, iterate multiple times
- Create detailed annotation guidelines with templates
- Use semi-automated annotation (model generates draft, human corrects)
- Consider whether Active Prompting ROI justifies cost vs alternatives
Symptom: Model still fails on certain types of inputs
Root causes:
- Selected examples don't cover all difficulty patterns
- Need multiple rounds to capture diversity
- Some input types fundamentally hard for few-shot
Solutions:
- Analyze failure cases: do they share patterns?
- Manually add examples covering failure patterns
- Run second round focusing on new uncertainty areas
- Consider clustering examples and sampling from each cluster
- May need RAG or fine-tuning for certain input types
Symptom: Inconsistent outputs even with good examples
Root causes:
- Temperature too high during inference
- Examples not diverse enough
- Prompt format issues
Solutions:
- Set temperature=0.0 for inference
- Add output format specification
- Combine with self-consistency (generate 5 outputs, take majority)
- Ensure examples demonstrate consistent format
Testing and Optimization
Validation Strategy:
Holdout Validation:
- Reserve 10-20% of pool as validation set (never use for uncertainty estimation)
- Test prompt performance on validation before final test set
- Use to tune n (number of examples) and k (sampling count)
Cross-Validation (Advanced):
- Split pool into 5 folds
- For each fold: select uncertain from other 4, test on held-out fold
- Validates uncertainty metric and selection process
- More robust but 5x compute cost
Adversarial Testing:
- Create challenging edge cases manually
- Test if Active Prompting handles them better than random
- Include: ambiguous inputs, boundary cases, out-of-distribution examples
Test Coverage:
Essential coverage (minimum 50 test examples):
- Common cases (50%): Representative of expected inputs
- High-uncertainty cases (30%): Similar to annotated examples
- Edge cases (15%): Boundary conditions, ambiguous inputs
- Adversarial (5%): Intentionally challenging, tricky inputs
Quality Metrics:
Task-Specific Metrics:
- Classification: Accuracy, precision, recall, F1, confusion matrix
- Reasoning: Correctness of final answer, intermediate step accuracy
- Generation: Coherence, relevance, factual accuracy
- Code: Execution correctness, test pass rate
- QA: Exact match, F1, ROUGE (for longer answers)
General Metrics:
- Improvement over baseline: (Active - Random) / Random × 100%
- Consistency: Output variance across runs with temp=0
- Annotation efficiency: Performance gain per annotated example
- Coverage: % of test set types represented in selected examples
Evaluation Framework:
class ActivePromptEvaluator:
def __init__(self, model, pool, test_set):
self.model = model
self.pool = pool
self.test_set = test_set
def evaluate_baseline(self, n=8):
"""Random few-shot baseline"""
random_examples = random.sample(self.pool, n)
# Get annotations for random examples
annotated_random = annotate_examples(random_examples)
accuracy = 0
for test_q, test_a in self.test_set:
prompt = create_few_shot(annotated_random, test_q)
pred = self.model(prompt)
accuracy += self.is_correct(pred, test_a)
return accuracy / len(self.test_set)
def evaluate_active(self, n=8, k=5):
"""Active Prompting evaluation"""
# Select uncertain examples
uncertain = self.select_uncertain(self.pool, n, k)
annotated_active = annotate_examples(uncertain)
accuracy = 0
for test_q, test_a in self.test_set:
prompt = create_few_shot(annotated_active, test_q)
pred = self.model(prompt)
accuracy += self.is_correct(pred, test_a)
return accuracy / len(self.test_set)
def compare(self):
"""Full comparison with statistical significance"""
baseline_acc = self.evaluate_baseline()
active_acc = self.evaluate_active()
improvement = (active_acc - baseline_acc) / baseline_acc * 100
print(f"Random few-shot: {baseline_acc:.1%}")
print(f"Active Prompting: {active_acc:.1%}")
print(f"Improvement: {improvement:.1f}%")
# Statistical significance test (bootstrap or t-test)
p_value = self.significance_test(baseline_acc, active_acc)
print(f"P-value: {p_value:.4f}")
return {
'baseline': baseline_acc,
'active': active_acc,
'improvement': improvement,
'p_value': p_value
}
Optimization Techniques:
1. Annotation Efficiency:
# Reduce annotations while maintaining quality
def efficient_active_prompting(pool, budget=8):
# Round 1: Select half the budget (4 examples)
round1 = select_uncertain(pool, n=budget//2)
annotated1 = annotate(round1)
# Evaluate on validation set
val_acc = evaluate(annotated1, validation_set)
# If accuracy sufficient, stop early
if val_acc > threshold:
return annotated1
# Round 2: Select remaining budget
round2 = select_uncertain(pool, n=budget//2, existing=annotated1)
annotated2 = annotate(round2)
return annotated1 + annotated2
2. Diversity Injection:
# Ensure diversity in selected examples
def diverse_uncertain_selection(pool, n=8, k=5):
# Calculate uncertainty
uncertainties = calculate_uncertainties(pool, k)
# Sort by uncertainty
sorted_pool = sort_by_uncertainty(uncertainties)
# Select top 2n candidates
candidates = sorted_pool[:2*n]
# Cluster candidates by similarity
clusters = cluster_examples(candidates, n_clusters=n)
# Select most uncertain from each cluster
selected = []
for cluster in clusters:
most_uncertain = max(cluster, key=lambda x: x['uncertainty'])
selected.append(most_uncertain)
return selected
3. Iterative Refinement:
# Multi-round refinement with early stopping
def iterative_active(pool, max_rounds=3, examples_per_round=3):
all_examples = []
prev_accuracy = 0
for round in range(max_rounds):
# Select uncertain examples not in current set
new_examples = select_uncertain(
pool,
n=examples_per_round,
exclude=all_examples
)
# Annotate
annotated = annotate(new_examples)
all_examples.extend(annotated)
# Evaluate
current_accuracy = evaluate(all_examples, validation_set)
improvement = current_accuracy - prev_accuracy
print(f"Round {round+1}: {current_accuracy:.2%} (+{improvement:.2%})")
# Early stopping if improvement < 2%
if improvement < 0.02:
print("Converged, stopping early")
break
prev_accuracy = current_accuracy
return all_examples
4. Consistency Techniques:
Combine Active Prompting with self-consistency:
def active_with_self_consistency(annotated_examples, test_q, num_samples=5):
"""Generate multiple responses and take majority vote"""
prompt = create_few_shot(annotated_examples, test_q)
responses = []
for _ in range(num_samples):
response = model(prompt, temperature=0.7)
responses.append(extract_answer(response))
# Majority vote
return Counter(responses).most_common(1)[0][0]
Iteration Criteria:
When to stop optimizing:
- Validation accuracy improvement <2% between iterations
- Reached annotation budget limit
- Validation accuracy >90% (excellent performance)
- Test accuracy plateaus across multiple rounds
- Annotation cost exceeds value of improvements
When to continue:
- Clear performance gaps on certain input types
- Validation accuracy 70-85% (room for improvement)
- Budget remaining and improvement trend positive
- Failure analysis reveals addressable patterns
A/B Testing Approach:
def ab_test_active_vs_random(pool, test_set, n=8, trials=10):
"""Statistical comparison of Active vs Random"""
active_accuracies = []
random_accuracies = []
for trial in range(trials):
# Active Prompting
uncertain = select_uncertain(pool, n=n, k=5)
annotated_active = annotate(uncertain)
active_acc = evaluate(annotated_active, test_set)
active_accuracies.append(active_acc)
# Random few-shot
random_ex = random.sample(pool, n)
annotated_random = annotate(random_ex)
random_acc = evaluate(annotated_random, test_set)
random_accuracies.append(random_acc)
# Statistical test
from scipy.stats import ttest_rel
t_stat, p_value = ttest_rel(active_accuracies, random_accuracies)
print(f"Active: {np.mean(active_accuracies):.2%} ± {np.std(active_accuracies):.2%}")
print(f"Random: {np.mean(random_accuracies):.2%} ± {np.std(random_accuracies):.2%}")
print(f"P-value: {p_value:.4f}")
return {
'active_mean': np.mean(active_accuracies),
'random_mean': np.mean(random_accuracies),
'p_value': p_value,
'significant': p_value < 0.05
}
Limitations and Constraints
Known Limitations
1. Requires Example Pool (Fundamental):
Active Prompting needs 100+ unlabeled examples for uncertainty estimation. If you don't have access to representative examples, the technique cannot be applied. This makes it unsuitable for truly novel tasks or very rare scenarios.
2. Annotation Bottleneck:
Effectiveness depends on expert annotation quality. If annotators lack domain expertise or provide inconsistent explanations, performance gains diminish. For specialized domains (medical, legal), finding qualified annotators can be challenging and expensive.
3. Computational Overhead:
Uncertainty estimation requires k × pool_size forward passes. For pool_size=500, k=10, that's 5000 API calls just for selection. At $0.01 per call, that's $50 just for example selection. This overhead only justified when annotation budget high or performance gains critical.
4. Uncertainty Metric Dependency:
Performance critically depends on uncertainty metric quality. Disagreement works well for discrete answers but poorly for open-ended generation. Some tasks lack clear uncertainty signals, making selection barely better than random.
5. Diminishing Returns:
Improvements strongest for first 4-6 examples, then plateau. Going from 8 to 12 examples rarely provides >2% additional gain. Multiple rounds show similar pattern: first round gives 5-10% improvement, second round 2-3%, third round <1%.
6. Context Window Constraints:
With 8 detailed CoT examples × 300 tokens each = 2400 tokens just for examples. Add test question (200 tokens) and response (500 tokens) = 3100 total. Limits usability with smaller context windows or very long examples.
7. No Performance Guarantee:
Active Prompting improves over random selection on average, but specific tasks may show no benefit. If task difficulty uniform across examples, uncertainty-based selection offers no advantage. Validation testing essential before committing resources.
Edge Cases
All examples equally uncertain:
- Happens when task beyond model capability
- Disagreement scores cluster in narrow range
- Detection: Standard deviation of uncertainty scores <0.1
- Solution: Task may need fine-tuning rather than better examples
All examples equally certain:
- Happens when task too easy for model
- Disagreement scores all near 0
- Detection: Max uncertainty score <0.2
- Solution: Zero-shot or simple few-shot sufficient
Selected examples too similar:
- High-uncertainty examples cluster in one difficulty type
- Lack diversity in reasoning patterns
- Detection: Manual review shows redundancy
- Solution: Use clustering-based diverse selection
Annotator disagreement:
- Different expert annotators provide conflicting answers
- Indicates genuinely ambiguous examples
- Detection: Inter-annotator agreement <0.7
- Solution: Discuss to reach consensus or use multiple valid approaches in examples
Out-of-distribution test inputs:
- Test inputs differ significantly from example pool
- Uncertainty estimation not representative
- Detection: Performance on test set much worse than validation
- Solution: Ensure pool representative of deployment distribution
Format non-compliance:
- Model generates wrong format despite examples
- Happens with complex structured outputs
- Detection: >20% format violations
- Solution: Add explicit format instructions, use structured output mode, or consider fine-tuning
Graceful Degradation:
def robust_active_prompting(pool, test_set, n=8, k=5):
"""Active Prompting with fallback strategies"""
# Attempt uncertainty estimation
try:
uncertainties = calculate_uncertainties(pool, k)
uncertainty_std = np.std([u['score'] for u in uncertainties])
# Check if uncertainty signal meaningful
if uncertainty_std < 0.1:
print("Warning: Low uncertainty variance, falling back to diverse sampling")
selected = diverse_sampling(pool, n)
else:
selected = top_uncertain(uncertainties, n)
except Exception as e:
print(f"Uncertainty estimation failed: {e}")
print("Falling back to random sampling")
selected = random.sample(pool, n)
# Annotate selected examples
annotated = annotate_with_validation(selected)
# Evaluate on validation set
val_accuracy = evaluate(annotated, validation_set)
# If performance poor, try random as sanity check
if val_accuracy < 0.5:
print("Warning: Low performance, trying random baseline")
random_examples = random.sample(pool, n)
random_annotated = annotate_with_validation(random_examples)
random_acc = evaluate(random_annotated, validation_set)
# Use better performing set
if random_acc > val_accuracy:
print("Random selection outperformed Active, using random")
annotated = random_annotated
return annotated
Constraint Management
Balancing Competing Factors:
Annotation budget vs accuracy:
- Start with minimum viable n (4 examples)
- Measure improvement per example
- Stop when marginal improvement <1% per additional annotation
- Example: If 4 examples → 70%, 6 examples → 75%, 8 examples → 76%, stop at 6
Uncertainty vs diversity:
- Pure uncertainty may select very similar hard examples
- Pure diversity may include uninformative easy examples
- Solution: Select top-2n uncertain, then cluster and pick one per cluster
Context length vs example count:
- More examples → better performance but longer context
- Longer context → higher cost and potential attention dilution
- Solution: Compress CoT annotations or use shorter examples when context limited
Compute budget vs k (samples):
- Higher k → better uncertainty signal but k× cost
- Lower k → cheaper but noisier uncertainty
- Solution: Start k=5, increase to 10 only if uncertainty scores unstable
Handling Token/Context Constraints:
def context_aware_active_prompting(pool, test_q, max_context=4000):
"""Select examples fitting within context budget"""
# Calculate uncertainty
uncertainties = calculate_uncertainties(pool, k=5)
sorted_uncertain = sort_by_uncertainty(uncertainties)
# Select examples fitting in context
selected = []
current_tokens = count_tokens(test_q) + 500 # Reserve for response
for example in sorted_uncertain:
example_tokens = count_tokens(example['question']) + \
count_tokens(example['annotation'])
if current_tokens + example_tokens < max_context:
selected.append(example)
current_tokens += example_tokens
if len(selected) >= 8: # Max desired examples
break
return selected
Handling Incomplete Information:
def active_prompting_with_imputation(pool_with_missing):
"""Handle incomplete example pool"""
# Filter out examples with missing critical information
complete_examples = [ex for ex in pool_with_missing
if is_complete(ex)]
if len(complete_examples) < 100:
print(f"Warning: Only {len(complete_examples)} complete examples")
# If too few, use data augmentation
if len(complete_examples) < 50:
augmented = augment_examples(complete_examples)
complete_examples.extend(augmented)
# Proceed with Active Prompting on complete examples
return select_uncertain(complete_examples, n=8, k=5)
Error Handling and Recovery:
class RobustActivePrompting:
def __init__(self, model):
self.model = model
self.fallback_strategies = ['random', 'diverse', 'manual']
def select_with_recovery(self, pool, n=8, k=5):
"""Attempt Active selection with fallbacks"""
try:
# Primary: Active Prompting
selected = self.active_selection(pool, n, k)
return selected, 'active'
except InsufficientUncertaintyError:
print("Insufficient uncertainty signal, using diverse sampling")
return self.diverse_selection(pool, n), 'diverse'
except APIError as e:
print(f"API error during uncertainty estimation: {e}")
print("Falling back to random selection")
return random.sample(pool, n), 'random'
except Exception as e:
print(f"Unexpected error: {e}")
print("Manual example selection recommended")
return None, 'manual'
def execute_with_fallback(self, pool, test_set, n=8):
"""Full execution with error recovery"""
selected, method = self.select_with_recovery(pool, n)
if selected is None:
raise ValueError("Automatic selection failed, manual intervention needed")
# Annotate
try:
annotated = self.annotate_with_validation(selected)
except AnnotationError as e:
print(f"Annotation failed: {e}")
# Retry with simpler annotation requirements
annotated = self.simple_annotate(selected)
# Evaluate
accuracy = self.evaluate(annotated, test_set)
print(f"Method: {method}, Accuracy: {accuracy:.2%}")
return annotated, accuracy, method
Advanced Techniques
Clarity and Context Optimization
Ensuring Clear Annotation Guidelines:
Annotation quality directly determines Active Prompting effectiveness. Clear guidelines ensure consistent, high-quality expert annotations.
Annotation Template:
# Annotation Guidelines for [Task Name]
## Objective
Provide step-by-step reasoning that leads to the correct answer.
## Format
Question: [Original question]
Reasoning: [Your detailed thought process, 2-5 sentences]
Answer: [Final answer in specified format]
## Requirements
1. Break down the problem into clear logical steps
2. Show intermediate calculations or inferences
3. Explain WHY each step follows from the previous
4. Verify the answer makes sense
5. Use consistent terminology
## Example Annotation
Question: If a car travels 120 miles in 3 hours, then travels another 80 miles in 2 hours, what is the average speed for the entire trip?
Reasoning: First, I'll calculate the total distance: 120 + 80 = 200 miles. Next, the total time: 3 + 2 = 5 hours. Average speed equals total distance divided by total time: 200 ÷ 5 = 40 miles per hour. This makes sense because it's between the two segment speeds (40 mph for first segment, 40 mph for second segment).
Answer: 40 miles per hour
## What to Avoid
- ❌ Just providing the answer without reasoning
- ❌ Skipping intermediate steps
- ❌ Using inconsistent notation
- ❌ Assumptions without justification
Balancing Detail vs Conciseness:
def optimize_annotation_length(example, max_tokens=300):
"""Balance detailed reasoning with token constraints"""
# Get full detailed annotation
full_annotation = expert_annotate(example)
token_count = count_tokens(full_annotation['reasoning'])
if token_count <= max_tokens:
return full_annotation
# If too long, request compressed version
compression_prompt = f"""
This reasoning is too long ({token_count} tokens).
Compress to {max_tokens} tokens while keeping:
1. Key logical steps
2. Critical calculations
3. Final verification
Original: {full_annotation['reasoning']}
Compressed version:
"""
compressed = model(compression_prompt)
return {
'question': example,
'reasoning': compressed,
'answer': full_annotation['answer']
}
Context Optimization:
For tasks requiring domain knowledge, provide context without overwhelming:
def context_aware_annotation(example, domain_knowledge):
"""Include minimal necessary context"""
annotation_prompt = f"""
Domain context: {domain_knowledge['key_concepts']}
Annotate this example:
{example}
Requirements:
- Reference domain concepts only when necessary
- Assume annotator familiar with basic domain knowledge
- Focus on problem-specific reasoning
"""
return expert_annotate(annotation_prompt)
Example Design (Effective Demonstrations):
What makes an effective example:
- Addresses model confusion: Selected because model uncertain, not arbitrary
- Clear reasoning chain: Step-by-step logic, no unexplained jumps
- Representative: Similar to expected test inputs
- Correct: Verified by domain expert
- Concise: No unnecessary verbosity
- Consistent: Same format and terminology as other examples
Optimal Number and Diversity:
- Classification: 4-6 examples, ensure all classes represented
- Reasoning: 6-8 examples, cover different reasoning patterns
- Generation: 5-7 examples, diverse styles and lengths
- Code: 6-10 examples, various edge cases and common patterns
Diversity Techniques:
def ensure_diverse_selection(uncertain_examples, n=8):
"""Balance uncertainty with diversity"""
# Embed examples
embeddings = embed_examples(uncertain_examples)
# Cluster into n groups
clusters = kmeans_clustering(embeddings, n_clusters=n)
# Select most uncertain from each cluster
selected = []
for cluster in clusters:
most_uncertain = max(cluster, key=lambda x: x['uncertainty'])
selected.append(most_uncertain)
return selected
Advanced Reasoning and Output Control
Multi-Step Reasoning:
Active Prompting particularly effective for complex reasoning when annotations decompose problems:
def structured_reasoning_annotation(question):
"""Template for complex multi-step problems"""
annotation = {
'question': question,
'reasoning': f"""
Step 1 - Understand: [What is given? What is asked?]
Step 2 - Plan: [What approach will solve this?]
Step 3 - Execute: [Carry out the calculations/reasoning]
Step 4 - Verify: [Does the answer make sense? Check units/reasonableness]
""",
'answer': '[Final answer]'
}
return annotation
Self-Verification Integration:
Encourage verification in annotated examples:
Question: John has $50. He spends 30% on food. How much is left?
Reasoning: First, calculate 30% of $50: 0.30 × 50 = $15. This is what he spends. To find what's left: 50 - 15 = $35. Let me verify: $15 (spent) + $35 (left) = $50 ✓
Answer: $35
Structured Output Enforcement:
def structured_output_examples(uncertain_examples):
"""Ensure examples demonstrate desired output format"""
annotated = []
for ex in uncertain_examples:
annotation = {
'question': ex['question'],
'reasoning': '[Step-by-step thought process]',
'answer': {
'final_answer': '[Answer value]',
'confidence': '[high/medium/low]',
'assumptions': ['[Assumption 1]', '[Assumption 2]']
}
}
annotated.append(annotation)
return annotated
Constraint Enforcement:
Hard constraints in examples teach model boundaries:
Question: Summarize this article in exactly 3 sentences.
Reasoning: The article covers three main points: [A], [B], [C]. I'll dedicate one sentence to each. Sentence 1 addresses [A]... Sentence 2 covers [B]... Sentence 3 explains [C]. Checking: that's exactly 3 sentences as required.
Answer: [Sentence 1]. [Sentence 2]. [Sentence 3].
Interaction Patterns
Iterative Active Prompting:
def iterative_with_feedback(pool, test_set, max_rounds=3):
"""Multiple rounds with performance feedback"""
all_examples = []
for round_num in range(max_rounds):
# Select uncertain examples not yet included
new_uncertain = select_uncertain(
pool,
n=3,
existing_examples=all_examples
)
# Annotate
annotated = expert_annotate(new_uncertain)
all_examples.extend(annotated)
# Evaluate
accuracy = evaluate(all_examples, test_set)
# Analyze failures
failures = [ex for ex in test_set if not correct(ex, all_examples)]
print(f"Round {round_num + 1}: {len(all_examples)} examples, {accuracy:.2%}")
# If accuracy sufficient or failures plateau, stop
if accuracy > 0.9 or (round_num > 0 and len(failures) == prev_failures):
break
prev_failures = len(failures)
return all_examples
Chaining with Other Techniques:
Combine Active Prompting with self-consistency:
def active_with_self_consistency(active_examples, test_q, n_samples=5):
"""Active Prompting + Self-Consistency ensemble"""
# Create prompt with active-selected examples
prompt = create_few_shot_prompt(active_examples, test_q)
# Generate multiple responses
responses = []
for _ in range(n_samples):
response = model(prompt, temperature=0.7)
responses.append(extract_answer(response))
# Majority vote
final_answer = Counter(responses).most_common(1)[0][0]
return final_answer
Model Considerations
Model-Specific Adaptations:
GPT-4 / GPT-4 Turbo:
- Excellent few-shot learning, benefits significantly from Active Prompting
- Can handle 8-12 examples without performance degradation
- Use temperature=1.0 for uncertainty estimation, 0.0 for inference
- Benefits from detailed CoT in examples
Claude 3.5 Sonnet:
- Strong instruction following, may need fewer examples (4-6)
- Particularly good at following format demonstrated in examples
- Consider using slightly lower k (5-7) as outputs less variable
- Excellent at maintaining consistent reasoning style from examples
O1 / O3 (Reasoning Models):
- Active Prompting less beneficial as these models strong zero-shot
- If using few-shot with O1, keep examples minimal (2-4)
- Focus on format specification rather than reasoning guidance
- Uncertainty estimation may differ due to internal reasoning
Llama 3 70B / 405B:
- Benefits from Active Prompting but needs more examples (8-12)
- Higher k recommended (8-10) for reliable uncertainty signals
- More sensitive to example quality than GPT-4
- Consider higher temperature (0.8-1.0) during uncertainty estimation
Cross-Model Prompts:
If deploying across multiple models:
def model_agnostic_active_prompting(pool, models, n=8):
"""Select examples that work well across models"""
# Calculate uncertainty across multiple models
multi_model_uncertainties = []
for example in pool:
uncertainties = []
for model in models:
responses = [model.generate(example) for _ in range(5)]
uncertainty = calculate_disagreement(responses)
uncertainties.append(uncertainty)
# Average uncertainty across models
avg_uncertainty = np.mean(uncertainties)
multi_model_uncertainties.append({
'example': example,
'uncertainty': avg_uncertainty
})
# Select examples uncertain across models
selected = sorted(multi_model_uncertainties,
key=lambda x: x['uncertainty'],
reverse=True)[:n]
return selected
Safety, Robustness, and Domain Adaptation
Output Safety:
Ensure annotated examples demonstrate safe, appropriate responses:
def safe_annotation_validation(annotation):
"""Validate annotations for safety concerns"""
checks = {
'no_harmful_content': not contains_harmful(annotation['reasoning']),
'no_bias': not contains_bias_markers(annotation['reasoning']),
'factually_grounded': verify_facts(annotation['answer']),
'appropriate_tone': check_tone(annotation['reasoning'])
}
if not all(checks.values()):
failed = [k for k, v in checks.items() if not v]
raise SafetyError(f"Annotation failed safety checks: {failed}")
return True
Reliability Through Consistency:
Multiple annotators for critical examples:
def multi_annotator_consensus(example, n_annotators=3):
"""Get multiple annotations and verify agreement"""
annotations = [expert_annotate(example) for _ in range(n_annotators)]
# Check answer agreement
answers = [a['answer'] for a in annotations]
if len(set(answers)) > 1:
# Disagreement - needs resolution
print(f"Annotator disagreement on: {example}")
consensus = resolve_disagreement(annotations)
return consensus
# Take annotation with best reasoning
best = max(annotations, key=lambda a: score_reasoning_quality(a['reasoning']))
return best
Domain Adaptation:
def domain_specific_active_prompting(pool, domain, n=8):
"""Adapt Active Prompting to specific domain"""
# Load domain-specific resources
terminology = load_domain_terminology(domain)
conventions = load_domain_conventions(domain)
# Calculate uncertainty with domain-aware metric
uncertainties = []
for example in pool:
responses = generate_responses(example, k=5)
# Domain-specific uncertainty (e.g., medical diagnosis diversity)
uncertainty = domain_uncertainty_metric(responses, domain)
uncertainties.append({'example': example, 'uncertainty': uncertainty})
# Select uncertain examples
selected = sorted(uncertainties, key=lambda x: x['uncertainty'], reverse=True)[:n]
# Annotate with domain guidelines
annotated = []
for ex in selected:
annotation = domain_expert_annotate(
ex['example'],
terminology=terminology,
conventions=conventions
)
annotated.append(annotation)
return annotated
Example Domain Adaptations:
Medical:
medical_annotation_guidelines = """
1. Use standard medical terminology (ICD codes, symptom names)
2. Follow differential diagnosis reasoning pattern
3. Consider contraindications and drug interactions
4. Reference clinical guidelines when applicable
5. Express uncertainty appropriately
"""
Legal:
legal_annotation_guidelines = """
1. Cite relevant statutes and case law
2. Follow IRAC structure (Issue, Rule, Application, Conclusion)
3. Consider jurisdiction-specific rules
4. Address counter-arguments
5. Use precise legal terminology
"""
Code Generation:
code_annotation_guidelines = """
1. Include edge case handling
2. Follow language-specific best practices
3. Add brief comments for complex logic
4. Consider time/space complexity
5. Show test cases in reasoning
"""
Risk and Ethics
Ethical Considerations
Annotation Labor:
Active Prompting requires expert human annotation. Ethical considerations:
- Fair compensation: Expert annotators should be paid appropriately for specialized knowledge
- Clear expectations: Annotation guidelines should be clear to avoid wasted effort
- Credit: If using annotated examples in production, consider acknowledging contributors
- Data rights: Clarify ownership of annotations
Bias Amplification Risk:
If model uncertainty correlates with demographic or sensitive attributes, Active Prompting could amplify bias:
def bias_aware_selection(pool, n=8, sensitive_attributes):
"""Monitor for bias in selected examples"""
# Select uncertain examples
selected = select_uncertain(pool, n=n)
# Check for demographic skew
for attribute in sensitive_attributes:
distribution = analyze_distribution(selected, attribute)
pool_distribution = analyze_distribution(pool, attribute)
# Alert if selected examples skewed vs pool
if kl_divergence(distribution, pool_distribution) > threshold:
print(f"Warning: Selection biased on {attribute}")
print(f"Selected: {distribution}, Pool: {pool_distribution}")
# Consider rebalancing
selected = rebalance_selection(selected, pool, attribute)
return selected
Model Capability Revelation:
Active Prompting identifies model weaknesses systematically. This could:
- Positive: Help developers improve models and identify failure modes
- Negative: Potentially be used to systematically find adversarial examples or exploit vulnerabilities
Transparency:
When deploying Active-Prompted systems:
- Disclose that examples selected based on model uncertainty
- Document annotation process and quality control
- Make clear that system's knowledge limited to annotated examples + pre-training
Risk Analysis
Failure Modes:
1. Poor Uncertainty Estimation:
- Symptom: Selected examples no more informative than random
- Impact: Wasted annotation effort, no performance gain
- Probability: Medium (20-30% of applications)
- Mitigation: Validate uncertainty metric on small sample before full annotation
2. Low-Quality Annotations:
- Symptom: Annotators provide incorrect or inconsistent reasoning
- Impact: Model learns wrong patterns, performance degrades
- Probability: Low-Medium (10-20% without quality control)
- Mitigation: Multi-annotator verification, expert validation, clear guidelines
3. Overfitting to Selected Examples:
- Symptom: Excellent performance on validation, poor on test set
- Impact: False confidence in model capability
- Probability: Low (5-10% with proper validation)
- Mitigation: Holdout test set, diverse example selection, cross-validation
4. Annotation Budget Exceeded:
- Symptom: More examples needed than budget allows
- Impact: Incomplete implementation, suboptimal performance
- Probability: Medium (25-35% of projects)
- Mitigation: Iterative approach, start small, measure ROI per example
Cascading Failures:
If annotated examples contain errors → model learns incorrect patterns → systematic failures on similar inputs → compounding error propagation
Prevention:
def annotation_quality_gate(annotations, sample_size=0.2):
"""Validate annotation quality before proceeding"""
# Sample annotations for independent verification
sample = random.sample(annotations, int(len(annotations) * sample_size))
# Second expert validates
agreements = 0
for annotation in sample:
verification = independent_expert_verify(annotation)
if verification['agrees']:
agreements += 1
agreement_rate = agreements / len(sample)
if agreement_rate < 0.9:
raise QualityError(f"Low agreement rate: {agreement_rate:.1%}")
return True
Safety Concerns:
Prompt Injection via Pool Examples:
If example pool includes user-generated content, adversarial users could inject malicious examples designed to be "uncertain" and get selected:
def sanitize_example_pool(pool):
"""Remove potentially adversarial examples"""
sanitized = []
for example in pool:
# Check for prompt injection patterns
if contains_injection_patterns(example):
continue
# Check for unusual formatting
if unusual_formatting(example):
continue
# Check length anomalies
if len(example) > max_reasonable_length:
continue
sanitized.append(example)
return sanitized
Adversarial Uncertainty Manipulation:
Attacker could craft inputs designed to maximize model disagreement, forcing selection of adversarial examples:
Mitigation:
- Validate that high-uncertainty examples are genuinely difficult, not adversarial
- Manual review of top-20 uncertain before annotation
- Use multiple uncertainty metrics and flag discrepancies
Bias Amplification:
Sources of Bias:
- Selection Bias: If model more uncertain on certain demographics, those get overrepresented in examples
- Annotation Bias: Annotator biases reflected in reasoning explanations
- Framing Bias: How examples are framed affects model's learned associations
Detection:
def detect_selection_bias(selected_examples, pool, sensitive_attrs):
"""Detect demographic bias in selection"""
biases_detected = []
for attr in sensitive_attrs:
# Distribution in selected examples
selected_dist = get_attribute_distribution(selected_examples, attr)
# Distribution in pool
pool_dist = get_attribute_distribution(pool, attr)
# Statistical test for difference
chi2, p_value = chi_square_test(selected_dist, pool_dist)
if p_value < 0.05:
biases_detected.append({
'attribute': attr,
'selected_dist': selected_dist,
'pool_dist': pool_dist,
'p_value': p_value
})
return biases_detected
Mitigation:
def debias_selection(pool, n=8, sensitive_attrs):
"""Select uncertain examples while maintaining demographic balance"""
# Calculate uncertainty
uncertainties = calculate_uncertainties(pool, k=5)
# Stratified selection maintaining pool distribution
selected = []
for attr in sensitive_attrs:
pool_dist = get_attribute_distribution(pool, attr)
# Select proportionally from each group
for attr_value, proportion in pool_dist.items():
n_from_group = int(n * proportion)
group_examples = [u for u in uncertainties
if get_attribute(u['example'], attr) == attr_value]
group_selected = sorted(group_examples,
key=lambda x: x['uncertainty'],
reverse=True)[:n_from_group]
selected.extend(group_selected)
return selected[:n] # In case of rounding, limit to n
Innovation Potential
Novel Combinations:
Active Prompting + RAG: Use Active Prompting to select most informative retrieved examples:
def active_rag(query, document_pool):
"""Retrieve then actively select most informative examples"""
# Retrieve relevant documents
retrieved = retrieve_top_k(query, document_pool, k=50)
# Calculate uncertainty on retrieved set
uncertainties = calculate_uncertainties(retrieved, k=5)
# Select most uncertain (most informative) retrieved docs
selected = top_n_uncertain(uncertainties, n=5)
# Use as context for generation
context = format_context(selected)
return generate_with_context(query, context)
Active Prompting + Meta-Learning: Learn which types of examples most effective:
def meta_active_prompting(pool, validation_set):
"""Learn example selection patterns that work best"""
# Try different selection strategies
strategies = [
'pure_uncertainty',
'diverse_uncertain',
'clustered_uncertain',
'stratified_uncertain'
]
strategy_performance = {}
for strategy in strategies:
selected = apply_strategy(pool, strategy, n=8)
annotated = annotate(selected)
accuracy = evaluate(annotated, validation_set)
strategy_performance[strategy] = accuracy
# Learn which strategy works best for this task type
best_strategy = max(strategy_performance, key=strategy_performance.get)
return best_strategy
Derived Innovations:
- Continuous Active Prompting: In production, identify uncertain cases from real traffic, request annotations, update prompts
- Transfer Active Prompting: Use uncertainty patterns from one task to inform example selection on related tasks
- Hierarchical Active Prompting: Multi-level selection - first select task categories, then uncertain examples within each
- Collaborative Active Prompting: Multiple annotators vote on which examples they find most instructive
Ecosystem and Integration
Tools and Frameworks
LangChain:
from langchain.prompts import FewShotPromptTemplate
from langchain.llms import OpenAI
def langchain_active_prompting(pool, test_set):
"""Active Prompting with LangChain"""
# Select uncertain examples (custom logic)
uncertain = select_uncertain_examples(pool, n=8, k=5)
# Annotate
annotated = annotate_examples(uncertain)
# Create FewShotPromptTemplate
example_prompt = PromptTemplate(
input_variables=["question", "reasoning", "answer"],
template="Question: {question}\nReasoning: {reasoning}\nAnswer: {answer}"
)
few_shot_prompt = FewShotPromptTemplate(
examples=annotated,
example_prompt=example_prompt,
suffix="Question: {input}\nReasoning:",
input_variables=["input"]
)
# Create chain
llm = OpenAI(temperature=0.0)
chain = LLMChain(llm=llm, prompt=few_shot_prompt)
# Run on test set
results = [chain.run(input=test_q) for test_q in test_set]
return results
DSPy (Declarative Self-improving Python):
DSPy has built-in support for example optimization which can be combined with Active Prompting:
import dspy
class ActiveCoT(dspy.Module):
def __init__(self):
super().__init__()
self.generate_answer = dspy.ChainOfThought("question -> answer")
def forward(self, question):
return self.generate_answer(question=question)
# Active selection of training examples
def active_dspy_examples(pool, n=8):
"""Select uncertain examples for DSPy optimizer"""
# Initialize model
lm = dspy.OpenAI(model="gpt-4")
dspy.settings.configure(lm=lm)
# Calculate uncertainty
uncertainties = []
for example in pool:
responses = [ActiveCoT()(example['question']) for _ in range(5)]
uncertainty = calculate_disagreement(responses)
uncertainties.append({'example': example, 'uncertainty': uncertainty})
# Select top uncertain
selected = sorted(uncertainties, key=lambda x: x['uncertainty'], reverse=True)[:n]
return [s['example'] for s in selected]
# Use with DSPy optimizer
trainset = active_dspy_examples(pool, n=8)
teleprompter = dspy.teleprompt.BootstrapFewShot(metric=answer_correctness)
optimized_cot = teleprompter.compile(ActiveCoT(), trainset=trainset)
Haystack:
from haystack import Pipeline
from haystack.nodes import PromptNode, PromptTemplate
def haystack_active_prompting(pool, test_set):
"""Active Prompting with Haystack"""
# Select uncertain examples
uncertain = select_uncertain_examples(pool, n=8)
annotated = annotate_examples(uncertain)
# Create prompt template with examples
examples_text = "\n\n".join([
f"Question: {ex['question']}\nReasoning: {ex['reasoning']}\nAnswer: {ex['answer']}"
for ex in annotated
])
prompt_template = PromptTemplate(
prompt=f"{examples_text}\n\nQuestion: {{query}}\nReasoning:",
output_parser={"type": "AnswerParser"}
)
# Create pipeline
prompt_node = PromptNode(
model_name_or_path="gpt-4",
default_prompt_template=prompt_template,
api_key="your-key"
)
pipeline = Pipeline()
pipeline.add_node(component=prompt_node, name="prompt", inputs=["Query"])
# Run
results = [pipeline.run(query=test_q) for test_q in test_set]
return results
Pre-built Tools:
- Active-Learner (GitHub): Python library for active learning, adaptable to prompting
- Label Studio: Annotation platform with active learning support
- Prodigy: Commercial annotation tool with active learning built-in
- Modal Labs / AWS SageMaker Ground Truth: Cloud platforms with active learning pipelines
Related Techniques and Combinations
Closely Related Techniques:
Active Learning (Classical ML):
- Connection: Active Prompting applies active learning principles to prompt engineering
- Difference: Active learning trains models, Active Prompting selects examples for context
- Transfer: Uncertainty sampling, query-by-committee, diversity sampling all transfer
Few-Shot Prompting:
- Connection: Active Prompting is optimized few-shot prompting
- Difference: Few-shot uses random/manual examples, Active uses uncertainty-selected
- Improvement: 5-15% accuracy gain over random few-shot
Chain-of-Thought Prompting:
- Connection: Active Prompting typically uses CoT in annotations
- Difference: CoT is about reasoning format, Active is about example selection
- Synergy: Combining both yields best results (Active-Prompt with CoT)
Self-Consistency:
- Connection: Both use multiple samples, Active for selection, Self-Consistency for inference
- Difference: Active uses samples to measure uncertainty, Self-Consistency for voting
- Combination: Use both - Active for example selection, Self-Consistency for final answer
Comparison Table:
| Technique | Example Selection | Annotation Needed | Best For | Typical Improvement | | -------------------- | ----------------- | ------------------------ | -------------------------------- | ----------------------------------- | | Zero-Shot | None | None | Simple tasks, quick deployment | Baseline | | Random Few-Shot | Random | Yes (n examples) | General tasks | +10-20% vs zero-shot | | Active Prompting | Uncertainty-based | Yes (n examples) | Maximize ROI on annotation | +5-15% vs random few-shot | | Manual Curation | Expert judgment | Yes (n examples) | Domain-critical tasks | +5-20% vs random (expert-dependent) | | Auto-CoT | Diversity-based | No (auto-generated) | Fast deployment, reasoning tasks | +5-10% vs zero-shot | | Fine-tuning | All data used | Yes (hundreds-thousands) | Production systems, high volume | +20-40% vs few-shot |
When to Choose What:
| Scenario | Recommended Technique | | ----------------------------------- | ------------------------------------- | | No examples, simple task | Zero-Shot | | No examples, complex reasoning | Zero-Shot CoT or Reasoning Model (O1) | | Have examples, cheap annotation | Random Few-Shot | | Have examples, expensive annotation | Active Prompting | | Need maximum accuracy, have budget | Active Prompting + Self-Consistency | | Thousands of examples available | Fine-tuning | | Knowledge-intensive task | RAG + Active-selected examples | | Production at scale | Fine-tuning or RAG |
Hybrid Solutions:
Active RAG (Retrieval-Augmented Generation):
def active_rag_hybrid(query, document_pool, k_retrieve=20, n_examples=5):
"""Combine retrieval with active selection"""
# Step 1: Retrieve relevant documents
retrieved = semantic_retrieval(query, document_pool, k=k_retrieve)
# Step 2: Calculate uncertainty on retrieved set
uncertainties = []
for doc in retrieved:
responses = generate_with_doc(query, doc, samples=5)
uncertainty = calculate_disagreement(responses)
uncertainties.append({'doc': doc, 'uncertainty': uncertainty})
# Step 3: Select most uncertain (informative) documents
selected = sorted(uncertainties, key=lambda x: x['uncertainty'], reverse=True)[:n_examples]
# Step 4: Generate with selected documents as context
context = "\n\n".join([s['doc'] for s in selected])
return generate_with_context(query, context)
Active + Self-Consistency:
def active_self_consistency(pool, test_q, n_examples=8, n_samples=5):
"""Active example selection + ensemble inference"""
# Step 1: Active selection
uncertain = select_uncertain(pool, n=n_examples, k=5)
annotated = annotate(uncertain)
# Step 2: Create few-shot prompt
prompt = create_few_shot_prompt(annotated, test_q)
# Step 3: Self-consistency ensemble
responses = []
for _ in range(n_samples):
response = model(prompt, temperature=0.7)
responses.append(extract_answer(response))
# Step 4: Majority vote
final_answer = Counter(responses).most_common(1)[0][0]
return final_answer
Integration Patterns
Task Adaptation Patterns:
Classification:
def active_for_classification(pool, classes, n_per_class=2):
"""Active selection ensuring class balance"""
selected = []
for cls in classes:
# Get examples from this class
class_pool = [ex for ex in pool if ex['class'] == cls]
# Select uncertain within class
class_uncertain = select_uncertain(class_pool, n=n_per_class)
selected.extend(class_uncertain)
return selected
Generation:
def active_for_generation(pool, n=6):
"""Active selection for text generation"""
# Uncertainty metric: semantic diversity of generated outputs
uncertainties = []
for example in pool:
responses = [generate(example) for _ in range(5)]
# Use semantic similarity variance as uncertainty
embeddings = [embed(r) for r in responses]
diversity = calculate_diversity(embeddings)
uncertainties.append({'example': example, 'uncertainty': diversity})
return top_uncertain(uncertainties, n)
Integration with Agents:
class ActivePromptAgent:
"""Agent that improves via active learning"""
def __init__(self, model, initial_examples):
self.model = model
self.examples = initial_examples
self.uncertainty_buffer = []
def execute(self, task):
"""Execute task, tracking uncertainty"""
prompt = self.create_prompt(task)
# Generate with uncertainty tracking
responses = [self.model(prompt, temp=0.7) for _ in range(5)]
uncertainty = calculate_disagreement(responses)
# If high uncertainty, add to buffer for annotation
if uncertainty > threshold:
self.uncertainty_buffer.append({
'task': task,
'responses': responses,
'uncertainty': uncertainty
})
# Return most common response
return Counter(responses).most_common(1)[0][0]
def improve(self, n_to_annotate=3):
"""Periodically improve with active learning"""
if len(self.uncertainty_buffer) < n_to_annotate:
return
# Select most uncertain from buffer
top_uncertain = sorted(self.uncertainty_buffer,
key=lambda x: x['uncertainty'],
reverse=True)[:n_to_annotate]
# Request annotations
new_examples = [annotate(ex['task']) for ex in top_uncertain]
# Add to example set
self.examples.extend(new_examples)
# Clear buffer
self.uncertainty_buffer = []
Transition Strategies:
From Random Few-Shot to Active Prompting:
- Baseline: Measure current random few-shot performance
- Small pilot: Select 3-4 uncertain examples, annotate, compare
- If pilot successful (>3% improvement): Scale to full Active implementation
- If pilot unsuccessful: Investigate why - poor uncertainty metric? Task doesn't vary in difficulty?
From Active Prompting to Fine-tuning:
- Collect data: Use Active Prompting to identify and annotate hard examples
- Combine: Add actively-selected examples to any existing training data
- Fine-tune: Use combined dataset for fine-tuning
- Compare: Measure if fine-tuning outperforms Active Prompting enough to justify cost
- Transition: If fine-tuning clearly superior (>10% improvement), deploy fine-tuned model
Larger System Integration:
class ProductionActiveSystem:
"""Production system with active learning loop"""
def __init__(self, model, initial_examples):
self.model = model
self.examples = initial_examples
self.version = 1
self.uncertainty_log = []
def predict(self, input_data):
"""Production inference with uncertainty logging"""
prompt = self.create_prompt(self.examples, input_data)
# Generate response
response = self.model(prompt, temperature=0.0)
# Track uncertainty for later improvement
if random.random() < 0.1: # Sample 10% for uncertainty estimation
uncertainty = self.estimate_uncertainty(input_data)
self.uncertainty_log.append({
'input': input_data,
'uncertainty': uncertainty,
'timestamp': datetime.now()
})
return response
def periodic_improvement(self, annotation_budget=5):
"""Periodic active learning update"""
# Select most uncertain from recent logs
top_uncertain = sorted(self.uncertainty_log,
key=lambda x: x['uncertainty'],
reverse=True)[:annotation_budget]
# Annotate
new_examples = [annotate(ex['input']) for ex in top_uncertain]
# Evaluate improvement
new_version_examples = self.examples + new_examples
improvement = self.evaluate_improvement(self.examples, new_version_examples)
if improvement > 0.02: # 2% improvement threshold
# Deploy new version
self.examples = new_version_examples
self.version += 1
self.save_version()
print(f"Deployed v{self.version} with {len(new_examples)} new examples")
# Clear log
self.uncertainty_log = []
def rollback(self):
"""Rollback to previous version if issues"""
self.version -= 1
self.examples = self.load_version(self.version)
print(f"Rolled back to v{self.version}")
Monitoring and Versioning:
class ActivePromptMonitor:
"""Monitor Active Prompting system performance"""
def __init__(self):
self.metrics = {
'accuracy': [],
'uncertainty_distribution': [],
'example_versions': [],
'annotation_costs': []
}
def log_performance(self, examples, test_set, version):
"""Log performance metrics"""
accuracy = evaluate(examples, test_set)
self.metrics['accuracy'].append({
'version': version,
'accuracy': accuracy,
'n_examples': len(examples),
'timestamp': datetime.now()
})
def detect_degradation(self, window=5):
"""Detect performance degradation"""
recent = self.metrics['accuracy'][-window:]
if len(recent) < window:
return False
# Check for declining trend
accuracies = [m['accuracy'] for m in recent]
trend = np.polyfit(range(len(accuracies)), accuracies, 1)[0]
if trend < -0.01: # Declining >1% over window
alert("Performance degradation detected")
return True
return False
Future Directions
Emerging Innovations (2024-2025 Research)
Recent Advances:
Research from 2025 highlights several critical developments in prompt engineering and active learning:
- Over-prompting Phenomenon: Excessive examples in prompts can paradoxically degrade performance in certain LLMs, suggesting optimal annotation budgets vary by model and task
- Hybrid Selection Methods: The HED-LM (Hybrid Euclidean Distance with Large Language Models) method filters candidate examples based on Euclidean distance and re-ranks using LLM-scored contextual relevance
- TF-IDF Superiority: Recent benchmarks show TF-IDF outperforms random sampling and semantic embedding for filtering relevant few-shot examples
- Apple's APE Framework: Apple Machine Learning Research introduced APE (Active Prompt Engineering) for identifying informative few-shot examples in production systems
- Uncertainty-based Sampling Prompting (USP): Google Research developed USP using model predictions as zero-shot proxies, estimating confidence via self-consistency without requiring multiple model calls
2025 Training Regime Comparison:
Comprehensive studies comparing zero-shot, few-shot, fine-tuning, and instruction-tuning found that the largest, most powerful models generally offer best predictive performance even with minimal training examples, though fine-tuning smaller models remains competitive due to high accuracy and lower cost.
Automated Active Prompting: Systems that automatically identify uncertain cases in production, request annotations, and update prompts without manual intervention:
class AutoActivePrompting:
"""Fully automated active learning for prompts"""
def __init__(self, model, annotation_service):
self.model = model
self.annotation_service = annotation_service # API to annotation platform
self.examples = []
async def continuous_improvement(self):
"""Continuous active learning loop"""
while True:
# Collect uncertain cases from production traffic
uncertain_cases = await self.collect_uncertain_from_production(hours=24)
if len(uncertain_cases) > threshold:
# Request annotations via API
annotations = await self.annotation_service.annotate(uncertain_cases)
# Validate quality
validated = self.quality_check(annotations)
# A/B test new examples
improvement = await self.ab_test_examples(validated)
if improvement > 0.02:
# Deploy automatically
self.examples.extend(validated)
self.deploy_new_version()
await asyncio.sleep(86400) # Daily updates
Transfer Active Prompting: Using uncertainty patterns learned from one task to bootstrap example selection on related tasks:
def transfer_active_selection(source_task_patterns, target_pool):
"""Transfer uncertainty patterns across tasks"""
# Learn what made examples uncertain in source task
uncertainty_features = learn_uncertainty_patterns(source_task_patterns)
# Predict which target examples will be uncertain
predicted_uncertainties = []
for example in target_pool:
features = extract_features(example)
predicted_uncertainty = uncertainty_features.predict(features)
predicted_uncertainties.append({
'example': example,
'predicted_uncertainty': predicted_uncertainty
})
# Select based on predicted uncertainty (cheaper than actual estimation)
return top_uncertain(predicted_uncertainties, n=8)
Multi-Modal Active Prompting: Extending to images, audio, video:
def multimodal_active_prompting(image_pool, n=8, k=5):
"""Active selection for vision-language models"""
uncertainties = []
for image in image_pool:
# Generate k descriptions/answers
responses = [vision_model.describe(image) for _ in range(k)]
# Calculate semantic diversity
embeddings = [embed(r) for r in responses]
uncertainty = semantic_variance(embeddings)
uncertainties.append({'image': image, 'uncertainty': uncertainty})
# Select most uncertain images for annotation
return top_uncertain(uncertainties, n)
Federated Active Prompting: Multiple organizations collaboratively select valuable examples while maintaining privacy:
def federated_active_selection(local_pools, n_global=8):
"""Select examples across organizations without sharing data"""
# Each organization calculates local uncertainties
local_uncertainties = []
for org_pool in local_pools:
org_uncertain = select_uncertain(org_pool, n=n_global)
# Share only uncertainty scores and example IDs, not data
local_uncertainties.append([
{'id': ex['id'], 'uncertainty': ex['uncertainty']}
for ex in org_uncertain
])
# Aggregate to find globally most uncertain
global_ranking = aggregate_uncertainties(local_uncertainties)
# Each org annotates their high-ranking examples
# Annotations shared (or kept private with federated learning)
return global_ranking[:n_global]
Research Frontiers
Open Questions:
-
Optimal Uncertainty Metrics: What uncertainty measures work best for different task types? Can we learn task-specific uncertainty metrics?
-
Theoretical Guarantees: Can we prove sample complexity bounds for Active Prompting? How many examples needed to reach target accuracy?
-
Annotation Quality vs Quantity: Trade-off between highly detailed annotations (expensive) vs more simple annotations (cheaper)? Optimal allocation of annotation budget?
-
Multi-Round Dynamics: How many rounds optimal? Do benefits plateau or continue? Optimal examples per round?
-
Cross-Model Transfer: Do examples selected for GPT-4 work well for Claude or Llama? Model-agnostic selection strategies?
-
Prompt Compression: Can we compress annotated examples without losing effectiveness? Distill 8 examples into 4 richer ones?
-
Real-Time Active Learning: Can Active Prompting work in real-time production with streaming data?
Promising Directions:
Learned Uncertainty Metrics:
class LearnedUncertaintyMetric:
"""Learn what makes examples informative"""
def __init__(self):
self.model = train_uncertainty_predictor()
def predict_informativeness(self, example, current_examples):
"""Predict how much an example would improve prompt"""
features = extract_features(example, current_examples)
return self.model.predict(features)
def train_from_history(self, selection_history):
"""Learn from past selection successes"""
# Features: example characteristics, current example set
# Target: actual performance improvement from adding example
X, y = prepare_training_data(selection_history)
self.model.fit(X, y)
Active Prompting for Alignment: Using human feedback on uncertain cases to align model behavior:
def active_alignment(pool, human_values):
"""Select examples for human feedback to improve alignment"""
# Find cases where model behavior uncertain
value_uncertainties = []
for example in pool:
responses = [model.generate(example) for _ in range(5)]
# Measure alignment uncertainty
alignment_scores = [score_alignment(r, human_values) for r in responses]
alignment_variance = np.var(alignment_scores)
value_uncertainties.append({
'example': example,
'alignment_uncertainty': alignment_variance
})
# Get human feedback on most uncertain
selected = top_uncertain(value_uncertainties, n=10)
human_preferences = [get_human_preference(ex) for ex in selected]
# Use as examples to guide model behavior
return create_alignment_prompt(human_preferences)
Adaptive Budget Allocation: Automatically deciding when to annotate more examples:
def adaptive_active_prompting(pool, initial_budget=8, max_budget=20):
"""Automatically decide annotation budget"""
examples = []
budget_spent = 0
while budget_spent < max_budget:
# Select and annotate batch
batch = select_uncertain(pool, n=min(4, max_budget - budget_spent))
annotated_batch = annotate(batch)
examples.extend(annotated_batch)
budget_spent += len(batch)
# Evaluate
accuracy = evaluate(examples, validation_set)
# Estimate marginal value of next batch
if budget_spent >= 8: # Need baseline
marginal_value = estimate_marginal_improvement(
examples,
validation_set,
next_batch_size=4
)
# Stop if marginal value below threshold
if marginal_value < 0.01: # <1% expected improvement
print(f"Stopping at {budget_spent} examples (marginal value: {marginal_value:.2%})")
break
print(f"Budget spent: {budget_spent}, Accuracy: {accuracy:.2%}")
return examples
Read Next
Start reading to get personalized recommendations
Explore Unread
Great job! You've read all available articles