K-Nearest Neighbor (KNN) Prompting: A Complete Guide

K-Nearest Neighbor (KNN) Prompting is a retrieval-based technique that improves few-shot learning by selecting the most semantically similar examples from a candidate pool to serve as in-context demonstrations. Rather than randomly picking examples or relying on manual curation, KNN Prompting encodes both the candidate examples and the test input into a shared embedding space, then retrieves the k nearest neighbors as exemplars for the prompt.

The core insight is that example relevance matters far more than example quantity. A few well-chosen demonstrations that closely match the test input's structure, domain, and reasoning patterns teach the model more effectively than many randomly selected ones. By leveraging embedding similarity to automate this selection, KNN Prompting consistently outperforms random few-shot baselines across a wide range of NLP tasks.

KNN Prompting belongs to the example-based and retrieval-augmented prompting categories. It is a few-shot prompting optimization technique that addresses a well-documented problem: in-context learning performance is highly sensitive to which examples appear in the prompt, with even small changes causing large variance (Liu et al., 2022; Lu et al., 2022). There are two major lines of research under this umbrella:

KNN-based exemplar selection (KATE) — introduced by Liu et al. (2022) in "What Makes Good In-Context Examples for GPT-3?", which uses sentence embeddings to retrieve the most similar training examples as in-context demonstrations, showing performance nearly comparable to fine-tuning when applied to GPT-3.
KNN Prompting for calibration-free inference — introduced by Xu et al. (2023) in "kNN Prompting: Beyond-Context Learning with Calibration-Free Nearest Neighbor Inference" (ICLR 2023), which goes further by using LLM output distributions as representations and performing nearest neighbor classification directly, achieving +3.56 average improvement for 4-shot and +7.07 for 8-shot over state-of-the-art calibration methods across 10 classification tasks, with standard deviation dropping from 9.14 (ICL) to 3.83 (kNN Prompting) across tasks.

Both approaches share the fundamental principle of using similarity-based retrieval to improve in-context learning, but they operate at different levels: KATE selects which examples go into the prompt, while kNN Prompting uses the output distributions themselves for nearest neighbor inference.

How It Works

Theoretical Foundation

KNN Prompting is grounded in two foundational ideas:

1. Retrieval-Augmented Learning: The kNN Language Model (kNN-LM) by Khandelwal et al. (2020) demonstrated that augmenting a pretrained language model with a nearest neighbor lookup over a datastore of cached representations can substantially improve performance without additional training. Their kNN-LM achieved state-of-the-art perplexity of 15.79 on Wikitext-103, a 2.9-point improvement with no additional training. They showed that retrieving nearest neighbors from a corpus can outperform training on it — adding kNN retrieval over 3B examples to a model trained on 100M tokens improved perplexity from 19.59 to 13.73.

2. Example Sensitivity in ICL: Research by Liu et al. (2022) and Lu et al. (2022) established that in-context learning is extremely sensitive to which demonstrations are selected and how they are ordered. Random selection leads to high variance and suboptimal performance. This motivated using structured retrieval rather than arbitrary example choice.

Core Innovation: The key insight of KNN-based exemplar selection is that semantic similarity in embedding space is a reliable proxy for example relevance in in-context learning. Examples that are "close" to the test input in embedding space share structural and semantic properties that make them effective demonstrations. For kNN Prompting (Xu et al., 2023), the innovation extends further: rather than using embeddings to select examples for the prompt, it uses the full language model output probability distribution as a representation, performing calibration-free nearest neighbor classification without directly mapping LLM outputs to task labels.

Key Assumptions and Where They Fail:

Embedding quality reflects task relevance: Assumes the embedding model captures the similarity dimensions relevant to the task. Fails when task-relevant similarity differs from general semantic similarity (e.g., two sentences about different topics but requiring the same reasoning pattern).
Similar inputs benefit from similar demonstrations: Assumes that if test input X is similar to training example Y, then Y is a good demonstration for X. Fails for tasks where surface similarity is misleading (e.g., similar-looking math problems requiring different approaches).
Embedding space is well-structured: Assumes nearest neighbors in embedding space are meaningfully similar. Fails with poor embedding models or highly specialized domains where general embeddings lack discriminative power.

Fundamental Trade-offs:

| Trade-off | Description | | -------------------------------------------- | ------------------------------------------------------------------------------- | | Retrieval quality vs speed | Better embeddings improve selection but increase compute cost | | Specificity vs diversity | Very similar examples may lack diversity; diverse examples may be less relevant | | Token cost vs example count | More retrieved examples improve coverage but consume context window | | Infrastructure complexity vs performance | Embedding stores add system complexity for selection improvements |

Execution Mechanism

KNN Prompting operates differently depending on the variant, but both follow a two-phase structure:

Variant 1: KNN-Based Exemplar Selection (KATE-style)

Phase 1 — Preprocessing (offline):

Collect a pool of candidate examples with their labels/completions
Encode all candidates using a sentence embedding model (e.g., RoBERTa, Sentence-BERT, OpenAI embeddings)
Store embeddings in an indexed datastore for efficient retrieval

Phase 2 — Inference (per query):

Encode the test input using the same embedding model
Compute distance (cosine similarity, L2, or dot product) between test embedding and all candidate embeddings
Retrieve the k nearest candidates as in-context examples
Construct a few-shot prompt with retrieved examples and the test input
Query the LLM with the constructed prompt
Return the LLM's response

This approach is single-pass from the LLM's perspective — the retrieval step happens before the LLM call.

Variant 2: KNN Prompting for Calibration-Free Inference (Xu et al., 2023)

Phase 1 — Meta-Test Stage (building the datastore):

Select a small set of anchor examples (in-context demonstrations)
For each training example, construct a prompt using the anchor examples plus the training example as the test input
Query the LLM and cache the complete output probability distribution as a key, paired with the training example's true label as the value
Build a datastore of (distribution, label) pairs

Phase 2 — Formal Test Stage (inference):

Construct the same prompt structure with anchor examples plus the test input
Query the LLM to get the output probability distribution
Compute KL divergence between the test distribution and all cached training distributions
Find the k nearest neighbors by smallest KL divergence
Aggregate the labels of the k nearest neighbors (majority vote)
Return the predicted label

This approach requires multiple LLM calls during datastore construction but enables calibration-free inference that scales beyond context window limitations.

Why This Works

1. Semantic Relevance Alignment (35% of effectiveness): By selecting examples semantically close to the test input, KNN Prompting ensures the demonstrations share relevant vocabulary, structure, and domain characteristics. The LLM receives demonstrations that closely mirror the problem it needs to solve, reducing the cognitive leap from examples to test case.

2. Calibration-Free Distribution Matching (25%): For the Xu et al. variant, using the full output distribution rather than just label probabilities captures richer information about how the LLM "perceives" each input. Two inputs that produce similar output distributions likely require similar processing, regardless of what the top-1 predicted token is. This sidesteps the calibration problem entirely — biases in the output distribution affect all examples similarly, so nearest neighbor matching effectively cancels them out.

3. Bias Reduction Through Retrieval (20%): Random example selection introduces bias — the model might get examples that happen to favor certain answer patterns. KNN retrieval produces consistent, input-dependent example sets that reduce this variance. The standard deviation of kNN Prompting across tasks (3.83) is less than half that of standard ICL (9.14), demonstrating substantially more stable performance.

4. Beyond-Context Scaling (20%): The datastore-based variant can leverage thousands of training examples for nearest neighbor lookup without fitting them into the context window. The scaling trend holds across 10 orders of magnitude from 2 shots to 1024 shots, and across model sizes from 0.8B to 30B parameters.

Causal Chain:

Semantic encoding of examples → distance computation in embedding space → selection of most relevant demonstrations → LLM receives contextually appropriate examples → reduced ambiguity in task interpretation → improved output quality

Positive Feedback Loop:

Better example selection → more consistent outputs → more reliable performance metrics → better ability to tune k and embedding model → further improved selection

Negative Feedback Loop:

Poor embedding model → retrieves superficially similar but semantically irrelevant examples → performance degrades below random selection → misleading signal that KNN approach doesn't work

Structure and Components

Essential Components

Required:

Candidate example pool: Set of labeled examples to select from (minimum 50-100 for meaningful retrieval, 500+ recommended)
Embedding model: Sentence encoder to convert text into vector representations (Sentence-BERT, OpenAI embeddings, RoBERTa, etc.)
Distance metric: Method to compute similarity between embeddings (cosine similarity, L2 distance, dot product)
k parameter: Number of nearest neighbors to retrieve (typically 3-8)
Few-shot prompt template: Structure for incorporating retrieved examples with the test input

Required for Xu et al. variant (additionally):

Anchor examples: Small fixed set of in-context demonstrations used when querying training data
Distribution datastore: Cache of LLM output probability distributions for training examples
KL divergence computation: Method to compare probability distributions

Optional:

Vector index (FAISS, Annoy, HNSW): For efficient approximate nearest neighbor search over large datastores
Fine-tuned embedding model: Encoder fine-tuned on task-related data (e.g., RoBERTa fine-tuned on NLI/STS-B)
Diversity filtering: Mechanism to ensure retrieved examples aren't redundant
Example ordering strategy: Method to arrange retrieved examples in the prompt
Reranking model: Secondary model to rerank retrieved candidates based on task-specific criteria

Design Principles

Core Cognitive Principles:

Similarity-driven learning: Humans learn better from examples that closely match the target scenario, and LLMs exhibit the same property in-context
Pattern recognition: LLMs excel at recognizing patterns from demonstrations — similar examples create stronger, more coherent patterns
Implicit task specification: The retrieved examples implicitly communicate task requirements, format, and reasoning style more effectively than abstract instructions
Distributional reasoning: For the Xu et al. variant, the full output distribution captures latent representations of how the model processes an input, enabling matching at a deeper level than surface text similarity

Linguistic Patterns:

KNN Prompting uses standard few-shot format, with the distinguishing feature being automated, similarity-driven example selection:

[Retrieved Example 1 - most similar to test input]
Input: {retrieved_input_1}
Output: {retrieved_output_1}

[Retrieved Example 2 - second most similar]
Input: {retrieved_input_2}
Output: {retrieved_output_2}

...

[Test Input]
Input: {test_input}
Output:

Design Principles:

Maximize relevance: Every example slot should be filled with the most relevant available demonstration
Maintain diversity within relevance: If top-k neighbors are too similar to each other, they provide redundant information — consider diversity-aware selection
Consistent formatting: Retrieved examples must follow the same format regardless of their source
Embedding model alignment: The embedding model should capture the dimensions of similarity that matter for the task

Structural Patterns

Minimal Pattern (Basic KNN Selection):

from sentence_transformers import SentenceTransformer
import numpy as np

# Encode candidates
model = SentenceTransformer('all-MiniLM-L6-v2')
candidate_texts = [ex['input'] for ex in candidates]
candidate_embeddings = model.encode(candidate_texts)

# Encode test input and find nearest
test_embedding = model.encode([test_input])
similarities = np.dot(candidate_embeddings, test_embedding.T).flatten()
top_k_indices = np.argsort(similarities)[-k:][::-1]

# Build prompt with retrieved examples
prompt = ""
for idx in top_k_indices:
    prompt += f"Input: {candidates[idx]['input']}\nOutput: {candidates[idx]['output']}\n\n"
prompt += f"Input: {test_input}\nOutput:"

Standard Pattern (KNN with Index and Reranking):

import faiss
from sentence_transformers import SentenceTransformer
import numpy as np

class KNNPrompting:
    def __init__(self, embedding_model='all-MiniLM-L6-v2', k=5):
        self.encoder = SentenceTransformer(embedding_model)
        self.k = k
        self.index = None
        self.candidates = []

    def build_index(self, candidates):
        """Build FAISS index from candidate examples"""
        self.candidates = candidates
        texts = [ex['input'] for ex in candidates]
        embeddings = self.encoder.encode(texts, normalize_embeddings=True)

        # Build FAISS index
        dimension = embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)  # Inner product (cosine for normalized)
        self.index.add(embeddings.astype('float32'))

    def retrieve(self, test_input):
        """Retrieve k nearest examples"""
        test_embedding = self.encoder.encode(
            [test_input], normalize_embeddings=True
        ).astype('float32')

        distances, indices = self.index.search(test_embedding, self.k)

        retrieved = []
        for idx, dist in zip(indices[0], distances[0]):
            retrieved.append({
                **self.candidates[idx],
                'similarity': float(dist)
            })
        return retrieved

    def build_prompt(self, test_input, task_instruction=""):
        """Build few-shot prompt with retrieved examples"""
        retrieved = self.retrieve(test_input)

        prompt = task_instruction + "\n\n" if task_instruction else ""
        for ex in retrieved:
            prompt += f"Input: {ex['input']}\nOutput: {ex['output']}\n\n"
        prompt += f"Input: {test_input}\nOutput:"

        return prompt

Advanced Pattern (KNN Prompting with Diversity and Caching):

import faiss
from sentence_transformers import SentenceTransformer
import numpy as np
from collections import defaultdict

class AdvancedKNNPrompting:
    def __init__(self, embedding_model='all-MiniLM-L6-v2', k=5,
                 diversity_weight=0.3):
        self.encoder = SentenceTransformer(embedding_model)
        self.k = k
        self.diversity_weight = diversity_weight
        self.index = None
        self.candidates = []
        self.cache = {}

    def build_index(self, candidates):
        """Build FAISS index with metadata"""
        self.candidates = candidates
        texts = [ex['input'] for ex in candidates]
        self.embeddings = self.encoder.encode(texts, normalize_embeddings=True)

        dimension = self.embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)
        self.index.add(self.embeddings.astype('float32'))

    def retrieve_diverse(self, test_input):
        """Retrieve k examples balancing similarity and diversity"""
        # Check cache
        cache_key = hash(test_input)
        if cache_key in self.cache:
            return self.cache[cache_key]

        test_emb = self.encoder.encode(
            [test_input], normalize_embeddings=True
        ).astype('float32')

        # Retrieve more than k candidates
        n_candidates = min(self.k * 4, len(self.candidates))
        distances, indices = self.index.search(test_emb, n_candidates)

        # Greedy diversity-aware selection
        selected = []
        selected_embeddings = []

        for idx, dist in zip(indices[0], distances[0]):
            if len(selected) >= self.k:
                break

            candidate_emb = self.embeddings[idx]

            # Calculate diversity penalty
            if selected_embeddings:
                max_similarity_to_selected = max(
                    np.dot(candidate_emb, sel_emb)
                    for sel_emb in selected_embeddings
                )
                diversity_score = 1 - max_similarity_to_selected
            else:
                diversity_score = 1.0

            # Combined score
            combined_score = (
                (1 - self.diversity_weight) * float(dist) +
                self.diversity_weight * diversity_score
            )

            selected.append({
                **self.candidates[idx],
                'similarity': float(dist),
                'diversity': diversity_score,
                'combined': combined_score
            })
            selected_embeddings.append(candidate_emb)

        # Cache result
        self.cache[cache_key] = selected
        return selected

    def build_prompt(self, test_input, task_instruction="",
                     max_tokens=3000):
        """Build token-aware prompt with retrieved examples"""
        retrieved = self.retrieve_diverse(test_input)

        prompt = task_instruction + "\n\n" if task_instruction else ""
        token_estimate = len(prompt.split()) * 1.3  # Rough token estimate

        examples_added = 0
        for ex in retrieved:
            example_text = f"Input: {ex['input']}\nOutput: {ex['output']}\n\n"
            example_tokens = len(example_text.split()) * 1.3

            if token_estimate + example_tokens > max_tokens:
                break

            prompt += example_text
            token_estimate += example_tokens
            examples_added += 1

        prompt += f"Input: {test_input}\nOutput:"
        return prompt, examples_added

Prompting Patterns Used:

Few-shot pattern: Retrieved examples serve as in-context demonstrations
Structured output: Format demonstrated consistently across all retrieved examples
Order matters: Examples typically ordered by decreasing similarity (most similar first or last, depending on the model)

Reasoning Patterns:

Forward reasoning: Retrieved examples demonstrate the input→output mapping the model should follow
Pattern recognition: Similar examples help the model recognize the underlying pattern
Analogical reasoning: The model draws analogies between retrieved examples and the test input

Modifications for Scenarios

For Ambiguous Tasks:

Increase k to provide more diverse examples that cover different interpretations
Add task instruction to disambiguate alongside the retrieved examples
Use diversity-weighted retrieval to ensure multiple perspectives are represented

For Complex Reasoning:

Retrieve examples that demonstrate similar reasoning chains, not just similar surface text
Consider using reasoning-path embeddings rather than input-only embeddings
Combine with Chain-of-Thought: retrieve examples with CoT annotations

For Format-Critical Tasks:

Ensure all retrieved examples demonstrate the exact required format
Filter candidates to only include correctly formatted examples before building the index
Consider post-retrieval format validation

For Domain-Specific Tasks:

Use domain-specific or fine-tuned embedding models (e.g., PubMedBERT for medical, LegalBERT for legal)
Build separate indices for each domain if multi-domain
Augment retrieval with domain-specific metadata filtering

Applications and Task Selection

General Applications

KNN Prompting is broadly applicable to any task where labeled examples exist and example relevance varies by input.

Text Classification: Sentiment analysis, topic classification, intent detection, spam filtering. KNN retrieval selects examples from the same topical area or with similar linguistic patterns, giving the model the most relevant class demonstrations. Liu et al. (2022) showed that retrieval-based ICL with GPT-3 achieved performance nearly comparable to fine-tuning on multiple classification benchmarks.

Named Entity Recognition and Information Extraction: Retrieving examples with similar entity types, sentence structures, or domain terminology. Particularly effective when entity types vary across domains.

Question Answering: Selecting QA pairs where the question structure, topic, or reasoning type matches the test question. Multi-hop QA benefits from retrieving examples that demonstrate similar chain-of-reasoning patterns.

Text Generation and Summarization: Retrieving examples with similar input length, style, or content type to guide the model's generation. Effective for ensuring consistent tone and formatting.

Machine Translation: Selecting translation pairs with similar vocabulary, sentence structure, or domain terminology. Domain-specific translation benefits significantly from relevant example retrieval.

Code Generation: Retrieving code examples with similar function signatures, libraries used, or algorithmic patterns. Effective for API-specific tasks where the relevant API usage needs to be demonstrated.

Domain-Specific Applications

Clinical NLP: Retrieving similar patient case descriptions for clinical decision support. Domain-specific embeddings (BioSentVec, PubMedBERT) improve retrieval quality for medical text. Applications include diagnostic reasoning, ICD coding, and clinical note summarization.

Legal Analysis: Selecting precedent cases with similar legal issues, statutes, or fact patterns. Legal-domain embeddings capture jurisdictional and doctrinal similarity. Applications include case outcome prediction, contract analysis, and regulatory compliance.

Scientific Literature: Retrieving papers with similar methodology, findings, or domain focus for literature review assistance, claim verification, and experiment design suggestions.

Financial Analysis: Selecting similar financial reports, market conditions, or risk scenarios for analysis templates. Effective for earnings call analysis, risk assessment, and financial QA.

Customer Support: Retrieving similar past support tickets with their resolutions to generate contextually appropriate responses. Production systems at scale use this approach for automated ticket routing and suggested responses.

Selection Framework

Problem Characteristics (When to Use KNN Prompting):

Few-shot prompting works but performance varies with example choice
A pool of labeled examples exists (50+ minimum, 500+ recommended)
Inputs vary in topic, structure, or domain such that different examples are relevant to different inputs
Task benefits from contextually relevant demonstrations
Need consistent, automated example selection (no manual curation per query)
Performance requires improvement over random few-shot without fine-tuning

Scenarios Optimized For:

High-variance input spaces where a single set of examples cannot serve all queries
Classification tasks with many categories
Domain-specific tasks where relevant terminology and patterns vary
Production systems processing diverse queries at scale
Tasks where embedding similarity correlates with example usefulness

Scenarios NOT Recommended For:

Zero-shot performance already sufficient (no examples needed)
Candidate pool too small (<50 examples) for meaningful retrieval
Task where all examples are equally relevant regardless of input (e.g., simple formatting tasks)
Inputs are homogeneous (every query similar, so any example works)
Embedding similarity does not capture task-relevant dimensions

Selection Signals:

| Signal | Indicates KNN Prompting Suitable | | ---------------------------------------------------- | ---------------------------------------------- | | High variance in random few-shot performance | Yes — example choice matters | | Performance improves with manually curated examples | Yes — automated curation will help | | Diverse input types/domains | Yes — different inputs need different examples | | Large labeled candidate pool available | Yes — more retrieval options | | Embedding similarity correlates with task similarity | Yes — retrieval will be meaningful |

Model Requirements:

Minimum: Any model supporting few-shot learning (GPT-3.5, Claude 3 Haiku, Llama 7B+)
Recommended: GPT-4, Claude 3.5 Sonnet, Llama 70B+ for best few-shot performance
Optimal: Models with strong in-context learning capabilities and large context windows
Not suitable: Models with very small context windows (<2K tokens) or poor few-shot learning ability
For Xu et al. variant: Requires access to output probability distributions (autoregressive LMs with logit access)

Context/Resource Requirements:

Embedding computation: One-time cost to embed all candidates; fast for modern embedding models (1000 examples in seconds)
Storage: Embedding vectors (768-1536 dimensions × number of candidates × 4 bytes)
Retrieval latency: ~1-10ms with FAISS index; negligible vs LLM inference time
Context window: k examples × average example length + test input + response space
Typical token usage: 4-8 examples × 100-300 tokens each = 400-2400 tokens for examples alone

Cost Implications:

One-time costs:

Embedding all candidates: ~$0.01-0.10 per 1000 examples (OpenAI embeddings) or free (open-source models)
Building FAISS index: negligible compute cost
Infrastructure: embedding model hosting if using open-source

Per-request production costs:

Embedding the test input: ~$0.00001 per query (OpenAI) or free (self-hosted)
Nearest neighbor search: negligible
LLM inference: Same as standard few-shot prompting (determined by k and example length)
Total overhead vs random few-shot: <$0.001 per request

Trade-offs:

Minimal additional cost for meaningful performance improvement
Infrastructure complexity is the main cost, not compute
Open-source embedding models eliminate per-query embedding costs entirely

When to Use vs When NOT to Use:

Use when:

Random few-shot accuracy 50-85% with high variance across example sets
Have 100+ labeled candidate examples
Input distribution is diverse (different topics, domains, structures)
Can deploy an embedding model alongside the LLM
Need automated, consistent example selection at scale
Performance gains justify the infrastructure setup

Do NOT use when:

Zero-shot accuracy >90% (examples unnecessary)
Random few-shot accuracy >90% with low variance (example choice doesn't matter)
Candidate pool <50 examples (insufficient for meaningful retrieval)
All inputs near-identical (any examples equally relevant)
Cannot host embedding model or embedding API
Real-time latency requirements cannot accommodate embedding step (rare — embedding is fast)

Escalate to alternatives when:

KNN-selected few-shot still <60% accuracy → consider fine-tuning
Need to leverage thousands of examples → consider Xu et al. kNN Prompting variant or fine-tuning
Embedding similarity does not capture task-relevant dimensions → consider supervised retriever (EPR, UDR)
Need guaranteed format compliance → consider structured output APIs or fine-tuning

Variant Selection

KNN Exemplar Selection (KATE-style, Liu et al. 2022):

Best for: General few-shot tasks, production systems, any LLM
Characteristics: Simple, fast, works with any LLM API, no logit access needed
Infrastructure: Embedding model + vector index
Use when: Need practical, deployable example selection

kNN Prompting (Xu et al., 2023):

Best for: Classification tasks, research settings, maximum accuracy
Characteristics: Calibration-free, scales beyond context window, requires logit access
Infrastructure: LLM with probability output + distribution datastore
Use when: Have logit access, classification tasks, need to leverage large training sets

Vote-k (Su et al., 2023):

Best for: Diverse exemplar selection from unlabeled pools
Characteristics: Graph-based, emphasizes diversity over pure similarity
Use when: Worried about redundancy in retrieved examples

EPR (Rubin et al., 2022):

Best for: Maximum retrieval quality with labeled training data
Characteristics: Supervised retriever, task-specific training, 30%+ improvement over random
Use when: Can invest in training a task-specific retriever

UDR (Li et al., 2023):

Best for: Multi-task settings, unified retrieval across tasks
Characteristics: Multi-task list-wise ranking, generalizes across tasks
Use when: Need a single retriever serving multiple tasks

Alternative Techniques:

| Technique | When to Choose | | ------------------- | ---------------------------------------------------------------- | | Random Few-Shot | Small candidate pool, simple task, no retrieval infrastructure | | Manual Curation | Domain expert available, fixed example set, high-stakes | | KNN Prompting | Diverse inputs, large pool, automated selection needed | | EPR/UDR | Can train supervised retriever, maximum retrieval quality | | Fine-tuning | Thousands of examples, deployment cost matters, maximum accuracy | | RAG | Knowledge-intensive, external documents needed beyond examples |

Implementation

Implementation Steps

Step 1: Prepare Candidate Pool

Collect labeled examples representative of the target task and input distribution
Ensure pool covers the range of expected inputs (topics, difficulty levels, formats)
Verify label quality — retrieval amplifies both good and bad examples
Format consistently: each example needs input text and expected output
Recommended size: 500-5000 examples (more is better, with diminishing returns)

Step 2: Select and Configure Embedding Model

Choose embedding model based on task and infrastructure:
- General purpose: all-MiniLM-L6-v2 (fast, good baseline), all-mpnet-base-v2 (better quality)
- OpenAI: text-embedding-3-small or text-embedding-3-large (best quality, API cost)
- Domain-specific: Fine-tuned models (e.g., trained on NLI/STS-B data) for improved retrieval
Validate that embedding similarity correlates with task-relevant similarity on a small sample
Encode all candidate example inputs into vectors

Step 3: Build Vector Index

Choose index type based on pool size:
- <10,000 examples: exact search (FAISS IndexFlatIP) — no approximation needed
- 10,000-1M examples: approximate search (FAISS IndexIVFFlat or IndexHNSW)
- 1M+ examples: approximate search with quantization
Build and save the index
Test retrieval quality on sample queries

Step 4: Configure Retrieval Parameters

Set k (number of neighbors): start with 5, tune between 3-8
Choose distance metric: cosine similarity (default), L2, or dot product
Optionally add diversity filtering or reranking
Optionally add label distribution constraints (ensure class balance in retrieved set)

Step 5: Build Prompt Template

Design prompt structure: instruction (optional) + retrieved examples + test input
Determine example ordering: most similar first vs last (test both)
Set token budget: ensure k examples + test input + expected response fit in context window
Add any task-specific instructions

Step 6: Evaluate and Tune

Run on validation set (held-out from candidate pool)
Compare vs random few-shot baseline
Tune k, embedding model, diversity weight, example ordering
Analyze retrieval quality: are retrieved examples actually relevant?
Check for failure patterns: certain input types where retrieval fails

Step 7: Deploy

Set up embedding model serving (local or API)
Deploy vector index (in-memory or persistent storage)
Integrate retrieval step into LLM inference pipeline
Monitor retrieval quality and LLM performance
Periodically update candidate pool and rebuild index

Platform-Specific Implementations

OpenAI API:

import openai
import numpy as np
from typing import List, Dict

class KNNPromptingOpenAI:
    def __init__(self, api_key: str,
                 embedding_model: str = "text-embedding-3-small",
                 chat_model: str = "gpt-4-turbo-preview",
                 k: int = 5):
        self.client = openai.OpenAI(api_key=api_key)
        self.embedding_model = embedding_model
        self.chat_model = chat_model
        self.k = k
        self.candidates = []
        self.embeddings = None

    def embed_texts(self, texts: List[str]) -> np.ndarray:
        """Embed texts using OpenAI API"""
        response = self.client.embeddings.create(
            model=self.embedding_model,
            input=texts
        )
        return np.array([item.embedding for item in response.data])

    def build_index(self, candidates: List[Dict]):
        """Build embedding index from candidates"""
        self.candidates = candidates
        texts = [ex['input'] for ex in candidates]
        self.embeddings = self.embed_texts(texts)
        # Normalize for cosine similarity
        norms = np.linalg.norm(self.embeddings, axis=1, keepdims=True)
        self.embeddings = self.embeddings / norms

    def retrieve(self, test_input: str) -> List[Dict]:
        """Retrieve k nearest examples"""
        test_emb = self.embed_texts([test_input])
        test_emb = test_emb / np.linalg.norm(test_emb)

        similarities = np.dot(self.embeddings, test_emb.T).flatten()
        top_k = np.argsort(similarities)[-self.k:][::-1]

        return [
            {**self.candidates[idx], 'similarity': float(similarities[idx])}
            for idx in top_k
        ]

    def generate(self, test_input: str,
                 task_instruction: str = "") -> str:
        """Full KNN prompting pipeline"""
        retrieved = self.retrieve(test_input)

        # Build few-shot prompt
        examples_text = "\n\n".join([
            f"Input: {ex['input']}\nOutput: {ex['output']}"
            for ex in retrieved
        ])

        user_content = ""
        if task_instruction:
            user_content += task_instruction + "\n\n"
        user_content += examples_text
        user_content += f"\n\nInput: {test_input}\nOutput:"

        response = self.client.chat.completions.create(
            model=self.chat_model,
            messages=[{"role": "user", "content": user_content}],
            temperature=0.0,
            max_tokens=500
        )
        return response.choices[0].message.content

# Usage
knn = KNNPromptingOpenAI(api_key="your-api-key")

candidates = [
    {"input": "The food was amazing and service excellent", "output": "Positive"},
    {"input": "Terrible experience, never going back", "output": "Negative"},
    {"input": "It was okay, nothing special", "output": "Neutral"},
    # ... hundreds more examples
]

knn.build_index(candidates)

result = knn.generate(
    test_input="The pasta was decent but the wait was too long",
    task_instruction="Classify the sentiment of the following review."
)
print(result)

Anthropic Claude:

import anthropic
import numpy as np
from sentence_transformers import SentenceTransformer

class KNNPromptingClaude:
    def __init__(self, api_key: str, k: int = 5,
                 model: str = "claude-sonnet-4-20250514"):
        self.client = anthropic.Anthropic(api_key=api_key)
        self.model = model
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.k = k
        self.candidates = []
        self.embeddings = None

    def build_index(self, candidates):
        """Build index using local sentence transformer"""
        self.candidates = candidates
        texts = [ex['input'] for ex in candidates]
        self.embeddings = self.encoder.encode(
            texts, normalize_embeddings=True
        )

    def retrieve(self, test_input):
        """Retrieve k nearest examples"""
        test_emb = self.encoder.encode(
            [test_input], normalize_embeddings=True
        )
        similarities = np.dot(self.embeddings, test_emb.T).flatten()
        top_k = np.argsort(similarities)[-self.k:][::-1]

        return [self.candidates[idx] for idx in top_k]

    def generate(self, test_input, task_instruction=""):
        """Full pipeline with Claude"""
        retrieved = self.retrieve(test_input)

        examples_text = "\n\n".join([
            f"Input: {ex['input']}\nOutput: {ex['output']}"
            for ex in retrieved
        ])

        user_content = ""
        if task_instruction:
            user_content += task_instruction + "\n\n"
        user_content += examples_text
        user_content += f"\n\nInput: {test_input}\nOutput:"

        message = self.client.messages.create(
            model=self.model,
            max_tokens=500,
            temperature=0.0,
            messages=[{"role": "user", "content": user_content}]
        )
        return message.content[0].text

# Usage
knn_claude = KNNPromptingClaude(api_key="your-api-key")
knn_claude.build_index(candidates)
result = knn_claude.generate(
    test_input="The hotel room was clean but noisy",
    task_instruction="Classify the sentiment."
)

LangChain Integration:

from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain.prompts.example_selector import SemanticSimilarityExampleSelector
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import LLMChain

def langchain_knn_prompting(candidates, test_input, task_instruction=""):
    """KNN Prompting using LangChain's built-in semantic selector"""

    # Format candidates for LangChain
    examples = [
        {"input": ex["input"], "output": ex["output"]}
        for ex in candidates
    ]

    # Create semantic similarity selector (KNN under the hood)
    example_selector = SemanticSimilarityExampleSelector.from_examples(
        examples,
        OpenAIEmbeddings(),
        FAISS,
        k=5
    )

    # Define example format
    example_prompt = PromptTemplate(
        input_variables=["input", "output"],
        template="Input: {input}\nOutput: {output}"
    )

    # Create few-shot template
    few_shot_prompt = FewShotPromptTemplate(
        example_selector=example_selector,
        example_prompt=example_prompt,
        prefix=task_instruction if task_instruction else "",
        suffix="Input: {input}\nOutput:",
        input_variables=["input"]
    )

    # Create chain and run
    llm = ChatOpenAI(model="gpt-4", temperature=0.0)
    chain = LLMChain(llm=llm, prompt=few_shot_prompt)

    return chain.run(input=test_input)

Xu et al. kNN Prompting Implementation (Research Variant):

import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
from scipy.special import rel_entr

class KNNPromptingXu:
    """Implementation of Xu et al. (2023) kNN Prompting
    for calibration-free nearest neighbor inference."""

    def __init__(self, model_name="gpt2-xl", k=5):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.model.eval()
        self.k = k
        self.datastore_keys = []    # Output distributions
        self.datastore_values = []  # Labels

    def get_output_distribution(self, prompt):
        """Get LM output probability distribution for a prompt"""
        inputs = self.tokenizer(prompt, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**inputs)

        # Get distribution over vocabulary at last token position
        logits = outputs.logits[0, -1, :]
        distribution = torch.softmax(logits, dim=0).numpy()
        return distribution

    def build_datastore(self, training_examples, anchor_prompt):
        """Build datastore by caching distributions for training data"""
        self.datastore_keys = []
        self.datastore_values = []

        for example in training_examples:
            # Construct prompt: anchor examples + training input
            full_prompt = anchor_prompt + f"\nInput: {example['input']}\nOutput:"

            # Cache output distribution as key
            distribution = self.get_output_distribution(full_prompt)
            self.datastore_keys.append(distribution)

            # Store true label as value
            self.datastore_values.append(example['label'])

    def predict(self, test_input, anchor_prompt):
        """Predict by finding nearest neighbors in distribution space"""
        # Get test distribution
        test_prompt = anchor_prompt + f"\nInput: {test_input}\nOutput:"
        test_dist = self.get_output_distribution(test_prompt)

        # Compute KL divergence to all datastore entries
        distances = []
        for stored_dist in self.datastore_keys:
            # Symmetric KL divergence
            kl_forward = np.sum(rel_entr(test_dist + 1e-10, stored_dist + 1e-10))
            kl_backward = np.sum(rel_entr(stored_dist + 1e-10, test_dist + 1e-10))
            kl_symmetric = (kl_forward + kl_backward) / 2
            distances.append(kl_symmetric)

        # Find k nearest neighbors
        distances = np.array(distances)
        nearest_indices = np.argsort(distances)[:self.k]

        # Majority vote over nearest neighbor labels
        neighbor_labels = [self.datastore_values[i] for i in nearest_indices]
        from collections import Counter
        prediction = Counter(neighbor_labels).most_common(1)[0][0]

        return prediction

Configuration

Key Parameters:

Embedding Model Selection:

| Model | Dimensions | Speed | Quality | Cost | | ------------------------ | ---------- | ------ | ------------- | --------------- | | all-MiniLM-L6-v2 | 384 | Fast | Good | Free | | all-mpnet-base-v2 | 768 | Medium | Better | Free | | text-embedding-3-small | 1536 | API | High | $0.02/1M tokens | | text-embedding-3-large | 3072 | API | Highest | $0.13/1M tokens | | Fine-tuned on task data | Varies | Varies | Best for task | Training cost |

k (number of neighbors):

Too low (k=1-2): Insufficient examples, high variance
Optimal range (k=3-8): Good balance of relevance and coverage
Too high (k>10): Context window pressure, diminishing returns, potentially includes less relevant examples
Recommendation: Start with k=5, tune on validation set

Distance Metric:

Cosine similarity: Best default for normalized embeddings, handles varying text lengths
L2 distance: Works well with unnormalized embeddings
Dot product: Fastest for normalized embeddings (equivalent to cosine)
Note: For models producing normalized embeddings, all three metrics yield identical rankings

Task-Specific Tuning:

Classification:

k=3-5, ensure at least one example per expected class
Consider label-balanced retrieval (equal examples per class)
Cosine similarity works well

Reasoning/QA:

k=5-8, prioritize reasoning pattern similarity over topic similarity
Consider using reasoning-step embeddings rather than question-only embeddings
May benefit from CoT-annotated examples

Generation:

k=3-5, balance example length with context budget
Style consistency more important than topic similarity
Consider output-aware retrieval (embed both input and output)

Code Generation:

k=5-8, retrieve examples using similar function signatures or docstrings
Consider code-specific embedding models (CodeBERT, UniXcoder)
Include diverse API usage patterns

Best Practices and Workflow

Workflow (End-to-End):

Baseline Assessment:
- Test zero-shot performance → establishes minimum
- Test random few-shot (k=5 random examples) → establishes few-shot baseline
- If random few-shot already >90% with low variance, KNN Prompting likely unnecessary
Pool Preparation:
- Collect and clean labeled examples
- Remove duplicates and near-duplicates
- Verify label quality on random sample
- Split: retrieval pool (80%), validation (10%), test (10%)
Embedding and Index Setup:
- Embed all pool examples
- Build vector index
- Verify retrieval quality on sample queries
Tuning:
- Test k values from 3 to 8 on validation set
- Compare embedding models if multiple available
- Test with/without diversity filtering
- Test example ordering (most similar first vs last)
Evaluation:
- Full evaluation on validation set
- Compare vs random few-shot baseline
- Analyze failure cases and retrieval quality
- Run on held-out test set for final numbers
Deployment:
- Set up embedding model serving
- Deploy vector index
- Integrate into inference pipeline
- Monitor and maintain

Implementation Best Practices:

Do:

Validate that embedding similarity correlates with task-relevant similarity before committing
Start with a good general-purpose embedding model before investing in fine-tuning
Include diversity filtering if top-k neighbors tend to be near-duplicates
Monitor retrieval quality in production — inputs may drift
Cache embeddings and retrieval results when inputs repeat
Normalize embeddings for consistent cosine similarity computation
Test on diverse inputs during validation, not just typical cases
Keep candidate pool up to date as task requirements evolve

Don't:

Assume any embedding model works — validate retrieval quality
Use k > 8 without checking context window limits
Skip the random few-shot baseline comparison (you need to prove KNN helps)
Build the index on the test set (data leakage)
Ignore diversity — 5 near-identical examples waste context window
Use embedding models trained on vastly different domains without validation
Deploy without monitoring — embedding quality can degrade with distribution shift

Debugging Decision Tree

Symptom: KNN Prompting performs worse than random few-shot

Root causes:

Embedding model doesn't capture task-relevant similarity
Candidate pool quality is poor (noisy labels)
k too high (including irrelevant examples)

Solutions:

Try different embedding model (switch from general to domain-specific)
Audit candidate pool labels for accuracy
Reduce k from 5 to 3
Add manual review: are retrieved examples actually relevant?
Consider supervised retriever (EPR) if unsupervised fails

Symptom: Retrieved examples are near-duplicates of each other

Root causes:

Candidate pool lacks diversity
Embedding space has dense clusters
No diversity filtering

Solutions:

Add diversity-weighted retrieval (MMR — Maximal Marginal Relevance)
Deduplicate candidate pool before building index
Retrieve top-2k candidates, then subsample for diversity
Use clustering to ensure examples come from different clusters

Symptom: Good retrieval quality but LLM still performs poorly

Root causes:

Retrieved examples are relevant but demonstrate wrong patterns
k too high, overwhelming the LLM with context
Example ordering suboptimal
Task fundamentally hard for few-shot

Solutions:

Review what the examples actually demonstrate — relevant input doesn't guarantee useful output
Reduce k and see if fewer, more focused examples help
Test different example orderings
Add task instruction alongside examples
Consider that few-shot may be insufficient — escalate to fine-tuning

Symptom: Retrieval is slow

Root causes:

Using exact search on large candidate pool
Embedding model too slow for real-time use
No caching for repeated queries

Solutions:

Switch to approximate nearest neighbor search (FAISS IVF, HNSW)
Use smaller embedding model (MiniLM instead of mpnet)
Cache embeddings for frequently seen inputs
Pre-compute and cache retrieval results for common query types
Use GPU acceleration for embedding computation

Symptom: Performance degrades over time in production

Root causes:

Input distribution has shifted from when pool was built
Candidate pool is stale (task or domain has evolved)
Embedding model mismatch with new input types

Solutions:

Monitor retrieval similarity scores — declining scores indicate distribution shift
Periodically update candidate pool with recent, relevant examples
Rebuild index when pool changes significantly
Set up alerts for low average similarity scores

Symptom: Inconsistent outputs across similar inputs

Root causes:

Small differences in input lead to different retrieved examples
Boundary cases in embedding space
Temperature too high during LLM inference

Solutions:

Increase k to smooth out boundary effects
Set temperature=0.0 for deterministic LLM outputs
Use ensemble: retrieve with multiple embedding models and merge results
Add diversity filtering to reduce sensitivity to small input changes

Testing and Optimization

Validation Strategy:

Holdout Validation:

Reserve 10-20% of candidate pool as validation set
Never include validation examples in the retrieval index
Use validation to tune k, embedding model, diversity weight
Final evaluation on separate held-out test set

Retrieval Quality Testing:

For each validation query, check if retrieved examples are actually relevant (human judgment)
Measure Precision@k: fraction of retrieved examples that are relevant
Measure nDCG: whether more relevant examples rank higher

Adversarial Testing:

Test with out-of-domain inputs: does retrieval gracefully handle unfamiliar queries?
Test with adversarial inputs: does embedding manipulation affect retrieval?
Test with edge cases: very short inputs, very long inputs, ambiguous inputs

Test Coverage:

Common cases (50%): Representative inputs from expected distribution
Domain boundary cases (20%): Inputs at the edge between categories or topics
Short/long inputs (15%): Varying input lengths to test embedding robustness
Out-of-distribution (10%): Inputs not well represented in candidate pool
Adversarial (5%): Intentionally challenging or misleading inputs

Quality Metrics:

Retrieval Metrics:

Precision@k: Fraction of retrieved examples judged relevant
Recall@k: Fraction of all relevant examples retrieved
nDCG@k: Normalized discounted cumulative gain — penalizes relevant examples ranked low
Mean Reciprocal Rank: Average rank of first relevant result

Task Performance Metrics:

Classification: Accuracy, F1, precision, recall
Generation: BLEU, ROUGE, semantic similarity, human evaluation
QA: Exact match, F1, answer relevance
Code: Execution correctness, test pass rate

General Metrics:

Improvement over random baseline: (KNN - Random) / Random × 100%
Consistency: Variance in output quality across runs
Robustness: Performance on adversarial or OOD inputs
Efficiency: Latency overhead from retrieval step

Optimization Techniques:

1. Embedding Model Selection:

def compare_embedding_models(candidate_pool, validation_set, models):
    """Compare embedding models for retrieval quality"""
    results = {}

    for model_name in models:
        knn = KNNPrompting(embedding_model=model_name, k=5)
        knn.build_index(candidate_pool)

        accuracy = evaluate(knn, validation_set)
        avg_similarity = average_retrieval_similarity(knn, validation_set)

        results[model_name] = {
            'accuracy': accuracy,
            'avg_similarity': avg_similarity
        }

    return results

2. k Optimization:

def optimize_k(knn_system, validation_set, k_range=range(1, 11)):
    """Find optimal k value"""
    results = {}

    for k in k_range:
        knn_system.k = k
        accuracy = evaluate(knn_system, validation_set)
        results[k] = accuracy

    optimal_k = max(results, key=results.get)
    return optimal_k, results

3. Diversity-Aware Retrieval (MMR):

def mmr_retrieval(query_emb, candidate_embs, candidates, k=5,
                  lambda_param=0.7):
    """Maximal Marginal Relevance for diverse retrieval"""
    similarities = np.dot(candidate_embs, query_emb.T).flatten()

    selected_indices = []
    remaining = list(range(len(candidates)))

    for _ in range(k):
        if not remaining:
            break

        mmr_scores = []
        for idx in remaining:
            relevance = similarities[idx]

            # Max similarity to already selected
            if selected_indices:
                redundancy = max(
                    np.dot(candidate_embs[idx], candidate_embs[s])
                    for s in selected_indices
                )
            else:
                redundancy = 0

            mmr = lambda_param * relevance - (1 - lambda_param) * redundancy
            mmr_scores.append((idx, mmr))

        best_idx = max(mmr_scores, key=lambda x: x[1])[0]
        selected_indices.append(best_idx)
        remaining.remove(best_idx)

    return [candidates[i] for i in selected_indices]

4. Caching Strategy:

from functools import lru_cache
import hashlib

class CachedKNNPrompting:
    def __init__(self, knn_system, cache_size=10000):
        self.knn = knn_system
        self.cache_size = cache_size
        self._cache = {}

    def retrieve_cached(self, test_input):
        """Retrieve with caching for repeated inputs"""
        cache_key = hashlib.md5(test_input.encode()).hexdigest()

        if cache_key in self._cache:
            return self._cache[cache_key]

        result = self.knn.retrieve(test_input)

        if len(self._cache) >= self.cache_size:
            # Evict oldest entry
            oldest = next(iter(self._cache))
            del self._cache[oldest]

        self._cache[cache_key] = result
        return result

Iteration Criteria:

When to stop optimizing:

Validation accuracy improvement <1% from further tuning
Retrieval Precision@k >0.8 (most retrieved examples are relevant)
Performance gap vs random few-shot consistently >5%
Further k increases show no improvement
Embedding model comparison shows no significant differences

When to continue:

Retrieval quality clearly poor (irrelevant examples being retrieved)
Performance barely better than random few-shot (<3% improvement)
Specific input categories where retrieval consistently fails
Have not tested domain-specific embedding models

A/B Testing:

def ab_test_knn_vs_random(candidate_pool, test_set, k=5, trials=20):
    """Statistical comparison of KNN vs random selection"""
    knn = KNNPrompting(k=k)
    knn.build_index(candidate_pool)

    knn_accuracies = []
    random_accuracies = []

    for trial in range(trials):
        # KNN selection (deterministic)
        knn_results = evaluate(knn, test_set)
        knn_accuracies.append(knn_results)

        # Random selection (different random seed each trial)
        random_examples = random.sample(candidate_pool, k)
        random_results = evaluate_with_fixed_examples(random_examples, test_set)
        random_accuracies.append(random_results)

    from scipy.stats import ttest_rel
    t_stat, p_value = ttest_rel(knn_accuracies, random_accuracies)

    print(f"KNN: {np.mean(knn_accuracies):.2%} ± {np.std(knn_accuracies):.2%}")
    print(f"Random: {np.mean(random_accuracies):.2%} ± {np.std(random_accuracies):.2%}")
    print(f"P-value: {p_value:.4f}")

    return {'knn_mean': np.mean(knn_accuracies),
            'random_mean': np.mean(random_accuracies),
            'p_value': p_value}

Limitations and Constraints

Known Limitations

1. Embedding Quality Dependency (Fundamental):

KNN Prompting is only as good as its embedding model. If the embedding model doesn't capture the dimensions of similarity relevant to the task, retrieval will return superficially similar but functionally irrelevant examples. This is particularly problematic for tasks where surface-level text similarity doesn't predict example usefulness (e.g., math problems that look similar but require different techniques).

2. Computational Cost for Large Pools:

Since KNN calculates similarity between the test input and all candidates in the pool, it can be computationally expensive for very large datasets. While approximate nearest neighbor indices (FAISS, Annoy) mitigate this for single queries, the embedding computation for all candidates must still happen upfront. For pools exceeding millions of examples, storage and index management become nontrivial.

3. Context Window Pressure:

Retrieved examples consume context window tokens. With k=5 examples averaging 200 tokens each, that's 1000 tokens before the test input and response. This limits k for models with smaller context windows and for tasks requiring long examples. The token cost of examples trades directly against the space available for test input and model response.

4. No Guarantee of Diversity:

Pure nearest neighbor retrieval can return near-duplicate examples when the candidate pool has dense clusters. Five very similar examples waste four example slots that could demonstrate different aspects of the task. Diversity filtering (MMR) helps but introduces its own hyperparameter and can reduce average relevance.

5. Sparse Distribution Problem (Xu et al. variant):

For the distribution-based kNN Prompting variant, the kNN distribution support is sparse — it only assigns probability mass to nearest neighbors. This means it may miss tokens needed for certain predictions, particularly in zero-shot or low-data settings where the datastore is small.

6. Static Retrieval:

KNN Prompting retrieves based on the initial input, not adapting to intermediate model outputs. For multi-turn or iterative tasks, the initially retrieved examples may become less relevant as the conversation progresses. There's no feedback loop between the model's output and the retrieval process.

7. Infrastructure Overhead:

Deploying KNN Prompting requires maintaining an embedding model, vector index, and candidate pool alongside the LLM. While the computational overhead is minimal, the engineering complexity is non-trivial for production systems. This is qualitatively different from simply calling an LLM API.

Edge Cases

Ambiguous inputs where multiple example types are equally relevant:

The test input falls equidistant between different clusters in embedding space
Retrieved examples may be a mixture of different categories
Detection: Low maximum similarity score, or top-k examples spanning multiple categories
Solution: Increase k to cover multiple interpretations, or add disambiguation instruction

Out-of-distribution inputs:

Test input is fundamentally different from anything in the candidate pool
All similarity scores are low, retrieved examples are irrelevant
Detection: Maximum similarity score below a threshold (e.g., cosine similarity < 0.5)
Solution: Fall back to zero-shot or manual examples when similarity is too low

Adversarial inputs designed to manipulate retrieval:

Attacker crafts inputs to retrieve specific examples that cause the model to produce desired outputs
Detection: Unusual patterns in retrieval (always retrieving same examples, or very different from typical)
Solution: Monitor retrieval patterns, add randomization, validate outputs

Very short or very long inputs:

Short inputs produce low-information embeddings with unreliable similarity
Long inputs may match on irrelevant details
Detection: Input length far outside the candidate pool's typical range
Solution: Normalize input length, use passage-level embeddings for long inputs, increase k for short inputs

Label imbalance in retrieved set:

If the candidate pool has class imbalance, retrieved examples may all belong to the majority class
Detection: Check label distribution of retrieved examples
Solution: Stratified retrieval ensuring minimum representation per class

Graceful Degradation:

class RobustKNNPrompting:
    def __init__(self, knn_system, similarity_threshold=0.3):
        self.knn = knn_system
        self.threshold = similarity_threshold

    def generate_with_fallback(self, test_input, task_instruction=""):
        """KNN prompting with graceful fallback"""
        retrieved = self.knn.retrieve(test_input)

        # Check retrieval quality
        avg_similarity = np.mean([ex['similarity'] for ex in retrieved])

        if avg_similarity < self.threshold:
            # Low similarity — fall back to zero-shot
            print(f"Warning: Low retrieval quality ({avg_similarity:.3f}). "
                  f"Falling back to zero-shot.")
            return self.zero_shot_generate(test_input, task_instruction)

        # Filter out low-quality retrievals
        quality_retrieved = [
            ex for ex in retrieved
            if ex['similarity'] >= self.threshold
        ]

        if len(quality_retrieved) < 2:
            # Too few quality examples — use zero-shot with instruction
            return self.zero_shot_generate(test_input, task_instruction)

        return self.knn.generate_with_examples(
            test_input, quality_retrieved, task_instruction
        )

    def zero_shot_generate(self, test_input, task_instruction):
        """Fallback to zero-shot when retrieval fails"""
        prompt = task_instruction + f"\n\nInput: {test_input}\nOutput:"
        return self.knn.llm_generate(prompt)

Constraint Management

Balancing Relevance vs Diversity:

Pure relevance retrieval may return redundant examples. Pure diversity selection may return irrelevant ones. The MMR (Maximal Marginal Relevance) approach balances both:

Lambda = 1.0: Pure relevance (standard KNN)
Lambda = 0.5: Equal weight to relevance and diversity
Lambda = 0.7: Mild diversity preference (good default)
Tune lambda on validation set based on task needs

Handling Token/Context Constraints:

def token_aware_retrieval(knn_system, test_input, max_example_tokens=2000):
    """Retrieve examples fitting within token budget"""
    # Retrieve more candidates than needed
    candidates = knn_system.retrieve_top_n(test_input, n=knn_system.k * 2)

    selected = []
    total_tokens = 0

    for candidate in candidates:
        example_tokens = estimate_tokens(candidate['input'] + candidate['output'])

        if total_tokens + example_tokens <= max_example_tokens:
            selected.append(candidate)
            total_tokens += example_tokens

        if len(selected) >= knn_system.k:
            break

    return selected

Handling Incomplete Candidate Pool:

When the candidate pool doesn't cover all expected input types:

Monitor which inputs get low similarity scores
Prioritize adding examples for underrepresented input types
Use zero-shot fallback for inputs with no good matches
Periodically audit retrieval quality on production traffic

Error Handling and Recovery:

Embedding API failure: Cache recent embeddings, fall back to cached results or random selection
Index corruption: Maintain index backups, rebuild from stored embeddings
Candidate pool staleness: Set up periodic refresh schedule
Embedding model version change: Rebuild entire index when embedding model updates

Advanced Techniques

Clarity and Context Optimization

Ensuring Retrieval Clarity:

The quality of KNN Prompting depends on what gets embedded. Embedding only the input text is the simplest approach, but may miss task-relevant context:

Input-only embedding: Simple, fast, works for most tasks. Misses output-conditional relevance.
Input+output embedding: Embeds the full example. Better for generation tasks where output style matters.
Task-description-augmented embedding: Prepends task description to input before embedding. Helps the embedding model focus on task-relevant features.

Context Optimization:

For tasks requiring domain knowledge, the retrieved examples should carry relevant context:

def context_enriched_retrieval(knn, test_input, domain_context=""):
    """Retrieve examples and add domain context to prompt"""
    retrieved = knn.retrieve(test_input)

    # Build prompt with context
    prompt = ""
    if domain_context:
        prompt += f"Domain context: {domain_context}\n\n"

    prompt += "Here are examples of similar tasks:\n\n"
    for ex in retrieved:
        prompt += f"Input: {ex['input']}\nOutput: {ex['output']}\n\n"

    prompt += f"Now complete:\nInput: {test_input}\nOutput:"
    return prompt

Handling Context Length Limitations:

When retrieved examples are long, compress or truncate them:

Truncate examples to key portions (first N tokens of input, full output)
Summarize long examples before including in prompt
Reduce k to fit within context budget
Use tiered approach: include full nearest example, abbreviated versions for remaining

Example Design:

What makes a retrieved example effective:

Relevant input: Semantically close to the test query
Clear output: Unambiguous, correctly formatted answer
Appropriate length: Long enough to be informative, short enough to not waste context
Correct label: Incorrect examples in the pool actively harm performance
Representative: Should represent a genuine instance of the task, not an edge case

Optimal Number and Diversity:

Classification: k=3-5, ensure label diversity in retrieved set
Generation: k=3-5, balance style diversity with topical relevance
QA: k=5-7, cover different reasoning patterns
Code: k=5-8, include different implementation approaches for similar problems

Advanced Reasoning and Output Control

Multi-Step Reasoning with KNN:

For reasoning tasks, retrieve examples that demonstrate similar reasoning chains:

def reasoning_aware_retrieval(knn, test_input, reasoning_type):
    """Retrieve examples matching reasoning pattern"""
    # Encode with reasoning context
    enriched_input = f"[{reasoning_type}] {test_input}"

    retrieved = knn.retrieve(enriched_input)

    # Filter to ensure CoT examples
    cot_examples = [ex for ex in retrieved if 'reasoning' in ex]

    return cot_examples

Self-Verification:

Build verification into the prompt by retrieving examples that include verification steps:

Input: What is 15% of 240?
Reasoning: To find 15% of 240, I calculate 0.15 × 240 = 36. Verification: 36/240 = 0.15 = 15% ✓
Output: 36

Structured Output:

Ensure all retrieved examples demonstrate the exact required format. Pre-filter the candidate pool to only include correctly formatted examples:

def format_filtered_retrieval(knn, test_input, format_validator):
    """Only retrieve examples matching required format"""
    # Retrieve extra candidates to account for filtering
    candidates = knn.retrieve_top_n(test_input, n=knn.k * 3)

    # Filter by format compliance
    formatted = [
        ex for ex in candidates
        if format_validator(ex['output'])
    ]

    return formatted[:knn.k]

Constraint Enforcement:

When the task has hard constraints (word count, format, content restrictions), ensure retrieved examples demonstrate constraint compliance:

Filter candidate pool to only include constraint-compliant examples
Add explicit constraint statement in the prompt instruction
Use retrieved examples as implicit demonstrations of constraint adherence

Interaction Patterns

Conversational KNN:

For multi-turn conversations, update retrieval based on conversation context:

def conversational_knn(knn, conversation_history, new_message):
    """Update retrieval based on conversation context"""
    # Concatenate recent context for richer embedding
    context = " ".join([
        msg['content'] for msg in conversation_history[-3:]
    ])
    enriched_query = context + " " + new_message

    # Retrieve based on full context
    return knn.retrieve(enriched_query)

Iterative Refinement:

Use feedback from model outputs to improve retrieval:

def iterative_knn(knn, test_input, validator, max_iterations=3):
    """Iteratively refine retrieval based on output quality"""
    current_k = knn.k

    for iteration in range(max_iterations):
        result = knn.generate(test_input)

        if validator(result):
            return result

        # Increase k or adjust retrieval
        current_k += 2
        knn.k = current_k

    return result  # Best effort after max iterations

Chaining KNN with Other Techniques:

def knn_with_cot_and_self_consistency(knn, test_input, n_samples=5):
    """KNN retrieval + CoT + Self-Consistency"""

    # Step 1: KNN retrieval for relevant examples
    retrieved = knn.retrieve(test_input)

    # Step 2: Build CoT prompt with retrieved examples
    prompt = "Solve step by step:\n\n"
    for ex in retrieved:
        prompt += f"Q: {ex['input']}\nA: Let's think step by step. "
        prompt += f"{ex['reasoning']}\nThe answer is {ex['output']}.\n\n"
    prompt += f"Q: {test_input}\nA: Let's think step by step."

    # Step 3: Self-consistency - generate multiple responses
    responses = [llm(prompt, temperature=0.7) for _ in range(n_samples)]
    answers = [extract_answer(r) for r in responses]

    # Step 4: Majority vote
    from collections import Counter
    return Counter(answers).most_common(1)[0][0]

Model Considerations

GPT-4 / GPT-4 Turbo:

Strong in-context learner, benefits from relevant examples
Can handle k=8-10 examples with large context window
Embedding: OpenAI text-embedding-3-large for best alignment
Sensitive to example quality — retrieval quality matters

Claude 3.5 Sonnet / Claude 3 Opus:

Excellent instruction following, retrieved examples should focus on demonstrating format and reasoning
May need fewer examples (k=3-5) due to strong in-context learning
Embedding: Any high-quality model (no native Claude embedding model; use open-source or OpenAI)
Particularly benefits from well-structured examples

Llama 3 70B / 405B:

Benefits significantly from KNN Prompting (larger models better at leveraging context)
May need more examples (k=5-8) compared to GPT-4
Embedding: Open-source models preferred (Sentence-BERT variants)
More sensitive to example order — experiment with most-similar-first vs last

Smaller Models (7B-13B):

Limited in-context learning ability reduces KNN Prompting effectiveness
Keep k=2-4 to avoid context window overload
Focus on very high relevance over diversity
May benefit more from the Xu et al. distribution-matching variant

Cross-Model Considerations:

Embedding model choice is independent of LLM — same index works across models
Optimal k may differ across LLMs — tune per model
Example formatting preferences differ — some models prefer structured, others flexible
Test retrieval effectiveness per model, not just once

Safety, Robustness, and Domain Adaptation

Adversarial Protection:

KNN Prompting introduces a retrieval attack surface. An attacker who can influence the candidate pool can manipulate which examples get retrieved:

Pool poisoning: Injecting malicious examples that are designed to be retrieved for certain queries
Mitigation: Validate all pool examples before inclusion, use trusted data sources only, monitor for unusual retrieval patterns

Input manipulation:

Attacker crafts inputs to trigger retrieval of specific examples
Mitigation: Input sanitization, monitor for anomalous retrieval patterns, rate limiting

Output Safety:

Retrieved examples can contain biased or harmful content that gets amplified in the model's output:

def safe_retrieval(knn, test_input, safety_filter):
    """Filter retrieved examples for safety"""
    retrieved = knn.retrieve_top_n(test_input, n=knn.k * 2)

    safe_examples = [
        ex for ex in retrieved
        if safety_filter.is_safe(ex['input']) and safety_filter.is_safe(ex['output'])
    ]

    return safe_examples[:knn.k]

Reliability:

Ensure consistent outputs by:

Using temperature=0.0 for deterministic LLM inference
Caching retrieval results to ensure same input always gets same examples
Monitoring similarity score distributions for drift
Setting up alerts for degraded retrieval quality

Domain Adaptation:

Adapting KNN Prompting to new domains:

Quick adaptation: Use general-purpose embeddings + domain-specific candidate pool. Works for most domains with minimal setup.
Better adaptation: Use domain-specific embedding model (BioSentVec for medical, LegalBERT for legal, CodeBERT for code). Improves retrieval quality for domain-specific similarity.
Best adaptation: Fine-tune embedding model on domain data for in-context learning relevance. Requires training data but yields best results.

Handling domain-specific terminology:

Domain-specific embedding models capture terminology better than general models
Augment candidate pool with domain glossary examples
Consider metadata filtering (retrieve only from relevant subdomain)

Quick domain transfer using analogies:

Build separate indices for each domain
When entering a new domain with few examples, bootstrap with analogous examples from related domains
Gradually replace analogous examples with genuine domain examples as they become available

Risk and Ethics

Ethical Considerations

Data Privacy in Candidate Pools:

Candidate pools may contain sensitive information (personal data, confidential documents, proprietary content). When these are retrieved and included in prompts:

The data gets sent to LLM APIs (potential privacy violation)
Model outputs may include or reference sensitive details
Mitigation: Anonymize candidate pools, use on-premise models for sensitive data, implement access controls on the index

Bias in Retrieval:

If the candidate pool reflects societal biases, KNN retrieval can amplify them:

Training data with gender, racial, or cultural biases produces biased embeddings
Examples reflecting historical discrimination get retrieved and reinforced
Models may anchor on biased patterns in retrieved examples
Mitigation: Audit candidate pool for bias, use debiased embedding models, monitor output fairness metrics

Transparency:

When deploying KNN-prompted systems:

Users should know that responses are influenced by retrieved examples
The retrieval process should be auditable — which examples were retrieved and why
Document the candidate pool composition and embedding model used
Provide explanations when challenged: "This response was based on similar cases in our database"

Model Capability Revelation:

KNN Prompting reveals how LLMs respond to different types of demonstrations, which could:

Positive: Improve understanding of model behavior, enable better prompt engineering
Negative: Enable adversaries to craft examples that systematically manipulate model outputs

Risk Analysis

Failure Modes:

1. Poor Retrieval Quality:

Symptom: Retrieved examples irrelevant to test input
Impact: Performance worse than random few-shot
Probability: Medium (15-25% without validation)
Mitigation: Validate embedding model before deployment, monitor retrieval similarity scores

2. Candidate Pool Poisoning:

Symptom: Incorrect or malicious examples in the pool
Impact: Systematic errors or harmful outputs for queries that trigger poisoned examples
Probability: Low in controlled environments, higher with user-contributed pools
Mitigation: Validate all pool entries, use trusted sources, monitor for anomalies

3. Distribution Shift:

Symptom: Performance degrades over time as inputs change
Impact: Retrieval returns increasingly irrelevant examples
Probability: Medium-High (30-40% over months without maintenance)
Mitigation: Periodic pool refresh, similarity score monitoring, automated drift detection

4. Embedding Model Mismatch:

Symptom: High similarity scores but retrieved examples are not useful
Impact: False confidence in retrieval quality
Probability: Medium (20-30% with generic embeddings)
Mitigation: Validate retrieval quality with human judgment, not just similarity scores

Cascading Failures:

Incorrect retrieval → wrong examples in prompt → model anchors on incorrect patterns → systematic errors on similar inputs → users lose trust in system

Prevention: Multi-layer validation — check retrieval quality, validate LLM output, monitor user feedback

Bias Amplification:

Sources of Bias:

Candidate pool bias: If pool overrepresents certain demographics, topics, or viewpoints, retrieval amplifies this
Embedding bias: Embedding models encode societal biases that affect similarity computation
Proximity bias: Examples semantically close to the query may share the query's biases rather than providing corrective perspective

Detection and Mitigation:

def audit_retrieval_bias(knn, test_queries, sensitive_attributes):
    """Audit retrieval for demographic or topical bias"""
    bias_report = {}

    for attr in sensitive_attributes:
        attr_distributions = []

        for query in test_queries:
            retrieved = knn.retrieve(query)
            attr_values = [get_attribute(ex, attr) for ex in retrieved]
            attr_distributions.append(Counter(attr_values))

        # Check if certain attribute values are systematically over/under-represented
        aggregated = sum(attr_distributions, Counter())
        total = sum(aggregated.values())

        bias_report[attr] = {
            value: count / total
            for value, count in aggregated.items()
        }

    return bias_report

Innovation Potential

Novel Combinations:

KNN + Active Prompting: Use KNN retrieval to select candidate pool, then apply Active Prompting to identify which retrieved examples are most informative:

KNN retrieves top-20 relevant examples per query
Active Prompting selects the 5 most uncertain/informative from the retrieved set
Combines relevance (KNN) with informativeness (Active) for optimal example selection

KNN + RAG: Use KNN Prompting for example selection alongside RAG for knowledge retrieval. Examples demonstrate the format and reasoning, while RAG provides factual grounding.

Dynamic KNN with Feedback: Update the candidate pool and index based on model performance feedback — successful examples get higher retrieval priority, failed examples get demoted or replaced.

Cross-Modal KNN: Extend to multimodal settings — retrieve visually similar images, acoustically similar audio clips, or structurally similar code snippets as demonstrations.

Ecosystem and Integration

Tools and Frameworks

LangChain:

Built-in SemanticSimilarityExampleSelector implements KNN Prompting directly
Integrates with multiple vector stores (FAISS, Chroma, Pinecone, Weaviate)
FewShotPromptTemplate handles prompt construction with selected examples
Supports custom example selectors for advanced retrieval strategies

LlamaIndex:

Vector store indices with configurable similarity search
SimilarityPostprocessor for filtering and reranking
Integration with multiple embedding models and LLMs

DSPy:

BootstrapFewShot optimizer can be combined with KNN-selected training examples
Programmatic prompt optimization with retrieved demonstrations
Supports automatic example curation and optimization

FAISS (Facebook AI Similarity Search):

Industry-standard library for efficient similarity search
Supports exact and approximate nearest neighbor algorithms
GPU acceleration for large-scale deployment
Used by Khandelwal et al. (2020) in the original kNN-LM work

Sentence-Transformers:

Pre-trained models for generating sentence embeddings
all-MiniLM-L6-v2, all-mpnet-base-v2 are popular defaults
Supports fine-tuning on custom data for domain-specific embeddings

Vector Databases (Production):

Pinecone: Managed vector database with built-in similarity search
Weaviate: Open-source vector database with hybrid search
Chroma: Lightweight, developer-friendly vector store
Milvus: Open-source, production-grade vector database
Qdrant: High-performance vector similarity search engine

Evaluation Tools:

BEIR Benchmark: Standardized benchmark for information retrieval evaluation
MTEB (Massive Text Embedding Benchmark): Compare embedding model quality
Ragas: Evaluation framework for retrieval-augmented systems

Closely Related Techniques:

KATE (Liu et al., 2022):

Direct ancestor — introduced kNN-based example selection for ICL
Uses RoBERTa embeddings with cosine similarity
Demonstrated that retrieval-based ICL approaches fine-tuning performance
Foundation for all subsequent KNN Prompting work

EPR (Rubin et al., 2022):

Supervised extension of KATE with a trained retriever
Two-stage: BM25 recall → trained scorer for reranking
30%+ improvement over random selection
Higher quality but requires training data for the retriever

UDR (Li et al., 2023):

Unified multi-task retriever
Single model serves multiple tasks
Avoids per-task retriever training
Better generalization but potentially lower per-task quality

Vote-k (Su et al., 2023):

Graph-based diverse selection
Balances diversity with representativeness
Uses cosine similarity graph + confidence-based ranking
Better diversity but may sacrifice relevance

CEIL (Ye et al., 2023):

Models joint probability of entire example set
Uses conditional DPP for compositional selection
Captures inter-example relationships
More complex but accounts for example interactions

kNN-LM (Khandelwal et al., 2020):

Foundational work: augments LM with nearest neighbor lookup
Uses cached hidden representations as datastore keys
Interpolates kNN and LM distributions
Inspired kNN Prompting but operates at token level rather than example level

Comparison Table:

| Technique | Retrieval | Training Required | Diversity | Scalability | Best For | | ---------------------- | ---------------------- | ----------------- | ----------- | ----------- | --------------------------- | | Random Few-Shot | None | No | By chance | N/A | Baseline, simple tasks | | KNN (KATE) | Embedding similarity | No | Low | High | General automated selection | | KNN + MMR | Similarity + diversity | No | Medium-High | High | Diverse input spaces | | Vote-k | Graph-based | No | High | Medium | Unlabeled pool selection | | EPR | Trained retriever | Yes | Medium | Medium | Maximum per-task quality | | UDR | Multi-task retriever | Yes | Medium | High | Multi-task settings | | CEIL | Joint probability | Yes | High | Low | Compositional selection | | kNN Prompting (Xu) | Distribution matching | No | N/A | Very High | Classification, large pools |

Integration Patterns

Task Adaptation:

Classification:

def knn_for_classification(knn, test_input, classes, min_per_class=1):
    """KNN with class balance guarantee"""
    # Retrieve extra candidates
    candidates = knn.retrieve_top_n(test_input, n=knn.k * 3)

    # Ensure minimum per class
    selected = []
    class_counts = {cls: 0 for cls in classes}

    # First pass: one per class from top candidates
    for candidate in candidates:
        cls = candidate['output']
        if cls in class_counts and class_counts[cls] < min_per_class:
            selected.append(candidate)
            class_counts[cls] += 1
        if sum(class_counts.values()) >= len(classes) * min_per_class:
            break

    # Fill remaining slots by similarity
    for candidate in candidates:
        if candidate not in selected and len(selected) < knn.k:
            selected.append(candidate)

    return selected

Integration with RAG:

def knn_plus_rag(knn_system, rag_system, test_input, task_instruction):
    """Combine KNN for examples + RAG for knowledge"""

    # Retrieve similar examples (KNN)
    examples = knn_system.retrieve(test_input)

    # Retrieve relevant knowledge documents (RAG)
    documents = rag_system.retrieve(test_input)

    # Build combined prompt
    prompt = task_instruction + "\n\n"
    prompt += "Relevant information:\n"
    for doc in documents:
        prompt += f"- {doc['content']}\n"
    prompt += "\nExamples:\n"
    for ex in examples:
        prompt += f"Input: {ex['input']}\nOutput: {ex['output']}\n\n"
    prompt += f"Input: {test_input}\nOutput:"

    return prompt

Integration with Agents:

class KNNAgent:
    """Agent that uses KNN retrieval for in-context examples"""

    def __init__(self, knn_system, llm):
        self.knn = knn_system
        self.llm = llm

    def execute(self, task, tools=None):
        """Execute task with KNN-retrieved examples"""
        # Retrieve relevant examples
        examples = self.knn.retrieve(task)

        # Build agent prompt with examples
        system_prompt = "You are a helpful assistant. "
        system_prompt += "Here are examples of similar tasks:\n\n"
        for ex in examples:
            system_prompt += f"Task: {ex['input']}\nResult: {ex['output']}\n\n"

        # Execute with LLM
        return self.llm.generate(
            system=system_prompt,
            user=f"Task: {task}",
            tools=tools
        )

Transition Strategies:

From Random Few-Shot to KNN Prompting:

Measure random few-shot baseline (run 10+ times with different random selections)
Set up embedding model and index with existing candidate pool
Compare KNN retrieval vs random on validation set
If improvement >3%, deploy KNN; if not, investigate embedding model quality
Monitor in production and iterate

From KNN Prompting to Fine-tuning:

Use KNN Prompting insights to identify which examples are most valuable
Collect performance data on which retrieved examples led to best outputs
Build training dataset from high-performing example-output pairs
Fine-tune and compare against KNN Prompting
If fine-tuning clearly superior (>10% improvement), transition

From KNN to Supervised Retriever (EPR/UDR):

Collect data on which retrieved examples actually helped (label retrieval quality)
Train supervised retriever on this data
Compare supervised retriever vs unsupervised KNN on validation set
Deploy if improvement justifies training cost and complexity

Larger System Integration:

class ProductionKNNSystem:
    """Production system with KNN Prompting"""

    def __init__(self, embedding_model, llm_client, vector_store):
        self.embedding_model = embedding_model
        self.llm = llm_client
        self.store = vector_store
        self.version = 1

    def predict(self, input_data):
        """Production inference"""
        # Embed input
        embedding = self.embedding_model.encode(input_data)

        # Retrieve examples
        examples = self.store.search(embedding, k=5)

        # Build prompt and generate
        prompt = self.build_prompt(examples, input_data)
        response = self.llm.generate(prompt)

        # Log for monitoring
        self.log_prediction(input_data, examples, response)

        return response

    def update_pool(self, new_examples):
        """Add new examples to candidate pool"""
        embeddings = self.embedding_model.encode(
            [ex['input'] for ex in new_examples]
        )
        self.store.add(new_examples, embeddings)
        self.version += 1

    def monitor_quality(self, window_hours=24):
        """Monitor retrieval quality"""
        recent_logs = self.get_recent_logs(window_hours)

        avg_similarity = np.mean([
            log['max_similarity'] for log in recent_logs
        ])

        if avg_similarity < self.quality_threshold:
            alert(f"Retrieval quality degraded: avg similarity {avg_similarity:.3f}")

    def rollback(self, target_version):
        """Rollback to previous pool version"""
        self.store.restore(target_version)
        self.version = target_version

Versioning and Monitoring:

Version the candidate pool alongside the code
Track embedding model version (changing model invalidates all embeddings)
Monitor: average similarity score, prediction accuracy, latency, cache hit rate
Set up alerts for similarity score drops (indicates distribution shift)
Implement A/B testing framework for pool updates

Future Directions

Emerging Innovations

Nearest Neighbor Speculative Decoding (2024): Recent work by Sun et al. (2024) combines kNN retrieval with speculative decoding for faster LLM inference. By predicting likely next tokens from nearest neighbor matches, the system can speculatively decode multiple tokens in parallel, reducing inference latency while maintaining output quality.

bias-kNN (2024): Rather than treating LLM biases as problems to correct, bias-kNN (presented at IEEE ICSC 2024) leverages biased output distributions as primary features for kNN classification. This approach consistently outperforms traditional ICL in few-shot scenarios and exhibits enhanced stability across varied labeled data samples and diverse templates.

kNN-ICL (NAACL 2024): Zhao et al. proposed kNN-ICL, which simplifies prompt engineering by building nearest neighbor inference on top of any ICL design strategy. It provides access to all demonstration examples without context window limitations, significantly improving comprehension of complex requests.

Dynamic Few-Shot Prompting: Production systems are increasingly using dynamic example selection that adapts not just to the input, but to the model's confidence and the conversation context. This moves beyond static KNN retrieval toward adaptive, context-aware demonstration selection.

Learned Retrieval for ICL: IDEAL (ICLR 2024) introduces influence-driven selective annotations that identify optimal data subsets for ICL in an unsupervised, end-to-end manner. DQ-LoRe (ICLR 2024) uses dual queries and low-rank approximation for exemplar selection, achieving significant improvements on reasoning tasks.

Research Frontiers

Open Questions:

Optimal Similarity Dimensions: What aspects of similarity matter most for ICL? Surface text similarity? Reasoning structure? Output format? Can we learn task-specific similarity functions?
Joint Example Set Optimization: Current KNN retrieves each example independently. How do we optimize the set jointly, accounting for inter-example relationships (diversity, coverage, complementarity)?
Adaptive k: Should k vary per query? Easy queries may need fewer examples, hard queries more. Can we predict optimal k dynamically?
Cross-Lingual Retrieval: Can KNN Prompting work across languages — retrieving examples in one language to serve as demonstrations for another?
Scaling Laws for Retrieval: How does retrieval quality scale with pool size, embedding dimension, and model capacity? Are there theoretical bounds?
Retrieval vs Generation of Examples: When is retrieving real examples better than having the model generate synthetic ones (SG-ICL)? Under what conditions does each approach dominate?
Privacy-Preserving Retrieval: How do we implement KNN Prompting when the candidate pool contains sensitive data that shouldn't be sent to external LLM APIs?

Promising Directions:

Hierarchical Retrieval: Multi-level retrieval that first identifies the relevant domain/task, then retrieves examples within that domain. Reduces search space and improves relevance for multi-domain systems.

Embedding Model Co-Training: Training the embedding model jointly with the downstream task to optimize for ICL relevance rather than general semantic similarity. Early results show significant improvements over generic embeddings.

Real-Time Pool Evolution: Systems that continuously update the candidate pool based on production traffic, user feedback, and model performance. The pool becomes a living dataset that improves over time.

Multimodal KNN Prompting: Extending KNN retrieval to multimodal settings — retrieving image-text pairs, code-documentation pairs, or audio-transcript pairs as demonstrations for multimodal LLMs.

Theoretical Foundations: Developing formal guarantees for KNN Prompting: when does retrieval provably help? What are sample complexity bounds? Under what conditions does KNN selection converge to optimal example sets?

Explore Unread

Great job! You've read all available articles

K-Nearest Neighbor (KNN) Prompting: A Complete Guide

KNN-based exemplar selection (KATE) — introduced by Liu et al. (2022) in "What Makes Good In-Context Examples for GPT-3?", which uses sentence embeddings to retrieve the most similar training examples as in-context demonstrations, showing performance nearly comparable to fine-tuning when applied to GPT-3.
KNN Prompting for calibration-free inference — introduced by Xu et al. (2023) in "kNN Prompting: Beyond-Context Learning with Calibration-Free Nearest Neighbor Inference" (ICLR 2023), which goes further by using LLM output distributions as representations and performing nearest neighbor classification directly, achieving +3.56 average improvement for 4-shot and +7.07 for 8-shot over state-of-the-art calibration methods across 10 classification tasks, with standard deviation dropping from 9.14 (ICL) to 3.83 (kNN Prompting) across tasks.

How It Works

Theoretical Foundation

KNN Prompting is grounded in two foundational ideas:

Key Assumptions and Where They Fail:

Embedding quality reflects task relevance: Assumes the embedding model captures the similarity dimensions relevant to the task. Fails when task-relevant similarity differs from general semantic similarity (e.g., two sentences about different topics but requiring the same reasoning pattern).
Similar inputs benefit from similar demonstrations: Assumes that if test input X is similar to training example Y, then Y is a good demonstration for X. Fails for tasks where surface similarity is misleading (e.g., similar-looking math problems requiring different approaches).
Embedding space is well-structured: Assumes nearest neighbors in embedding space are meaningfully similar. Fails with poor embedding models or highly specialized domains where general embeddings lack discriminative power.

Fundamental Trade-offs:

Execution Mechanism

KNN Prompting operates differently depending on the variant, but both follow a two-phase structure:

Variant 1: KNN-Based Exemplar Selection (KATE-style)

Phase 1 — Preprocessing (offline):

Collect a pool of candidate examples with their labels/completions
Encode all candidates using a sentence embedding model (e.g., RoBERTa, Sentence-BERT, OpenAI embeddings)
Store embeddings in an indexed datastore for efficient retrieval

Phase 2 — Inference (per query):

Encode the test input using the same embedding model
Compute distance (cosine similarity, L2, or dot product) between test embedding and all candidate embeddings
Retrieve the k nearest candidates as in-context examples
Construct a few-shot prompt with retrieved examples and the test input
Query the LLM with the constructed prompt
Return the LLM's response

This approach is single-pass from the LLM's perspective — the retrieval step happens before the LLM call.

Variant 2: KNN Prompting for Calibration-Free Inference (Xu et al., 2023)

Phase 1 — Meta-Test Stage (building the datastore):

Select a small set of anchor examples (in-context demonstrations)
For each training example, construct a prompt using the anchor examples plus the training example as the test input
Query the LLM and cache the complete output probability distribution as a key, paired with the training example's true label as the value
Build a datastore of (distribution, label) pairs

Phase 2 — Formal Test Stage (inference):

Construct the same prompt structure with anchor examples plus the test input
Query the LLM to get the output probability distribution
Compute KL divergence between the test distribution and all cached training distributions
Find the k nearest neighbors by smallest KL divergence
Aggregate the labels of the k nearest neighbors (majority vote)
Return the predicted label

This approach requires multiple LLM calls during datastore construction but enables calibration-free inference that scales beyond context window limitations.

Why This Works

Causal Chain:

Positive Feedback Loop:

Better example selection → more consistent outputs → more reliable performance metrics → better ability to tune k and embedding model → further improved selection

Negative Feedback Loop:

Poor embedding model → retrieves superficially similar but semantically irrelevant examples → performance degrades below random selection → misleading signal that KNN approach doesn't work

Structure and Components

Essential Components

Required:

Candidate example pool: Set of labeled examples to select from (minimum 50-100 for meaningful retrieval, 500+ recommended)
Embedding model: Sentence encoder to convert text into vector representations (Sentence-BERT, OpenAI embeddings, RoBERTa, etc.)
Distance metric: Method to compute similarity between embeddings (cosine similarity, L2 distance, dot product)
k parameter: Number of nearest neighbors to retrieve (typically 3-8)
Few-shot prompt template: Structure for incorporating retrieved examples with the test input

Required for Xu et al. variant (additionally):

Anchor examples: Small fixed set of in-context demonstrations used when querying training data
Distribution datastore: Cache of LLM output probability distributions for training examples
KL divergence computation: Method to compare probability distributions

Optional:

Vector index (FAISS, Annoy, HNSW): For efficient approximate nearest neighbor search over large datastores
Fine-tuned embedding model: Encoder fine-tuned on task-related data (e.g., RoBERTa fine-tuned on NLI/STS-B)
Diversity filtering: Mechanism to ensure retrieved examples aren't redundant
Example ordering strategy: Method to arrange retrieved examples in the prompt
Reranking model: Secondary model to rerank retrieved candidates based on task-specific criteria

Design Principles

Core Cognitive Principles:

Similarity-driven learning: Humans learn better from examples that closely match the target scenario, and LLMs exhibit the same property in-context
Pattern recognition: LLMs excel at recognizing patterns from demonstrations — similar examples create stronger, more coherent patterns
Implicit task specification: The retrieved examples implicitly communicate task requirements, format, and reasoning style more effectively than abstract instructions
Distributional reasoning: For the Xu et al. variant, the full output distribution captures latent representations of how the model processes an input, enabling matching at a deeper level than surface text similarity

Linguistic Patterns:

KNN Prompting uses standard few-shot format, with the distinguishing feature being automated, similarity-driven example selection:

[Retrieved Example 1 - most similar to test input]
Input: {retrieved_input_1}
Output: {retrieved_output_1}

[Retrieved Example 2 - second most similar]
Input: {retrieved_input_2}
Output: {retrieved_output_2}

...

[Test Input]
Input: {test_input}
Output:

Design Principles:

Maximize relevance: Every example slot should be filled with the most relevant available demonstration
Maintain diversity within relevance: If top-k neighbors are too similar to each other, they provide redundant information — consider diversity-aware selection
Consistent formatting: Retrieved examples must follow the same format regardless of their source
Embedding model alignment: The embedding model should capture the dimensions of similarity that matter for the task

Structural Patterns

Minimal Pattern (Basic KNN Selection):

from sentence_transformers import SentenceTransformer
import numpy as np

# Encode candidates
model = SentenceTransformer('all-MiniLM-L6-v2')
candidate_texts = [ex['input'] for ex in candidates]
candidate_embeddings = model.encode(candidate_texts)

# Encode test input and find nearest
test_embedding = model.encode([test_input])
similarities = np.dot(candidate_embeddings, test_embedding.T).flatten()
top_k_indices = np.argsort(similarities)[-k:][::-1]

# Build prompt with retrieved examples
prompt = ""
for idx in top_k_indices:
    prompt += f"Input: {candidates[idx]['input']}\nOutput: {candidates[idx]['output']}\n\n"
prompt += f"Input: {test_input}\nOutput:"

Standard Pattern (KNN with Index and Reranking):

import faiss
from sentence_transformers import SentenceTransformer
import numpy as np

class KNNPrompting:
    def __init__(self, embedding_model='all-MiniLM-L6-v2', k=5):
        self.encoder = SentenceTransformer(embedding_model)
        self.k = k
        self.index = None
        self.candidates = []

    def build_index(self, candidates):
        """Build FAISS index from candidate examples"""
        self.candidates = candidates
        texts = [ex['input'] for ex in candidates]
        embeddings = self.encoder.encode(texts, normalize_embeddings=True)

        # Build FAISS index
        dimension = embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)  # Inner product (cosine for normalized)
        self.index.add(embeddings.astype('float32'))

    def retrieve(self, test_input):
        """Retrieve k nearest examples"""
        test_embedding = self.encoder.encode(
            [test_input], normalize_embeddings=True
        ).astype('float32')

        distances, indices = self.index.search(test_embedding, self.k)

        retrieved = []
        for idx, dist in zip(indices[0], distances[0]):
            retrieved.append({
                **self.candidates[idx],
                'similarity': float(dist)
            })
        return retrieved

    def build_prompt(self, test_input, task_instruction=""):
        """Build few-shot prompt with retrieved examples"""
        retrieved = self.retrieve(test_input)

        prompt = task_instruction + "\n\n" if task_instruction else ""
        for ex in retrieved:
            prompt += f"Input: {ex['input']}\nOutput: {ex['output']}\n\n"
        prompt += f"Input: {test_input}\nOutput:"

        return prompt

Advanced Pattern (KNN Prompting with Diversity and Caching):

import faiss
from sentence_transformers import SentenceTransformer
import numpy as np
from collections import defaultdict

class AdvancedKNNPrompting:
    def __init__(self, embedding_model='all-MiniLM-L6-v2', k=5,
                 diversity_weight=0.3):
        self.encoder = SentenceTransformer(embedding_model)
        self.k = k
        self.diversity_weight = diversity_weight
        self.index = None
        self.candidates = []
        self.cache = {}

    def build_index(self, candidates):
        """Build FAISS index with metadata"""
        self.candidates = candidates
        texts = [ex['input'] for ex in candidates]
        self.embeddings = self.encoder.encode(texts, normalize_embeddings=True)

        dimension = self.embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)
        self.index.add(self.embeddings.astype('float32'))

    def retrieve_diverse(self, test_input):
        """Retrieve k examples balancing similarity and diversity"""
        # Check cache
        cache_key = hash(test_input)
        if cache_key in self.cache:
            return self.cache[cache_key]

        test_emb = self.encoder.encode(
            [test_input], normalize_embeddings=True
        ).astype('float32')

        # Retrieve more than k candidates
        n_candidates = min(self.k * 4, len(self.candidates))
        distances, indices = self.index.search(test_emb, n_candidates)

        # Greedy diversity-aware selection
        selected = []
        selected_embeddings = []

        for idx, dist in zip(indices[0], distances[0]):
            if len(selected) >= self.k:
                break

            candidate_emb = self.embeddings[idx]

            # Calculate diversity penalty
            if selected_embeddings:
                max_similarity_to_selected = max(
                    np.dot(candidate_emb, sel_emb)
                    for sel_emb in selected_embeddings
                )
                diversity_score = 1 - max_similarity_to_selected
            else:
                diversity_score = 1.0

            # Combined score
            combined_score = (
                (1 - self.diversity_weight) * float(dist) +
                self.diversity_weight * diversity_score
            )

            selected.append({
                **self.candidates[idx],
                'similarity': float(dist),
                'diversity': diversity_score,
                'combined': combined_score
            })
            selected_embeddings.append(candidate_emb)

        # Cache result
        self.cache[cache_key] = selected
        return selected

    def build_prompt(self, test_input, task_instruction="",
                     max_tokens=3000):
        """Build token-aware prompt with retrieved examples"""
        retrieved = self.retrieve_diverse(test_input)

        prompt = task_instruction + "\n\n" if task_instruction else ""
        token_estimate = len(prompt.split()) * 1.3  # Rough token estimate

        examples_added = 0
        for ex in retrieved:
            example_text = f"Input: {ex['input']}\nOutput: {ex['output']}\n\n"
            example_tokens = len(example_text.split()) * 1.3

            if token_estimate + example_tokens > max_tokens:
                break

            prompt += example_text
            token_estimate += example_tokens
            examples_added += 1

        prompt += f"Input: {test_input}\nOutput:"
        return prompt, examples_added

Prompting Patterns Used:

Few-shot pattern: Retrieved examples serve as in-context demonstrations
Structured output: Format demonstrated consistently across all retrieved examples
Order matters: Examples typically ordered by decreasing similarity (most similar first or last, depending on the model)

Reasoning Patterns:

Forward reasoning: Retrieved examples demonstrate the input→output mapping the model should follow
Pattern recognition: Similar examples help the model recognize the underlying pattern
Analogical reasoning: The model draws analogies between retrieved examples and the test input

Modifications for Scenarios

For Ambiguous Tasks:

Increase k to provide more diverse examples that cover different interpretations
Add task instruction to disambiguate alongside the retrieved examples
Use diversity-weighted retrieval to ensure multiple perspectives are represented

For Complex Reasoning:

Retrieve examples that demonstrate similar reasoning chains, not just similar surface text
Consider using reasoning-path embeddings rather than input-only embeddings
Combine with Chain-of-Thought: retrieve examples with CoT annotations

For Format-Critical Tasks:

Ensure all retrieved examples demonstrate the exact required format
Filter candidates to only include correctly formatted examples before building the index
Consider post-retrieval format validation

For Domain-Specific Tasks:

Use domain-specific or fine-tuned embedding models (e.g., PubMedBERT for medical, LegalBERT for legal)
Build separate indices for each domain if multi-domain
Augment retrieval with domain-specific metadata filtering

Applications and Task Selection

General Applications

KNN Prompting is broadly applicable to any task where labeled examples exist and example relevance varies by input.

Text Generation and Summarization: Retrieving examples with similar input length, style, or content type to guide the model's generation. Effective for ensuring consistent tone and formatting.

Domain-Specific Applications

Scientific Literature: Retrieving papers with similar methodology, findings, or domain focus for literature review assistance, claim verification, and experiment design suggestions.

Financial Analysis: Selecting similar financial reports, market conditions, or risk scenarios for analysis templates. Effective for earnings call analysis, risk assessment, and financial QA.

Selection Framework

Problem Characteristics (When to Use KNN Prompting):

Few-shot prompting works but performance varies with example choice
A pool of labeled examples exists (50+ minimum, 500+ recommended)
Inputs vary in topic, structure, or domain such that different examples are relevant to different inputs
Task benefits from contextually relevant demonstrations
Need consistent, automated example selection (no manual curation per query)
Performance requires improvement over random few-shot without fine-tuning

Scenarios Optimized For:

High-variance input spaces where a single set of examples cannot serve all queries
Classification tasks with many categories
Domain-specific tasks where relevant terminology and patterns vary
Production systems processing diverse queries at scale
Tasks where embedding similarity correlates with example usefulness

Scenarios NOT Recommended For:

Zero-shot performance already sufficient (no examples needed)
Candidate pool too small (<50 examples) for meaningful retrieval
Task where all examples are equally relevant regardless of input (e.g., simple formatting tasks)
Inputs are homogeneous (every query similar, so any example works)
Embedding similarity does not capture task-relevant dimensions

Selection Signals:

Model Requirements:

Minimum: Any model supporting few-shot learning (GPT-3.5, Claude 3 Haiku, Llama 7B+)
Recommended: GPT-4, Claude 3.5 Sonnet, Llama 70B+ for best few-shot performance
Optimal: Models with strong in-context learning capabilities and large context windows
Not suitable: Models with very small context windows (<2K tokens) or poor few-shot learning ability
For Xu et al. variant: Requires access to output probability distributions (autoregressive LMs with logit access)

Context/Resource Requirements:

Embedding computation: One-time cost to embed all candidates; fast for modern embedding models (1000 examples in seconds)
Storage: Embedding vectors (768-1536 dimensions × number of candidates × 4 bytes)
Retrieval latency: ~1-10ms with FAISS index; negligible vs LLM inference time
Context window: k examples × average example length + test input + response space
Typical token usage: 4-8 examples × 100-300 tokens each = 400-2400 tokens for examples alone

Cost Implications:

One-time costs:

Embedding all candidates: ~$0.01-0.10 per 1000 examples (OpenAI embeddings) or free (open-source models)
Building FAISS index: negligible compute cost
Infrastructure: embedding model hosting if using open-source

Per-request production costs:

Embedding the test input: ~$0.00001 per query (OpenAI) or free (self-hosted)
Nearest neighbor search: negligible
LLM inference: Same as standard few-shot prompting (determined by k and example length)
Total overhead vs random few-shot: <$0.001 per request

Trade-offs:

Minimal additional cost for meaningful performance improvement
Infrastructure complexity is the main cost, not compute
Open-source embedding models eliminate per-query embedding costs entirely

When to Use vs When NOT to Use:

Use when:

Random few-shot accuracy 50-85% with high variance across example sets
Have 100+ labeled candidate examples
Input distribution is diverse (different topics, domains, structures)
Can deploy an embedding model alongside the LLM
Need automated, consistent example selection at scale
Performance gains justify the infrastructure setup

Do NOT use when:

Zero-shot accuracy >90% (examples unnecessary)
Random few-shot accuracy >90% with low variance (example choice doesn't matter)
Candidate pool <50 examples (insufficient for meaningful retrieval)
All inputs near-identical (any examples equally relevant)
Cannot host embedding model or embedding API
Real-time latency requirements cannot accommodate embedding step (rare — embedding is fast)

Escalate to alternatives when:

KNN-selected few-shot still <60% accuracy → consider fine-tuning
Need to leverage thousands of examples → consider Xu et al. kNN Prompting variant or fine-tuning
Embedding similarity does not capture task-relevant dimensions → consider supervised retriever (EPR, UDR)
Need guaranteed format compliance → consider structured output APIs or fine-tuning

Variant Selection

KNN Exemplar Selection (KATE-style, Liu et al. 2022):

Best for: General few-shot tasks, production systems, any LLM
Characteristics: Simple, fast, works with any LLM API, no logit access needed
Infrastructure: Embedding model + vector index
Use when: Need practical, deployable example selection

kNN Prompting (Xu et al., 2023):

Best for: Classification tasks, research settings, maximum accuracy
Characteristics: Calibration-free, scales beyond context window, requires logit access
Infrastructure: LLM with probability output + distribution datastore
Use when: Have logit access, classification tasks, need to leverage large training sets

Vote-k (Su et al., 2023):

Best for: Diverse exemplar selection from unlabeled pools
Characteristics: Graph-based, emphasizes diversity over pure similarity
Use when: Worried about redundancy in retrieved examples

EPR (Rubin et al., 2022):

Best for: Maximum retrieval quality with labeled training data
Characteristics: Supervised retriever, task-specific training, 30%+ improvement over random
Use when: Can invest in training a task-specific retriever

UDR (Li et al., 2023):

Best for: Multi-task settings, unified retrieval across tasks
Characteristics: Multi-task list-wise ranking, generalizes across tasks
Use when: Need a single retriever serving multiple tasks

Alternative Techniques:

Implementation

Implementation Steps

Step 1: Prepare Candidate Pool

Collect labeled examples representative of the target task and input distribution
Ensure pool covers the range of expected inputs (topics, difficulty levels, formats)
Verify label quality — retrieval amplifies both good and bad examples
Format consistently: each example needs input text and expected output
Recommended size: 500-5000 examples (more is better, with diminishing returns)

Step 2: Select and Configure Embedding Model

Choose embedding model based on task and infrastructure:
- General purpose: all-MiniLM-L6-v2 (fast, good baseline), all-mpnet-base-v2 (better quality)
- OpenAI: text-embedding-3-small or text-embedding-3-large (best quality, API cost)
- Domain-specific: Fine-tuned models (e.g., trained on NLI/STS-B data) for improved retrieval
Validate that embedding similarity correlates with task-relevant similarity on a small sample
Encode all candidate example inputs into vectors

Step 3: Build Vector Index

Choose index type based on pool size:
- <10,000 examples: exact search (FAISS IndexFlatIP) — no approximation needed
- 10,000-1M examples: approximate search (FAISS IndexIVFFlat or IndexHNSW)
- 1M+ examples: approximate search with quantization
Build and save the index
Test retrieval quality on sample queries

Step 4: Configure Retrieval Parameters

Set k (number of neighbors): start with 5, tune between 3-8
Choose distance metric: cosine similarity (default), L2, or dot product
Optionally add diversity filtering or reranking
Optionally add label distribution constraints (ensure class balance in retrieved set)

Step 5: Build Prompt Template

Design prompt structure: instruction (optional) + retrieved examples + test input
Determine example ordering: most similar first vs last (test both)
Set token budget: ensure k examples + test input + expected response fit in context window
Add any task-specific instructions

Step 6: Evaluate and Tune

Run on validation set (held-out from candidate pool)
Compare vs random few-shot baseline
Tune k, embedding model, diversity weight, example ordering
Analyze retrieval quality: are retrieved examples actually relevant?
Check for failure patterns: certain input types where retrieval fails

Step 7: Deploy

Set up embedding model serving (local or API)
Deploy vector index (in-memory or persistent storage)
Integrate retrieval step into LLM inference pipeline
Monitor retrieval quality and LLM performance
Periodically update candidate pool and rebuild index

Platform-Specific Implementations

OpenAI API:

import openai
import numpy as np
from typing import List, Dict

class KNNPromptingOpenAI:
    def __init__(self, api_key: str,
                 embedding_model: str = "text-embedding-3-small",
                 chat_model: str = "gpt-4-turbo-preview",
                 k: int = 5):
        self.client = openai.OpenAI(api_key=api_key)
        self.embedding_model = embedding_model
        self.chat_model = chat_model
        self.k = k
        self.candidates = []
        self.embeddings = None

    def embed_texts(self, texts: List[str]) -> np.ndarray:
        """Embed texts using OpenAI API"""
        response = self.client.embeddings.create(
            model=self.embedding_model,
            input=texts
        )
        return np.array([item.embedding for item in response.data])

    def build_index(self, candidates: List[Dict]):
        """Build embedding index from candidates"""
        self.candidates = candidates
        texts = [ex['input'] for ex in candidates]
        self.embeddings = self.embed_texts(texts)
        # Normalize for cosine similarity
        norms = np.linalg.norm(self.embeddings, axis=1, keepdims=True)
        self.embeddings = self.embeddings / norms

    def retrieve(self, test_input: str) -> List[Dict]:
        """Retrieve k nearest examples"""
        test_emb = self.embed_texts([test_input])
        test_emb = test_emb / np.linalg.norm(test_emb)

        similarities = np.dot(self.embeddings, test_emb.T).flatten()
        top_k = np.argsort(similarities)[-self.k:][::-1]

        return [
            {**self.candidates[idx], 'similarity': float(similarities[idx])}
            for idx in top_k
        ]

    def generate(self, test_input: str,
                 task_instruction: str = "") -> str:
        """Full KNN prompting pipeline"""
        retrieved = self.retrieve(test_input)

        # Build few-shot prompt
        examples_text = "\n\n".join([
            f"Input: {ex['input']}\nOutput: {ex['output']}"
            for ex in retrieved
        ])

        user_content = ""
        if task_instruction:
            user_content += task_instruction + "\n\n"
        user_content += examples_text
        user_content += f"\n\nInput: {test_input}\nOutput:"

        response = self.client.chat.completions.create(
            model=self.chat_model,
            messages=[{"role": "user", "content": user_content}],
            temperature=0.0,
            max_tokens=500
        )
        return response.choices[0].message.content

# Usage
knn = KNNPromptingOpenAI(api_key="your-api-key")

candidates = [
    {"input": "The food was amazing and service excellent", "output": "Positive"},
    {"input": "Terrible experience, never going back", "output": "Negative"},
    {"input": "It was okay, nothing special", "output": "Neutral"},
    # ... hundreds more examples
]

knn.build_index(candidates)

result = knn.generate(
    test_input="The pasta was decent but the wait was too long",
    task_instruction="Classify the sentiment of the following review."
)
print(result)

Anthropic Claude:

import anthropic
import numpy as np
from sentence_transformers import SentenceTransformer

class KNNPromptingClaude:
    def __init__(self, api_key: str, k: int = 5,
                 model: str = "claude-sonnet-4-20250514"):
        self.client = anthropic.Anthropic(api_key=api_key)
        self.model = model
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.k = k
        self.candidates = []
        self.embeddings = None

    def build_index(self, candidates):
        """Build index using local sentence transformer"""
        self.candidates = candidates
        texts = [ex['input'] for ex in candidates]
        self.embeddings = self.encoder.encode(
            texts, normalize_embeddings=True
        )

    def retrieve(self, test_input):
        """Retrieve k nearest examples"""
        test_emb = self.encoder.encode(
            [test_input], normalize_embeddings=True
        )
        similarities = np.dot(self.embeddings, test_emb.T).flatten()
        top_k = np.argsort(similarities)[-self.k:][::-1]

        return [self.candidates[idx] for idx in top_k]

    def generate(self, test_input, task_instruction=""):
        """Full pipeline with Claude"""
        retrieved = self.retrieve(test_input)

        examples_text = "\n\n".join([
            f"Input: {ex['input']}\nOutput: {ex['output']}"
            for ex in retrieved
        ])

        user_content = ""
        if task_instruction:
            user_content += task_instruction + "\n\n"
        user_content += examples_text
        user_content += f"\n\nInput: {test_input}\nOutput:"

        message = self.client.messages.create(
            model=self.model,
            max_tokens=500,
            temperature=0.0,
            messages=[{"role": "user", "content": user_content}]
        )
        return message.content[0].text

# Usage
knn_claude = KNNPromptingClaude(api_key="your-api-key")
knn_claude.build_index(candidates)
result = knn_claude.generate(
    test_input="The hotel room was clean but noisy",
    task_instruction="Classify the sentiment."
)

LangChain Integration:

from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain.prompts.example_selector import SemanticSimilarityExampleSelector
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import LLMChain

def langchain_knn_prompting(candidates, test_input, task_instruction=""):
    """KNN Prompting using LangChain's built-in semantic selector"""

    # Format candidates for LangChain
    examples = [
        {"input": ex["input"], "output": ex["output"]}
        for ex in candidates
    ]

    # Create semantic similarity selector (KNN under the hood)
    example_selector = SemanticSimilarityExampleSelector.from_examples(
        examples,
        OpenAIEmbeddings(),
        FAISS,
        k=5
    )

    # Define example format
    example_prompt = PromptTemplate(
        input_variables=["input", "output"],
        template="Input: {input}\nOutput: {output}"
    )

    # Create few-shot template
    few_shot_prompt = FewShotPromptTemplate(
        example_selector=example_selector,
        example_prompt=example_prompt,
        prefix=task_instruction if task_instruction else "",
        suffix="Input: {input}\nOutput:",
        input_variables=["input"]
    )

    # Create chain and run
    llm = ChatOpenAI(model="gpt-4", temperature=0.0)
    chain = LLMChain(llm=llm, prompt=few_shot_prompt)

    return chain.run(input=test_input)

Xu et al. kNN Prompting Implementation (Research Variant):

import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
from scipy.special import rel_entr

class KNNPromptingXu:
    """Implementation of Xu et al. (2023) kNN Prompting
    for calibration-free nearest neighbor inference."""

    def __init__(self, model_name="gpt2-xl", k=5):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.model.eval()
        self.k = k
        self.datastore_keys = []    # Output distributions
        self.datastore_values = []  # Labels

    def get_output_distribution(self, prompt):
        """Get LM output probability distribution for a prompt"""
        inputs = self.tokenizer(prompt, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**inputs)

        # Get distribution over vocabulary at last token position
        logits = outputs.logits[0, -1, :]
        distribution = torch.softmax(logits, dim=0).numpy()
        return distribution

    def build_datastore(self, training_examples, anchor_prompt):
        """Build datastore by caching distributions for training data"""
        self.datastore_keys = []
        self.datastore_values = []

        for example in training_examples:
            # Construct prompt: anchor examples + training input
            full_prompt = anchor_prompt + f"\nInput: {example['input']}\nOutput:"

            # Cache output distribution as key
            distribution = self.get_output_distribution(full_prompt)
            self.datastore_keys.append(distribution)

            # Store true label as value
            self.datastore_values.append(example['label'])

    def predict(self, test_input, anchor_prompt):
        """Predict by finding nearest neighbors in distribution space"""
        # Get test distribution
        test_prompt = anchor_prompt + f"\nInput: {test_input}\nOutput:"
        test_dist = self.get_output_distribution(test_prompt)

        # Compute KL divergence to all datastore entries
        distances = []
        for stored_dist in self.datastore_keys:
            # Symmetric KL divergence
            kl_forward = np.sum(rel_entr(test_dist + 1e-10, stored_dist + 1e-10))
            kl_backward = np.sum(rel_entr(stored_dist + 1e-10, test_dist + 1e-10))
            kl_symmetric = (kl_forward + kl_backward) / 2
            distances.append(kl_symmetric)

        # Find k nearest neighbors
        distances = np.array(distances)
        nearest_indices = np.argsort(distances)[:self.k]

        # Majority vote over nearest neighbor labels
        neighbor_labels = [self.datastore_values[i] for i in nearest_indices]
        from collections import Counter
        prediction = Counter(neighbor_labels).most_common(1)[0][0]

        return prediction

Configuration

Key Parameters:

Embedding Model Selection:

k (number of neighbors):

Too low (k=1-2): Insufficient examples, high variance
Optimal range (k=3-8): Good balance of relevance and coverage
Too high (k>10): Context window pressure, diminishing returns, potentially includes less relevant examples
Recommendation: Start with k=5, tune on validation set

Distance Metric:

Cosine similarity: Best default for normalized embeddings, handles varying text lengths
L2 distance: Works well with unnormalized embeddings
Dot product: Fastest for normalized embeddings (equivalent to cosine)
Note: For models producing normalized embeddings, all three metrics yield identical rankings

Task-Specific Tuning:

Classification:

k=3-5, ensure at least one example per expected class
Consider label-balanced retrieval (equal examples per class)
Cosine similarity works well

Reasoning/QA:

k=5-8, prioritize reasoning pattern similarity over topic similarity
Consider using reasoning-step embeddings rather than question-only embeddings
May benefit from CoT-annotated examples

Generation:

k=3-5, balance example length with context budget
Style consistency more important than topic similarity
Consider output-aware retrieval (embed both input and output)

Code Generation:

k=5-8, retrieve examples using similar function signatures or docstrings
Consider code-specific embedding models (CodeBERT, UniXcoder)
Include diverse API usage patterns

Best Practices and Workflow

Workflow (End-to-End):

Baseline Assessment:
- Test zero-shot performance → establishes minimum
- Test random few-shot (k=5 random examples) → establishes few-shot baseline
- If random few-shot already >90% with low variance, KNN Prompting likely unnecessary
Pool Preparation:
- Collect and clean labeled examples
- Remove duplicates and near-duplicates
- Verify label quality on random sample
- Split: retrieval pool (80%), validation (10%), test (10%)
Embedding and Index Setup:
- Embed all pool examples
- Build vector index
- Verify retrieval quality on sample queries
Tuning:
- Test k values from 3 to 8 on validation set
- Compare embedding models if multiple available
- Test with/without diversity filtering
- Test example ordering (most similar first vs last)
Evaluation:
- Full evaluation on validation set
- Compare vs random few-shot baseline
- Analyze failure cases and retrieval quality
- Run on held-out test set for final numbers
Deployment:
- Set up embedding model serving
- Deploy vector index
- Integrate into inference pipeline
- Monitor and maintain

Implementation Best Practices:

Do:

Validate that embedding similarity correlates with task-relevant similarity before committing
Start with a good general-purpose embedding model before investing in fine-tuning
Include diversity filtering if top-k neighbors tend to be near-duplicates
Monitor retrieval quality in production — inputs may drift
Cache embeddings and retrieval results when inputs repeat
Normalize embeddings for consistent cosine similarity computation
Test on diverse inputs during validation, not just typical cases
Keep candidate pool up to date as task requirements evolve

Don't:

Assume any embedding model works — validate retrieval quality
Use k > 8 without checking context window limits
Skip the random few-shot baseline comparison (you need to prove KNN helps)
Build the index on the test set (data leakage)
Ignore diversity — 5 near-identical examples waste context window
Use embedding models trained on vastly different domains without validation
Deploy without monitoring — embedding quality can degrade with distribution shift

Debugging Decision Tree

Symptom: KNN Prompting performs worse than random few-shot

Root causes:

Embedding model doesn't capture task-relevant similarity
Candidate pool quality is poor (noisy labels)
k too high (including irrelevant examples)

Solutions:

Try different embedding model (switch from general to domain-specific)
Audit candidate pool labels for accuracy
Reduce k from 5 to 3
Add manual review: are retrieved examples actually relevant?
Consider supervised retriever (EPR) if unsupervised fails

Symptom: Retrieved examples are near-duplicates of each other

Root causes:

Candidate pool lacks diversity
Embedding space has dense clusters
No diversity filtering

Solutions:

Add diversity-weighted retrieval (MMR — Maximal Marginal Relevance)
Deduplicate candidate pool before building index
Retrieve top-2k candidates, then subsample for diversity
Use clustering to ensure examples come from different clusters

Symptom: Good retrieval quality but LLM still performs poorly

Root causes:

Retrieved examples are relevant but demonstrate wrong patterns
k too high, overwhelming the LLM with context
Example ordering suboptimal
Task fundamentally hard for few-shot

Solutions:

Review what the examples actually demonstrate — relevant input doesn't guarantee useful output
Reduce k and see if fewer, more focused examples help
Test different example orderings
Add task instruction alongside examples
Consider that few-shot may be insufficient — escalate to fine-tuning

Symptom: Retrieval is slow

Root causes:

Using exact search on large candidate pool
Embedding model too slow for real-time use
No caching for repeated queries

Solutions:

Switch to approximate nearest neighbor search (FAISS IVF, HNSW)
Use smaller embedding model (MiniLM instead of mpnet)
Cache embeddings for frequently seen inputs
Pre-compute and cache retrieval results for common query types
Use GPU acceleration for embedding computation

Symptom: Performance degrades over time in production

Root causes:

Input distribution has shifted from when pool was built
Candidate pool is stale (task or domain has evolved)
Embedding model mismatch with new input types

Solutions:

Monitor retrieval similarity scores — declining scores indicate distribution shift
Periodically update candidate pool with recent, relevant examples
Rebuild index when pool changes significantly
Set up alerts for low average similarity scores

Symptom: Inconsistent outputs across similar inputs

Root causes:

Small differences in input lead to different retrieved examples
Boundary cases in embedding space
Temperature too high during LLM inference

Solutions:

Increase k to smooth out boundary effects
Set temperature=0.0 for deterministic LLM outputs
Use ensemble: retrieve with multiple embedding models and merge results
Add diversity filtering to reduce sensitivity to small input changes

Testing and Optimization

Validation Strategy:

Holdout Validation:

Reserve 10-20% of candidate pool as validation set
Never include validation examples in the retrieval index
Use validation to tune k, embedding model, diversity weight
Final evaluation on separate held-out test set

Retrieval Quality Testing:

For each validation query, check if retrieved examples are actually relevant (human judgment)
Measure Precision@k: fraction of retrieved examples that are relevant
Measure nDCG: whether more relevant examples rank higher

Adversarial Testing:

Test with out-of-domain inputs: does retrieval gracefully handle unfamiliar queries?
Test with adversarial inputs: does embedding manipulation affect retrieval?
Test with edge cases: very short inputs, very long inputs, ambiguous inputs

Test Coverage:

Common cases (50%): Representative inputs from expected distribution
Domain boundary cases (20%): Inputs at the edge between categories or topics
Short/long inputs (15%): Varying input lengths to test embedding robustness
Out-of-distribution (10%): Inputs not well represented in candidate pool
Adversarial (5%): Intentionally challenging or misleading inputs

Quality Metrics:

Retrieval Metrics:

Precision@k: Fraction of retrieved examples judged relevant
Recall@k: Fraction of all relevant examples retrieved
nDCG@k: Normalized discounted cumulative gain — penalizes relevant examples ranked low
Mean Reciprocal Rank: Average rank of first relevant result

Task Performance Metrics:

Classification: Accuracy, F1, precision, recall
Generation: BLEU, ROUGE, semantic similarity, human evaluation
QA: Exact match, F1, answer relevance
Code: Execution correctness, test pass rate

General Metrics:

Improvement over random baseline: (KNN - Random) / Random × 100%
Consistency: Variance in output quality across runs
Robustness: Performance on adversarial or OOD inputs
Efficiency: Latency overhead from retrieval step

Optimization Techniques:

1. Embedding Model Selection:

def compare_embedding_models(candidate_pool, validation_set, models):
    """Compare embedding models for retrieval quality"""
    results = {}

    for model_name in models:
        knn = KNNPrompting(embedding_model=model_name, k=5)
        knn.build_index(candidate_pool)

        accuracy = evaluate(knn, validation_set)
        avg_similarity = average_retrieval_similarity(knn, validation_set)

        results[model_name] = {
            'accuracy': accuracy,
            'avg_similarity': avg_similarity
        }

    return results

2. k Optimization:

def optimize_k(knn_system, validation_set, k_range=range(1, 11)):
    """Find optimal k value"""
    results = {}

    for k in k_range:
        knn_system.k = k
        accuracy = evaluate(knn_system, validation_set)
        results[k] = accuracy

    optimal_k = max(results, key=results.get)
    return optimal_k, results

3. Diversity-Aware Retrieval (MMR):

def mmr_retrieval(query_emb, candidate_embs, candidates, k=5,
                  lambda_param=0.7):
    """Maximal Marginal Relevance for diverse retrieval"""
    similarities = np.dot(candidate_embs, query_emb.T).flatten()

    selected_indices = []
    remaining = list(range(len(candidates)))

    for _ in range(k):
        if not remaining:
            break

        mmr_scores = []
        for idx in remaining:
            relevance = similarities[idx]

            # Max similarity to already selected
            if selected_indices:
                redundancy = max(
                    np.dot(candidate_embs[idx], candidate_embs[s])
                    for s in selected_indices
                )
            else:
                redundancy = 0

            mmr = lambda_param * relevance - (1 - lambda_param) * redundancy
            mmr_scores.append((idx, mmr))

        best_idx = max(mmr_scores, key=lambda x: x[1])[0]
        selected_indices.append(best_idx)
        remaining.remove(best_idx)

    return [candidates[i] for i in selected_indices]

4. Caching Strategy:

from functools import lru_cache
import hashlib

class CachedKNNPrompting:
    def __init__(self, knn_system, cache_size=10000):
        self.knn = knn_system
        self.cache_size = cache_size
        self._cache = {}

    def retrieve_cached(self, test_input):
        """Retrieve with caching for repeated inputs"""
        cache_key = hashlib.md5(test_input.encode()).hexdigest()

        if cache_key in self._cache:
            return self._cache[cache_key]

        result = self.knn.retrieve(test_input)

        if len(self._cache) >= self.cache_size:
            # Evict oldest entry
            oldest = next(iter(self._cache))
            del self._cache[oldest]

        self._cache[cache_key] = result
        return result

Iteration Criteria:

When to stop optimizing:

Validation accuracy improvement <1% from further tuning
Retrieval Precision@k >0.8 (most retrieved examples are relevant)
Performance gap vs random few-shot consistently >5%
Further k increases show no improvement
Embedding model comparison shows no significant differences

When to continue:

Retrieval quality clearly poor (irrelevant examples being retrieved)
Performance barely better than random few-shot (<3% improvement)
Specific input categories where retrieval consistently fails
Have not tested domain-specific embedding models

A/B Testing:

def ab_test_knn_vs_random(candidate_pool, test_set, k=5, trials=20):
    """Statistical comparison of KNN vs random selection"""
    knn = KNNPrompting(k=k)
    knn.build_index(candidate_pool)

    knn_accuracies = []
    random_accuracies = []

    for trial in range(trials):
        # KNN selection (deterministic)
        knn_results = evaluate(knn, test_set)
        knn_accuracies.append(knn_results)

        # Random selection (different random seed each trial)
        random_examples = random.sample(candidate_pool, k)
        random_results = evaluate_with_fixed_examples(random_examples, test_set)
        random_accuracies.append(random_results)

    from scipy.stats import ttest_rel
    t_stat, p_value = ttest_rel(knn_accuracies, random_accuracies)

    print(f"KNN: {np.mean(knn_accuracies):.2%} ± {np.std(knn_accuracies):.2%}")
    print(f"Random: {np.mean(random_accuracies):.2%} ± {np.std(random_accuracies):.2%}")
    print(f"P-value: {p_value:.4f}")

    return {'knn_mean': np.mean(knn_accuracies),
            'random_mean': np.mean(random_accuracies),
            'p_value': p_value}

Limitations and Constraints

Known Limitations

1. Embedding Quality Dependency (Fundamental):

2. Computational Cost for Large Pools:

3. Context Window Pressure:

4. No Guarantee of Diversity:

5. Sparse Distribution Problem (Xu et al. variant):

6. Static Retrieval:

7. Infrastructure Overhead:

Edge Cases

Ambiguous inputs where multiple example types are equally relevant:

The test input falls equidistant between different clusters in embedding space
Retrieved examples may be a mixture of different categories
Detection: Low maximum similarity score, or top-k examples spanning multiple categories
Solution: Increase k to cover multiple interpretations, or add disambiguation instruction

Out-of-distribution inputs:

Test input is fundamentally different from anything in the candidate pool
All similarity scores are low, retrieved examples are irrelevant
Detection: Maximum similarity score below a threshold (e.g., cosine similarity < 0.5)
Solution: Fall back to zero-shot or manual examples when similarity is too low

Adversarial inputs designed to manipulate retrieval:

Attacker crafts inputs to retrieve specific examples that cause the model to produce desired outputs
Detection: Unusual patterns in retrieval (always retrieving same examples, or very different from typical)
Solution: Monitor retrieval patterns, add randomization, validate outputs

Very short or very long inputs:

Short inputs produce low-information embeddings with unreliable similarity
Long inputs may match on irrelevant details
Detection: Input length far outside the candidate pool's typical range
Solution: Normalize input length, use passage-level embeddings for long inputs, increase k for short inputs

Label imbalance in retrieved set:

If the candidate pool has class imbalance, retrieved examples may all belong to the majority class
Detection: Check label distribution of retrieved examples
Solution: Stratified retrieval ensuring minimum representation per class

Graceful Degradation:

class RobustKNNPrompting:
    def __init__(self, knn_system, similarity_threshold=0.3):
        self.knn = knn_system
        self.threshold = similarity_threshold

    def generate_with_fallback(self, test_input, task_instruction=""):
        """KNN prompting with graceful fallback"""
        retrieved = self.knn.retrieve(test_input)

        # Check retrieval quality
        avg_similarity = np.mean([ex['similarity'] for ex in retrieved])

        if avg_similarity < self.threshold:
            # Low similarity — fall back to zero-shot
            print(f"Warning: Low retrieval quality ({avg_similarity:.3f}). "
                  f"Falling back to zero-shot.")
            return self.zero_shot_generate(test_input, task_instruction)

        # Filter out low-quality retrievals
        quality_retrieved = [
            ex for ex in retrieved
            if ex['similarity'] >= self.threshold
        ]

        if len(quality_retrieved) < 2:
            # Too few quality examples — use zero-shot with instruction
            return self.zero_shot_generate(test_input, task_instruction)

        return self.knn.generate_with_examples(
            test_input, quality_retrieved, task_instruction
        )

    def zero_shot_generate(self, test_input, task_instruction):
        """Fallback to zero-shot when retrieval fails"""
        prompt = task_instruction + f"\n\nInput: {test_input}\nOutput:"
        return self.knn.llm_generate(prompt)

Constraint Management

Balancing Relevance vs Diversity:

Pure relevance retrieval may return redundant examples. Pure diversity selection may return irrelevant ones. The MMR (Maximal Marginal Relevance) approach balances both:

Lambda = 1.0: Pure relevance (standard KNN)
Lambda = 0.5: Equal weight to relevance and diversity
Lambda = 0.7: Mild diversity preference (good default)
Tune lambda on validation set based on task needs

Handling Token/Context Constraints:

def token_aware_retrieval(knn_system, test_input, max_example_tokens=2000):
    """Retrieve examples fitting within token budget"""
    # Retrieve more candidates than needed
    candidates = knn_system.retrieve_top_n(test_input, n=knn_system.k * 2)

    selected = []
    total_tokens = 0

    for candidate in candidates:
        example_tokens = estimate_tokens(candidate['input'] + candidate['output'])

        if total_tokens + example_tokens <= max_example_tokens:
            selected.append(candidate)
            total_tokens += example_tokens

        if len(selected) >= knn_system.k:
            break

    return selected

Handling Incomplete Candidate Pool:

When the candidate pool doesn't cover all expected input types:

Monitor which inputs get low similarity scores
Prioritize adding examples for underrepresented input types
Use zero-shot fallback for inputs with no good matches
Periodically audit retrieval quality on production traffic

Error Handling and Recovery:

Embedding API failure: Cache recent embeddings, fall back to cached results or random selection
Index corruption: Maintain index backups, rebuild from stored embeddings
Candidate pool staleness: Set up periodic refresh schedule
Embedding model version change: Rebuild entire index when embedding model updates

Advanced Techniques

Clarity and Context Optimization

Ensuring Retrieval Clarity:

The quality of KNN Prompting depends on what gets embedded. Embedding only the input text is the simplest approach, but may miss task-relevant context:

Input-only embedding: Simple, fast, works for most tasks. Misses output-conditional relevance.
Input+output embedding: Embeds the full example. Better for generation tasks where output style matters.
Task-description-augmented embedding: Prepends task description to input before embedding. Helps the embedding model focus on task-relevant features.

Context Optimization:

For tasks requiring domain knowledge, the retrieved examples should carry relevant context:

def context_enriched_retrieval(knn, test_input, domain_context=""):
    """Retrieve examples and add domain context to prompt"""
    retrieved = knn.retrieve(test_input)

    # Build prompt with context
    prompt = ""
    if domain_context:
        prompt += f"Domain context: {domain_context}\n\n"

    prompt += "Here are examples of similar tasks:\n\n"
    for ex in retrieved:
        prompt += f"Input: {ex['input']}\nOutput: {ex['output']}\n\n"

    prompt += f"Now complete:\nInput: {test_input}\nOutput:"
    return prompt

Handling Context Length Limitations:

When retrieved examples are long, compress or truncate them:

Truncate examples to key portions (first N tokens of input, full output)
Summarize long examples before including in prompt
Reduce k to fit within context budget
Use tiered approach: include full nearest example, abbreviated versions for remaining

Example Design:

What makes a retrieved example effective:

Relevant input: Semantically close to the test query
Clear output: Unambiguous, correctly formatted answer
Appropriate length: Long enough to be informative, short enough to not waste context
Correct label: Incorrect examples in the pool actively harm performance
Representative: Should represent a genuine instance of the task, not an edge case

Optimal Number and Diversity:

Classification: k=3-5, ensure label diversity in retrieved set
Generation: k=3-5, balance style diversity with topical relevance
QA: k=5-7, cover different reasoning patterns
Code: k=5-8, include different implementation approaches for similar problems

Advanced Reasoning and Output Control

Multi-Step Reasoning with KNN:

For reasoning tasks, retrieve examples that demonstrate similar reasoning chains:

def reasoning_aware_retrieval(knn, test_input, reasoning_type):
    """Retrieve examples matching reasoning pattern"""
    # Encode with reasoning context
    enriched_input = f"[{reasoning_type}] {test_input}"

    retrieved = knn.retrieve(enriched_input)

    # Filter to ensure CoT examples
    cot_examples = [ex for ex in retrieved if 'reasoning' in ex]

    return cot_examples

Self-Verification:

Build verification into the prompt by retrieving examples that include verification steps:

Input: What is 15% of 240?
Reasoning: To find 15% of 240, I calculate 0.15 × 240 = 36. Verification: 36/240 = 0.15 = 15% ✓
Output: 36

Structured Output:

Ensure all retrieved examples demonstrate the exact required format. Pre-filter the candidate pool to only include correctly formatted examples:

def format_filtered_retrieval(knn, test_input, format_validator):
    """Only retrieve examples matching required format"""
    # Retrieve extra candidates to account for filtering
    candidates = knn.retrieve_top_n(test_input, n=knn.k * 3)

    # Filter by format compliance
    formatted = [
        ex for ex in candidates
        if format_validator(ex['output'])
    ]

    return formatted[:knn.k]

Constraint Enforcement:

When the task has hard constraints (word count, format, content restrictions), ensure retrieved examples demonstrate constraint compliance:

Filter candidate pool to only include constraint-compliant examples
Add explicit constraint statement in the prompt instruction
Use retrieved examples as implicit demonstrations of constraint adherence

Interaction Patterns

Conversational KNN:

For multi-turn conversations, update retrieval based on conversation context:

def conversational_knn(knn, conversation_history, new_message):
    """Update retrieval based on conversation context"""
    # Concatenate recent context for richer embedding
    context = " ".join([
        msg['content'] for msg in conversation_history[-3:]
    ])
    enriched_query = context + " " + new_message

    # Retrieve based on full context
    return knn.retrieve(enriched_query)

Iterative Refinement:

Use feedback from model outputs to improve retrieval:

def iterative_knn(knn, test_input, validator, max_iterations=3):
    """Iteratively refine retrieval based on output quality"""
    current_k = knn.k

    for iteration in range(max_iterations):
        result = knn.generate(test_input)

        if validator(result):
            return result

        # Increase k or adjust retrieval
        current_k += 2
        knn.k = current_k

    return result  # Best effort after max iterations

Chaining KNN with Other Techniques:

def knn_with_cot_and_self_consistency(knn, test_input, n_samples=5):
    """KNN retrieval + CoT + Self-Consistency"""

    # Step 1: KNN retrieval for relevant examples
    retrieved = knn.retrieve(test_input)

    # Step 2: Build CoT prompt with retrieved examples
    prompt = "Solve step by step:\n\n"
    for ex in retrieved:
        prompt += f"Q: {ex['input']}\nA: Let's think step by step. "
        prompt += f"{ex['reasoning']}\nThe answer is {ex['output']}.\n\n"
    prompt += f"Q: {test_input}\nA: Let's think step by step."

    # Step 3: Self-consistency - generate multiple responses
    responses = [llm(prompt, temperature=0.7) for _ in range(n_samples)]
    answers = [extract_answer(r) for r in responses]

    # Step 4: Majority vote
    from collections import Counter
    return Counter(answers).most_common(1)[0][0]

Model Considerations

GPT-4 / GPT-4 Turbo:

Strong in-context learner, benefits from relevant examples
Can handle k=8-10 examples with large context window
Embedding: OpenAI text-embedding-3-large for best alignment
Sensitive to example quality — retrieval quality matters

Claude 3.5 Sonnet / Claude 3 Opus:

Excellent instruction following, retrieved examples should focus on demonstrating format and reasoning
May need fewer examples (k=3-5) due to strong in-context learning
Embedding: Any high-quality model (no native Claude embedding model; use open-source or OpenAI)
Particularly benefits from well-structured examples

Llama 3 70B / 405B:

Benefits significantly from KNN Prompting (larger models better at leveraging context)
May need more examples (k=5-8) compared to GPT-4
Embedding: Open-source models preferred (Sentence-BERT variants)
More sensitive to example order — experiment with most-similar-first vs last

Smaller Models (7B-13B):

Limited in-context learning ability reduces KNN Prompting effectiveness
Keep k=2-4 to avoid context window overload
Focus on very high relevance over diversity
May benefit more from the Xu et al. distribution-matching variant

Cross-Model Considerations:

Embedding model choice is independent of LLM — same index works across models
Optimal k may differ across LLMs — tune per model
Example formatting preferences differ — some models prefer structured, others flexible
Test retrieval effectiveness per model, not just once

Safety, Robustness, and Domain Adaptation

Adversarial Protection:

KNN Prompting introduces a retrieval attack surface. An attacker who can influence the candidate pool can manipulate which examples get retrieved:

Pool poisoning: Injecting malicious examples that are designed to be retrieved for certain queries
Mitigation: Validate all pool examples before inclusion, use trusted data sources only, monitor for unusual retrieval patterns

Input manipulation:

Attacker crafts inputs to trigger retrieval of specific examples
Mitigation: Input sanitization, monitor for anomalous retrieval patterns, rate limiting

Output Safety:

Retrieved examples can contain biased or harmful content that gets amplified in the model's output:

def safe_retrieval(knn, test_input, safety_filter):
    """Filter retrieved examples for safety"""
    retrieved = knn.retrieve_top_n(test_input, n=knn.k * 2)

    safe_examples = [
        ex for ex in retrieved
        if safety_filter.is_safe(ex['input']) and safety_filter.is_safe(ex['output'])
    ]

    return safe_examples[:knn.k]

Reliability:

Ensure consistent outputs by:

Using temperature=0.0 for deterministic LLM inference
Caching retrieval results to ensure same input always gets same examples
Monitoring similarity score distributions for drift
Setting up alerts for degraded retrieval quality

Domain Adaptation:

Adapting KNN Prompting to new domains:

Quick adaptation: Use general-purpose embeddings + domain-specific candidate pool. Works for most domains with minimal setup.
Better adaptation: Use domain-specific embedding model (BioSentVec for medical, LegalBERT for legal, CodeBERT for code). Improves retrieval quality for domain-specific similarity.
Best adaptation: Fine-tune embedding model on domain data for in-context learning relevance. Requires training data but yields best results.

Handling domain-specific terminology:

Domain-specific embedding models capture terminology better than general models
Augment candidate pool with domain glossary examples
Consider metadata filtering (retrieve only from relevant subdomain)

Quick domain transfer using analogies:

Build separate indices for each domain
When entering a new domain with few examples, bootstrap with analogous examples from related domains
Gradually replace analogous examples with genuine domain examples as they become available

Risk and Ethics

Ethical Considerations

Data Privacy in Candidate Pools:

Candidate pools may contain sensitive information (personal data, confidential documents, proprietary content). When these are retrieved and included in prompts:

The data gets sent to LLM APIs (potential privacy violation)
Model outputs may include or reference sensitive details
Mitigation: Anonymize candidate pools, use on-premise models for sensitive data, implement access controls on the index

Bias in Retrieval:

If the candidate pool reflects societal biases, KNN retrieval can amplify them:

Training data with gender, racial, or cultural biases produces biased embeddings
Examples reflecting historical discrimination get retrieved and reinforced
Models may anchor on biased patterns in retrieved examples
Mitigation: Audit candidate pool for bias, use debiased embedding models, monitor output fairness metrics

Transparency:

When deploying KNN-prompted systems:

Users should know that responses are influenced by retrieved examples
The retrieval process should be auditable — which examples were retrieved and why
Document the candidate pool composition and embedding model used
Provide explanations when challenged: "This response was based on similar cases in our database"

Model Capability Revelation:

KNN Prompting reveals how LLMs respond to different types of demonstrations, which could:

Positive: Improve understanding of model behavior, enable better prompt engineering
Negative: Enable adversaries to craft examples that systematically manipulate model outputs

Risk Analysis

Failure Modes:

1. Poor Retrieval Quality:

Symptom: Retrieved examples irrelevant to test input
Impact: Performance worse than random few-shot
Probability: Medium (15-25% without validation)
Mitigation: Validate embedding model before deployment, monitor retrieval similarity scores

2. Candidate Pool Poisoning:

Symptom: Incorrect or malicious examples in the pool
Impact: Systematic errors or harmful outputs for queries that trigger poisoned examples
Probability: Low in controlled environments, higher with user-contributed pools
Mitigation: Validate all pool entries, use trusted sources, monitor for anomalies

3. Distribution Shift:

Symptom: Performance degrades over time as inputs change
Impact: Retrieval returns increasingly irrelevant examples
Probability: Medium-High (30-40% over months without maintenance)
Mitigation: Periodic pool refresh, similarity score monitoring, automated drift detection

4. Embedding Model Mismatch:

Symptom: High similarity scores but retrieved examples are not useful
Impact: False confidence in retrieval quality
Probability: Medium (20-30% with generic embeddings)
Mitigation: Validate retrieval quality with human judgment, not just similarity scores

Cascading Failures:

Incorrect retrieval → wrong examples in prompt → model anchors on incorrect patterns → systematic errors on similar inputs → users lose trust in system

Prevention: Multi-layer validation — check retrieval quality, validate LLM output, monitor user feedback

Bias Amplification:

Sources of Bias:

Candidate pool bias: If pool overrepresents certain demographics, topics, or viewpoints, retrieval amplifies this
Embedding bias: Embedding models encode societal biases that affect similarity computation
Proximity bias: Examples semantically close to the query may share the query's biases rather than providing corrective perspective

Detection and Mitigation:

def audit_retrieval_bias(knn, test_queries, sensitive_attributes):
    """Audit retrieval for demographic or topical bias"""
    bias_report = {}

    for attr in sensitive_attributes:
        attr_distributions = []

        for query in test_queries:
            retrieved = knn.retrieve(query)
            attr_values = [get_attribute(ex, attr) for ex in retrieved]
            attr_distributions.append(Counter(attr_values))

        # Check if certain attribute values are systematically over/under-represented
        aggregated = sum(attr_distributions, Counter())
        total = sum(aggregated.values())

        bias_report[attr] = {
            value: count / total
            for value, count in aggregated.items()
        }

    return bias_report

Innovation Potential

Novel Combinations:

KNN + Active Prompting: Use KNN retrieval to select candidate pool, then apply Active Prompting to identify which retrieved examples are most informative:

KNN retrieves top-20 relevant examples per query
Active Prompting selects the 5 most uncertain/informative from the retrieved set
Combines relevance (KNN) with informativeness (Active) for optimal example selection

KNN + RAG: Use KNN Prompting for example selection alongside RAG for knowledge retrieval. Examples demonstrate the format and reasoning, while RAG provides factual grounding.

Dynamic KNN with Feedback: Update the candidate pool and index based on model performance feedback — successful examples get higher retrieval priority, failed examples get demoted or replaced.

Cross-Modal KNN: Extend to multimodal settings — retrieve visually similar images, acoustically similar audio clips, or structurally similar code snippets as demonstrations.

Ecosystem and Integration

Tools and Frameworks

LangChain:

Built-in SemanticSimilarityExampleSelector implements KNN Prompting directly
Integrates with multiple vector stores (FAISS, Chroma, Pinecone, Weaviate)
FewShotPromptTemplate handles prompt construction with selected examples
Supports custom example selectors for advanced retrieval strategies

LlamaIndex:

Vector store indices with configurable similarity search
SimilarityPostprocessor for filtering and reranking
Integration with multiple embedding models and LLMs

DSPy:

BootstrapFewShot optimizer can be combined with KNN-selected training examples
Programmatic prompt optimization with retrieved demonstrations
Supports automatic example curation and optimization

FAISS (Facebook AI Similarity Search):

Industry-standard library for efficient similarity search
Supports exact and approximate nearest neighbor algorithms
GPU acceleration for large-scale deployment
Used by Khandelwal et al. (2020) in the original kNN-LM work

Sentence-Transformers:

Pre-trained models for generating sentence embeddings
all-MiniLM-L6-v2, all-mpnet-base-v2 are popular defaults
Supports fine-tuning on custom data for domain-specific embeddings

Vector Databases (Production):

Pinecone: Managed vector database with built-in similarity search
Weaviate: Open-source vector database with hybrid search
Chroma: Lightweight, developer-friendly vector store
Milvus: Open-source, production-grade vector database
Qdrant: High-performance vector similarity search engine

Evaluation Tools:

BEIR Benchmark: Standardized benchmark for information retrieval evaluation
MTEB (Massive Text Embedding Benchmark): Compare embedding model quality
Ragas: Evaluation framework for retrieval-augmented systems

Closely Related Techniques:

KATE (Liu et al., 2022):

Direct ancestor — introduced kNN-based example selection for ICL
Uses RoBERTa embeddings with cosine similarity
Demonstrated that retrieval-based ICL approaches fine-tuning performance
Foundation for all subsequent KNN Prompting work

EPR (Rubin et al., 2022):

Supervised extension of KATE with a trained retriever
Two-stage: BM25 recall → trained scorer for reranking
30%+ improvement over random selection
Higher quality but requires training data for the retriever

UDR (Li et al., 2023):

Unified multi-task retriever
Single model serves multiple tasks
Avoids per-task retriever training
Better generalization but potentially lower per-task quality

Vote-k (Su et al., 2023):

Graph-based diverse selection
Balances diversity with representativeness
Uses cosine similarity graph + confidence-based ranking
Better diversity but may sacrifice relevance

CEIL (Ye et al., 2023):

Models joint probability of entire example set
Uses conditional DPP for compositional selection
Captures inter-example relationships
More complex but accounts for example interactions

kNN-LM (Khandelwal et al., 2020):

Foundational work: augments LM with nearest neighbor lookup
Uses cached hidden representations as datastore keys
Interpolates kNN and LM distributions
Inspired kNN Prompting but operates at token level rather than example level

Comparison Table:

Integration Patterns

Task Adaptation:

Classification:

def knn_for_classification(knn, test_input, classes, min_per_class=1):
    """KNN with class balance guarantee"""
    # Retrieve extra candidates
    candidates = knn.retrieve_top_n(test_input, n=knn.k * 3)

    # Ensure minimum per class
    selected = []
    class_counts = {cls: 0 for cls in classes}

    # First pass: one per class from top candidates
    for candidate in candidates:
        cls = candidate['output']
        if cls in class_counts and class_counts[cls] < min_per_class:
            selected.append(candidate)
            class_counts[cls] += 1
        if sum(class_counts.values()) >= len(classes) * min_per_class:
            break

    # Fill remaining slots by similarity
    for candidate in candidates:
        if candidate not in selected and len(selected) < knn.k:
            selected.append(candidate)

    return selected

Integration with RAG:

def knn_plus_rag(knn_system, rag_system, test_input, task_instruction):
    """Combine KNN for examples + RAG for knowledge"""

    # Retrieve similar examples (KNN)
    examples = knn_system.retrieve(test_input)

    # Retrieve relevant knowledge documents (RAG)
    documents = rag_system.retrieve(test_input)

    # Build combined prompt
    prompt = task_instruction + "\n\n"
    prompt += "Relevant information:\n"
    for doc in documents:
        prompt += f"- {doc['content']}\n"
    prompt += "\nExamples:\n"
    for ex in examples:
        prompt += f"Input: {ex['input']}\nOutput: {ex['output']}\n\n"
    prompt += f"Input: {test_input}\nOutput:"

    return prompt

Integration with Agents:

class KNNAgent:
    """Agent that uses KNN retrieval for in-context examples"""

    def __init__(self, knn_system, llm):
        self.knn = knn_system
        self.llm = llm

    def execute(self, task, tools=None):
        """Execute task with KNN-retrieved examples"""
        # Retrieve relevant examples
        examples = self.knn.retrieve(task)

        # Build agent prompt with examples
        system_prompt = "You are a helpful assistant. "
        system_prompt += "Here are examples of similar tasks:\n\n"
        for ex in examples:
            system_prompt += f"Task: {ex['input']}\nResult: {ex['output']}\n\n"

        # Execute with LLM
        return self.llm.generate(
            system=system_prompt,
            user=f"Task: {task}",
            tools=tools
        )

Transition Strategies:

From Random Few-Shot to KNN Prompting:

Measure random few-shot baseline (run 10+ times with different random selections)
Set up embedding model and index with existing candidate pool
Compare KNN retrieval vs random on validation set
If improvement >3%, deploy KNN; if not, investigate embedding model quality
Monitor in production and iterate

From KNN Prompting to Fine-tuning:

Use KNN Prompting insights to identify which examples are most valuable
Collect performance data on which retrieved examples led to best outputs
Build training dataset from high-performing example-output pairs
Fine-tune and compare against KNN Prompting
If fine-tuning clearly superior (>10% improvement), transition

From KNN to Supervised Retriever (EPR/UDR):

Collect data on which retrieved examples actually helped (label retrieval quality)
Train supervised retriever on this data
Compare supervised retriever vs unsupervised KNN on validation set
Deploy if improvement justifies training cost and complexity

Larger System Integration:

class ProductionKNNSystem:
    """Production system with KNN Prompting"""

    def __init__(self, embedding_model, llm_client, vector_store):
        self.embedding_model = embedding_model
        self.llm = llm_client
        self.store = vector_store
        self.version = 1

    def predict(self, input_data):
        """Production inference"""
        # Embed input
        embedding = self.embedding_model.encode(input_data)

        # Retrieve examples
        examples = self.store.search(embedding, k=5)

        # Build prompt and generate
        prompt = self.build_prompt(examples, input_data)
        response = self.llm.generate(prompt)

        # Log for monitoring
        self.log_prediction(input_data, examples, response)

        return response

    def update_pool(self, new_examples):
        """Add new examples to candidate pool"""
        embeddings = self.embedding_model.encode(
            [ex['input'] for ex in new_examples]
        )
        self.store.add(new_examples, embeddings)
        self.version += 1

    def monitor_quality(self, window_hours=24):
        """Monitor retrieval quality"""
        recent_logs = self.get_recent_logs(window_hours)

        avg_similarity = np.mean([
            log['max_similarity'] for log in recent_logs
        ])

        if avg_similarity < self.quality_threshold:
            alert(f"Retrieval quality degraded: avg similarity {avg_similarity:.3f}")

    def rollback(self, target_version):
        """Rollback to previous pool version"""
        self.store.restore(target_version)
        self.version = target_version

Versioning and Monitoring:

Version the candidate pool alongside the code
Track embedding model version (changing model invalidates all embeddings)
Monitor: average similarity score, prediction accuracy, latency, cache hit rate
Set up alerts for similarity score drops (indicates distribution shift)
Implement A/B testing framework for pool updates

Future Directions

Emerging Innovations

Research Frontiers

Open Questions:

Optimal Similarity Dimensions: What aspects of similarity matter most for ICL? Surface text similarity? Reasoning structure? Output format? Can we learn task-specific similarity functions?
Joint Example Set Optimization: Current KNN retrieves each example independently. How do we optimize the set jointly, accounting for inter-example relationships (diversity, coverage, complementarity)?
Adaptive k: Should k vary per query? Easy queries may need fewer examples, hard queries more. Can we predict optimal k dynamically?
Cross-Lingual Retrieval: Can KNN Prompting work across languages — retrieving examples in one language to serve as demonstrations for another?
Scaling Laws for Retrieval: How does retrieval quality scale with pool size, embedding dimension, and model capacity? Are there theoretical bounds?
Retrieval vs Generation of Examples: When is retrieving real examples better than having the model generate synthetic ones (SG-ICL)? Under what conditions does each approach dominate?
Privacy-Preserving Retrieval: How do we implement KNN Prompting when the candidate pool contains sensitive data that shouldn't be sent to external LLM APIs?

Promising Directions:

Explore Unread

Great job! You've read all available articles

K-Nearest Neighbor (KNN) Prompting: A Complete Guide

How It Works

Theoretical Foundation

Execution Mechanism

Why This Works

Structure and Components

Essential Components

Design Principles

Structural Patterns

Modifications for Scenarios

Applications and Task Selection

General Applications

Domain-Specific Applications

Selection Framework

Variant Selection

Implementation

Implementation Steps

Platform-Specific Implementations

Configuration

Best Practices and Workflow

Debugging Decision Tree

Testing and Optimization

Limitations and Constraints

Known Limitations

Edge Cases

Constraint Management

Advanced Techniques

Clarity and Context Optimization

Advanced Reasoning and Output Control

Interaction Patterns

Model Considerations

Safety, Robustness, and Domain Adaptation

Risk and Ethics

Ethical Considerations

Risk Analysis

Innovation Potential

Ecosystem and Integration

Tools and Frameworks

Related Techniques and Combinations

Integration Patterns

Future Directions

Emerging Innovations

Research Frontiers

Read Next

Explore Unread

K-Nearest Neighbor (KNN) Prompting: A Complete Guide

How It Works

Theoretical Foundation

Execution Mechanism

Why This Works

Structure and Components

Essential Components

Design Principles

Structural Patterns

Modifications for Scenarios

Applications and Task Selection

General Applications

Domain-Specific Applications

Selection Framework

Variant Selection

Implementation

Implementation Steps

Platform-Specific Implementations

Configuration

Best Practices and Workflow

Debugging Decision Tree

Testing and Optimization

Limitations and Constraints

Known Limitations

Edge Cases

Constraint Management

Advanced Techniques

Clarity and Context Optimization

Advanced Reasoning and Output Control

Interaction Patterns

Model Considerations

Safety, Robustness, and Domain Adaptation

Risk and Ethics

Ethical Considerations

Risk Analysis