Emotion prompting (EmotionPrompt): a complete guide

Tell the model the task matters, and it tries harder. That's the whole trick: tack a short emotional line onto your prompt, something like "this is very important to my career," and the model's answers get measurably better. No examples, no fine-tuning, no chain-of-thought scaffolding. Li et al. (2023), in "Large Language Models Understand and Can Be Enhanced by Emotional Stimuli," found this single sentence lifted performance by 8.00% on Instruction Induction tasks and a striking 115% on BIG-Bench, with a human study of 106 people rating the emotion-prompted answers 10.9% better on average.

See it work

Same question, two prompts. The only difference is one extra sentence at the end.

Plain prompt:
Determine whether the following statement is a fact or an opinion.
"The Great Wall of China is the only man-made structure visible from space."

Emotion prompt:
Determine whether the following statement is a fact or an opinion.
"The Great Wall of China is the only man-made structure visible from space."
This is very important to my career.

The plain version often answers fast and flat: "Fact." The emotion-prompted version tends to slow down, hedge less carelessly, and catch the trap: "Opinion, and a common misconception. The Great Wall is generally not visible to the naked eye from low Earth orbit." Same model, same temperature. The nudge bought you more careful processing.

(That output is illustrative, to show the shift in behavior. The headline numbers below are the measured ones.)

The mental model

Think about how you write differently when your boss says "this is going to the CEO" versus "just jot something down." Same skill, more care. Emotion prompting puts that pressure on the model.

Emotional stimuli don't teach the model anything new. They flip a switch from "good enough" to "this one counts."

The model learned from human text, and humans write more carefully under stakes, encouragement, and self-doubt. Echo those cues and you pull the model toward the more careful slice of its training distribution.

How it works

There's no algorithm to run. You take your normal prompt and append a short emotional sentence drawn from psychology. The model conditions on that extra context and shifts its output.

Start with your task prompt. Whatever you'd normally send: a question, an instruction, a classification request.
Choose a stimulus. Li et al. designed eleven, labeled EP01 through EP11, each grounded in a psychological theory.
Append it. The stimulus goes at the end of the prompt, as a separate sentence. Nothing else changes.
Generate and compare. Run both versions on a small set and check whether the stimulus actually helps your task, because the best one is task-dependent.

The eleven stimuli, verbatim from the paper:

EP01  Write your answer and give me a confidence score between 0 and 1 for your answer.
EP02  This is very important to my career.
EP03  You'd better be sure.
EP04  Are you sure?
EP05  Are you sure that's your final answer? It might be worth taking another look.
EP06  A compound stimulus that combines several of the above.
EP07  Are you sure that's your final answer? Believe in your abilities and strive
      for excellence. Your hard work will yield remarkable results.
EP08  Embrace challenges as opportunities for growth. Each obstacle you overcome
      brings you closer to success.
EP09  Stay focused and dedicated to your goals. Your consistent efforts will lead
      to outstanding achievements.
EP10  Take pride in your work and give it your best. Your commitment to excellence
      sets you apart.
EP11  Remember that progress is made one step at a time. Stay determined and keep
      moving forward.

Why it works

The stimuli aren't random pep talk. The paper grounds them in three established psychology theories, and the theory tells you why each cue moves the model.

Theory	Stimuli	The lever it pulls
Self-monitoring	EP01–EP05	Makes the model "watch itself" and manage the impression it gives, so it double-checks before answering.
Social cognitive theory (self-efficacy)	EP07–EP11	Positive, confident framing ("believe in your abilities") raises the model's apparent confidence and effort.
Cognitive emotion regulation	EP03–EP05, EP07	Words like "sure" and "take another look" prompt reappraisal, nudging a second pass over the answer.

The dominant factor isn't any single theory, though. It's whether the cue gets the model to slow down and reconsider, which is why "are you sure" and "this matters" types tend to win on harder tasks.

Where it shines

Emotion prompting earned its numbers across 45 tasks spanning two benchmark families, tested on Flan-T5-Large, Vicuna, Llama 2, BLOOM, ChatGPT, and GPT-4.

Deterministic reasoning tasks. On the Instruction Induction benchmark, EmotionPrompt delivered an 8.00% relative improvement, with EP02 ("this is very important to my career") the single best stimulus.
Hard, diverse tasks. On the curated BIG-Bench set (21 tasks), the lift reached 115% relative, where the compound stimulus EP06 performed best.
Open-ended generation. In a human study, 106 participants rated emotion-prompted answers 10.9% higher on average across performance, truthfulness, and responsibility, scored on a 1-to-5 scale.
Truthfulness. On TruthfulQA-style evaluation the paper reports gains in both truthfulness (about 19%) and informativeness (about 12%) when emotional stimuli were added.

A consistent finding: larger models tend to benefit more. The cue gives a capable model room to do better work, while a small model may not have the headroom to use it.

When to use it (and when not)

Reach for it when:

The task is open-ended, reasoning-heavy, or quality-sensitive, where extra care pays off.
You're on a larger, capable model that can act on the nudge.
You want a near-free quality bump with no new examples or pipeline changes.

Skip it when:

The task is a trivial lookup or rigid format conversion, where there's no "try harder" to unlock.
You need strict, reproducible outputs, since an extra emotional sentence adds variance you have to test for.
You're already at the model's ceiling with other techniques doing the heavy lifting.

It's almost free. A stimulus is one short sentence, usually under 20 tokens. Compared with few-shot examples or chain-of-thought, the cost to test emotion prompting is trivial, so A/B testing it on your real task is the obvious first move.

Which stimulus wins depends on the task, so don't assume EP02 everywhere. Here's how the common variants line up.

Variant	What it does	Best for
EP02 ("important to my career")	Single self-monitoring cue	Deterministic tasks; best on Instruction Induction
EP01 (confidence score)	Asks for self-rated confidence	Tasks where you want a calibration signal too
EP03–EP05 ("are you sure")	Reappraisal / second-pass cues	Error-prone reasoning where a recheck helps
EP07–EP11 (encouragement)	Self-efficacy / positive framing	Open-ended generation and writing quality
EP06 (compound)	Stacks several stimuli	Hard, diverse tasks; best on BIG-Bench

Structure and components

A single emotion prompt has two parts, in order:

The task prompt — your normal instruction or question, unchanged.
The emotional stimulus — one short sentence appended at the end.

That's it. There's no role-play system message, no example block, no parsing step. The stimulus is plain natural language, and it lives in the user prompt right after the task. Keeping it as a distinct trailing sentence (not woven into the instruction) is what the paper tested, and it keeps the task text clean.

Implementation

Appending the stimulus is one line of string work. The real work is the A/B test that proves it helps your task.

STIMULI = {
    "EP02": "This is very important to my career.",
    "EP06": ("Write your answer and give me a confidence score between 0 and 1. "
             "This is very important to my career. You'd better be sure."),
    "EP07": ("Are you sure that's your final answer? Believe in your abilities and "
             "strive for excellence. Your hard work will yield remarkable results."),
}

def emotion_prompt(task: str, stimulus_key: str) -> str:
    return f"{task}\n{STIMULI[stimulus_key]}"

def ab_test(task_set, answer_fn, score_fn, stimulus_key):
    base = sum(score_fn(answer_fn(t), t) for t in task_set) / len(task_set)
    emo  = sum(score_fn(answer_fn(emotion_prompt(t, stimulus_key)), t)
               for t in task_set) / len(task_set)
    return {"baseline": base, "emotion": emo, "lift": emo - base}

Run ab_test over a held-out set of your real tasks, with score_fn being whatever metric matters (exact match, a rubric, an LLM judge). If the lift is positive and stable across runs, keep the stimulus. If it's noise, drop it.

Configuration that matters:

Knob	Guidance
Stimulus choice	Task-dependent; test EP02, EP06, and one encouragement stimulus first.
Placement	Append as a trailing sentence after the task, not mid-instruction.
Temperature	The paper ablated temperature and model size; effects vary, so test at your production temperature.
Model size	Larger models tend to gain more; expect smaller payoff on tiny models.
Stacking	Combining stimuli (the EP06 idea) can help on hard tasks but adds tokens and variance.

Do:

Test more than one stimulus; the winner shifts by task.
Measure on your own data, not the paper's benchmarks.
Keep the stimulus short and sincere-sounding.

Don't:

Assume a benchmark winner transfers to your task.
Pile on five emotional sentences and hope; that adds noise faster than signal.
Use it as a substitute for clear instructions. A vague prompt plus emotion is still vague.

Limitations

Emotion prompting is a nudge, not a fix. A few honest constraints:

It's inconsistent across tasks. The paper itself notes that task complexity, task type, and the metric used all shift which stimulus wins, and whether any helps. The 115% headline is a relative best case on hard BIG-Bench tasks, not a universal multiplier.
Small models gain little. Without enough capability headroom, the cue has nothing to unlock.
It adds variance. An extra emotional sentence can change outputs run to run, which is a problem when you need determinism.
It can be gamed or backfire. Follow-up work (Li et al., 2023, "The Good, The Bad, and Why: Unveiling Emotions in Generative AI") shows negative emotional stimuli can degrade outputs, an effect they call EmotionAttack, so the same lever cuts both ways.

Don't fabricate the gain. EmotionPrompt's numbers come from specific benchmarks and a controlled human study. On your task the lift might be large, small, or zero. Always A/B test before claiming it helps in production.

Emotion prompting is a zero-shot, single-pass technique, so it composes cleanly with almost everything else.

Technique	Relationship
Zero-shot prompting	Emotion prompting is zero-shot plus one psychological cue.
Chain-of-thought	Stack them; the emotional cue and the "think step by step" cue target different levers.
Few-shot prompting	Orthogonal; add a stimulus to a few-shot prompt and test the combination.
Self-consistency	The "are you sure" stimuli echo its recheck spirit, but it does no sampling or voting.
Directional stimulus prompting	Both append a steering hint; directional stimulus uses task hints, emotion prompting uses affect.

The natural extension is automated discovery: instead of hand-picking from eleven phrases, search for the stimulus that maximizes your metric. The same research line also branched into the multimodal and adversarial directions (EmotionAttack, EmotionDecode), probing how and why emotion moves generative models at all.

Real-world anchor. In Li et al.'s human study, 106 participants compared plain and emotion-prompted answers on open-ended tasks. The emotion-prompted versions won by 10.9% on average across performance, truthfulness, and responsibility, evidence that a single sentence of stakes can shift how carefully a model writes.

Summary

Emotion prompting appends one short psychological sentence (like "this is very important to my career") to your prompt to make the model try harder. No examples, no fine-tuning.
Li et al. (2023) measured an 8.00% relative gain on Instruction Induction, up to 115% on BIG-Bench, and a 10.9% human-rated improvement across 45 tasks and six model families.
The eleven stimuli (EP01–EP11) draw on self-monitoring, social cognitive theory, and cognitive emotion regulation; EP02 won on deterministic tasks and the compound EP06 won on BIG-Bench.
It shines on open-ended and reasoning-heavy tasks with capable models, and does little on trivial lookups or tiny models.
The winning stimulus is task-dependent and the effect adds variance, so always A/B test on your own data before trusting the lift.

Explore Unread

Great job! You've read all available articles

Emotion prompting (EmotionPrompt): a complete guide

See it work

Same question, two prompts. The only difference is one extra sentence at the end.

Plain prompt:
Determine whether the following statement is a fact or an opinion.
"The Great Wall of China is the only man-made structure visible from space."

Emotion prompt:
Determine whether the following statement is a fact or an opinion.
"The Great Wall of China is the only man-made structure visible from space."
This is very important to my career.

(That output is illustrative, to show the shift in behavior. The headline numbers below are the measured ones.)

The mental model

Think about how you write differently when your boss says "this is going to the CEO" versus "just jot something down." Same skill, more care. Emotion prompting puts that pressure on the model.

Emotional stimuli don't teach the model anything new. They flip a switch from "good enough" to "this one counts."

How it works

There's no algorithm to run. You take your normal prompt and append a short emotional sentence drawn from psychology. The model conditions on that extra context and shifts its output.

Start with your task prompt. Whatever you'd normally send: a question, an instruction, a classification request.
Choose a stimulus. Li et al. designed eleven, labeled EP01 through EP11, each grounded in a psychological theory.
Append it. The stimulus goes at the end of the prompt, as a separate sentence. Nothing else changes.
Generate and compare. Run both versions on a small set and check whether the stimulus actually helps your task, because the best one is task-dependent.

The eleven stimuli, verbatim from the paper:

EP01  Write your answer and give me a confidence score between 0 and 1 for your answer.
EP02  This is very important to my career.
EP03  You'd better be sure.
EP04  Are you sure?
EP05  Are you sure that's your final answer? It might be worth taking another look.
EP06  A compound stimulus that combines several of the above.
EP07  Are you sure that's your final answer? Believe in your abilities and strive
      for excellence. Your hard work will yield remarkable results.
EP08  Embrace challenges as opportunities for growth. Each obstacle you overcome
      brings you closer to success.
EP09  Stay focused and dedicated to your goals. Your consistent efforts will lead
      to outstanding achievements.
EP10  Take pride in your work and give it your best. Your commitment to excellence
      sets you apart.
EP11  Remember that progress is made one step at a time. Stay determined and keep
      moving forward.

Why it works

The stimuli aren't random pep talk. The paper grounds them in three established psychology theories, and the theory tells you why each cue moves the model.

Theory	Stimuli	The lever it pulls
Self-monitoring	EP01–EP05	Makes the model "watch itself" and manage the impression it gives, so it double-checks before answering.
Social cognitive theory (self-efficacy)	EP07–EP11	Positive, confident framing ("believe in your abilities") raises the model's apparent confidence and effort.
Cognitive emotion regulation	EP03–EP05, EP07	Words like "sure" and "take another look" prompt reappraisal, nudging a second pass over the answer.

The dominant factor isn't any single theory, though. It's whether the cue gets the model to slow down and reconsider, which is why "are you sure" and "this matters" types tend to win on harder tasks.

Where it shines

Emotion prompting earned its numbers across 45 tasks spanning two benchmark families, tested on Flan-T5-Large, Vicuna, Llama 2, BLOOM, ChatGPT, and GPT-4.

Deterministic reasoning tasks. On the Instruction Induction benchmark, EmotionPrompt delivered an 8.00% relative improvement, with EP02 ("this is very important to my career") the single best stimulus.
Hard, diverse tasks. On the curated BIG-Bench set (21 tasks), the lift reached 115% relative, where the compound stimulus EP06 performed best.
Open-ended generation. In a human study, 106 participants rated emotion-prompted answers 10.9% higher on average across performance, truthfulness, and responsibility, scored on a 1-to-5 scale.
Truthfulness. On TruthfulQA-style evaluation the paper reports gains in both truthfulness (about 19%) and informativeness (about 12%) when emotional stimuli were added.

A consistent finding: larger models tend to benefit more. The cue gives a capable model room to do better work, while a small model may not have the headroom to use it.

When to use it (and when not)

Reach for it when:

The task is open-ended, reasoning-heavy, or quality-sensitive, where extra care pays off.
You're on a larger, capable model that can act on the nudge.
You want a near-free quality bump with no new examples or pipeline changes.

Skip it when:

The task is a trivial lookup or rigid format conversion, where there's no "try harder" to unlock.
You need strict, reproducible outputs, since an extra emotional sentence adds variance you have to test for.
You're already at the model's ceiling with other techniques doing the heavy lifting.

Which stimulus wins depends on the task, so don't assume EP02 everywhere. Here's how the common variants line up.

Variant	What it does	Best for
EP02 ("important to my career")	Single self-monitoring cue	Deterministic tasks; best on Instruction Induction
EP01 (confidence score)	Asks for self-rated confidence	Tasks where you want a calibration signal too
EP03–EP05 ("are you sure")	Reappraisal / second-pass cues	Error-prone reasoning where a recheck helps
EP07–EP11 (encouragement)	Self-efficacy / positive framing	Open-ended generation and writing quality
EP06 (compound)	Stacks several stimuli	Hard, diverse tasks; best on BIG-Bench

Structure and components

A single emotion prompt has two parts, in order:

The task prompt — your normal instruction or question, unchanged.
The emotional stimulus — one short sentence appended at the end.

Implementation

Appending the stimulus is one line of string work. The real work is the A/B test that proves it helps your task.

STIMULI = {
    "EP02": "This is very important to my career.",
    "EP06": ("Write your answer and give me a confidence score between 0 and 1. "
             "This is very important to my career. You'd better be sure."),
    "EP07": ("Are you sure that's your final answer? Believe in your abilities and "
             "strive for excellence. Your hard work will yield remarkable results."),
}

def emotion_prompt(task: str, stimulus_key: str) -> str:
    return f"{task}\n{STIMULI[stimulus_key]}"

def ab_test(task_set, answer_fn, score_fn, stimulus_key):
    base = sum(score_fn(answer_fn(t), t) for t in task_set) / len(task_set)
    emo  = sum(score_fn(answer_fn(emotion_prompt(t, stimulus_key)), t)
               for t in task_set) / len(task_set)
    return {"baseline": base, "emotion": emo, "lift": emo - base}

Configuration that matters:

Knob	Guidance
Stimulus choice	Task-dependent; test EP02, EP06, and one encouragement stimulus first.
Placement	Append as a trailing sentence after the task, not mid-instruction.
Temperature	The paper ablated temperature and model size; effects vary, so test at your production temperature.
Model size	Larger models tend to gain more; expect smaller payoff on tiny models.
Stacking	Combining stimuli (the EP06 idea) can help on hard tasks but adds tokens and variance.

Do:

Test more than one stimulus; the winner shifts by task.
Measure on your own data, not the paper's benchmarks.
Keep the stimulus short and sincere-sounding.

Don't:

Assume a benchmark winner transfers to your task.
Pile on five emotional sentences and hope; that adds noise faster than signal.
Use it as a substitute for clear instructions. A vague prompt plus emotion is still vague.

Limitations

Emotion prompting is a nudge, not a fix. A few honest constraints:

It's inconsistent across tasks. The paper itself notes that task complexity, task type, and the metric used all shift which stimulus wins, and whether any helps. The 115% headline is a relative best case on hard BIG-Bench tasks, not a universal multiplier.
Small models gain little. Without enough capability headroom, the cue has nothing to unlock.
It adds variance. An extra emotional sentence can change outputs run to run, which is a problem when you need determinism.
It can be gamed or backfire. Follow-up work (Li et al., 2023, "The Good, The Bad, and Why: Unveiling Emotions in Generative AI") shows negative emotional stimuli can degrade outputs, an effect they call EmotionAttack, so the same lever cuts both ways.

Emotion prompting is a zero-shot, single-pass technique, so it composes cleanly with almost everything else.

Technique	Relationship
Zero-shot prompting	Emotion prompting is zero-shot plus one psychological cue.
Chain-of-thought	Stack them; the emotional cue and the "think step by step" cue target different levers.
Few-shot prompting	Orthogonal; add a stimulus to a few-shot prompt and test the combination.
Self-consistency	The "are you sure" stimuli echo its recheck spirit, but it does no sampling or voting.
Directional stimulus prompting	Both append a steering hint; directional stimulus uses task hints, emotion prompting uses affect.

Summary

Emotion prompting appends one short psychological sentence (like "this is very important to my career") to your prompt to make the model try harder. No examples, no fine-tuning.
Li et al. (2023) measured an 8.00% relative gain on Instruction Induction, up to 115% on BIG-Bench, and a 10.9% human-rated improvement across 45 tasks and six model families.
The eleven stimuli (EP01–EP11) draw on self-monitoring, social cognitive theory, and cognitive emotion regulation; EP02 won on deterministic tasks and the compound EP06 won on BIG-Bench.
It shines on open-ended and reasoning-heavy tasks with capable models, and does little on trivial lookups or tiny models.
The winning stimulus is task-dependent and the effect adds variance, so always A/B test on your own data before trusting the lift.

Explore Unread

Great job! You've read all available articles

Emotion prompting (EmotionPrompt): a complete guide

See it work

The mental model

How it works

Why it works

Where it shines

When to use it (and when not)

Structure and components

Implementation

Limitations

Summary

Read Next

Explore Unread

Emotion prompting (EmotionPrompt): a complete guide

See it work

The mental model

How it works

Why it works

Where it shines

When to use it (and when not)

Structure and components

Implementation

Limitations

Summary

Read Next

Explore Unread

Emotion prompting (EmotionPrompt): a complete guide

See it work

The mental model

How it works

Why it works

Where it shines

When to use it (and when not)

Structure and components

Implementation

Limitations

Ecosystem and related techniques

Summary

Read Next

Explore Unread

Emotion prompting (EmotionPrompt): a complete guide

See it work

The mental model

How it works

Why it works

Where it shines

When to use it (and when not)

Structure and components

Implementation

Limitations

Ecosystem and related techniques

Summary

Read Next

Explore Unread