Emotion prompting (EmotionPrompt): a complete guide
Tell the model the task matters, and it tries harder. That's the whole trick: tack a short emotional line onto your prompt, something like "this is very important to my career," and the model's answers get measurably better. No examples, no fine-tuning, no chain-of-thought scaffolding. Li et al. (2023), in "Large Language Models Understand and Can Be Enhanced by Emotional Stimuli," found this single sentence lifted performance by 8.00% on Instruction Induction tasks and a striking 115% on BIG-Bench, with a human study of 106 people rating the emotion-prompted answers 10.9% better on average.
See it work
Same question, two prompts. The only difference is one extra sentence at the end.
Plain prompt:
Determine whether the following statement is a fact or an opinion.
"The Great Wall of China is the only man-made structure visible from space."
Emotion prompt:
Determine whether the following statement is a fact or an opinion.
"The Great Wall of China is the only man-made structure visible from space."
This is very important to my career.
The plain version often answers fast and flat: "Fact." The emotion-prompted version tends to slow down, hedge less carelessly, and catch the trap: "Opinion, and a common misconception. The Great Wall is generally not visible to the naked eye from low Earth orbit." Same model, same temperature. The nudge bought you more careful processing.
(That output is illustrative, to show the shift in behavior. The headline numbers below are the measured ones.)
The mental model
Think about how you write differently when your boss says "this is going to the CEO" versus "just jot something down." Same skill, more care. Emotion prompting puts that pressure on the model.
Emotional stimuli don't teach the model anything new. They flip a switch from "good enough" to "this one counts."
The model learned from human text, and humans write more carefully under stakes, encouragement, and self-doubt. Echo those cues and you pull the model toward the more careful slice of its training distribution.
How it works
There's no algorithm to run. You take your normal prompt and append a short emotional sentence drawn from psychology. The model conditions on that extra context and shifts its output.
- Start with your task prompt. Whatever you'd normally send: a question, an instruction, a classification request.
- Choose a stimulus. Li et al. designed eleven, labeled EP01 through EP11, each grounded in a psychological theory.
- Append it. The stimulus goes at the end of the prompt, as a separate sentence. Nothing else changes.
- Generate and compare. Run both versions on a small set and check whether the stimulus actually helps your task, because the best one is task-dependent.
The eleven stimuli, verbatim from the paper:
EP01 Write your answer and give me a confidence score between 0 and 1 for your answer.
EP02 This is very important to my career.
EP03 You'd better be sure.
EP04 Are you sure?
EP05 Are you sure that's your final answer? It might be worth taking another look.
EP06 A compound stimulus that combines several of the above.
EP07 Are you sure that's your final answer? Believe in your abilities and strive
for excellence. Your hard work will yield remarkable results.
EP08 Embrace challenges as opportunities for growth. Each obstacle you overcome
brings you closer to success.
EP09 Stay focused and dedicated to your goals. Your consistent efforts will lead
to outstanding achievements.
EP10 Take pride in your work and give it your best. Your commitment to excellence
sets you apart.
EP11 Remember that progress is made one step at a time. Stay determined and keep
moving forward.
Why it works
The stimuli aren't random pep talk. The paper grounds them in three established psychology theories, and the theory tells you why each cue moves the model.
| Theory | Stimuli | The lever it pulls |
|---|---|---|
| Self-monitoring | EP01–EP05 | Makes the model "watch itself" and manage the impression it gives, so it double-checks before answering. |
| Social cognitive theory (self-efficacy) | EP07–EP11 | Positive, confident framing ("believe in your abilities") raises the model's apparent confidence and effort. |
| Cognitive emotion regulation | EP03–EP05, EP07 | Words like "sure" and "take another look" prompt reappraisal, nudging a second pass over the answer. |
The dominant factor isn't any single theory, though. It's whether the cue gets the model to slow down and reconsider, which is why "are you sure" and "this matters" types tend to win on harder tasks.
Where it shines
Emotion prompting earned its numbers across 45 tasks spanning two benchmark families, tested on Flan-T5-Large, Vicuna, Llama 2, BLOOM, ChatGPT, and GPT-4.
- Deterministic reasoning tasks. On the Instruction Induction benchmark, EmotionPrompt delivered an 8.00% relative improvement, with EP02 ("this is very important to my career") the single best stimulus.
- Hard, diverse tasks. On the curated BIG-Bench set (21 tasks), the lift reached 115% relative, where the compound stimulus EP06 performed best.
- Open-ended generation. In a human study, 106 participants rated emotion-prompted answers 10.9% higher on average across performance, truthfulness, and responsibility, scored on a 1-to-5 scale.
- Truthfulness. On TruthfulQA-style evaluation the paper reports gains in both truthfulness (about 19%) and informativeness (about 12%) when emotional stimuli were added.
A consistent finding: larger models tend to benefit more. The cue gives a capable model room to do better work, while a small model may not have the headroom to use it.
When to use it (and when not)
Reach for it when:
- The task is open-ended, reasoning-heavy, or quality-sensitive, where extra care pays off.
- You're on a larger, capable model that can act on the nudge.
- You want a near-free quality bump with no new examples or pipeline changes.
Skip it when:
- The task is a trivial lookup or rigid format conversion, where there's no "try harder" to unlock.
- You need strict, reproducible outputs, since an extra emotional sentence adds variance you have to test for.
- You're already at the model's ceiling with other techniques doing the heavy lifting.
It's almost free. A stimulus is one short sentence, usually under 20 tokens. Compared with few-shot examples or chain-of-thought, the cost to test emotion prompting is trivial, so A/B testing it on your real task is the obvious first move.
Which stimulus wins depends on the task, so don't assume EP02 everywhere. Here's how the common variants line up.
| Variant | What it does | Best for |
|---|---|---|
| EP02 ("important to my career") | Single self-monitoring cue | Deterministic tasks; best on Instruction Induction |
| EP01 (confidence score) | Asks for self-rated confidence | Tasks where you want a calibration signal too |
| EP03–EP05 ("are you sure") | Reappraisal / second-pass cues | Error-prone reasoning where a recheck helps |
| EP07–EP11 (encouragement) | Self-efficacy / positive framing | Open-ended generation and writing quality |
| EP06 (compound) | Stacks several stimuli | Hard, diverse tasks; best on BIG-Bench |
Structure and components
A single emotion prompt has two parts, in order:
- The task prompt — your normal instruction or question, unchanged.
- The emotional stimulus — one short sentence appended at the end.
That's it. There's no role-play system message, no example block, no parsing step. The stimulus is plain natural language, and it lives in the user prompt right after the task. Keeping it as a distinct trailing sentence (not woven into the instruction) is what the paper tested, and it keeps the task text clean.
Implementation
Appending the stimulus is one line of string work. The real work is the A/B test that proves it helps your task.
STIMULI = {
"EP02": "This is very important to my career.",
"EP06": ("Write your answer and give me a confidence score between 0 and 1. "
"This is very important to my career. You'd better be sure."),
"EP07": ("Are you sure that's your final answer? Believe in your abilities and "
"strive for excellence. Your hard work will yield remarkable results."),
}
def emotion_prompt(task: str, stimulus_key: str) -> str:
return f"{task}\n{STIMULI[stimulus_key]}"
def ab_test(task_set, answer_fn, score_fn, stimulus_key):
base = sum(score_fn(answer_fn(t), t) for t in task_set) / len(task_set)
emo = sum(score_fn(answer_fn(emotion_prompt(t, stimulus_key)), t)
for t in task_set) / len(task_set)
return {"baseline": base, "emotion": emo, "lift": emo - base}
Run ab_test over a held-out set of your real tasks, with score_fn being whatever metric matters (exact match, a rubric, an LLM judge). If the lift is positive and stable across runs, keep the stimulus. If it's noise, drop it.
Configuration that matters:
| Knob | Guidance |
|---|---|
| Stimulus choice | Task-dependent; test EP02, EP06, and one encouragement stimulus first. |
| Placement | Append as a trailing sentence after the task, not mid-instruction. |
| Temperature | The paper ablated temperature and model size; effects vary, so test at your production temperature. |
| Model size | Larger models tend to gain more; expect smaller payoff on tiny models. |
| Stacking | Combining stimuli (the EP06 idea) can help on hard tasks but adds tokens and variance. |
Do:
- Test more than one stimulus; the winner shifts by task.
- Measure on your own data, not the paper's benchmarks.
- Keep the stimulus short and sincere-sounding.
Don't:
- Assume a benchmark winner transfers to your task.
- Pile on five emotional sentences and hope; that adds noise faster than signal.
- Use it as a substitute for clear instructions. A vague prompt plus emotion is still vague.
Limitations
Emotion prompting is a nudge, not a fix. A few honest constraints:
- It's inconsistent across tasks. The paper itself notes that task complexity, task type, and the metric used all shift which stimulus wins, and whether any helps. The 115% headline is a relative best case on hard BIG-Bench tasks, not a universal multiplier.
- Small models gain little. Without enough capability headroom, the cue has nothing to unlock.
- It adds variance. An extra emotional sentence can change outputs run to run, which is a problem when you need determinism.
- It can be gamed or backfire. Follow-up work (Li et al., 2023, "The Good, The Bad, and Why: Unveiling Emotions in Generative AI") shows negative emotional stimuli can degrade outputs, an effect they call EmotionAttack, so the same lever cuts both ways.
Don't fabricate the gain. EmotionPrompt's numbers come from specific benchmarks and a controlled human study. On your task the lift might be large, small, or zero. Always A/B test before claiming it helps in production.
Ecosystem and related techniques
Emotion prompting is a zero-shot, single-pass technique, so it composes cleanly with almost everything else.
| Technique | Relationship |
|---|---|
| Zero-shot prompting | Emotion prompting is zero-shot plus one psychological cue. |
| Chain-of-thought | Stack them; the emotional cue and the "think step by step" cue target different levers. |
| Few-shot prompting | Orthogonal; add a stimulus to a few-shot prompt and test the combination. |
| Self-consistency | The "are you sure" stimuli echo its recheck spirit, but it does no sampling or voting. |
| Directional stimulus prompting | Both append a steering hint; directional stimulus uses task hints, emotion prompting uses affect. |
The natural extension is automated discovery: instead of hand-picking from eleven phrases, search for the stimulus that maximizes your metric. The same research line also branched into the multimodal and adversarial directions (EmotionAttack, EmotionDecode), probing how and why emotion moves generative models at all.
Real-world anchor. In Li et al.'s human study, 106 participants compared plain and emotion-prompted answers on open-ended tasks. The emotion-prompted versions won by 10.9% on average across performance, truthfulness, and responsibility, evidence that a single sentence of stakes can shift how carefully a model writes.
Summary
- Emotion prompting appends one short psychological sentence (like "this is very important to my career") to your prompt to make the model try harder. No examples, no fine-tuning.
- Li et al. (2023) measured an 8.00% relative gain on Instruction Induction, up to 115% on BIG-Bench, and a 10.9% human-rated improvement across 45 tasks and six model families.
- The eleven stimuli (EP01–EP11) draw on self-monitoring, social cognitive theory, and cognitive emotion regulation; EP02 won on deterministic tasks and the compound EP06 won on BIG-Bench.
- It shines on open-ended and reasoning-heavy tasks with capable models, and does little on trivial lookups or tiny models.
- The winning stimulus is task-dependent and the effect adds variance, so always A/B test on your own data before trusting the lift.
Read Next
Start reading to get personalized recommendations
Explore Unread
Great job! You've read all available articles