SimToM: perspective-taking for theory-of-mind reasoning

Ask a language model where Sally will look for her marble, and it tends to tell you where the marble actually is — not where Sally thinks it is. The model sees the whole story, so it answers from an omniscient view it can't switch off. SimToM (Simulated Theory of Mind) fixes this with one structural move: before answering, it builds a filtered version of the story containing only what the target character witnessed, then answers from inside that smaller world. On the ToMi benchmark, this took GPT-4 from 25.5% to 87.75% on false-belief questions, and lifted GPT-3.5 from 41.0% to 70.5% on BigToM (Wilf, Lee, Liang, and Morency, "Think Twice," arXiv:2311.10227, ACL 2024).

See it work

Here's the classic false-belief setup. Sally hides a marble, leaves, and Anne moves it while she's gone.

Story: Sally puts a marble in a basket. Sally leaves the room. Anne takes the
marble from the basket and puts it in a box. Sally comes back into the room.

Question: Where will Sally look for the marble?

--- Zero-shot GPT-3.5 ---
Sally will look in the box. The marble was moved to the box by Anne.   ✗

--- SimToM ---
Stage 1 (what does Sally know?):
  - Sally puts a marble in a basket.
  - Sally comes back into the room.
  (The marble move is excluded — Sally was absent for it.)

Stage 2 (answer as Sally, from those events only):
  I will look in the basket. That's where I put the marble before I left,
  and I have no reason to believe it moved.   ✓

Zero-shot gives the right answer to the wrong question — where the marble is, not where Sally will look. SimToM never shows Stage 2 the marble-move event, so the wrong answer simply isn't available to grab.

The mental model

Think of a stage actor preparing a scene. They don't play their character with the script's full knowledge of how the plot ends — they deliberately forget everything the character hasn't lived through yet, then react only to what's in front of them. SimToM hands the model that smaller script.

Don't ask the model to suppress what it knows. Take away the knowledge it should never have had in the first place.

This is why SimToM is different from "think step by step." Chain-of-thought changes how carefully the model reasons over the story; SimToM changes what story the model reasons over. For false-belief tasks, the second is the intervention that matters.

How it works

SimToM splits one inference into two sequential API calls. The first decides what the character knows; the second answers as that character.

Pick the target. Find which character the question is about (usually named in the question), and the character names and locations in the story.
Stage 1 — perspective-taking. Give the model the full story plus an explicit knowledge rule: a character knows events they witness; once they leave a location, they stop learning what happens there until they return. The model returns only the events that character saw.
Stage 2 — question-answering. Give the model the filtered events, not the original story, with first-person framing ("You are Sally. Based on the above…") and the question. It answers from the character's limited view.

The two calls must be physically separate. Merging them (the SimToM-Single ablation) drops accuracy by 19–27 percentage points across models, because a single forward pass attends to the full story regardless of which sub-task it's resolving — the omniscient narrative bleeds into the restricted answer.

Why it works

The deep claim: false-belief failure is an information-state failure, not a reasoning failure. The model reasons fine; it just reasons from the wrong viewpoint. At the architecture level, self-attention is epistemically neutral — it weights tokens by relevance, with no mechanism to mask events a referenced character never saw. Both "basket" and "box" sit in context competing for the answer. SimToM removes "box" from Stage 2's context entirely, so the competition disappears. Ranked by how much they drive results:

Factor	Why it dominates
Stage 1 accuracy (~60% of outcome variance)	A wrong filter feeds Stage 2 bad input; correct reasoning then gives a wrong answer. The Oracle gap measures this directly.
Model capability	Stronger models build better filters in Stage 1 and reason better in Stage 2.
Clarity of the knowledge rule	Spelling out the leave/return rule sharply improves Stage 1 completions.
Story complexity	More characters and location changes degrade Stage 1, and the error cascades.

The ceiling is striking: SimToM-Oracle — feeding Stage 2 human-annotated correct perspectives — hits ~96% on both benchmarks' false-belief questions. So Stage 2 is nearly solved; almost all remaining error lives in Stage 1. On ToMi with GPT-3.5, SimToM scores 81.0% vs Oracle's ~96%, a ~15 pp gap that better perspective extraction would recover.

Where it shines

SimToM helps exactly when a question targets one character's restricted viewpoint in a narrative with information asymmetry. The headline numbers, all on false-belief subsets:

BigToM false-belief (% accuracy):

Model	0-shot	0-shot CoT	SimToM	Gain vs 0-shot
Llama2-7b-chat	47.5	31.5	70.5	+23.0 pp
Llama2-13b-chat	41.25	52.25	61.75	+20.5 pp
GPT-3.5-Turbo	41.0	56.25	70.5	+29.5 pp
GPT-4	89.0	93.25	92.0	+3.0 pp (−1.2 vs CoT)

ToMi false-belief (% accuracy):

Model	0-shot	0-shot CoT	SimToM	Gain vs 0-shot
Llama2-7b-chat	28.25	24.0	40.0	+11.75 pp
Llama2-13b-chat	39.25	16.5	35.5	−3.75 pp
GPT-3.5-Turbo	67.25	34.0	81.0	+13.75 pp
GPT-4	25.5	74.25	87.75	+62.25 pp

A few things worth staring at. GPT-3.5 on BigToM is the largest clean win (+29.5 pp over zero-shot, +14.25 over CoT — and CoT actually hurt its zero-shot score). GPT-4's 0-shot ToMi is a dismal 25.5% — below chance for a binary task, meaning it confidently applies a wrong heuristic — which SimToM overrides to 87.75%. Two exceptions: GPT-4 on BigToM regresses 1.2 pp vs CoT (BigToM was generated by GPT-4, which may flatter the baseline), and Llama-2-13b on ToMi loses 3.75 pp — a reminder that a model generating a bad Stage 1 filter gets worse input than no filter at all.

Beyond benchmarks, the same "filter to one party's knowledge" move maps onto real domains: tracking what a patient has been told vs. what the chart says, what a party to a contract was formally notified of, what a student has been taught before assigning a problem, what a support customer was told by a prior agent, what a teammate knew before a PR they missed shipped a breaking change, or what another agent can see in an incomplete-information game like Hanabi. OpenToM (Xu et al., 2024) confirms the scope precisely: SimToM-style perspective-taking improves physical-world mental states (object locations, event awareness) but not psychological-world ones (desires, emotions).

When to use it (and when not)

Reach for SimToM when:

The question asks "What does X think/know/believe?", "Where will X look?", or "Why did X do Y?" — where X is not the omniscient narrator.
At least two agents have different information access, and the gap comes from who witnessed which event.
The story marks clearly who is present for what (event-list or temporal-marker prose).

Skip it when:

The question is about the actual ground-truth world state — filtering can only hurt.
The belief must be inferred from behavioral or emotional cues, not event-witnessing (desires, emotions).
There's one character, or their knowledge equals the full story.
The story is so short that CoT already suffices and two calls aren't worth it.

Cost is the trade-off, not setup. SimToM is two API calls instead of one — roughly 1.4–1.8× the token cost and about 2× the latency of single-pass inference. Adding a one-shot example (SimToM-Domain) pushes that to ~1.8–2×. For false-belief tasks the accuracy gains justify it; for trivial first-order cases with clear narratives, plain CoT is cheaper.

Model fit. Below 7B parameters SimToM is unvalidated and likely hurts. Llama-2-7b is the floor (filtering is rough but still helps); Llama-2-13b+ and GPT-3.5 are the reliable middle; GPT-4 and Claude Opus/Sonnet give near-Oracle Stage 1. For data-governed deployments, a local Llama-3-70B reaches roughly 70–75% on ToMi vs GPT-4's 87.75% — a fair trade for keeping text on-premises.

Escalate when: you need second-order or higher beliefs (move to Decompose-ToM or recursive SimToM); Stage 1 stays poor despite prompt tuning (fine-tune a Stage 1 extractor, or human-in-the-loop it à la SimToM-Oracle); latency is a hard ceiling (fall back to single-stage CoT, accepting accuracy loss).

Variant	Best for
Zero-shot SimToM	First deployment, novel domains without examples
SimToM-Domain (1-shot)	Production first-order ToM — ~20 pp boost at modest cost
Recursive SimToM	Second-order belief tasks
SimToM + CoT in Stage 2	When belief-to-action inference is complex after filtering
SimToM-Oracle	Research baselines; high-stakes work with human review of Stage 1

Structure and components

Four required pieces, two per stage. Stage 1 needs the story block (the full narrative), the knowledge rule (presence-based witnessing, stated explicitly — without it, filters come out wrong), and the target character named exactly as in the story. Stage 2 needs the filtered perspective from Stage 1 (which replaces the story) and first-person grounding ("You are X…"), then the question. Optional add-ons: a domain-specific few-shot example per stage (SimToM-Domain) and an output-format constraint on Stage 1 (a clean numbered list, no commentary).

The first-person framing isn't cosmetic. It's a direct operationalization of Gordon's "ascent routine" from Simulation Theory — make assertions about the world as the character would — and it activates the model's first-person narration patterns, which align better with perspective-constrained answering than third-person ones.

Here's the canonical two-stage template:

=== Stage 1: perspective-taking ===
The following is a sequence of events:

{story}

Which events does {character} know about?
A character knows about all events they directly witness.
If a character is in a location, they know all events that happen there.
If they leave, they no longer know events there until they return.

List only the events {character} knows about, one per line.

=== Stage 2: question-answering ===
{stage_1_output}

You are {character}. Based on the above information, answer:

{question}

Answer:

What SimToM does and doesn't cover. It natively handles factual, true, and displaced beliefs (a once-correct belief gone stale). It does not handle desires, emotional states, probabilistic/graded beliefs, counterfactuals ("what if she had stayed?"), or implicit-inference knowledge. Second-order beliefs ("A thinks B believes X") need recursive application. And the base presence rule mishandles testimony — if A tells B about a missed event, B was "present" for the telling, so the rule wrongly excludes the event; an extended rule must treat communication as a second knowledge channel (and, for deception, model whether B trusts A).

Configuration

Setting	Stage 1	Stage 2
Temperature	0 (deterministic filter)	0 for classification; 0.3–0.7 for open-ended
Max tokens	~1.5× story length	50–100 classification; 200–400 explanatory
Stop sequence	`"\n\n"` to cut trailing commentary	only if format demands
Output format	free-form list, optionally "no commentary"	constrain only here, never Stage 1

Stage 1 output typically runs 40–70% of the story's token count. Don't add CoT to Stage 1 — it bloats the perspective and dilutes the filtered signal. Add it only to Stage 2 if the belief-to-action step is genuinely complex.

Implementation workflow

Preprocess the story (names, locations, event order) → identify the target character and question type → build the Stage 1 prompt → call at temperature 0 → build Stage 2 from the filtered output → call → extract the answer. One concise implementation:

from openai import OpenAI

client = OpenAI()

def simtom(story: str, character: str, question: str, model: str = "gpt-4") -> str:
    # Stage 1: perspective-taking
    stage1 = f"""The following is a sequence of events:

{story}

Which events does {character} know about?
A character knows about all events they directly witness.
If they leave a location, they no longer know events there until they return.

List only the events {character} knows about, one per line."""
    perspective = client.chat.completions.create(
        model=model, temperature=0,
        messages=[{"role": "user", "content": stage1}],
    ).choices[0].message.content.strip()

    # Stage 2: question-answering from the filtered perspective only
    stage2 = f"""{perspective}

You are {character}. Based on the above information, answer:

{question}

Answer:"""
    return client.chat.completions.create(
        model=model, temperature=0,
        messages=[{"role": "user", "content": stage2}],
    ).choices[0].message.content.strip()

It ports cleanly across providers. Claude likes XML-delimited structure (<story>…</story>, <character>…</character>); GPT models do better with the knowledge rule in the system message; Llama instruction models need their [INST] chat template applied. DSPy is a natural fit — model the two stages as two modules and let MIPRO jointly tune both prompts against a labeled set, since Stage 1 phrasing is the single biggest lever. For batches, fire all Stage 1 calls concurrently, then all Stage 2 calls, and cache Stage 1 by (story_hash, character) so re-querying the same character costs nothing.

Do: state the knowledge rule explicitly; keep Stage 1 at temperature 0; validate Stage 1 outputs (shorter than the story, no post-departure events) before trusting Stage 2; name the character exactly; try SimToM-Domain if zero-shot Stage 1 is weak.

Don't: merge the two stages; add CoT to Stage 1; use SimToM for ground-truth questions; reuse one Stage 1 output across questions that need different time-indexed knowledge states.

A useful rule-of-thumb on phrasing the knowledge rule: first-person framing ("Imagine you are X — what did you personally witness?") tends to win on larger models because it front-loads the perspective-taking; explicit location-tracking and negative-framing ("do NOT include events X missed") are the other two variants worth A/B testing.

Debugging

Symptom	Likely cause	Fix
Stage 2 gives the ground-truth answer	Stage 1 leaked events the character missed, or Stage 2 got the original story	Tighten the rule ("only events personally witnessed"); confirm Stage 2 starts with the filtered output, not the story
Stage 1 output ≈ whole story	Over-inclusion	Add explicit exclusion ("do not include events that happened while X was away")
Stage 1 too short, misses witnessed events	Over-filtering / name mismatch	"Include all events X witnessed, not just key ones"; check name matches exactly
Stage 2 says "not enough information"	Stage 1 correctly filtered everything relevant	Technically correct; let it state its best expectation, or treat as uncertain
Inconsistent answers across runs	Non-zero Stage 1 temperature	Set Stage 1 to 0; if Stage 2 varies, self-consistency on Stage 2 only
SimToM below zero-shot	Stage 1 so inaccurate it adds error (the Llama-2-13b/ToMi −3.75 pp case)	Switch to SimToM-Domain, use a stronger Stage 1 model, or human-review Stage 1

Quick diagnostics without gold labels: a Stage 1/story length ratio above 0.9 signals over-inclusion, below 0.15 signals over-exclusion (aim 0.3–0.8); a Stage 2 "I don't know" rate above 20% means over-filtering; low overlap (below 0.7) across paraphrased stories means Stage 1 is surface-matching, not applying the rule.

Testing and proving it works

Always score false-belief and true-belief questions separately — SimToM's gains live in the false-belief subset, and aggregate numbers hide it. SimToM must not hurt true-belief accuracy; if it does, Stage 1 is dropping events the character actually saw. Reserve a 20% holdout, tune on the rest, and hand-annotate 20–30 Stage 1 outputs for an event-level F1 — the most informative single diagnostic; if Stage 1 accuracy is below ~80%, fix it before touching Stage 2.

For comparing methods on the same questions, use McNemar's test (the errors are paired); a p-value below 0.05 with Cohen's h above 0.2 supports a real improvement. For 80% power to detect a 5 pp difference at α=0.05 you need roughly 400 binary questions; about 100 for a 10 pp difference. Run the five-perturbation adversarial suite from the Ullman/Shapira critiques — name swap, object swap, order permutation, length padding, location renaming — since a genuine structural fix should survive all of them.

This is also where SimToM's superiority over more-of-the-same reasoning shows: Self-Consistency CoT manages only 33.5% on ToMi (GPT-3.5) and Tree-of-Thoughts also trails — voting over omniscient chains never introduces the epistemic constraint SimToM adds.

Limitations

Presence-only filtering. Stage 1 tracks physical presence, so it misses knowledge gained by testimony, inference, or reputation — common in naturalistic text.
Stage 1 quality ceiling. Oracle shows ~96% is reachable, but model-generated filters fall short on longer, multi-character, multi-location stories. This gap is irreducible without better models or fine-tuning.
First-order only, natively. Second-order and up need recursion, which multiplies calls and compounds Stage 1 error. Decompose-ToM (Zhao et al., 2025) beats SimToM by +28.13 pp for GPT-4o (and +22.5 pp for Llama-3-70B) at second order on Hi-ToM.
Binary knowledge. No degrees of certainty or partial witnessing.
One character per call. Comparing beliefs means running Stage 1 multiple times.
Long context decays. On FANToM, SimToM shows a 4% gap between short and long conversations, vs Decompose-ToM's 0.9% — more transitions mean more Stage 1 errors.

Watch the edge cases. Events spanning a departure, "noticed it's missing but not where it went," ambiguous presence ("nearby," "distracted"), pronoun-only references, and nested/reported speech all break the binary rule. Default to an over-exclusion bias ("when in doubt, exclude") since over-inclusion — letting ground truth leak into Stage 2 — is the more damaging error.

Advanced techniques

Recursive SimToM for second-order beliefs. To answer "Where does Sally think Anne will look?", first run standard SimToM for Anne to get her perspective and belief. Then build a second-order story from Sally's viewpoint that encodes what Sally knows about Anne's information state (e.g., that Sally saw Anne leave before the move), run Stage 1 for Sally over it, and ask Stage 2 what Sally thinks Anne believes. It works but costs 4+ calls and compounds error — for production higher-order work, Decompose-ToM's recursive architecture is more systematic.

Other extensions: add a Stage 2 self-verification pass ("check your answer against the events listed"); request structured JSON output with a confidence field (low confidence flags ambiguous Stage 1 for review); feed multiple characters' filtered perspectives into one Stage 2 for common-ground questions; combine with Self-Consistency on Stage 2 (5 samples at temperature 0.5, majority-vote) while keeping Stage 1 fixed; or pair with RAG so the retrieved chunks become the "story" (sort them chronologically first — Stage 1 depends on event order).

Risks and ethics

Scaffolded ToM is not genuine ToM. Oracle's ~96% means Stage 2 reasons well when handed the right perspective — but the model doesn't build that perspective spontaneously. So a SimToM system fails silently whenever its assumptions break (implied departures, naturalistic prose, unusual structure). Document the assumptions your deployment relies on and test the boundaries; don't treat the model as having robust theory of mind.

Two more concerns. Transparency: only Stage 2 is usually shown, but the hidden Stage 1 filter determines the whole answer — in legal or medical use, log and surface both for audit, and prefer on-premises models (HIPAA/GDPR) over sending PHI/PII to third-party APIs. Manipulation and bias: a crafted story can make Stage 1 misrepresent a character's knowledge, so never present generated "perspectives" as factual claims about real people; and audit Stage 1 quality across character demographics, since systematic mis-filtering would yield biased belief attributions. Standard prompt-injection mitigations (delimit the story, treat embedded instructions as narrative text) and content filtering on Stage 2 cover the role-play jailbreak risk.

Ecosystem and alternatives

SimToM is grounded in Simulation Theory from cognitive science (Gordon, 1986; Heal, 1986; Goldman, 2006) — we understand others by simulating their situation, not by applying a rulebook. Its rival, Theory Theory (Fodor, 1983; Gopnik & Wellman, 1992), says we apply an internalized folk-psychology theory; chain-of-thought implicitly bets on Theory Theory, which is exactly why CoT fails here — the problem is too much information, not too little reasoning. The paper draws directly on developmental evidence that cueing children with "what did Sally see?" lets them pass false-belief tasks earlier (Siegal & Beattie, 1991; Lewis & Osborne, 1990); Stage 1 is that cue for an LLM.

How it stacks up against the alternatives:

Dimension	SimToM	Zero-shot CoT	Self-Consistency	Decompose-ToM	Fine-tuning
Mechanism	Context partitioning	Reasoning elicitation	Voting over paths	Recursive decomposition	Weight update
API calls / question	2	1	5–20	4–8+	1 at inference
False-belief (GPT-3.5, ToMi)	81.0%	34.0%	33.5%	≈ SimToM at 1st order	Task-dependent
Higher-order ToM	Weak (1st-order)	Weak	Weak	Strong (2nd–4th)	Potentially strong
Token overhead	~1.5×	1×	5–20×	3–6×	1× (high upfront)
Training data	None	None	None	None	Required
Explainability	High (Stage 1 visible)	Medium	Low	Medium	Low

Choosing between them: SimToM over CoT whenever there's explicit false-belief structure (the +47 pp ToMi gap for GPT-3.5 settles it). SimToM over Self-Consistency is no contest on ToM (81.0% vs 33.5%). SimToM over Decompose-ToM for first-order tasks (simpler, cheaper, comparable) — switch to Decompose-ToM only for second-order+ or FANToM-style long context. Fine-tuning beats both only at very high, single-task query volume.

Where it connects: SimToM has no dedicated framework yet (early 2026) but drops into LangChain (a two-step chain), LlamaIndex (a two-call pipeline), or DSPy (two modules). The official repo is shawnsihyunlee/simulatedtom, with evaluate_tomi.py / evaluate_bigtom.py and W&B tracking. Hybrids worth knowing: SimToM + fine-tuned Stage 1 (a small 7B extractor trained on Oracle annotations, paired with a larger Stage 2 model); SimToM as a tool inside multi-agent frameworks (AutoGen, CrewAI) for modeling other agents' beliefs; and a neural-symbolic split where Stage 1 extracts (character, event, present) tuples for a formal epistemic reasoner. UniToMBench (Li et al., 2025) builds directly on SimToM's perspective/reasoning separation. To migrate to SimToM: identify ToM questions, wire the two stages, validate Stage 1 on 20 cases, add a one-shot example if it's below 80%, then benchmark against your CoT baseline before shipping.

The benchmark landscape, for context: ToMi (Le et al., 2019, fixed false-belief template), BigToM (Gandhi et al., 2023, GPT-4-generated, broader narratives — hence the GPT-4 confound), FANToM (Kim et al., 2023, conversational asymmetry), Hi-ToM (Wu et al., 2023, up to 4th-order), and OpenToM (Xu et al., 2024, longer stories with personalities). The broader debate — Kosinski (2023, arXiv:2302.02083) claimed GPT-4 passes at a 9-year-old's level; Ullman (2023) and Shapira et al. (2023) showed performance collapses under trivial perturbations — is what motivated a structural fix in the first place.

Future directions

The clearest next step follows straight from the Oracle ceiling: train a dedicated, small (~7B) perspective extractor for Stage 1, since that's where almost all the remaining error sits — cheaper, faster, and more accurate than a frontier model doing Stage 1. Other live threads: communication-aware filtering that tracks testimony alongside witnessing; dynamic perspective updating for streaming dialogue without re-running Stage 1; BDI extension that filters desires and goals, not just knowledge, to cover OpenToM's psychological-world questions; multi-modal SimToM filtering what a character could have seen or heard; and using perspective-correctness as an RLHF reward so models eventually do this natively, no scaffolding required. The open mechanistic question remains whether SimToM enables genuine simulation or just removes the wrong-answer attractor — causal ablation studies on Stage 1 contents could tell them apart.

The result that frames the technique: GPT-4 answered ToMi false-belief questions correctly just 25.5% of the time zero-shot — confidently wrong, below chance. The same model with SimToM scaffolding reached 87.75%. The reasoning ability was there all along; it was pointed at the wrong information. SimToM's whole contribution is fixing what the model gets to see, not how hard it thinks.

Summary

What: SimToM (Simulated Theory of Mind) is a two-stage prompt — filter the story to what a character witnessed, then answer as that character from the filter alone.
Why: False-belief failure is an information-state problem, not a reasoning one. Self-attention can't mask events a character didn't see; SimToM removes them from context instead.
When: Use it for questions about a character's belief/knowledge with witness-based information asymmetry; skip it for ground-truth, single-character, or desire/emotion questions.
Where: Narrative false-belief tasks, and real analogues — patient vs. chart, party vs. contract, student vs. curriculum, agent vs. game state.
How: Two separate API calls (never merged — merging costs 19–27 pp), Stage 1 at temperature 0 with an explicit knowledge rule, Stage 2 with first-person grounding over the filtered events only.
Which: Zero-shot or 1-shot (SimToM-Domain, ~+20 pp) for first-order; recursive SimToM or Decompose-ToM for second-order+; fine-tuned Stage 1 at high volume.
Headline: +62.25 pp for GPT-4 on ToMi and +29.5 pp for GPT-3.5 on BigToM; Oracle's ~96% shows the remaining gap is Stage 1 quality (Wilf et al., arXiv:2311.10227, ACL 2024).

Explore Unread

Great job! You've read all available articles

SimToM: perspective-taking for theory-of-mind reasoning

See it work

Here's the classic false-belief setup. Sally hides a marble, leaves, and Anne moves it while she's gone.

Story: Sally puts a marble in a basket. Sally leaves the room. Anne takes the
marble from the basket and puts it in a box. Sally comes back into the room.

Question: Where will Sally look for the marble?

--- Zero-shot GPT-3.5 ---
Sally will look in the box. The marble was moved to the box by Anne.   ✗

--- SimToM ---
Stage 1 (what does Sally know?):
  - Sally puts a marble in a basket.
  - Sally comes back into the room.
  (The marble move is excluded — Sally was absent for it.)

Stage 2 (answer as Sally, from those events only):
  I will look in the basket. That's where I put the marble before I left,
  and I have no reason to believe it moved.   ✓

The mental model

Don't ask the model to suppress what it knows. Take away the knowledge it should never have had in the first place.

How it works

SimToM splits one inference into two sequential API calls. The first decides what the character knows; the second answers as that character.

Pick the target. Find which character the question is about (usually named in the question), and the character names and locations in the story.
Stage 1 — perspective-taking. Give the model the full story plus an explicit knowledge rule: a character knows events they witness; once they leave a location, they stop learning what happens there until they return. The model returns only the events that character saw.
Stage 2 — question-answering. Give the model the filtered events, not the original story, with first-person framing ("You are Sally. Based on the above…") and the question. It answers from the character's limited view.

Why it works

Factor	Why it dominates
Stage 1 accuracy (~60% of outcome variance)	A wrong filter feeds Stage 2 bad input; correct reasoning then gives a wrong answer. The Oracle gap measures this directly.
Model capability	Stronger models build better filters in Stage 1 and reason better in Stage 2.
Clarity of the knowledge rule	Spelling out the leave/return rule sharply improves Stage 1 completions.
Story complexity	More characters and location changes degrade Stage 1, and the error cascades.

Where it shines

SimToM helps exactly when a question targets one character's restricted viewpoint in a narrative with information asymmetry. The headline numbers, all on false-belief subsets:

BigToM false-belief (% accuracy):

Model	0-shot	0-shot CoT	SimToM	Gain vs 0-shot
Llama2-7b-chat	47.5	31.5	70.5	+23.0 pp
Llama2-13b-chat	41.25	52.25	61.75	+20.5 pp
GPT-3.5-Turbo	41.0	56.25	70.5	+29.5 pp
GPT-4	89.0	93.25	92.0	+3.0 pp (−1.2 vs CoT)

ToMi false-belief (% accuracy):

Model	0-shot	0-shot CoT	SimToM	Gain vs 0-shot
Llama2-7b-chat	28.25	24.0	40.0	+11.75 pp
Llama2-13b-chat	39.25	16.5	35.5	−3.75 pp
GPT-3.5-Turbo	67.25	34.0	81.0	+13.75 pp
GPT-4	25.5	74.25	87.75	+62.25 pp

When to use it (and when not)

Reach for SimToM when:

The question asks "What does X think/know/believe?", "Where will X look?", or "Why did X do Y?" — where X is not the omniscient narrator.
At least two agents have different information access, and the gap comes from who witnessed which event.
The story marks clearly who is present for what (event-list or temporal-marker prose).

Skip it when:

The question is about the actual ground-truth world state — filtering can only hurt.
The belief must be inferred from behavioral or emotional cues, not event-witnessing (desires, emotions).
There's one character, or their knowledge equals the full story.
The story is so short that CoT already suffices and two calls aren't worth it.

Variant	Best for
Zero-shot SimToM	First deployment, novel domains without examples
SimToM-Domain (1-shot)	Production first-order ToM — ~20 pp boost at modest cost
Recursive SimToM	Second-order belief tasks
SimToM + CoT in Stage 2	When belief-to-action inference is complex after filtering
SimToM-Oracle	Research baselines; high-stakes work with human review of Stage 1

Structure and components

Here's the canonical two-stage template:

=== Stage 1: perspective-taking ===
The following is a sequence of events:

{story}

Which events does {character} know about?
A character knows about all events they directly witness.
If a character is in a location, they know all events that happen there.
If they leave, they no longer know events there until they return.

List only the events {character} knows about, one per line.

=== Stage 2: question-answering ===
{stage_1_output}

You are {character}. Based on the above information, answer:

{question}

Answer:

Configuration

Setting	Stage 1	Stage 2
Temperature	0 (deterministic filter)	0 for classification; 0.3–0.7 for open-ended
Max tokens	~1.5× story length	50–100 classification; 200–400 explanatory
Stop sequence	`"\n\n"` to cut trailing commentary	only if format demands
Output format	free-form list, optionally "no commentary"	constrain only here, never Stage 1

Implementation workflow

from openai import OpenAI

client = OpenAI()

def simtom(story: str, character: str, question: str, model: str = "gpt-4") -> str:
    # Stage 1: perspective-taking
    stage1 = f"""The following is a sequence of events:

{story}

Which events does {character} know about?
A character knows about all events they directly witness.
If they leave a location, they no longer know events there until they return.

List only the events {character} knows about, one per line."""
    perspective = client.chat.completions.create(
        model=model, temperature=0,
        messages=[{"role": "user", "content": stage1}],
    ).choices[0].message.content.strip()

    # Stage 2: question-answering from the filtered perspective only
    stage2 = f"""{perspective}

You are {character}. Based on the above information, answer:

{question}

Answer:"""
    return client.chat.completions.create(
        model=model, temperature=0,
        messages=[{"role": "user", "content": stage2}],
    ).choices[0].message.content.strip()

Don't: merge the two stages; add CoT to Stage 1; use SimToM for ground-truth questions; reuse one Stage 1 output across questions that need different time-indexed knowledge states.

Debugging

Symptom	Likely cause	Fix
Stage 2 gives the ground-truth answer	Stage 1 leaked events the character missed, or Stage 2 got the original story	Tighten the rule ("only events personally witnessed"); confirm Stage 2 starts with the filtered output, not the story
Stage 1 output ≈ whole story	Over-inclusion	Add explicit exclusion ("do not include events that happened while X was away")
Stage 1 too short, misses witnessed events	Over-filtering / name mismatch	"Include all events X witnessed, not just key ones"; check name matches exactly
Stage 2 says "not enough information"	Stage 1 correctly filtered everything relevant	Technically correct; let it state its best expectation, or treat as uncertain
Inconsistent answers across runs	Non-zero Stage 1 temperature	Set Stage 1 to 0; if Stage 2 varies, self-consistency on Stage 2 only
SimToM below zero-shot	Stage 1 so inaccurate it adds error (the Llama-2-13b/ToMi −3.75 pp case)	Switch to SimToM-Domain, use a stronger Stage 1 model, or human-review Stage 1

Testing and proving it works

Limitations

Presence-only filtering. Stage 1 tracks physical presence, so it misses knowledge gained by testimony, inference, or reputation — common in naturalistic text.
Stage 1 quality ceiling. Oracle shows ~96% is reachable, but model-generated filters fall short on longer, multi-character, multi-location stories. This gap is irreducible without better models or fine-tuning.
First-order only, natively. Second-order and up need recursion, which multiplies calls and compounds Stage 1 error. Decompose-ToM (Zhao et al., 2025) beats SimToM by +28.13 pp for GPT-4o (and +22.5 pp for Llama-3-70B) at second order on Hi-ToM.
Binary knowledge. No degrees of certainty or partial witnessing.
One character per call. Comparing beliefs means running Stage 1 multiple times.
Long context decays. On FANToM, SimToM shows a 4% gap between short and long conversations, vs Decompose-ToM's 0.9% — more transitions mean more Stage 1 errors.

Advanced techniques

Risks and ethics

Ecosystem and alternatives

How it stacks up against the alternatives:

Dimension	SimToM	Zero-shot CoT	Self-Consistency	Decompose-ToM	Fine-tuning
Mechanism	Context partitioning	Reasoning elicitation	Voting over paths	Recursive decomposition	Weight update
API calls / question	2	1	5–20	4–8+	1 at inference
False-belief (GPT-3.5, ToMi)	81.0%	34.0%	33.5%	≈ SimToM at 1st order	Task-dependent
Higher-order ToM	Weak (1st-order)	Weak	Weak	Strong (2nd–4th)	Potentially strong
Token overhead	~1.5×	1×	5–20×	3–6×	1× (high upfront)
Training data	None	None	None	None	Required
Explainability	High (Stage 1 visible)	Medium	Low	Medium	Low

Future directions

Summary

What: SimToM (Simulated Theory of Mind) is a two-stage prompt — filter the story to what a character witnessed, then answer as that character from the filter alone.
Why: False-belief failure is an information-state problem, not a reasoning one. Self-attention can't mask events a character didn't see; SimToM removes them from context instead.
When: Use it for questions about a character's belief/knowledge with witness-based information asymmetry; skip it for ground-truth, single-character, or desire/emotion questions.
Where: Narrative false-belief tasks, and real analogues — patient vs. chart, party vs. contract, student vs. curriculum, agent vs. game state.
How: Two separate API calls (never merged — merging costs 19–27 pp), Stage 1 at temperature 0 with an explicit knowledge rule, Stage 2 with first-person grounding over the filtered events only.
Which: Zero-shot or 1-shot (SimToM-Domain, ~+20 pp) for first-order; recursive SimToM or Decompose-ToM for second-order+; fine-tuned Stage 1 at high volume.
Headline: +62.25 pp for GPT-4 on ToMi and +29.5 pp for GPT-3.5 on BigToM; Oracle's ~96% shows the remaining gap is Stage 1 quality (Wilf et al., arXiv:2311.10227, ACL 2024).

Explore Unread

Great job! You've read all available articles

SimToM: perspective-taking for theory-of-mind reasoning

See it work

The mental model

How it works

Why it works

Where it shines

When to use it (and when not)

Structure and components

Configuration

Implementation workflow

Debugging

Testing and proving it works

Limitations

Advanced techniques

Risks and ethics

Ecosystem and alternatives

Future directions

Summary

Read Next

Explore Unread

SimToM: perspective-taking for theory-of-mind reasoning

See it work

The mental model

How it works

Why it works

Where it shines

When to use it (and when not)

Structure and components

Configuration

Implementation workflow

Debugging

Testing and proving it works

Limitations

Advanced techniques

Risks and ethics

Ecosystem and alternatives

Future directions

Summary

Read Next

Explore Unread