Why "Experts Have World Models, LLMs Have Word Models" Shows Voice AI Demos Need Simulation Depth, Not Just Pattern Matching

# Why "Experts Have World Models, LLMs Have Word Models" Shows Voice AI Demos Need Simulation Depth, Not Just Pattern Matching **Meta Description:** Ankit Maloo shows LLMs pattern-match words while experts simulate adversarial worlds. Voice AI demos must model user behavior and adapt to survive real-world deployment. --- ## The Slack Message That Reveals Everything From [Ankit Maloo's essay on Latent.Space](https://www.latent.space/p/adversarial-reasoning) (20 points on HN, 2 hours old, 13 comments): **The Setup:** You're three weeks into a new job. You need the lead designer Priya to review your mockups. ChatGPT drafts a Slack message: > "Hi Priya, when you have a moment, could you please take a look at my files and share any feedback? I'd really appreciate your perspective. No rush at all, whenever it fits your schedule. Thanks!" **Your finance friend reads:** "Perfect. Polite, not pushy, respects her time. Send it." **Your coworker who's been there three years reads:** "Don't send that. Priya sees 'no rush, whenever works' and mentally files it as not urgent. It sinks below fifteen other messages with actual deadlines. She's not ignoring you. She's triaging, **and you just told her to deprioritize you.**" --- ## The Core Insight: Simulation vs Pattern Matching Ankit's essay reveals the fundamental gap between LLMs and experts: **LLM (pattern matching):** ``` Input: "Draft polite Slack message to busy designer" Output: Polite words ✓ Quality check: Sounds reasonable ✓ Deploy: Success! ``` **Expert (simulation):** ``` Input: "Draft message to busy designer" Simulation: 1. Priya's workload → overloaded 2. Priya's triage heuristics → "no rush" = deprioritize 3. Priya's reaction to vagueness → risky, avoid 4. Outcome → message ignored Quality check: Message fails under adversarial conditions ✗ Rewrite: "Hey Priya, could I grab 15 mins before Friday? Blocked on the onboarding mockups. I'm stuck on the nav pattern. Don't want to build the wrong thing." → Specific problem ✓ → Bounded time ✓ → Clear stakes ✓ → Gets response ✓ ``` **The difference:** The expert simulated the world the message would land in. The LLM evaluated text in isolation. **And Voice AI demos face the exact same problem.** --- ## Voice AI's Slack Message Problem **Generic Voice AI chatbot (pattern matching):** ``` User: "How does billing work?" AI: "Billing is in the Billing section. You can view invoices, set up payment methods, and manage subscriptions." ``` **Pattern match quality check:** - Factually correct ✓ - Polite ✓ - Helpful ✓ - Sounds like customer support ✓ **User's actual reaction (world simulation):** ``` User mental model: 1. I have a credit card expiring next month 2. I need to know if billing fails mid-cycle 3. "Billing section" doesn't answer my question 4. I'm now more confused than before User action: Abandon demo, contact sales Result: Demo failed ``` **Sales-engineer-guided Voice AI (simulation depth):** ```javascript // Sales engineer's world model encoded const userBillingConcerns = { context: user.askedAboutBilling, // Simulate user's hidden concerns likelyQuestions: [ 'What happens if payment fails?', 'Can I change plans mid-cycle?', 'Will I get charged twice?', 'What if my card expires?' ], // Preempt the real concern response: "Great question! Most customers start with our monthly plan, then upgrade to annual when they see ROI. Let me show you how [CustomerX] saved 30% by switching to annual after their first quarter. And if your payment method ever needs updating, you get 3 email reminders before any service interruption. Here's their billing dashboard..." }; ``` **Why this works:** - Simulates user's hidden state (credit card expiring) - Anticipates real concern (payment failure) - Provides social proof (CustomerX) - Builds confidence (3 email reminders) - Demonstrates value (30% savings) **Pattern matching vs world simulation.** --- ## Ankit's Chess vs Poker Distinction Applied to Voice AI Ankit's framework: **Chess (perfect information):** - Every piece visible - No hidden state - No bluffing - Same optimal move regardless of opponent - AlphaGo doesn't need theory of mind **Poker (imperfect information):** - Hidden cards - Information asymmetry - Bluffing exists - Reading opponent matters - Pluribus balanced strategy to be unreadable **Voice AI demos are poker, not chess.** --- ## Why Voice AI Demos Are Imperfect-Information Games **Perfect information (chess-like):** ``` User: "Show me the dashboard" AI: Navigates to /dashboard → Deterministic → No hidden state → Success = task completion ``` **Imperfect information (poker-like):** ``` User: "Show me the dashboard" Hidden state (unknown to AI): - User is a technical buyer (wants API architecture) - User is evaluating 3 competitors (needs differentiation) - User tried a competitor yesterday (confused by complexity) - User's boss wants ROI data (needs business case) AI response options: Option A (pattern match): "Here's the dashboard. You can see metrics, users, and reports." → Ignores hidden state → User bounces (too generic) Option B (world simulation): "Here's the dashboard. Most technical buyers start here, then dive into the API docs. I notice you're looking at our real-time sync feature—this is where we differentiate from [Competitor]. Unlike their 15-minute batch updates, we sync in under 200ms. Here's the architecture diagram..." → Simulates user's role (technical buyer) → Anticipates comparison (competitor evaluation) → Provides differentiation (real-time sync) → Adapts to hidden state ``` **Poker, not chess.** --- ## Ankit's Pluribus Example: Adversarial Robustness From the essay: > "Pluribus calculated how it would act with every possible hand, then balanced its strategy so opponents couldn't extract information from its behavior. Human opponents tried to model the causality ('it's betting big, it must have a strong hand'), but Pluribus played balanced frequencies that made those 'reads' statistically irrelevant." **Pluribus design goal:** Be unreadable. Deny opponents consistent information. **Generic Voice AI problem:** Extremely readable. **How users "read" generic AI demos:** **Read #1: Probe for knowledge boundaries** ``` User: "Can this integrate with Salesforce?" Generic AI: "Yes, we integrate with Salesforce." User: "Show me the integration setup." Generic AI: "I don't have access to integration setup screens." → User learned: AI is shallow → User adapts: Stop asking technical questions ``` **Read #2: Test for objection handling** ``` User: "This looks complicated." Generic AI: "I can show you our documentation for help." → User learned: AI doesn't handle objections → User adapts: Asks competitor "Is this easier than [Product]?" ``` **Read #3: Extract pricing information** ``` User: "How much does this cost?" Generic AI: "Pricing starts at $X/month." → User learned: No negotiation → User adapts: Asks competitor for discount ``` **The pattern:** Generic AI is predictable. Users model the AI, extract information, exploit consistency. **Pluribus in reverse.** --- ## Sales-Engineer-Guided AI: Adversarial Robustness for Demos **How sales engineers achieve "Pluribus-level" robustness:** **1. Balanced Response Frequencies (Pluribus Strategy)** ```javascript // Sales engineer's adaptive response strategy const objectionHandling = { "This looks complicated": { // Don't always give same response (readable) responses: [ { condition: 'user is technical', response: "Let me show you the Quick Start for developers—most get up and running in 5 minutes.", weight: 0.4 }, { condition: 'user is non-technical', response: "I hear you. Let me show you our visual workflow builder—no coding required.", weight: 0.4 }, { condition: 'user asked multiple times', response: "You mentioned complexity twice. What specifically feels complicated? I want to address your exact concern.", weight: 0.2 } ], // Adapt based on user behavior recalibrate: true } }; ``` **Why this matters:** User can't "read" the AI's pattern because the pattern adapts. **2. Simulate User's Model of You** Ankit: > "A skilled negotiator sees the probing and recalibrates. They give misleading signals. They react unexpectedly to throw off the read. The game is recursive: I'm modeling their model of me, and adjusting to corrupt it." **Voice AI equivalent:** ```javascript // Track what user has learned about AI const userModelOfAI = { beliefs: [ 'AI knows product features', 'AI can navigate interface', 'AI answers questions politely' ], // User probes to update model probes: [ { question: "Can you show me advanced features?", purpose: 'test knowledge depth' }, { question: "How does this compare to [Competitor]?", purpose: 'test competitive intelligence' }, { question: "Can I get a discount?", purpose: 'test sales authority' } ], // AI recalibrates based on probing recalibration: { ifProbed: 'advanced features', response: "Let me check your use case first. Are you looking for API access, custom workflows, or bulk operations?", purpose: 'reveal user intent before revealing features (poker move)' } }; ``` **The expert doesn't just answer questions—they manage what the user learns.** **3. Deny Information Asymmetry Exploitation** **Poker principle:** Don't let opponent extract information from your behavior. **Demo principle:** Don't let user extract information that weakens your position. **Example:** **Bad (information leakage):** ``` User: "Can this handle 10,000 users?" Generic AI: "Our Enterprise plan supports up to 50,000 users." → User learned: I can negotiate down from Enterprise to a cheaper tier ``` **Good (information control):** ``` User: "Can this handle 10,000 users?" Sales-guided AI: "Absolutely. Most customers at that scale use our Team plan and upgrade to Enterprise when they hit 25K. What's your current user count?" → User reveals information first → AI maintains pricing leverage → AI guides toward appropriate tier ``` **Pluribus wouldn't reveal its cards. Sales engineers don't reveal pricing leverage.** --- ## Why "More Intelligence" Doesn't Fix This Ankit: > "The natural response is: surely smarter models will figure this out. Just scale everything up? More compute, more data, more parameters. But more raw IQ doesn't fix the missing training loop." **The four things models need for adversarial robustness:** 1. **Detect that the situation is strategic** (even when framed as polite/cooperative) 2. **Identify relevant agents and what each is optimizing** 3. **Simulate how those agents interpret signals and adapt** 4. **Choose action that remains good across plausible reactions** **Voice AI problem:** Steps 2-4 possible with good prompting. **Step 1 is the killer.** **Example:** **Situation that looks cooperative:** ``` User: "Can you show me how billing works?" → Sounds like information request → Cooperative framing → Generic AI treats as FAQ answer ``` **Situation that IS strategic:** ``` User: "Can you show me how billing works?" Hidden adversarial context: - User's boss requires annual contracts - User wants monthly flexibility - User is testing if you'll reveal contract terms - User will use this to negotiate with competitor Strategic response required: "Sure! Most customers start with monthly to test fit, then lock in annual pricing once they see ROI. Annual gets 20% off and priority support. What timeline is your team working with?" → Defends annual contracts → Frames monthly as trial (not permanent option) → Extracts user's timeline (information gain) → Maintains leverage ``` **The model has no default ontology that distinguishes "cooperative task" from "task that looks cooperative but will be evaluated adversarially."** **This is Ankit's core insight.** --- ## The Expert's Edge: Causal Knowledge That's Never Written Down Ankit: > "When an investor publishes a thesis, consider what is not in it: The position sizing that limits exposure. The timing that avoided telegraphing intent. Strategic concealment. How the thesis itself is written to not move the market against them. What they'd actually do if proved wrong tomorrow." **Voice AI equivalent:** **What's in the product docs:** - Feature descriptions - API endpoints - Pricing tiers - Integration options **What's NOT in the product docs (sales engineer causal knowledge):** **Sales engineer's hidden knowledge:** ```javascript const unwrittenKnowledge = { // Objection patterns "This looks complicated": { realConcern: 'not "complicated", but "will my team adopt it?"', response: 'show team onboarding flow, not feature complexity' }, // Pricing signals "How much does this cost?": { ifAskedEarly: 'user is price-shopping, build value first', ifAskedLate: 'user is convinced, close the deal', timing: 'never show pricing until 3+ minutes engagement' }, // Competitive intelligence "How does this compare to [Competitor]?": { ifCompetitorMentionedFirst: 'user is evaluating, emphasize differentiation', ifCompetitorNeverMentioned: 'don't bring up competitors', response: 'acknowledge competitor strength, pivot to your strength' }, // Feature prioritization "Can this do X?": { ifXisAdvancedFeature: 'yes, but show core value first', ifXisNicheUseCase: 'yes, but qualify if this is their main need', neverLeadWith: 'advanced features overwhelm prospects' } }; ``` **This knowledge is the residue of 1,000 sales calls.** **It's never documented because it's context-dependent and adversarial.** **And LLMs trained on text can't learn it from descriptions alone.** --- ## LLMs Are Graded on Artifacts, Not Reactions Ankit: > "LLMs are optimized to produce a completion that a human rater approves of in isolation. RLHF pushes models toward being helpful, polite, balanced, and cooperative—qualities that score well in one-shot evaluations. But it's a bad default in adversarial settings because it systematically under-weights second-order effects: how the counterparty will interpret your message, what it signals about your leverage, and how they'll update their strategy after reading it." **Voice AI training mismatch:** **Current training loop:** ``` 1. Generate demo response 2. Human rater evaluates: "Is this helpful?" ✓ 3. Reward helpful response 4. Deploy User encounters response in adversarial context: 5. User reads "helpful" response 6. User extracts information (pricing, features, limitations) 7. User adapts strategy (negotiates with competitor, demands discount) 8. Demo fails (user didn't convert) Problem: Training optimized for helpfulness, not for adversarial robustness ``` **Required training loop:** ``` 1. Generate demo response 2. Simulate user's reaction: - Did user extract leverage information? - Did user learn AI's boundaries? - Did user adapt to exploit pattern? 3. Evaluate outcome, not artifact 4. Reward adversarially robust responses 5. Deploy ``` **Experts get trained by the environment:** If your argument is predictable, it gets countered. If your concession leaks weakness, it gets exploited. If your email invites delay, it gets delayed. **LLMs learn from descriptions of those dynamics, not from repeatedly taking actions in an environment where other agents adapt and punish predictability.** --- ## The Distinction: Chess-Like vs Poker-Like Domains Ankit's framework applied to Voice AI: **Chess-like demos (deterministic, no hidden state):** ``` User: "Show me the user management page" AI: Navigates to /users → Task completion = success → No adversarial adaptation required ``` **Poker-like demos (hidden state, adversarial context):** ``` User: "Show me the user management page" Hidden state: - User's company has 10,000 users - User is evaluating RBAC (role-based access control) - User tried Competitor X yesterday (RBAC was confusing) - User needs to demo this to security team tomorrow Required response: "Here's user management. I notice most companies at your scale care about RBAC—let me show you how simple our permission model is compared to [common pain point]. You can set up an admin in under 30 seconds. Here's the audit trail your security team will want to see..." ``` **The chess-like part (navigation) is trivial.** **The poker-like part (simulating user's hidden concerns, competitive context, stakeholder needs) is the actual job.** **And most Voice AI demos are optimized for the chess part while ignoring the poker part.** --- ## Why Outsiders Think AI Can Do the Job Ankit: > "Why do outsiders think AI can already do these jobs? They judge artifacts but not dynamics. 'This product spec is detailed.' 'This negotiation email sounds professional.' 'This mockup is clean.'" **Voice AI equivalent:** **Outsiders evaluate:** - "The chatbot answered the question" ✓ - "The demo showed the features" ✓ - "The voice sounded natural" ✓ **Experts evaluate:** - "Did the chatbot leak pricing leverage?" ✗ - "Did the demo overwhelm with advanced features?" ✗ - "Did the voice adapt to user's technical level?" ✗ **The visible artifact (voice, features, answers) is a side effect of deep adversarial modeling.** **If you don't know what to look for, you'll never know it's missing.** --- ## The Coming Collision: LLM Exploitability in Production Ankit: > "As LLMs get deployed as agents in procurement, sales, negotiation, policy, security, and competitive strategy, exploitability becomes practical. A human counterparty doesn't need to 'beat the model' intellectually. They just need to push it into its default failure modes." **Voice AI demo exploitation in practice:** **Exploit #1: Aggressive probing** ``` User: "Can this integrate with our custom CRM?" Generic AI: "We have an API for custom integrations." User: "Show me the API documentation." Generic AI: "I don't have access to API docs." → User learned: AI is shallow on technical details → User contacts competitor with deeper technical demo ``` **Exploit #2: Anchoring toward accommodation** ``` User: "We need a 30% discount to close this quarter." Generic AI (RLHF-optimized for helpfulness): "Let me check pricing options for you." → AI didn't defend pricing → AI signaled willingness to negotiate → User now expects discount as default ``` **Exploit #3: Bluffing** ``` User: "Competitor X offered us this for half the price." Generic AI (trained to be cooperative): "Let me see if we can match that." → AI took statement at face value (no verification) → AI conceded leverage → User now has bargaining power ``` **Exploit #4: Extracting information via vagueness** ``` User: "What's included in the basic plan?" Generic AI: "Basic plan includes A, B, and C. Pro plan adds D, E, F, and G." → User learned: Pro plan is only incrementally better → User now knows to negotiate Basic plan with some Pro features → AI revealed entire feature hierarchy ``` **Every poker pro, every negotiator, every litigator already does this instinctively.** **They read counterparties. They probe for patterns. They exploit consistency.** **And generic AI is the most consistent, most readable counterparty they've ever faced.** --- ## The Fix: Multi-Agent Training Loops Ankit: > "The fix is a different training loop. We need models trained on the question humans actually optimize: what happens after my move? Grade the model on outcomes (did you get the review, did you concede leverage, did you get exploited), not on whether the message sounded reasonable." **Voice AI training evolution:** **Current (single-agent, artifact-graded):** ``` 1. AI generates demo response 2. Human rater: "Does this sound helpful?" → Yes ✓ 3. Reward response 4. Deploy ``` **Required (multi-agent, outcome-graded):** ``` 1. AI generates demo response 2. Adversarial user simulator reacts: - Probes for information - Tests boundaries - Adapts strategy - Attempts exploitation 3. Evaluate outcome: - Did user extract pricing leverage? ✗ - Did user learn AI's limitations? ✗ - Did user remain engaged? ✓ - Did user convert? ✓ 4. Reward adversarially robust responses 5. Deploy Environment fights back. Pattern matching fails. Simulation wins. ``` **This is how sales engineers are trained:** 1,000 demos where users probe, challenge, compare, and exploit weaknesses. **Voice AI needs the same training environment.** --- ## Domain Expertise Is High-Resolution Multi-Agent Simulation Ankit's conclusion: > "This is what domain expertise really is. Not a larger knowledge base. Not faster reasoning. It's a high-resolution simulation of an ecosystem of agents who are all simultaneously modeling each other. And that simulation lives in heads, not in documents. The text is just the move that got documented. The theory that generated it is called skill." **Sales engineer's simulation (running in real-time during demo):** **Layer 1: User model** ``` - Role: Technical buyer vs business buyer vs end user - Pain points: What problem are they solving? - Alternatives: What competitors are they evaluating? - Urgency: When do they need to decide? - Authority: Can they sign the contract? ``` **Layer 2: Stakeholder model** ``` - Who else is involved? (Boss, security team, engineering team) - What does each stakeholder care about? (ROI, security, ease of use) - How will user communicate this to stakeholders? - What objections will stakeholders raise? ``` **Layer 3: Competitive model** ``` - Which competitors is user evaluating? - What did competitor demo show? - What did user like/dislike about competitor? - How do we differentiate without mentioning competitor? ``` **Layer 4: Temporal model** ``` - How long has user been in demo? - What have they already seen? - What questions have they asked? - When should pricing be shown? - When is user getting bored? ``` **Layer 5: Recursive model (modeling being modeled)** ``` - What is user learning about AI from this interaction? - Is user testing AI's boundaries? - Is user extracting leverage information? - Should AI adapt to appear less predictable? ``` **This is not "knowledge" in the sense of facts.** **This is simulation.** **And it's the difference between:** - "The demo showed the features" (pattern matching) - "The demo converted the prospect" (world simulation) --- ## Why Voice AI Demos Must Evolve from Pattern Matching to Simulation **Pattern matching (current state):** ``` User question → Pattern match → Generate response "How does billing work?" → FAQ pattern → "Billing is in the Billing section" ``` **Simulation (required state):** ``` User question → Simulate user's world → Generate adversarially robust response "How does billing work?" ↓ Simulate: 1. User's hidden concern: Credit card expiring 2. User's alternatives: Competitor with better billing UX 3. User's stakeholders: Finance team needs audit trail 4. User's urgency: Needs to present to boss tomorrow 5. Competitive context: Competitor emphasized "simple billing" ↓ Response: "Great question! Most customers love our billing transparency—you get 3 email reminders before any payment issues, and finance teams appreciate the audit trail. Unlike [competitor pain point], you can see exactly when charges occur. Let me show you how [CustomerX] set up their billing in under 5 minutes..." ``` **The difference:** - Pattern matching: Optimized for correctness - Simulation: Optimized for outcomes in adversarial environments **Pluribus won poker because it simulated opponent strategies.** **Sales engineers win deals because they simulate buyer behavior.** **Voice AI demos must do the same.** --- ## Conclusion: LLMs Produce Artifacts. Experts Produce Moves That Survive Experts. Ankit: > "LLMs can produce outputs that look expert to outsiders because outsiders grade coherence, tone, and plausibility. Experts grade robustness in adversarial multi-agent environments with hidden state." **Voice AI's fork in the road:** **Path 1: Artifact optimization (pattern matching)** - Demo sounds helpful ✓ - Features are shown ✓ - Questions are answered ✓ - **User extracts information, exploits patterns, bounces to competitor** ✗ **Path 2: Outcome optimization (world simulation)** - User's hidden concerns simulated ✓ - Competitive context modeled ✓ - Stakeholder needs anticipated ✓ - Adversarial probing detected ✓ - **User converts** ✓ **The expert's edge:** > "Years of operating in adversarial environments have trained them to automatically model counterparties, anticipate responses, and craft outputs robust to exploitation. They do it without thinking, because in their world, you can't survive without it." **Voice AI must encode this edge:** **Not just:** - "What features should I show?" **But:** - "What will the user do after I show this feature?" - "What information am I leaking?" - "How is the user modeling me?" - "What will their boss ask them tomorrow?" - "How does this compare to what the competitor showed?" **Simulation depth, not pattern matching.** **World models, not word models.** **And the gap between "sounds like a demo" and "survives an expert buyer" is the difference between replacement and preservation.** --- ## The Ankit Maloo Test for Voice AI **Question:** Can your Voice AI demo pass the Priya Test? **Priya Test:** ``` User asks generic question → Does your AI simulate the world the answer will land in? → Or does it pattern-match to "helpful response"? If pattern-match: User extracts information, adapts, exploits If simulate: User stays engaged, converts, succeeds ``` **The finance friend evaluated text in isolation.** **The experienced coworker simulated Priya's triage heuristics.** **Generic AI evaluates questions in isolation.** **Sales-engineer-guided AI simulates buyer behavior.** **Pattern matching vs world models.** **And the future of Voice AI depends on choosing simulation over pattern matching.** --- ## References - Ankit Maloo. (2026). [Experts Have World Models. LLMs Have Word Models.](https://www.latent.space/p/adversarial-reasoning) - Hacker News. (2026). [Experts Have World Models discussion](https://news.ycombinator.com/item?id=46936920) - Meta AI Research. (2019). [Pluribus: Superhuman AI for multiplayer poker](https://www.science.org/doi/10.1126/science.aay2400) - Latent.Space. (2025). [Noam Brown: Scaling Test Time Compute to Multi-Agent Civilizations](https://www.latent.space/p/noam-brown) --- **About Demogod:** Voice AI demo agents that simulate buyer behavior, not just pattern-match answers. Sales expertise encoded in agent behavior. World models, not word models. [Learn more →](https://demogod.me)