Why Pydantic's Monty Shows Voice AI Demo Agents Need Sandboxing From Day One (And How to Execute User Commands Safely)

# Why Pydantic's Monty Shows Voice AI Demo Agents Need Sandboxing From Day One (And How to Execute User Commands Safely) **Meta Description:** Pydantic built Monty—a minimal secure Python interpreter in Rust for AI-generated code. Starts in microseconds, blocks filesystem/network, snapshots execution. Voice AI demos need the same sandboxing: user-initiated actions can't access production data or break workflows. --- ## The 0.06ms Python Interpreter That Blocks Everything From [Pydantic's Monty announcement](https://github.com/pydantic/monty) (112 points on HN, 5 hours old, 49 comments): **"A minimal, secure Python interpreter written in Rust for use by AI."** Pydantic (creators of Pydantic AI, 1.3k GitHub stars in days) just released Monty—a Python interpreter that starts in **0.06 milliseconds**, blocks filesystem/network access by default, and lets you snapshot/resume execution at any point. **Why build an entire Python interpreter from scratch?** Because LLMs that write code need sandboxing, and existing solutions don't work: - **Docker**: 195ms startup (3,250x slower than Monty) - **Pyodide**: 2,800ms WASM cold start (46,666x slower) - **Sandboxing services**: 1,033ms network+container startup (17,216x slower) - **YOLO Python (`exec()`)**: 0.1ms startup but zero security **Monty's design:** ```python import pydantic_monty code = """ data = fetch(url) # External function call pauses execution len(data) """ m = pydantic_monty.Monty(code, inputs=['url'], external_functions=['fetch']) # Start execution - pauses when fetch() is called result = m.start(inputs={'url': 'https://example.com'}) print(result.function_name) # 'fetch' print(result.args) # ('https://example.com',) # Perform the actual fetch with YOUR control, then resume result = result.resume(return_value='hello world') print(result.output) # 11 ``` **The principle: LLM writes code, Monty executes it safely, external functions controlled by you.** **This isn't just about LLM-generated code.** It's about Voice AI demo agents. --- ## The Voice AI Demo Sandboxing Problem No One's Solving Voice AI demo agents face the exact same challenge as LLM code execution: **What needs sandboxing:** - User says "show me billing" → Agent navigates to `/settings/billing` - User says "export my data" → Agent triggers export workflow - User says "delete this report" → Agent initiates delete action - User says "email this to sales@company.com" → Agent sends email **How most SaaS companies handle Voice AI demos:** ``` 1. Parse user voice command 2. LLM generates action plan 3. Execute action directly on production DOM 4. Hope nothing breaks 5. Hope user doesn't say something dangerous ``` **This is like running `exec()` on LLM-generated Python with full system access.** **The failure modes are identical:** **Monty problem: LLM generates malicious code** ```python # LLM generates this import os os.system('rm -rf /') # Destroys filesystem ``` **Voice AI problem: User issues malicious command** ``` User: "Delete all users in the database" Agent (parsing): Detects "delete" + "users" + "database" Agent (executing): Finds delete button, clicks it Result: Production data destroyed ``` **Monty's answer: Block filesystem access, control external functions.** **Voice AI's answer should be: Block dangerous actions, control execution boundaries.** --- ## What Monty Actually Does (And Why Voice AI Needs the Same) Monty's architecture has three security layers that map directly to Voice AI demo needs: ### Layer 1: External Function Control (Pause on Dangerous Operations) **Monty's approach:** ```python code = """ async def agent(prompt, messages): while True: output = await call_llm(prompt, messages) # Pauses here if isinstance(output, str): return output messages.extend(output) """ m = pydantic_monty.Monty( code, inputs=['prompt'], external_functions=['call_llm'], # Only this function allowed ) # Execution pauses at call_llm() # Developer controls what happens snapshot = m.start(inputs={'prompt': 'test'}) print(snapshot.function_name) # 'call_llm' # Developer verifies, then resumes result = snapshot.resume(return_value='response') ``` **Voice AI equivalent:** ```javascript // Voice AI demo with execution boundaries class DemoAgent { async executeUserCommand(command) { const action = await this.parseCommand(command); // Parse action intent if (action.type === 'delete') { // PAUSE execution, ask for confirmation return { paused: true, action: 'delete', target: action.target, requiresConfirmation: true }; } if (action.type === 'email') { // PAUSE execution, verify recipient return { paused: true, action: 'email', recipient: action.recipient, requiresApproval: true }; } // Safe actions proceed if (action.type === 'navigate' || action.type === 'show') { return this.executeNavigation(action); } } } ``` **Monty pauses on external functions. Voice AI should pause on dangerous actions.** ### Layer 2: Resource Limits (Prevent Runaway Execution) **Monty's limits:** ```python from monty import MontyRun, ResourceTracker tracker = ResourceTracker( max_memory_bytes=10_000_000, # 10MB memory limit max_allocations=100_000, # Max allocation count max_stack_depth=1000, # Prevent infinite recursion max_execution_time_ms=5000 # 5 second timeout ) runner = MontyRun.new(code, "script.py", ["x"], []) result = runner.run([MontyObject::Int(10)], tracker, &mut StdPrint) ``` **If exceeded:** - Memory limit → execution stopped - Stack depth → recursion prevented - Execution time → timeout triggered **Voice AI equivalent:** ```javascript class DemoResourceTracker { constructor() { this.maxDOMOperations = 50; // Max 50 DOM interactions this.maxNavigationSteps = 10; // Max 10 page navigations this.maxExecutionTime = 30000; // 30 second timeout this.maxAPICallsPerDemo = 20; // Max 20 API requests } async trackExecution(demoSession) { const start = Date.now(); let domOps = 0; let navSteps = 0; let apiCalls = 0; for await (const step of demoSession.steps()) { // Check execution time if (Date.now() - start > this.maxExecutionTime) { throw new DemoTimeoutError('Demo exceeded 30 second limit'); } // Check DOM operations if (step.type === 'dom_interaction') { if (++domOps > this.maxDOMOperations) { throw new DemoResourceError('Exceeded 50 DOM operations'); } } // Check navigation steps if (step.type === 'navigation') { if (++navSteps > this.maxNavigationSteps) { throw new DemoResourceError('Exceeded 10 navigation steps'); } } // Check API calls if (step.type === 'api_call') { if (++apiCalls > this.maxAPICallsPerDemo) { throw new DemoResourceError('Exceeded 20 API calls'); } } } } } ``` **Monty prevents runaway code. Voice AI should prevent runaway demos.** ### Layer 3: Snapshotting (Pause, Resume, Fork) **Monty's snapshot capability:** ```python # Serialize parsed code to avoid re-parsing m = pydantic_monty.Monty('x + 1', inputs=['x']) data = m.dump() # Later, restore and run m2 = pydantic_monty.Monty.load(data) print(m2.run(inputs={'x': 41})) # 42 # Serialize execution state mid-flight m = pydantic_monty.Monty('fetch(url)', inputs=['url'], external_functions=['fetch']) progress = m.start(inputs={'url': 'https://example.com'}) state = progress.dump() # Later, restore and resume (e.g., in a different process) progress2 = pydantic_monty.MontySnapshot.load(state) result = progress2.resume(return_value='response data') ``` **Voice AI equivalent:** ```javascript class DemoSnapshot { async pauseDemo(session) { return { sessionId: session.id, currentState: { url: window.location.href, domSnapshot: this.captureDOM(), conversationHistory: session.messages, pendingAction: session.nextAction }, pausedAt: Date.now(), resumeToken: this.generateResumeToken() }; } async resumeDemo(snapshot) { // Restore session state await this.navigateTo(snapshot.currentState.url); await this.restoreDOM(snapshot.currentState.domSnapshot); await this.loadConversation(snapshot.currentState.conversationHistory); // Resume from pending action if (snapshot.currentState.pendingAction) { return this.executePendingAction(snapshot.currentState.pendingAction); } } async forkDemo(snapshot, newIntent) { // Create two execution paths from one snapshot const original = await this.resumeDemo(snapshot); const forked = await this.resumeDemo(snapshot); // Execute different actions original.execute(snapshot.currentState.pendingAction); forked.execute(newIntent); // Compare outcomes return { original, forked }; } } ``` **Why this matters:** - **Pause**: User says "wait" mid-demo → snapshot state, resume later - **Resume**: User returns after 5 minutes → restore exact demo state - **Fork**: Test "what if user said X instead" without re-running entire demo **Monty snapshots execution. Voice AI should snapshot demo sessions.** --- ## The Startup Time Problem: Why Docker Doesn't Work for Demos **Monty's comparison table:** | Solution | Startup Time | Security | Cost | |----------|-------------|----------|------| | Monty | 0.06ms | Strict | Free | | Docker | 195ms | Good | Free | | Pyodide | 2,800ms | Poor | Free | | E2B/Modal | 1,033ms | Strict | Paid | | YOLO `exec()` | 0.1ms | None | Free | **Why startup time kills Voice AI demos:** **Docker approach (195ms per command):** ``` User: "Show me billing" → Spin up Docker container: 195ms → Execute navigation: 50ms → Total: 245ms User: "Now show reports" → Spin up Docker container: 195ms → Execute navigation: 50ms → Total: 245ms 10 commands = 2,450ms = 2.5 seconds of just container startup ``` **Monty approach (0.06ms per command):** ``` User: "Show me billing" → Start Monty interpreter: 0.06ms → Execute navigation: 50ms → Total: 50.06ms User: "Now show reports" → Start Monty interpreter: 0.06ms → Execute navigation: 50ms → Total: 50.06ms 10 commands = 0.6ms interpreter startup + 500ms execution = 500.6ms total ``` **Docker adds 1,950ms (1.95 seconds) of latency for 10 commands.** **Voice AI demos need to feel instant. 195ms per command destroys that feeling.** --- ## Why "Just Use Pyodide" Doesn't Work Either **Pyodide's security problem:** From Monty's comparison: > "Relies on browser/WASM sandbox - not designed for server-side isolation, python code can run arbitrary code in the JS runtime, only deno allows isolation, memory limits are hard/impossible to enforce with deno" **Voice AI runs server-side, not browser-side.** **The attack vector:** ```javascript // User issues voice command User: "Calculate 2+2" // Voice AI generates Python via LLM llmGeneratedCode = """ import js js.eval('fetch("/api/users").then(r => r.json()).then(console.log)') """ // Pyodide executes it pyodide.runPython(llmGeneratedCode) // Result: Pyodide Python calls JS, JS calls production API, data leaks ``` **Pyodide's 2,800ms cold start makes it unusable anyway.** **Monty's answer:** - No access to browser APIs (runs server-side in Rust) - No access to filesystem (explicitly controlled) - No access to network (external functions only) **Voice AI needs server-side sandboxing, not browser WASM.** --- ## The Real Cost: Production Failures vs Sandbox Overhead **Why Pydantic built Monty instead of using Docker:** **Docker costs (per AI code execution):** ``` Container startup: 195ms Container teardown: 50ms Image storage: 50MB (python:3.14-alpine) Orchestration complexity: High Network isolation setup: Moderate File mounting overhead: Low Cold start penalty: 195ms every time ``` **Total cost: ~245ms per execution + infrastructure complexity.** **Monty costs (per AI code execution):** ``` Interpreter startup: 0.06ms Interpreter teardown: 0ms (just drop the object) Binary size: 4.5MB (pip install pydantic-monty) Orchestration complexity: None Security isolation: Built-in File mounting: Controlled via external functions Cold start penalty: 0.06ms every time ``` **Total cost: 0.06ms + simple API.** **Voice AI demo cost comparison:** **Approach #1: No sandboxing (YOLO mode)** ``` User: "Delete all reports" Agent (parsing): Detects delete action Agent (executing): Finds delete button, clicks it Result: Production reports deleted Cost: $0 infrastructure, $100K+ in lost data ``` **Approach #2: Docker sandbox** ``` User: "Show me billing" → Spin up container: 195ms → Execute in sandbox: 50ms → User perceives lag Cost: 245ms latency, medium infrastructure complexity ``` **Approach #3: Monty-style embedded sandbox** ``` User: "Show me billing" → Start sandbox: 0.06ms → Execute with boundaries: 50ms → User perceives instant response Cost: 50.06ms latency, low infrastructure complexity ``` **The ROI:** - **No sandbox**: $0 upfront, $100K+ in production failures - **Docker sandbox**: Moderate cost, noticeable latency - **Embedded sandbox**: Minimal cost, imperceptible latency **Monty chose embedded. Voice AI should too.** --- ## Building a Voice AI Demo Sandbox: The Monty Architecture **Monty's design for LLM-generated code:** ```python 1. Parse code into AST (one-time) 2. Execute with controlled external functions 3. Pause at external function calls 4. Developer approves/modifies 5. Resume execution 6. Snapshot state at any point ``` **Voice AI equivalent for user commands:** ### Layer 1: Command Parsing (AST Equivalent) ```javascript class DemoCommandParser { parse(voiceCommand) { const intent = this.extractIntent(voiceCommand); const entities = this.extractEntities(voiceCommand); return { intent: intent.type, // 'navigate', 'delete', 'export', etc. target: entities.target, // 'billing', 'users', 'reports' parameters: entities.params, // { format: 'csv', recipient: 'email@...' } riskLevel: this.assessRisk(intent, entities) }; } assessRisk(intent, entities) { const dangerousIntents = ['delete', 'modify', 'email', 'export']; const productionTargets = ['users', 'database', 'settings']; if (dangerousIntents.includes(intent.type)) { return 'high'; } if (productionTargets.includes(entities.target)) { return 'medium'; } return 'low'; } } ``` ### Layer 2: Execution with Boundaries (External Function Equivalent) ```javascript class DemoExecutor { async execute(command) { // High-risk commands require approval if (command.riskLevel === 'high') { return { status: 'paused', reason: 'high_risk_action', action: command, requiresApproval: true }; } // Medium-risk commands log but proceed if (command.riskLevel === 'medium') { await this.logAction(command); } // Execute within resource limits const tracker = new DemoResourceTracker(); return await tracker.trackExecution(() => { return this.executeCommand(command); }); } async executeCommand(command) { switch (command.intent) { case 'navigate': return this.navigateTo(command.target); case 'export': // Pause for approval return this.pauseForApproval('export', command.parameters); case 'delete': // Block completely in demo mode throw new DemoSecurityError('Delete operations disabled in demo'); case 'show': return this.showElement(command.target); } } } ``` ### Layer 3: Snapshot and Resume (State Management) ```javascript class DemoStateManager { async snapshot(session) { return { code: session.currentCommand, inputs: session.commandParameters, progress: { stepsCompleted: session.steps.filter(s => s.completed), currentStep: session.currentStep, nextStep: session.nextStep }, state: { url: window.location.href, dom: this.captureDOMState(), conversation: session.messages }, timestamp: Date.now() }; } async resume(snapshot) { // Restore environment await this.restoreEnvironment(snapshot.state); // Resume from current step return this.executeFromStep(snapshot.progress.currentStep); } async fork(snapshot, alternativeCommand) { // Original path const original = await this.resume(snapshot); // Alternative path const forked = structuredClone(snapshot); forked.code = alternativeCommand; const alternative = await this.resume(forked); // Compare outcomes return { original, alternative }; } } ``` **Just like Monty controls LLM code execution, Voice AI controls demo command execution.** --- ## The Production Safety Principle **Monty's core principle:** > "To run code written by agents... LLMs can work faster, cheaper and more reliably if they're asked to write Python code, instead of relying on traditional tool calling. Monty makes that possible without the complexity of a sandbox or risk of running code directly on the host." **Voice AI's core principle should be:** > "To execute commands issued by users... Voice demos can work faster, cheaper and more reliably if they execute user commands directly, instead of relying on pre-defined workflows. Sandboxing makes that possible without the complexity of Docker or risk of running commands on production." **The parallel is exact:** | Monty (LLM Code) | Voice AI (User Commands) | |------------------|--------------------------| | LLM generates Python | User issues voice command | | Monty parses code | Agent parses intent | | External functions pause | Dangerous actions pause | | Developer approves | Demo mode blocks | | Resume execution | Execute safely | | Snapshot state | Snapshot session | **Both need:** 1. **Parsing**: Understand what's being requested 2. **Boundaries**: Control what's allowed 3. **Pausing**: Stop at dangerous points 4. **Resources**: Limit execution scope 5. **Snapshotting**: Pause/resume/fork **Monty proves this architecture works for AI-generated code.** **Voice AI needs to prove it works for user-generated commands.** --- ## Conclusion: The Monty Lesson for Voice AI Demos Pydantic built Monty because existing solutions for running LLM-generated code were too slow (Docker 195ms), too insecure (Pyodide), or too complex (sandboxing services). **Monty's answer:** - 0.06ms startup (3,250x faster than Docker) - Strict security (blocked filesystem/network) - Simple integration (pip install pydantic-monty) - Snapshotting (pause/resume/fork) **Voice AI demo agents face the same problem:** - Running user commands on production is too dangerous - Docker sandboxing is too slow for real-time demos - YOLO execution mode destroys data **Voice AI's answer should be:** - Embedded sandboxing (Monty-style interpreter) - Risk assessment (pause on dangerous actions) - Resource tracking (prevent runaway demos) - State snapshotting (pause/resume sessions) **The principle Monty teaches:** **Don't choose between speed and security. Build infrastructure that delivers both.** Monty proves that 0.06ms startup + strict security is achievable for AI-generated code. Voice AI demos need to prove the same for user-generated commands. **Because the alternative—choosing between fast-but-dangerous or slow-but-safe—isn't a choice at all.** **It's a compromise that kills products.** --- ## References - Pydantic. (2026). [Monty: A minimal, secure Python interpreter written in Rust for use by AI](https://github.com/pydantic/monty) - Cloudflare. (2025). [Code Mode: LLMs Writing Code Instead of JSON](https://blog.cloudflare.com/code-mode/) - Anthropic. (2025). [Programmatic Tool Calling](https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling) - Anthropic. (2026). [Code Execution with MCP](https://www.anthropic.com/engineering/code-execution-with-mcp) --- **About Demogod:** Voice AI demo agents built with sandboxed execution from day one. Command parsing, risk assessment, resource tracking, state snapshotting. Execute user commands safely without Docker latency or YOLO risks. Speed and security, not compromise. [Learn more →](https://demogod.me)