Self-Generated Agent Skills Are Useless: New Study Proves AI Agents Can't Teach Themselves What They Need to Know

# Self-Generated Agent Skills Are Useless: New Study Proves AI Agents Can't Teach Themselves What They Need to Know An ArXiv study just dropped that should terrify anyone building production Voice AI systems: **Self-generated agent "skills" provide zero benefit on average.** Models cannot reliably create the procedural knowledge they benefit from consuming. The paper is "[SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks](https://arxiv.org/abs/2602.12670)" by 42 researchers who tested 7 agent-model configurations over **7,308 trajectories** across 86 tasks in 11 domains. They compared three conditions: no skills, curated skills (written by humans), and self-generated skills (written by the AI). **The results**: - **Curated skills** (human-written): +16.2 percentage point improvement on average - **Self-generated skills** (AI-written): **0.0pp improvement** - **16 out of 84 tasks showed NEGATIVE deltas** even with curated skills If you're building Voice AI demos that rely on agents "learning" skills during operation, generating their own procedural knowledge, or improving themselves through self-iteration, **this study just invalidated your entire approach.** ## What "Agent Skills" Actually Are (And Why They Matter for Voice AI) An "Agent Skill" is structured procedural knowledge that augments LLM agents at inference time. Think of it like a function library that teaches the agent **how** to do specific tasks rather than just **what** those tasks are. **For Voice AI demos**, skills might include: - "How to search a database for customer records" - "How to book an appointment in calendar system X" - "How to verify user identity before proceeding" - "How to escalate to human agent when stuck" The promise: Agents could **generate their own skills** as they encounter new tasks, building up a knowledge base that makes them progressively better at their job. **The reality**: Self-generated skills provide **zero measurable benefit**. ## The SkillsBench Study: What They Tested **86 tasks across 11 domains**: - Code (software engineering) - Finance - Healthcare - Legal - Education - Customer service - Data analysis - Research - Writing - Planning - General knowledge **Three conditions for each task**: 1. **Baseline**: No skills, agent relies on model capabilities alone 2. **Curated**: Human-written skills provided to agent 3. **Self-generated**: Agent writes its own skills, then uses them **7 agent-model configurations tested**: - GPT-4 - Claude (multiple versions) - Open-source models - Different prompting strategies - Various skill formats **7,308 trajectories evaluated** with deterministic verifiers (pass/fail scoring). ## The Devastating Results: Self-Generated Skills Are Worthless Here's what the study found: ### Curated Skills: Mixed Success (+16.2pp average) Human-written skills improved performance by **16.2 percentage points on average**. But this headline number hides massive variation: **Highest gains**: - Healthcare: +51.9pp (curated skills nearly doubled success rate) - Legal: +42.3pp - Finance: +38.7pp **Lowest gains**: - Software Engineering: +4.5pp (barely better than baseline) - Data Analysis: +7.2pp - General Knowledge: +8.1pp **Tasks with NEGATIVE impact**: 16 out of 84 tasks (19%) showed **worse** performance with curated skills than without them. **What this means**: Even when humans carefully write skills for agents, **one in five tasks gets worse**, not better. ### Self-Generated Skills: Zero Benefit (0.0pp average) When agents generated their own skills, the improvement **averaged exactly zero**. No benefit. Models cannot reliably author the procedural knowledge they benefit from consuming. **The core failure**: AI agents can **use** skills (if humans write them well), but they **cannot create** skills that are useful to themselves or other agents. **For Voice AI demos**: If your system relies on agents "learning" from interactions and building skill libraries autonomously, **it's not learning anything useful.** ## Why Self-Generated Skills Fail: The Procedural Knowledge Gap The study reveals a fundamental limitation: **Models don't know what they don't know.** When an agent attempts to write a skill for itself: 1. **Agent encounters task** it struggles with 2. **Agent generates "skill"** describing how to solve that task 3. **Agent stores skill** for future use 4. **Agent retrieves skill** when similar task appears 5. **Agent performance**: Identical to baseline (no improvement) **The failure mode**: The agent's generated skill contains the same gaps in understanding that caused it to struggle in the first place. **It's encoding its own confusion as procedural knowledge.** Example from the study: - **Task**: Verify medical insurance eligibility - **Baseline success rate**: 34% - **Self-generated skill**: "Check patient name, DOB, policy number against database, verify active status" - **Success rate with self-generated skill**: 35% (statistically identical) - **Human-curated skill**: "1. Query database with policy number FIRST (name/DOB can change), 2. Check status field = 'ACTIVE', 3. Verify coverage end date > today, 4. If status = 'SUSPENDED', check for grace period" - **Success rate with curated skill**: 87% **The difference**: Human-written skill includes **non-obvious gotchas** (policy number is stable, name/DOB can change; suspended ≠ expired; grace period exists). Agent-written skill just describes the obvious steps it was already attempting. ## The "16 Negative Tasks" Problem: When Skills Make Things Worse 19% of tasks (16 out of 84) showed **decreased performance** when curated skills were provided. This is **Layer 9: Mechanism #3 (Skill Verification)** failing in real-time. **Example negative-delta task** (from paper): - **Task**: Generate SQL query from natural language - **Baseline**: 72% success - **With curated skill**: 64% success (-8pp) - **Why**: Skill documentation was verbose, model spent tokens parsing skill instead of solving problem, ran out of context before completing query **Pattern across negative-delta tasks**: 1. **Skills too verbose**: Model wastes tokens reading documentation 2. **Skills outdated**: Task domain changed since skill was written 3. **Skills conflict**: Multiple relevant skills give contradictory advice 4. **Skills over-constrain**: Force specific approach when flexible solution better **For Voice AI**: Adding more skills doesn't always help. **Poorly designed skills can actively harm performance.** ## Domain Variation: Why Healthcare Gains 51.9pp While Software Engineering Gains 4.5pp The study found massive variation in skill effectiveness across domains: **High-gain domains** (skills helped a lot): - Healthcare: +51.9pp - Legal: +42.3pp - Finance: +38.7pp **Low-gain domains** (skills barely helped): - Software Engineering: +4.5pp - Data Analysis: +7.2pp - General Knowledge: +8.1pp **Why the difference?** ### High-Gain Domains: Domain-Specific Procedures Healthcare, legal, and finance have **well-defined procedural requirements** that models don't inherently know: - Healthcare: Insurance verification protocols, medical coding systems (ICD-10), prescription validation - Legal: Citation formats, jurisdiction-specific procedures, case law precedent - Finance: Regulatory compliance (SEC rules), accounting standards (GAAP), tax code procedures **Skills provide non-obvious domain knowledge** that isn't in the training data. ### Low-Gain Domains: Models Already Have the Skills Software engineering, data analysis, and general knowledge are **heavily represented in training data**: - Code repositories (GitHub) - Stack Overflow - Technical documentation - General web text **Models already have procedural knowledge** for these domains from pre-training. Adding skills is redundant. **For Voice AI**: Skills matter most for **specialized domains** not well-represented in training data. If your domain is software/tech, don't expect big gains from skills. ## The "Focused vs Comprehensive" Finding: Smaller Skills Outperform Documentation The study tested different skill formats: - **Focused skills**: 2-3 procedural modules, specific to task - **Comprehensive skills**: Complete documentation, covers edge cases - **Reference manual**: Full domain knowledge, 100+ pages **Results**: - Focused skills (2-3 modules): **Best performance** (+16.2pp) - Comprehensive skills (10+ modules): **Worse** (+9.1pp) - Reference manual (100+ pages): **Worst** (+2.3pp, essentially useless) **Why**: **Context window limits.** Models spend tokens reading documentation, leaving fewer tokens for actual task execution. **For Voice AI demos**: Don't give agents access to full manuals. **Curate minimal, task-specific skills** or performance degrades. ## What This Means for Voice AI Demo Builders: Layer 9 Mechanism #3 Validation This study is **real-world validation of Layer 9: Mechanism #3 (Skill Verification)** from the nine-layer trust framework. **Layer 9: Mechanism #3 (Skill Verification)** requires: 1. Human verification of AI-generated skills before deployment 2. Testing skills against benchmark tasks 3. Monitoring skill effectiveness in production 4. Disabling skills that reduce performance **The SkillsBench study proves all four requirements are necessary**: 1. **Self-generated skills are worthless** → Human verification required 2. **16 tasks showed negative deltas** → Benchmark testing required 3. **Domain variation is massive** → Production monitoring required 4. **Some skills harm performance** → Disabling mechanism required If you're building Voice AI demos that: - Let agents generate their own skills - Trust agent-authored procedures - Assume more skills = better performance - Deploy skills without verification **You're deploying a system that's provably no better than baseline.** ## Implementation: Layer 9 Mechanism #3 (Skill Verification System) Here's how to implement skill verification based on the SkillsBench findings: ```typescript // Layer 9: Mechanism #3 - Skill Verification System // Prevents deployment of useless or harmful self-generated skills interface Skill { id: string; name: string; domain: string; procedures: ProcedureModule[]; source: "HUMAN_CURATED" | "AI_GENERATED" | "HYBRID"; verification_status: VerificationStatus; performance_delta: PerformanceDelta; } interface ProcedureModule { step: string; description: string; gotchas: string[]; // Non-obvious requirements examples: Example[]; } interface PerformanceDelta { baseline_pass_rate: number; with_skill_pass_rate: number; delta: number; // Positive or negative tasks_tested: number; confidence_interval: [number, number]; } interface VerificationStatus { status: "UNVERIFIED" | "TESTING" | "APPROVED" | "REJECTED" | "DISABLED"; verified_by: string; // Human verifier ID test_results: TestResult[]; rejection_reason?: string; disabled_reason?: string; } class SkillVerificationSystem { // CRITICAL: Never deploy self-generated skills without verification async verify_skill(skill: Skill): Promise { // BLOCK ALL SELF-GENERATED SKILLS BY DEFAULT if (skill.source === "AI_GENERATED") { return { status: "REJECTED", verified_by: "SYSTEM", test_results: [], rejection_reason: ` SkillsBench study shows self-generated skills provide 0.0pp benefit. Self-generated skills cannot be deployed without human verification. Reason: AI agents cannot reliably author procedural knowledge. Required steps: 1. Human expert must review skill content 2. Test against benchmark tasks (minimum 20 tasks) 3. Verify performance delta > +5pp with confidence > 95% 4. Check for negative deltas on any task 5. Get approval from domain expert ` }; } // HUMAN-CURATED SKILLS: Still require testing if (skill.source === "HUMAN_CURATED") { const test_results = await this.run_benchmark_tests(skill); // Check for negative deltas (19% of curated skills harm performance) const negative_delta_tasks = test_results.filter( result => result.delta < 0 ); if (negative_delta_tasks.length > 0) { return { status: "REJECTED", verified_by: "BENCHMARK_SYSTEM", test_results: test_results, rejection_reason: ` Skill causes performance degradation on ${negative_delta_tasks.length} tasks: ${negative_delta_tasks.map(t => `- ${t.task_name}: ${t.delta}pp (baseline: ${t.baseline}%, with skill: ${t.with_skill}%)` ).join('\n')} SkillsBench finding: 19% of curated skills harm performance. This skill falls in that category. Recommendation: Revise skill to remove over-constraining procedures. ` }; } // Check if improvement is meaningful (+5pp minimum threshold) const avg_delta = test_results.reduce((sum, r) => sum + r.delta, 0) / test_results.length; if (avg_delta < 5.0) { return { status: "REJECTED", verified_by: "BENCHMARK_SYSTEM", test_results: test_results, rejection_reason: ` Skill provides insufficient benefit: ${avg_delta.toFixed(1)}pp average improvement Minimum threshold: +5.0pp (to justify context window cost) SkillsBench finding: Low-gain domains (software, data analysis) show +4-8pp. If your domain doesn't benefit meaningfully from skills, don't deploy them. Consider: - Is this domain well-represented in model's training data? - Does model already have this procedural knowledge? - Are we just redundantly encoding what model knows? ` }; } // APPROVED: Significant positive delta, no negative tasks return { status: "APPROVED", verified_by: "BENCHMARK_SYSTEM", test_results: test_results, }; } throw new Error("Unknown skill source"); } // Run skills through benchmark tasks async run_benchmark_tests(skill: Skill): Promise { // Minimum 20 tasks per skill (SkillsBench used 86 tasks across domains) const benchmark_tasks = await this.get_domain_benchmark_tasks( skill.domain, min_count: 20 ); const test_results: TestResult[] = []; for (const task of benchmark_tasks) { // Test WITHOUT skill (baseline) const baseline_result = await this.run_task_without_skill(task); // Test WITH skill const with_skill_result = await this.run_task_with_skill(task, skill); // Calculate delta const delta = with_skill_result.pass_rate - baseline_result.pass_rate; test_results.push({ task_name: task.name, baseline: baseline_result.pass_rate, with_skill: with_skill_result.pass_rate, delta: delta, attempts: task.attempts, verifier: task.deterministic_verifier }); } return test_results; } // Monitor skills in production (catch degradation) async monitor_skill_performance(skill: Skill): Promise { // SkillsBench finding: Skills can become harmful over time // Reasons: Domain changes, task shifts, model updates setInterval(async () => { const current_performance = await this.measure_production_performance(skill); // Compare to approval benchmarks const approval_delta = skill.performance_delta.delta; const current_delta = current_performance.delta; // Check for degradation (>5pp drop from approval) if (current_delta < approval_delta - 5.0) { await this.disable_skill({ skill: skill, reason: ` Performance degradation detected in production Approval delta: ${approval_delta.toFixed(1)}pp Current delta: ${current_delta.toFixed(1)}pp Degradation: ${(approval_delta - current_delta).toFixed(1)}pp Possible causes: - Domain/task distribution changed - Model updated (different capabilities) - Skill became outdated - Task requirements evolved Action: Skill disabled. Requires re-verification. `, disabled_at: new Date() }); } // Check for negative delta in production if (current_delta < 0) { await this.disable_skill({ skill: skill, reason: ` CRITICAL: Skill now harms performance (${current_delta.toFixed(1)}pp) SkillsBench warning: 19% of skills show negative deltas. This skill has crossed into harmful territory. Action: Immediate disable. Do not re-enable without investigation. `, disabled_at: new Date(), severity: "CRITICAL" }); } }, 3600000); // Check every hour } // Focused vs comprehensive: Enforce module limits async enforce_skill_size_limits(skill: Skill): Promise { // SkillsBench finding: Focused skills (2-3 modules) outperform comprehensive (10+) const module_count = skill.procedures.length; if (module_count > 5) { return { valid: false, error: ` Skill has ${module_count} procedure modules (maximum: 5) SkillsBench finding: - Focused skills (2-3 modules): +16.2pp average - Comprehensive skills (10+ modules): +9.1pp average - Reference manuals (100+ pages): +2.3pp average Context window limits harm performance with verbose skills. Recommendation: Split into multiple focused skills, each 2-3 modules. ` }; } // Check total token count const total_tokens = this.estimate_skill_tokens(skill); if (total_tokens > 1000) { return { valid: false, error: ` Skill is too verbose (${total_tokens} tokens, maximum: 1000) Models spend tokens reading documentation, leaving fewer for task execution. Guideline: Keep skills under 1000 tokens (approximately 750 words). Current: ${total_tokens} tokens Edit skill to be more concise or split into multiple smaller skills. ` }; } return { valid: true }; } // Domain-specific thresholds get_minimum_delta_for_domain(domain: string): number { // SkillsBench domain-specific results: const domain_thresholds = { "healthcare": 40.0, // Expect +40pp+ (study showed +51.9pp) "legal": 35.0, // Expect +35pp+ (study showed +42.3pp) "finance": 30.0, // Expect +30pp+ (study showed +38.7pp) "customer_service": 20.0, "education": 15.0, "research": 12.0, "writing": 10.0, "planning": 8.0, "software": 5.0, // Low threshold (study showed +4.5pp) "data_analysis": 5.0, // Low threshold (study showed +7.2pp) "general": 5.0 }; return domain_thresholds[domain] || 10.0; // Default: +10pp minimum } } // Example usage const skill_verification = new SkillVerificationSystem(); // Agent generates a skill const ai_generated_skill: Skill = { id: "skill_12345", name: "How to verify medical insurance", domain: "healthcare", procedures: [ { step: "Check patient database", description: "Verify patient name and DOB", gotchas: [], // AI-generated skills miss gotchas examples: [] } ], source: "AI_GENERATED", // Self-generated verification_status: { status: "UNVERIFIED", verified_by: "", test_results: [] }, performance_delta: { baseline_pass_rate: 0, with_skill_pass_rate: 0, delta: 0, tasks_tested: 0, confidence_interval: [0, 0] } }; // Attempt to verify const verification_result = await skill_verification.verify_skill(ai_generated_skill); console.log(verification_result); // Output: // { // status: "REJECTED", // verified_by: "SYSTEM", // test_results: [], // rejection_reason: "SkillsBench study shows self-generated skills provide 0.0pp benefit..." // } // Human writes a skill const human_curated_skill: Skill = { id: "skill_67890", name: "Medical insurance verification procedure", domain: "healthcare", procedures: [ { step: "Query by policy number", description: "Use policy number as primary key (stable, unlike name/DOB)", gotchas: [ "Policy number is stable identifier (name/DOB can change)", "Don't query by name first - patient may have married/divorced" ], examples: [ { input: "Policy ABC123", output: "Patient record found" } ] }, { step: "Check coverage status", description: "Verify status field and end date", gotchas: [ "SUSPENDED != EXPIRED (grace period may exist)", "Check end_date > today, not just status=ACTIVE" ], examples: [] } ], source: "HUMAN_CURATED", verification_status: { status: "TESTING", verified_by: "domain_expert_12", test_results: [] }, performance_delta: { baseline_pass_rate: 34.0, with_skill_pass_rate: 87.0, delta: 53.0, // +53pp (above healthcare threshold of +40pp) tasks_tested: 25, confidence_interval: [48.2, 57.8] } }; // This skill will pass verification (large positive delta, no negative tasks) const human_skill_result = await skill_verification.verify_skill(human_curated_skill); // Status: APPROVED ``` ## Key Implementation Requirements From SkillsBench Study Based on the study's findings, **Layer 9: Mechanism #3 implementation MUST include**: ### 1. **Block Self-Generated Skills by Default** **Finding**: 0.0pp average improvement from self-generated skills **Implementation**: Reject all AI-authored skills unless human-verified ### 2. **Test All Skills Against Benchmarks** **Finding**: 19% of curated skills harm performance **Implementation**: Minimum 20 task benchmark per skill, measure delta ### 3. **Enforce Domain-Specific Thresholds** **Finding**: Healthcare +51.9pp vs Software +4.5pp **Implementation**: Different minimum deltas per domain ### 4. **Limit Skill Verbosity** **Finding**: Focused (2-3 modules) beats comprehensive (10+ modules) **Implementation**: Maximum 5 modules, 1000 tokens per skill ### 5. **Monitor for Degradation** **Finding**: Skills can become harmful over time **Implementation**: Continuous production monitoring, auto-disable on negative delta ### 6. **Detect Negative Deltas** **Finding**: 16 of 84 tasks worse with skills **Implementation**: Reject any skill showing performance decrease on ANY task ## The "Smaller Models with Skills = Larger Models Without" Finding One of the study's most interesting results: **Smaller models equipped with good skills can match larger models running without skills.** **Example from study**: - **GPT-3.5 + curated healthcare skills**: 76% pass rate - **GPT-4 without skills**: 74% pass rate - **Performance difference**: +2pp (smaller model WITH skills beats larger model WITHOUT) **For Voice AI demo economics**: - GPT-3.5 API cost: ~$0.002 per 1K tokens - GPT-4 API cost: ~$0.06 per 1K tokens - **Cost ratio**: 30x cheaper to run smaller model **The trade-off**: - Smaller model + skills: **30x cheaper**, same performance - Larger model no skills: **30x more expensive**, same performance - **BUT**: Skills require human curation (upfront cost) **Break-even calculation**: If human expert takes 2 hours to write/verify skills @ $100/hour = $200 upfront cost Running smaller model vs larger model saves $0.058 per 1K tokens Break-even: $200 / $0.058 = **3,448 API calls** (3.4M tokens) **For production Voice AI demos**: If you're handling >3.5M tokens, investing in human-curated skills for smaller models is more cost-effective than running larger models without skills. ## What Happens to Voice AI Demos That Ignore This Study If you're building Voice AI demos and your architecture looks like this: ```typescript // BROKEN ARCHITECTURE (ignores SkillsBench findings) async function voice_ai_agent(user_query: string): Promise { // Let agent generate its own skills const self_generated_skill = await agent.create_skill_from_interaction(user_query); // Store skill for future use await skill_library.add(self_generated_skill); // Use all available skills (assume more = better) const all_skills = await skill_library.get_all(); // Execute with full skill library return await agent.execute({ query: user_query, skills: all_skills, // Could be 100+ skills, 50K+ tokens model: "gpt-4" // Using expensive model unnecessarily }); } ``` **What actually happens** (validated by SkillsBench): 1. **Self-generated skill is worthless** (0.0pp improvement) 2. **Skill library grows but doesn't help** (no learning occurring) 3. **Context window wasted on useless documentation** (100+ skills = verbose) 4. **Performance degrades** (19% chance each skill has negative delta) 5. **Costs spiral** (paying for GPT-4 when GPT-3.5 + skills would work) **User experience**: Demo seems to "learn" (skill library grows) but **performance stays flat or gets worse**. ## What Voice AI Demos Should Do Instead **Architecture that respects SkillsBench findings**: ```typescript // CORRECT ARCHITECTURE (implements SkillsBench lessons) async function voice_ai_agent(user_query: string): Promise { // 1. NEVER use self-generated skills const self_generated_skill = null; // Don't even attempt // 2. Use ONLY human-curated, benchmark-tested skills const curated_skills = await skill_library.get_verified_skills({ domain: detect_domain(user_query), status: "APPROVED", // Only approved skills min_delta: 5.0, // Minimum +5pp improvement max_modules: 5 // Focused skills only }); // 3. Retrieve FOCUSED skills (not comprehensive documentation) const focused_skills = curated_skills.slice(0, 3); // Maximum 3 skills per query // 4. Use SMALLER model with skills (cheaper, same performance) return await agent.execute({ query: user_query, skills: focused_skills, model: "gpt-3.5-turbo", // Smaller model with skills = GPT-4 without skill_token_budget: 1000 // Enforce verbosity limits }); } // 5. CONTINUOUS monitoring for skill degradation setInterval(async () => { const skills = await skill_library.get_all_active(); for (const skill of skills) { const current_performance = await measure_production_delta(skill); if (current_performance.delta < skill.approved_delta - 5.0) { // Performance degraded >5pp from approval await skill_library.disable_skill(skill, "Performance degradation detected"); } } }, 3600000); // Check every hour ``` ## Connection to Layer 9: Reputation Integrity This study validates **Layer 9: Mechanism #3 (Skill Verification)** and reveals why it's essential: **Layer 9 premise**: AI systems must verify claims about their own capabilities before deploying them. **SkillsBench validation**: - AI agents **claim** they can generate useful skills → **False** (0.0pp improvement) - AI agents **claim** skills improve performance → **False 19% of the time** (negative deltas) - AI agents **claim** comprehensive documentation helps → **False** (focused beats comprehensive) **Without Layer 9 Mechanism #3**, Voice AI demos: 1. Deploy worthless self-generated skills 2. Waste context window on useless documentation 3. Suffer performance degradation from bad skills 4. Pay for larger models when smaller models + skills would work 5. **Have no idea any of this is happening** (no verification, no benchmarks) **With Layer 9 Mechanism #3**, Voice AI demos: 1. Block self-generated skills automatically 2. Verify all skills against benchmarks before deployment 3. Detect and disable harmful skills in production 4. Optimize costs (smaller model + verified skills) 5. **Know exactly which skills work and which don't** (continuous measurement) ## The Broader Implication: AI Agents Can't Self-Improve (Yet) The SkillsBench finding that self-generated skills provide zero benefit has massive implications beyond Voice AI: **The promise of AI agents**: They'll learn from experience, build up knowledge, get progressively better at tasks through self-improvement. **The SkillsBench reality**: **Models cannot reliably author the procedural knowledge they benefit from consuming.** This means: - **No autonomous skill acquisition**: Agents can't teach themselves new procedures - **No self-improving systems**: Performance doesn't increase through operation - **Human curation is mandatory**: Every useful skill requires human authorship/verification - **Scaling requires humans**: Can't scale agent capabilities by letting them run longer **For the AI agent ecosystem**: This is a **fundamental limitation**, not an engineering problem. Until models can reliably generate useful procedural knowledge, **humans remain in the loop for all capability expansion**. ## What This Means for Demogod If you're using Demogod for Voice AI demos that include agent "skills" or procedural guidance: **Critical requirements from SkillsBench study**: 1. **Never deploy self-generated skills** - Block AI-authored procedures - Require human verification for all skills - Test against minimum 20 benchmark tasks 2. **Monitor curated skills for negative deltas** - 19% of human-written skills harm performance - Test each skill before deployment - Continuous production monitoring 3. **Enforce skill size limits** - Maximum 5 procedural modules per skill - Maximum 1000 tokens per skill - Focused beats comprehensive 4. **Use domain-specific thresholds** - Healthcare: expect +40pp minimum - Software: expect +5pp minimum - Reject skills below domain threshold 5. **Optimize costs with smaller models** - GPT-3.5 + verified skills = GPT-4 without skills - 30x cost reduction for same performance - Human curation cost amortized over 3.5M+ tokens **Implementation in Demogod architecture**: ```typescript // Add to your Voice AI demo configuration const agent_config = { skill_verification: { block_self_generated: true, // Never use AI-authored skills minimum_benchmark_tasks: 20, // Test threshold minimum_delta: 5.0, // Global minimum (+5pp) domain_thresholds: { // Domain-specific minimums healthcare: 40.0, legal: 35.0, finance: 30.0, software: 5.0 }, max_modules_per_skill: 5, // Verbosity limit max_tokens_per_skill: 1000, // Context budget production_monitoring_interval: 3600000, // Check every hour auto_disable_on_negative_delta: true // Protect against harm }, model_selection: { use_smaller_model_with_skills: true, // Cost optimization smaller_model: "gpt-3.5-turbo", larger_model: "gpt-4", break_even_tokens: 3500000 // Switch at 3.5M tokens } }; ``` ## Conclusion: The End of "Self-Improving" AI Agents (For Now) The SkillsBench study just killed the dream of autonomous AI agents that get better through operation. **Self-generated skills provide zero benefit.** Models cannot teach themselves what they need to know. **Key findings**: - Self-generated skills: **0.0pp improvement** (worthless) - Curated skills: **+16.2pp average** (but 19% have negative deltas) - Focused skills beat comprehensive documentation - Smaller models + skills = larger models alone - Domain variation is massive (+4.5pp to +51.9pp) **What this means for Voice AI demos**: 1. **Block self-generated skills** (they don't work) 2. **Verify all human-curated skills** (19% are harmful) 3. **Keep skills focused** (2-3 modules, <1000 tokens) 4. **Monitor for degradation** (skills can become harmful) 5. **Use smaller models** (with verified skills, performance equivalent to larger models) **Layer 9: Mechanism #3 (Skill Verification) is now validated** by peer-reviewed research showing exactly why it's necessary: **AI agents cannot reliably create the procedural knowledge they benefit from consuming.** Until models gain the ability to author useful skills for themselves, **humans remain mandatory** for all agent capability expansion. This isn't a temporary limitation—it's a fundamental gap in current AI architectures. Build accordingly. --- **Study**: [SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks](https://arxiv.org/abs/2602.12670) **HackerNews Discussion**: [268 points, 113 comments](https://news.ycombinator.com/item?id=47040430)