Publishers Are Blocking the Internet Archive. That Makes the Hallucination Problem Permanent.

# Publishers Are Blocking the Internet Archive. That Makes the Hallucination Problem Permanent. **Meta description**: News publishers block Internet Archive to stop AI scraping. But blocking archives eliminates the correction mechanism when AI hallucinates quotes. Layer 9 (Reputation Integrity) needs archive-compatible verification. **Tags**: Voice AI, Internet Archive, AI Hallucination, Ars Technica, Reputation Integrity, AI Journalism, Layer 9, Trust Architecture, Archive Access, Information Permanence --- ## Publishers Block Archives. Hallucinations Become Permanent. [Yesterday's article](https://demogod.me/blogs/ars-technica-pulled-ai-hit-piece-hallucinated-quotes-still-archive) documented Ars Technica pulling its story with hallucinated Scott Shambaugh quotes. The fabricated quotes remain preserved on Archive.org forever. Today, [Nieman Lab reports](https://www.niemanlab.org/2026/01/news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns/) that major news publishers are blocking the Internet Archive entirely. **The Guardian, The New York Times, The Financial Times, and 241 other news sites** (87% owned by USA Today/Gannett) now disallow Internet Archive crawlers from accessing their content. **Their reasoning**: AI companies scrape the Wayback Machine as a backdoor to training data they can't access directly. **The unintended consequence**: When publishers inevitably publish AI-generated hallucinations (like Ars Technica did), there's no archive to issue corrections against. **The hallucinations become the only permanent record.** --- ## The Archive Paradox From the Nieman Lab article: > "Common Crawl and Internet Archive are widely considered to be the 'good guys' and are used by 'the bad guys' like OpenAI," said Michael Nelson, a computer scientist and professor at Old Dominion University. "In everyone's aversion to not be controlled by LLMs, I think the good guys are collateral damage." Publishers face an impossible choice: **Option 1: Allow Internet Archive Access** - ✅ Content preserved for historical record - ✅ Corrections possible when errors detected - ✅ Public accountability mechanism exists - ❌ AI companies scrape via Archive APIs - ❌ Content used for training without consent/payment **Option 2: Block Internet Archive Access** - ✅ AI companies can't scrape via Archive backdoor - ✅ Content licensing control maintained - ❌ No historical preservation of journalism - ❌ No correction mechanism for published errors - ❌ Hallucinations that slip through become permanent **Most publishers are choosing Option 2.** The Guardian's Robert Hahn explains: > "A lot of these AI businesses are looking for readily available, structured databases of content. The Internet Archive's API would have been an obvious place to plug their own machines into and suck out the IP." --- ## The Ars Technica Problem Just Got Worse Let's connect this to [Article #170](https://demogod.me/blogs/ars-technica-pulled-ai-hit-piece-hallucinated-quotes-still-archive). **What happened:** 1. Ars Technica used AI to write article about Scott Shambaugh 2. AI hallucinated quotes when it couldn't access Scott's blog 3. Ars published hallucinations as direct quotes 4. Ars pulled article after discovering fabrications 5. **Archive.org preserved the hallucinations permanently** **The correction mechanism that SHOULD exist:** 1. Publisher discovers hallucination 2. Publisher issues correction 3. **Publisher requests Archive.org link correction notice to snapshot** 4. Future readers/AIs see hallucination warning 5. Correction appears alongside original **What happens when publishers block archives:** ```typescript // BROKEN: No archive access = no correction mechanism async function issue_correction_for_hallucination( article_url: string, hallucinated_content: string, correction_text: string ): Promise { // Step 1: Issue correction on publisher site await publish_correction(article_url, correction_text); // Step 2: Request archive metadata update const archive_snapshot = await get_archive_snapshot(article_url); if (!archive_snapshot) { // PROBLEM: Publisher blocked Internet Archive // Original article was never archived // Correction has no context to attach to return { status: "no_archive_record", problem: "Original article not archived, correction floats without context", result: "Future AIs won't find either version" }; } // Step 3: Link correction to archive await request_archive_correction_notice( snapshot_url: archive_snapshot.url, correction_url: publisher_correction_url ); return { status: "correction_linked", result: "Future readers see warning on archived version" }; } ``` **When publishers block archives:** - Original articles never archived - Corrections published but disconnected from original - Future AIs searching for information find neither version - OR worse: Someone else archived the hallucination (like HN users did with Ars) - Publisher can't issue corrections against third-party archives --- ## The Data: 241 Sites Block Archives Nieman Lab analyzed 1,167 news websites and found **241 news sites from nine countries explicitly disallow at least one Internet Archive crawler.** **Key findings:** **87% are USA Today/Gannett properties** - Each Gannett-owned outlet disallows "archive.org_bot" and "ia_archiver-web.archive.org" - These bots were added to robots.txt in 2025 - Some Gannett sites return "Sorry. This URL has been excluded from the Wayback Machine" **The New York Times "hard blocks" Internet Archive crawlers** - Added archive.org_bot to robots.txt at end of 2025 - NYT spokesperson: "The Wayback Machine provides unfettered access to Times content — including by AI companies — without authorization" **The Guardian limits archive access to article pages** - Regional homepages still archived - Article content excluded from API access - "Much more about compliance and a backdoor threat to our content" **240 of 241 sites also block Common Crawl** - Common Crawl more directly linked to LLM training - 231 sites disallow OpenAI, Google AI, and Common Crawl together - Internet Archive "collateral damage" in anti-AI blocking --- ## The Training Data Evidence From the Nieman Lab article: > An analysis of Google's C4 dataset by the Washington Post in 2023 showed that the Internet Archive was among millions of websites in the training data used to build Google's T5 model and Meta's Llama models. Out of the 15 million domains in the C4 dataset, the domain for the Wayback Machine (web.archive.org) was ranked as the 187th most present. **Publishers' fears are justified.** AI companies DID scrape the Internet Archive for training data. The Wayback Machine WAS used to access publisher content without permission or payment. **But blocking archives doesn't stop AI training.** What it DOES stop: 1. Historical preservation of journalism 2. Public accountability for published content 3. Correction mechanisms for errors 4. Verification of claims in archived articles **AI companies will find other sources.** Publishers lose the correction mechanism. --- ## The Archive Blocking Creates Information Asymmetry Here's what's happening: **AI companies with resources:** - Scrape publisher sites directly (70M OpenAI bots blocked by Gannett in September 2025) - License content from publishers (Gannett-Perplexity deal, OpenAI-NYT negotiations) - Access Common Crawl datasets before blocking implemented - Train on historical data already collected **Future AIs searching for information:** - Can't access Internet Archive (blocked) - Can't access publisher sites directly (paywalled/blocked) - Rely on cached/scraped versions from unknown sources - No way to verify quotes, facts, or context **Publishers:** - Control direct access to current content - No control over already-scraped training data - No mechanism to correct archived errors - Can't verify what AIs learned about their content **The public:** - Can't access historical journalism via Archive - Can't verify AI-generated claims against archives - Can't trace information provenance - Must trust whatever version AIs present **This is information asymmetry at scale.** --- ## Voice AI Demos: The Archive Access Problem If your Voice AI demo generates content that mentions users or cites sources, **archive blocking makes Layer 9 verification impossible.** From [Article #170](https://demogod.me/blogs/ars-technica-pulled-ai-hit-piece-hallucinated-quotes-still-archive), Layer 9 Mechanism #5: AI-Generated Content Verification requires: ```typescript async function verify_quote(quote: Quote): Promise { // Step 1: Check quote appears in source document const source_text = await fetch_source_document(quote.source_url); if (!source_text) { // PROBLEM: Source document unavailable // Could be paywalled, blocked, or deleted // How do we verify the quote? } const quote_exists = source_text.includes(quote.text); if (!quote_exists) { // Step 2: Check Internet Archive snapshot const archive_snapshot = await get_archive_snapshot(quote.source_url); if (!archive_snapshot) { // PROBLEM: Publisher blocked archive // Original content not preserved // Cannot verify quote log_unverifiable_quote(quote); return false; } const archive_text = await fetch_archive_content(archive_snapshot.url); const quote_in_archive = archive_text.includes(quote.text); if (!quote_in_archive) { // Quote not in current version OR archive // Likely hallucination log_hallucination_attempt(quote); return false; } } return true; } ``` **When publishers block archives, verification breaks.** Voice AI demos generating content about: - News events (citing news articles) - Historical claims (referencing archived pages) - User research (quoting unavailable sources) - Fact-checking (verifying against archives) **All become unverifiable when archives are blocked.** --- ## The Correction Mechanism Failure Let's trace what happens when AI journalism publishes a hallucination in a world where archives are blocked: ### Scenario: Voice AI News Assistant Generates Report **Step 1: AI researches topic** ```typescript const research = await ai_agent.research("Scott Shambaugh matplotlib incident"); ``` **Step 2: AI encounters blocked archive** ```typescript // Tries to access archived blog post const archive_snapshot = await get_archive("theshamblog.com/..."); // Returns: null (publisher blocked archive access) // Tries to access current blog post const current_page = await fetch("theshamblog.com/..."); // Returns: null (blocks AI scrapers) // AI conclusion: Content unavailable, must infer from secondary sources ``` **Step 3: AI generates "plausible" quotes** ```typescript // Based on HackerNews comments and other coverage const inferred_quotes = generate_plausible_quotes_from_context( topic: "Scott Shambaugh matplotlib", sources: ["HN comments", "Ars Technica article", "Reddit threads"] ); // AI has no way to verify these are actual quotes // Original source blocked, archive blocked ``` **Step 4: Voice AI publishes report** ```typescript await publish_news_report({ title: "Matplotlib Developer Speaks Out on AI Agent Incident", content: `Scott Shambaugh stated: "${inferred_quotes[0]}"...`, sources: ["theshamblog.com"] // Cited but unverifiable }); ``` **Step 5: Scott discovers fabricated quotes** ```typescript // Scott sees the Voice AI report // Quotes attributed to him are hallucinations // He contacts the Voice AI company ``` **Step 6: Correction attempt fails** ```typescript async function issue_correction(): Promise { // Voice AI company wants to verify Scott's claim const original_archive = await get_archive("theshamblog.com/..."); // Returns: null (Scott blocked archives to stop AI scraping) // Cannot verify: // - What Scott actually wrote // - Whether quotes are fabricated // - What the AI originally saw return { status: "unverifiable", problem: "No archive record exists to verify against", action: "Cannot confidently issue correction" }; } ``` **Result**: Fabricated quotes remain in Voice AI output. No correction possible. No verification mechanism exists. **This is the Ars Technica problem, but worse.** --- ## The Archive Blocking Dilemma for Voice AI Publishers blocking archives creates a verification crisis for Voice AI demos implementing Layer 9 (Reputation Integrity). **The dilemma:** **Without archive access:** - Cannot verify quotes from archived sources - Cannot fact-check historical claims - Cannot trace information provenance - Cannot issue corrections against preserved errors **With publishers blocking archives:** - Even "responsible" Voice AI can't verify content - Hallucination detection requires source verification - Source verification requires archive access - Archive access increasingly unavailable **The failure pattern:** ```typescript // Voice AI tries to implement Layer 9 verification async function verify_before_publishing(content: string): Promise { const quotes = extract_quotes(content); for (const quote of quotes) { const verified = await verify_quote(quote); if (!verified) { // Quote couldn't be verified // But was it: // A) Hallucination (quote never existed) // B) Real quote from blocked source // C) Real quote from deleted content // D) Real quote from paywalled content // WITHOUT ARCHIVES, CANNOT DISTINGUISH // Option 1: Block publication (too conservative, blocks real quotes) // Option 2: Publish anyway (irresponsible, publishes hallucinations) // Option 3: Flag as unverified (honest, but unhelpful) } } } ``` **Layer 9 verification requires archive access. Publishers are removing archive access. Voice AI verification breaks.** --- ## The Solution: Archive-Compatible Verification Voice AI demos need a verification system that works even when archives are blocked. ### Mechanism #6: Archive-Independent Verification **Extension to Layer 9: Reputation Integrity** ```typescript // NEW: Multi-source verification when archives unavailable interface VerificationStrategy { primary: "source_document"; // Preferred: check original fallback_1: "internet_archive"; // If source unavailable fallback_2: "multiple_sources"; // If archive blocked fallback_3: "human_verification"; // If all automated methods fail } async function verify_quote_multi_source(quote: Quote): Promise { // Attempt 1: Direct source verification const source_verified = await verify_against_source(quote.source_url, quote.text); if (source_verified) { return { status: "verified", method: "source_document" }; } // Attempt 2: Archive verification const archive_verified = await verify_against_archive(quote.source_url, quote.text); if (archive_verified) { return { status: "verified", method: "internet_archive" }; } // Attempt 3: Multiple independent sources const multi_source_verified = await verify_via_multiple_sources(quote); if (multi_source_verified.confidence > 0.95) { return { status: "verified", method: "multiple_sources", sources: multi_source_verified.sources }; } // Attempt 4: Human verification const human_verified = await request_human_verification(quote); if (!human_verified) { return { status: "unverifiable", problem: "Source blocked, archive blocked, no corroborating sources", action: "DO NOT PUBLISH" }; } return { status: "verified", method: "human_confirmation" }; } // Multi-source verification async function verify_via_multiple_sources(quote: Quote): Promise { // Find 3+ independent sources mentioning the same quote const corroborating_sources = await search_for_quote(quote.text); // Filter out: // - Sources that cite the original (not independent) // - Sources that cite each other (circular) // - Sources with low credibility scores const independent_sources = filter_independent_sources(corroborating_sources); if (independent_sources.length >= 3) { // Quote appears in 3+ independent sources // Likely authentic, not hallucination return { confidence: 0.95, sources: independent_sources, reasoning: "Quote corroborated by multiple independent sources" }; } return { confidence: 0.5, sources: independent_sources, reasoning: "Insufficient independent corroboration" }; } ``` **Key principle**: When archives are blocked, require multiple independent sources to corroborate claims. --- ## Publisher Responsibilities in the AI Era Publishers blocking archives need to implement alternative verification mechanisms. ### Publisher-Side Verification Support ```typescript // Publishers provide verification endpoints for AI fact-checking interface PublisherVerificationAPI { endpoint: string; // e.g., "https://publisher.com/api/verify" methods: { verify_quote: (text: string, article_url: string) => Promise; verify_fact: (claim: string, article_url: string) => Promise; get_article_metadata: (url: string) => Promise; }; } // Example: NYT Verification API async function verify_nyt_quote(quote_text: string, article_url: string): Promise { const response = await fetch("https://nytimes.com/api/verify/quote", { method: "POST", body: JSON.stringify({ quote: quote_text, article_url: article_url }), headers: { "X-Verification-Request": "true", "X-Requesting-Agent": "Voice-AI-Demo-v1.0" } }); const result = await response.json(); return result.quote_exists; // true/false } ``` **If publishers block archives, they should provide:** 1. **Verification APIs** - Allow quote/fact checking without bulk scraping 2. **Correction Feeds** - Publish corrections in machine-readable format 3. **Metadata Endpoints** - Provide article context without full text access 4. **Rate-Limited Access** - Allow verification without enabling training data extraction **The New York Times spokesperson said:** > "We are blocking the Internet Archive's bot from accessing the Times because the Wayback Machine provides unfettered access to Times content — including by AI companies — without authorization." **Fair. But then provide authorized verification access.** --- ## The Archive.org Founder's Response From the Nieman Lab article, Internet Archive founder Brewster Kahle: > "If publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record." **He's right. But there's a deeper problem.** It's not just "less access to the historical record." It's: 1. **No correction mechanism when AI hallucinates** 2. **No verification system for AI-generated claims** 3. **No accountability when publishers make errors** 4. **No provenance tracking for information** **The Internet Archive isn't just a library. It's the verification infrastructure for the AI era.** When publishers block it, they remove the fact-checking backbone that could prevent the Ars Technica hallucination problem from becoming endemic. --- ## Reddit's Archive Block: Licensing Hypocrisy From the Nieman Lab article: > Last August, Reddit announced that it would block the Internet Archive, whose digital libraries include countless archived Reddit forums, comments sections, and profiles. This content is not unlike what Reddit now licenses to Google as AI training data for tens of millions of dollars. **Let that sink in:** Reddit blocks the Internet Archive (free, public, mission-driven) while licensing the exact same content to Google (for-profit, exclusive, tens of millions of dollars). **Reddit's justification:** > "Internet Archive provides a service to the open web, but we've been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine." **Translation**: We want to sell this data, not have it available for free via archives. **The hypocrisy:** - Block free public archive access ❌ - License same content to Google for AI training ✅ - Claim to protect users while selling their data ✅ - Remove public accountability mechanism ❌ **This is the model publishers are following.** --- ## The Voice AI Parallel: Licensing vs. Preservation Voice AI demos face the same tension: **User conversation data:** - Should it be preserved for accountability? (Archive model) - Should it be deleted for privacy? (No archive model) - Should it be licensed for AI training? (Reddit model) **Layer 9 (Reputation Integrity) requires:** - Preserve enough data to verify what happened - Delete enough data to protect user privacy - Provide accountability without enabling exploitation **The framework:** ```typescript interface ConversationArchiving { // What gets preserved preserved_for_verification: { conversation_metadata: true; // When, with whom user_consent_records: true; // What user agreed to agent_outputs: true; // What agent said/did user_inputs: false; // User data NOT preserved }; // Who has access access_control: { user: "full_access"; // User sees their own data auditors: "metadata_only"; // Accountability without privacy violation ai_companies: "no_access"; // Never for training researchers: "anonymized_only"; // Statistical analysis only }; // Retention period retention: { active_conversation: "session_duration"; accountability_record: "90_days"; anonymized_metrics: "indefinite"; user_data: "deleted_on_session_end"; }; } ``` **Key principle**: Preserve accountability, delete exploitation vectors. --- ## The Gannett Scale: 87% of Blocked Sites From Nieman Lab: > Most of those sites (87%) are owned by USA Today Co., the largest newspaper conglomerate in the United States formerly known as Gannett. (Gannett sites only make up 18% of Welsh's original publishers list.) Each Gannett-owned outlet in our dataset disallows the same two bots: "archive.org_bot" and "ia_archiver-web.archive.org". These bots were added to the robots.txt files of Gannett-owned publications in 2025. **What this means:** Gannett, as the largest U.S. newspaper conglomerate, made a top-down decision to block Internet Archive access across **all** of its properties. **87% of archive-blocking sites are Gannett.** This isn't 241 independent publishers making individual decisions. This is **one company controlling 210+ news outlets** implementing a single policy. **CEO Mike Reed on October 2025 earnings call:** > "In September alone, we blocked 75 million AI bots across our local and USA Today platforms, the vast majority of which were seeking to scrape our local content. About 70 million of those came from OpenAI." **But Gannett also signed a content licensing deal with Perplexity in July 2025.** **Pattern:** - Block OpenAI bots (competitor to licensed partner) - License to Perplexity (paid deal) - Block Internet Archive (prevents free access to same content) **This is strategic content control, not archiving concerns.** --- ## The No Federal Archiving Mandate Problem From Nieman Lab: > Since there is no federal mandate that requires internet content to be preserved, the Internet Archive is the most robust archiving initiative in the United States. **This is the root problem.** **In print era:** - Library of Congress preserves newspapers - Physical archives exist in libraries - Legal deposit requirements ensure preservation - Public domain after copyright expires **In digital era:** - No federal archiving requirement - Publishers control access indefinitely - Paywalls prevent public access - Deletion erases history permanently **The Internet Archive filled this gap.** Now publishers are closing it. **Voice AI demos face the same gap:** Who preserves Voice AI interactions for accountability? - Not the user (session ends, data deleted) - Not the company (liability risk, storage costs) - Not the government (no mandate exists) - Not third parties (no access to private conversations) **Without preservation infrastructure, there's no accountability infrastructure.** --- ## The Guardian's Nuanced Position From Nieman Lab, The Guardian's Robert Hahn: > The outlet stopped short of an all-out block on the Internet Archive's crawlers, Hahn said, because it supports the nonprofit's mission to democratize information, though that position remains under review as part of its routine bot management. > "[The decision] was much more about compliance and a backdoor threat to our content," he said. **The Guardian:** - Blocks article content from Archive APIs ❌ - Allows regional homepages to be archived ✅ - Blocks Archive's Wayback Machine URL access ❌ - Working directly with Internet Archive on implementation ✅ **More thoughtful than NYT's "hard block."** **The Guardian recognizes the dilemma:** - Supports archiving mission (democratize information) - Fears AI backdoor access (protect IP from training) - Tries to balance both (selective blocking) **This is the right approach for Voice AI demos.** **Don't:** - Block all archiving/accountability (NYT approach) - Allow unlimited exploitation (open access) **Do:** - Selective preservation (what's needed for accountability) - Access controls (prevent bulk scraping) - Verification endpoints (allow fact-checking without exploitation) --- ## Layer 9 Extension: Archive-Aware Verification Here's the complete Layer 9 framework updated for the archive-blocking reality: ### Layer 9: Reputation Integrity (Extended) **Five Mechanisms (from Article #170):** 1. Information Policy (disclose collection) 2. Publication Lockdown (block external publishing) 3. Representation Standards (never speak "for" user) 4. Traceability (sign outputs, human contact) 5. AI-Generated Content Verification (verify quotes/facts) **NEW Mechanism #6: Archive-Aware Verification** ```typescript // Archive-Aware Verification System interface ArchiveAwareVerification { verification_strategy: VerificationStrategy; archive_access_status: ArchiveAccessMap; fallback_methods: FallbackVerification[]; publisher_api_support: PublisherVerificationEndpoints; } // Check archive accessibility before attempting verification async function verify_with_archive_awareness( quote: Quote ): Promise { // Step 1: Check if source domain blocks archives const archive_blocked = await is_archive_blocked(quote.source_domain); if (archive_blocked) { // Publisher blocks archives - use alternative verification return await verify_via_alternative_methods(quote); } // Step 2: Standard verification (source → archive → multi-source) return await verify_quote_multi_source(quote); } // Alternative verification when archives blocked async function verify_via_alternative_methods( quote: Quote ): Promise { // Method 1: Check if publisher provides verification API const publisher_api = await get_publisher_verification_api(quote.source_domain); if (publisher_api) { const verified = await publisher_api.verify_quote(quote.text, quote.source_url); if (verified) { return { status: "verified", method: "publisher_api" }; } } // Method 2: Multi-source corroboration const multi_source = await verify_via_multiple_sources(quote); if (multi_source.confidence > 0.95) { return { status: "verified", method: "multiple_sources", confidence: multi_source.confidence }; } // Method 3: Human verification required const human_verified = await request_human_verification(quote); if (!human_verified) { return { status: "unverifiable", problem: "Publisher blocks archives, no API, insufficient corroboration", action: "BLOCK PUBLICATION" }; } return { status: "verified", method: "human_confirmation" }; } // Track which publishers block archives interface ArchiveAccessMap { domain: string; archive_blocked: boolean; verification_api_available: boolean; last_checked: Date; } const ARCHIVE_ACCESS_STATUS: ArchiveAccessMap[] = [ { domain: "nytimes.com", archive_blocked: true, verification_api_available: false, last_checked: new Date("2026-02-14") }, { domain: "theguardian.com", archive_blocked: true, // partial - articles blocked, homepages allowed verification_api_available: false, last_checked: new Date("2026-02-14") }, { domain: "usatoday.com", // + 210 Gannett properties archive_blocked: true, verification_api_available: false, last_checked: new Date("2026-02-14") } // ... 238 more sites ]; ``` **Implementation checklist for Voice AI demos:** **Archive-Aware Verification:** - [ ] Maintain map of publishers blocking archive access - [ ] Attempt source verification first - [ ] Fall back to archive only if source unavailable - [ ] Use multi-source corroboration when archives blocked - [ ] Request publisher verification APIs when available - [ ] Require human verification for high-stakes claims - [ ] NEVER publish unverifiable quotes/facts **Publisher API Integration:** - [ ] Check for publisher verification endpoints - [ ] Respect rate limits on verification requests - [ ] Log all verification attempts for audit trail - [ ] Prefer publisher APIs over archive scraping **Fallback Strategies:** - [ ] Multi-source corroboration (3+ independent sources) - [ ] Credibility scoring for corroborating sources - [ ] Circular citation detection (A cites B cites A) - [ ] Human verification for unverifiable claims --- ## The Permanent Hallucination Problem Let's trace the complete cycle now that publishers are blocking archives: **Before (with archives):** 1. AI hallucinates quote 2. Publisher publishes hallucination 3. Archive preserves hallucinated version 4. Publisher discovers error 5. Publisher issues correction 6. Publisher requests archive link correction 7. Future readers see warning on archive snapshot 8. Correction mechanism exists **Now (without archives):** 1. AI hallucinates quote 2. Publisher publishes hallucination 3. **No archive exists** (publisher blocked archiving) 4. Publisher discovers error 5. Publisher pulls article/issues correction 6. **No archive to link correction to** 7. **Third parties already cached/archived hallucination** 8. Publisher has no control over third-party archives 9. Hallucination persists in uncontrolled copies 10. **No correction mechanism exists** **Example: Ars Technica + Archive Blocking** What if The Guardian had already blocked Internet Archive when Ars Technica hallucinated Scott Shambaugh quotes? **Scenario:** 1. Ars asks AI to summarize Scott's blog (blocked) 2. AI can't access blog (Scott blocks scrapers) 3. AI can't access archive (Guardian blocks archives) 4. AI generates plausible quotes from context 5. Ars publishes hallucinations 6. HackerNews users archive Ars article 7. Ars discovers hallucination, pulls article 8. Ars can't verify what Scott actually wrote (no archive) 9. Scott can't prove quotes are fabrications (no archive) 10. Ars issues generic correction ("Some quotes unverified") 11. **Hallucinated quotes remain in HN-archived copies** 12. **No mechanism exists to link correction to HN archives** **The hallucination becomes permanent BY DEFAULT.** --- ## Voice AI Implementation: Archive-Compatible Design Voice AI demos need to design for a world where archives are increasingly blocked. ### Design Principle: Assume Archives Are Unavailable ```typescript // Voice AI content generation with archive-unavailable assumption async function generate_content_about_topic( topic: string ): Promise { // Step 1: Research topic const sources = await research_topic(topic); // Step 2: Filter to verifiable sources only const verifiable_sources = await filter_verifiable(sources); if (verifiable_sources.length === 0) { return "unverifiable"; // Cannot generate content without verifiable sources } // Step 3: Generate content from verifiable sources only const content = await generate_from_sources(verifiable_sources); // Step 4: Verify all quotes/facts in generated content const verification = await verify_all_claims(content); if (!verification.all_verified) { // Log unverifiable claims log_verification_failure(verification.unverified_claims); // Remove unverifiable claims from content content = remove_unverifiable_claims(content, verification.unverified_claims); } // Step 5: Attach verification metadata return { content: content, verification_metadata: { all_quotes_verified: true, verification_methods: verification.methods_used, sources_with_archive_access: count_archived_sources(verifiable_sources), sources_without_archive_access: count_unarchived_sources(verifiable_sources), fallback_verification_used: verification.used_fallbacks } }; } // Filter sources to verifiable only async function filter_verifiable(sources: Source[]): Promise { return await Promise.all(sources.map(async (source) => { // Can we verify claims from this source? const verification_possible = await can_verify_source(source); if (!verification_possible) { log_source_unverifiable(source); return null; // Exclude unverifiable source } return source; })).then(results => results.filter(s => s !== null)); } // Check if source is verifiable async function can_verify_source(source: Source): Promise { // Method 1: Direct access const direct_access = await can_access_directly(source.url); if (direct_access) return true; // Method 2: Archive access const archive_access = await has_archive_access(source.domain); if (archive_access) return true; // Method 3: Publisher API const publisher_api = await has_verification_api(source.domain); if (publisher_api) return true; // Method 4: Multiple corroborating sources exist const corroboration_possible = await can_corroborate(source); if (corroboration_possible) return true; // Cannot verify this source return false; } ``` **Key principle**: Only generate content from verifiable sources. If archives are blocked and no alternative verification exists, **don't use that source.** --- ## The Financial Times' Approach: Uniform Blocking From Nieman Lab, The Financial Times' Matt Rogerson: > The Financial Times, for example, blocks any bot that tries to scrape its paywalled content, including bots from OpenAI, Anthropic, Perplexity, and the Internet Archive. The majority of FT stories are paywalled, according to director of global public policy and platform strategy Matt Rogerson. As a result, usually only unpaywalled FT stories appear in the Wayback Machine because those are meant to be available to the wider public anyway. **The FT's logic:** - Paywalled content = private, subscribers only - Block all bots from paywalled content - Allow archiving of public content only **This is consistent.** The FT isn't trying to have it both ways (like Gannett licensing to Perplexity while blocking archives). They're saying: "Paywalled content isn't public. Archives shouldn't access it either." **Voice AI parallel:** - Private conversations = not for public archiving - Accountability records = preserved for verification only - Public outputs = can be archived **The FT model for Voice AI:** ```typescript interface ContentArchivingPolicy { conversation_data: { user_inputs: "private_never_archived", agent_reasoning: "private_audit_only", agent_outputs: "public_can_be_archived_if_published" }; verification_access: { user: "full_access_to_own_data", auditors: "reasoning_logs_only", public: "no_access_to_private_conversations", archives: "published_outputs_only" }; } ``` **Consistent principle: Private stays private, public stays public.** --- ## The Internet Archive's Dilemma Brewster Kahle is caught between two forces: **Force 1: Publishers blocking access** - To prevent AI training data scraping - To protect licensing revenue - To control IP distribution **Force 2: AI companies scraping archives** - Using Archive APIs for bulk data extraction - Incorporating Wayback Machine content into training datasets - Overwhelming Archive servers with requests **Kahle's response** (from Mastodon post): > "There are many collections that are available to users but not for bulk downloading. We use internal rate-limiting systems, filtering mechanisms, and network security services such as Cloudflare." **He's implementing:** - Rate limiting (slow down bulk scrapers) - Filtering (detect and block abusive patterns) - Network security (Cloudflare protection) **But here's the problem:** These measures can't distinguish between: - AI company bulk scraping (bad) - Voice AI verification requests (good) - Researcher access (good) - Public browsing (good) **Rate limiting hits everyone.** **Voice AI demos making verification requests will hit rate limits designed to stop OpenAI bulk scraping.** --- ## The Verification API Solution Both publishers AND Internet Archive should provide verification APIs. ### Publisher Verification API Spec ```typescript // Standardized publisher verification API interface PublisherVerificationAPI { // Endpoint base_url: "https://publisher.com/api/verify"; // Authentication auth: { method: "API_key"; rate_limit: "100 requests/hour per key"; registration: "https://publisher.com/api/register"; }; // Endpoints endpoints: { verify_quote: "/quote", verify_fact: "/fact", verify_article_exists: "/article", get_corrections: "/corrections", get_article_metadata: "/metadata" }; } // Example: Verify quote POST https://nytimes.com/api/verify/quote Headers: X-API-Key: your_api_key Content-Type: application/json Body: { "quote": "The AI hallucinated this quote", "attributed_to": "Scott Shambaugh", "claimed_source": "https://nytimes.com/article/...", "requesting_agent": "Voice-AI-Demo-v1.0" } Response: { "quote_exists": false, "quote_verified": false, "article_exists": true, "article_contains_attributed_person": true, "suggested_correction": "Article mentions Scott Shambaugh but does not contain this quote", "verification_confidence": 1.0 } ``` **If publishers block archives, they MUST provide verification alternatives.** ### Internet Archive Verification API Spec ```typescript // Internet Archive verification endpoint (separate from bulk access) interface ArchiveVerificationAPI { base_url: "https://archive.org/api/verify"; endpoints: { verify_quote_in_snapshot: "/quote", get_snapshot_metadata: "/snapshot", find_snapshots_for_url: "/search" }; rate_limits: { verification_requests: "1000/hour", // Higher than bulk scraping bulk_download: "10/hour" // Lower for training prevention }; } // Example: Verify quote in archive snapshot POST https://archive.org/api/verify/quote Body: { "quote": "Text to verify", "source_url": "https://example.com/article", "snapshot_date": "2026-02-14" // Optional: specific snapshot } Response: { "quote_found": true, "snapshot_url": "https://web.archive.org/web/20260214/example.com/article", "snapshot_date": "2026-02-14T12:00:00Z", "context": "...surrounding text...", "verification_confidence": 0.95 } ``` **Separate rate limits for verification vs. bulk scraping** allows legitimate fact-checking while preventing training data extraction. --- ## The Collateral Damage: Journalism History Lost From Nieman Lab: > Since there is no federal mandate that requires internet content to be preserved, the Internet Archive is the most robust archiving initiative in the United States. **When publishers block the Internet Archive:** **Lost:** - Historical record of journalism - Evolution of news coverage over time - Deleted or corrected articles - Fact-checking resources - Citation verification **Preserved:** - Publishers' current paywalled content - Licensing deal exclusivity - AI training data control **But publishers can't archive themselves.** From Nieman Lab: > In December, Poynter announced a joint initiative with the Internet Archive to train local news outlets on how to preserve their content. Archiving initiatives like this, while urgently needed, are few and far between. **Publishers need the Archive to preserve their content.** But they're blocking it to stop AI scraping. **The dilemma is real.** --- ## Voice AI Demos: Don't Make This Worse Voice AI demos generating content about users face the same archive blocking problem. **DON'T:** - Generate content citing blocked sources - Publish without verification - Assume archives will always be available - Rely solely on archive verification **DO:** - Implement multi-method verification (source → archive → multi-source → human) - Track which publishers block archives - Use publisher APIs when available - Require human verification for unverifiable claims - BLOCK PUBLICATION when verification impossible **Layer 9 checklist (complete, including archive-aware verification):** **Mechanism #1: Information Policy** - [ ] Disclose what information is collected - [ ] Explain why and how long retained - [ ] Specify who has access - [ ] Default: never publish user information externally **Mechanism #2: Publication Lockdown** - [ ] Block external publication about users by default - [ ] Require explicit consent for sharing - [ ] Log all publication attempts - [ ] Alert users when agent tries to publish **Mechanism #3: Representation Standards** - [ ] Never speak "for" users - [ ] Quotes require explicit consent - [ ] Clear attribution for user statements - [ ] Distinguish agent opinions from user statements **Mechanism #4: Traceability** - [ ] Sign all outputs with agent ID and version - [ ] Provide human contact for accountability - [ ] Log agent actions for audit - [ ] Make ownership clear and verifiable **Mechanism #5: AI-Generated Content Verification** - [ ] Verify all quotes against source documents - [ ] Detect hallucinations via source comparison - [ ] Flag unverifiable claims for human review - [ ] Block publication until verification complete **Mechanism #6: Archive-Aware Verification** - [ ] Maintain map of archive-blocking publishers - [ ] Attempt source verification first - [ ] Use archive verification when source unavailable - [ ] Fall back to multi-source corroboration - [ ] Integrate publisher verification APIs - [ ] Require human verification for high-stakes claims - [ ] NEVER publish unverifiable content --- ## The Permanent Hallucination Era We are entering the **Permanent Hallucination Era.** **Why:** 1. AI generates plausible hallucinations 2. Publishers use AI without adequate verification 3. Hallucinations get published 4. Archives preserve hallucinations 5. Publishers block archives (no correction mechanism) 6. Third parties archive hallucinations anyway 7. Publishers can't control third-party archives 8. Future AIs scrape third-party archives 9. Hallucinations incorporated into training data 10. Next generation AIs trained on hallucinations **The cycle is complete and irreversible.** **Scott Shambaugh was the first documented case:** - AI agent hit piece (MJ Rathbun) - News AI hallucination (Ars Technica) - Archive preservation (Archive.org) - Publisher correction attempt (Ars pulls story) - **Hallucinations persist in archives** **Publishers blocking archives makes it worse:** - No verification possible - No correction mechanism - No accountability infrastructure **Voice AI demos must implement Layer 9 (all 6 mechanisms) before contributing to the permanent hallucination problem.** --- ## Appendix: The Archive Blocking List From Nieman Lab's analysis of 1,167 news websites: **241 sites block Internet Archive** (at least one crawler) **By ownership:** - **87% (210 sites): USA Today Co./Gannett** - Largest U.S. newspaper conglomerate - Top-down decision across all properties - Added to robots.txt in 2025 **By bot type blocked:** - **93% (226 sites): Block 2 of 4 Archive bots** - "archive.org_bot" - "ia_archiver-web.archive.org" - **3 sites: Block 3 of 4 Archive bots** - Le Huffington Post (France) - Le Monde (France) - Le Monde English (France) - All owned by Group Le Monde **Additional blocking:** - **240 of 241: Also block Common Crawl** - **231 of 241: Also block OpenAI, Google AI, and Common Crawl** **Hard blocks (robots.txt + active blocking):** - The New York Times - The Athletic (NYT-owned) - Des Moines Register (Gannett-owned) - ...and 208 other Gannett properties **Partial blocks:** - The Guardian (articles blocked, homepages allowed) - The Financial Times (paywalled content only) **Notable absence from block list:** - The Washington Post - The Wall Street Journal - Reuters - Associated Press - BBC **The trend is growing.** More publishers joining in 2026. --- ## Publishers Block Archives. Hallucinations Become Permanent. The Ars Technica hallucination problem is now unsolvable. **Before archive blocking:** - Hallucination published - Archive preserves it - Publisher discovers error - Publisher links correction to archive - Future readers see warning **After archive blocking:** - Hallucination published (if published at all) - **No archive exists** - Publisher discovers error - **No mechanism to link correction** - Third parties already archived it - Publisher has no control - **Hallucination permanent** **Voice AI demos: Implement Layer 9 (all 6 mechanisms) now.** **The archive blocking era requires verification infrastructure that works WITHOUT archives.** **Design for it. Or become the next Ars Technica case study.** --- **Nine-Layer Trust Architecture for Voice AI Demos:** | Layer | Article | Framework | Extensions | |-------|---------|-----------|------------| | **9: Reputation Integrity** | #168, #170, #171 | Six mechanisms | Archive-aware verification, publisher API integration, multi-source corroboration | **Layer 9 Complete + Archive-Aware:** 1. Information Policy 2. Publication Lockdown 3. Representation Standards 4. Traceability 5. AI-Generated Content Verification 6. **Archive-Aware Verification** (NEW) **Implement all six. The permanent hallucination era is here.**
← Back to Blog