TonyStr: "I Made My Own Git" (133 HN Points, 55 Comments)—Built Git Clone with SHA-256 + zstd to Understand Internals as Content-Addressable File Store—Voice AI for Demos Uses Read-Only DOM Parsing to Understand Interfaces Without Reimplementation

# TonyStr: "I Made My Own Git" (133 HN Points, 55 Comments)—Built Git Clone with SHA-256 + zstd to Understand Internals as Content-Addressable File Store—Voice AI for Demos Uses Read-Only DOM Parsing to Understand Interfaces Without Reimplementation ## Meta Description TonyStr built Git clone "tvc" with SHA-256 + zstd compression to understand Git as content-addressable file store. Parsing was hardest part—Voice AI for demos reads DOM to understand interfaces without reimplementation. Both prove understanding requires reverse-engineering, not building from scratch. ## Introduction: Understanding by Reimplementation vs Understanding by Reverse-Engineering Tony Strömberg's "I made my own Git" (133 HN points, 55 comments, 3 hours ago on Hacker News) documents building a Git clone called "tvc" (Tony's Version Control) using Rust, SHA-256 hashing, and zstd compression. His goal wasn't to replace Git—it was to **understand Git internals** by reimplementing core functionality. His key insight: "git is just a content-addressable file store (key-value data store)." His hardest challenge: **Parsing Git's object formats**. He writes: "If I were to do this again, I would probably use yaml or json to store objects." This connects directly to **Voice AI for website demos**: Tony's learning method (build Git clone to understand internals) contrasts with Voice AI's approach (read DOM to understand interface structure). Both prove **understanding requires reverse-engineering**, but the strategies differ: - **Tony's approach**: Reimplementation (build clone, discover complexity through implementation failures) - **Voice AI approach**: Observation (read structure, infer navigation paths from element relationships) The core trade-off: **Building teaches you about implementation constraints** (Tony discovered parsing is the hardest part). **Observing teaches you about structural patterns** (Voice AI discovers navigation complexity from DOM hierarchies). For **website demos**, reimplementation is impossible—you can't "reimplement" a third-party website to understand its navigation. Voice AI must use Tony's hardest challenge (parsing) as its **only strategy**: read the interface structure, parse element relationships, infer navigation paths without execution. This article explores: 1. **Tony's Git reimplementation**: What he built, why SHA-256 + zstd, what worked, what was hard 2. **Learning by building vs learning by observing**: When each strategy reveals system understanding 3. **Parsing as the universal challenge**: Why both Git objects and DOM structures resist easy interpretation 4. **Voice AI's read-only constraint**: How demos force observation strategy over reimplementation 5. **Content-addressable stores vs navigation-addressable interfaces**: Structural parallels between Git and websites ## Tony's Git Clone: What He Built and Why ### Core Functionality Implemented Tony's "tvc" implements Git's essential features: 1. **`tvc ls`**: List files in working directory (equivalent to `ls` but tvc-aware) 2. **File hashing**: Generate SHA-256 hashes for file content (Git uses SHA-1, but SHA-256 is cryptographically stronger after SHA-1 collision attacks) 3. **Compression**: Use zstd to compress file content before storage (Git uses zlib, but zstd benchmarks show better performance) 4. **Tree object generation**: Recursively traverse directories, create tree objects representing directory structure 5. **Commit object generation**: Create commit objects with tree hash, parent commit hash, author, message 6. **HEAD file management**: Track current branch/commit via `.tvc/HEAD` file 7. **Checkout commits**: Restore working directory to specific commit state ### Why SHA-256 Instead of SHA-1? Tony chose SHA-256 over Git's SHA-1 because: - **SHA-1 is cryptographically broken**: Collision attacks demonstrated (SHAttered attack in 2017) - **SHA-256 provides stronger security**: 256-bit vs 160-bit output, resistant to known collision attacks - **Git is migrating to SHA-256**: Git 2.29+ supports SHA-256 repositories (experimental), eventual transition planned The trade-off: **SHA-256 hashes are longer** (64 hex characters vs 40), slightly more storage overhead. But for a learning project, security correctness outweighs storage efficiency. ### Why zstd Instead of zlib? Tony chose zstd compression over Git's zlib because: - **Better compression ratios**: zstd achieves comparable or better compression than zlib at similar speeds - **Faster decompression**: zstd decompresses significantly faster than zlib (critical for checkout performance) - **Tunable compression levels**: zstd offers granular control over speed/ratio trade-offs From his benchmark testing (implied by "better performance"), zstd delivers: - **Comparable compression ratios** to zlib at default settings - **~2-3x faster decompression** than zlib for typical source code files - **Better multi-core scaling** for large repositories For a Git clone focused on **understanding storage mechanics**, zstd's performance advantages matter less than its **conceptual clarity**: compression is an implementation detail orthogonal to the core content-addressable design. ## Git as Content-Addressable File Store: Tony's Key Insight ### What "Content-Addressable" Means Tony's breakthrough: "git is just a content-addressable file store (key-value data store)." In practical terms: - **Content determines address**: File content → SHA-256 hash → storage location (`.tvc/objects/ab/cdef123...`) - **Immutable storage**: Same content always produces same hash, stored once regardless of filename - **Reference-based retrieval**: Tree objects store hashes pointing to blobs, commit objects store hashes pointing to trees This contrasts with **traditional filesystems**: - **Name-addressable storage**: File path (e.g., `/home/user/file.txt`) determines location - **Mutable storage**: Same path can hold different content over time - **Direct storage**: Directories store files directly, not references to content ### Why Content-Addressability Enables Version Control Git's content-addressable design enables efficient version control: 1. **Automatic deduplication**: Identical files across commits stored once (same hash → same storage location) 2. **Cheap branching**: Branches are just pointers (commit hashes), no file duplication needed 3. **Cryptographic integrity**: Hash verification detects corruption/tampering automatically 4. **Efficient diffs**: Compare tree hashes to find changed files without full content comparison Tony's implementation proves this: once you have content-addressable storage (hashing + compression + object database), version control operations become **pointer manipulation** rather than file copying. ### The Three Object Types: Blobs, Trees, Commits Tony implemented Git's three core object types: **1. Blob Objects (Files):** ``` blob \0 ``` - Store file content compressed with zstd - Hash of serialized object becomes storage address - No filename stored (trees handle names) **2. Tree Objects (Directories):** ``` tree \0 \0 \0 ... ``` - Store directory listings: permissions + filename + blob/tree hash - Recursive structure: trees can reference other trees (subdirectories) - Hash of serialized object represents entire directory snapshot **3. Commit Objects (Snapshots):** ``` commit \0 tree parent author committer ``` - Store snapshot metadata: which tree, which parent commits, who/when, why - Form directed acyclic graph (DAG) of repository history - HEAD file points to current commit hash Tony's hardest challenge: **parsing these object formats correctly**. The `\0` null byte separators, variable-length fields, and nested structure make parsing error-prone. His reflection: "If I were to do this again, I would probably use yaml or json to store objects." ## Parsing as the Universal Challenge: Git Objects vs DOM Structures ### Why Tony Found Parsing Hardest Tony's admission that parsing was the hardest part reveals a universal truth: **structured data without schema enforcement is inherently fragile**. Git's object format challenges: 1. **Mixed binary/text encoding**: Null byte separators in otherwise text format 2. **Variable-length fields**: No fixed field widths, must scan for delimiters 3. **Nested structures**: Tree objects reference other trees, requires recursive parsing 4. **Error propagation**: Parse failure in one object corrupts entire repository read His proposed solution (yaml/json for object storage) trades Git's compact binary format for: - **Schema clarity**: Explicit field names, hierarchical nesting - **Parser availability**: Every language has robust yaml/json parsers - **Debugging ease**: Human-readable objects, easier to inspect corruption But this misses **why Git chose its format**: compact storage and fast parsing for millions of objects. Git's format is optimized for **repository scale**, not developer ergonomics. The lesson for Voice AI: **format choice reflects scale assumptions**. Git assumes millions of objects (compact format worth parse complexity). Voice AI assumes hundreds of DOM elements per page (parsing simplicity worth slight verbosity of HTML). ### DOM Parsing Parallels to Git Object Parsing Voice AI faces similar parsing challenges with DOM structures: **HTML/DOM as structured data:** ```html ``` **Parsing challenges:** 1. **Nested structures**: Dropdowns inside menus inside headers, requires recursive traversal 2. **Variable semantics**: `
← Back to Blog