<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://avishek.net/feed.xml" rel="self" type="application/atom+xml" /><link href="https://avishek.net/" rel="alternate" type="text/html" /><updated>2026-05-18T05:48:57+05:30</updated><id>https://avishek.net/feed.xml</id><title type="html">Total Internal Reflection</title><subtitle>Technology and Art</subtitle><entry><title type="html">Engineering Log: An AI Harness for RedDragon</title><link href="https://avishek.net/2026/04/16/engineering-log-reddragon-testing-harness.html" rel="alternate" type="text/html" title="Engineering Log: An AI Harness for RedDragon" /><published>2026-04-16T00:00:00+05:30</published><updated>2026-04-16T00:00:00+05:30</updated><id>https://avishek.net/2026/04/16/engineering-log-reddragon-testing-harness</id><content type="html" xml:base="https://avishek.net/2026/04/16/engineering-log-reddragon-testing-harness.html"><![CDATA[<p><em><a href="https://github.com/avishek-sen-gupta/red-dragon">RedDragon</a> is a multi-language interpreter/compiler that lowers 16 languages (Python, Java, Go, TypeScript, Rust, C, C++, C#, JavaScript, Kotlin, Lua, PHP, Ruby, Swift, Scala, COBOL) down to a shared IR. With 13,000+ tests across 16 frontends, keeping quality high and coverage visible requires a harness that operates at several layers simultaneously. This post documents that harness in full: how language feature coverage is tracked through feature enums, how the pre-commit pipeline is structured, and how Claude Code’s hook system is wired to enforce workflow discipline at edit time.</em></p>

<p><em>This is a work in progress. The harness described here reflects the current state of the tooling, but it is actively evolving: new gates get added as pain points emerge, non-blocking warnings get promoted to blockers as violation counts drop, and the hook and skills infrastructure grows alongside the codebase. Treat this as a snapshot, not a finished design.</em></p>

<hr />

<h2 id="table-of-contents">Table of Contents</h2>

<ul>
  <li><a href="#claudemd-the-llm-guidance-layer">CLAUDE.md: The LLM Guidance Layer</a></li>
  <li><a href="#language-feature-coverage-tracking">Language Feature Coverage Tracking</a>
    <ul>
      <li><a href="#feature-enums-self-documenting-members">Feature Enums: Self-Documenting Members</a></li>
      <li><a href="#the-covers-decorator">The @covers Decorator</a></li>
      <li><a href="#the-coverage-audit-script">The Coverage Audit Script</a></li>
      <li><a href="#what-uncovered-means">What Uncovered Means</a></li>
    </ul>
  </li>
  <li><a href="#the-pre-commit-pipeline">The Pre-commit Pipeline</a>
    <ul>
      <li><a href="#talisman-secret-detection">Talisman: Secret Detection</a></li>
      <li><a href="#terminology-guard">Terminology Guard</a></li>
      <li><a href="#black-auto-format">Black: Auto-format</a></li>
      <li><a href="#python-fp-lint-functional-programming-warnings">Python FP Lint: Functional Programming Warnings</a></li>
      <li><a href="#import-linter-architectural-contracts">import-linter: Architectural Contracts</a></li>
      <li><a href="#full-test-suite">Full Test Suite</a></li>
      <li><a href="#beads-backup">Beads Backup</a></li>
    </ul>
  </li>
  <li><a href="#claude-code-hooks-prepost-tool-use">Claude Code Hooks: Pre/Post Tool Use</a>
    <ul>
      <li><a href="#userpromptsubmit-contextual-invariant-injection">UserPromptSubmit: Contextual Invariant Injection</a></li>
      <li><a href="#pretooluse-the-gate-layer">PreToolUse: The Gate Layer</a></li>
    </ul>
  </li>
  <li><a href="#skills-and-workflow-automation">Skills and Workflow Automation</a>
    <ul>
      <li><a href="#superpowers-core-development-lifecycle-skills">Superpowers: Core Development Lifecycle Skills</a></li>
      <li><a href="#project-level-skills">Project-Level Skills</a></li>
      <li><a href="#plugins-context-injector-and-python-fp-lint">Plugins: Context Injector and Python FP Lint</a></li>
    </ul>
  </li>
  <li><a href="#the-full-picture">The Full Picture</a></li>
  <li><a href="#conclusion">Conclusion</a></li>
</ul>

<hr />

<h2 id="claudemd-the-llm-guidance-layer">CLAUDE.md: The LLM Guidance Layer</h2>

<p><code class="language-plaintext highlighter-rouge">CLAUDE.md</code> is Claude Code’s project instruction file. It is loaded at the start of every session and provides the model with baseline knowledge about the project. RedDragon’s <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> is <strong>thin by design</strong>: it delegates to a set of imported files rather than encoding everything inline.</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gh"># RedDragon: Agent Instructions</span>

<span class="gh">#import .claude/core/project-context.md</span>
<span class="gh">#import .claude/core/workflow.md</span>
<span class="gh">#import .claude/core/implementation.md</span>
<span class="gh">#import .claude/core/tools-search.md</span>
<span class="gh">#import .claude/conditional/design-principles.md</span>
</code></pre></div></div>

<p>The four core files cover: what the project is and how it is structured (<code class="language-plaintext highlighter-rouge">project-context.md</code>), how to interact and commit (<code class="language-plaintext highlighter-rouge">workflow.md</code>), coding conventions and guard rules (<code class="language-plaintext highlighter-rouge">implementation.md</code>), and when to use which search tool (<code class="language-plaintext highlighter-rouge">tools-search.md</code>). The conditional <code class="language-plaintext highlighter-rouge">design-principles.md</code> is also always loaded here, in addition to being injected dynamically by the <code class="language-plaintext highlighter-rouge">UserPromptSubmit</code> hook when relevant prompts are classified.</p>

<p>This layer tells the model <em>how to work</em>: which test directories to use, when to brainstorm before coding, how to handle Talisman warnings, which formatting tool to run. It is effective for orienting the model and establishing defaults.</p>

<p>What it cannot do is <em>guarantee</em> those defaults are followed. Guidance in a prompt is a suggestion; it degrades across long contexts, gets overridden by competing instructions, and fails silently. <a href="https://arxiv.org/abs/2307.03172">Liu et al. (2023), <em>Lost in the Middle</em></a> showed that retrieval performance degrades most severely for information in the middle of the context window, which is exactly where <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> ends up as conversation history grows. The <code class="language-plaintext highlighter-rouge">UserPromptSubmit</code> hook re-injects invariant files on every prompt to keep guidance recent, but even that is probabilistic. <strong>“Most cases” is not a constraint.</strong> Everything below this section exists because LLM guidance alone is not sufficient.</p>

<hr />

<h2 id="language-feature-coverage-tracking">Language Feature Coverage Tracking</h2>

<p><em>Each of the 16 language frontends has a feature enum that enumerates every construct the language supports, implemented or not. A decorator on test methods links tests to enum members. An audit script computes the gap.</em></p>

<h3 id="feature-enums-self-documenting-members">Feature Enums: Self-Documenting Members</h3>

<p>Each of the 16 language frontends has a dedicated <code class="language-plaintext highlighter-rouge">features.py</code> file at <code class="language-plaintext highlighter-rouge">interpreter/frontends/{lang}/features.py</code> (COBOL lives at <code class="language-plaintext highlighter-rouge">interpreter/cobol/features.py</code>). Each file contains a single <code class="language-plaintext highlighter-rouge">XxxFeature(Enum)</code> class whose members use <strong>string values as documentation</strong> rather than <code class="language-plaintext highlighter-rouge">auto()</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">PythonFeature</span><span class="p">(</span><span class="n">Enum</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Semantic features of the Python language.</span><span class="sh">"""</span>

    <span class="c1"># Declarations
</span>    <span class="n">VARIABLE_DECLARATION</span> <span class="o">=</span> <span class="sh">"</span><span class="s">simple name = value assignments at statement level</span><span class="sh">"</span>
    <span class="n">FUNCTION_DECLARATION</span> <span class="o">=</span> <span class="sh">"</span><span class="s">def f(...): function definitions</span><span class="sh">"</span>
    <span class="n">CLASS</span>                <span class="o">=</span> <span class="sh">"</span><span class="s">class C: class body definitions</span><span class="sh">"</span>

    <span class="c1"># Control flow
</span>    <span class="n">IF_ELSE</span>          <span class="o">=</span> <span class="sh">"</span><span class="s">if / elif / else conditional branching</span><span class="sh">"</span>
    <span class="n">WHILE_LOOP</span>       <span class="o">=</span> <span class="sh">"</span><span class="s">while cond: loop statements</span><span class="sh">"</span>
    <span class="n">FOR_LOOP</span>         <span class="o">=</span> <span class="sh">"</span><span class="s">for x in iterable: loop statements</span><span class="sh">"</span>
    <span class="n">MATCH_STATEMENT</span>  <span class="o">=</span> <span class="sh">"</span><span class="s">match subject: structural pattern matching (Python 3.10+)</span><span class="sh">"</span>

    <span class="c1"># Pattern Matching
</span>    <span class="n">CAPTURE_PATTERN</span>  <span class="o">=</span> <span class="sh">"</span><span class="s">case x: name capture in match cases</span><span class="sh">"</span>
    <span class="n">SEQUENCE_PATTERN</span> <span class="o">=</span> <span class="sh">"</span><span class="s">case [a, b]: sequence destructuring patterns</span><span class="sh">"</span>
    <span class="n">MAPPING_PATTERN</span>  <span class="o">=</span> <span class="sh">'</span><span class="s">case {</span><span class="sh">"</span><span class="s">key</span><span class="sh">"</span><span class="s">: v}: mapping destructuring patterns</span><span class="sh">'</span>
    <span class="bp">...</span>
</code></pre></div></div>

<p>The string value serves a dual purpose: it acts as human-readable documentation when the enum is displayed in audit reports, and it confirms at a glance what the feature covers. There are currently <strong>948 features</strong> defined across the 16 language enums.</p>

<h3 id="the-covers-decorator">The @covers Decorator</h3>

<p>Every test method that exercises a language feature is annotated with <code class="language-plaintext highlighter-rouge">@covers</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">tests.covers</span> <span class="kn">import</span> <span class="n">covers</span>
<span class="kn">from</span> <span class="n">interpreter.frontends.java.features</span> <span class="kn">import</span> <span class="n">JavaFeature</span>

<span class="k">class</span> <span class="nc">TestJavaInterface</span><span class="p">:</span>
    <span class="nd">@covers</span><span class="p">(</span><span class="n">JavaFeature</span><span class="p">.</span><span class="n">INTERFACE</span><span class="p">)</span>
    <span class="k">def</span> <span class="nf">test_interface_method_lowering</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="n">result</span> <span class="o">=</span> <span class="nf">run</span><span class="p">(</span><span class="sh">"""</span><span class="s">
            interface Greeter { String greet(String name); }
        </span><span class="sh">"""</span><span class="p">,</span> <span class="sh">"</span><span class="s">java</span><span class="sh">"</span><span class="p">)</span>
        <span class="k">assert</span> <span class="n">result</span> <span class="o">==</span> <span class="bp">...</span>
</code></pre></div></div>

<p>The decorator is a <strong>no-op at runtime</strong>; it just attaches a <code class="language-plaintext highlighter-rouge">_covers</code> frozenset to the function object:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">covers</span><span class="p">(</span><span class="o">*</span><span class="n">features</span><span class="p">:</span> <span class="n">Enum</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Callable</span><span class="p">[[</span><span class="n">_F</span><span class="p">],</span> <span class="n">_F</span><span class="p">]:</span>
    <span class="k">def</span> <span class="nf">_decorator</span><span class="p">(</span><span class="n">func</span><span class="p">:</span> <span class="n">_F</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">_F</span><span class="p">:</span>
        <span class="n">func</span><span class="p">.</span><span class="n">_covers</span> <span class="o">=</span> <span class="nf">frozenset</span><span class="p">(</span><span class="n">features</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">func</span>
    <span class="k">return</span> <span class="n">_decorator</span>
</code></pre></div></div>

<p>This means <strong>zero test overhead</strong>. The metadata only matters to the audit script.</p>

<hr />

<h3 id="the-coverage-audit-script">The Coverage Audit Script</h3>

<p><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/scripts/feature_coverage_audit.py"><code class="language-plaintext highlighter-rouge">scripts/feature_coverage_audit.py</code></a> performs a three-phase static analysis:</p>

<p><strong>Phase 1: Feature module discovery.</strong> It globs <code class="language-plaintext highlighter-rouge">interpreter/frontends/*/features.py</code> and <code class="language-plaintext highlighter-rouge">interpreter/cobol/features.py</code>, imports each module, and collects every enum member along with its string description.</p>

<p><strong>Phase 2: Test file scanning.</strong> It uses Python’s <code class="language-plaintext highlighter-rouge">ast</code> module (<strong>not regex</strong>) to walk all <code class="language-plaintext highlighter-rouge">test_*.py</code> files under <code class="language-plaintext highlighter-rouge">tests/unit/</code> and <code class="language-plaintext highlighter-rouge">tests/integration/</code>, extracting every <code class="language-plaintext highlighter-rouge">@covers(XxxFeature.MEMBER)</code> reference <strong>without executing any code</strong>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">_covers_refs_in_file</span><span class="p">(</span><span class="n">path</span><span class="p">:</span> <span class="n">Path</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">frozenset</span><span class="p">[</span><span class="n">FeatureRef</span><span class="p">]:</span>
    <span class="n">tree</span> <span class="o">=</span> <span class="n">ast</span><span class="p">.</span><span class="nf">parse</span><span class="p">(</span><span class="n">path</span><span class="p">.</span><span class="nf">read_text</span><span class="p">())</span>
    <span class="k">return</span> <span class="nf">frozenset</span><span class="p">(</span>
        <span class="nc">FeatureRef</span><span class="p">(</span><span class="n">enum_class_name</span><span class="o">=</span><span class="n">arg</span><span class="p">.</span><span class="n">value</span><span class="p">.</span><span class="nb">id</span><span class="p">,</span> <span class="n">member_name</span><span class="o">=</span><span class="n">arg</span><span class="p">.</span><span class="n">attr</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">node</span> <span class="ow">in</span> <span class="n">ast</span><span class="p">.</span><span class="nf">walk</span><span class="p">(</span><span class="n">tree</span><span class="p">)</span>
        <span class="k">if</span> <span class="nf">isinstance</span><span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="p">(</span><span class="n">ast</span><span class="p">.</span><span class="n">FunctionDef</span><span class="p">,</span> <span class="n">ast</span><span class="p">.</span><span class="n">AsyncFunctionDef</span><span class="p">))</span>
        <span class="ow">and</span> <span class="n">node</span><span class="p">.</span><span class="n">name</span><span class="p">.</span><span class="nf">startswith</span><span class="p">(</span><span class="sh">"</span><span class="s">test_</span><span class="sh">"</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">dec</span> <span class="ow">in</span> <span class="n">node</span><span class="p">.</span><span class="n">decorator_list</span>
        <span class="k">if</span> <span class="nf">isinstance</span><span class="p">(</span><span class="n">dec</span><span class="p">,</span> <span class="n">ast</span><span class="p">.</span><span class="n">Call</span><span class="p">)</span>
        <span class="ow">and</span> <span class="nf">isinstance</span><span class="p">(</span><span class="n">dec</span><span class="p">.</span><span class="n">func</span><span class="p">,</span> <span class="n">ast</span><span class="p">.</span><span class="n">Name</span><span class="p">)</span>
        <span class="ow">and</span> <span class="n">dec</span><span class="p">.</span><span class="n">func</span><span class="p">.</span><span class="nb">id</span> <span class="o">==</span> <span class="sh">"</span><span class="s">covers</span><span class="sh">"</span>
        <span class="k">for</span> <span class="n">arg</span> <span class="ow">in</span> <span class="n">dec</span><span class="p">.</span><span class="n">args</span>
        <span class="k">if</span> <span class="nf">isinstance</span><span class="p">(</span><span class="n">arg</span><span class="p">,</span> <span class="n">ast</span><span class="p">.</span><span class="n">Attribute</span><span class="p">)</span> <span class="ow">and</span> <span class="nf">isinstance</span><span class="p">(</span><span class="n">arg</span><span class="p">.</span><span class="n">value</span><span class="p">,</span> <span class="n">ast</span><span class="p">.</span><span class="n">Name</span><span class="p">)</span>
    <span class="p">)</span>
</code></pre></div></div>

<p><strong>Phase 3: Gap computation.</strong> For each language, the covered set (enum members that appear in at least one <code class="language-plaintext highlighter-rouge">@covers</code>) is subtracted from the full member set to produce the uncovered list.</p>

<p>The output is a JSON coverage report plus a summary table:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  c           :  33 /  48 covered  (15 uncovered)
  cobol        :  87 / 106 covered  (19 uncovered)
  cpp          :  38 /  84 covered  (46 uncovered)
  csharp       :  71 /  94 covered  (23 uncovered)
  go           :  40 /  44 covered  (4 uncovered)
  java         :  50 /  72 covered  (22 uncovered)
  javascript   :  33 /  38 covered  (5 uncovered)
  kotlin       :  51 /  59 covered  (8 uncovered)
  lua          :  19 /  25 covered  (6 uncovered)
  pascal       :  38 /  47 covered  (9 uncovered)
  php          :  46 /  55 covered  (9 uncovered)
  python       :  35 /  55 covered  (20 uncovered)
  ruby         :  67 /  72 covered  (5 uncovered)
  rust         :  46 /  60 covered  (14 uncovered)
  scala        :  43 /  53 covered  (10 uncovered)
  typescript   :  16 /  36 covered  (20 uncovered)
  ──────────────────────────────────────────────────
  TOTAL: 948 features across 16 languages, 205 uncovered
</code></pre></div></div>

<p>The script also supports <code class="language-plaintext highlighter-rouge">--gaps-doc docs/frontend-lowering-gaps.md</code> to regenerate a full Markdown report with a summary table and per-language uncovered feature lists.</p>

<hr />

<h3 id="what-uncovered-means">What Uncovered Means</h3>

<p>The feature enums are meant to be a <strong>comprehensive inventory of the language</strong>: every feature the language has, whether or not RedDragon supports it yet. An enum member with no <code class="language-plaintext highlighter-rouge">@covers</code> annotation is not a mistake in the enum; it is a <strong>documented gap</strong>: RedDragon’s frontend does not yet handle this construct.</p>

<p>This makes the coverage audit a <strong>living gap analysis</strong> as much as a test quality report. Running the audit tells you two things simultaneously: which features are implemented and tested, and which features are known to be missing. As of this writing there are 948 features across 16 languages, with 205 uncovered. The uncovered features are documented gaps: constructs the language has that RedDragon does not yet implement. The number moves as new features are added to the enums and implementation catches up.</p>

<hr />

<h2 id="the-pre-commit-pipeline">The Pre-commit Pipeline</h2>

<p><em>Seven gates run in sequence on every <code class="language-plaintext highlighter-rouge">git commit</code>. The hook logic lives in <a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/.claude/hooks/pre-commit"><code class="language-plaintext highlighter-rouge">.claude/hooks/pre-commit</code></a> (<strong>versioned in the repo</strong>); <code class="language-plaintext highlighter-rouge">.git/hooks/pre-commit</code> simply delegates to it. Blocking gates exit non-zero to abort the commit; non-blocking gates warn and continue.</em></p>

<pre><code class="language-mermaid">flowchart TD
    classDef gate    fill:#b91c1c,stroke:#7f1d1d,color:#fff
    classDef autofix fill:#0f766e,stroke:#134e4a,color:#fff
    classDef warn    fill:#b45309,stroke:#78350f,color:#fff
    classDef stop    fill:#1f2937,stroke:#111827,color:#f87171
    classDef start   fill:#1e3a8a,stroke:#1e3a8a,color:#fff
    classDef done    fill:#15803d,stroke:#14532d,color:#fff

    GC(["git commit"]):::start

    GC        --&gt; TAL
    TAL       --&gt;|pass| TG
    TAL       --&gt;|fail| S1

    TG        --&gt;|pass| BLK
    TG        --&gt;|violation| S2

    BLK       --&gt; FPL
    FPL       --&gt; IMP

    IMP       --&gt;|pass| PYT
    IMP       --&gt;|fail| S3

    PYT       --&gt;|pass| BD
    PYT       --&gt;|fail| S4

    BD        --&gt;|pass| OK
    BD        --&gt;|fail| S5

    TAL["Talisman\nsecret detection"]:::gate
    TG["Terminology guard\nchecks staged diff"]:::gate
    BLK["Black\nauto-format + re-stage"]:::autofix
    FPL(["python-fp-lint\nFP warnings only"]):::warn
    IMP["import-linter\n5 architectural contracts"]:::gate
    PYT["pytest\n13k+ tests  -x -q"]:::gate
    BD["Beads\nbackup"]:::gate

    S1["blocked"]:::stop
    S2["blocked"]:::stop
    S3["blocked"]:::stop
    S4["blocked"]:::stop
    S5["blocked"]:::stop

    OK(["commit created"]):::done
</code></pre>

<h3 id="talisman-secret-detection">Talisman: Secret Detection</h3>

<p><a href="https://github.com/thoughtworks/talisman">Talisman</a> scans all staged files for patterns that look like secrets: API keys, tokens, private key headers, high-entropy strings. It runs before everything else. If it flags something, the commit is blocked and you must either fix the content or add an explicit whitelist entry to <code class="language-plaintext highlighter-rouge">.talismanrc</code>. The <code class="language-plaintext highlighter-rouge">.talismanrc</code> policy is <strong>append-only</strong>: existing entries are never modified, new entries are added at the end.</p>

<hr />

<h3 id="terminology-guard">Terminology Guard</h3>

<p>The terminology guard is a two-part system that prevents sensitive terms from entering tracked artifacts. A custom script at <a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/scripts/check-terminology"><code class="language-plaintext highlighter-rouge">scripts/check-terminology</code></a> scans staged content for a blocklist of domain-specific vocabulary that should not appear in a public repository. This operates independently of Talisman: Talisman looks for secrets, the terminology guard looks for project-specific identifiers. Both block on failure. This matters because RedDragon is used in consulting contexts where client names, system codes, and vendor identifiers must never appear in a public repository.</p>

<p><strong>The blocklist</strong> lives at <code class="language-plaintext highlighter-rouge">~/.config/git/blocklist.txt</code>, one regex pattern per line, case-sensitive, applied via <code class="language-plaintext highlighter-rouge">grep -E</code>. It covers project names, government domain identifiers, vendor framework names, and reference system codes. An exclude-list at <code class="language-plaintext highlighter-rouge">~/.config/git/blocklist-exclude.txt</code> can whitelist specific file glob patterns.</p>

<p><strong>At commit time</strong>, <a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/scripts/check-terminology"><code class="language-plaintext highlighter-rouge">scripts/check-terminology</code></a> scans the staged diff using <code class="language-plaintext highlighter-rouge">git diff --cached --diff-filter=ACMR -U0</code>. It only looks at added lines (lines starting with <code class="language-plaintext highlighter-rouge">+</code>, excluding the <code class="language-plaintext highlighter-rouge">+++</code> header). If a forbidden term appears in a new or modified line, the commit is blocked with a formatted table showing the commit reference, file:line location, matched term, and 60-character context snippet with the term highlighted:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Forbidden terms in staged changes
──────────────────────────────────────────────────────────────────────────────────────────
  COMMIT         LOCATION                       TERM       CONTEXT
──────────────────────────────────────────────────────────────────────────────────────────
  staged         src/parser.py:42               "REDACTED" ...client = REDACTEDClient()...
──────────────────────────────────────────────────────────────────────────────────────────
  1 hit(s)

  Blocklist: ~/.config/git/blocklist.txt
  Fix the content or update the blocklist to proceed.
</code></pre></div></div>

<p><strong>For history scanning</strong>, <a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/scripts/scan-history"><code class="language-plaintext highlighter-rouge">scripts/scan-history</code></a> performs a retroactive audit of the entire git history using the same blocklist. It runs two passes:</p>

<ol>
  <li><strong>File contents</strong>: uses <code class="language-plaintext highlighter-rouge">git log --all -G "$PATTERN"</code> to find commits that introduced the pattern, then walks each affected file at that commit SHA with <code class="language-plaintext highlighter-rouge">git show "$sha:$file"</code> to extract matching lines and their locations.</li>
  <li><strong>Commit messages</strong>: scans <code class="language-plaintext highlighter-rouge">git log --all --format='%h %s %b'</code> for forbidden terms in the subject and body of every commit message.</li>
</ol>

<p>Both scans produce the same formatted table output. The script is run manually (not as a hook) to audit historical exposure, typically when the blocklist is updated.</p>

<p>The shared formatting logic (<code class="language-plaintext highlighter-rouge">print_table_header</code>, <code class="language-plaintext highlighter-rouge">print_table_row</code>, <code class="language-plaintext highlighter-rouge">snippet_around</code>) lives in <a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/scripts/lib-terminology.sh"><code class="language-plaintext highlighter-rouge">scripts/lib-terminology.sh</code></a>, which both the pre-commit guard and the history scanner source. This ensures consistent output format between the commit gate and the retrospective scan.</p>

<p><strong>At the AI agent level</strong>, the <a href="https://github.com/avishek-sen-gupta/context-injector/blob/main/hooks/bd-terminology-guard.sh"><code class="language-plaintext highlighter-rouge">bd-terminology-guard.sh</code></a> PreToolUse hook intercepts Beads write commands before they reach the issue tracker. This catches the case where the AI composes an issue title or body containing a sensitive term; without this hook, the term could end up in the issue database even though it never touches a file.</p>

<hr />

<h3 id="black-auto-format">Black: Auto-format</h3>

<p><a href="https://github.com/psf/black">Black</a> runs on the entire repository and is non-blocking in the sense that it auto-corrects and re-stages the formatted files (<code class="language-plaintext highlighter-rouge">git add -u</code>) before the later gates run. If you commit with non-Black-formatted code, the commit still goes through, but the staged content becomes Black-formatted.</p>

<hr />

<h3 id="python-fp-lint-functional-programming-warnings">Python FP Lint: Functional Programming Warnings</h3>

<p><a href="https://github.com/avishek-sen-gupta/python-fp-lint"><code class="language-plaintext highlighter-rouge">python-fp-lint</code></a> is a custom linter that enforces functional programming conventions across <code class="language-plaintext highlighter-rouge">interpreter/</code>. It runs three backends in sequence, each responsible for a different class of violation.</p>

<p>The obvious alternative for structural rules would be Semgrep, but Semgrep’s rule registry and advanced features are gated behind account creation and a paywall, and I firmly believe that <a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/PHILOSOPHY.md">useful software should be free</a>. ast-grep is <strong>fully open, runs entirely offline</strong>, and its YAML rule format is simpler for custom rules. For the patterns needed here (single-node structural matches on mutation syntax), it is more than sufficient.</p>

<p><strong>Backend 1: <a href="https://github.com/ast-grep/ast-grep">ast-grep</a> (structural pattern rules).</strong> Twenty-eight custom YAML rules match syntactic mutation patterns that ast-grep can detect with a single-node structural match. The rules cover direct mutation (subscript assignment <code class="language-plaintext highlighter-rouge">d[k] = v</code>, augmented assignment <code class="language-plaintext highlighter-rouge">x += y</code>, attribute mutation <code class="language-plaintext highlighter-rouge">self.x = y</code>, loop mutation), collection mutation methods (<code class="language-plaintext highlighter-rouge">list.append/insert/remove/pop/extend</code>, <code class="language-plaintext highlighter-rouge">set.add/discard</code>, <code class="language-plaintext highlighter-rouge">dict.update/clear/setdefault</code>), and annotation hygiene (<code class="language-plaintext highlighter-rouge">list</code>/<code class="language-plaintext highlighter-rouge">dict</code> parameter types, unfrozen dataclasses, <code class="language-plaintext highlighter-rouge">None</code>-defaulted parameters). Using ast-grep here rather than regex is deliberate: these are multi-token structural patterns (<code class="language-plaintext highlighter-rouge">$OBJ[$KEY] = $VAL</code>, <code class="language-plaintext highlighter-rouge">@dataclass</code> without <code class="language-plaintext highlighter-rouge">frozen=True</code>) that span more than one token and would require fragile regexes or a full parser to match correctly otherwise.</p>

<p><strong>Backend 2: <a href="https://github.com/astral-sh/ruff">Ruff</a> (general code quality).</strong> The selected rule sets are <code class="language-plaintext highlighter-rouge">E</code>/<code class="language-plaintext highlighter-rouge">W</code> (pycodestyle), <code class="language-plaintext highlighter-rouge">F</code> (Pyflakes), <code class="language-plaintext highlighter-rouge">I</code> (isort), <code class="language-plaintext highlighter-rouge">B</code> (flake8-bugbear), <code class="language-plaintext highlighter-rouge">UP</code> (pyupgrade), <code class="language-plaintext highlighter-rouge">SIM</code> (flake8-simplify), <code class="language-plaintext highlighter-rouge">RUF</code> (Ruff-specific), <code class="language-plaintext highlighter-rouge">BLE</code> (blind-except), <code class="language-plaintext highlighter-rouge">T20</code> (print statements), <code class="language-plaintext highlighter-rouge">TID252</code> (tidy-imports), and <code class="language-plaintext highlighter-rouge">C901</code> (McCabe complexity). These cover standard quality checks that have nothing to do with FP style but are useful to have in a single pass.</p>

<p><strong>Backend 3: <a href="https://github.com/serge-sans-paille/beniget">beniget</a> (def-use chain reassignment detection).</strong> Variable reassignment (binding the same name twice in the same scope) cannot be detected by ast-grep, because the pattern spans multiple statements and requires knowing whether a name was already defined earlier in the same scope. <a href="https://github.com/serge-sans-paille/beniget">beniget</a> builds a def-use chain for each Python module by walking the AST and tracking every definition and use of every name. The <code class="language-plaintext highlighter-rouge">ReassignmentGate</code> queries <code class="language-plaintext highlighter-rouge">duc.locals</code> to find names with more than one definition node in the same scope and reports the second and subsequent assignments as violations.</p>

<p><strong>Current limitation: this only catches local variable reassignment. Object member reassignment (<code class="language-plaintext highlighter-rouge">self.x = y</code>) is not detected by this backend; those cases are caught by the <code class="language-plaintext highlighter-rouge">no-attribute-augmented-mutation</code> ast-grep rule, which covers augmented assignment only (<code class="language-plaintext highlighter-rouge">self.x += y</code>), not simple attribute writes.</strong></p>

<p>Currently <strong>non-blocking</strong> (<code class="language-plaintext highlighter-rouge">|| true</code>): the codebase has accumulated violations from before the linter existed, and the migration toward functional patterns is incremental. The intent is to make it blocking once the violation count reaches zero.</p>

<p>The linter lives in its own venv at <code class="language-plaintext highlighter-rouge">~/.claude/plugins/python-fp-lint/venv/</code> and is invoked directly:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/.claude/plugins/python-fp-lint/venv/bin/python <span class="se">\</span>
    <span class="nt">-m</span> python_fp_lint check <span class="s2">"</span><span class="nv">$REPO_ROOT</span><span class="s2">/interpreter/"</span> <span class="o">||</span> <span class="nb">true</span>
</code></pre></div></div>

<hr />

<h3 id="import-linter-architectural-contracts">import-linter: Architectural Contracts</h3>

<p><a href="https://import-linter.readthedocs.io/">import-linter</a> enforces five architectural contracts:</p>

<table>
  <thead>
    <tr>
      <th>Contract</th>
      <th>Rule</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>VM must not import frontend lowering code</td>
      <td>Prevents lowering logic bleeding into the VM</td>
    </tr>
    <tr>
      <td>IR module must not import other interpreter modules</td>
      <td>Keeps IR as a pure data type layer</td>
    </tr>
    <tr>
      <td>Project module must not import VM internals</td>
      <td>Enforces the project/VM boundary</td>
    </tr>
    <tr>
      <td>Language frontends must not import each other</td>
      <td>Prevents cross-frontend coupling</td>
    </tr>
    <tr>
      <td>COBOL module only imported by frontend factory</td>
      <td>Isolates the COBOL frontend</td>
    </tr>
  </tbody>
</table>

<p>These are <strong>blocking</strong>. A contract violation fails the commit.</p>

<hr />

<h3 id="full-test-suite">Full Test Suite</h3>

<p><code class="language-plaintext highlighter-rouge">poetry run python -m pytest tests/ -x -q</code> runs all 13,000+ tests with fail-fast (<code class="language-plaintext highlighter-rouge">-x</code>). This is the most expensive gate (~50 seconds) but <strong>it runs last, after all the cheaper gates have passed</strong>. It is <strong>blocking</strong>.</p>

<p>One deliberate omission: <strong>coverage measurement</strong> (<code class="language-plaintext highlighter-rouge">--cov</code>) is not part of the pre-commit hook. At 13,000+ tests, adding coverage instrumentation locally would push commit time past the point of acceptability. Coverage runs in CI instead, where the cost is paid asynchronously and doesn’t interrupt the development loop.</p>

<hr />

<h3 id="beads-backup">Beads Backup</h3>

<p><a href="https://beads.sh">Beads</a> is the issue tracker used for this project. The pre-commit hook calls <code class="language-plaintext highlighter-rouge">bd backup</code> to snapshot the issue database before the commit lands. This ensures the issue tracker state is always in sync with the code history. <strong>Blocking</strong>: if the backup fails, the commit is blocked.</p>

<hr />

<h2 id="claude-code-hooks-prepost-tool-use">Claude Code Hooks: Pre/Post Tool Use</h2>

<p><em>Shell scripts wired into Claude Code’s event system. <code class="language-plaintext highlighter-rouge">UserPromptSubmit</code> injects project invariants into every prompt. <code class="language-plaintext highlighter-rouge">PreToolUse</code> blocks edits that violate architectural or terminology constraints.</em></p>

<p>Claude Code’s hook system allows shell scripts to intercept tool calls before they execute (PreToolUse). RedDragon uses this to enforce workflow discipline <strong>at the point where the AI agent is making edits, not just at commit time</strong>.</p>

<p>All hooks are configured in <code class="language-plaintext highlighter-rouge">.claude/settings.json</code>:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"hooks"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"PreToolUse"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
      </span><span class="p">{</span><span class="w"> </span><span class="nl">"matcher"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Grep"</span><span class="p">,</span><span class="w">  </span><span class="nl">"hooks"</span><span class="p">:</span><span class="w"> </span><span class="p">[{</span><span class="w"> </span><span class="nl">"command"</span><span class="p">:</span><span class="w"> </span><span class="s2">"...ast-grep advisory..."</span><span class="w"> </span><span class="p">}]</span><span class="w"> </span><span class="p">},</span><span class="w">
      </span><span class="p">{</span><span class="w"> </span><span class="nl">"hooks"</span><span class="p">:</span><span class="w">            </span><span class="p">[{</span><span class="w"> </span><span class="nl">"command"</span><span class="p">:</span><span class="w"> </span><span class="s2">"bd-terminology-guard.sh"</span><span class="w"> </span><span class="p">}]</span><span class="w"> </span><span class="p">},</span><span class="w">
      </span><span class="p">{</span><span class="w"> </span><span class="nl">"matcher"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Edit"</span><span class="p">,</span><span class="w">  </span><span class="nl">"hooks"</span><span class="p">:</span><span class="w"> </span><span class="p">[{</span><span class="w"> </span><span class="nl">"command"</span><span class="p">:</span><span class="w"> </span><span class="s2">"lint-check.sh"</span><span class="w"> </span><span class="p">}]</span><span class="w"> </span><span class="p">},</span><span class="w">
      </span><span class="p">{</span><span class="w"> </span><span class="nl">"matcher"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Write"</span><span class="p">,</span><span class="w"> </span><span class="nl">"hooks"</span><span class="p">:</span><span class="w"> </span><span class="p">[{</span><span class="w"> </span><span class="nl">"command"</span><span class="p">:</span><span class="w"> </span><span class="s2">"lint-check.sh"</span><span class="w"> </span><span class="p">}]</span><span class="w"> </span><span class="p">}</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="nl">"UserPromptSubmit"</span><span class="p">:</span><span class="w"> </span><span class="p">[{</span><span class="w"> </span><span class="nl">"hooks"</span><span class="p">:</span><span class="w"> </span><span class="p">[{</span><span class="w"> </span><span class="nl">"command"</span><span class="p">:</span><span class="w"> </span><span class="s2">"user-prompt-submit.sh"</span><span class="w"> </span><span class="p">}]</span><span class="w"> </span><span class="p">}],</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<h3 id="userpromptsubmit-contextual-invariant-injection">UserPromptSubmit: Contextual Invariant Injection</h3>

<p>Every time a message is submitted, <a href="https://github.com/avishek-sen-gupta/context-injector/blob/main/hooks/user-prompt-submit.sh"><code class="language-plaintext highlighter-rouge">user-prompt-submit.sh</code></a> classifies the prompt using keyword matching and injects the relevant invariant files into the conversation context. The classification is purely lexical (no LLM involved):</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Keyword families → context categories</span>
<span class="k">if </span><span class="nb">grep</span> <span class="nt">-qiEw</span> <span class="s1">'implement|add|build|create|fix|feature|...'</span><span class="p">;</span> <span class="k">then
  </span><span class="nv">DESIGN</span><span class="o">=</span>1<span class="p">;</span> <span class="nv">TESTING</span><span class="o">=</span>1<span class="p">;</span> <span class="nv">REFACTORING</span><span class="o">=</span>1<span class="p">;</span> <span class="nv">SKILLS</span><span class="o">=</span>1
<span class="k">fi
if </span><span class="nb">grep</span> <span class="nt">-qiEw</span> <span class="s1">'test|tdd|assert|coverage|...'</span><span class="p">;</span> <span class="k">then
  </span><span class="nv">TESTING</span><span class="o">=</span>1
<span class="k">fi
if </span><span class="nb">grep</span> <span class="nt">-qiEw</span> <span class="s1">'refactor|rename|extract|move|...'</span><span class="p">;</span> <span class="k">then
  </span><span class="nv">DESIGN</span><span class="o">=</span>1<span class="p">;</span> <span class="nv">REFACTORING</span><span class="o">=</span>1<span class="p">;</span> <span class="nv">SKILLS</span><span class="o">=</span>1
<span class="k">fi
if </span><span class="nb">grep</span> <span class="nt">-qiEw</span> <span class="s1">'review|pr|diff|...'</span><span class="p">;</span> <span class="k">then
  </span><span class="nv">REVIEW</span><span class="o">=</span>1
<span class="k">fi</span>
</code></pre></div></div>

<p>The injected invariant files live in <code class="language-plaintext highlighter-rouge">.claude/core/</code> (always injected) and <code class="language-plaintext highlighter-rouge">.claude/conditional/</code> (injected when the corresponding category fires). This means a prompt like “implement COBOL array support” automatically injects design principles, testing patterns, and tools/skills context, while a prompt like “what does this function do?” only injects core context.</p>

<p><strong>What gets injected.</strong> The invariant files are split into two directories:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">.claude/core/</code>: four files injected on every prompt regardless of classification: <code class="language-plaintext highlighter-rouge">project-context.md</code> (language, toolchain, module map), <code class="language-plaintext highlighter-rouge">workflow.md</code> (interaction style, commit protocol), <code class="language-plaintext highlighter-rouge">implementation.md</code> (coding conventions, guard rules), <code class="language-plaintext highlighter-rouge">tools-search.md</code> (when to use ast-grep vs grep vs the graph MCP).</li>
  <li><code class="language-plaintext highlighter-rouge">.claude/conditional/</code>: five files injected only when the corresponding category fires: <code class="language-plaintext highlighter-rouge">design-principles.md</code> (DESIGN), <code class="language-plaintext highlighter-rouge">testing-patterns.md</code> (TESTING), <code class="language-plaintext highlighter-rouge">refactoring.md</code> (REFACTORING), <code class="language-plaintext highlighter-rouge">tools-skills.md</code> (SKILLS), <code class="language-plaintext highlighter-rouge">code-review.md</code> (REVIEW).</li>
</ul>

<p>A prompt classified as DESIGN+TESTING+REFACTORING+SKILLS (e.g. “implement COBOL array support”) injects all nine files. A prompt classified as none of the above (e.g. “what does this function do?”) injects only the four core files.</p>

<p><strong>The <code class="language-plaintext highlighter-rouge">/ctx</code> slash command</strong> controls the injection gate. It is implemented as a Claude Code user command at <code class="language-plaintext highlighter-rouge">~/.claude/commands/ctx.md</code>, which delegates to the <code class="language-plaintext highlighter-rouge">ctx</code> binary at <code class="language-plaintext highlighter-rouge">~/.claude/plugins/context-injector/bin/ctx</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/ctx          <span class="c"># toggle on ↔ off</span>
/ctx on       <span class="c"># enable injection</span>
/ctx off      <span class="c"># disable injection</span>
/ctx status   <span class="c"># print {"active": true|false}</span>
</code></pre></div></div>

<pre><code class="language-mermaid">%%{init: {'theme': 'base', 'themeVariables': {'edgeLabelBackground': '#ffffff'}, 'flowchart': {'curve': 'basis'}}}%%
flowchart TD
    CTX["🔒 /tmp/ctx-locks/&amp;ltmd5&amp;gt\n&lt;i&gt;ctx toggle&lt;/i&gt;"]:::lock

    subgraph hooks["Claude Code Hooks"]
        UPS["📨 UserPromptSubmit\n&lt;i&gt;user-prompt-submit.sh&lt;/i&gt;"]:::hook
        PTU["🔧 PreToolUse\n&lt;i&gt;pre-tool-use.sh&lt;/i&gt;"]:::hook
        SS["🚀 SessionStart\n&lt;i&gt;session-start.sh&lt;/i&gt;"]:::hook
    end

    subgraph content["Project Context"]
        CORE["📁 .claude/core/*.md\n&lt;i&gt;always injected&lt;/i&gt;"]:::core
        COND["📂 .claude/conditional/*.md\n&lt;i&gt;keyword-matched&lt;/i&gt;"]:::cond
    end

    CTX --&gt;|"ctx on?"| UPS
    CTX --&gt;|"ctx on?"| PTU
    CTX --&gt;|"ctx on?"| SS

    UPS --&gt;|"inject"| CORE
    UPS --&gt;|"classify keywords → inject"| COND
    PTU --&gt;|"code-review agent? → inject"| COND
    SS --&gt;|"inject once at start"| CORE

    classDef hook fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef lock fill:#fef9c3,stroke:#ca8a04,color:#713f12
    classDef core fill:#dcfce7,stroke:#16a34a,color:#14532d
    classDef cond fill:#fce7f3,stroke:#db2777,color:#831843
</code></pre>

<p>The state is a per-project lockfile at <code class="language-plaintext highlighter-rouge">/tmp/ctx-locks/&lt;project-hash&gt;</code>. The hash is derived from <code class="language-plaintext highlighter-rouge">$PWD</code>, so each project has an independent toggle; switching it off in one repo doesn’t affect another. The hook <a href="https://github.com/avishek-sen-gupta/context-injector/blob/main/hooks/user-prompt-submit.sh"><code class="language-plaintext highlighter-rouge">user-prompt-submit.sh</code></a> checks for the lockfile at the start of every run; if absent, it exits immediately without injecting anything.</p>

<p>The injection mechanism works through Claude Code’s <code class="language-plaintext highlighter-rouge">UserPromptSubmit</code> hook return value: the hook writes its output to stdout, and Claude Code prepends that output to the user’s message before it reaches the model. From the model’s perspective, the invariants appear at the top of the incoming message, not as a separate system prompt. This means the injected content is visible in the conversation and can be referenced explicitly.</p>

<p><code class="language-plaintext highlighter-rouge">/ctx off</code> is useful when you want a raw conversation without the invariant overhead: exploring an unfamiliar part of the codebase, doing a quick calculation, or asking a question where the injected design principles would be noise. <code class="language-plaintext highlighter-rouge">/ctx on</code> (or toggling back) restores normal operation; no session restart required.</p>

<hr />

<h3 id="pretooluse-the-gate-layer">PreToolUse: The Gate Layer</h3>

<p>Three hooks run before tool calls:</p>

<p><strong>1. ast-grep advisory (on <code class="language-plaintext highlighter-rouge">Grep</code>):</strong> When the agent is about to run a Grep, an advisory fires reminding it to prefer <code class="language-plaintext highlighter-rouge">ast-grep</code> for structural patterns (constructor shapes, multi-line call sites) and reserve <code class="language-plaintext highlighter-rouge">Grep</code> for simple keyword/import/constant searches.</p>

<p><strong>2. Terminology guard (all <code class="language-plaintext highlighter-rouge">Bash</code> tool calls with <code class="language-plaintext highlighter-rouge">bd</code> write commands):</strong> The <a href="https://github.com/avishek-sen-gupta/context-injector/blob/main/hooks/bd-terminology-guard.sh"><code class="language-plaintext highlighter-rouge">bd-terminology-guard.sh</code></a> hook intercepts any <code class="language-plaintext highlighter-rouge">bd create</code>, <code class="language-plaintext highlighter-rouge">bd update</code>, <code class="language-plaintext highlighter-rouge">bd note</code>, or similar Beads write commands and checks them against the blocklist. If a sensitive term appears in the issue text, the command is blocked (exit 2) before it reaches Beads.</p>

<p><strong>3. Python FP lint gate (on <code class="language-plaintext highlighter-rouge">Edit</code> and <code class="language-plaintext highlighter-rouge">Write</code>):</strong> The <a href="https://github.com/avishek-sen-gupta/context-injector/blob/main/hooks/lint-check.sh"><code class="language-plaintext highlighter-rouge">lint-check.sh</code></a> hook receives the full tool event JSON on stdin (including the file path and the new/modified content), runs <code class="language-plaintext highlighter-rouge">python_fp_lint hook-check</code>, and blocks the edit if violations appear in the modified line range. Critically, this operates on the <strong>diff range</strong>: it only flags violations introduced or modified by the current edit, <strong>not pre-existing violations in unchanged lines</strong>. This prevents the linter from blocking edits in files that already had violations before you touched them.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Exit 0 = allow, Exit 2 = block</span>
<span class="nb">cat</span> | <span class="nv">$LINT_CMD</span> hook-check
</code></pre></div></div>

<p>The gate is toggled with <code class="language-plaintext highlighter-rouge">/lint on</code> and <code class="language-plaintext highlighter-rouge">/lint off</code>, which create/remove a lockfile at <code class="language-plaintext highlighter-rouge">/tmp/ctx-lint/&lt;project-hash&gt;</code>.</p>

<hr />

<h2 id="skills-and-workflow-automation">Skills and Workflow Automation</h2>

<p><em>Hooks enforce runtime constraints at the tool level. Skills encode workflow knowledge at the task level: how to approach a problem, what questions to ask, what patterns to follow. They are loaded on demand via Claude Code’s Skill tool.</em></p>

<h3 id="superpowers-core-development-lifecycle-skills">Superpowers: Core Development Lifecycle Skills</h3>

<p>The <a href="https://github.com/anthropics/claude-plugins-official">Superpowers</a> plugin provides a suite of lifecycle skills that cover the full development loop:</p>

<table>
  <thead>
    <tr>
      <th>Skill</th>
      <th>Purpose</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">brainstorming</code></td>
      <td>Structured option exploration before touching code. Presents trade-offs, seeks user input before committing to an approach.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">writing-plans</code></td>
      <td>Produces a detailed, phased implementation plan from a spec before any coding begins.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">executing-plans</code></td>
      <td>Executes a written plan using subagent orchestration, one step at a time.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">test-driven-development</code></td>
      <td>Red/green/refactor loop with explicit constraints on what to write at each phase.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">systematic-debugging</code></td>
      <td>Root cause analysis before patching. Forces diagnosis before solution.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">verification-before-completion</code></td>
      <td>Checklist run before claiming work is complete: tests pass, no regressions, no TODOs left behind.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">requesting-code-review</code></td>
      <td>Invokes specialized review agents before merging.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">receiving-code-review</code></td>
      <td>Structures how to process and respond to review feedback.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">dispatching-parallel-agents</code></td>
      <td>When 2+ independent tasks exist, launches them as concurrent subagents.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">subagent-driven-development</code></td>
      <td>Orchestrates independent implementation tasks across a plan in parallel.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">finishing-a-development-branch</code></td>
      <td>Checklist for branch completion: cleanup, docs, tests, commit hygiene.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">writing-skills</code></td>
      <td>For creating or editing skill files themselves.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">using-git-worktrees</code></td>
      <td>Isolates feature work in a temporary worktree to avoid polluting the working directory.</td>
    </tr>
  </tbody>
</table>

<p>The key constraint from <code class="language-plaintext highlighter-rouge">using-superpowers</code> is that skills are loaded <strong>before any response</strong>, even before a clarifying question. The rule: if there is <strong>even a 1% chance a skill applies</strong>, invoke it first. Crucially, the developer does not need to explicitly trigger a skill; the agent is expected to recognise when one applies and invoke it without being asked.</p>

<hr />

<h3 id="project-level-skills">Project-Level Skills</h3>

<p>RedDragon has six project-specific skills in <code class="language-plaintext highlighter-rouge">.claude/skills/</code>:</p>

<p><strong><code class="language-plaintext highlighter-rouge">tdd</code></strong>: The project’s TDD skill extends the Superpowers TDD skill with RedDragon-specific conventions: where integration tests go (<code class="language-plaintext highlighter-rouge">tests/integration/</code>), what a correct <code class="language-plaintext highlighter-rouge">@covers</code> annotation looks like, how to use <code class="language-plaintext highlighter-rouge">run()</code> to exercise the VM rather than unit-testing the lowering step in isolation.</p>

<p><strong><code class="language-plaintext highlighter-rouge">audit-asserts</code></strong>: Periodic audit of test name vs. assertion mismatches. Batches test files across parallel agents (batches of ~20), collects P0/P1/P2 violations, deduplicates against existing Beads issues, and files new ones. The most recent audit (#16 as of this writing) covered 331 files and found 0 P0 violations.</p>

<p><strong><code class="language-plaintext highlighter-rouge">grill-me</code></strong>: Interview mode. When invoked, the AI asks probing questions about every aspect of a plan or design, walking each branch of the decision tree and providing a recommended answer for each question. Used before committing to significant architectural changes. Useful for surfacing assumptions that haven’t been examined.</p>

<p><strong><code class="language-plaintext highlighter-rouge">migration-planner</code></strong>: Activates during brainstorming when the task is a large-scale type migration (e.g., replacing <code class="language-plaintext highlighter-rouge">str</code> with a domain type across function signatures, dict keys, and data classes). Provides a scope-first discipline: count before you plan, don’t estimate. Provides three migration strategies (big-bang, parallel-run, incremental) with trade-off analysis.</p>

<p><strong><code class="language-plaintext highlighter-rouge">documentation</code></strong>: Updates all living docs (README, ADRs, IR lowering gaps, frontend design docs, HTML presentations) to reflect current codebase state. Does not touch immutable spec files under <code class="language-plaintext highlighter-rouge">docs/superpowers/specs/</code>.</p>

<p><strong><code class="language-plaintext highlighter-rouge">improve-codebase-architecture</code></strong>: Explores the codebase for deep module opportunities (à la Ousterhout), surfaces integration risk at module seams, and proposes refactors as Beads RFC issues.</p>

<hr />

<h3 id="plugins-context-injector-and-python-fp-lint">Plugins: Context Injector and Python FP Lint</h3>

<p><strong><a href="https://github.com/avishek-sen-gupta/context-injector">Context Injector</a></strong> is the plugin that powers the <code class="language-plaintext highlighter-rouge">UserPromptSubmit</code> hook described above. Toggle: <code class="language-plaintext highlighter-rouge">/ctx on</code> / <code class="language-plaintext highlighter-rouge">/ctx off</code>.</p>

<p><strong><a href="https://github.com/avishek-sen-gupta/python-fp-lint">Python FP Lint</a></strong> enforces functional programming conventions at edit time. The PreToolUse gate blocks edits that introduce FP violations into the modified lines. Toggle: <code class="language-plaintext highlighter-rouge">/lint on</code> / <code class="language-plaintext highlighter-rouge">/lint off</code>. The same linter runs in the pre-commit hook as a non-blocking warning pass over the full <code class="language-plaintext highlighter-rouge">interpreter/</code> directory.</p>

<p><strong>Additional plugins and skill sources used in this project:</strong></p>

<ul>
  <li><a href="https://github.com/tirth8205/code-review-graph">code-review-graph</a>: builds a Tree-sitter AST graph of the codebase for token-efficient, impact-aware code review via MCP</li>
  <li><a href="https://github.com/thedotmack/claude-mem">claude-mem</a>: persistent cross-session memory with timeline, search, and smart-outline MCP tools</li>
  <li><a href="https://github.com/ast-grep/agent-skill">ast-grep agent-skill</a>: structural code search skill; the CLAUDE.md mandate to prefer <code class="language-plaintext highlighter-rouge">ast-grep</code> over plain <code class="language-plaintext highlighter-rouge">grep</code> is enforced by a PreToolUse advisory wired to every <code class="language-plaintext highlighter-rouge">Grep</code> call</li>
</ul>

<hr />

<h2 id="the-full-picture">The Full Picture</h2>

<pre><code class="language-mermaid">flowchart LR
    classDef skill    fill:#6d28d9,stroke:#4c1d95,color:#fff
    classDef hook     fill:#1d4ed8,stroke:#1e3a8a,color:#fff
    classDef blocker  fill:#b91c1c,stroke:#7f1d1d,color:#fff
    classDef autofix  fill:#0f766e,stroke:#134e4a,color:#fff
    classDef advisory fill:#b45309,stroke:#78350f,color:#fff
    classDef coverage fill:#374151,stroke:#111827,color:#d1d5db
    classDef hdr      fill:#0f172a,stroke:#0f172a,color:#94a3b8

    subgraph L1["LAYER 1  ·  DESIGN TIME  ·  skills, invoked on demand"]
        direction TB
        D0["SKILL · on-demand"]:::hdr
        D1["brainstorming · structured option exploration"]:::skill
        D2["writing-plans · phased implementation plan"]:::skill
        D3["tdd · red / green / refactor loop"]:::skill
        D4["grill-me · stress-test designs before coding"]:::skill
        D5["audit-asserts · test name vs. assertion audit"]:::skill
        D6["migration-planner · scope before estimate"]:::skill
        D0 --- D1 --- D2 --- D3 --- D4 --- D5 --- D6
    end

    subgraph L2["LAYER 2  ·  EDIT TIME  ·  Claude Code hooks"]
        direction TB
        H0["HOOK · every tool call"]:::hdr
        H2["UserPromptSubmit · keyword classify + invariant inject  ·  toggleable"]:::hook
        H3["PreToolUse / Grep · prefer ast-grep advisory  ·  warn"]:::advisory
        H4["PreToolUse / Bash bd · terminology guard  ·  BLOCKING"]:::blocker
        H5["PreToolUse / Edit+Write · FP lint gate, diff-range  ·  BLOCKING"]:::blocker
        H0 --- H2 --- H3 --- H4 --- H5
    end

    subgraph L3["LAYER 3  ·  COMMIT TIME  ·  pre-commit hook, sequential"]
        direction TB
        C0["GATE · git commit"]:::hdr
        C1["Talisman · secret detection  ·  BLOCKING"]:::blocker
        C2["Terminology guard · staged diff  ·  BLOCKING"]:::blocker
        C3["Black · auto-format + re-stage  ·  AUTO-FIX"]:::autofix
        C4["python-fp-lint · FP warnings  ·  non-blocking"]:::advisory
        C5["import-linter · 5 architectural contracts  ·  BLOCKING"]:::blocker
        C6["pytest · 13k+ tests, fail-fast  ·  BLOCKING"]:::blocker
        C7["Beads backup  ·  BLOCKING"]:::blocker
        C0 --- C1 --- C2 --- C3 --- C4 --- C5 --- C6 --- C7
    end

    FC["@covers decorator + audit script\n948 features  ·  16 languages  ·  205 uncovered\ncontinuous / run anytime"]:::coverage

    L1 ==&gt; L2 ==&gt; L3
    FC -.-&gt; L3
</code></pre>

<p>The harness operates at three layers, with feature coverage sitting orthogonal as a continuous property rather than a gate:</p>

<p><strong>Layer 1, design time (skills):</strong> <code class="language-plaintext highlighter-rouge">brainstorming</code> structures option exploration before coding begins; <code class="language-plaintext highlighter-rouge">tdd</code> enforces red/green/refactor discipline; <code class="language-plaintext highlighter-rouge">grill-me</code> examines designs before committing to them; <code class="language-plaintext highlighter-rouge">audit-asserts</code> audits test name vs. assertion consistency; <code class="language-plaintext highlighter-rouge">migration-planner</code> scopes large changes before estimating.</p>

<p><strong>Layer 2, edit time (Claude Code hooks):</strong> Every prompt gets relevant invariants injected (<code class="language-plaintext highlighter-rouge">UserPromptSubmit</code>). Every edit to <code class="language-plaintext highlighter-rouge">interpreter/</code> is checked for FP violations in the modified line range (<code class="language-plaintext highlighter-rouge">PreToolUse</code>). Every Beads write is screened for sensitive terms (<code class="language-plaintext highlighter-rouge">PreToolUse</code>).</p>

<p><strong>Layer 3, commit time (pre-commit pipeline):</strong> Secret detection, terminology, formatting, FP warnings, architectural contracts, the full test suite, and issue tracker backup run in sequence. Most are blocking; a violation stops the commit.</p>

<p>The three layers are designed so that violations are caught close to where they are introduced: design problems during brainstorming, bad edits before they are written to disk, and bad commits before they enter the history.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>The goal behind this harness is <strong>reliability, discoverability, and rigour</strong> in an AI-assisted development context. Each of those properties requires a different kind of investment.</p>

<p><strong>Reliability / Repeatability / Reproducibility</strong> means that a constraint either holds or it does not; it does not depend on whether the AI had the right context, or whether the right instruction was in scope, or whether the model happened to follow it this session. Every gate in the pre-commit pipeline, every PreToolUse hook, every import-linter contract, and every <code class="language-plaintext highlighter-rouge">@covers</code> annotation is a <em>deterministic check</em>. It runs the same way every time and produces the same result for the same input. Asking the AI to “remember not to import across module boundaries” or “make sure you annotate every new test with <code class="language-plaintext highlighter-rouge">@covers</code>” is <em>a prompt instruction, not a constraint</em>. <strong>Prompt instructions degrade across context boundaries, get overridden by other instructions, and fail silently.</strong> A blocking git hook does not forget.</p>

<p><strong>Discoverability</strong> requires that the codebase itself be <em>automation-friendly</em>:</p>

<ul>
  <li>The <code class="language-plaintext highlighter-rouge">@covers</code> decorator exists so that coverage auditing can be done statically, without executing tests or parsing output.</li>
  <li>The feature enum string values exist so that audit reports are self-describing without a separate documentation file.</li>
  <li>The <code class="language-plaintext highlighter-rouge">VAR_DEFINITION_OPCODES</code> frozenset in <code class="language-plaintext highlighter-rouge">ir.py</code> exists so that dataflow analysis can identify definition sites without pattern-matching over opcode names scattered across the codebase.</li>
</ul>

<p>These are small structural choices that make the codebase <em>queryable by scripts and agents without interpretation</em>. <strong>A codebase that lacks these hooks forces automation to guess; a codebase that provides them allows automation to be exact.</strong></p>

<p><strong>Rigour</strong> comes from a specific division of labour: <em>build constraints using AI, but do not rely on the AI to impose them</em>. The terminology blocklist was written with AI assistance; the blocklist itself is a flat text file checked by a shell script. The import-linter contracts were designed with AI help; the contracts run as a deterministic linter step. The skills encode workflow knowledge refined through conversation; the skills themselves are Markdown files loaded and followed mechanically. In each case, the AI did the authoring work, but <em>the resulting artifact is a static, executable constraint that runs without the AI</em>. <strong>The non-deterministic layer is reserved for work that genuinely requires judgment:</strong> exploring design trade-offs, writing implementation code, interpreting test failures, deciding what to file as an issue.</p>

<p><strong>Nested feedback loops.</strong> The harness is structured as a set of loops, each operating at a different granularity and cost:</p>

<ul>
  <li><strong>Design time (skills):</strong> fire before a line of code is written; a skill invocation costs nothing.</li>
  <li><strong>Edit time (Claude Code hooks):</strong> fire before a change is saved; a PreToolUse block costs one rejected edit.</li>
  <li><strong>Commit time (pre-commit pipeline):</strong> fires before a change enters history; a failed commit costs a full test run.</li>
  <li><strong>Continuous (coverage audit):</strong> runs orthogonal to all three; a gap costs a backlog issue.</li>
</ul>

<p>Each loop catches a different class of problem. The loops are deliberately ordered so that the cheapest checks run first. <strong>Problems caught in an inner loop are cheaper to fix than problems that escape to an outer one.</strong> The design goal is not to make any single gate comprehensive, but to ensure that each class of violation has a loop it cannot escape.</p>

<p>This is not a finished design. Some gates are still non-blocking warnings that will become blockers as violation counts drop. Some constraints that currently exist only as prompt instructions will eventually become code. The direction is consistent: <em>push enforcement into deterministic mechanisms wherever possible</em>, and make what remains explicit and auditable.</p>]]></content><author><name>avishek</name></author><category term="Software Engineering" /><category term="AI-Assisted Development" /><category term="Testing" /><category term="Developer Tooling" /><category term="Workflow" /><category term="Python" /><category term="RedDragon" /><summary type="html"><![CDATA[RedDragon is a multi-language interpreter/compiler that lowers 16 languages (Python, Java, Go, TypeScript, Rust, C, C++, C#, JavaScript, Kotlin, Lua, PHP, Ruby, Swift, Scala, COBOL) down to a shared IR. With 13,000+ tests across 16 frontends, keeping quality high and coverage visible requires a harness that operates at several layers simultaneously. This post documents that harness in full: how language feature coverage is tracked through feature enums, how the pre-commit pipeline is structured, and how Claude Code’s hook system is wired to enforce workflow discipline at edit time.]]></summary></entry><entry><title type="html">Engineering Log: Testing agentic coding workflows via Large Refactorings</title><link href="https://avishek.net/2026/03/24/testing-agentic-development-systems-gsd-v2.html" rel="alternate" type="text/html" title="Engineering Log: Testing agentic coding workflows via Large Refactorings" /><published>2026-03-24T00:00:00+05:30</published><updated>2026-03-24T00:00:00+05:30</updated><id>https://avishek.net/2026/03/24/testing-agentic-development-systems-gsd-v2</id><content type="html" xml:base="https://avishek.net/2026/03/24/testing-agentic-development-systems-gsd-v2.html"><![CDATA[<p><em>Two agentic development frameworks applied to the same multi-layer type migration across a 13,000-test compiler pipeline. The first provided process discipline but stalled partway through. The second completed the work because it planned more thoroughly and stopped to ask.</em></p>

<p><em>The code is at <a href="https://github.com/avishek-sen-gupta/red-dragon">avishek-sen-gupta/red-dragon</a>. RedDragon’s design is described in <a href="/2026/03/01/designing-reddragon-multi-language-code-analysis.html">Designing RedDragon</a>. The completed elimination plan is at <a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/docs/design/eliminate-irinstruction-plan.md">the design document</a>. Previous posts in this series: <a href="/2026/03/12/experiences-building-with-coding-assistant.html">Building Non-Trivial Systems with an AI Coding Assistant</a>, <a href="/2026/03/13/anatomy-of-a-refactoring-using-ai.html">Anatomy of a Refactoring Using AI</a>.</em></p>

<hr />

<h2 id="table-of-contents">Table of Contents</h2>

<ul>
  <li><a href="#context">Context</a></li>
  <li><a href="#what-gsd-2-is">What GSD 2 Is</a></li>
  <li><a href="#what-i-adopted">What I Adopted</a>
    <ul>
      <li><a href="#1-enforced-phases-per-unit-of-work">1. Enforced Phases Per Unit of Work</a></li>
      <li><a href="#2-complexity-classification">2. Complexity Classification</a></li>
      <li><a href="#3-verification-gate">3. Verification Gate</a></li>
      <li><a href="#4-fresh-context-for-heavy-tasks">4. Fresh Context for Heavy Tasks</a></li>
      <li><a href="#5-state-on-disk">5. State on Disk</a></li>
    </ul>
  </li>
  <li><a href="#teething-issues">Teething Issues</a>
    <ul>
      <li><a href="#the-patch-on-patch-problem">The Patch-on-Patch Problem</a></li>
      <li><a href="#brainstorm-wasnt-enforced">Brainstorm Wasn’t Enforced</a></li>
      <li><a href="#retroactive-issue-filing">Retroactive Issue Filing</a></li>
      <li><a href="#weak-integration-test-assertions">Weak Integration Test Assertions</a></li>
      <li><a href="#the-theoretical-bug">The Theoretical Bug</a></li>
    </ul>
  </li>
  <li><a href="#the-reorganised-claudemd">The Reorganised CLAUDE.md</a></li>
  <li><a href="#the-test-replacing-a-stringly-typed-ir-with-domain-types">The Test: Replacing a Stringly-Typed IR with Domain Types</a>
    <ul>
      <li><a href="#the-problem-strings-all-the-way-down">The Problem: Strings All the Way Down</a></li>
    </ul>
  </li>
  <li><a href="#chapter-1-codelabel-the-first-domino">Chapter 1: CodeLabel, The First Domino</a>
    <ul>
      <li><a href="#the-attempted-cascade">The Attempted Cascade</a></li>
      <li><a href="#the-revert-and-the-lesson">The Revert and the Lesson</a></li>
      <li><a href="#what-actually-shipped">What Actually Shipped</a></li>
      <li><a href="#the-coercion-validator-mistake">The Coercion Validator Mistake</a></li>
      <li><a href="#the-refactoring-principles-that-crystallised">The Refactoring Principles That Crystallised</a></li>
    </ul>
  </li>
  <li><a href="#chapter-2-register-the-pydantic-problem">Chapter 2: Register, The Pydantic Problem</a>
    <ul>
      <li><a href="#the-boolean-truthiness-problem">The Boolean Truthiness Problem</a></li>
      <li><a href="#the-branch_if-comma-convention">The BRANCH_IF Comma Convention</a></li>
    </ul>
  </li>
  <li><a href="#chapter-3-typed-instructions">Chapter 3: Typed Instructions</a>
    <ul>
      <li><a href="#the-decision-one-class-per-opcode">The Decision: One Class Per Opcode</a></li>
      <li><a href="#the-original-four-phase-plan">The Original Four-Phase Plan</a></li>
      <li><a href="#the-plan-that-replaced-the-plan">The Plan That Replaced the Plan</a></li>
      <li><a href="#the-missing-length-field">The Missing Length Field</a></li>
      <li><a href="#layer-1-execution">Layer 1 Execution</a></li>
    </ul>
  </li>
  <li><a href="#chapter-4-completing-the-migration-with-superpowers">Chapter 4: Completing the Migration with Superpowers</a>
    <ul>
      <li><a href="#why-gsd-v2-stalled-after-layer-1">Why GSD v2 Stalled After Layer 1</a></li>
      <li><a href="#what-superpowers-did-differently">What Superpowers Did Differently</a></li>
      <li><a href="#the-audit-that-revised-the-plan">The Audit That Revised the Plan</a></li>
      <li><a href="#layers-2-through-5-execution">Layers 2 Through 5: Execution</a></li>
      <li><a href="#layer-5-was-not-cleanup">Layer 5 Was Not Cleanup</a></li>
    </ul>
  </li>
  <li><a href="#learnings">Learnings</a>
    <ul>
      <li><a href="#on-type-hints-in-python">On Type Hints in Python</a></li>
      <li><a href="#on-scope">On Scope</a></li>
      <li><a href="#on-type-migrations">On Type Migrations</a></li>
      <li><a href="#on-plans">On Plans</a></li>
      <li><a href="#on-tooling">On Tooling</a></li>
    </ul>
  </li>
  <li><a href="#takeaways">Takeaways</a></li>
</ul>

<hr />

<h2 id="context">Context</h2>

<p>I’ve been building <a href="https://github.com/avishek-sen-gupta/red-dragon">RedDragon</a>, a multi-language code analysis engine, across 400+ sessions with Claude Code. I <a href="/2026/03/12/experiences-building-with-coding-assistant.html">wrote about that experience</a> earlier: how CLAUDE.md rules evolved reactively, how structured memory (Beads issues, ADRs, gap analyses) solved the session-to-session continuity problem, how TDD changed the quality of AI-generated tests.</p>

<p>What I didn’t have was a <em>systematic</em> workflow framework. My CLAUDE.md was a collection of individually sensible rules, accumulated reactively over months. It worked, but it was disorganised: duplicate rules scattered across sections, no enforced phase ordering, no complexity-aware ceremony.</p>

<p>The entire refactoring started because I turned on <strong>Pyright</strong>. I come from statically typed languages (Java, C#, C++), and writing a 13,000-line Python project without a type checker had been bothering me. When I enabled Pyright with strict settings, the IR layer lit up with errors. <code class="language-plaintext highlighter-rouge">label: str | None</code> caused 83+ narrowing failures at sites that assumed non-None. <code class="language-plaintext highlighter-rouge">operands: list[Any]</code> was a black hole that Pyright couldn’t reason about at all. <code class="language-plaintext highlighter-rouge">result_reg: str</code> with <code class="language-plaintext highlighter-rouge">""</code> as a sentinel meant every presence check was invisible to the type checker. The type errors weren’t bugs in the running code. They were a <em>map of every place where the code was fragile</em>, where a wrong assumption would pass silently.</p>

<p>Fixing those errors properly meant replacing the stringly-typed IR with domain types and per-opcode instruction classes. That became <em>the</em> refactoring.</p>

<p>I’d been using Claude Code with Superpowers for most of the project’s development. I wanted to try something different, both to evaluate <a href="https://github.com/gsd-build/gsd-2">GSD 2</a> as a tool and to see what workflow ideas I could take away from it. Then I stress-tested those ideas against the refactoring: replacing the single flat instruction class with a union of 31 per-opcode typed dataclasses, and migrating registers and labels from raw strings to domain types.</p>

<p>This post covers both: what I adopted from GSD 2 and later from Superpowers, and what happened when those patterns met a 13,000-test type migration.</p>

<hr />

<h2 id="what-gsd-2-is">What GSD 2 Is</h2>

<p>I’d been aware of the original <a href="https://github.com/gsd-build/gsd">Get Shit Done prompt framework</a> for a while, and had started using it lightly through Claude Code, but I’d never fully embraced it. GSD 2 gave me the opportunity to try the framework whole-heartedly. <a href="https://github.com/gsd-build/gsd-2">GSD 2</a> is a standalone CLI built on the Pi SDK that controls a coding agent’s session programmatically. It’s the evolution of the original GSD, from markdown prompts injected into Claude Code to a TypeScript application that manages context windows, dispatches work, tracks cost, detects stuck loops, and recovers from crashes. Importantly, GSD 2 let me keep using my existing infrastructure (my own ADR documents, Beads for issue tracking, and the CLAUDE.md conventions I’d built up over months) while layering its workflow discipline on top.</p>

<p>The <em>workflow patterns</em> I’m describing here (enforced phases, complexity classification, verification gates, fresh-context discipline, state-on-disk) are not GSD-2-specific. They’re engineering discipline that applies to any AI coding assistant: Claude Code directly, Cursor, Windsurf, Codex, or whatever comes next. The patterns are about how you structure work, not which tool runs it.</p>

<p>Here’s what I took from it.</p>

<hr />

<h2 id="what-i-adopted">What I Adopted</h2>

<h3 id="1-enforced-phases-per-unit-of-work">1. Enforced Phases Per Unit of Work</h3>

<p>GSD 2 has a dispatch pipeline: research → plan → implement → verify. Every unit of work goes through these phases in order. The agent doesn’t skip ahead.</p>

<p>My previous CLAUDE.md had “The workflow is Brainstorm → Discuss → Plan → Write tests → Implement → Fix → Commit → Refactor”, but it was a suggestion, not an enforcement. Multiple times during the multi-file linker work, implementation started before brainstorming was complete. The linker went through four rounds of compensating patches because the initial design was written without understanding the VM’s dispatch model. A proper brainstorm phase (reading the actual VM code, not just designing on paper) would have caught the mismatch on day one.</p>

<p>The new rule: <strong>every non-trivial task goes through brainstorm → plan → test-first → implement → self-review → verify → commit, in that order. Do not skip phases.</strong></p>

<pre><code class="language-mermaid">flowchart LR
    B("🧠 Brainstorm"):::phase1 --&gt; P("📐 Plan"):::phase2 --&gt; T("🧪 Test first"):::test --&gt; I("⚙️ Implement"):::phase3 --&gt; R("🔍 Self-review"):::review --&gt; V("✅ Verify gate"):::gate --&gt; C("📦 Commit"):::done

    classDef phase1 fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef phase2 fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef test fill:#fce4ec,stroke:#c62828,stroke-width:2px,color:#5c0a0a
    classDef phase3 fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#3a0a3a
    classDef review fill:#fff9c4,stroke:#f9a825,stroke-width:2px,color:#5c4a0a
    classDef gate fill:#c8e6c9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
    classDef done fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#1a3a1a
</code></pre>

<p>I also added a self-review step between implement and verify: scan your own diff for workaround guards, weak assertions, mutation in loops, stale docs, and missing tests before running the verification gate.</p>

<h3 id="2-complexity-classification">2. Complexity Classification</h3>

<p>GSD 2 classifies each unit as light, standard, or heavy before dispatching. The classification determines model selection and timeout.</p>

<p>I adapted this for ceremony level:</p>

<ul>
  <li><strong>Light</strong> (&lt; 50 lines, single file): brief brainstorm. Adding a node type to a dispatch table.</li>
  <li><strong>Standard</strong> (50–300 lines, follows existing patterns): brainstorm identifies the pattern being followed.</li>
  <li><strong>Heavy</strong> (300+ lines, new abstractions, multiple subsystems): brainstorm must produce a written design with trade-offs before any code. Break into independently-committable units. Do not attempt in a single pass.</li>
</ul>

<p>The multi-file linker was Heavy (500+ lines, new package, touched VM, registry, API, MCP). It should have been broken into smaller units from the start. Instead it was attempted as one continuous stream, which led to the patch-on-patch problem.</p>

<pre><code class="language-mermaid">flowchart TD
    T("📋 New task"):::start --&gt; C{"Classify&lt;br/&gt;complexity"}:::decide

    C -- "&lt; 50 lines,&lt;br/&gt;single file" --&gt; L("💡 Light&lt;br/&gt;&lt;i&gt;Brief brainstorm&lt;/i&gt;&lt;br/&gt;&lt;i&gt;e.g., add node to dispatch table&lt;/i&gt;"):::light
    C -- "50–300 lines,&lt;br/&gt;2–5 files" --&gt; S("📦 Standard&lt;br/&gt;&lt;i&gt;Identify pattern being followed&lt;/i&gt;&lt;br/&gt;&lt;i&gt;e.g., import extraction for new lang&lt;/i&gt;"):::standard
    C -- "300+ lines,&lt;br/&gt;new abstractions" --&gt; H("🏗️ Heavy&lt;br/&gt;&lt;i&gt;Written design + trade-offs&lt;/i&gt;&lt;br/&gt;&lt;i&gt;Break into committable units&lt;/i&gt;&lt;br/&gt;&lt;i&gt;Re-read code before each phase&lt;/i&gt;"):::heavy

    classDef start fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef decide fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef light fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
    classDef standard fill:#fff9c4,stroke:#f9a825,stroke-width:2px,color:#5c4a0a
    classDef heavy fill:#fce4ec,stroke:#c62828,stroke-width:2px,color:#5c0a0a
</code></pre>

<h3 id="3-verification-gate">3. Verification Gate</h3>

<p>GSD 2 runs automated verification (lint, test, typecheck) with auto-fix retries before marking a unit complete.</p>

<p>My CLAUDE.md had “run black” and “run tests” as separate bullet points in different sections. The <code class="language-plaintext highlighter-rouge">.importlinter</code> check wasn’t mentioned at all, which is how the CI pipeline broke silently with stale module paths for who knows how long.</p>

<p>The new rule: <strong>three checks in order before every commit: <code class="language-plaintext highlighter-rouge">black</code>, <code class="language-plaintext highlighter-rouge">lint-imports</code>, <code class="language-plaintext highlighter-rouge">pytest</code>. All three must pass.</strong></p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>poetry run python <span class="nt">-m</span> black <span class="nb">.</span>         <span class="c"># formatting</span>
poetry run lint-imports               <span class="c"># architectural contracts</span>
poetry run python <span class="nt">-m</span> pytest tests/    <span class="c"># full test suite</span>
</code></pre></div></div>

<p>Having this as a single named concept (“verification gate”) instead of scattered bullets makes it harder to skip.</p>

<h3 id="4-fresh-context-for-heavy-tasks">4. Fresh Context for Heavy Tasks</h3>

<p>GSD 2 creates a fresh agent session for every dispatched unit. The LLM starts with a clean context window containing only the pre-inlined artifacts it needs. This prevents quality degradation from context accumulation.</p>

<p>I can’t create fresh sessions mid-conversation. But the underlying insight applies: <strong>design documents can anchor you to a flawed model.</strong> During the linker work, the initial design document described a 7-step linking process with import tables, CALL_FUNCTION rewriting, and entry-module-first ordering. Every one of those assumptions was wrong. But because the design document was in context, the implementation followed it faithfully, then spent four patches compensating for the mismatches.</p>

<p>The new rule: <strong>for Heavy tasks, re-read the actual code before each phase. Don’t implement from a design document without verifying its assumptions against the code you’re modifying.</strong></p>

<h3 id="5-state-on-disk">5. State on Disk</h3>

<p>GSD 2’s <code class="language-plaintext highlighter-rouge">.gsd/</code> directory is the sole source of truth. No in-memory state survives across sessions. This enables crash recovery and session resumption.</p>

<p>I already had Beads for issue tracking and ADRs for decisions. What I was missing was the discipline of <strong>backup before every commit</strong>. Beads state lives in a Dolt database; if a session dies before backup, the issue state is lost. The <code class="language-plaintext highlighter-rouge">bd backup</code> command exports everything to JSONL files committed to the repo.</p>

<p>The new rule: <strong><code class="language-plaintext highlighter-rouge">bd backup</code> before every commit. Issues filed before work starts. Prefer committed partial results over uncommitted complete attempts.</strong> If a session might end, commit with a <code class="language-plaintext highlighter-rouge">WIP:</code> prefix and file an issue for the remainder.</p>

<pre><code class="language-mermaid">flowchart TB
    S(("🚀 Session start")):::start
    S --&gt; R("📜 CLAUDE.md&lt;br/&gt;&lt;i&gt;Rules: how to behave&lt;/i&gt;"):::rules
    S --&gt; A("📚 ADRs&lt;br/&gt;&lt;i&gt;Why past decisions&lt;br/&gt;were made&lt;/i&gt;"):::adrs
    S --&gt; B("📋 Beads issues&lt;br/&gt;&lt;i&gt;What to work on next&lt;/i&gt;"):::issues
    S --&gt; D("📊 Docs&lt;br/&gt;&lt;i&gt;Living design docs&lt;/i&gt;"):::docs
    R &amp; A &amp; B &amp; D --&gt; O("✅ Agent is oriented"):::done

    O --&gt; W("⚙️ Work"):::work
    W --&gt; BK("💾 bd backup"):::backup
    BK --&gt; CM("📦 git commit + push"):::commit
    CM --&gt;|"more work"| W
    CM --&gt;|"session ending"| END("🏁 Clean state on disk"):::done

    classDef start fill:#e8f4fd,stroke:#4a90d9,stroke-width:3px,color:#1a3a5c
    classDef rules fill:#fce4ec,stroke:#c62828,stroke-width:2px,color:#5c0a0a
    classDef adrs fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#3a0a3a
    classDef issues fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef docs fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef done fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
    classDef work fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#3a0a3a
    classDef backup fill:#fff9c4,stroke:#f9a825,stroke-width:2px,color:#5c4a0a
    classDef commit fill:#c8e6c9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
</code></pre>

<hr />

<h2 id="teething-issues">Teething Issues</h2>

<h3 id="the-patch-on-patch-problem">The Patch-on-Patch Problem</h3>

<p>The linker was where this adoption was most tested. The original design promised “zero downstream changes”: namespace labels, merge IR, and the existing VM/CFG/registry would just work. It didn’t.</p>

<p>The VM resolves function calls by looking up variable names in the scope chain, not by label. The VM’s CONST handler converts label strings into <code class="language-plaintext highlighter-rouge">BoundFuncRef</code>/<code class="language-plaintext highlighter-rouge">ClassRef</code> via symbol tables. Constructor dispatch uses the class name from ClassRef to look up methods. None of this is label-based. The design document didn’t account for any of it.</p>

<p>What followed: patch 1 (rewrite import stubs as CONST label bindings), patch 2 (pass symbol tables to ExecutionStrategies), patch 3 (keep bare names in symbol tables, not namespaced), patch 4 (insert BRANCH instructions to chain module entries), patch 5 (drop variable import stubs so they don’t overwrite concrete values).</p>

<p>Five patches. Each one “fixed” the immediate symptom. The underlying problem was that the design was wrong: linking at the label level doesn’t work when the VM dispatches at the scope level.</p>

<pre><code class="language-mermaid">flowchart TD
    D("📄 Design:&lt;br/&gt;namespace labels,&lt;br/&gt;rewrite references"):::design
    D --&gt; P1("🩹 Patch 1:&lt;br/&gt;rewrite import stubs&lt;br/&gt;as CONST bindings"):::patch
    P1 --&gt; P2("🩹 Patch 2:&lt;br/&gt;pass symbol tables&lt;br/&gt;to ExecutionStrategies"):::patch
    P2 --&gt; P3("🩹 Patch 3:&lt;br/&gt;bare names in&lt;br/&gt;symbol tables"):::patch
    P3 --&gt; P4("🩹 Patch 4:&lt;br/&gt;BRANCH chains&lt;br/&gt;between modules"):::patch
    P4 --&gt; P5("🩹 Patch 5:&lt;br/&gt;drop variable&lt;br/&gt;import stubs"):::patch
    P5 --&gt; STOP("🛑 Stop.&lt;br/&gt;The design is wrong."):::stop
    STOP --&gt; RW("♻️ Rewrite:&lt;br/&gt;499 → 311 lines&lt;br/&gt;one clean model"):::rewrite

    classDef design fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef patch fill:#fce4ec,stroke:#c62828,stroke-width:2px,color:#5c0a0a
    classDef stop fill:#ffcdd2,stroke:#b71c1c,stroke-width:3px,color:#5c0a0a
    classDef rewrite fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#1a3a1a
</code></pre>

<p>The fix was a rewrite: strip per-module entry labels, concatenate in dependency order, drop import stubs, emit one shared entry label. 499 lines → 311 lines. One clean model instead of five compensating transforms.</p>

<p>If I’d had the “stop and consult when patching” rule from the start, the rewrite would have happened after patch 2, not patch 5.</p>

<h3 id="brainstorm-wasnt-enforced">Brainstorm Wasn’t Enforced</h3>

<p>Several times during the session, I had to redirect: “brainstorm with me first”, “don’t go haring off on your own”, “should this have been a separate issue?”</p>

<p>The phase ordering was written down but not enforced by habit. The AI’s default is to start implementing the moment it understands the problem. The brainstorm phase needs to be explicit: present options, discuss trade-offs, take input. Not a brief internal consideration before diving into code.</p>

<p>This led to a new interaction rule: <strong>brainstorm collaboratively. Present options and trade-offs to the user. Do not pick an approach and start implementing without discussion.</strong></p>

<h3 id="retroactive-issue-filing">Retroactive Issue Filing</h3>

<p>The multi-file work started without filing issues. Bugs were discovered and fixed inline: no issue, no claim, no close. Three linker bugs (<code class="language-plaintext highlighter-rouge">red-dragon-khfp</code>, <code class="language-plaintext highlighter-rouge">red-dragon-qyvn</code>, <code class="language-plaintext highlighter-rouge">red-dragon-q00i</code>) were filed retroactively after I noticed the gap.</p>

<p>The fix was simple: make issue filing the first step, not an afterthought. “File an issue before starting work” was already in CLAUDE.md but wasn’t being followed consistently.</p>

<h3 id="weak-integration-test-assertions">Weak Integration Test Assertions</h3>

<p>After shipping multi-file support for all 16 languages, I discovered that 9 of the 15 tree-sitter language integration tests had weak assertions like <code class="language-plaintext highlighter-rouge">assert "add" in result</code> instead of <code class="language-plaintext highlighter-rouge">assert result["answer"] == 30</code>. These tests would pass even if the cross-module call returned symbolic.</p>

<p>This was a violation of the self-review checklist that I’d just added. The checklist says “weak assertions” is an anti-pattern to scan for. The tests were written before the checklist existed.</p>

<p>Strengthening them exposed two additional bugs: a Pascal frontend gap (<code class="language-plaintext highlighter-rouge">begin...end.</code> block not lowered) and a Pascal function return bug (<code class="language-plaintext highlighter-rouge">Result</code> variable not returned). Both were pre-existing single-file bugs, not linker issues, but the weak multi-file assertions had masked them.</p>

<p>This prompted two new rules. First: <strong>after writing tests, review every assertion for specificity.</strong> Replace <code class="language-plaintext highlighter-rouge">assert x is not None</code> and <code class="language-plaintext highlighter-rouge">assert "name" in result</code> with concrete value assertions like <code class="language-plaintext highlighter-rouge">assert result == 30</code>. If a concrete assertion isn’t possible, document why. Weak assertions that would pass even when the feature is broken are worse than no assertions because they give false confidence.</p>

<p>Second: <strong>every <code class="language-plaintext highlighter-rouge">xfail</code> must have a corresponding Beads issue</strong>, and the <code class="language-plaintext highlighter-rouge">xfail</code> reason string must reference the issue ID. An <code class="language-plaintext highlighter-rouge">xfail</code> without a tracked issue is a gap that never gets fixed. It just sits there, green in the test report, invisible in the backlog.</p>

<h3 id="the-theoretical-bug">The Theoretical Bug</h3>

<p>One of the review findings (<code class="language-plaintext highlighter-rouge">red-dragon-lzae</code>) was about a fragile state machine in the import stub dropper. After brainstorming, we determined the bug was theoretical: no frontend currently produces the problematic instruction pattern, and adding “just in case” code would violate the “no speculative code without tests” principle. We closed it as won’t-fix with a documentation note.</p>

<p>It took a brainstorming session to reach that conclusion. The first instinct (mine and the AI’s) was to write a test and fix it. Brainstorming first (“is this actually a bug?”) saved time.</p>

<hr />

<h2 id="the-reorganised-claudemd">The Reorganised CLAUDE.md</h2>

<p>The adoption process surfaced how disorganised my CLAUDE.md had become. “Run black” appeared in three places. “Run tests” appeared in three places. “Brainstorm before work” appeared twice. The verification gate was split across two sections.</p>

<p>I rewrote it from scratch: deduplicated, restructured into a logical reading order (Project Context → Task Tracking → Workflow → Design Principles → Programming Patterns → Testing → Code Review → Implementation → Interaction), and consolidated related guidance.</p>

<p>245 lines → 168 lines. Same rules, less duplication, clearer structure.</p>

<hr />

<h2 id="the-test-replacing-a-stringly-typed-ir-with-domain-types">The Test: Replacing a Stringly-Typed IR with Domain Types</h2>

<p>With the GSD 2 patterns in place, the next major piece of work was replacing the IR’s instruction representation. This was where the patterns would be tested under sustained pressure.</p>

<p>By late March 2026, RedDragon had ~12,900 tests, 121 architectural decision records, 550+ tracked issues, and a problem: the IR was stringly-typed. Every instruction was an <code class="language-plaintext highlighter-rouge">IRInstruction</code> with an <code class="language-plaintext highlighter-rouge">opcode: Opcode</code> enum and <code class="language-plaintext highlighter-rouge">operands: list[Any]</code>. Labels were <code class="language-plaintext highlighter-rouge">str | None</code>. Registers were bare strings like <code class="language-plaintext highlighter-rouge">"%r0"</code>. Field names, variable names, function names, operator symbols, and register references were all <code class="language-plaintext highlighter-rouge">str</code>, distinguished only by position in the operands list.</p>

<h3 id="the-problem-strings-all-the-way-down">The Problem: Strings All the Way Down</h3>

<p>The core instruction type looked like this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@dataclass</span>
<span class="k">class</span> <span class="nc">IRInstruction</span><span class="p">:</span>
    <span class="n">opcode</span><span class="p">:</span> <span class="n">Opcode</span>
    <span class="n">operands</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">Any</span><span class="p">]</span> <span class="o">=</span> <span class="nf">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="nb">list</span><span class="p">)</span>
    <span class="n">result_reg</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="sh">""</span>
    <span class="n">label</span><span class="p">:</span> <span class="nb">str</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span>
    <span class="n">source_location</span><span class="p">:</span> <span class="n">SourceLocation</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span>
</code></pre></div></div>

<p>A <code class="language-plaintext highlighter-rouge">BINOP</code> instruction had <code class="language-plaintext highlighter-rouge">operands = ["+", "%r1", "%r2"]</code>. The operator is at index 0. The left register is at index 1. The right register is at index 2. Nothing in the type system distinguishes them. A <code class="language-plaintext highlighter-rouge">STORE_FIELD</code> had <code class="language-plaintext highlighter-rouge">operands = ["%r0", "name", "%r1"]</code>: object register, field name, value register. Swap indices 0 and 2 and you get a silent data corruption, not a type error.</p>

<p>The <code class="language-plaintext highlighter-rouge">label</code> field was <code class="language-plaintext highlighter-rouge">str | None</code>. 72% of instructions had <code class="language-plaintext highlighter-rouge">label=None</code>, causing 83+ pyright type errors at sites that assumed non-None. The <code class="language-plaintext highlighter-rouge">result_reg</code> field was <code class="language-plaintext highlighter-rouge">str</code>, with the empty string <code class="language-plaintext highlighter-rouge">""</code> as the sentinel for “no result register.” Comparing a register to <code class="language-plaintext highlighter-rouge">""</code> to check presence was scattered across dozens of files.</p>

<p>This stringly-typed design had served the project well through rapid prototyping. 15 frontends were built against it, plus a VM, a CFG builder, a dataflow analysis engine, a type inference system, an interprocedural analysis pipeline, a multi-file linker, and an MCP server. But as pyright and import-linter tightened the quality gates, the strings became the dominant source of type errors, and every new feature required the same positional-indexing boilerplate.</p>

<p>The goal: replace <code class="language-plaintext highlighter-rouge">IRInstruction</code> entirely with a union of per-opcode frozen dataclasses, where every register is a <code class="language-plaintext highlighter-rouge">Register</code> object, every label is a <code class="language-plaintext highlighter-rouge">CodeLabel</code> object, and every field is named.</p>

<p>What actually happened was messier. The migration proceeded in three chapters, each teaching its own lessons.</p>

<hr />

<h2 id="chapter-1-codelabel-the-first-domino">Chapter 1: CodeLabel, The First Domino</h2>

<p><code class="language-plaintext highlighter-rouge">IRInstruction.label</code> was <code class="language-plaintext highlighter-rouge">str | None</code>. The plan was straightforward: introduce a <code class="language-plaintext highlighter-rouge">CodeLabel(value: str)</code> type and a <code class="language-plaintext highlighter-rouge">NoCodeLabel</code> null-object singleton, replace <code class="language-plaintext highlighter-rouge">str | None</code> everywhere, and push the type through adjacents.</p>

<h3 id="the-attempted-cascade">The Attempted Cascade</h3>

<p>The first session went well. <code class="language-plaintext highlighter-rouge">CodeLabel</code> and <code class="language-plaintext highlighter-rouge">NoCodeLabel</code> were defined. <code class="language-plaintext highlighter-rouge">IRInstruction.label</code> was updated. 49 files were touched. 12,850 tests passed. The issue was closed.</p>

<p>Then I looked at the adjacents. <code class="language-plaintext highlighter-rouge">BasicBlock.label</code> was <code class="language-plaintext highlighter-rouge">str</code>. <code class="language-plaintext highlighter-rouge">cfg.blocks</code> was <code class="language-plaintext highlighter-rouge">dict[str, BasicBlock]</code>. <code class="language-plaintext highlighter-rouge">StateUpdate.next_label</code> was <code class="language-plaintext highlighter-rouge">str</code>. <code class="language-plaintext highlighter-rouge">label_to_idx</code> used string keys. The <code class="language-plaintext highlighter-rouge">CodeLabel</code> type existed, but 19 sites in the interpreter still accessed <code class="language-plaintext highlighter-rouge">.value</code> directly, breaking the abstraction.</p>

<p>I opened a new issue: <em>“CodeLabel: eliminate .value leaks, propagate CodeLabel into BasicBlock, StateUpdate, cfg.blocks.”</em></p>

<p>And then I tried to do it in one pass.</p>

<p>The cascade touched ~40 files. Changing <code class="language-plaintext highlighter-rouge">BasicBlock.label</code> from <code class="language-plaintext highlighter-rouge">str</code> to <code class="language-plaintext highlighter-rouge">CodeLabel</code> cascaded into <code class="language-plaintext highlighter-rouge">cfg.blocks</code> dict keys, which cascaded into <code class="language-plaintext highlighter-rouge">build_cfg()</code>, which cascaded into <code class="language-plaintext highlighter-rouge">build_registry()</code>, which cascaded into the executor’s block lookup, which cascaded into the linker’s namespace operations, which cascaded into the type inference engine’s function signature lookup. Each change was individually simple. Together they formed a dependency chain that crossed every layer of the architecture.</p>

<h3 id="the-revert-and-the-lesson">The Revert and the Lesson</h3>

<p>I reverted everything. The commit message reads:</p>

<blockquote>
  <p>Attempted CodeLabel propagation into cfg_types/cfg.py. Cascade touches ~40 files. Reverted to clean state, needs a dedicated session.</p>
</blockquote>

<p>The lesson was not “the scope was too large.” The lesson was that I had violated my own complexity classification. This was obviously Heavy work (300+ lines, new abstractions, multiple subsystems) but I’d treated it as Standard because each individual change seemed small. The individual changes <em>were</em> small. The cascade was not.</p>

<h3 id="what-actually-shipped">What Actually Shipped</h3>

<p>The next session took a different approach. Instead of propagating <code class="language-plaintext highlighter-rouge">CodeLabel</code> into all adjacents at once, I identified three independently-committable rings:</p>

<ol>
  <li>
    <p><strong>Eliminate <code class="language-plaintext highlighter-rouge">.value</code> leaks in the immediate adjacents.</strong> Push <code class="language-plaintext highlighter-rouge">CodeLabel</code> into <code class="language-plaintext highlighter-rouge">BasicBlock.label</code>, <code class="language-plaintext highlighter-rouge">cfg.blocks</code> keys, <code class="language-plaintext highlighter-rouge">StateUpdate.next_label</code>, and the handler/registry layer. One commit. 12,850 tests.</p>
  </li>
  <li>
    <p><strong>Push wrapping upstream.</strong> <code class="language-plaintext highlighter-rouge">branch_targets()</code> was returning <code class="language-plaintext highlighter-rouge">list[str]</code> and callers were wrapping each element in <code class="language-plaintext highlighter-rouge">CodeLabel(...)</code>. Change <code class="language-plaintext highlighter-rouge">branch_targets()</code> to return <code class="language-plaintext highlighter-rouge">list[CodeLabel]</code> directly. Remove downstream wrapping. One commit.</p>
  </li>
  <li>
    <p><strong>Remove <code class="language-plaintext highlighter-rouge">_coerce_label</code>.</strong> A helper function that auto-converted strings to <code class="language-plaintext highlighter-rouge">CodeLabel</code>. It was masking call sites that should have been explicitly updated. One commit.</p>
  </li>
</ol>

<p>Each ring was a session. Each ring had its own issue. Each ring went through the verification gate independently. The result was the same as the abandoned single-pass attempt, but arrived safely.</p>

<h3 id="the-coercion-validator-mistake">The Coercion Validator Mistake</h3>

<p>During the <code class="language-plaintext highlighter-rouge">CodeLabel</code> work, the AI proposed a Pydantic <code class="language-plaintext highlighter-rouge">field_validator</code> that would auto-convert incoming strings to <code class="language-plaintext highlighter-rouge">CodeLabel</code>. I initially accepted this. It made tests pass immediately, with no caller needing an update.</p>

<p>That was the problem. The validator <em>masked</em> every call site that was passing a raw string. Instead of failing loudly (forcing me to find and fix each caller), it silently converted. When I later tried to remove the validator, I discovered dozens of sites that had never been updated.</p>

<p>This produced a rule that went into CLAUDE.md’s new “Refactoring Principles” section:</p>

<blockquote>
  <p><strong>No coercion validators.</strong> Do not add Pydantic <code class="language-plaintext highlighter-rouge">field_validator</code> or <code class="language-plaintext highlighter-rouge">__post_init__</code> hacks that auto-convert strings to the domain type. These mask call sites that should be explicitly updated. If Pydantic rejects a value, the caller is wrong. Fix the caller.</p>
</blockquote>

<h3 id="the-refactoring-principles-that-crystallised">The Refactoring Principles That Crystallised</h3>

<p>The <code class="language-plaintext highlighter-rouge">CodeLabel</code> experience generated a full set of type-propagation guidelines. Each one came from a specific mistake:</p>

<table>
  <thead>
    <tr>
      <th>Principle</th>
      <th>Mistake it prevents</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>No coercion validators</td>
      <td>Masking unconverted callers</td>
    </tr>
    <tr>
      <td>Push wrapping to the origin</td>
      <td>Re-wrapping at every consumer</td>
    </tr>
    <tr>
      <td>No defensive <code class="language-plaintext highlighter-rouge">isinstance</code> checks</td>
      <td>Inconsistent callers surviving</td>
    </tr>
    <tr>
      <td>Domain methods over string extraction</td>
      <td><code class="language-plaintext highlighter-rouge">.value</code> leaks everywhere</td>
    </tr>
    <tr>
      <td>Validate at construction, not at use</td>
      <td>Double-wrapping going undetected</td>
    </tr>
    <tr>
      <td>Separate name generation from label generation</td>
      <td><code class="language-plaintext highlighter-rouge">fresh_label()</code> producing non-labels</td>
    </tr>
    <tr>
      <td>Serialization at the boundary</td>
      <td><code class="language-plaintext highlighter-rouge">str()</code> in the middle of pipelines</td>
    </tr>
    <tr>
      <td>File issues for the next ring</td>
      <td>Attempting everything in one session</td>
    </tr>
  </tbody>
</table>

<p>These principles were extracted from the CodeLabel work and written into <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> before the Register migration began. They saved significant time on every subsequent step.</p>

<hr />

<h2 id="chapter-2-register-the-pydantic-problem">Chapter 2: Register, The Pydantic Problem</h2>

<p>With <code class="language-plaintext highlighter-rouge">CodeLabel</code> done, the next target was <code class="language-plaintext highlighter-rouge">result_reg: str</code>. The pattern was established: introduce <code class="language-plaintext highlighter-rouge">Register(name: str)</code> and <code class="language-plaintext highlighter-rouge">NoRegister</code> singleton, replace <code class="language-plaintext highlighter-rouge">str</code>, push through adjacents.</p>

<p>The definition was clean:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@dataclass</span><span class="p">(</span><span class="n">frozen</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">Register</span><span class="p">:</span>
    <span class="n">name</span><span class="p">:</span> <span class="nb">str</span>
    <span class="k">def</span> <span class="nf">__str__</span><span class="p">(</span><span class="n">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="n">name</span>
    <span class="k">def</span> <span class="nf">__eq__</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">other</span><span class="p">:</span> <span class="nb">object</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
        <span class="k">if</span> <span class="nf">isinstance</span><span class="p">(</span><span class="n">other</span><span class="p">,</span> <span class="nb">str</span><span class="p">):</span>
            <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="n">name</span> <span class="o">==</span> <span class="n">other</span>  <span class="c1"># compatibility bridge
</span>        <span class="k">if</span> <span class="nf">isinstance</span><span class="p">(</span><span class="n">other</span><span class="p">,</span> <span class="n">Register</span><span class="p">):</span>
            <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="n">name</span> <span class="o">==</span> <span class="n">other</span><span class="p">.</span><span class="n">name</span>
        <span class="k">return</span> <span class="nb">NotImplemented</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">NoRegister</code> was the null object: <code class="language-plaintext highlighter-rouge">__str__</code> returned <code class="language-plaintext highlighter-rouge">""</code>, <code class="language-plaintext highlighter-rouge">is_present()</code> returned <code class="language-plaintext highlighter-rouge">False</code>.</p>

<h3 id="the-boolean-truthiness-problem">The Boolean Truthiness Problem</h3>

<p><code class="language-plaintext highlighter-rouge">IRInstruction.result_reg</code> was <code class="language-plaintext highlighter-rouge">str</code>, and the empty string <code class="language-plaintext highlighter-rouge">""</code> was the sentinel for “no result register.” Throughout the codebase, truthiness checks like <code class="language-plaintext highlighter-rouge">if inst.result_reg:</code> were used to test whether an instruction produced a result.</p>

<p>When <code class="language-plaintext highlighter-rouge">result_reg</code> became <code class="language-plaintext highlighter-rouge">Register</code>, the truthiness semantics broke. A <code class="language-plaintext highlighter-rouge">Register("")</code> is truthy in Python (it’s a dataclass instance, not an empty string). <code class="language-plaintext highlighter-rouge">NoRegister</code> is also truthy (same reason). Every <code class="language-plaintext highlighter-rouge">if inst.result_reg:</code> check now always evaluated to <code class="language-plaintext highlighter-rouge">True</code>.</p>

<p>The initial fix was to add <code class="language-plaintext highlighter-rouge">__bool__</code> to <code class="language-plaintext highlighter-rouge">Register</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">__bool__</span><span class="p">(</span><span class="n">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
    <span class="k">return</span> <span class="nf">bool</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>
</code></pre></div></div>

<p>This restored the old behaviour: <code class="language-plaintext highlighter-rouge">Register("")</code> was falsy, <code class="language-plaintext highlighter-rouge">Register("%r0")</code> was truthy, <code class="language-plaintext highlighter-rouge">NoRegister</code> was falsy.</p>

<p>But <code class="language-plaintext highlighter-rouge">__bool__</code> on a domain type is a code smell. It couples the type to the semantics of one particular usage pattern (truthiness as “is present”). The better fix was explicit: replace every <code class="language-plaintext highlighter-rouge">if inst.result_reg:</code> with <code class="language-plaintext highlighter-rouge">if inst.result_reg.is_present()</code>. This was 30+ sites across the codebase.</p>

<p>The <code class="language-plaintext highlighter-rouge">__bool__</code> method was added, used as a temporary bridge, and then removed in a separate commit after all call sites were migrated to <code class="language-plaintext highlighter-rouge">is_present()</code>. The commit message: <em>“Register: remove __bool__, use explicit is_present() everywhere.”</em></p>

<p>The <code class="language-plaintext highlighter-rouge">CodeLabel</code> principles applied directly: push semantics to the domain type, don’t rely on Python’s implicit protocols.</p>

<h3 id="the-branch_if-comma-convention">The BRANCH_IF Comma Convention</h3>

<p><code class="language-plaintext highlighter-rouge">BRANCH_IF</code> stored its branch targets as <code class="language-plaintext highlighter-rouge">operands = ["%r0", "L_true,L_false"]</code>. Two labels concatenated with a comma in a single string. Callers split on comma to recover the label list.</p>

<p>This was not a design choice. It was an accidental constraint from the early days when <code class="language-plaintext highlighter-rouge">operands</code> was <code class="language-plaintext highlighter-rouge">list[Any]</code> and nobody thought about it. The comma string had survived hundreds of sessions and 12,000+ tests because it worked.</p>

<p>Replacing it was the right time, since we were already touching <code class="language-plaintext highlighter-rouge">BRANCH_IF</code> for the <code class="language-plaintext highlighter-rouge">Register</code> migration. The fix introduced a <code class="language-plaintext highlighter-rouge">branch_targets: tuple[CodeLabel, ...]</code> field on the typed <code class="language-plaintext highlighter-rouge">BranchIf</code> instruction, eliminating the comma-separated string entirely.</p>

<p>But the migration was tricky because the comma convention was load-bearing in <code class="language-plaintext highlighter-rouge">cfg.py</code>, <code class="language-plaintext highlighter-rouge">dataflow.py</code>, and <code class="language-plaintext highlighter-rouge">registry.py</code>. Each file had its own slightly different approach to splitting the string:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># cfg.py
</span><span class="n">targets</span> <span class="o">=</span> <span class="n">inst</span><span class="p">.</span><span class="n">operands</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="nf">split</span><span class="p">(</span><span class="sh">"</span><span class="s">,</span><span class="sh">"</span><span class="p">)</span>

<span class="c1"># dataflow.py
</span><span class="k">for</span> <span class="n">target</span> <span class="ow">in</span> <span class="n">inst</span><span class="p">.</span><span class="n">operands</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="nf">split</span><span class="p">(</span><span class="sh">"</span><span class="s">,</span><span class="sh">"</span><span class="p">):</span>

<span class="c1"># registry.py
</span><span class="n">branches</span> <span class="o">=</span> <span class="n">inst</span><span class="p">.</span><span class="n">operands</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="nf">split</span><span class="p">(</span><span class="sh">"</span><span class="s">,</span><span class="sh">"</span><span class="p">)</span> <span class="k">if</span> <span class="sh">"</span><span class="s">,</span><span class="sh">"</span> <span class="ow">in</span> <span class="nf">str</span><span class="p">(</span><span class="n">inst</span><span class="p">.</span><span class="n">operands</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span> <span class="k">else</span> <span class="p">[</span><span class="n">inst</span><span class="p">.</span><span class="n">operands</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]]</span>
</code></pre></div></div>

<p>Three files, three variations, one undocumented convention. The structured <code class="language-plaintext highlighter-rouge">branch_targets</code> field eliminated all of them.</p>

<hr />

<h2 id="chapter-3-typed-instructions">Chapter 3: Typed Instructions</h2>

<p>Chapters 1 and 2 had fixed the <em>metadata</em> fields on <code class="language-plaintext highlighter-rouge">IRInstruction</code>: <code class="language-plaintext highlighter-rouge">label</code> became <code class="language-plaintext highlighter-rouge">CodeLabel</code>, <code class="language-plaintext highlighter-rouge">result_reg</code> became <code class="language-plaintext highlighter-rouge">Register</code>. But the actual payload of each instruction, the thing that made a <code class="language-plaintext highlighter-rouge">BINOP</code> different from a <code class="language-plaintext highlighter-rouge">STORE_FIELD</code>, still lived in <code class="language-plaintext highlighter-rouge">operands: list[Any]</code>.</p>

<p>This was the field that Pyright couldn’t reason about at all. It accepted anything: strings, ints, <code class="language-plaintext highlighter-rouge">CodeLabel</code> objects, lists of <code class="language-plaintext highlighter-rouge">CodeLabel</code> objects, <code class="language-plaintext highlighter-rouge">SpreadArguments</code> wrappers. The type checker saw <code class="language-plaintext highlighter-rouge">list[Any]</code> and gave up. Every read from <code class="language-plaintext highlighter-rouge">operands</code> was untyped, every write was unchecked, and every positional convention was invisible to static analysis.</p>

<p>The <code class="language-plaintext highlighter-rouge">operands</code> bag was also the source of the most insidious class of bugs: <em>silent positional swaps</em>. A <code class="language-plaintext highlighter-rouge">STORE_FIELD</code> had <code class="language-plaintext highlighter-rouge">operands = [obj_reg, field_name, value_reg]</code>. If a frontend accidentally emitted <code class="language-plaintext highlighter-rouge">operands = [field_name, obj_reg, value_reg]</code>, the type system had nothing to say. All three values were strings. The bug would surface as wrong runtime behaviour, not a type error. Every consumer of the IR had its own positional destructuring logic, and every one was a potential source of index bugs that no tool could catch statically.</p>

<h3 id="the-decision-one-class-per-opcode">The Decision: One Class Per Opcode</h3>

<p>The IR had a single <code class="language-plaintext highlighter-rouge">IRInstruction</code> class carrying every instruction in the system. A <code class="language-plaintext highlighter-rouge">CONST</code> loading a literal, a <code class="language-plaintext highlighter-rouge">CALL_METHOD</code> dispatching to an object, a <code class="language-plaintext highlighter-rouge">TRY_PUSH</code> setting up an exception handler: all the same class, distinguished only by the <code class="language-plaintext highlighter-rouge">opcode</code> enum and the contents of the <code class="language-plaintext highlighter-rouge">operands</code> bag. Every consumer had to destructure positionally:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># What is operands[0]? Depends on the opcode.
</span><span class="k">if</span> <span class="n">inst</span><span class="p">.</span><span class="n">opcode</span> <span class="o">==</span> <span class="n">Opcode</span><span class="p">.</span><span class="n">BINOP</span><span class="p">:</span>
    <span class="n">operator</span> <span class="o">=</span> <span class="n">inst</span><span class="p">.</span><span class="n">operands</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>   <span class="c1"># str: "+"
</span>    <span class="n">left_reg</span> <span class="o">=</span> <span class="n">inst</span><span class="p">.</span><span class="n">operands</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>   <span class="c1"># str: "%r0"
</span>    <span class="n">right_reg</span> <span class="o">=</span> <span class="n">inst</span><span class="p">.</span><span class="n">operands</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span>  <span class="c1"># str: "%r1"
</span><span class="k">elif</span> <span class="n">inst</span><span class="p">.</span><span class="n">opcode</span> <span class="o">==</span> <span class="n">Opcode</span><span class="p">.</span><span class="n">STORE_FIELD</span><span class="p">:</span>
    <span class="n">obj_reg</span> <span class="o">=</span> <span class="n">inst</span><span class="p">.</span><span class="n">operands</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>    <span class="c1"># str: "%r0"
</span>    <span class="n">field_name</span> <span class="o">=</span> <span class="n">inst</span><span class="p">.</span><span class="n">operands</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># str: "name"
</span>    <span class="n">value_reg</span> <span class="o">=</span> <span class="n">inst</span><span class="p">.</span><span class="n">operands</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span>  <span class="c1"># str: "%r1"
</span></code></pre></div></div>

<p>The same index meant different things depending on the opcode. Swap two operands in a <code class="language-plaintext highlighter-rouge">STORE_FIELD</code> and the error is silent: both are strings, both fit in <code class="language-plaintext highlighter-rouge">operands[2]</code>, and the type system has nothing to say about it.</p>

<p>The alternative was a discriminated union: one frozen dataclass per opcode, each with named fields of the correct type. A <code class="language-plaintext highlighter-rouge">Binop</code> has <code class="language-plaintext highlighter-rouge">operator: str</code>, <code class="language-plaintext highlighter-rouge">left: Register</code>, <code class="language-plaintext highlighter-rouge">right: Register</code>. A <code class="language-plaintext highlighter-rouge">StoreField</code> has <code class="language-plaintext highlighter-rouge">obj_reg: Register</code>, <code class="language-plaintext highlighter-rouge">field_name: str</code>, <code class="language-plaintext highlighter-rouge">value_reg: Register</code>. Swapping <code class="language-plaintext highlighter-rouge">obj_reg</code> and <code class="language-plaintext highlighter-rouge">value_reg</code> is still possible, but swapping <code class="language-plaintext highlighter-rouge">field_name</code> and <code class="language-plaintext highlighter-rouge">value_reg</code> is a type error because one is <code class="language-plaintext highlighter-rouge">str</code> and the other is <code class="language-plaintext highlighter-rouge">Register</code>.</p>

<p>The trade-off was migration cost. A single <code class="language-plaintext highlighter-rouge">IRInstruction</code> class was the interface for every frontend (15 tree-sitter frontends, an LLM frontend, a COBOL frontend), every handler (28 VM handlers), the CFG builder, the dataflow engine, the type inference system, the interprocedural analysis pipeline, the linker, and the MCP server. Replacing it required touching all of them. The compatibility bridge (<code class="language-plaintext highlighter-rouge">to_typed()</code>/<code class="language-plaintext highlighter-rouge">to_flat()</code> converters, <code class="language-plaintext highlighter-rouge">Register.__eq__(str)</code>, <code class="language-plaintext highlighter-rouge">operands</code> properties on the typed classes) existed to make this migration incremental rather than big-bang.</p>

<p>The design produced 31 frozen dataclasses sharing a common <code class="language-plaintext highlighter-rouge">InstructionBase</code> with <code class="language-plaintext highlighter-rouge">source_location</code>, plus per-class fields for registers, labels, operators, names, and branch targets. The <code class="language-plaintext highlighter-rouge">Instruction</code> union type was the discriminant:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Instruction</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">Const</span> <span class="o">|</span> <span class="n">LoadVar</span> <span class="o">|</span> <span class="n">DeclVar</span> <span class="o">|</span> <span class="n">StoreVar</span> <span class="o">|</span> <span class="n">Symbolic</span> <span class="o">|</span>
    <span class="n">Binop</span> <span class="o">|</span> <span class="n">Unop</span> <span class="o">|</span>
    <span class="n">CallFunction</span> <span class="o">|</span> <span class="n">CallMethod</span> <span class="o">|</span> <span class="n">CallUnknown</span> <span class="o">|</span>
    <span class="n">LoadField</span> <span class="o">|</span> <span class="n">StoreField</span> <span class="o">|</span> <span class="n">LoadFieldIndirect</span> <span class="o">|</span>
    <span class="n">LoadIndex</span> <span class="o">|</span> <span class="n">StoreIndex</span> <span class="o">|</span>
    <span class="n">LoadIndirect</span> <span class="o">|</span> <span class="n">StoreIndirect</span> <span class="o">|</span> <span class="n">AddressOf</span> <span class="o">|</span>
    <span class="n">NewObject</span> <span class="o">|</span> <span class="n">NewArray</span> <span class="o">|</span>
    <span class="n">Label_</span> <span class="o">|</span> <span class="n">Branch</span> <span class="o">|</span> <span class="n">BranchIf</span> <span class="o">|</span>
    <span class="n">Return_</span> <span class="o">|</span> <span class="n">Throw_</span> <span class="o">|</span>
    <span class="n">TryPush</span> <span class="o">|</span> <span class="n">TryPop</span> <span class="o">|</span>
    <span class="n">AllocRegion</span> <span class="o">|</span> <span class="n">LoadRegion</span> <span class="o">|</span> <span class="n">WriteRegion</span> <span class="o">|</span>
    <span class="n">SetContinuation</span> <span class="o">|</span> <span class="n">ResumeContinuation</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Consumer dispatch changed from <code class="language-plaintext highlighter-rouge">dict[Opcode, handler]</code> to <code class="language-plaintext highlighter-rouge">dict[type, handler]</code>, and opcode checks from <code class="language-plaintext highlighter-rouge">inst.opcode == Opcode.LABEL</code> to <code class="language-plaintext highlighter-rouge">isinstance(inst, Label_)</code>. Generic transformations (register rebasing in the linker, label namespacing in multi-file compilation) went from operand-by-operand iteration to <code class="language-plaintext highlighter-rouge">inst.map_registers(fn)</code> and <code class="language-plaintext highlighter-rouge">inst.map_labels(fn)</code>, which introspect the dataclass fields and return a new frozen instance via <code class="language-plaintext highlighter-rouge">dataclasses.replace()</code>.</p>

<h3 id="the-original-four-phase-plan">The Original Four-Phase Plan</h3>

<p>The initial design session produced a four-phase plan:</p>

<ol>
  <li><strong>Define typed instruction classes</strong>: 30 frozen dataclasses, one per opcode</li>
  <li><strong>Migrate handler consumers</strong>: VM handlers use <code class="language-plaintext highlighter-rouge">to_typed()</code> for named field access</li>
  <li><strong>Migrate remaining consumers</strong>: CFG, dataflow, type inference, linker, etc.</li>
  <li><strong>Migrate frontend producers</strong>: replace <code class="language-plaintext highlighter-rouge">emit(Opcode.X, ...)</code> with <code class="language-plaintext highlighter-rouge">emit_inst(TypedInstruction(...))</code></li>
</ol>

<p>Phases 1–3 were completed first. Phase 4 started with <code class="language-plaintext highlighter-rouge">common/</code> (190 calls migrated) and the <code class="language-plaintext highlighter-rouge">emit_inst()</code> infrastructure.</p>

<p>Then came the realisation: <strong>the plan was incomplete.</strong> The register <em>operand</em> fields inside typed instructions were still <code class="language-plaintext highlighter-rouge">str</code>. A <code class="language-plaintext highlighter-rouge">Binop</code> had <code class="language-plaintext highlighter-rouge">left: str</code> and <code class="language-plaintext highlighter-rouge">right: str</code>. Changing those to <code class="language-plaintext highlighter-rouge">Register</code> was a separate layer of work that the original plan hadn’t accounted for.</p>

<h3 id="the-plan-that-replaced-the-plan">The Plan That Replaced the Plan</h3>

<p>The original four phases were dissolved into a five-layer plan, documented in ADR-121 and a dedicated design document:</p>

<pre><code class="language-mermaid">flowchart TD
    L1("Layer 1&lt;br/&gt;Register fields in&lt;br/&gt;instruction classes&lt;br/&gt;&lt;i&gt;str → Register&lt;/i&gt;"):::layer
    L2("Layer 2&lt;br/&gt;map_registers/map_labels&lt;br/&gt;on InstructionBase"):::layer
    L3("Layer 3&lt;br/&gt;Migrate consumers&lt;br/&gt;&lt;i&gt;5 sub-issues&lt;/i&gt;"):::layer
    L4("Layer 4&lt;br/&gt;Migrate producers&lt;br/&gt;&lt;i&gt;19 sub-issues&lt;/i&gt;"):::layer
    L5("Layer 5&lt;br/&gt;Remove the bridge&lt;br/&gt;&lt;i&gt;delete IRInstruction&lt;/i&gt;"):::layer

    L1 --&gt; L2
    L1 --&gt; L3
    L1 --&gt; L4a("Layer 4a&lt;br/&gt;lower_expr() → Register"):::sub
    L2 --&gt; L3a("3a: handlers&lt;br/&gt;+ executor"):::sub
    L2 --&gt; L3d("3d: linker&lt;br/&gt;+ compiler"):::sub
    L1 --&gt; L3b("3b: type&lt;br/&gt;inference"):::sub
    L1 --&gt; L3c("3c: CFG +&lt;br/&gt;dataflow"):::sub
    L1 --&gt; L3e("3e: LLM +&lt;br/&gt;COBOL + misc"):::sub
    L4a --&gt; L4f("17 frontend&lt;br/&gt;sub-issues&lt;br/&gt;&lt;i&gt;~1,357 emit() calls&lt;/i&gt;"):::sub

    L3 --&gt; L5
    L4f --&gt; L5

    classDef layer fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef sub fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
</code></pre>

<p>The five-layer plan had 28 independently-trackable sub-issues. Each sub-issue was a self-contained commit with a test gate. The <code class="language-plaintext highlighter-rouge">Register.__eq__(str)</code> compatibility bridge would remain until Layer 5, making all preceding layers individually safe.</p>

<p>The key insight was decomposition granularity. Layer 4 alone had 17 sub-issues, one per frontend, because each frontend could be migrated independently. The Rust frontend had 127 <code class="language-plaintext highlighter-rouge">emit()</code> calls. The COBOL emit context had 9. Mixing them in one commit would have made review and revert impossible.</p>

<h3 id="the-missing-length-field">The Missing Length Field</h3>

<p>Phase 1 of the original plan defined 30 typed instruction classes. Each class was a frozen dataclass with named fields instead of positional operands. The round-trip test (<code class="language-plaintext highlighter-rouge">to_typed(inst).to_flat() == inst</code>) was the primary verification.</p>

<p>The <code class="language-plaintext highlighter-rouge">LoadRegion</code> instruction was defined as:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@dataclass</span><span class="p">(</span><span class="n">frozen</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">LoadRegion</span><span class="p">(</span><span class="n">InstructionBase</span><span class="p">):</span>
    <span class="n">result_reg</span><span class="p">:</span> <span class="n">Register</span> <span class="o">=</span> <span class="n">NO_REGISTER</span>
    <span class="n">region_reg</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="sh">""</span>
    <span class="n">offset_reg</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="sh">""</span>
</code></pre></div></div>

<p>But <code class="language-plaintext highlighter-rouge">LOAD_REGION</code> in the IR had four operands: region register, offset register, <em>length</em>, and a field name. The <code class="language-plaintext highlighter-rouge">length</code> field was missing from the typed class. The round-trip test caught this immediately (the <code class="language-plaintext highlighter-rouge">to_flat()</code> output had fewer operands than the input) but the bug existed in the first commit and had to be fixed as a follow-up.</p>

<p>The cause was straightforward: the AI generated the class from the opcode name and the most common operand patterns, without checking the actual <code class="language-plaintext highlighter-rouge">_handle_load_region</code> handler for the full operand list. The IR reference document listed the correct operands, but the class definition was generated from pattern inference, not from the spec.</p>

<p>Commit message: <em>“fix: LoadRegion typed instruction, add missing length field.”</em></p>

<p>The lesson: <strong>generated code that passes a round-trip test is not necessarily correct; the test verifies internal consistency, not completeness.</strong> A class with 3 fields that round-trips perfectly to a 3-operand instruction is still wrong if the real instruction has 4 operands.</p>

<h3 id="layer-1-execution">Layer 1 Execution</h3>

<p>Layer 1 was the riskiest single commit: changing 31 fields across 21 instruction classes from <code class="language-plaintext highlighter-rouge">str</code> to <code class="language-plaintext highlighter-rouge">Register</code>, plus updating all <code class="language-plaintext highlighter-rouge">operands</code> properties, <code class="language-plaintext highlighter-rouge">to_typed()</code> converters, and <code class="language-plaintext highlighter-rouge">to_flat()</code> converters.</p>

<p>The <code class="language-plaintext highlighter-rouge">Register.__eq__(str)</code> bridge was supposed to make this safe: all existing string comparisons would continue to work. And it did, mostly. But several edge cases escaped the bridge:</p>

<p><strong>1. COBOL literal-in-call-operand.</strong> COBOL frontends occasionally put literal values (not register names) in call argument positions. These weren’t registers; they were strings like <code class="language-plaintext highlighter-rouge">"SPACES"</code> or <code class="language-plaintext highlighter-rouge">"100"</code>. Wrapping them in <code class="language-plaintext highlighter-rouge">Register(...)</code> was semantically wrong. The fix was an <code class="language-plaintext highlighter-rouge">_as_register()</code> helper that checked whether the operand looked like a register name before wrapping.</p>

<p><strong>2. <code class="language-plaintext highlighter-rouge">_resolve_reg</code> fallback path.</strong> When the VM’s <code class="language-plaintext highlighter-rouge">_resolve_reg</code> couldn’t find a register in the frame, it returned the raw operand as a fallback (for symbolic execution of incomplete programs). After the migration, this fallback returned a <code class="language-plaintext highlighter-rouge">Register</code> object instead of a string, which leaked into <code class="language-plaintext highlighter-rouge">TypedValue</code> wrappers and broke downstream comparisons. The fix: <code class="language-plaintext highlighter-rouge">_resolve_reg</code> fallback returns <code class="language-plaintext highlighter-rouge">str(operand)</code>, not the raw <code class="language-plaintext highlighter-rouge">Register</code>.</p>

<p><strong>3. <code class="language-plaintext highlighter-rouge">_is_temporary_register</code> in dataflow.</strong> This function subscripted the register name to check character patterns: <code class="language-plaintext highlighter-rouge">reg[1] == 'r' and reg[2:].isdigit()</code>. After migration, <code class="language-plaintext highlighter-rouge">reg</code> was a <code class="language-plaintext highlighter-rouge">Register</code> object, and subscripting it didn’t work. The fix: convert to <code class="language-plaintext highlighter-rouge">str</code> before subscripting.</p>

<p><strong>4. StoreIndex with SpreadArguments.</strong> Some frontends emitted <code class="language-plaintext highlighter-rouge">STORE_INDEX</code> with a <code class="language-plaintext highlighter-rouge">SpreadArguments</code> in the <code class="language-plaintext highlighter-rouge">value_reg</code> position, a convention for array unpacking. The <code class="language-plaintext highlighter-rouge">value_reg</code> field was typed as <code class="language-plaintext highlighter-rouge">Register</code>, not <code class="language-plaintext highlighter-rouge">Register | SpreadArguments</code>, so the type system rejected it. The fix: preserve <code class="language-plaintext highlighter-rouge">SpreadArguments</code> in <code class="language-plaintext highlighter-rouge">StoreIndex.value_reg</code> with a note documenting the frontend convention.</p>

<p>Each of these was a small fix. Together they took as long as the main migration. The commit message for Layer 1 runs to 14 lines documenting the edge cases.</p>

<p>All 12,924 tests passed. 22 xfailed. 1 skipped.</p>

<hr />

<h2 id="chapter-4-completing-the-migration-with-superpowers">Chapter 4: Completing the Migration with Superpowers</h2>

<p>Layer 1 was the riskiest single commit, and the GSD v2 workflow patterns had gotten me through it. But the five-layer plan sat at roughly 15% complete. Layers 2 through 5 still needed ~2,200 sites migrated across ~100 files. The GSD v2 patterns had gotten me through the riskiest single commit, but the remaining work stalled.</p>

<h3 id="why-gsd-v2-stalled-after-layer-1">Why GSD v2 Stalled After Layer 1</h3>

<p>The GSD v2 patterns I’d adopted were <em>rules in CLAUDE.md</em>: enforce phases, classify complexity, run the verification gate. They worked as guardrails. But they didn’t help with the <em>orchestration</em> problem: who breaks down the next 4 layers into sub-issues? Who dispatches work? Who reviews each piece before moving to the next?</p>

<p>With GSD v2 patterns, the agent was too eager. It would see the five-layer plan, start implementing Layer 2, hit an edge case, make a judgment call without consulting me, then discover the judgment was wrong three layers later. The “stop and consult when patching” rule was in CLAUDE.md, but the agent treated it as a suggestion, not a gate.</p>

<p>GSD v2 gave me <em>process discipline</em> (phases, verification, state on disk) but not <em>decision discipline</em> (when to pause, when to decompose, when to ask).</p>

<p>There were also practical issues. As the session grew long, GSD v2’s TUI would behave erratically: freezing the terminal window, scrolling to random sections of the session unprompted. This made it harder to stay oriented during multi-hour work.</p>

<p>A caveat: I did not investigate all of GSD v2’s configuration options in depth. It has settings for model selection, timeout behaviour, stuck-loop detection, and other knobs that might have helped. What I’m describing is the out-of-the-box experience with default settings applied to a Heavy refactoring.</p>

<h3 id="what-superpowers-did-differently">What Superpowers Did Differently</h3>

<p>I switched back to the <a href="https://github.com/anthropics/claude-code/tree/main/plugins/superpowers">Superpowers</a> plugin for Claude Code, which I’d used for much of the project’s earlier development.</p>

<p><strong>Brainstorming was a real gate.</strong> Superpowers’ brainstorming skill forced a structured conversation before any implementation. When I said “execute this plan,” it first asked five clarifying questions, one at a time, with multiple choice options and a recommendation. Layer 4 decomposition strategy? One sub-issue per frontend (option A), grouped by size tier (B), or grouped by language family (C)? I picked A. The <code class="language-plaintext highlighter-rouge">lower_expr()</code> signature change: bundled with frontend migration or separate atomic commit? I picked separate. These decisions shaped the entire execution.</p>

<p>With GSD v2 patterns alone, the agent would have made these decisions silently and started coding.</p>

<p><strong>Two-stage review after every task.</strong> Superpowers dispatched a fresh subagent per task, then ran a spec compliance review (“did you build what was requested?”) followed by a code quality review (“is it well-built?”). The spec reviewer caught the Layer 1 <code class="language-plaintext highlighter-rouge">StoreIndex.value_reg</code> <code class="language-plaintext highlighter-rouge">SpreadArguments</code> guard issue. The code quality reviewer caught the <code class="language-plaintext highlighter-rouge">types.UnionType</code> vs <code class="language-plaintext highlighter-rouge">typing.Union</code> problem for Python 3.13 that would have <em>broken <code class="language-plaintext highlighter-rouge">map_registers()</code> at runtime</em>.</p>

<p>With GSD v2, self-review was a checklist item. With Superpowers, review was a separate agent with adversarial instructions.</p>

<p><strong>Issues filed for discovered work.</strong> During Layer 3e, the agent discovered that <code class="language-plaintext highlighter-rouge">registry.py</code> had 3 <code class="language-plaintext highlighter-rouge">opcode==</code> comparisons that couldn’t be migrated because the instruction list was mixed-type. With GSD v2 patterns, this would have been noted in a commit message and forgotten. With Superpowers, I asked “are there issues filed for any of the new-found follow-up work?” The agent checked, found 4 undocumented items, and filed them immediately, each blocking the epic. The dependency graph stayed honest.</p>

<p><strong>The plan was revised during execution.</strong> After Layer 3 completed, the agent checked whether the plan’s assumptions still held. The Layer 3a estimate of “~100 sites” turned out to be 4 sites (the executor was already clean). Layer 3e estimated “~40 sites” turned out to be 128 (COBOL ir_encoders dominated). The plan was rewritten mid-execution with corrected counts, updated dependency graphs, and new prerequisite issues. The design document became a living record rather than a stale spec.</p>

<h3 id="the-audit-that-revised-the-plan">The Audit That Revised the Plan</h3>

<p>The session started with an audit. I asked the agent to examine <code class="language-plaintext highlighter-rouge">red-dragon-0ibe</code> and its design doc. It dispatched an exploration subagent that surveyed the entire codebase: every <code class="language-plaintext highlighter-rouge">to_typed()</code> call, every <code class="language-plaintext highlighter-rouge">inst.operands[</code> access, every <code class="language-plaintext highlighter-rouge">opcode==</code> comparison, every <code class="language-plaintext highlighter-rouge">IRInstruction(</code> construction. The findings were a table with file, line, and count for each migration marker.</p>

<p>The audit revealed that the design doc’s estimates were wrong in both directions. Handlers had 0 <code class="language-plaintext highlighter-rouge">to_typed()</code> calls (not the estimated 72). COBOL <code class="language-plaintext highlighter-rouge">ir_encoders.py</code> had 106 <code class="language-plaintext highlighter-rouge">IRInstruction</code> constructions (not counted in the original plan at all). Type inference had a full Opcode-keyed dispatch table that needed replacement (described as “~30 sites, annotation changes only”).</p>

<p>This upfront survey took <strong>10 minutes</strong>. It saved hours of mid-implementation replanning.</p>

<h3 id="layers-2-through-5-execution">Layers 2 Through 5: Execution</h3>

<p>With the revised plan, execution was systematic:</p>

<p><strong>Layer 2</strong> (2 commits, small): <code class="language-plaintext highlighter-rouge">Register.rebase(offset)</code> and <code class="language-plaintext highlighter-rouge">map_registers()</code>/<code class="language-plaintext highlighter-rouge">map_labels()</code> on <code class="language-plaintext highlighter-rouge">InstructionBase</code>. The <code class="language-plaintext highlighter-rouge">types.UnionType</code> fix (caught by the plan reviewer) was baked into the implementation. 22 new tests.</p>

<p><strong>Layer 3</strong> (4 commits, 5 sub-issues): Consumer migration. The agent dispatched sub-issues in parallel where dependencies allowed — 3b, 3c, 3e could start immediately after Layer 1, without waiting for Layer 2. Layer 3a was deferred entirely (blocked by COBOL literal operands). Layer 3d used the new <code class="language-plaintext highlighter-rouge">map_registers()</code> in the linker, deleting the old <code class="language-plaintext highlighter-rouge">_rebase_operand</code> helper.</p>

<p><strong>Layer 4a</strong> (1 commit): <code class="language-plaintext highlighter-rouge">lower_expr() -&gt; Register</code> signature change across 33 files, ~316 methods. Single atomic mechanical commit.</p>

<p><strong>Layer 4</strong> (15 commits): One per frontend. The agent dispatched batches: 5 small frontends first (C, Lua, TypeScript, C++, Scala), then 4 medium (Go, Python, JavaScript, Pascal), then 6 large (PHP, C#, Java, Kotlin, Ruby, Rust). Plus <code class="language-plaintext highlighter-rouge">_base.py</code>/<code class="language-plaintext highlighter-rouge">context.py</code> and COBOL <code class="language-plaintext highlighter-rouge">emit_context.py</code>. ~1,300 <code class="language-plaintext highlighter-rouge">emit()</code> calls converted to <code class="language-plaintext highlighter-rouge">emit_inst()</code>.</p>

<p><strong>Prerequisites</strong> (3 commits): Before Layer 5 could start, 4 discovered issues needed resolution. COBOL <code class="language-plaintext highlighter-rouge">ir_encoders.py</code> literal operands → emit CONST instructions for literals instead. 59 remaining COBOL frontend <code class="language-plaintext highlighter-rouge">emit()</code> calls. Registry mixed-type list. VM/dataflow boundary cleanup.</p>

<p><strong>Layer 5</strong> (3 commits): Delete <code class="language-plaintext highlighter-rouge">emit()</code> methods. Add standalone <code class="language-plaintext highlighter-rouge">__str__</code> (replacing <code class="language-plaintext highlighter-rouge">to_flat()</code> delegation). Delete <code class="language-plaintext highlighter-rouge">to_flat()</code> and all flat-to-typed converters. Remove <code class="language-plaintext highlighter-rouge">Register.__eq__(str)</code> — which broke ~300 test sites that were fixed in a single commit. Delete <code class="language-plaintext highlighter-rouge">IRInstruction</code> class.</p>

<p>Total: <strong>~32 commits, ~100 files modified, zero test regressions</strong> at any point. 12,944 tests passing throughout.</p>

<p>The subagent execution times (including test suite runs after each change) give a sense of effort per layer:</p>

<table>
  <thead>
    <tr>
      <th>Layer</th>
      <th>Subagent time</th>
      <th>Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Layer 1 (Register fields)</td>
      <td>~64 min</td>
      <td>Riskiest commit; 4 edge cases discovered</td>
    </tr>
    <tr>
      <td>Layer 2a+2b (rebase, map_registers)</td>
      <td>~19 min</td>
      <td>Small, well-specified</td>
    </tr>
    <tr>
      <td>Layer 3 (5 sub-issues)</td>
      <td>~94 min</td>
      <td>3b and 3e ran in parallel; 3e dominated at 22 min</td>
    </tr>
    <tr>
      <td>Layer 4a (signatures)</td>
      <td>~15 min</td>
      <td>Mechanical find-and-replace</td>
    </tr>
    <tr>
      <td>Layer 4 (15 frontends)</td>
      <td>~64 min</td>
      <td>Three batches; largest frontends took longest</td>
    </tr>
    <tr>
      <td>Prerequisites (pyww, oczk, b6m4, j1o6)</td>
      <td>~68 min</td>
      <td>pyww dominated at 57 min</td>
    </tr>
    <tr>
      <td>Layer 5a-E (delete emit, to_flat)</td>
      <td>~49 min</td>
      <td> </td>
    </tr>
    <tr>
      <td>Layer 5F (Register.<strong>eq</strong> removal)</td>
      <td>~107 min</td>
      <td>300+ test sites fixed in one pass</td>
    </tr>
    <tr>
      <td>Layer 5G (delete IRInstruction)</td>
      <td>~53 min</td>
      <td>100 files, type annotation cascade</td>
    </tr>
    <tr>
      <td><strong>Total subagent time</strong></td>
      <td><strong>~533 min (~9 hours)</strong></td>
      <td>Single session, wall clock ~12 hours with review/planning</td>
    </tr>
  </tbody>
</table>

<p>The wall clock time was longer than the subagent time because of the planning, brainstorming, review, and consultation work between subagent dispatches. The brainstorming conversation at the start of the session (decomposing layers, choosing strategies, resolving open questions) took roughly an hour before any code was written.</p>

<pre><code class="language-mermaid">timeline
    title IRInstruction Elimination
    Layer 1-2 : L1 Register fields, 33 fields
               : L2a rebase method
               : L2b map_registers, map_labels
    Layer 3   : 3b type inference dispatch
               : 3c CFG, dataflow isinstance
               : 3d linker map_registers
               : 3e LLM, COBOL, registry, 128 sites
    Layer 4   : 4a lower_expr to Register
               : 15 frontends, 1300 sites
    Prerequisites : COBOL literal operands
                  : COBOL frontend emit
                  : registry cleanup
    Layer 5   : inline_ir refactor
               : delete emit, to_flat
               : remove Register eq str
               : delete IRInstruction
</code></pre>

<h3 id="layer-5-was-not-cleanup">Layer 5 Was Not Cleanup</h3>

<p>The plan described Layer 5 as “remove the bridge.” The actual scope was larger.</p>

<p>The plan said “delete IRInstruction, to_typed/to_flat, operands properties, Register.<strong>eq</strong>(str).” The survey said ~100 sites. The reality was that <code class="language-plaintext highlighter-rouge">IRInstruction</code> was referenced in 116 imports across 36 files. <code class="language-plaintext highlighter-rouge">to_typed()</code> was called at 6 normalization entry points. <code class="language-plaintext highlighter-rouge">Register.__eq__(str)</code> was load-bearing in ~300 test assertions and ~100 production sites.</p>

<p>Removing <code class="language-plaintext highlighter-rouge">Register.__eq__(str)</code> was the most labour-intensive step. Every <code class="language-plaintext highlighter-rouge">assert inst.left == "%r0"</code> became <code class="language-plaintext highlighter-rouge">assert inst.left == Register("%r0")</code>. Every <code class="language-plaintext highlighter-rouge">frame.registers["%r0"]</code> became <code class="language-plaintext highlighter-rouge">frame.registers[Register("%r0")]</code>. The agent fixed 300+ sites across 46 test files in a single commit — a mechanical migration, but one that required understanding each comparison’s context to choose the right fix pattern (<code class="language-plaintext highlighter-rouge">.name</code> extraction vs <code class="language-plaintext highlighter-rouge">Register()</code> wrapping vs dict key normalisation).</p>

<p>The <code class="language-plaintext highlighter-rouge">IRInstruction</code> class itself was replaced with a factory function of the same name: callers keep the same <code class="language-plaintext highlighter-rouge">IRInstruction(opcode=Opcode.X, operands=[...])</code> interface, but the return value is now a typed <code class="language-plaintext highlighter-rouge">InstructionBase</code> subclass. This preserved backward compatibility for ~50 remaining construction sites (mostly in test helpers and COBOL <code class="language-plaintext highlighter-rouge">inline_ir</code>) while eliminating the class.</p>

<p>The remaining work — converting those ~50 construction sites to direct typed instruction construction and removing the factory — is tracked as a follow-up. The factory is a thin shim, not a compatibility bridge. The bridge is gone.</p>

<hr />

<h2 id="learnings">Learnings</h2>

<h3 id="on-type-hints-in-python">On Type Hints in Python</h3>

<p>I spent years writing Java, C#, C, C++, and Ruby before this project, and tried Scala and Haskell along the way. In the statically typed ones, the compiler catches the kind of errors that were invisible in this codebase for months: passing a variable name where a register was expected, comparing a label to None without narrowing, indexing into a bag of Any and hoping the position was right. Python’s dynamism made the rapid prototyping phase fast. Fifteen frontends, a VM, a type inference engine, an interprocedural analysis pipeline, all built in a few months. But the cost <em>accumulated silently</em>.</p>

<p>Turning on Pyright made that cost visible. The 83+ errors in the IR layer were not bugs in the running code. They were a <em>map of fragility</em>: every place where a wrong assumption would pass at runtime and corrupt data quietly. Fixing them properly required domain types, not just annotations. <code class="language-plaintext highlighter-rouge">str</code> with <code class="language-plaintext highlighter-rouge">""</code> as a sentinel became <code class="language-plaintext highlighter-rouge">Register</code> with a <code class="language-plaintext highlighter-rouge">NoRegister</code> null object. <code class="language-plaintext highlighter-rouge">str | None</code> became <code class="language-plaintext highlighter-rouge">CodeLabel</code> with <code class="language-plaintext highlighter-rouge">NoCodeLabel</code>. <code class="language-plaintext highlighter-rouge">operands: list[Any]</code> became 31 frozen dataclasses with named typed fields.</p>

<p>The investment was significant: 32 commits, 100 files, two days of migration. The type hint tightening is still a work in progress; 18 primitive <code class="language-plaintext highlighter-rouge">str</code> and <code class="language-plaintext highlighter-rouge">int</code> fields remain on the instruction classes, and the broader codebase has areas where annotations are loose or absent. But the return is already visible: Pyright now catches at the edit site what previously surfaced as a wrong value three layers deep in the VM. For a project of this size, built primarily through AI-assisted development where the AI generates most of the code, that static safety net is worth more than any number of runtime tests. <strong>Tests verify behaviour you thought to check. Types prevent mistakes you didn’t think to check for.</strong></p>

<h3 id="on-scope">On Scope</h3>

<p><strong>Scope is measured by the cascade, not the initial change.</strong> Changing <code class="language-plaintext highlighter-rouge">BasicBlock.label</code> from <code class="language-plaintext highlighter-rouge">str</code> to <code class="language-plaintext highlighter-rouge">CodeLabel</code> was 5 lines. That change cascaded into 40 files because the label was used as a dict key, a lookup value, a comparison target, and a format string argument across every architectural layer. I treated it as “lots of small changes” and attempted it in one session. The session was reverted. The same work, decomposed into four rings with separate issues and separate sessions, shipped cleanly.</p>

<p><strong>File issues for the next ring.</strong> The CodeLabel migration into <code class="language-plaintext highlighter-rouge">IRInstruction.label</code> was one commit. Propagation into <code class="language-plaintext highlighter-rouge">BasicBlock</code> and <code class="language-plaintext highlighter-rouge">cfg.blocks</code> was a separate issue. Propagation into <code class="language-plaintext highlighter-rouge">FuncRef.label</code> and <code class="language-plaintext highlighter-rouge">ClassRef.label</code> was a third. Each ring was bounded, independently committable, and went through the verification gate on its own.</p>

<pre><code class="language-mermaid">flowchart TD
    R0("Ring 0&lt;br/&gt;IRInstruction.label&lt;br/&gt;&lt;i&gt;CodeLabel / NoCodeLabel&lt;/i&gt;"):::ring0
    R1("Ring 1&lt;br/&gt;BasicBlock, CFG,&lt;br/&gt;StateUpdate, handlers"):::ring1
    R2("Ring 2&lt;br/&gt;FuncRef, ClassRef,&lt;br/&gt;FunctionEntry, InstructionLocation"):::ring2
    R3("Ring 3&lt;br/&gt;Remove _coerce_label,&lt;br/&gt;push wrapping upstream"):::ring3

    R0 -- "separate issue" --&gt; R1
    R1 -- "separate issue" --&gt; R2
    R2 -- "separate issue" --&gt; R3

    classDef ring0 fill:#e8f4fd,stroke:#4a90d9,stroke-width:3px,color:#1a3a5c
    classDef ring1 fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef ring2 fill:#fce4ec,stroke:#c62828,stroke-width:2px,color:#5c0a0a
    classDef ring3 fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
</code></pre>

<p><strong>Classify complexity before starting, not after.</strong> The multi-file linker was obviously Heavy (500+ lines, new package, touched VM, registry, API, MCP). It was treated as a single unbroken implementation session. Breaking it into independently-committable units from the start would have produced cleaner history and caught issues earlier.</p>

<h3 id="on-type-migrations">On Type Migrations</h3>

<p><strong>Never auto-coerce during a migration.</strong> A Pydantic validator that auto-converted <code class="language-plaintext highlighter-rouge">str</code> to <code class="language-plaintext highlighter-rouge">CodeLabel</code> made all tests pass immediately. It also masked every call site that needed updating. A failing test at a call site is <em>information</em>: the caller hasn’t been migrated. Silencing it with auto-coercion destroys that information.</p>

<p><strong>Audit implicit protocol usages.</strong> Replacing <code class="language-plaintext highlighter-rouge">str</code> result registers with <code class="language-plaintext highlighter-rouge">Register</code> objects broke every <code class="language-plaintext highlighter-rouge">if inst.result_reg:</code> check because dataclass instances are always truthy in Python. When replacing a primitive with a domain type, audit <code class="language-plaintext highlighter-rouge">bool()</code>, <code class="language-plaintext highlighter-rouge">len()</code>, <code class="language-plaintext highlighter-rouge">in</code>, <code class="language-plaintext highlighter-rouge">[]</code>, comparison operators, and string formatting. Each one is a potential semantic break that passes silently.</p>

<p><strong>Edge cases cluster at boundaries.</strong> The four Layer 1 edge cases (COBOL literals in call operands, <code class="language-plaintext highlighter-rouge">_resolve_reg</code> fallback, <code class="language-plaintext highlighter-rouge">_is_temporary_register</code> subscripting, SpreadArguments in StoreIndex) all occurred where the type change met a convention or assumption from a different subsystem. The type change itself was mechanical. The <em>boundary interactions</em> were where the bugs lived.</p>

<p><strong>Stringly-typed representations accumulate hidden conventions.</strong> <code class="language-plaintext highlighter-rouge">BRANCH_IF</code> stored branch targets as a comma-separated string. This survived 10 months and 12,000+ tests. Three files independently implemented the same split logic. None of it was documented. Discovering it during a refactoring is how it was always going to be discovered.</p>

<h3 id="on-plans">On Plans</h3>

<p><strong>The end state was clear; the intermediate steps were not.</strong> The original four-phase plan correctly described the goal (typed instructions with Register fields, no IRInstruction). What it underestimated was the number of intermediate layers needed to get there safely. Operand-level register migration was a whole layer the plan didn’t account for. The plan had to be replaced mid-execution with a five-layer plan decomposed into 28 sub-issues. Getting the destination right is the easy part. Mapping the route, with each step <em>independently committable and testable</em>, is where the planning effort belongs.</p>

<p><strong>Round-trip tests verify consistency, not completeness.</strong> <code class="language-plaintext highlighter-rouge">LoadRegion</code> was defined with 3 fields. The actual instruction had 4 operands. The round-trip test passed perfectly because a 3-field class round-trips correctly to a 3-operand instruction. The bug was that the real instruction has 4 operands. Verification should cross-reference against the authoritative source (the handler, the IR spec), not just test internal consistency.</p>

<p><strong>Start with an audit, not implementation.</strong> The Superpowers session started by surveying every <code class="language-plaintext highlighter-rouge">to_typed()</code> call, every <code class="language-plaintext highlighter-rouge">inst.operands[</code> access, every <code class="language-plaintext highlighter-rouge">opcode==</code> comparison, every <code class="language-plaintext highlighter-rouge">IRInstruction(</code> construction. This took 10 minutes and revealed that the design doc’s estimates were wrong in both directions: handlers had 0 <code class="language-plaintext highlighter-rouge">to_typed()</code> calls (not 72), COBOL <code class="language-plaintext highlighter-rouge">ir_encoders.py</code> had 106 constructions (not counted at all). The upfront survey saved hours of mid-implementation replanning.</p>

<h3 id="on-tooling">On Tooling</h3>

<p><strong>Start with the verification gate from day one.</strong> The <code class="language-plaintext highlighter-rouge">.importlinter</code> failure was avoidable. Having “run black” and “run tests” as separate loose rules instead of a single named gate made it easy to skip lint-imports entirely.</p>

<p><strong>Enforce brainstorm-before-implement from session one.</strong> The linker patch-on-patch problem cost more time than the proper rewrite. A brainstorm phase that reads actual code, not just designs on paper, would have caught the VM dispatch mismatch immediately.</p>

<p><strong>File issues before work, not after.</strong> Retroactive issue filing loses the intent: why was this work started? What was the expected outcome? Issues filed before work serve as specifications. Issues filed after are bookkeeping.</p>

<hr />

<h2 id="takeaways">Takeaways</h2>

<p><strong>The GSD 2 patterns held up as guardrails.</strong> Five workflow patterns adopted from GSD 2, then tested against the largest refactoring in the project:</p>

<table>
  <thead>
    <tr>
      <th>Pattern</th>
      <th>What It Solved</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Enforced phases</td>
      <td>Skipped brainstorm → wrong design → patch-on-patch</td>
    </tr>
    <tr>
      <td>Complexity classification</td>
      <td>Heavy work attempted as single pass → messy history</td>
    </tr>
    <tr>
      <td>Verification gate</td>
      <td>Scattered “run X” rules → missed lint-imports</td>
    </tr>
    <tr>
      <td>Fresh context for heavy tasks</td>
      <td>Design document anchoring → flawed assumptions carried into implementation</td>
    </tr>
    <tr>
      <td>State on disk</td>
      <td>Missing backups → lost issue state on session crash</td>
    </tr>
  </tbody>
</table>

<p><strong>Superpowers added what GSD v2 lacked.</strong> GSD v2 gave me process discipline. Superpowers gave me decision discipline and orchestration:</p>

<table>
  <thead>
    <tr>
      <th>Capability</th>
      <th>GSD v2 (CLAUDE.md rules)</th>
      <th>Superpowers (plugin)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Brainstorming</td>
      <td>“brainstorm first” rule, often skipped</td>
      <td>Structured gate: one question at a time, options, recommendations</td>
    </tr>
    <tr>
      <td>Decomposition</td>
      <td>Manual, done by me before the session</td>
      <td>Agent proposes, I choose: “one per frontend” vs “grouped by family”</td>
    </tr>
    <tr>
      <td>Review</td>
      <td>Self-review checklist</td>
      <td>Two-stage: spec compliance reviewer + code quality reviewer (separate agents)</td>
    </tr>
    <tr>
      <td>Discovered work</td>
      <td>Noted in commit messages, often lost</td>
      <td>Filed as issues with dependency links to the epic</td>
    </tr>
    <tr>
      <td>Plan revision</td>
      <td>Manual, often deferred</td>
      <td>Agent resurveys after each layer, rewrites estimates, updates dependency graph</td>
    </tr>
    <tr>
      <td>Consultation</td>
      <td>“stop and consult” rule, inconsistently followed</td>
      <td>Natural pause points at each brainstorming question and review finding</td>
    </tr>
  </tbody>
</table>

<p>The distinction matters. GSD v2 tells the agent <em>what to do</em> (brainstorm, then plan, then implement). Superpowers tells the agent <em>when to stop and ask</em> (is this the right decomposition? did the spec reviewer approve? should this discovered issue block the epic?). For a refactoring that took 32 commits across 100 files, the stopping points were where the real decisions happened.</p>

<p><strong>They serve different sweet spots.</strong> I would still reach for GSD v2 for greenfield work of moderate complexity where I want to move fast: adding a new frontend, building a new analysis pass, scaffolding a new module. GSD v2’s autonomous dispatch is an advantage when the work is well-understood and doesn’t require mid-stream design decisions. For Heavy refactoring work that crosses architectural boundaries and accumulates edge cases, Superpowers’ consultation model is the better fit.</p>

<p><strong>The patterns are tool-agnostic.</strong> They work whether you’re using GSD 2, Superpowers, Cursor, or any other AI coding assistant. The common thread is treating AI-assisted development as an engineering practice that needs the same discipline as human development: phase gates, complexity awareness, verification, and persistent state.</p>

<table>
  <thead>
    <tr>
      <th>Moment in the refactoring</th>
      <th>Ad-hoc (before either framework)</th>
      <th>With GSD v2 patterns</th>
      <th>With Superpowers</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Linker design was wrong</td>
      <td>5 patches before realising</td>
      <td>Would have stopped at patch 2 (“stop and consult” rule)</td>
      <td>Would not have started: brainstorming gate asks “have you read the VM dispatch code?”</td>
    </tr>
    <tr>
      <td>Layer 4 decomposition (17 frontends)</td>
      <td>One long session, hoping for the best</td>
      <td>Classified as Heavy, broken into units manually</td>
      <td>Agent proposed 3 options (per-frontend, by size tier, by family), I picked one</td>
    </tr>
    <tr>
      <td>COBOL literal operands discovered during Layer 1</td>
      <td>Noted in commit message, forgotten</td>
      <td>Filed as issue retroactively after I noticed</td>
      <td>Agent asked “are there issues filed for follow-up work?”, filed 4 issues with dependency links</td>
    </tr>
    <tr>
      <td>Layer 3 site counts were wrong (~40 estimated, 128 actual)</td>
      <td>Discovered mid-implementation, improvised</td>
      <td>Discovered mid-implementation, adjusted manually</td>
      <td>Discovered by upfront audit before implementation started</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">types.UnionType</code> vs <code class="language-plaintext highlighter-rouge">typing.Union</code> bug in <code class="language-plaintext highlighter-rouge">map_registers</code></td>
      <td>Would have surfaced as a runtime crash in Layer 3</td>
      <td>Self-review checklist might have caught it</td>
      <td>Code quality reviewer (separate agent) caught it during plan review</td>
    </tr>
    <tr>
      <td>Removing <code class="language-plaintext highlighter-rouge">Register.__eq__(str)</code> (300+ sites)</td>
      <td>Would not have attempted incrementally</td>
      <td>Would have attempted, likely stalled on scope</td>
      <td>Agent fixed 300+ sites in one pass, choosing the right fix pattern per site</td>
    </tr>
  </tbody>
</table>

<p><strong>Compatibility bridges make incremental migration possible.</strong> <code class="language-plaintext highlighter-rouge">Register.__eq__(str)</code> was a deliberate compatibility bridge: it made <code class="language-plaintext highlighter-rouge">Register</code> compare equal to raw strings, letting 12,924 tests pass without modification during the migration. The plan explicitly called for removing it in Layer 5, after all consumers had been updated. Without the bridge, every layer would have been a big-bang change.</p>

<p><strong>Reverts are progress.</strong> The abandoned CodeLabel cascade, attempted, reverted, and documented in the Beads issue, was not wasted time. It produced the scope analysis that made the ring-by-ring approach possible.</p>

<p><strong>The plan will be wrong. Plan for that.</strong> The four-phase plan became a five-layer plan with 28 sub-issues. The CodeLabel single-pass attempt became a four-ring incremental approach. In both cases, the revised plan was better than the original, not because planning was futile, but because the first attempt revealed the <em>real shape of the work</em>.</p>

<p><strong>Write down your refactoring principles before the next refactoring.</strong> The “Refactoring Principles” section of CLAUDE.md, written after the CodeLabel experience, prevented at least three of the same mistakes during the Register migration. Rules extracted from failures and codified before the next attempt compound over time.</p>

<p><strong>The AI writes the code. You manage the scope.</strong> Across this entire migration, the AI’s code was overwhelmingly correct at the mechanical level: updating 31 fields, threading kwargs, updating converters, formatting with Black. Where it failed was scope: generating a class without checking the handler spec, proposing coercion validators that masked call sites, not flagging that operand fields needed their own migration layer. The human’s job in a type migration is not typing. It’s asking <em>“what does this cascade into?”</em> at every step.</p>

<p>The AI writes code faster than I can. It doesn’t write <em>better</em> code without guardrails.</p>]]></content><author><name>avishek</name></author><category term="Software Engineering" /><category term="AI-Assisted Development" /><category term="Developer Tooling" /><category term="Workflow" /><category term="Refactoring" /><category term="Compilers" /><category term="Type Safety" /><summary type="html"><![CDATA[Two agentic development frameworks applied to the same multi-layer type migration across a 13,000-test compiler pipeline. The first provided process discipline but stalled partway through. The second completed the work because it planned more thoroughly and stopped to ask.]]></summary></entry><entry><title type="html">Engineering Log: Anatomy of a Refactoring Using AI</title><link href="https://avishek.net/2026/03/13/anatomy-of-a-refactoring-using-ai.html" rel="alternate" type="text/html" title="Engineering Log: Anatomy of a Refactoring Using AI" /><published>2026-03-13T00:00:00+05:30</published><updated>2026-03-13T00:00:00+05:30</updated><id>https://avishek.net/2026/03/13/anatomy-of-a-refactoring-using-ai</id><content type="html" xml:base="https://avishek.net/2026/03/13/anatomy-of-a-refactoring-using-ai.html"><![CDATA[<p><em>Tracing the full arc of a multi-phase refactoring — from “Java string concatenation crashes the VM” to “every value in the system carries its type” — done across a dozen sessions with Claude Code over two days.</em></p>

<hr />

<h2 id="table-of-contents">Table of Contents</h2>

<ul>
  <li><a href="#the-problem">The Problem</a></li>
  <li><a href="#the-tools">The Tools</a>
    <ul>
      <li><a href="#superpowers-enforced-design-discipline">Superpowers: Enforced Design Discipline</a></li>
      <li><a href="#beads-local-first-task-tracking">Beads: Local-First Task Tracking</a></li>
      <li><a href="#code-simplifier-automated-cleanup-after-every-change">Code Simplifier: Automated Cleanup After Every Change</a></li>
    </ul>
  </li>
  <li><a href="#phase-1-typedvalue-and-binopcoercionstrategy">Phase 1: TypedValue and BinopCoercionStrategy</a>
    <ul>
      <li><a href="#the-design-decision-that-shaped-everything">The Design Decision That Shaped Everything</a></li>
      <li><a href="#the-boundary-table">The Boundary Table</a></li>
    </ul>
  </li>
  <li><a href="#phase-2-handler-migration">Phase 2: Handler Migration</a>
    <ul>
      <li><a href="#the-serializedeserialize-roundtrip">The Serialize/Deserialize Roundtrip</a></li>
    </ul>
  </li>
  <li><a href="#phase-3-return-values">Phase 3: Return Values</a>
    <ul>
      <li><a href="#the-constructor-bug">The Constructor Bug</a></li>
    </ul>
  </li>
  <li><a href="#phases-46-heap-and-closures">Phases 4–6: Heap and Closures</a>
    <ul>
      <li><a href="#the-double-wrapping-landmine">The Double-Wrapping Landmine</a></li>
    </ul>
  </li>
  <li><a href="#phase-7-cleaning-up-after-ourselves">Phase 7: Cleaning Up After Ourselves</a></li>
  <li><a href="#detour-builtins">Detour: Builtins</a>
    <ul>
      <li><a href="#builtinresult-side-effects-should-be-declarative">BuiltinResult: Side Effects Should Be Declarative</a></li>
      <li><a href="#builtin-args-the-atomic-commit-problem">Builtin Args: The Atomic Commit Problem</a></li>
    </ul>
  </li>
  <li><a href="#more-detours">More Detours</a>
    <ul>
      <li><a href="#binopcoercionstrategy-return-type">BinopCoercionStrategy Return Type</a></li>
      <li><a href="#unopcoercionstrategy">UnopCoercionStrategy</a></li>
      <li><a href="#demo-scripts-and-llm-path-leaks">Demo Scripts and LLM Path Leaks</a></li>
      <li><a href="#the-question-that-came-after">The Question That Came After</a></li>
    </ul>
  </li>
  <li><a href="#the-shape-of-the-work">The Shape of the Work</a></li>
  <li><a href="#takeaways">Takeaways</a></li>
</ul>

<hr />

<h2 id="the-problem">The Problem</h2>

<p><a href="https://github.com/avishek-sen-gupta/red-dragon">RedDragon</a> is a multi-language code analysis engine with a universal IR, deterministic VM, and a type system. It parses 15 languages, lowers them to IR, and executes the IR on a virtual machine.</p>

<p>The VM stored values as raw Python primitives — <code class="language-plaintext highlighter-rouge">int</code>, <code class="language-plaintext highlighter-rouge">str</code>, <code class="language-plaintext highlighter-rouge">float</code>, <code class="language-plaintext highlighter-rouge">bool</code>. Type information lived in a completely separate structure: a <code class="language-plaintext highlighter-rouge">TypeEnvironment</code> built by a static inference pass before execution. The operators themselves never saw types. They received raw values via <code class="language-plaintext highlighter-rouge">_resolve_reg</code> and used Python’s native operators.</p>

<p>This created an obvious problem. When Java code did <code class="language-plaintext highlighter-rouge">"int:" + 42</code>, Python raised <code class="language-plaintext highlighter-rouge">TypeError</code> because it can’t concatenate <code class="language-plaintext highlighter-rouge">str</code> and <code class="language-plaintext highlighter-rouge">int</code>. The VM caught the exception and degraded the result to a <code class="language-plaintext highlighter-rouge">SymbolicValue</code> — a placeholder meaning “I don’t know what this is.” The concrete information was gone. Java, C#, Kotlin, and Scala all auto-stringify non-string operands in string concatenation. The VM had no way to implement this because type information was absent at the point of operation.</p>

<p>The fix looked straightforward: make type information available to operators. What followed was a refactoring that touched almost every layer of the VM, exposed hidden assumptions in constructor handling, revealed that builtins were bypassing the state management contract, and prompted several side detours into coercion protocols, demo scripts, and the question of whether two separate type-tracking mechanisms were still both necessary.</p>

<p>This post traces that arc. It’s written partly as documentation and partly because the shape of the work — the way a focused fix expanded into a system-wide migration, the side detours, the bugs that only surfaced because something else changed — is characteristic of refactoring work in general. The AI didn’t change the nature of that work. It changed the speed.</p>

<p>All of this was done through conversations with Claude Code, using three tools that shaped the work as much as the code decisions did: <a href="https://github.com/anthropics/claude-code-plugins">Superpowers</a> for enforced design discipline, <a href="https://github.com/anthropics/beads">Beads</a> for local-first task tracking, and <a href="https://github.com/anthropics/claude-code-plugins">Code Simplifier</a> for automated post-implementation cleanup. The next section describes all three.</p>

<p>The refactoring spanned about a dozen sessions over two days. I’m including specific moments from those conversations — places where I had to course-correct, where I got frustrated with the codebase or the AI’s approach, where a question I asked led to discovering something unexpected — because the texture of those interactions is part of the story.</p>

<hr />

<h2 id="the-tools">The Tools</h2>

<p>Three tools ran alongside Claude Code throughout this migration. None are part of Claude Code itself — they’re open-source plugins that layer structure on top of it. I’m describing them here because the post references them repeatedly, and their constraints shaped how the work unfolded.</p>

<h3 id="superpowers-enforced-design-discipline">Superpowers: Enforced Design Discipline</h3>

<p><a href="https://github.com/anthropics/claude-code-plugins">Superpowers</a> is a skill system for Claude Code. Skills are structured prompts that activate automatically based on the task at hand. They don’t add capabilities the AI doesn’t have — they enforce workflows the AI would otherwise skip.</p>

<p>The skills that mattered for this migration:</p>

<p><strong>Brainstorming.</strong> Every major phase started here. The brainstorming skill runs a structured dialogue: it asks one clarifying question at a time, proposes alternative approaches with explicit trade-offs, and refuses to produce a design spec until you’ve agreed on the direction. It won’t let you skip to implementation. This is the skill that caught the serialize/deserialize split (Phase 2) and the <code class="language-plaintext highlighter-rouge">BuiltinResult</code> design (Detour) — cases where the obvious approach wasn’t the best one, and the skill’s insistence on proposing alternatives surfaced something simpler.</p>

<p>The brainstorming pipeline looks like this:</p>

<pre><code class="language-mermaid">flowchart LR
    Q("🔍 Clarifying&lt;br/&gt;questions&lt;br/&gt;&lt;i&gt;one at a time&lt;/i&gt;"):::ask --&gt; A("⚖️ Alternative&lt;br/&gt;approaches&lt;br/&gt;&lt;i&gt;with trade-offs&lt;/i&gt;"):::think
    A --&gt; D{"🎯 Design&lt;br/&gt;decision&lt;br/&gt;&lt;i&gt;user chooses&lt;/i&gt;"}:::decide
    D -- "more questions" --&gt; Q
    D -- "agreed" --&gt; S("📋 Design spec"):::output

    classDef ask fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef think fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef decide fill:#fce4ec,stroke:#c62828,stroke-width:2px,color:#5c0a0a
    classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a

    linkStyle 2 stroke:#2e7d32,stroke-width:2px
</code></pre>

<p>The early phases (1–3) had longer brainstorming cycles — eight questions in Phase 1 before any code was discussed. By the later phases, the cycles were shorter because the patterns were established: the skill would propose an approach, I’d confirm it matched the established pattern, and we’d move to planning.</p>

<p><strong>Writing-plans.</strong> Takes the design spec from brainstorming and breaks it into granular TDD steps — test first, then implementation, then verification. Each step is small enough to be a single commit. The plan for Phase 5 (heap fields) had 12 steps; the plan for the builtin args migration (Detour) explicitly mandated zero intermediate commits because the interface change was atomic.</p>

<p><strong>Subagent-driven development.</strong> Dispatches fresh Claude Code agents per task from the plan, each with its own context window. The dispatching agent reviews each sub-agent’s work before accepting it. This is where the review caught the <code class="language-plaintext highlighter-rouge">value is not None</code> guard in Phase 3 — a sub-agent had taken a shortcut that violated the design spec, and the reviewing agent flagged it. Without the two-stage review, that shortcut would have shipped.</p>

<p>The full pipeline:</p>

<pre><code class="language-mermaid">flowchart LR
    B("🧠 Brainstorm"):::phase1 --&gt; SP("📋 Spec"):::phase2
    SP --&gt; PL("📐 Plan&lt;br/&gt;&lt;i&gt;TDD steps&lt;/i&gt;"):::phase2
    PL --&gt; SA("🤖 Sub-agents&lt;br/&gt;&lt;i&gt;one per task&lt;/i&gt;"):::phase3
    SA --&gt; RV("🔎 Review&lt;br/&gt;&lt;i&gt;two-stage&lt;/i&gt;"):::phase4
    RV --&gt; CM("✅ Commit"):::phase5

    classDef phase1 fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef phase2 fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef phase3 fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#3a0a3a
    classDef phase4 fill:#fce4ec,stroke:#c62828,stroke-width:2px,color:#5c0a0a
    classDef phase5 fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
</code></pre>

<p>The discipline this pipeline enforces is not novel — brainstorm, spec, plan, implement, review is how careful engineering has always worked. What’s different is that a tool <em>enforces</em> the steps. When you’re twelve sessions into a migration and tempted to just start coding the next phase, the skill doesn’t let you. It asks its questions first.</p>

<h3 id="beads-local-first-task-tracking">Beads: Local-First Task Tracking</h3>

<p><a href="https://github.com/anthropics/beads">Beads</a> is a local-first issue tracker that lives alongside the repo as flat files. No server, no web UI, no syncing. Tasks are created and queried from the command line: <code class="language-plaintext highlighter-rouge">bd create</code>, <code class="language-plaintext highlighter-rouge">bd ready</code>, <code class="language-plaintext highlighter-rouge">bd update &lt;id&gt; --status closed</code>.</p>

<p>The features that mattered:</p>

<p><strong>Dependencies.</strong> Every Beads task can declare dependencies on other tasks. <code class="language-plaintext highlighter-rouge">bd ready</code> shows only tasks whose dependencies are all closed — the next unblocked work. This turned the dependency graph in <a href="#the-shape-of-the-work">The Shape of the Work</a> from a diagram into an operational tool. After closing a phase, <code class="language-plaintext highlighter-rouge">bd ready</code> surfaced whatever was next — the next planned phase or a detour that had just become unblocked.</p>

<p><strong>Instant triage.</strong> When a detour surfaced mid-session — the constructor bug in Phase 3, the builtin side-effect problem after Phase 7 — it got filed as a new Beads task in two seconds: <code class="language-plaintext highlighter-rouge">bd create "BuiltinResult: builtins bypass apply_update" --dep red-dragon-xyz</code>. The dependency was set, and the task appeared in <code class="language-plaintext highlighter-rouge">bd ready</code> at the right time. Without this, mid-session discoveries would have been either fixed immediately (derailing the current work) or forgotten.</p>

<p><strong>Mid-session pivots.</strong> The clearest example: midway through brainstorming the heap fields migration (<code class="language-plaintext highlighter-rouge">red-dragon-f6i</code>), I realized handlers needed to be migrated first. I interrupted, ran <code class="language-plaintext highlighter-rouge">bd update red-dragon-f6i --status deferred</code>, created the handler migration task, and pivoted. When the handler migration was done and closed, <code class="language-plaintext highlighter-rouge">bd ready</code> surfaced the deferred heap fields task automatically. The tracker absorbed the pivot without losing the deferred work.</p>

<p><strong>Session boundaries.</strong> Each Beads task has an ID (like <code class="language-plaintext highlighter-rouge">red-dragon-gsl</code>) that appears in commit messages and in the brainstorming/planning conversations. When a new Claude Code session starts, the first thing I do is <code class="language-plaintext highlighter-rouge">bd ready</code> — the task list is the handoff between sessions. The AI doesn’t need to remember what happened last session; the tracker tells it what’s next.</p>

<p>Over a dozen sessions and five detours, the pattern was: close a task → <code class="language-plaintext highlighter-rouge">bd ready</code> → claim the next one → brainstorm → plan → implement → commit → close. The tracker turned “I should also fix this other thing I just noticed” from a context-switching hazard into a two-second operation.</p>

<h3 id="code-simplifier-automated-cleanup-after-every-change">Code Simplifier: Automated Cleanup After Every Change</h3>

<p><a href="https://github.com/anthropics/claude-code-plugins">Code Simplifier</a> is a Claude Code plugin that runs as a dedicated review agent after implementation work completes. It focuses on code that was just modified — not the entire codebase — and refines it for clarity, consistency, and maintainability without changing behaviour.</p>

<p>In a migration like this, where sub-agents are churning out handler-by-handler changes across dozens of files, the code that lands is functional but not always clean. A sub-agent focused on migrating <code class="language-plaintext highlighter-rouge">_handle_store_field</code> to produce <code class="language-plaintext highlighter-rouge">TypedValue</code> will get the types right but might leave behind redundant intermediate variables, unnecessarily verbose conditionals, or naming inconsistencies with the surrounding code. Code Simplifier catches this.</p>

<p>What it does:</p>

<ul>
  <li><strong>Reduces unnecessary complexity.</strong> Nested ternaries become <code class="language-plaintext highlighter-rouge">if</code>/<code class="language-plaintext highlighter-rouge">else</code> chains. Three-line variable assignments that exist only to be passed once get inlined. Guard clauses replace deep nesting.</li>
  <li><strong>Eliminates redundant code.</strong> During the transition phases, handlers accumulated isinstance checks, temporary unwrap/rewrap sequences, and defensive guards that were necessary mid-migration but dead after the phase completed. Code Simplifier flagged many of these before Phase 7’s explicit cleanup pass.</li>
  <li><strong>Enforces consistency.</strong> When 15 handler groups are migrated one at a time across multiple sessions, naming conventions drift. One handler might call the coerced value <code class="language-plaintext highlighter-rouge">lhs_coerced</code>, another <code class="language-plaintext highlighter-rouge">coerced_lhs</code>, another <code class="language-plaintext highlighter-rouge">left</code>. Code Simplifier normalises these.</li>
</ul>

<p>The key constraint: it only touches recently modified code. It won’t “improve” stable code you didn’t ask about. This prevents the scope creep that happens when cleanup tools audit everything — you end up with a 200-file diff when you wanted a 3-file fix.</p>

<p>In practice, I invoked it after each major phase commit. The simplifier would produce a small follow-up diff — typically 10–30 lines changed — that tightened the code the sub-agents had just written. These were fast reviews because the behavioural correctness was already established by the tests; the simplifier was only adjusting form, not function.</p>

<hr />

<h2 id="phase-1-typedvalue-and-binopcoercionstrategy">Phase 1: TypedValue and BinopCoercionStrategy</h2>

<p>The first design session used the brainstorming skill to work through the problem space. The skill’s process is structured: it asks one clarifying question at a time, proposes approaches with trade-offs, and won’t let you skip to implementation until a design is approved. In this case, the dialogue went through eight questions before code was discussed.</p>

<p>The skill started with motivation: <em>“What’s the primary goal — language-correct operators, eliminating the side-car type system, or both?”</em> I said both. Then it asked whether <code class="language-plaintext highlighter-rouge">TypedValue</code> should wrap everything (even values with unknown types) or only values with known types. It recommended wrapping everything — even with <code class="language-plaintext highlighter-rouge">UNKNOWN</code> — to eliminate all “is this typed or raw?” branching. I agreed, and this turned out to be the single most important design decision of the entire migration.</p>

<p>The next questions drilled into specifics. Should <code class="language-plaintext highlighter-rouge">TypedValue</code> subsume <code class="language-plaintext highlighter-rouge">SymbolicValue</code> and <code class="language-plaintext highlighter-rouge">Pointer</code>, or wrap them? The skill proposed wrapping — keeping <code class="language-plaintext highlighter-rouge">SymbolicValue</code> as a value inside <code class="language-plaintext highlighter-rouge">TypedValue</code> rather than replacing it — which preserved the existing constraint-tracking machinery without duplication. Then: how should BINOP consume types? The skill proposed two approaches: (A) unwrap, operate on raw values, rewrap with inferred type, or (B) full type-driven dispatch where operators receive <code class="language-plaintext highlighter-rouge">TypedValue</code> throughout. It recommended A as the simpler path. I agreed, but added a requirement: <em>“BINOP should have access to a pluggable language-specific TypeConversionStrategy.”</em> That addition — mine, not the skill’s — became <code class="language-plaintext highlighter-rouge">BinopCoercionStrategy</code>.</p>

<p>The skill then proposed three migration approaches:</p>

<ol>
  <li><strong>Big Bang</strong> — change everything at once. Rejected as too risky.</li>
  <li><strong>Incremental with accessor protocol</strong> — wrap at <code class="language-plaintext highlighter-rouge">apply_update</code>, migrate handlers one by one. Recommended and chosen.</li>
  <li><strong>Transparent wrapper with magic methods</strong> — <code class="language-plaintext highlighter-rouge">TypedValue.__add__</code> delegates to the underlying value. The skill flagged this as violating the simplest-mechanism principle: <code class="language-plaintext highlighter-rouge">isinstance(val, int)</code> checks throughout the codebase would silently fail.</li>
</ol>

<p>The last question was about scope: <em>“Where should language information live — on the value or on the strategy?”</em> The skill recommended the strategy: <em>“A Java Int and a C# Int are the same value with the same type; the difference is in the coercion rules applied to them.”</em> I agreed.</p>

<p>That session produced two things: <code class="language-plaintext highlighter-rouge">TypedValue</code> and <code class="language-plaintext highlighter-rouge">BinopCoercionStrategy</code>.</p>

<p><code class="language-plaintext highlighter-rouge">TypedValue</code> is a frozen dataclass:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@dataclass</span><span class="p">(</span><span class="n">frozen</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">TypedValue</span><span class="p">:</span>
    <span class="n">value</span><span class="p">:</span> <span class="n">Any</span>       <span class="c1"># The raw Python value
</span>    <span class="nb">type</span><span class="p">:</span> <span class="n">TypeExpr</span>   <span class="c1"># The inferred or declared type
</span></code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">BinopCoercionStrategy</code> is a protocol with two methods:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">BinopCoercionStrategy</span><span class="p">(</span><span class="n">Protocol</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">coerce</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">op</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">lhs</span><span class="p">:</span> <span class="n">TypedValue</span><span class="p">,</span> <span class="n">rhs</span><span class="p">:</span> <span class="n">TypedValue</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">TypedValue</span><span class="p">,</span> <span class="n">TypedValue</span><span class="p">]:</span>
        <span class="bp">...</span>
    <span class="k">def</span> <span class="nf">result_type</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">op</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">lhs</span><span class="p">:</span> <span class="n">TypedValue</span><span class="p">,</span> <span class="n">rhs</span><span class="p">:</span> <span class="n">TypedValue</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">TypeExpr</span><span class="p">:</span>
        <span class="bp">...</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">coerce()</code> transforms operands before the operator runs. <code class="language-plaintext highlighter-rouge">result_type()</code> infers the output type. The executor calls both, wraps the result in <code class="language-plaintext highlighter-rouge">TypedValue</code>, and stores it.</p>

<p>Here’s the full round trip of a binary operation like <code class="language-plaintext highlighter-rouge">"int:" + 42</code> in Java:</p>

<pre><code class="language-mermaid">flowchart TD
    subgraph read ["① Read"]
        R1("%r1 → TypedValue('int:', String)"):::reg --&gt; RESOLVE1("_resolve_binop_operand"):::fn
        R2("%r2 → TypedValue(42, Int)"):::reg --&gt; RESOLVE2("_resolve_binop_operand"):::fn
    end

    subgraph coerce ["② Coerce"]
        RESOLVE1 --&gt; LHS("lhs: TypedValue('int:', String)"):::tv
        RESOLVE2 --&gt; RHS("rhs: TypedValue(42, Int)"):::tv
        LHS --&gt; COERCE("⚖️ BinopCoercionStrategy.coerce('+', lhs, rhs)&lt;br/&gt;&lt;i&gt;JavaBinopCoercion: stringify rhs&lt;/i&gt;"):::strategy
        RHS --&gt; COERCE
        COERCE --&gt; COERCED_L("TypedValue('int:', String)"):::tv
        COERCE --&gt; COERCED_R("TypedValue('42', String)"):::tvchanged
    end

    subgraph compute ["③ Compute"]
        COERCED_L -- ".value" --&gt; EVAL("Operators.eval_binop('+', 'int:', '42')"):::fn
        COERCED_R -- ".value" --&gt; EVAL
        EVAL --&gt; RESULT("result = 'int:42'"):::raw
    end

    subgraph typeinfer ["④ Type + Wrap"]
        LHS --&gt; RTYPE("BinopCoercionStrategy.result_type()"):::strategy
        RHS --&gt; RTYPE
        RTYPE --&gt; TYPE("String"):::type
        RESULT --&gt; WRAP("typed('int:42', String)"):::fn
        TYPE --&gt; WRAP
    end

    subgraph store ["⑤ Store"]
        WRAP --&gt; STORE("%r3 → TypedValue('int:42', String)"):::reg
        STORE --&gt; APPLY("apply_update()"):::fn
    end

    classDef reg fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef fn fill:#f5f5f5,stroke:#616161,stroke-width:1px,color:#212121
    classDef tv fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
    classDef tvchanged fill:#fff9c4,stroke:#f9a825,stroke-width:2px,color:#5c3a0a
    classDef strategy fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#3a0a3a
    classDef raw fill:#fff3e0,stroke:#e8a735,stroke-width:1px,color:#5c3a0a
    classDef type fill:#fce4ec,stroke:#c62828,stroke-width:1px,color:#5c0a0a
</code></pre>

<p><code class="language-plaintext highlighter-rouge">DefaultBinopCoercion</code> is a no-op — it passes operands through unchanged and infers types from operator categories (comparisons return <code class="language-plaintext highlighter-rouge">Bool</code>, arithmetic follows numeric promotion rules). <code class="language-plaintext highlighter-rouge">JavaBinopCoercion</code> overrides <code class="language-plaintext highlighter-rouge">coerce()</code> to auto-stringify non-string operands when <code class="language-plaintext highlighter-rouge">+</code> is used with a <code class="language-plaintext highlighter-rouge">String</code>.</p>

<h3 id="the-design-decision-that-shaped-everything">The Design Decision That Shaped Everything</h3>

<p>The critical decision was: <strong>“language on strategy, not value.”</strong> A Java <code class="language-plaintext highlighter-rouge">Int</code> and a C# <code class="language-plaintext highlighter-rouge">Int</code> are the same <code class="language-plaintext highlighter-rouge">TypedValue</code>. The difference is in the injected coercion strategy. This meant we didn’t need a <code class="language-plaintext highlighter-rouge">JavaInt</code> vs. <code class="language-plaintext highlighter-rouge">CSharpInt</code> distinction. The coercion strategy is selected once at the top of the execution pipeline based on the source language, then threaded through via dependency injection.</p>

<p>The other critical decision: <strong>every value is <code class="language-plaintext highlighter-rouge">TypedValue</code>, even when the type is <code class="language-plaintext highlighter-rouge">UNKNOWN</code></strong>. This eliminated all branching on “is this typed or raw?” throughout the codebase. It sounded like over-wrapping at first. It turned out to be the decision that kept the migration tractable.</p>

<h3 id="the-boundary-table">The Boundary Table</h3>

<p>The spec documented five boundary crossings where values moved between storage locations and what wrapping/unwrapping happened at each. This table became the roadmap for every subsequent phase. Each phase was essentially: pick a boundary, push TypedValue one layer deeper, update the read sites, run the tests.</p>

<pre><code class="language-mermaid">flowchart LR
    subgraph Frame ["Stack Frame"]
        direction TB
        REG("📦 Registers&lt;br/&gt;&lt;code&gt;%r1 → TypedValue&lt;/code&gt;"):::reg
        VAR("📌 Local Vars&lt;br/&gt;&lt;code&gt;x → TypedValue&lt;/code&gt;"):::var
    end

    subgraph Heap ["Heap"]
        OBJ("🗄️ HeapObject.fields&lt;br/&gt;&lt;code&gt;name → TypedValue&lt;/code&gt;"):::heap
    end

    subgraph Closure ["Closure Environment"]
        BIND("🔗 Bindings&lt;br/&gt;&lt;code&gt;captured_x → TypedValue&lt;/code&gt;"):::closure
    end

    REG -- "STORE_FIELD" --&gt; OBJ
    OBJ -- "LOAD_FIELD" --&gt; REG
    VAR -- "capture" --&gt; BIND
    BIND -- "function entry" --&gt; VAR
    REG -- "CALL_FUNCTION" --&gt; VAR
    VAR -- "STORE_VAR" --&gt; VAR
    REG -- "LOAD_VAR" --&gt; REG

    classDef reg fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef var fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef heap fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
    classDef closure fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#3a0a3a
</code></pre>

<p>Each arrow is a boundary crossing. Before the migration, values were unwrapped to raw primitives and re-wrapped at each crossing. After: <code class="language-plaintext highlighter-rouge">TypedValue</code> flows through intact.</p>

<p>After this phase: ~11,274 tests passing.</p>

<hr />

<h2 id="phase-2-handler-migration">Phase 2: Handler Migration</h2>

<p>Phase 1 wrapped values in <code class="language-plaintext highlighter-rouge">TypedValue</code> at the <code class="language-plaintext highlighter-rouge">apply_update</code> boundary — the function that takes a <code class="language-plaintext highlighter-rouge">StateUpdate</code> and applies it to the VM state. Every handler still produced raw values. <code class="language-plaintext highlighter-rouge">apply_update</code> wrapped them.</p>

<p>This worked but created a pointless roundtrip. Handlers called <code class="language-plaintext highlighter-rouge">_serialize_value()</code> to flatten objects into JSON-compatible structures, then <code class="language-plaintext highlighter-rouge">apply_update</code> called <code class="language-plaintext highlighter-rouge">_deserialize_value()</code> to reconstruct them. For locally-executed instructions (the vast majority), this was a no-op. The serialization path only existed for the LLM fallback, where an LLM returns a JSON <code class="language-plaintext highlighter-rouge">StateUpdate</code> that needs deserialization.</p>

<p>This phase had an interesting origin. I was partway through brainstorming the heap fields migration (Phase 5) when I realized that handlers were still producing raw values — the heap fields work would be building on a half-migrated foundation. I interrupted: <em>“I think then we pause this plan for now, and do red-dragon-132, so that values always arrive as TypedValue.”</em> The heap fields task was deferred in Beads (<code class="language-plaintext highlighter-rouge">bd update red-dragon-f6i --status deferred</code>), and the handler migration jumped the queue. This is one of those moments where the issue tracker earned its keep — deferring a task mid-brainstorm without losing it.</p>

<h3 id="the-serializedeserialize-roundtrip">The Serialize/Deserialize Roundtrip</h3>

<p>The brainstorming for the <code class="language-plaintext highlighter-rouge">apply_update</code> split was another case where I had to push back on the AI’s first instinct. The AI proposed a dual-path <code class="language-plaintext highlighter-rouge">apply_update</code> with isinstance branching — check if the incoming value is <code class="language-plaintext highlighter-rouge">TypedValue</code> and take one path, otherwise take the raw path. I said: <em>“I think <code class="language-plaintext highlighter-rouge">apply_update</code> should be split into <code class="language-plaintext highlighter-rouge">apply_update</code> (which accepts only TypedValue) and <code class="language-plaintext highlighter-rouge">apply_update_raw</code> (the path which LLMs take).”</em> Then, on reflection: <em>“On second thoughts, what should probably happen is that in the else clause (the LLM path), the raw update should be transformed into a TypedValue update… that way, the updates from both clauses are the canonical TypedValue update.”</em> The AI adopted this — a <code class="language-plaintext highlighter-rouge">materialize_raw_update</code> function that converts LLM responses into <code class="language-plaintext highlighter-rouge">TypedValue</code> updates before they enter the standard pipeline.</p>

<p>The fix split <code class="language-plaintext highlighter-rouge">apply_update</code> into two paths:</p>
<ul>
  <li><strong>Local path:</strong> Handlers produce <code class="language-plaintext highlighter-rouge">TypedValue</code> directly. <code class="language-plaintext highlighter-rouge">apply_update</code> stores them with lightweight type coercion.</li>
  <li><strong>LLM path:</strong> A new <code class="language-plaintext highlighter-rouge">materialize_raw_update</code> function takes raw values from LLM JSON responses, deserializes them, coerces them, and wraps them in <code class="language-plaintext highlighter-rouge">TypedValue</code>.</li>
</ul>

<p><strong>Before:</strong> serialize → deserialize roundtrip</p>

<pre><code class="language-mermaid">flowchart LR
    H1("Handler"):::fn -- "_serialize_value()" --&gt; SU1("StateUpdate&lt;br/&gt;&lt;i&gt;raw/JSON&lt;/i&gt;"):::raw
    SU1 -- "_deserialize_value()" --&gt; AU1("apply_update()"):::fn
    AU1 -- "wrap" --&gt; VM1("VM State"):::state

    classDef fn fill:#f5f5f5,stroke:#616161,stroke-width:1px,color:#212121
    classDef raw fill:#fce4ec,stroke:#c62828,stroke-width:1px,color:#5c0a0a
    classDef state fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
</code></pre>

<p><strong>After:</strong> dual path — local handlers and LLM backends converge on the same <code class="language-plaintext highlighter-rouge">apply_update()</code></p>

<pre><code class="language-mermaid">flowchart TD
    H2("🖥️ Local Handler"):::local -- "produces TypedValue" --&gt; SU2("StateUpdate&lt;br/&gt;&lt;i&gt;TypedValue&lt;/i&gt;"):::tv
    LLM("🌐 LLM Backend"):::llm -- "JSON response" --&gt; RAW("Raw StateUpdate"):::raw
    RAW -- "materialize_raw_update()" --&gt; SU2
    SU2 -- "coerce_local_update()" --&gt; AU2("apply_update()"):::fn
    AU2 --&gt; VM2("VM State"):::state

    classDef fn fill:#f5f5f5,stroke:#616161,stroke-width:1px,color:#212121
    classDef raw fill:#fce4ec,stroke:#c62828,stroke-width:1px,color:#5c0a0a
    classDef tv fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
    classDef state fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef local fill:#e8f5e9,stroke:#2e7d32,stroke-width:1px,color:#1a3a1a
    classDef llm fill:#fff3e0,stroke:#e8a735,stroke-width:1px,color:#5c3a0a
</code></pre>

<p>The local path — the common case — is a direct pipeline with no serialization overhead. The LLM path gets its own materialization function that handles the JSON-to-TypedValue conversion before merging into the same <code class="language-plaintext highlighter-rouge">StateUpdate</code>.</p>

<p>The migration touched every handler in the executor — about 15 handler groups — done one at a time in dependency order: simple value handlers first (<code class="language-plaintext highlighter-rouge">_handle_const</code>, <code class="language-plaintext highlighter-rouge">_handle_store_var</code>), then loads (<code class="language-plaintext highlighter-rouge">_handle_load_var</code>, <code class="language-plaintext highlighter-rouge">_handle_load_field</code>), then objects, then operators, then the call chain. Each group was a separate commit.</p>

<p>The spec documented six different serialization patterns across handlers (raw value, <code class="language-plaintext highlighter-rouge">_serialize_value(val)</code>, <code class="language-plaintext highlighter-rouge">sym.to_dict()</code>, <code class="language-plaintext highlighter-rouge">SymbolicValue</code> object directly, <code class="language-plaintext highlighter-rouge">Pointer</code> object directly, heap address string). Each had its own migration path.</p>

<p>After this phase: ~11,449 tests passing.</p>

<hr />

<h2 id="phase-3-return-values">Phase 3: Return Values</h2>

<p><code class="language-plaintext highlighter-rouge">_handle_return</code> was still serializing return values via <code class="language-plaintext highlighter-rouge">_serialize_value(val)</code>, and <code class="language-plaintext highlighter-rouge">_handle_return_flow</code> was deserializing them back. This was the same roundtrip as Phase 2, just for return values.</p>

<p>The migration exposed a conflation that had been hiding in the return value semantics.</p>

<h3 id="the-constructor-bug">The Constructor Bug</h3>

<p>This was the most frustrating part of the migration, and the brainstorming conversation around it was contentious.</p>

<p>Before TypedValue, <code class="language-plaintext highlighter-rouge">return_value = None</code> meant two different things: “this instruction doesn’t have a return value” and “the function returned None/null.” These were indistinguishable. For most code this didn’t matter. For constructors, it did.</p>

<p>Constructors in RedDragon work by allocating a heap object, running the constructor body (which stores fields via STORE_FIELD), and returning <code class="language-plaintext highlighter-rouge">self</code>. The return mechanism had a guard: <code class="language-plaintext highlighter-rouge">if return_value is not None</code>. This guard prevented constructors from accidentally clobbering their result register with <code class="language-plaintext highlighter-rouge">None</code> — constructors return the <code class="language-plaintext highlighter-rouge">this</code> pointer via a different mechanism (STORE_VAR into the caller’s local), not via <code class="language-plaintext highlighter-rouge">return_value</code>.</p>

<p>When <code class="language-plaintext highlighter-rouge">return_value</code> became a <code class="language-plaintext highlighter-rouge">TypedValue</code>, the guard broke. <code class="language-plaintext highlighter-rouge">TypedValue(None, Void)</code> is not <code class="language-plaintext highlighter-rouge">None</code> — the isinstance check passes, and the constructor’s result register gets overwritten with a void value.</p>

<p>The brainstorming for the fix started with me pushing on the void/None distinction. The AI’s initial proposal had a <code class="language-plaintext highlighter-rouge">value is not None</code> guard on the return path — essentially preserving the old ambiguity under a new name. I pushed back: <em>“Why is there a None check though?”</em> This forced an explicit discussion of void vs null semantics. Then: <em>“I’m not comfortable with passing a naked None back.”</em> And finally: <em>“I want the ‘no return value is possible because it is void’ scenario to also be represented by a different TypedValue.”</em> This led to adding <code class="language-plaintext highlighter-rouge">VOID</code> to the type system.</p>

<p>But the real frustration came after implementation. The implementer agent had used <code class="language-plaintext highlighter-rouge">value is not None</code> as a guard anyway — silently discarding both Void and None TypedValues. I caught this in review: <em>“So you completely discarded creating Void and None TypedValues?”</em> followed by <em>“Produce a plan which accommodates the proper behaviour of using TypedValue, and not your coding convenience.”</em> The fix was redesigned from scratch.</p>

<p>The eventual fix came in two commits:</p>
<ol>
  <li>Constructor detection via scope chain inspection — if the current frame is a constructor, skip return value writes to the result register.</li>
  <li>Replace the <code class="language-plaintext highlighter-rouge">result_reg=None</code> hack (constructors had been setting their result register to <code class="language-plaintext highlighter-rouge">None</code> to prevent writes) with a clean <code class="language-plaintext highlighter-rouge">is_ctor</code> flag on <code class="language-plaintext highlighter-rouge">StackFrame</code>.</li>
</ol>

<p>The <code class="language-plaintext highlighter-rouge">is_ctor</code> idea came from me during the brainstorming: <em>“Tentative idea: <code class="language-plaintext highlighter-rouge">_try_class_constructor_call</code> pushes <code class="language-plaintext highlighter-rouge">is_ctor</code> onto the constructor frame… the <code class="language-plaintext highlighter-rouge">_handle_return_flow</code> guard only assigns the return value if it is not Void.”</em> The AI refined this into the implementation pattern.</p>

<p>This was a classic refactoring discovery: the old code worked, but only because of an accidental coupling between “None means no value” and “constructors shouldn’t write return values.” TypedValue made the coupling visible by removing the ambiguity. But it was also a case where the brainstorming process — the back-and-forth about what void means, the insistence on not taking shortcuts — produced a cleaner design than either party would have reached alone.</p>

<p>The fix introduced a three-state return type:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">typed(None, scalar("Void"))</code> — void return (no value to write)</li>
  <li><code class="language-plaintext highlighter-rouge">typed(None, UNKNOWN)</code> — explicit <code class="language-plaintext highlighter-rouge">return None</code></li>
  <li><code class="language-plaintext highlighter-rouge">typed(42, scalar("Int"))</code> — concrete return value</li>
</ul>

<pre><code class="language-mermaid">flowchart TD
    RET("_handle_return()"):::fn

    RET --&gt; CTOR{"🏗️ Constructor?&lt;br/&gt;&lt;i&gt;frame.is_ctor&lt;/i&gt;"}:::decide
    CTOR -- "yes" --&gt; VOID("TypedValue(None, Void)"):::void
    CTOR -- "no, has operand" --&gt; RESOLVE("_resolve_reg(operand)"):::fn
    CTOR -- "no, no operand" --&gt; VOID

    RESOLVE --&gt; TV("typed_from_runtime(val)"):::fn
    TV --&gt; SU("StateUpdate&lt;br/&gt;&lt;i&gt;return_value + call_pop&lt;/i&gt;"):::state
    VOID --&gt; SU

    SU --&gt; POP("⬆️ Pop frame"):::fn
    POP --&gt; FLOW{"_handle_return_flow()"}:::decide
    FLOW -- "Void" --&gt; SKIP("🚫 Skip result_reg write"):::void
    FLOW -- "concrete" --&gt; WRITE("✅ caller.registers[result_reg]&lt;br/&gt;= TypedValue"):::tv

    classDef fn fill:#f5f5f5,stroke:#616161,stroke-width:1px,color:#212121
    classDef decide fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef void fill:#fce4ec,stroke:#c62828,stroke-width:1px,color:#5c0a0a
    classDef state fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef tv fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
</code></pre>

<p>After this phase: ~11,449 tests, plus the constructor fix.</p>

<hr />

<h2 id="phases-46-heap-and-closures">Phases 4–6: Heap and Closures</h2>

<p>Three more storage locations to migrate: <code class="language-plaintext highlighter-rouge">HeapWrite.value</code> (Phase 4), <code class="language-plaintext highlighter-rouge">HeapObject.fields</code> (Phase 5), and <code class="language-plaintext highlighter-rouge">ClosureEnvironment.bindings</code> (Phase 6).</p>

<p>Phase 4 was modest: <code class="language-plaintext highlighter-rouge">HeapWrite.value</code> carries <code class="language-plaintext highlighter-rouge">TypedValue</code>, but <code class="language-plaintext highlighter-rouge">apply_update</code> unwraps it before storing in <code class="language-plaintext highlighter-rouge">HeapObject.fields</code>. The heap stays raw for now.</p>

<p>Phase 5 pushed <code class="language-plaintext highlighter-rouge">TypedValue</code> into the heap itself. <code class="language-plaintext highlighter-rouge">HeapObject.fields</code> stores <code class="language-plaintext highlighter-rouge">TypedValue</code> directly. This was the largest single phase because every read site — <code class="language-plaintext highlighter-rouge">_handle_load_field</code>, <code class="language-plaintext highlighter-rouge">_handle_load_index</code>, constructor field access, builtin method dispatch — had to stop re-wrapping values that were already wrapped.</p>

<p>Phase 6 was the simplest: three write sites and one read site for closure bindings.</p>

<h3 id="the-double-wrapping-landmine">The Double-Wrapping Landmine</h3>

<p>Phase 5 was where I spent the most time reading code and trying to understand the flow. The heap is read from many places — field access, index access, constructor field initialization, alias variable resolution — and each site had its own slightly different wrapping logic. I kept asking <em>“what does this value look like when it arrives here?”</em> and tracing through the call chain to find out. The codebase had grown large enough that I couldn’t hold the full picture in my head, and the AI’s summaries sometimes glossed over details that mattered.</p>

<p>The persistent source of bugs was that <code class="language-plaintext highlighter-rouge">typed_from_runtime()</code> is not idempotent. If you pass it a <code class="language-plaintext highlighter-rouge">TypedValue</code>, it wraps it inside another <code class="language-plaintext highlighter-rouge">TypedValue</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>typed_from_runtime(TypedValue(42, Int))
→ TypedValue(value=TypedValue(42, Int), type=UNKNOWN)
</code></pre></div></div>

<p>Every read site that previously called <code class="language-plaintext highlighter-rouge">typed_from_runtime(raw_value)</code> unconditionally had to get an isinstance guard to prevent double-wrapping. The plan warned about intermediate breakage: tasks 1–4 changed write sites to store <code class="language-plaintext highlighter-rouge">TypedValue</code>, but read sites still called <code class="language-plaintext highlighter-rouge">typed_from_runtime()</code> unconditionally until tasks 5–7. The test suite was broken between those groups. This was acceptable because both groups were committed atomically.</p>

<pre><code class="language-mermaid">flowchart TD
    subgraph write ["Write Site (Phase 4–5)"]
        HANDLER("Handler"):::fn -- "typed_from_runtime(val)" --&gt; TV1("TypedValue(42, Int)"):::tv
        TV1 -- "HeapWrite" --&gt; HEAP("🗄️ heap.fields['x']&lt;br/&gt;= TypedValue(42, Int)"):::heap
    end

    HEAP --&gt; SPLIT{"Read site calls&lt;br/&gt;typed_from_runtime()?"}:::decide

    subgraph bad ["❌ Before fix: double-wrapped"]
        SPLIT -- "yes" --&gt; DOUBLE("TypedValue(&lt;br/&gt;  value = TypedValue(42, Int),&lt;br/&gt;  type = UNKNOWN&lt;br/&gt;)"):::danger
    end

    subgraph good ["✅ After fix: pass-through"]
        SPLIT -- "no" --&gt; PASS("TypedValue(42, Int)&lt;br/&gt;&lt;i&gt;intact&lt;/i&gt;"):::tv
    end

    classDef fn fill:#f5f5f5,stroke:#616161,stroke-width:1px,color:#212121
    classDef tv fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
    classDef heap fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef decide fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef danger fill:#ffcdd2,stroke:#c62828,stroke-width:3px,color:#b71c1c
</code></pre>

<p>After Phase 6: ~11,481 tests passing.</p>

<hr />

<h2 id="phase-7-cleaning-up-after-ourselves">Phase 7: Cleaning Up After Ourselves</h2>

<p>All storage locations now stored <code class="language-plaintext highlighter-rouge">TypedValue</code>. The isinstance guards added during the transition — <code class="language-plaintext highlighter-rouge">if isinstance(val, TypedValue)</code> — were dead code. Phase 7 removed them all and narrowed type annotations. This was a cleanup commit, not a behavioral change.</p>

<hr />

<h2 id="detour-builtins">Detour: Builtins</h2>

<p>The TypedValue migration was “done” after Phase 7. But the work exposed two more problems in the builtin system.</p>

<h3 id="builtinresult-side-effects-should-be-declarative">BuiltinResult: Side Effects Should Be Declarative</h3>

<p>This detour started with me asking <em>“what else?”</em> after the main migration was done. The AI ran an audit and flagged the builtins. The fix went through another round of brainstorming — a shorter one this time, since the pattern was established. The brainstorming skill proposed three approaches: (A) builtins return <code class="language-plaintext highlighter-rouge">ExecutionResult</code> directly, (B) a lightweight <code class="language-plaintext highlighter-rouge">BuiltinResult</code> dataclass with value + side effects, or (C) split builtins into two tables (pure vs heap-mutating). I chose B.</p>

<p>But the interesting moment was a correction I made to the AI’s initial proposal. It had suggested that only the heap-mutating builtins return <code class="language-plaintext highlighter-rouge">BuiltinResult</code>, while pure builtins continue returning raw values. I pushed back: <em>“Pure builtins should also return BuiltinResult.”</em> The point was uniform interface — the caller shouldn’t need isinstance branching to figure out what a builtin returned. This was the same principle that drove the “every value is TypedValue, even when the type is UNKNOWN” decision in Phase 1. A Beads task was filed (<code class="language-plaintext highlighter-rouge">red-dragon-vva</code>), and the plan was generated from the spec.</p>

<p>RedDragon has about 40 built-in functions (<code class="language-plaintext highlighter-rouge">len</code>, <code class="language-plaintext highlighter-rouge">range</code>, <code class="language-plaintext highlighter-rouge">print</code>, <code class="language-plaintext highlighter-rouge">slice</code>, plus 25 COBOL-specific byte manipulation builtins). Most are pure — they take arguments and return a value. Two are not: <code class="language-plaintext highlighter-rouge">_builtin_array_of</code> creates a heap object, and <code class="language-plaintext highlighter-rouge">_builtin_object_rest</code> copies fields from an existing heap object. These two wrote directly to <code class="language-plaintext highlighter-rouge">vm.heap</code> as a side effect, bypassing the <code class="language-plaintext highlighter-rouge">apply_update</code> pipeline.</p>

<p>This had always been the case, but it became conspicuous once every other state change flowed through <code class="language-plaintext highlighter-rouge">StateUpdate</code>. I hadn’t been thinking about builtins at all during the TypedValue planning — they seemed orthogonal. They weren’t. The fix introduced <code class="language-plaintext highlighter-rouge">BuiltinResult</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@dataclass</span><span class="p">(</span><span class="n">frozen</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">BuiltinResult</span><span class="p">:</span>
    <span class="n">value</span><span class="p">:</span> <span class="n">TypedValue</span>
    <span class="n">new_objects</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">NewObject</span><span class="p">]</span> <span class="o">=</span> <span class="nf">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="nb">list</span><span class="p">)</span>
    <span class="n">heap_writes</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">HeapWrite</span><span class="p">]</span> <span class="o">=</span> <span class="nf">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="nb">list</span><span class="p">)</span>
</code></pre></div></div>

<p>All builtins return <code class="language-plaintext highlighter-rouge">BuiltinResult</code>. The executor unpacks it into the <code class="language-plaintext highlighter-rouge">StateUpdate</code>. No builtin directly mutates <code class="language-plaintext highlighter-rouge">vm.heap</code>.</p>

<p><strong>Before:</strong> builtins bypass StateUpdate</p>

<pre><code class="language-mermaid">flowchart LR
    B1("_builtin_array_of()"):::fn -- "⚡ direct write" --&gt; HEAP1("vm.heap"):::danger
    B1 -- "raw value" --&gt; H1("Handler"):::fn
    H1 -- "StateUpdate&lt;br/&gt;&lt;i&gt;value only&lt;/i&gt;" --&gt; AU1("apply_update()"):::fn
    AU1 --&gt; VM1("VM State"):::state

    classDef fn fill:#f5f5f5,stroke:#616161,stroke-width:1px,color:#212121
    classDef danger fill:#ffcdd2,stroke:#c62828,stroke-width:3px,color:#b71c1c
    classDef state fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
</code></pre>

<p><strong>After:</strong> all effects are declarative</p>

<pre><code class="language-mermaid">flowchart LR
    B2("_builtin_array_of()"):::fn --&gt; BR("📋 BuiltinResult&lt;br/&gt;&lt;i&gt;value + new_objects&lt;br/&gt;+ heap_writes&lt;/i&gt;"):::builtin
    BR --&gt; H2("Handler unpacks"):::fn
    H2 --&gt; AU2("apply_update()"):::fn
    AU2 --&gt; VM2("VM State&lt;br/&gt;&lt;i&gt;atomic update&lt;/i&gt;"):::state

    classDef fn fill:#f5f5f5,stroke:#616161,stroke-width:1px,color:#212121
    classDef state fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef builtin fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
</code></pre>

<p>The migration was done in eight commits, with an isinstance bridge during the transition so that old-style builtins (returning raw values) and new-style builtins (returning <code class="language-plaintext highlighter-rouge">BuiltinResult</code>) could coexist. The bridge was removed in the last commit.</p>

<h3 id="builtin-args-the-atomic-commit-problem">Builtin Args: The Atomic Commit Problem</h3>

<p>After builtins returned <code class="language-plaintext highlighter-rouge">TypedValue</code> via <code class="language-plaintext highlighter-rouge">BuiltinResult</code>, they still <em>received</em> raw Python primitives. <code class="language-plaintext highlighter-rouge">_resolve_reg</code> stripped the <code class="language-plaintext highlighter-rouge">TypedValue</code> wrapper before passing arguments. This was a hole: type information was available at the call site but discarded before the builtin could see it.</p>

<p>The fix was conceptually simple: change all builtins from <code class="language-plaintext highlighter-rouge">list[Any]</code> to <code class="language-plaintext highlighter-rouge">list[TypedValue]</code> arguments. The implementation was not, because <strong>every builtin and every call site had to change simultaneously</strong>. Changing the executor to pass <code class="language-plaintext highlighter-rouge">TypedValue</code> args without changing the builtins (which do things like <code class="language-plaintext highlighter-rouge">args[0] + args[1]</code>) would break all 11,530+ tests.</p>

<p>The plan explicitly mandated: no intermediate commits. All production code changes happen without committing; the single commit occurs after the full test suite passes. This was the only phase where the atomic commit constraint was hard — in every other phase, at least some intermediate states were testable.</p>

<p>This phase touched all 40+ builtins, all method builtins, all 25 COBOL builtins, and the parameter binding code in user function and constructor calls.</p>

<p>After builtins: ~11,530 tests passing.</p>

<hr />

<h2 id="more-detours">More Detours</h2>

<p>The main migration was done. But work kept surfacing.</p>

<h3 id="binopcoercionstrategy-return-type">BinopCoercionStrategy Return Type</h3>

<p>This one came from me staring at the code and asking: <em>“Why does <code class="language-plaintext highlighter-rouge">BinopCoercionStrategy.coerce()</code> return <code class="language-plaintext highlighter-rouge">tuple[Any, Any]</code>?”</em></p>

<p>We’d just spent two days making every value in the system a <code class="language-plaintext highlighter-rouge">TypedValue</code>. The protocol that coerces binary operands — the thing that started this entire migration — was still stripping types at its return boundary. The AI had implemented it that way in Phase 1, before the rest of the migration made <code class="language-plaintext highlighter-rouge">TypedValue</code> universal. Nobody caught it because the callers immediately re-wrapped the values. But it meant type information was being discarded and re-inferred at the coercion boundary, which defeated the purpose.</p>

<p>It was a one-commit fix, but I insisted it be filed as its own Beads task and done separately — changing the protocol return type is a distinct concern from the builtin migration, and mixing them would have muddied the commit history. Beads made this easy: file a task, give it a dependency on the current work, and it shows up in <code class="language-plaintext highlighter-rouge">bd ready</code> at the right time.</p>

<h3 id="unopcoercionstrategy">UnopCoercionStrategy</h3>

<p>Phase 1 introduced <code class="language-plaintext highlighter-rouge">BinopCoercionStrategy</code> for binary operators. Unary operators (<code class="language-plaintext highlighter-rouge">-x</code>, <code class="language-plaintext highlighter-rouge">not x</code>, <code class="language-plaintext highlighter-rouge">~x</code>, <code class="language-plaintext highlighter-rouge">#x</code>) had no equivalent — <code class="language-plaintext highlighter-rouge">_handle_unop</code> still used <code class="language-plaintext highlighter-rouge">_resolve_reg</code> (raw values) and <code class="language-plaintext highlighter-rouge">typed_from_runtime</code> (runtime type inference). The fix followed the same pattern: a <code class="language-plaintext highlighter-rouge">UnopCoercionStrategy</code> protocol with <code class="language-plaintext highlighter-rouge">coerce()</code> and <code class="language-plaintext highlighter-rouge">result_type()</code>, a <code class="language-plaintext highlighter-rouge">DefaultUnopCoercion</code> implementation, and threading through the executor pipeline via kwargs.</p>

<h3 id="demo-scripts-and-llm-path-leaks">Demo Scripts and LLM Path Leaks</h3>

<p>After the main migration, the AI reported “all tests pass” and I asked: <em>“Do any of the script files need to be updated?”</em> They did. Five demo scripts in the <code class="language-plaintext highlighter-rouge">scripts/</code> directory used <code class="language-plaintext highlighter-rouge">_format_val</code> to display VM state. None of them handled <code class="language-plaintext highlighter-rouge">TypedValue</code>. After the heap fields migration, the scripts showed output like <code class="language-plaintext highlighter-rouge">fields={'x': TypedValue(value=42, type=ScalarType(name='Int'))}</code> instead of <code class="language-plaintext highlighter-rouge">fields={'x': 42}</code>.</p>

<p>The AI’s first fix added an <code class="language-plaintext highlighter-rouge">_unwrap()</code> helper with isinstance guards. I pushed back: <em>“Why can’t <code class="language-plaintext highlighter-rouge">_unwrap()</code> unconditionally take a TypedValue and unwrap it instead of all the isinstance nonsense?”</em> The local variables always store <code class="language-plaintext highlighter-rouge">TypedValue</code> now — that was the whole point. The guard was leftover thinking from the transition period. The fix was to just use <code class="language-plaintext highlighter-rouge">.value</code> directly.</p>

<p>Then I asked the AI to <em>actually run</em> each of the scripts. Not just run tests — the scripts aren’t tests, they’re demos that call live LLM backends. The AI had been saying “all tests pass” as if that covered everything. It didn’t. Running the scripts exposed formatting bugs that the test suite couldn’t catch.</p>

<p>This prompted creating an external test infrastructure: a pytest marker <code class="language-plaintext highlighter-rouge">@pytest.mark.external</code> and a configuration that excludes external tests by default in both local runs and CI. The scripts now have real tests, they just don’t run in the normal suite.</p>

<p>A separate leak was found in <code class="language-plaintext highlighter-rouge">LLMPlausibleResolver._parse_llm_response</code>, which parsed LLM JSON responses into <code class="language-plaintext highlighter-rouge">StateUpdate</code> values. The parser produced bare values — not <code class="language-plaintext highlighter-rouge">TypedValue</code> — which entered the now-TypedValue-only pipeline. I’d asked <em>“Are there still any bare values passed around anywhere else in the system?”</em> and the audit found three sites in the LLM resolver. The main migration had been so focused on the local execution path that the LLM fallback path was missed.</p>

<h3 id="the-question-that-came-after">The Question That Came After</h3>

<p>After the migration was done, the documentation updated, and the detours resolved, I looked at the codebase and asked: <em>“We’re now storing types alongside every value in TypedValue. Are the separate <code class="language-plaintext highlighter-rouge">register_types</code> and <code class="language-plaintext highlighter-rouge">var_types</code> dictionaries in <code class="language-plaintext highlighter-rouge">TypeEnvironment</code> still required?”</em></p>

<p>The answer turned out to be yes — but for a narrower reason than before. <code class="language-plaintext highlighter-rouge">TypedValue.type</code> carries the <em>runtime-inferred</em> type (what the computation produced). <code class="language-plaintext highlighter-rouge">TypeEnvironment</code> carries the <em>declared</em> type (what the source code said). These can differ: Python’s <code class="language-plaintext highlighter-rouge">4 / 2</code> produces <code class="language-plaintext highlighter-rouge">2.0</code> (a float), but if the variable was declared <code class="language-plaintext highlighter-rouge">int d</code>, the declared type is <code class="language-plaintext highlighter-rouge">Int</code>. Write-time coercion exists to reconcile the two.</p>

<p>But the overlap is real. If the pre-operation coercion strategies were made aware of declared types, they could produce the correct type directly, and write-time coercion would become a no-op for locally-executed instructions. It would only remain necessary for LLM-produced updates. I filed this as a future investigation task. The two-layer architecture works, but it’s worth asking whether both layers are still earning their keep now that TypedValue has changed the landscape.</p>

<p>This is the kind of question that only becomes askable after a migration is done. Before TypedValue, the question was meaningless — types and values lived in different structures by necessity. After TypedValue, the redundancy is visible.</p>

<hr />

<h2 id="the-shape-of-the-work">The Shape of the Work</h2>

<p>Here’s what the migration looked like as a dependency graph:</p>

<pre><code class="language-mermaid">flowchart TD
    P1("Phase 1&lt;br/&gt;TypedValue + BinopCoercion"):::main
    P2("Phase 2&lt;br/&gt;Handler migration"):::main
    P3("Phase 3&lt;br/&gt;Return values"):::main
    P4("Phase 4&lt;br/&gt;HeapWrite.value"):::main
    P5("Phase 5&lt;br/&gt;HeapObject.fields"):::main
    P6("Phase 6&lt;br/&gt;Closure bindings"):::main
    P7("Phase 7&lt;br/&gt;Guard cleanup"):::main

    B1("Detour&lt;br/&gt;BuiltinResult"):::detour
    B2("Detour&lt;br/&gt;Builtin TypedValue args"):::detour
    D1("Detour&lt;br/&gt;BinopCoercion return type"):::detour
    D2("Detour&lt;br/&gt;UnopCoercionStrategy"):::detour
    D3("Detour&lt;br/&gt;Demo scripts"):::detour
    D4("Detour&lt;br/&gt;LLM path leak"):::detour
    D5("Detour&lt;br/&gt;External tests"):::detour

    DOC("📄 Doc&lt;br/&gt;Type system update"):::doc

    P1 --&gt; P2 --&gt; P3 --&gt; P4 --&gt; P5 --&gt; P6 --&gt; P7
    P5 --&gt; B1 --&gt; B2
    P1 --&gt; D1 --&gt; D2
    P5 --&gt; D3 --&gt; D5
    B2 --&gt; D4
    P7 --&gt; DOC
    D2 --&gt; DOC

    classDef main fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef detour fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef doc fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a

    linkStyle 0,1,2,3,4,5 stroke:#4a90d9,stroke-width:3px
    linkStyle 6,7,8,9,10,11 stroke:#e8a735,stroke-width:2px,stroke-dasharray:5
</code></pre>

<p>The main sequence (Phases 1–7) was planned. The detours were not. Most of them started with me asking <em>“what next?”</em> or <em>“what else?”</em> after a phase completed — essentially asking the AI to audit the codebase for things I hadn’t thought of. This is a pattern I use a lot: finish a unit of work, commit, then ask the AI to look for fallout. It’s more effective than trying to anticipate everything upfront.</p>

<p>Beads kept this manageable. Each node in the graph above was a Beads task — <code class="language-plaintext highlighter-rouge">red-dragon-gsl</code> for the initial TypedValue design, <code class="language-plaintext highlighter-rouge">red-dragon-132</code> for handler migration, <code class="language-plaintext highlighter-rouge">red-dragon-vva</code> for BuiltinResult, <code class="language-plaintext highlighter-rouge">red-dragon-x9r</code> for builtin args, <code class="language-plaintext highlighter-rouge">red-dragon-d5c</code> for the BinopCoercion return type fix, and so on. When a detour surfaced — the constructor bug, the builtin side-effect problem, the PHP enum — it got filed immediately as a new task with its dependencies. Running <code class="language-plaintext highlighter-rouge">bd ready</code> after closing a task showed me the next unblocked item, which might be the next planned phase or a detour that had just become unblocked.</p>

<p>There was a concrete example of this working well. Midway through brainstorming the heap fields migration (<code class="language-plaintext highlighter-rouge">red-dragon-f6i</code>), I realized handlers needed to be migrated first. I interrupted the brainstorming, ran <code class="language-plaintext highlighter-rouge">bd update red-dragon-f6i --status deferred</code>, created the handler migration task, and pivoted. When the handler migration was done and closed, <code class="language-plaintext highlighter-rouge">bd ready</code> surfaced the deferred heap fields task automatically. Without the tracker, that kind of mid-session pivot would have meant losing track of the deferred work.</p>

<p>Each major phase also went through the brainstorming skill before implementation began. The early phases (1–3) had longer brainstorming cycles because the patterns weren’t established yet. By the later phases, the brainstorming rounds were shorter — the skill would propose an approach, I’d confirm it matched the established pattern, and we’d move to planning. The skill’s insistence on proposing alternatives before committing to a direction caught at least two cases where the obvious approach wasn’t the best one (the serialize/deserialize split in Phase 2, and the <code class="language-plaintext highlighter-rouge">BuiltinResult</code> design over direct <code class="language-plaintext highlighter-rouge">StateUpdate</code> returns).</p>

<p>Each detour emerged from one of three causes:</p>

<ol>
  <li><strong>The migration exposed a pre-existing problem</strong> (constructor bug, builtins bypassing <code class="language-plaintext highlighter-rouge">apply_update</code>, <code class="language-plaintext highlighter-rouge">BinopCoercionStrategy</code> return type).</li>
  <li><strong>The migration broke something downstream</strong> (demo scripts, LLM path leak).</li>
  <li><strong>The migration created an obvious gap</strong> (<code class="language-plaintext highlighter-rouge">UnopCoercionStrategy</code> — if binops have injectable coercion, why don’t unops?).</li>
</ol>

<p>This is, I think, the normal shape of a refactoring. The plan covers the main sequence. The detours are where the actual learning happens.</p>

<p>Some numbers:</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th> </th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Duration</td>
      <td>~2 days</td>
    </tr>
    <tr>
      <td>Sessions</td>
      <td>~12</td>
    </tr>
    <tr>
      <td>Phases</td>
      <td>9 major + 5 detours</td>
    </tr>
    <tr>
      <td>Commits</td>
      <td>~60</td>
    </tr>
    <tr>
      <td>Test count (start)</td>
      <td>~11,274</td>
    </tr>
    <tr>
      <td>Test count (end)</td>
      <td>~11,545</td>
    </tr>
    <tr>
      <td>Files touched</td>
      <td>~40</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="takeaways">Takeaways</h2>

<p><strong>A refactoring is a sequence of discoveries, not a sequence of steps.</strong> The plan covered the main migration. The constructor bug, the builtin side-effect problem, the double-wrapping landmine, the LLM path leak — none of these were in the original plan. They emerged because changing one thing made something else visible.</p>

<p><strong>“Every value is TypedValue, even when the type is UNKNOWN” was the single most important decision.</strong> It eliminated all branching on “is this typed or raw?” and made the migration monotonic — each phase pushed <code class="language-plaintext highlighter-rouge">TypedValue</code> one layer deeper without introducing conditional paths. Every phase that added isinstance guards for transition purposes removed them later. The guards were temporary scaffolding, not permanent complexity.</p>

<p><strong>Splitting the local and LLM execution paths was worth the upfront cost.</strong> The serialize/deserialize roundtrip existed because one code path served both local execution and LLM fallback. Once we split them, local execution became a direct path (handler produces TypedValue, <code class="language-plaintext highlighter-rouge">apply_update</code> stores it) and the LLM path got its own <code class="language-plaintext highlighter-rouge">materialize_raw_update</code> function. This separated two concerns that had been coupled since the beginning.</p>

<p><strong>Side effects should be declarative.</strong> The BuiltinResult migration wasn’t in the original plan. It became obvious once every other state change flowed through <code class="language-plaintext highlighter-rouge">StateUpdate</code> — two builtins were writing directly to <code class="language-plaintext highlighter-rouge">vm.heap</code>, and that was suddenly conspicuous. Making heap mutations declarative via <code class="language-plaintext highlighter-rouge">BuiltinResult(new_objects=..., heap_writes=...)</code> was the natural conclusion. The refactoring didn’t create this problem; it made it visible.</p>

<p><strong>Non-idempotent wrapping functions are dangerous.</strong> <code class="language-plaintext highlighter-rouge">typed_from_runtime(TypedValue(...))</code> produces a double-wrapped value. This was the single most error-prone aspect of the migration. In hindsight, making <code class="language-plaintext highlighter-rouge">typed_from_runtime</code> idempotent (detecting and passing through already-wrapped values) would have prevented an entire class of bugs.</p>

<p><strong>Atomic commits are sometimes unavoidable.</strong> Most phases allowed incremental commits. The builtin args migration did not — changing the interface without changing all callers simultaneously would break 11,530+ tests. The plan explicitly mandated no intermediate commits. This is a real constraint when doing interface-level changes in a system with high test coverage.</p>

<p><strong>The brainstorm → spec → plan → implement pipeline prevented false starts.</strong> Every phase that went through the brainstorming skill produced a spec before any code was written. The spec forced me to articulate what was changing, what the boundary conditions were, and what the migration path looked like. Twice, the brainstorming skill proposed an approach I hadn’t considered that turned out to be simpler (the serialize/deserialize split, the <code class="language-plaintext highlighter-rouge">BuiltinResult</code> design). The discipline of writing down the design before implementing it is not new — but having a tool that <em>enforces</em> the step, asks the right questions, and proposes alternatives makes it harder to skip when you’re tempted to just start coding.</p>

<p><strong>A local issue tracker changes how you handle surprises.</strong> Detours are the normal shape of a refactoring. The question is whether they derail you or get absorbed into the work. Beads made it trivial to file a new task the moment a detour surfaced, set its dependencies, and continue with the current work. When the current task was done, <code class="language-plaintext highlighter-rouge">bd ready</code> surfaced whatever was next — planned phase or newly-filed detour. The tracker turned “I should also fix this other thing I just noticed” from a context-switching hazard into a two-second operation. Over a dozen sessions and five detours, that added up.</p>

<p><strong>The AI didn’t change the nature of the work. It changed the throughput.</strong> The design decisions, the boundary table, the phase ordering, the detour triaging — all of that is the same work a human would do. The AI handled the mechanical parts: updating 40+ builtins to accept <code class="language-plaintext highlighter-rouge">TypedValue</code> args, threading kwargs through five layers of function signatures, updating test assertions across eight test files. The refactoring took two days. Without the AI, the same refactoring would have taken longer, but the intellectual structure would have been identical. The bottleneck was never typing speed. It was understanding what needed to change and in what order.</p>

<hr />

<p><em>The code is at <a href="https://github.com/avishek-sen-gupta/red-dragon">avishek-sen-gupta/red-dragon</a>. The type system documentation, updated after this migration, is at <code class="language-plaintext highlighter-rouge">docs/type-system.md</code>.</em></p>]]></content><author><name>avishek</name></author><category term="Software Engineering" /><category term="Refactoring" /><category term="AI-Assisted Development" /><category term="Type Systems" /><summary type="html"><![CDATA[Tracing the full arc of a multi-phase refactoring — from “Java string concatenation crashes the VM” to “every value in the system carries its type” — done across a dozen sessions with Claude Code over two days.]]></summary></entry><entry><title type="html">Engineering Log: Building Non-Trivial Systems with an AI Coding Assistant</title><link href="https://avishek.net/2026/03/12/experiences-building-with-coding-assistant.html" rel="alternate" type="text/html" title="Engineering Log: Building Non-Trivial Systems with an AI Coding Assistant" /><published>2026-03-12T00:00:00+05:30</published><updated>2026-03-12T00:00:00+05:30</updated><id>https://avishek.net/2026/03/12/experiences-building-with-coding-assistant</id><content type="html" xml:base="https://avishek.net/2026/03/12/experiences-building-with-coding-assistant.html"><![CDATA[<p><em>Notes from building a multi-language code analysis engine across 400+ conversation sessions with Claude Code.</em></p>

<hr />

<h2 id="table-of-contents">Table of Contents</h2>

<ul>
  <li><a href="#context">Context</a></li>
  <li><a href="#reddragon-how-the-architecture-emerged">RedDragon: How the Architecture Emerged</a>
    <ul>
      <li><a href="#the-initial-session-feb-2526">The Initial Session (Feb 25–26)</a></li>
      <li><a href="#the-determinism-pivot">The Determinism Pivot</a></li>
      <li><a href="#deterministic-frontends-for-15-languages">Deterministic Frontends for 15 Languages</a></li>
      <li><a href="#the-implementation-rhythm">The Implementation Rhythm</a></li>
    </ul>
  </li>
  <li><a href="#growing-the-test-suite">Growing the Test Suite</a>
    <ul>
      <li><a href="#cross-language-testing-via-rosetta-and-exercism">Cross-Language Testing via Rosetta and Exercism</a></li>
      <li><a href="#the-dispatch-audit-loop">The Dispatch Audit Loop</a></li>
    </ul>
  </li>
  <li><a href="#the-assertion-audit-or-why-green-tests-may-not-imply-a-working-system">The Assertion Audit, or, Why Green Tests may not imply a working system</a>
    <ul>
      <li><a href="#weak-assertion-patterns">Weak Assertion Patterns</a></li>
      <li><a href="#the-audit-process">The Audit Process</a></li>
      <li><a href="#bugs-found-behind-weak-assertions">Bugs Found Behind Weak Assertions</a></li>
      <li><a href="#assertion-audit-lessons">Assertion Audit Lessons</a></li>
    </ul>
  </li>
  <li><a href="#guardrails-claudemd">Guardrails: CLAUDE.md</a>
    <ul>
      <li><a href="#build-rules">Build Rules</a></li>
      <li><a href="#testing-rules">Testing Rules</a></li>
      <li><a href="#programming-rules">Programming Rules</a></li>
      <li><a href="#the-workflow-evolution">The Workflow Evolution</a></li>
    </ul>
  </li>
  <li><a href="#structured-agent-memory">Structured Agent Memory</a>
    <ul>
      <li><a href="#the-continuity-problem">The Continuity Problem</a></li>
      <li><a href="#issue-tracking-with-beads">Issue Tracking with Beads</a></li>
      <li><a href="#gap-analysis-as-planning">Gap Analysis as Planning</a></li>
      <li><a href="#the-type-system-evolution">The Type System Evolution</a></li>
      <li><a href="#memory-files">Memory Files</a></li>
      <li><a href="#the-quick-win-trap">The Quick Win Trap</a></li>
    </ul>
  </li>
  <li><a href="#patterns-and-observations">Patterns and Observations</a>
    <ul>
      <li><a href="#the-anonymous-class-story-or-why-the-ai-reaches-for-new-infrastructure">The Anonymous Class Story, or, Why the AI Reaches for New Infrastructure</a></li>
    </ul>
  </li>
  <li><a href="#what-i-would-change">What I Would Change</a></li>
  <li><a href="#conclusion">Conclusion</a></li>
</ul>

<hr />

<h2 id="context">Context</h2>

<p>Over February–March 2026, I built <strong><a href="https://github.com/avishek-sen-gupta/red-dragon">RedDragon</a></strong> — a multi-language code analysis engine with a universal IR, deterministic VM, type system, and iterative dataflow analysis — almost entirely through conversations with Claude Code. RedDragon was built in an initial session, then refined across 237+ more, with 73 additional sessions on its precursor project. That’s roughly 400+ human-AI conversation sessions total.</p>

<p>This post documents what I learned about directing an AI to build a system of this scale.</p>

<hr />

<p><img src="/assets/pipeline-viz.gif" alt="Demo" /></p>

<h2 id="reddragon-how-the-architecture-emerged">RedDragon: How the Architecture Emerged</h2>

<h3 id="the-initial-session-feb-2526">The Initial Session (Feb 25–26)</h3>

<p>On Feb 25, I opened a fresh session and described what I wanted: a universal symbolic interpreter that parses source in any language, lowers it to a flat IR, builds a CFG, and executes it symbolically, handling missing imports and unknown externals gracefully.</p>

<p>The first thing I asked: <em>“Is there an existing IR/VM that already does this?”</em> There wasn’t a good fit for what I needed — symbolic execution of incomplete programs across 15 languages — so we proceeded.</p>

<p>The git log from that day shows the progression:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>3bfbead  Initial implementation of LLM symbolic interpreter
7eb9721  Local function dispatch, builtins, and scope chain
ed4a3c8  Break up interpreter.py into modular package
b459cdc  Make VM fully deterministic: replace all LLM fallbacks
         with symbolic value creation
bd51810  Add LLM-based frontend for multi-language source-to-IR lowering
4b6f815  Add closure support
6bdd973  Add deterministic tree-sitter frontends for 14 languages
         (346 tests, 0 failures)
</code></pre></div></div>

<p>Seven commits, each one a distinct architectural decision. The ADR log (which I had Claude write retroactively) captures the reasoning behind each.</p>

<h3 id="the-determinism-pivot">The Determinism Pivot</h3>

<p>The initial design had the LLM deciding state changes at each execution step. After implementing it, I asked: <em>“Given that the IR is always bounded, shouldn’t the IR execution be deterministic?”</em></p>

<p>This turned out to matter. We replaced all LLM calls in the VM with symbolic value creation. When the VM encountered an unresolved import, it created a <code class="language-plaintext highlighter-rouge">SymbolicValue</code> with a descriptive hint instead of asking an LLM. The value propagated through computation deterministically. The entire execution became reproducible.</p>

<p>With the VM deterministic, the LLM’s role narrowed to one thing: translating source code to IR. And even that was constrained — the prompt provided all opcode schemas, concrete patterns, and a worked example. The LLM was acting as a mechanical translator, not a reasoning engine.</p>

<p>This was the decision that shaped everything else. Once execution was deterministic, everything became testable. The entire test suite runs with zero LLM calls.</p>

<h3 id="deterministic-frontends-for-15-languages">Deterministic Frontends for 15 Languages</h3>

<p>Rather than using the LLM at runtime to lower source to IR, I asked: <em>“How hard is it to write deterministic logic to lower ASTs to IR for 16 languages?”</em></p>

<p>Not that hard, with tree-sitter and a dispatch table engine. Claude generated tree-sitter-based frontends for 14 languages in a single session. Each frontend extends a <code class="language-plaintext highlighter-rouge">BaseFrontend</code> class with two dispatch tables (one for statements, one for expressions) mapping AST node types to handler methods. Common constructs (<code class="language-plaintext highlighter-rouge">if/else</code>, <code class="language-plaintext highlighter-rouge">while</code>, <code class="language-plaintext highlighter-rouge">for</code>, <code class="language-plaintext highlighter-rouge">return</code>) are handled in the base class. Language-specific constructs override or extend.</p>

<p>Sub-millisecond. Zero LLM calls. Fully testable. 346 tests on day one.</p>

<p>When the LLM frontend hit context window limits on large files, we added a chunked frontend that decomposes files into per-function chunks via tree-sitter, lowers each independently, then renumbers registers and reassembles.</p>

<h3 id="the-implementation-rhythm">The Implementation Rhythm</h3>

<p>The initial session followed a repeating cycle:</p>

<ol>
  <li><strong>Implement a feature</strong> (30–60 minutes)</li>
  <li><strong>Run it on real code</strong> and inspect the output</li>
  <li><strong>Identify the next gap</strong> (“any other language features not covered?”)</li>
  <li><strong>Audit for completeness</strong>, then batch-implement all gaps</li>
  <li><strong>Clean up immediately</strong>: refactor, split large files, reorganise tests</li>
</ol>

<p>I didn’t let technical debt accumulate. When <code class="language-plaintext highlighter-rouge">interpreter.py</code> hit 1,200 lines, I said <em>“break up interpreter.py, it’s too big.”</em> When the registry module grew three responsibilities, I split it into three files. When tests were in a flat directory, I separated them into <code class="language-plaintext highlighter-rouge">unit/</code> and <code class="language-plaintext highlighter-rouge">integration/</code>.</p>

<p>Filling language-specific gaps across all 15 frontends was systematic: ask Claude to audit every frontend for missing constructs, prioritise by impact, say <em>“implement all the critical and common ones”</em>, push, re-audit. This cycle repeated 4–5 times, each time catching a smaller set of remaining gaps.</p>

<hr />

<h2 id="growing-the-test-suite">Growing the Test Suite</h2>

<h3 id="cross-language-testing-via-rosetta-and-exercism">Cross-Language Testing via Rosetta and Exercism</h3>

<p>The test count tells the progression:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Initial frontends:    346 tests
After tooling:       ~700 tests
Rosetta suite:     ~1,200 tests
Exercism (final):   7,268 tests
COBOL + audit:      8,569 tests
Type system:       ~9,400 tests
Gap analysis:     10,152 tests
</code></pre></div></div>

<p>The Rosetta cross-language test suite (15 algorithms across 15 languages) and the Exercism integration suite drove most of this growth. Each exercise exposed new frontend gaps, VM limitations, and edge cases, and each fix was immediately verified across all 15 languages.</p>

<p>All tests run without LLM calls and are deterministic.</p>

<h3 id="the-dispatch-audit-loop">The Dispatch Audit Loop</h3>

<p>After every batch of frontend work, I ran a two-pass audit:</p>

<p><strong>Pass 1 (Dispatch Comparison):</strong> Parse source samples in all 15 languages, collect every AST node type that appears, compare against the frontend’s dispatch tables, and classify unhandled types as structural (harmless, consumed by parent handlers) or substantive (gaps that produce <code class="language-plaintext highlighter-rouge">SYMBOLIC</code>).</p>

<p><strong>Pass 2 (Runtime SYMBOLIC check):</strong> Lower the source through each frontend, scan the resulting IR for <code class="language-plaintext highlighter-rouge">SYMBOLIC</code> instructions with <code class="language-plaintext highlighter-rouge">"unsupported:"</code> operands.</p>

<p>The classification heuristic itself went through three iterations:</p>

<ol>
  <li><strong>Naive:</strong> Flag everything not in a dispatch table. Hundreds of false positives, because nodes like <code class="language-plaintext highlighter-rouge">parameter_list</code> are consumed by parent handlers.</li>
  <li><strong>Parent heuristic:</strong> Flag unhandled nodes only if their immediate parent isn’t handled. Reduced false positives but still produced 259.</li>
  <li><strong>Block-reachability analysis:</strong> Walk the AST and identify which unhandled nodes are direct named children of block-iterated nodes. Only these can reach <code class="language-plaintext highlighter-rouge">_lower_stmt</code> and produce <code class="language-plaintext highlighter-rouge">SYMBOLIC</code>. This dropped substantive gaps from 259 to 1.</li>
</ol>

<p>The audit loop ran dozens of times:</p>

<pre><code class="language-mermaid">flowchart LR
    A("🔍 Audit all&lt;br/&gt;frontends"):::audit --&gt; G{"Gaps&lt;br/&gt;found?"}:::decide
    G -- "Yes (34 → 19 → 12 → ...)" --&gt; I("🔧 Batch-implement&lt;br/&gt;all gaps"):::impl
    I --&gt; T("✅ Add tests"):::test
    T --&gt; A
    G -- "No: 0 gaps,&lt;br/&gt;0 SYMBOLIC" --&gt; D("🏁 Done"):::done

    classDef audit fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef decide fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef impl fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#3a0a3a
    classDef test fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
    classDef done fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#1a3a1a
</code></pre>

<p>This pattern (audit, batch-fix, re-audit) was more effective than trying to enumerate every missing feature upfront.</p>

<hr />

<h2 id="the-assertion-audit-or-why-green-tests-may-not-imply-a-working-system">The Assertion Audit, or, Why Green Tests may not imply a working system</h2>

<p>By March 2026, the test suite had grown to ~8,400 tests across ~130 files. All green. The question I’d been putting off: <strong>if every test passes, how do I know each test is actually checking what it says it’s checking?</strong></p>

<p>When an AI writes your tests, you get volume and coverage breadth. What you don’t get is <em>assertion depth</em>. The AI produces a test for every function, every edge case, every language, but each individual assertion may be checking the easy thing (does it not crash?) rather than the hard thing (does it produce the right output?). The AI optimises for the test <em>passing</em>, not for the test <em>verifying</em>.</p>

<h3 id="weak-assertion-patterns">Weak Assertion Patterns</h3>

<p>Over two days and 11 audit passes, I had Claude scan every test file, comparing each test’s name against its actual assertions. The patterns that emerged:</p>

<ul>
  <li><strong>OR-fallback assertions.</strong> Tests like <code class="language-plaintext highlighter-rouge">assert Opcode.BRANCH_IF in opcodes or Opcode.BRANCH in opcodes</code> — where <code class="language-plaintext highlighter-rouge">BRANCH</code> (unconditional jump) exists in virtually every program, making the assertion tautologically true. 23+ instances across Scala match/case, C# switch expressions, COBOL PERFORM ordering, and IR stats tests.</li>
  <li><strong>Existence-only checks.</strong> <code class="language-plaintext highlighter-rouge">assert len(writes) &gt;= 1</code> on WRITE_REGION, satisfied by DATA DIVISION initial-value writes, leaving the PROCEDURE statement under test untested. The strengthened version decoded the EBCDIC bytes and checked specific values.</li>
  <li><strong>Cross-product matching.</strong> <code class="language-plaintext highlighter-rouge">assert any(bi &lt; pi for bi in branch_if_indices for pi in print_indices)</code> — <code class="language-plaintext highlighter-rouge">any()</code> over a cross-product matches if <em>any</em> <code class="language-plaintext highlighter-rouge">BRANCH_IF</code> appears before <em>any</em> print, even from unrelated parts of the program.</li>
  <li><strong>Silent parametrised passes.</strong> Bare <code class="language-plaintext highlighter-rouge">return</code> in parametrised tests for excluded languages — 11 languages were showing as green in the closure test report with zero assertions executed. The fix was <code class="language-plaintext highlighter-rouge">pytest.skip()</code> with a reason string.</li>
  <li><strong>Tautological guards.</strong> <code class="language-plaintext highlighter-rouge">if "x" in result.definitions:</code> where <code class="language-plaintext highlighter-rouge">result.definitions</code> was a list of <code class="language-plaintext highlighter-rouge">Definition</code> objects, not a dict. The <code class="language-plaintext highlighter-rouge">in</code> check always returned <code class="language-plaintext highlighter-rouge">False</code>, so the assertion never fired.</li>
</ul>

<h3 id="the-audit-process">The Audit Process</h3>

<p><strong>Phase 1: Discrepancy audits.</strong> Tests whose names contradicted what the code did. A diamond-shape test that asserted stadium shape; <code class="language-plaintext highlighter-rouge">test_constructor_sets_fields</code> that never verified field values.</p>

<p><strong>Phase 2: Name-vs-assertion audits.</strong> Does each test assert what its name claims? 52 violations across 22 files.</p>

<p><strong>Phase 3: Priority-based audits.</strong> P0 (false confidence), P1 (missing key assertion), P2 (weak/generic), P3 (cosmetic). Re-scans after each fix batch drove the count down: 82 to 56 to 17.</p>

<p><strong>Phase 4: Reconciliation.</strong> The violation list kept changing between audits. Items fixed reappeared with different wording. The fix was anchoring: starting from the previous audit’s known remaining items and verifying each against the current code.</p>

<p>The governing principle throughout: <strong>strengthen the assertion to match the name, never weaken the name to match the assertion.</strong> Renaming moves the problem. Strengthening closes the gap.</p>

<h3 id="bugs-found-behind-weak-assertions">Bugs Found Behind Weak Assertions</h3>

<p>P0 fixes exposed genuine bugs:</p>

<p><strong>Pascal bare-except.</strong> The Pascal frontend silently dropped bare <code class="language-plaintext highlighter-rouge">except</code> blocks (without <code class="language-plaintext highlighter-rouge">on E: Exception do</code> wrapper). The test passed because it only checked that <code class="language-plaintext highlighter-rouge">STORE_VAR "x"</code> existed, satisfied by the try body alone. Strengthening the assertion exposed the bug.</p>

<p><strong>C# else-if chain lowering.</strong> A weak assertion masked incomplete lowering of chained else-if blocks.</p>

<p><strong><code class="language-plaintext highlighter-rouge">test_no_self_dependency_without_loop</code>.</strong> A guard checking membership on a list of <code class="language-plaintext highlighter-rouge">Definition</code> objects always returned <code class="language-plaintext highlighter-rouge">False</code>, so the assertion never executed.</p>

<h3 id="assertion-audit-lessons">Assertion Audit Lessons</h3>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Audit passes</td>
      <td>11</td>
    </tr>
    <tr>
      <td>Unique violations (deduplicated)</td>
      <td>~90</td>
    </tr>
    <tr>
      <td>Violations fixed</td>
      <td>~75</td>
    </tr>
    <tr>
      <td>Frontend bugs exposed</td>
      <td>2</td>
    </tr>
    <tr>
      <td>False-pass tests eliminated</td>
      <td>5</td>
    </tr>
  </tbody>
</table>

<p>The test count went <em>up</em>, not down. Strengthening assertions sometimes meant splitting one weak test into multiple specific ones.</p>

<p>Key lessons:</p>

<ul>
  <li><strong>Green tests are necessary but not sufficient.</strong> 15 P0 violations where the test would pass even when the feature was broken.</li>
  <li><strong>OR-fallback assertions are the most dangerous pattern.</strong> One side trivially satisfied by unrelated instructions.</li>
  <li><strong>Audit stability requires anchoring.</strong> Fresh scans produce inconsistent results. The reconciliation approach is necessary.</li>
  <li><strong>Fixing assertions requires running the code.</strong> Many fix attempts failed because the assertion assumed a representation that didn’t match reality. The cycle (write assertion, run, discover actual representation, fix, re-run) never happened voluntarily.</li>
  <li><strong>Parametrised tests need explicit skips, not silent returns.</strong></li>
</ul>

<hr />

<h2 id="guardrails-claudemd">Guardrails: CLAUDE.md</h2>

<p>The file that had the most impact on consistency wasn’t any Python module. It was <code class="language-plaintext highlighter-rouge">CLAUDE.md</code>, the development rules file that Claude Code reads at the start of every session. The rules evolved over the project’s lifetime, each one added in response to a specific failure mode.</p>

<h3 id="build-rules">Build Rules</h3>

<ul>
  <li><strong>“Before committing anything, run all tests, fixing them if necessary.”</strong> This prevented test count regression across 292 commits. If test assertions are being <em>removed</em>, ask for human review first.</li>
  <li><strong>“Before committing anything, run <code class="language-plaintext highlighter-rouge">poetry run black</code> on the full codebase.”</strong> CI enforces this.</li>
  <li><strong>“Before committing anything, update the README based on the diffs.”</strong> Without this, the README would have drifted within the first week.</li>
  <li><strong>“For each feature, treat it as an independent commit / push, with its own testing.”</strong> Atomic, reviewable commits. Combined with “do not start a new task until the current one is committed,” this prevented half-finished features from accumulating across sessions.</li>
  <li><strong>“Once a design is finalised, document it as an ADR.”</strong> This produced 100+ architectural decision records that serve as the project’s institutional memory.</li>
</ul>

<h3 id="testing-rules">Testing Rules</h3>

<ul>
  <li><strong>“When fixing tests, do not blindly change test assertions to make the test pass.”</strong> Without this, the AI’s default behaviour is to modify the assertion to match whatever the code produces, regardless of whether the code is correct.</li>
  <li><strong>“Make sure you are not creating any special implementation behaviour just to get the tests to pass.”</strong> Without this, the AI occasionally added if-branches in production code solely to satisfy a test expectation.</li>
  <li><strong>“Do not use <code class="language-plaintext highlighter-rouge">unittest.mock.patch</code>. Use proper dependency injection.”</strong> This forced every external dependency to be injectable. The entire VM, all 15 frontends, and all analysis passes are testable in isolation.</li>
  <li><strong>“For every bug you fix, make sure you have a test that fails without the bug fix.”</strong></li>
</ul>

<h3 id="programming-rules">Programming Rules</h3>

<ul>
  <li><strong>“STOP USING FOR LOOPS WITH MUTATIONS IN THEM.”</strong> This forced a functional style: list comprehensions, <code class="language-plaintext highlighter-rouge">map</code>, <code class="language-plaintext highlighter-rouge">filter</code>, <code class="language-plaintext highlighter-rouge">reduce</code> instead of mutable accumulators.</li>
  <li><strong>“Categorically avoid defensive programming.”</strong> Defensive code hides bugs. A <code class="language-plaintext highlighter-rouge">None</code> check that silently returns an empty list masks the fact that a value should never have been <code class="language-plaintext highlighter-rouge">None</code>. Without this rule, the AI adds defensive checks reflexively.</li>
  <li><strong>“If a function has a non-None return type, never return None.”</strong> Use null object pattern instead.</li>
  <li><strong>“When writing <code class="language-plaintext highlighter-rouge">if</code> conditions, prefer early return.”</strong> Without this, the AI nests the happy path inside increasingly deep conditionals.</li>
  <li><strong>“Do not use static methods.”</strong> Static methods resist dependency injection and create hidden coupling.</li>
  <li><strong>“Use a ports-and-adapter type architecture. Adhere to ‘Functional Core, Imperative Shell’.”</strong> The VM handlers are pure functions returning <code class="language-plaintext highlighter-rouge">StateUpdate</code> data objects. The dataflow module is a pure analysis pass. I/O lives at the edges.</li>
  <li><strong>“Parameters in functions, if they must have default values, must have those values as empty structures corresponding to the non-empty types.”</strong> Empty dicts, empty lists, never <code class="language-plaintext highlighter-rouge">None</code>.</li>
</ul>

<h3 id="the-workflow-evolution">The Workflow Evolution</h3>

<p>The workflow encoded in CLAUDE.md changed over time:</p>

<p><strong>Early workflow:</strong></p>

<pre><code class="language-mermaid">flowchart LR
    B1("🧠 Brainstorm"):::phase1 --&gt; P1("📐 Plan"):::phase2 --&gt; I1("⚙️ Implement"):::phase3 --&gt; T1("🧪 Test"):::phase4

    classDef phase1 fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef phase2 fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef phase3 fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#3a0a3a
    classDef phase4 fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
</code></pre>

<p><strong>Revised workflow (TDD):</strong></p>

<pre><code class="language-mermaid">flowchart LR
    B2("🧠 Brainstorm"):::phase1 --&gt; D2("⚖️ Discuss&lt;br/&gt;trade-offs"):::phase1 --&gt; P2("📐 Plan"):::phase2 --&gt; T2("🧪 Write&lt;br/&gt;tests"):::test --&gt; I2("⚙️ Implement"):::phase3 --&gt; F2("🔧 Fix&lt;br/&gt;tests"):::test --&gt; C2("✅ Commit"):::done --&gt; R2("♻️ Refactor"):::done

    classDef phase1 fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef phase2 fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef phase3 fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#3a0a3a
    classDef test fill:#fce4ec,stroke:#c62828,stroke-width:2px,color:#5c0a0a
    classDef done fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
</code></pre>

<p>Early: <strong>Brainstorm -&gt; Plan -&gt; Implement -&gt; Test.</strong> Tests came after implementation. This was the root cause of the weak assertion patterns the audit uncovered — when the AI writes tests <em>after</em> the code exists, it tends to assert what the code <em>does</em> rather than what it <em>should do</em>. The test becomes a description of current behaviour, not a specification of correct behaviour.</p>

<p>The assertion audit made this cost concrete. After spending two days fixing ~75 violations — OR-fallbacks, existence-only checks, silent parametrised passes — I changed the workflow to test-first. The AI writes tests that encode expected behaviour <em>before</em> implementation. It then writes code to make them pass. This inverts the incentive: the test defines the target, and the code adapts to meet it, rather than the test adapting to describe whatever the code produced.</p>

<p>The type system work (Phase 3 onward) was built entirely under this TDD workflow. The difference was visible — the type inference tests asserted specific return types for specific expressions, not just “inference produced a result.”</p>

<p>Every rule was reactive. “STOP USING FOR LOOPS WITH MUTATIONS” came after mutation bugs. “Don’t blindly change test assertions” came after watching the AI weaken tests to make them pass. “Categorically avoid defensive programming” came after silent <code class="language-plaintext highlighter-rouge">None</code> checks masked real bugs. Each rule represents a mistake that happened at least once.</p>

<hr />

<h2 id="structured-agent-memory">Structured Agent Memory</h2>

<h3 id="the-continuity-problem">The Continuity Problem</h3>

<p>By session 200, the project had outgrown ad-hoc session management. At 8,500+ tests, 15 frontends, 66 ADRs, and a type system under active development, I’d open a new conversation and describe what I wanted — but determining <em>what</em> was getting harder. Which frontend gaps remained? Which were P0 vs. P1? What depended on what?</p>

<p>The deeper problem was agent amnesia. Each conversation started from zero. Claude didn’t remember that the TypeExpr migration was complete, that Pascal <code class="language-plaintext highlighter-rouge">declLabels</code> was already handled, that <code class="language-plaintext highlighter-rouge">poetry run python -m pytest</code> was the correct test command. At 50 sessions, this was a minor annoyance. At 200, it was the main cost.</p>

<p>The solution was a <strong>structured memory layer</strong> — persistent artefacts that the agent reads at session start. Four components: a curated memory file, a gap analysis document, an issue tracker, and architectural decision records. Together they answer: <em>what’s the current state?</em>, <em>what’s left to do?</em>, <em>what are the rules?</em>, and <em>why were past decisions made?</em></p>

<h3 id="issue-tracking-with-beads">Issue Tracking with Beads</h3>

<p>The inflection point was a frontend lowering gap analysis: cross-referencing every frontend’s dispatch table against its tree-sitter grammar. Result: 25 P0, 187 P1, and ~326 P2 gaps across 15 languages. The P0s were resolved in three commits. But 187 P1 gaps couldn’t be managed as a mental list.</p>

<p>I chose <a href="https://github.com/eqlabs/beads">Beads</a>, a local-first issue tracker that stores data in a Dolt database alongside the repo. It supports hierarchical issues (epics -&gt; stories -&gt; tasks), dependency chains, labels, and priority classifications from the command line.</p>

<p>The 187 P1 gaps became a structured breakdown:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>red-dragon-gvu [epic]: 129 P1 frontend lowering gaps
├── gvu.1 [epic]: Cross-language pattern matching (25 tasks across 6 languages)
├── gvu.2 [epic]: Class/OOP features (14 tasks)
├── gvu.3 [epic]: Type system and generics (11 tasks)
├── gvu.4 [epic]: Async/coroutine support (8 tasks)
├── gvu.5 [epic]: Destructuring and rest patterns (7 tasks)
├── gvu.6 [epic]: Metaprogramming/macros (9 tasks)
├── gvu.7 [epic]: Module system and imports (8 tasks)
└── gvu.8 [epic]: Language-specific features (remaining)
</code></pre></div></div>

<p>This changed how sessions started. Instead of describing work from memory, I could say <em>“what’s the next open task under gvu.1?”</em> The issue tracker became the session boundary — work was defined before the session began, not discovered during it.</p>

<p>Beads data is backed up as JSONL files committed to the repository, so issue state travels with the code. On a new machine, <code class="language-plaintext highlighter-rouge">bd backup restore</code> rebuilds the full database — 168 issues, 538 events, 163 dependencies.</p>

<h3 id="gap-analysis-as-planning">Gap Analysis as Planning</h3>

<p>The gap analysis document (<code class="language-plaintext highlighter-rouge">docs/frontend-lowering-gaps.md</code>) became a living planning artefact. Each P1 gap had a row with language, node type, description, and status. As gaps were resolved, status flipped to DONE. As new gaps were discovered, they were added.</p>

<p>Many P1 gaps clustered thematically. Pattern matching was a gap in 6 languages simultaneously. Class/OOP features were missing in 5. The right unit of work wasn’t “fix Go’s missing <code class="language-plaintext highlighter-rouge">iota</code>” — it was “implement pattern matching across all 6 languages that need it.” The themed epics emerged from the data, not from upfront planning.</p>

<p>The analysis also forced honest assessment of what “done” meant. I’d have Claude classify some gaps as “no-ops.” When I asked Claude to verify each claimed no-op against the actual tree-sitter AST, most contained meaningful content. TypeScript’s <code class="language-plaintext highlighter-rouge">ambient_declaration</code> had full type signatures. C#’s <code class="language-plaintext highlighter-rouge">unsafe_statement</code> wrapped executable blocks. Out of 8 claimed no-ops, only 2 were genuine. The lesson: <strong>verify against the AST, not against assumptions about what a node type name implies.</strong></p>

<h3 id="the-type-system-evolution">The Type System Evolution</h3>

<p>The largest post-sprint development was the type system. RedDragon started with string-based type hints — <code class="language-plaintext highlighter-rouge">"Int"</code>, <code class="language-plaintext highlighter-rouge">"String"</code>, <code class="language-plaintext highlighter-rouge">"List&lt;Int&gt;"</code> — threaded through the IR as operand annotations. This worked for simple inference but couldn’t represent parameterised types, union types, or subtype relationships.</p>

<p>The evolution happened in phases, each driven by a concrete limitation:</p>

<p><strong>Phase 1: TypeExpr ADT.</strong> Replaced string type hints with an algebraic data type: <code class="language-plaintext highlighter-rouge">ScalarType("Int")</code>, <code class="language-plaintext highlighter-rouge">ParameterizedType("List", (ScalarType("Int"),))</code>, <code class="language-plaintext highlighter-rouge">UnknownType</code>. A <code class="language-plaintext highlighter-rouge">parse_type()</code> function handled roundtripping from legacy strings. String-compatible equality was preserved for backward compatibility during the migration.</p>

<p><strong>Phase 2: TypeGraph.</strong> A directed acyclic graph encoding subtype relationships (<code class="language-plaintext highlighter-rouge">Int &lt;: Number &lt;: Object</code>, <code class="language-plaintext highlighter-rouge">String &lt;: Object</code>). Covariant <code class="language-plaintext highlighter-rouge">is_subtype_expr()</code> for parameterised types. <code class="language-plaintext highlighter-rouge">common_supertype_expr()</code> for join operations.</p>

<p><strong>Phase 3: Interface-aware inference.</strong> When the inference engine encountered <code class="language-plaintext highlighter-rouge">animal.speak()</code> where <code class="language-plaintext highlighter-rouge">animal</code> was typed as interface <code class="language-plaintext highlighter-rouge">Animal</code>, it couldn’t resolve the return type. The fix was a chain walk: check interface method types when the class’s own methods don’t have type information. This required seeding <code class="language-plaintext highlighter-rouge">interface_implementations</code> from 5 frontends.</p>

<p>Each phase was documented as an ADR, tested with both unit and integration tests, and committed independently. The type system alone accounts for ~1,500 tests and 34 ADRs.</p>

<p>A later phase of type system work — migrating every runtime value to carry its type via a <code class="language-plaintext highlighter-rouge">TypedValue</code> wrapper — became a multi-phase refactoring that exposed hidden assumptions in constructor handling, revealed that builtins were bypassing the state management contract, and prompted half a dozen side detours. I wrote about that migration separately in <a href="/2026/03/13/anatomy-of-a-refactoring-using-ai.html">Anatomy of a Refactoring Using AI</a>.</p>

<h3 id="memory-files">Memory Files</h3>

<p>Claude Code supports a persistent memory directory (<code class="language-plaintext highlighter-rouge">.claude/projects/.../memory/MEMORY.md</code>) loaded at session start. My memory file contains:</p>

<ul>
  <li><strong>Key references</strong>: Links to audit results, gap analyses, and their current status</li>
  <li><strong>Workflow reminders</strong>: Commands that have been forgotten before</li>
  <li><strong>Project state</strong>: Current test count, resolved issues, known gotchas</li>
  <li><strong>Type system state</strong>: Which migration phases are complete</li>
  <li><strong>Future work pointers</strong>: What’s been deferred and why</li>
</ul>

<p>The memory file is curated, not appended. Outdated information is removed. Entries are updated when facts change.</p>

<p>The distinction between the memory file and <code class="language-plaintext highlighter-rouge">CLAUDE.md</code>: <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> encodes <em>rules</em>. The memory file encodes <em>state</em>. Together with the gap analysis (<em>plan</em>) and the issue tracker (<em>work queue</em>), they form a four-layer structured memory:</p>

<pre><code class="language-mermaid">flowchart TB
    S(("🚀 New session")):::start
    R("📜 CLAUDE.md&lt;br/&gt;&lt;i&gt;Rules: how to behave&lt;/i&gt;"):::rules
    M("🧠 MEMORY.md&lt;br/&gt;&lt;i&gt;State: what's been done&lt;/i&gt;"):::state
    G("📊 Gap analysis&lt;br/&gt;&lt;i&gt;Plan: what's left to do&lt;/i&gt;"):::plan
    B("📋 Beads issues&lt;br/&gt;&lt;i&gt;Work queue: what to do next&lt;/i&gt;"):::queue
    O("✅ Agent is oriented"):::done

    S --&gt; R &amp; M &amp; G &amp; B
    R &amp; M &amp; G &amp; B --&gt; O

    classDef start fill:#e8f4fd,stroke:#4a90d9,stroke-width:3px,color:#1a3a5c
    classDef rules fill:#fce4ec,stroke:#c62828,stroke-width:2px,color:#5c0a0a
    classDef state fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#3a0a3a
    classDef plan fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef queue fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef done fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
</code></pre>

<table>
  <thead>
    <tr>
      <th>Layer</th>
      <th>Artefact</th>
      <th>Purpose</th>
      <th>Update frequency</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Rules</td>
      <td><code class="language-plaintext highlighter-rouge">CLAUDE.md</code></td>
      <td>How to behave</td>
      <td>When failure modes are discovered</td>
    </tr>
    <tr>
      <td>State</td>
      <td><code class="language-plaintext highlighter-rouge">MEMORY.md</code></td>
      <td>What’s been done</td>
      <td>Every few sessions</td>
    </tr>
    <tr>
      <td>Plan</td>
      <td>Gap analysis doc</td>
      <td>What’s left to do</td>
      <td>After each implementation batch</td>
    </tr>
    <tr>
      <td>Work queue</td>
      <td>Beads issues</td>
      <td>What to do next</td>
      <td>After each task closes</td>
    </tr>
  </tbody>
</table>

<p>A fresh session can orient itself in seconds rather than minutes.</p>

<h3 id="the-quick-win-trap">The Quick Win Trap</h3>

<p>When managing 168 issues, there is a temptation to chase quick wins — tasks that appear trivial. During the gap analysis breakdown, Claude identified ~16 “quick wins” including 8 claimed no-ops. When I asked Claude to verify each one against the actual AST structure, the list shrank to 7, with only 2 genuine no-ops. The rest needed real handlers.</p>

<p><strong>What looks like a no-op from the node type name often isn’t.</strong> <code class="language-plaintext highlighter-rouge">ambient_declaration</code> sounds like metadata; it’s actually <code class="language-plaintext highlighter-rouge">declare const VERSION: string</code>. <code class="language-plaintext highlighter-rouge">unsafe_statement</code> sounds like a compiler pragma; it’s actually a block wrapper around executable code.</p>

<p>The AI will optimise for closing tickets, not for closing them correctly. The human’s job is to challenge the classification before the work begins.</p>

<hr />

<h2 id="patterns-and-observations">Patterns and Observations</h2>

<ul>
  <li>
    <p><strong>Brainstorm, probe, crystallise.</strong> I didn’t start with fixed architectures. I started with a problem, brainstormed approaches with Claude, then implemented each and tested on real data. The deterministic VM emerged from asking “shouldn’t this be deterministic?” after seeing the LLM-based approach work.</p>
  </li>
  <li>
    <p><strong>The plan document as interface.</strong> After brainstorming and discussing trade-offs, I’d formulate a plan document covering context, phases, file-by-file changes, and verification steps. The plan is specific enough for unambiguous execution but high-level enough to retain architectural control. This happened ~15 times.</p>
  </li>
  <li><strong>Brainstorming with Superpowers.</strong> Later in the project, I started using the <a href="https://github.com/nicobailey/claude-code-superpowers">Superpowers</a> plugin for Claude Code, specifically its brainstorming skill. Why it fit my workflow:
    <ul>
      <li><strong>I’m not used to writing large upfront specs.</strong> I prefer to explore the design space before committing to a choice, and Superpowers’ brainstorming mode let me describe a problem and have it explore multiple approaches interactively.</li>
      <li><strong>Design through small focused decisions.</strong> I want the design to evolve through a series of yes/no decisions on individual aspects, not a single monolithic plan. The brainstorming skill converges on a design through exactly this kind of incremental refinement.</li>
      <li><strong>Customising how information is presented.</strong> For larger refactorings, instead of handing me a full implementation spec to review (which I would skim), I asked it to show me the salient aspects of the final design — the key trade-offs, the data structures that would change, the migration strategy — as a list of focused questions I could approve or reject individually. This turned spec review from a passive reading exercise into an active decision-making process.</li>
    </ul>
  </li>
  <li>
    <p><strong>Breadth over depth.</strong> Tasks like “generate frontends for 14 languages” or “audit all 130 test files for weak assertions” are where the AI works well. These breadth tasks — applying a consistent pattern across many targets — would have taken days. Where it needed more guidance was depth: closure capture semantics (snapshot vs. shared environment), when to use <code class="language-plaintext highlighter-rouge">SYMBOLIC</code> fallback vs. crash, whether an assertion is vacuous. These required me to probe with specific test cases.</p>
  </li>
  <li>
    <p><strong>Empirical validation over specification.</strong> I rarely specified exact behaviour upfront. I implemented a feature, ran it on real code, and judged the results. The AI made this feedback loop fast enough to be practical.</p>
  </li>
  <li><strong>Terse directives after trust.</strong> Early prompts were detailed. By mid-project: <em>“do all of them”</em>, <em>“push”</em>, <em>“commit and push this”</em>. Trust built through consistent execution.</li>
</ul>

<pre><code class="language-mermaid">flowchart LR
    E("📝 Sessions 1–20&lt;br/&gt;&lt;b&gt;Detailed specs&lt;/b&gt;&lt;br/&gt;&lt;i&gt;full context + constraints&lt;/i&gt;"):::early
    M("💬 Sessions 20–100&lt;br/&gt;&lt;b&gt;Short directives&lt;/b&gt;&lt;br/&gt;&lt;i&gt;'implement all the&lt;br/&gt;critical and common ones'&lt;/i&gt;"):::mid
    L("⚡ Sessions 100+&lt;br/&gt;&lt;b&gt;Minimal prompts&lt;/b&gt;&lt;br/&gt;&lt;i&gt;'do all of them'&lt;/i&gt;&lt;br/&gt;&lt;i&gt;'push'&lt;/i&gt;"):::late

    E --&gt; M --&gt; L

    classDef early fill:#fce4ec,stroke:#c62828,stroke-width:2px,color:#5c0a0a
    classDef mid fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef late fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
</code></pre>

<ul>
  <li>
    <p><strong>The AI hallucinated audit findings.</strong> During the assertion audit, the AI reported violations that didn’t exist or had already been fixed. Different parallel agents flagged different things, inconsistently applied priority criteria, and re-reported fixed items with different wording. The reconciliation pass caught this. The auditor itself needs auditing.</p>
  </li>
  <li>
    <p><strong>CLAUDE.md rules are reactive.</strong> Every rule was added in response to a specific failure. They accumulate over time, and each one represents a mistake that happened at least once.</p>
  </li>
  <li>
    <p><strong>Screenshot-driven debugging.</strong> For the CFG visualisation work, I’d generate a diagram, screenshot it, paste it into the conversation, and ask “why does it look so disjointed?” Claude could see the rendering and diagnose layout issues. The visualisation went through five rounds.</p>
  </li>
</ul>

<h3 id="the-anonymous-class-story-or-why-the-ai-reaches-for-new-infrastructure">The Anonymous Class Story, or, Why the AI Reaches for New Infrastructure</h3>

<p>TypeScript allows assigning anonymous classes to variables: <code class="language-plaintext highlighter-rouge">const MyClass = class { constructor() { ... } }</code>. When someone writes <code class="language-plaintext highlighter-rouge">new MyClass()</code>, the VM needs to resolve <code class="language-plaintext highlighter-rouge">MyClass</code> — but <code class="language-plaintext highlighter-rouge">MyClass</code> isn’t a declared class name. It’s a variable that <em>holds</em> a class.</p>

<p>The first design Claude proposed: a new <code class="language-plaintext highlighter-rouge">class_aliases</code> dictionary in the class registry, populated during lowering, with a <code class="language-plaintext highlighter-rouge">resolve_class_name()</code> method that checks aliases before the main registry. New data structure, new resolution method, new lowering logic to populate it.</p>

<p>I asked: <em>“Why is this so complicated?”</em></p>

<p>Second attempt: a pointer chain mechanism. The variable would store a pointer to the class entry, and <code class="language-plaintext highlighter-rouge">_handle_new_object</code> would follow the pointer chain. Still new infrastructure — a new pointer type and a resolution protocol.</p>

<p>I asked: <em>“Why can’t it just be a regular variable living on the stack?”</em></p>

<p>Third attempt — the one that shipped: at <code class="language-plaintext highlighter-rouge">_handle_new_object</code> time, if the class name isn’t in the registry, check if it’s a variable in scope. If so, dereference it and use the result as the class name. The variable store <em>already was</em> the lookup table. Five lines of code. Zero new data structures.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># The entire fix
</span><span class="k">if</span> <span class="n">class_name</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">class_registry</span><span class="p">:</span>
    <span class="n">resolved</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">current_frame</span><span class="p">.</span><span class="nf">lookup</span><span class="p">(</span><span class="n">class_name</span><span class="p">)</span>
    <span class="k">if</span> <span class="nf">isinstance</span><span class="p">(</span><span class="n">resolved</span><span class="p">,</span> <span class="nb">str</span><span class="p">):</span>
        <span class="n">class_name</span> <span class="o">=</span> <span class="n">resolved</span>
</code></pre></div></div>

<p>The pattern repeated immediately. The next step was seeding the variable’s type as <code class="language-plaintext highlighter-rouge">Type[ClassName]</code> — a metatype — so the type inference engine could track it. Claude proposed a string-encoded <code class="language-plaintext highlighter-rouge">"Type[ClassName]"</code> representation. But the codebase already had <code class="language-plaintext highlighter-rouge">ParameterizedType</code> in its <code class="language-plaintext highlighter-rouge">TypeExpr</code> ADT. The metatype was just <code class="language-plaintext highlighter-rouge">ParameterizedType("Type", (ScalarType("ClassName"),))</code>. A one-line convenience constructor, no new types.</p>

<p>That metatype work then surfaced a deeper issue: the type extraction pipeline was converting <code class="language-plaintext highlighter-rouge">TypeExpr</code> objects to strings, passing strings through seed methods, and then parsing them back to <code class="language-plaintext highlighter-rouge">TypeExpr</code> on the other side. The round-trip was pointless. This led to a migration across all 15 frontends — changing seed methods to accept <code class="language-plaintext highlighter-rouge">TypeExpr</code> directly, eliminating the string intermediary. The migration touched ~30 files and passed through 11,193 tests without a single failure, because it was removing accidental complexity, not adding new behaviour.</p>

<p>Three iterations to reach a 5-line solution. Each iteration was simpler than the last. The AI’s instinct at each step was to <em>add</em> — a new registry, a new pointer type, a new string encoding. The human’s role was to ask <em>“doesn’t the existing system already do this?”</em> until the answer was yes.</p>

<p>This is the most common design failure mode I’ve observed: <strong>the AI builds new infrastructure before checking whether the existing system already solves the problem.</strong> It’s not a capability limitation — Claude understood the variable store, the class registry, and the TypeExpr ADT perfectly well. It just didn’t <em>start</em> from them. It started from the problem and worked forward, rather than starting from the existing system and asking what was missing.</p>

<p>The fix isn’t a CLAUDE.md rule (though I added one). It’s a conversational habit: before accepting any design, ask <em>“what existing mechanism does this duplicate?”</em></p>

<hr />

<h2 id="what-i-would-change">What I Would Change</h2>

<ul>
  <li>
    <p><strong>Start with the audit earlier.</strong> The two-pass dispatch audit should have existed from the first batch of frontends, not after 50 sessions.</p>
  </li>
  <li>
    <p><strong>Invest in cross-language tests from day one.</strong> The Rosetta and Exercism suites exposed more bugs than all the language-specific unit tests combined. A single exercise tested across 15 languages covers more surface area than 50 unit tests in one language.</p>
  </li>
  <li>
    <p><strong>Be more aggressive about the functional core.</strong> Even with the FP rules in CLAUDE.md, some mutation crept in, especially in the VM executor. The dataflow module is almost purely functional and is the easiest module to test. The correlation is not a coincidence.</p>
  </li>
</ul>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>Building non-trivial systems with an AI coding assistant is about architectural direction, not prompting. The human’s role is strategic: choosing problems, evaluating approaches empirically, making pivot decisions, and encoding quality standards. The AI’s role is tactical: implementing plans, auditing for completeness, applying patterns at breadth, and maintaining consistency.</p>

<p>The workflow I converged on (brainstorm, discuss trade-offs, plan, test-first, implement, clean up) emerged through trial and error across 400+ sessions.</p>

<p>What changed between the early sessions and the later ones was the emergence of structured agent memory. The early sessions were exploratory, driven by momentum. The later sessions were systematic — the agent started each conversation knowing the project state, the active work queue, and the governing rules. The human’s role expanded from architect to memory curator: defining work, challenging classifications, and maintaining the persistent artefacts that make each session productive from its first message.</p>

<p>The limiting factor in AI-assisted development is not the AI’s capability — it’s the AI’s memory. A capable agent with no memory rediscovers context every session. A capable agent with structured memory picks up where the last session left off.</p>

<hr />

<h2 id="references">References</h2>

<ul>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/master/docs/ir-reference.md">IR Reference</a> — The 28-opcode instruction set</li>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/master/docs/notes-on-vm-design.md">Notes on VM Design</a> — Deterministic execution model and symbolic values</li>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/master/docs/notes-on-frontend-design.md">Notes on Frontend Design</a> — Tree-sitter dispatch table architecture</li>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/master/docs/notes-on-dataflow-design.md">Notes on Dataflow Design</a> — Iterative dataflow analysis</li>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/master/docs/type-system.md">Type System</a> — TypeExpr ADT, TypeGraph, and inference</li>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/master/docs/frontend-lowering-gaps.md">Frontend Lowering Gaps</a> — Gap analysis across 15 languages</li>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/master/docs/ir-lowering-gaps.md">IR Lowering Gaps</a> — IR-level lowering gaps</li>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/master/docs/architectural-design-decisions.md">Architectural Decision Records</a> — 100+ ADRs documenting design decisions</li>
  <li><a href="/2026/03/13/anatomy-of-a-refactoring-using-ai.html">Anatomy of a Refactoring Using AI</a> — The TypedValue migration: a multi-phase refactoring traced in detail</li>
</ul>]]></content><author><name>avishek</name></author><category term="Software Engineering" /><category term="Compilers" /><category term="Program Analysis" /><category term="AI-Assisted Development" /><summary type="html"><![CDATA[Notes from building a multi-language code analysis engine across 400+ conversation sessions with Claude Code.]]></summary></entry><entry><title type="html">RedDragon: A Deterministic Compiler Pipeline with LLM-Assisted Repair for Analysing Incomplete Code and Unknown Languages</title><link href="https://avishek.net/2026/03/01/designing-reddragon-multi-language-code-analysis.html" rel="alternate" type="text/html" title="RedDragon: A Deterministic Compiler Pipeline with LLM-Assisted Repair for Analysing Incomplete Code and Unknown Languages" /><published>2026-03-01T00:00:00+05:30</published><updated>2026-03-01T00:00:00+05:30</updated><id>https://avishek.net/2026/03/01/designing-reddragon-multi-language-code-analysis</id><content type="html" xml:base="https://avishek.net/2026/03/01/designing-reddragon-multi-language-code-analysis.html"><![CDATA[<p><em>A universal IR with per-opcode typed instructions, 15 deterministic frontends, LLM-assisted repair/lowering/execution, a deterministic VM with class hierarchy support, overload resolution, and cross-language slicing, a structured type system with generics/unions/variance/traits and interface-aware inference, and iterative dataflow analysis.</em></p>

<p><strong>GitHub</strong>: <a href="https://github.com/avishek-sen-gupta/red-dragon">avishek-sen-gupta/red-dragon</a></p>

<p><img src="/assets/pipeline-viz.gif" alt="RedDragon Pipeline Demo" /></p>

<hr />

<h2 id="the-problem">The Problem</h2>

<p>I wanted to analyse source code across many languages (trace data flow, build control flow graphs, understand how variables depend on each other) without writing a separate analyser for each language. The conventional approach is to build language-specific tooling (Roslyn for C#, javac’s AST for Java, etc.), but that means duplicating every downstream analysis pass for every language. I wanted <strong>one representation, one analyser, many languages</strong>.</p>

<p>Established IRs exist for this kind of work. LLVM IR covers C, C++, Rust, Swift, and others. WebAssembly targets a growing set of languages. GraalVM’s Truffle framework provides a polyglot execution layer. I considered all of these and chose to build my own for three reasons:</p>

<ul>
  <li>No single existing IR covered the full set of languages I wanted to analyse (Python, Ruby, JavaScript, TypeScript, PHP, Lua, Scala, Kotlin, Go, Java, C#, C, C++, Rust, Pascal, and COBOL).</li>
  <li>Existing IRs assume programs are complete and all dependencies are resolved. They are not designed for incomplete code with missing imports, unresolved externals, or partial extracts.</li>
  <li>I wanted to integrate LLM-based lowering and LLM-assisted execution as first-class features of the pipeline, and grafting that onto an existing IR’s toolchain would have taken more time than building a purpose-built one.</li>
</ul>

<p>The twist: I wanted to handle <em>incomplete</em> programs gracefully. Real-world code depends on imports, frameworks, and external systems that aren’t available during static analysis. Most tools crash or give up when they hit an unresolved reference. I wanted mine to keep going, creating <strong>symbolic placeholders</strong> for unknowns and tracing data flow through them.</p>

<p><a href="https://github.com/avishek-sen-gupta/red-dragon">RedDragon</a> is the result. It parses source in 15 languages, lowers it to a universal intermediate representation, builds control flow graphs, performs iterative dataflow analysis, and executes programs via a deterministic virtual machine. All with <strong>zero LLM calls</strong> for programs with concrete inputs. RedDragon is a work in progress.</p>

<p>This post covers how the system is designed: the IR, the frontends, the VM, type inference, and the dataflow analysis.</p>

<h3 id="motivation">Motivation</h3>

<p>RedDragon is the natural extension of my earlier work on <a href="https://github.com/avishek-sen-gupta/cobol-rekt">Cobol-REKT</a>, a COBOL reverse engineering toolkit. That project taught me the value of building analysis tools for legacy code, and left me wanting to generalise the approach across languages. I’d also built a small VM in Prolog previously, and wanted to attempt a more complete one — with a proper type system, dataflow analysis, and multi-language support.</p>

<p>Beyond the technical goals, this project was also a deliberate exercise in AI-assisted software development. I wanted to actively practise and improve my skills in directing an AI to build non-trivial systems — understanding where it works well, where it doesn’t, and how to structure the collaboration. The <a href="/2026/03/12/experiences-building-with-coding-assistant">companion post</a> covers that side of the experience.</p>

<p>RedDragon is a work in progress, and this post is a snapshot of where it stands today. My understanding of compiler design, type systems, and program analysis continues to evolve alongside the project.</p>

<h3 id="core-theses">Core Theses</h3>

<p>RedDragon explores three ideas about analysing frequently-incomplete code, the kind found in legacy migrations, decompiled binaries, partial extracts, and codebases with missing dependencies:</p>

<ol>
  <li>
    <p><strong>Deterministic language frontends with LLM-assisted repair.</strong> Tree-sitter frontends (15 languages) and a ProLeap bridge (COBOL) handle well-formed source deterministically. When tree-sitter hits malformed syntax, an optional LLM repair loop fixes only the broken fragments and re-parses, maximising deterministic coverage for real-world incomplete code. All paths produce the same universal IR.</p>
  </li>
  <li>
    <p><strong>Full LLM frontends for unsupported languages.</strong> For languages without a tree-sitter frontend, an LLM lowers source to IR entirely, supporting any language without new parser code. A chunked variant splits large files into per-function chunks via tree-sitter, lowering each independently. The LLM acts as a <em>compiler frontend</em>, constrained by a formal IR schema with concrete patterns. It’s translating syntax, not reasoning about semantics.</p>
  </li>
  <li>
    <p><strong>A VM that integrates LLMs only at the boundaries where information is genuinely missing.</strong> When execution hits missing dependencies, unresolved imports, or unknown externals, a configurable resolver can invoke an LLM to produce plausible state changes, keeping execution moving through incomplete programs instead of halting at the first unknown. When source is complete and all dependencies are present, the entire pipeline (parse → lower → execute) is deterministic with zero LLM calls.</p>
  </li>
</ol>

<hr />

<h2 id="table-of-contents">Table of Contents</h2>

<ol>
  <li><a href="#architecture-overview">Architecture Overview</a></li>
  <li><a href="#a-worked-example-source-to-execution">A Worked Example: Source to Execution</a></li>
  <li><a href="#the-ir-33-opcodes-to-rule-them-all">The IR: 33 Opcodes to Rule Them All</a></li>
  <li><a href="#frontends-four-strategies-one-output">Frontends: Four Strategies, One Output</a></li>
  <li><a href="#llm-assisted-ast-repair">LLM-Assisted AST Repair</a></li>
  <li><a href="#llm-frontend-lowering-unknown-languages">LLM Frontend: Lowering Unknown Languages</a></li>
  <li><a href="#the-dispatch-table-engine">The Dispatch Table Engine</a></li>
  <li><a href="#lowering-equivalence">Lowering Equivalence</a></li>
  <li><a href="#the-deterministic-vm">The Deterministic VM</a></li>
  <li><a href="#llm-assisted-vm-execution">LLM-Assisted VM Execution</a></li>
  <li><a href="#dataflow-analysis">Dataflow Analysis</a></li>
  <li><a href="#type-inference">Type Inference</a></li>
  <li><a href="#cross-language-type-inference-in-practice">Cross-Language Type Inference in Practice</a></li>
  <li><a href="#cross-language-verification-via-exercism">Cross-Language Verification via Exercism</a></li>
  <li><a href="#references">References</a></li>
</ol>

<hr />

<h2 id="architecture-overview">Architecture Overview</h2>

<p>RedDragon follows a classic compiler pipeline:</p>

<pre><code class="language-mermaid">%%{ init: { "flowchart": { "curve": "stepBefore" } } }%%
flowchart TD
    src("📄 Source Code&lt;br/&gt;&lt;i&gt;15 languages&lt;/i&gt;"):::input
    frontend("🔧 Frontend&lt;br/&gt;&lt;i&gt;deterministic or LLM-based&lt;/i&gt;"):::transform
    cfg("🔀 CFG Builder"):::analysis
    typeinf("🏷️ Type Inference"):::analysis
    vm("⚙️ VM"):::engine
    dataflow("📊 Dataflow Analysis"):::engine

    src --&gt; frontend
    frontend --&gt;|"list[TypedInstruction]"| cfg
    frontend --&gt;|"list[TypedInstruction]"| typeinf
    typeinf --&gt;|"TypeEnvironment"| vm
    cfg --&gt; vm
    cfg --&gt; dataflow

    classDef input fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef transform fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef analysis fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#3a0a3a
    classDef engine fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
</code></pre>

<p>Every stage operates on the same flat IR. The type inference pass, VM, and dataflow analysis are all language-agnostic. They don’t know whether the instructions came from Python, Rust, or COBOL.</p>

<hr />

<h2 id="a-worked-example-source-to-execution">A Worked Example: Source to Execution</h2>

<p>To make the pipeline concrete, here’s a complete trace of a simple program through every stage. This is the same pipeline that runs for all 15 languages.</p>

<h3 id="source-python">Source (Python)</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">classify</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">x</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
        <span class="n">label</span> <span class="o">=</span> <span class="sh">"</span><span class="s">positive</span><span class="sh">"</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">label</span> <span class="o">=</span> <span class="sh">"</span><span class="s">negative</span><span class="sh">"</span>
    <span class="k">return</span> <span class="n">label</span>

<span class="n">result</span> <span class="o">=</span> <span class="nf">classify</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="stage-1-lowering-to-ir">Stage 1: Lowering to IR</h3>

<p>The Python tree-sitter frontend parses this and emits flat three-address code. The function body is wrapped in a skip-over pattern (a <code class="language-plaintext highlighter-rouge">BRANCH</code> jumps past it so it’s not executed at definition time):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>branch end_classify_0                    # skip over body
func_classify_0:                         # entry point
  %0 = symbolic param:x                  # parameter binding
  decl_var x %0
  %1 = load_var x                        # if x &gt; 0
  %2 = const 0
  %3 = binop &gt; %1 %2
  branch_if %3 if_true_0,if_false_0
if_true_0:
  %4 = const "positive"
  decl_var label %4
  branch if_end_0
if_false_0:
  %5 = const "negative"
  store_var label %5                     # assignment to existing var
  branch if_end_0
if_end_0:
  %6 = load_var label
  return %6
end_classify_0:
  %7 = const &lt;function:classify@func_classify_0&gt;
  decl_var classify %7
  %8 = const 5
  decl_var x %8
  %9 = call_function classify %8         # classify(5)
  decl_var result %9
</code></pre></div></div>

<p>Every instruction is a flat dataclass with an opcode, operands, a destination register, and a source location tracing it back to the original line and column. No nested expressions. <code class="language-plaintext highlighter-rouge">x &gt; 0</code> decomposes into <code class="language-plaintext highlighter-rouge">LOAD_VAR</code>, <code class="language-plaintext highlighter-rouge">CONST</code>, <code class="language-plaintext highlighter-rouge">BINOP</code>.</p>

<h3 id="stage-2-cfg-construction">Stage 2: CFG Construction</h3>

<p>The CFG builder splits the IR at every <code class="language-plaintext highlighter-rouge">LABEL</code> and after every <code class="language-plaintext highlighter-rouge">BRANCH</code>/<code class="language-plaintext highlighter-rouge">BRANCH_IF</code>/<code class="language-plaintext highlighter-rouge">RETURN</code>/<code class="language-plaintext highlighter-rouge">THROW</code>, then wires edges based on branch targets:</p>

<pre><code class="language-mermaid">flowchart TD
    entry(["&lt;b&gt;entry&lt;/b&gt;&lt;br/&gt;BRANCH end_classify_0"]):::entry
    func("&lt;b&gt;func_classify_0&lt;/b&gt;&lt;br/&gt;SYMBOLIC param:x · DECL_VAR x&lt;br/&gt;LOAD x · CONST 0 · BINOP &gt;&lt;br/&gt;BRANCH_IF"):::block
    if_true("&lt;b&gt;if_true_0&lt;/b&gt;&lt;br/&gt;CONST &amp;quot;positive&amp;quot;&lt;br/&gt;DECL_VAR label · BRANCH"):::branch
    if_false("&lt;b&gt;if_false_0&lt;/b&gt;&lt;br/&gt;CONST &amp;quot;negative&amp;quot;&lt;br/&gt;STORE_VAR label · BRANCH"):::branch
    if_end(["&lt;b&gt;if_end_0&lt;/b&gt;&lt;br/&gt;LOAD label&lt;br/&gt;RETURN"]):::exit
    end_classify("&lt;b&gt;end_classify_0&lt;/b&gt;&lt;br/&gt;CONST function · DECL_VAR classify&lt;br/&gt;CONST 5 · DECL_VAR x&lt;br/&gt;CALL classify · DECL_VAR result"):::block

    entry --&gt; end_classify
    entry -.-&gt;|"skip"| func
    func -- "T" --&gt; if_true
    func -- "F" --&gt; if_false
    if_true --&gt; if_end
    if_false --&gt; if_end
    end_classify -.-&gt;|"call"| func

    classDef entry fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef block fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef branch fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#3a0a3a
    classDef exit fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
</code></pre>

<h3 id="stage-3-vm-execution-0-llm-calls">Stage 3: VM Execution (0 LLM calls)</h3>

<p>The deterministic VM executes step by step. When it hits <code class="language-plaintext highlighter-rouge">CALL_FUNCTION classify</code>, it pushes a new stack frame, binds the parameter <code class="language-plaintext highlighter-rouge">x = 5</code>, and jumps to <code class="language-plaintext highlighter-rouge">func_classify_0</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>step  1: branch end_classify_0          → skip to end_classify_0
step  2: const &lt;function:classify&gt;       → %7 = &lt;function:classify@func_classify_0&gt;
step  3: decl_var classify %7            → classify = &lt;function&gt;
step  4: const 5                         → %8 = 5
step  5: decl_var x %8                   → x = 5
step  6: call_function classify %8       → push frame, jump to func_classify_0
step  7: symbolic param:x               → %0 = 5 (bound from caller)
step  8: decl_var x %0                   → x = 5
step  9: load_var x                      → %1 = 5
step 10: const 0                         → %2 = 0
step 11: binop &gt; %1 %2                   → %3 = True (5 &gt; 0)
step 12: branch_if %3 if_true,if_false   → True, jump to if_true_0
step 13: const "positive"                → %4 = "positive"
step 14: decl_var label %4               → label = "positive"
step 15: branch if_end_0                 → jump to if_end_0
step 16: load_var label                  → %6 = "positive"
step 17: return %6                       → pop frame, return "positive"
step 18: decl_var result %9              → result = "positive"

Final state: result = "positive"  (18 steps, 0 LLM calls)
</code></pre></div></div>

<h3 id="stage-4-dataflow-analysis">Stage 4: Dataflow Analysis</h3>

<p>The reaching definitions analysis traces through the register chain. The raw def-use chain says “<code class="language-plaintext highlighter-rouge">result</code> depends on <code class="language-plaintext highlighter-rouge">%9</code>”. But tracing through: <code class="language-plaintext highlighter-rouge">%9</code> comes from <code class="language-plaintext highlighter-rouge">CALL_FUNCTION</code> on <code class="language-plaintext highlighter-rouge">classify</code> with argument <code class="language-plaintext highlighter-rouge">%8</code>; inside the call, <code class="language-plaintext highlighter-rouge">label</code> is set to <code class="language-plaintext highlighter-rouge">"positive"</code> (the branch taken); <code class="language-plaintext highlighter-rouge">label</code> is loaded into <code class="language-plaintext highlighter-rouge">%6</code> and returned. The dependency graph says: <code class="language-plaintext highlighter-rouge">result</code> depends on <code class="language-plaintext highlighter-rouge">classify</code> and <code class="language-plaintext highlighter-rouge">x</code>.</p>

<p>The <a href="#llm-assisted-vm-execution">LLM-Assisted VM Execution</a> section shows what happens when the source has missing dependencies: the VM creates symbolic placeholders that propagate deterministically, preserving dataflow tracing even without concrete values.</p>

<hr />

<h2 id="the-ir-33-opcodes-to-rule-them-all">The IR: 33 Opcodes to Rule Them All</h2>

<p><em>See also: <a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/docs/ir-reference.md">IR Reference</a></em></p>

<p>The intermediate representation is a <strong>flattened three-address code</strong> with 33 opcodes, grouped by role:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Value producers:   CONST, LOAD_VAR, LOAD_FIELD, LOAD_INDEX,
                   NEW_OBJECT, NEW_ARRAY, BINOP, UNOP,
                   CALL_FUNCTION, CALL_METHOD, CALL_UNKNOWN,
                   CALL_CTOR

Value consumers:   STORE_VAR, STORE_FIELD, STORE_INDEX, DECL_VAR

Control flow:      BRANCH, BRANCH_IF, LABEL, RETURN, THROW,
                   TRY_PUSH, TRY_POP

Pointers:          ADDRESS_OF, LOAD_INDIRECT, LOAD_FIELD_INDIRECT,
                   STORE_INDIRECT

Regions:           ALLOC_REGION, WRITE_REGION, LOAD_REGION

Continuations:     SET_CONTINUATION, RESUME_CONTINUATION

Escape hatch:      SYMBOLIC
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">DECL_VAR</code> is a recent addition that separates variable <em>declarations</em> from <em>assignments</em>. Previously, <code class="language-plaintext highlighter-rouge">STORE_VAR</code> handled both, always writing to the current stack frame. This meant assignments inside nested functions could never modify outer-scope variables — a fundamental scope chain bug. <code class="language-plaintext highlighter-rouge">DECL_VAR</code> writes only to the current frame (for <code class="language-plaintext highlighter-rouge">let x = ...</code>, <code class="language-plaintext highlighter-rouge">var x = ...</code>), while <code class="language-plaintext highlighter-rouge">STORE_VAR</code> now walks the scope chain to find existing bindings. All 15 frontends emit <code class="language-plaintext highlighter-rouge">DECL_VAR</code> for declarations and <code class="language-plaintext highlighter-rouge">STORE_VAR</code> for assignments.</p>

<p>The first 21 opcodes handle all general-purpose lowering across 15 languages. <code class="language-plaintext highlighter-rouge">TRY_PUSH</code> and <code class="language-plaintext highlighter-rouge">TRY_POP</code> model structured exception handling (pushing/popping handler labels onto the VM’s exception stack). <code class="language-plaintext highlighter-rouge">ADDRESS_OF</code> supports pointer aliasing: <code class="language-plaintext highlighter-rouge">&amp;x</code> on a primitive promotes the variable to a heap object and returns a typed <code class="language-plaintext highlighter-rouge">Pointer(base, offset)</code>, enabling <code class="language-plaintext highlighter-rouge">*ptr = 99</code> to update the original variable through the alias (see <a href="#pointer-aliasing">Pointer Aliasing</a>). The three region opcodes (<code class="language-plaintext highlighter-rouge">ALLOC_REGION</code>, <code class="language-plaintext highlighter-rouge">WRITE_REGION</code>, <code class="language-plaintext highlighter-rouge">LOAD_REGION</code>) provide byte-addressed memory for COBOL-style overlays, REDEFINES, and packed data layouts. The two continuation opcodes (<code class="language-plaintext highlighter-rouge">SET_CONTINUATION</code>, <code class="language-plaintext highlighter-rouge">RESUME_CONTINUATION</code>) model COBOL’s PERFORM return semantics, where control transfers to a named paragraph and returns to the caller on completion. The extended opcodes are language-agnostic in the IR and VM; they happen to be emitted by specific frontends but could serve broader use cases.</p>

<p>Every instruction is a frozen dataclass with named fields: an opcode, typed operands (registers as <code class="language-plaintext highlighter-rouge">Register</code> objects, labels as <code class="language-plaintext highlighter-rouge">CodeLabel</code> objects, operators as <code class="language-plaintext highlighter-rouge">BinopKind</code>/<code class="language-plaintext highlighter-rouge">UnopKind</code> enums), a destination register, and a source location tracing it back to the original code. Each opcode has its own dataclass (e.g., <code class="language-plaintext highlighter-rouge">Binop</code>, <code class="language-plaintext highlighter-rouge">LoadVar</code>, <code class="language-plaintext highlighter-rouge">CallFunction</code>) rather than a generic container with positional operands. No nested expressions. <code class="language-plaintext highlighter-rouge">a + b * c</code> decomposes into:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>%0 = const b
%1 = const c
%2 = binop * %0 %1
%3 = const a
%4 = binop + %3 %2
</code></pre></div></div>

<p>This verbosity is the <strong>trade-off for universality</strong>. CFG construction, dataflow analysis, and VM execution all operate on the same flat list. Adding a new language means emitting these opcodes; everything downstream works automatically.</p>

<h3 id="source-location-traceability">Source Location Traceability</h3>

<p>Every instruction carries a <code class="language-plaintext highlighter-rouge">SourceLocation</code> with start/end line and column, captured from the tree-sitter AST node that generated it. The IR’s string representation appends this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>%0 = const 10  # 1:4-1:6
</code></pre></div></div>

<p>This means any IR instruction, any VM execution step, any dataflow dependency can be traced back to the exact span of source code that produced it. When a symbolic value appears in the output, its provenance chain leads back to specific source lines.</p>

<h3 id="control-flow-and-functions">Control Flow and Functions</h3>

<p>All control flow is explicit: labels, conditional branches, and unconditional jumps. There are no structured <code class="language-plaintext highlighter-rouge">if</code>/<code class="language-plaintext highlighter-rouge">while</code>/<code class="language-plaintext highlighter-rouge">for</code> constructs. <code class="language-plaintext highlighter-rouge">BRANCH_IF</code> carries its branch targets as a typed <code class="language-plaintext highlighter-rouge">branch_targets: tuple[CodeLabel, ...]</code> field. The CFG builder splits the IR into basic blocks at every <code class="language-plaintext highlighter-rouge">LABEL</code> and after every <code class="language-plaintext highlighter-rouge">BRANCH</code>/<code class="language-plaintext highlighter-rouge">BRANCH_IF</code>/<code class="language-plaintext highlighter-rouge">RETURN</code>/<code class="language-plaintext highlighter-rouge">THROW</code>, then wires edges based on the branch targets. Loops become back-edges: a <code class="language-plaintext highlighter-rouge">while</code> loop’s <code class="language-plaintext highlighter-rouge">BRANCH</code> at the end of the body points back to the condition’s label. Function definitions use the <strong>skip-over pattern</strong> shown in the <a href="#a-worked-example-source-to-execution">worked example</a>: a <code class="language-plaintext highlighter-rouge">BRANCH</code> jumps past the body at definition time, and a <code class="language-plaintext highlighter-rouge">FunctionRegistry</code> scans the IR for <code class="language-plaintext highlighter-rouge">SYMBOLIC "param:"</code> markers to extract parameter names, map class names to method labels, and build linearized parent chains for class inheritance (see <a href="#class-hierarchy-and-inherited-method-dispatch">Class Hierarchy and Inherited Method Dispatch</a>).</p>

<h3 id="four-call-variants">Four Call Variants</h3>

<p>The IR distinguishes four kinds of calls:</p>

<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">CALL_FUNCTION</code></strong>: static calls where the target is a known name. Fields: <code class="language-plaintext highlighter-rouge">func_name</code>, <code class="language-plaintext highlighter-rouge">args</code>.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">CALL_METHOD</code></strong>: method calls on objects. Fields: <code class="language-plaintext highlighter-rouge">obj_reg</code>, <code class="language-plaintext highlighter-rouge">method_name</code>, <code class="language-plaintext highlighter-rouge">args</code>.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">CALL_UNKNOWN</code></strong>: dynamic calls where the target is a computed expression (a variable holding a function reference, or a closure). Fields: <code class="language-plaintext highlighter-rouge">target_reg</code>, <code class="language-plaintext highlighter-rouge">args</code>.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">CALL_CTOR</code></strong>: explicit constructor calls, separating <code class="language-plaintext highlighter-rouge">new ClassName(args)</code> from regular function calls. Fields: <code class="language-plaintext highlighter-rouge">class_name</code>, <code class="language-plaintext highlighter-rouge">args</code>. This avoids the need for runtime heuristics to distinguish constructor calls from function calls, and gives the VM a clean dispatch path for object allocation + initialisation.</li>
</ul>

<p>The frontend decides which to emit based on the AST: <code class="language-plaintext highlighter-rouge">foo(x)</code> emits <code class="language-plaintext highlighter-rouge">CALL_FUNCTION</code>, <code class="language-plaintext highlighter-rouge">obj.foo(x)</code> emits <code class="language-plaintext highlighter-rouge">CALL_METHOD</code>, <code class="language-plaintext highlighter-rouge">some_var(x)</code> where <code class="language-plaintext highlighter-rouge">some_var</code> isn’t a known function emits <code class="language-plaintext highlighter-rouge">CALL_UNKNOWN</code>, and <code class="language-plaintext highlighter-rouge">new Foo(x)</code> emits <code class="language-plaintext highlighter-rouge">CALL_CTOR</code>.</p>

<h3 id="object-and-array-construction">Object and Array Construction</h3>

<p>Objects and arrays are created via <code class="language-plaintext highlighter-rouge">NEW_OBJECT</code>/<code class="language-plaintext highlighter-rouge">NEW_ARRAY</code> followed by <code class="language-plaintext highlighter-rouge">STORE_FIELD</code>/<code class="language-plaintext highlighter-rouge">STORE_INDEX</code> for each member. An array literal <code class="language-plaintext highlighter-rouge">[1, 2, 3]</code> lowers to:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>%0 = const 3
%1 = new_array list %0
%2 = const 0
%3 = const 1
store_index %1 %2 %3         # array[0] = 1
%4 = const 1
%5 = const 2
store_index %1 %4 %5         # array[1] = 2
%6 = const 2
%7 = const 3
store_index %1 %6 %7         # array[2] = 3
</code></pre></div></div>

<p>This verbose expansion means the VM and dataflow analysis see every individual element assignment, which matters for tracking which values flow into which positions.</p>

<h3 id="the-symbolic-escape-hatch">The SYMBOLIC Escape Hatch</h3>

<p><code class="language-plaintext highlighter-rouge">SYMBOLIC</code> is the escape hatch. When a frontend encounters a construct it doesn’t handle, it emits <code class="language-plaintext highlighter-rouge">SYMBOLIC "unsupported:list_comprehension"</code> instead of crashing. The VM treats it as a symbolic value that propagates through execution. Parameters use it too (<code class="language-plaintext highlighter-rouge">SYMBOLIC "param:x"</code>), as do caught exceptions (<code class="language-plaintext highlighter-rouge">SYMBOLIC "caught_exception:ValueError"</code>).</p>

<p>Over time, <code class="language-plaintext highlighter-rouge">unsupported:</code> emissions get replaced with real IR as frontends gain coverage. The project’s history is essentially the story of systematically eliminating every last <code class="language-plaintext highlighter-rouge">SYMBOLIC</code>.</p>

<hr />

<h2 id="frontends-four-strategies-one-output">Frontends: Four Strategies, One Output</h2>

<p><em>See also: <a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/docs/notes-on-frontend-design.md">Frontend Design</a> · <a href="https://github.com/avishek-sen-gupta/red-dragon/tree/main/docs/frontend-design">Per-Language Frontend Docs</a></em></p>

<p>All four frontend strategies produce the same list of typed instructions. They differ in speed, coverage, and determinism:</p>

<p><strong>1. Deterministic frontends (15 languages):</strong> Python, JavaScript, TypeScript, Java, Ruby, Go, PHP, C#, C, C++, Rust, Kotlin, Scala, Lua, Pascal. These use tree-sitter for parsing and a dispatch-table-based recursive descent for lowering. <strong>Sub-millisecond. Zero LLM calls. Fully testable.</strong> Each frontend is modularised into separate files for expressions, control flow, and declarations, inheriting from a shared <code class="language-plaintext highlighter-rouge">BaseFrontend</code>. An optional <strong>AST repair decorator</strong> can wrap any deterministic frontend to handle malformed source (see the next section).</p>

<p><strong>2. COBOL frontend (ProLeap bridge):</strong> COBOL source is parsed by the ProLeap COBOL parser (a Java-based parser producing an Abstract Syntax Graph), bridged to Python via a shaded JAR that emits JSON ASGs. The frontend includes a complete type system: PIC clause parsing (zoned decimal, COMP/COMP-1/COMP-2, packed decimal, alphanumeric, EBCDIC), REDEFINES overlays with byte-addressed memory regions, OCCURS arrays with subscript resolution, level-88 condition names with value ranges, and paragraph-based control flow via named continuations. COBOL-specific IR is emitted using the region and continuation opcodes.</p>

<p><strong>3. LLM frontend:</strong> For languages without a deterministic frontend. The source is sent to an LLM constrained by a formal schema: all 33 opcode specs, concrete patterns, and worked examples. The LLM acts as a mechanical compiler frontend, not a reasoning engine. This distinction matters: the prompt doesn’t ask <em>“what does this code do?”</em> It asks <em>“translate this into these specific opcodes.”</em></p>

<p><strong>4. Chunked LLM frontend:</strong> For large files that overflow context windows. Tree-sitter decomposes the file into per-function chunks, each is LLM-lowered independently, registers and labels are renumbered to avoid collisions, and the chunks are reassembled into a single IR.</p>

<hr />

<h2 id="llm-assisted-ast-repair">LLM-Assisted AST Repair</h2>

<p>Real-world source code is often malformed: missing semicolons, unclosed brackets, incomplete extracts pasted from documentation, partial files from legacy migrations. Tree-sitter is tolerant of errors (it produces ERROR and MISSING nodes in the AST rather than refusing to parse), but those error nodes reach the dispatch chain and produce <code class="language-plaintext highlighter-rouge">SYMBOLIC "unsupported:ERROR"</code> emissions. The deterministic frontend keeps going, but the analysis loses information at every error node.</p>

<p>The AST repair facility recovers that information. It’s implemented as a decorator (<code class="language-plaintext highlighter-rouge">RepairingFrontendDecorator</code>) that wraps any deterministic frontend. When the source is clean, the decorator adds <strong>zero overhead</strong>: it checks <code class="language-plaintext highlighter-rouge">tree.root_node.has_error</code>, finds no errors, and delegates directly to the inner frontend. When errors exist, it runs a repair loop:</p>

<pre><code class="language-mermaid">%%{ init: { "flowchart": { "curve": "stepBefore" } } }%%
flowchart TD
    parse("📄 Parse source&lt;br/&gt;&lt;i&gt;tree-sitter&lt;/i&gt;"):::step
    check{"has_error?"}:::decide
    delegate("✅ Delegate to&lt;br/&gt;inner frontend"):::success
    extract("🔍 Extract&lt;br/&gt;error spans"):::step
    prompt("📝 Build&lt;br/&gt;repair prompt"):::step
    llm("🤖 Send to LLM"):::llm
    patch("🩹 Patch source&lt;br/&gt;&lt;i&gt;at error spans&lt;/i&gt;"):::step
    reparse("📄 Re-parse&lt;br/&gt;&lt;i&gt;patched source&lt;/i&gt;"):::step
    recheck{"has_error?"}:::decide
    fallback("⚠️ Fall back to&lt;br/&gt;original source"):::fallback

    parse --&gt; check
    check --&gt;|"No"| delegate
    check --&gt;|"Yes"| extract
    extract --&gt; prompt --&gt; llm --&gt; patch --&gt; reparse --&gt; recheck
    recheck --&gt;|"No"| delegate
    recheck --&gt;|"Yes (retries left)"| extract
    recheck --&gt;|"Yes (exhausted)"| fallback
    fallback --&gt; delegate

    classDef step fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef decide fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef llm fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#3a0a3a
    classDef success fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
    classDef fallback fill:#fce4ec,stroke:#c62828,stroke-width:1px,color:#5c0a0a
</code></pre>

<p>The loop has four stages, each implemented as a pure function:</p>

<h3 id="1-error-span-extraction">1. Error Span Extraction</h3>

<p>The extractor walks the tree-sitter AST recursively, collecting every ERROR and MISSING node. Each node’s byte offsets are expanded to cover full source lines (so the LLM sees complete lines, not mid-line fragments). Overlapping or adjacent spans are merged to avoid sending redundant context. Each merged span gets N lines of surrounding context (configurable, default 3) attached as <code class="language-plaintext highlighter-rouge">context_before</code> and <code class="language-plaintext highlighter-rouge">context_after</code>.</p>

<h3 id="2-prompt-construction">2. Prompt Construction</h3>

<p>The prompter builds a structured repair prompt from the error spans. The system prompt constrains the LLM to fix <em>only</em> syntax errors and return <em>only</em> the repaired code, with no markdown wrapping and no explanations. Each error span becomes a delimited section in the user prompt showing the broken code with its surrounding context:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Context before:
def process(data):
    result = []

# Broken code:
    for item in data
        result.append(item.value

# Context after:
    return result
</code></pre></div></div>

<p>Multiple error spans are separated by a <code class="language-plaintext highlighter-rouge">===FRAGMENT===</code> delimiter. The LLM returns repaired fragments separated by the same delimiter.</p>

<h3 id="3-source-patching">3. Source Patching</h3>

<p>The patcher applies repaired fragments back to the original source bytes. It processes spans from end-of-file backward so that earlier byte offsets remain valid as later spans are replaced. This is the same technique compilers use for applying source-level fixups.</p>

<h3 id="4-re-parse-and-retry">4. Re-parse and Retry</h3>

<p>The patched source is re-parsed with tree-sitter. If the parse is now clean (<code class="language-plaintext highlighter-rouge">has_error</code> is false), the repaired source is passed to the inner deterministic frontend for lowering. If errors remain and retry budget allows (default: 3 attempts), the loop repeats from step 1 with the partially-repaired source. If all retries are exhausted, the decorator falls back to the original source, accepting <code class="language-plaintext highlighter-rouge">SYMBOLIC</code> emissions for the error nodes rather than crashing.</p>

<h3 id="design-properties">Design Properties</h3>

<p><strong>The LLM fixes syntax, not semantics.</strong> The prompt constrains the LLM to syntactic repair: fix the missing semicolon, close the bracket, complete the partial statement. This keeps the repair narrowly scoped and verifiable (the repaired source either parses cleanly or it doesn’t).</p>

<p><strong>Graceful degradation.</strong> If the LLM returns garbage, the retry budget is eventually exhausted and the decorator falls back to the original source. The worst case is identical to not having repair at all.</p>

<hr />

<h2 id="llm-frontend-lowering-unknown-languages">LLM Frontend: Lowering Unknown Languages</h2>

<p>The 15 deterministic frontends cover a fixed set of languages. For everything else (Haskell, Elixir, Perl, R, Fortran, or any language with parseable source), the LLM frontend lowers source directly to IR. No tree-sitter grammar needed, no dispatch table, no language-specific code. The LLM acts as the entire compiler frontend.</p>

<h3 id="the-prompt-as-a-formal-schema">The Prompt as a Formal Schema</h3>

<p>The LLM frontend works by constraining the LLM to a <strong>mechanical translation task</strong>. The system prompt is a specification containing:</p>

<ol>
  <li><strong>The instruction format</strong>: every IR instruction is a JSON object with <code class="language-plaintext highlighter-rouge">opcode</code>, <code class="language-plaintext highlighter-rouge">result_reg</code>, <code class="language-plaintext highlighter-rouge">operands</code>, <code class="language-plaintext highlighter-rouge">label</code>, and <code class="language-plaintext highlighter-rouge">source_location</code> fields.</li>
  <li><strong>All 33 opcode definitions</strong>: grouped into value producers (<code class="language-plaintext highlighter-rouge">CONST</code>, <code class="language-plaintext highlighter-rouge">LOAD_VAR</code>, <code class="language-plaintext highlighter-rouge">BINOP</code>, <code class="language-plaintext highlighter-rouge">CALL_FUNCTION</code>, …), consumers/control flow (<code class="language-plaintext highlighter-rouge">STORE_VAR</code>, <code class="language-plaintext highlighter-rouge">BRANCH_IF</code>, <code class="language-plaintext highlighter-rouge">RETURN</code>, …), and special instructions (<code class="language-plaintext highlighter-rouge">SYMBOLIC</code>, <code class="language-plaintext highlighter-rouge">LABEL</code>).</li>
  <li><strong>Critical patterns</strong>: exact lowering templates for function definitions (the skip-over pattern with <code class="language-plaintext highlighter-rouge">BRANCH</code>/<code class="language-plaintext highlighter-rouge">LABEL</code>/<code class="language-plaintext highlighter-rouge">SYMBOLIC param:</code>/implicit <code class="language-plaintext highlighter-rouge">RETURN None</code>), class definitions, constructor calls, method calls, and if/elif/else chains.</li>
  <li><strong>A complete worked example</strong>: a <code class="language-plaintext highlighter-rouge">fib(n)</code> function lowered to 30 IR instructions, showing every convention in context.</li>
  <li><strong>Rules</strong>: the first instruction is always <code class="language-plaintext highlighter-rouge">LABEL "entry"</code>, every expression is flattened into registers, string literals include quotes in the operand, booleans are <code class="language-plaintext highlighter-rouge">"True"</code>/<code class="language-plaintext highlighter-rouge">"False"</code>, return only the JSON array with no markdown fences.</li>
</ol>

<p>The user prompt is simply: <em>“Lower the following {language} source code into IR instructions:”</em> followed by the raw source. The opcode definitions and patterns are the specification; the LLM is the translator.</p>

<h3 id="parsing-and-validation">Parsing and Validation</h3>

<p>The LLM’s response (a JSON array of instruction objects) goes through three stages:</p>

<ol>
  <li><strong>Fence stripping</strong>: markdown code fences are removed if present (LLMs add them reflexively despite being told not to).</li>
  <li><strong>JSON parsing</strong>: each object is mapped to an <code class="language-plaintext highlighter-rouge">IRInstruction</code>, validating that every opcode string matches the <code class="language-plaintext highlighter-rouge">Opcode</code> enum. Unknown opcodes raise <code class="language-plaintext highlighter-rouge">IRParsingError</code>.</li>
  <li><strong>Entry label validation</strong>: if the first instruction isn’t <code class="language-plaintext highlighter-rouge">LABEL "entry"</code>, one is auto-prepended with a warning. This ensures the CFG builder always finds a valid entry point.</li>
</ol>

<p>If JSON parsing fails, the frontend retries up to 3 times (configurable). Each retry makes a fresh LLM call. If all retries are exhausted, the parsing error propagates.</p>

<h3 id="chunked-llm-frontend-scaling-to-large-files">Chunked LLM Frontend: Scaling to Large Files</h3>

<p>A single LLM call can’t handle a 2,000-line file: the source plus the system prompt plus the response would overflow the context window. The chunked frontend solves this by decomposing the file before calling the LLM.</p>

<p>The decomposition uses tree-sitter for structural splitting (even though the language may not have a deterministic <em>lowering</em> frontend, tree-sitter grammars exist for most languages). The <code class="language-plaintext highlighter-rouge">ChunkExtractor</code> walks the top-level children of the parse tree and classifies each as a function, class, or top-level statement. Contiguous top-level statements are grouped into a single chunk. Functions and classes are emitted first, then top-level groups, preserving the definition-before-use ordering that the skip-over pattern requires.</p>

<p>Each chunk is lowered independently through the standard <code class="language-plaintext highlighter-rouge">LLMFrontend</code>. The <code class="language-plaintext highlighter-rouge">IRRenumberer</code> then fixes up the results:</p>

<ul>
  <li><strong>Register renumbering</strong>: each chunk’s registers start at <code class="language-plaintext highlighter-rouge">%0</code>, so the renumberer offsets them (<code class="language-plaintext highlighter-rouge">%0</code> in chunk 2 becomes <code class="language-plaintext highlighter-rouge">%47</code> if chunk 1 used registers up to <code class="language-plaintext highlighter-rouge">%46</code>).</li>
  <li><strong>Label renumbering</strong>: each chunk’s labels get a <code class="language-plaintext highlighter-rouge">_chunkN</code> suffix to avoid collisions (<code class="language-plaintext highlighter-rouge">if_true_2</code> in chunk 1 vs. <code class="language-plaintext highlighter-rouge">if_true_2_chunk1</code>).</li>
  <li><strong>Function reference fixup</strong>: the <code class="language-plaintext highlighter-rouge">&lt;function:foo@func_foo_0&gt;</code> convention embeds the label in a string literal. The renumberer patches these to match the suffixed labels.</li>
</ul>

<p>The entry label is stripped from each chunk’s output and a single <code class="language-plaintext highlighter-rouge">LABEL "entry"</code> is prepended to the combined result. If a chunk fails (the LLM returns unparseable JSON), a <code class="language-plaintext highlighter-rouge">SYMBOLIC "chunk_error:chunk_name"</code> placeholder is inserted and lowering continues with the next chunk.</p>

<h3 id="end-to-end-example-haskell">End-to-End Example: Haskell</h3>

<p>Haskell has no deterministic frontend in RedDragon. Here is the full pipeline for a Haskell program with pattern-matched recursion, imports, and external function calls:</p>

<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">import</span> <span class="nn">Data.Char</span> <span class="p">(</span><span class="nf">toUpper</span><span class="p">,</span> <span class="nf">ord</span><span class="p">)</span>

<span class="n">factorial</span> <span class="o">::</span> <span class="kt">Int</span> <span class="o">-&gt;</span> <span class="kt">Int</span>
<span class="n">factorial</span> <span class="mi">0</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">factorial</span> <span class="n">n</span> <span class="o">=</span> <span class="n">n</span> <span class="o">*</span> <span class="n">factorial</span> <span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>

<span class="n">x</span> <span class="o">=</span> <span class="n">factorial</span> <span class="mi">5</span>
<span class="n">ch</span> <span class="o">=</span> <span class="n">toUpper</span> <span class="sc">'a'</span>
<span class="n">code</span> <span class="o">=</span> <span class="n">ord</span> <span class="n">ch</span>
<span class="n">total</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">code</span>
</code></pre></div></div>

<p>The LLM frontend receives this source along with the 180-line system prompt and produces IR following the same conventions as the deterministic frontends. The <code class="language-plaintext highlighter-rouge">factorial</code> function is lowered using the skip-over pattern:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>entry:
  branch end_factorial_1          # skip over function body

func_factorial_0:
  %0 = symbolic "param:n"
  decl_var n %0
  %1 = load_var n
  %2 = const 0
  %3 = binop == %1 %2
  branch_if %3 if_true_2,if_false_3

if_true_2:
  %4 = const 1
  return %4

if_false_3:
  %5 = load_var n
  %6 = const 1
  %7 = binop - %5 %6
  %8 = call_function factorial %7
  %9 = load_var n
  %10 = binop * %9 %8
  return %10
  %11 = const "None"
  return %11

end_factorial_1:
  %12 = const "&lt;function:factorial@func_factorial_0&gt;"
  decl_var factorial %12
</code></pre></div></div>

<p>The top-level bindings follow the same pattern: <code class="language-plaintext highlighter-rouge">CONST</code> + <code class="language-plaintext highlighter-rouge">CALL_FUNCTION</code> + <code class="language-plaintext highlighter-rouge">DECL_VAR</code> for each declaration. At execution time, <code class="language-plaintext highlighter-rouge">factorial(5)</code> resolves to 120 deterministically (concrete recursive call). <code class="language-plaintext highlighter-rouge">toUpper('a')</code> and <code class="language-plaintext highlighter-rouge">ord(ch)</code> are unresolved <code class="language-plaintext highlighter-rouge">Data.Char</code> functions, handled by the <code class="language-plaintext highlighter-rouge">UnresolvedCallResolver</code>: the default <code class="language-plaintext highlighter-rouge">SymbolicResolver</code> traces dependencies through symbolic placeholders; the opt-in <code class="language-plaintext highlighter-rouge">LLMPlausibleResolver</code> produces concrete values (<code class="language-plaintext highlighter-rouge">'A'</code>, <code class="language-plaintext highlighter-rouge">65</code>, <code class="language-plaintext highlighter-rouge">total = 185</code>). From Haskell source to executed VM state, with <strong>zero language-specific code</strong>.</p>

<hr />

<h2 id="the-dispatch-table-engine">The Dispatch Table Engine</h2>

<p><em>See also: <a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/docs/frontend-design/base-frontend.md">Base Frontend Design</a></em></p>

<p>The heart of the deterministic frontends is a <code class="language-plaintext highlighter-rouge">BaseFrontend</code> class that all 15 languages inherit from. It uses two dispatch tables (one for statements, one for expressions) mapping tree-sitter AST node types to handler methods.</p>

<p>The lowering dispatch chain:</p>

<pre><code class="language-mermaid">%%{ init: { "flowchart": { "curve": "stepBefore" } } }%%
flowchart TD
    lower("lower(root)"):::fn
    block("_lower_block(root)&lt;br/&gt;&lt;i&gt;iterate named children&lt;/i&gt;"):::fn
    stmt("_lower_stmt(child)&lt;br/&gt;&lt;i&gt;skip noise/comments; try STMT_DISPATCH&lt;/i&gt;"):::fn
    expr("_lower_expr(child)&lt;br/&gt;&lt;i&gt;fallback: try EXPR_DISPATCH&lt;/i&gt;"):::fn
    sym("⚠️ SYMBOLIC('unsupported:X')&lt;br/&gt;&lt;i&gt;final fallback&lt;/i&gt;"):::fallback

    lower --&gt; block --&gt; stmt --&gt; expr --&gt; sym

    classDef fn fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef fallback fill:#fce4ec,stroke:#c62828,stroke-width:2px,color:#5c0a0a
</code></pre>

<p>Common constructs (<code class="language-plaintext highlighter-rouge">if/else</code>, <code class="language-plaintext highlighter-rouge">while</code>, <code class="language-plaintext highlighter-rouge">for</code>, <code class="language-plaintext highlighter-rouge">for-each</code> with destructuring, <code class="language-plaintext highlighter-rouge">return</code>, <code class="language-plaintext highlighter-rouge">function_definition</code>, <code class="language-plaintext highlighter-rouge">class_definition</code>, <code class="language-plaintext highlighter-rouge">try/catch</code>) are handled in the base class. For-loop destructuring (<code class="language-plaintext highlighter-rouge">for (const [k, v] of arr)</code> in JS/TS, <code class="language-plaintext highlighter-rouge">for ((k, v) in map)</code> in Kotlin, structured bindings in C++) decomposes binding patterns into individual <code class="language-plaintext highlighter-rouge">LOAD_INDEX</code>/<code class="language-plaintext highlighter-rouge">LOAD_FIELD</code> + <code class="language-plaintext highlighter-rouge">STORE_VAR</code> instructions per iteration. Language-specific constructs override or extend. Overridable constants handle the small but persistent differences across grammars:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Python says "True", Go says "true", Lua says "true"
</span><span class="n">TRUE_LITERAL</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="sh">"</span><span class="s">True</span><span class="sh">"</span>    <span class="c1"># default
</span><span class="n">FALSE_LITERAL</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="sh">"</span><span class="s">False</span><span class="sh">"</span>
<span class="n">NONE_LITERAL</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="sh">"</span><span class="s">None</span><span class="sh">"</span>

<span class="c1"># Python puts the body in "body", Go puts it in "block"
</span><span class="n">FUNC_BODY_FIELD</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="sh">"</span><span class="s">body</span><span class="sh">"</span>
<span class="n">IF_CONSEQUENCE_FIELD</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="sh">"</span><span class="s">consequence</span><span class="sh">"</span>
</code></pre></div></div>

<p>All 15 languages <strong>canonicalise their native null/boolean forms</strong> to Python-form at lowering time. <code class="language-plaintext highlighter-rouge">nil</code>, <code class="language-plaintext highlighter-rouge">null</code>, <code class="language-plaintext highlighter-rouge">undefined</code>, <code class="language-plaintext highlighter-rouge">NULL</code> all become <code class="language-plaintext highlighter-rouge">"None"</code>. <code class="language-plaintext highlighter-rouge">true</code>, <code class="language-plaintext highlighter-rouge">True</code>, <code class="language-plaintext highlighter-rouge">TRUE</code> all become <code class="language-plaintext highlighter-rouge">"True"</code>. This means the VM only handles one set of literals, regardless of source language.</p>

<p>Adding support for a new AST node type is mechanical: write a handler method, register it in the dispatch table. This is what made the systematic coverage push possible. When the audit flagged 34 missing node types across 15 languages, implementing them was straightforward because each one followed the same pattern.</p>

<hr />

<h2 id="lowering-equivalence">Lowering Equivalence</h2>

<p>If 15 frontends lower the same algorithm from 15 different languages, does the resulting IR look the same? It should. The whole point of a universal IR is that downstream analysis (VM execution, dataflow, CFG) doesn’t depend on the source language. If two frontends produce structurally different IR for the same logic, it means one of them has a bug or an unnecessary inefficiency.</p>

<p>The lowering equivalence tests verify this directly. For each algorithm, the test lowers the source through all 15 deterministic frontends, extracts the function body from the IR (scanning for the <code class="language-plaintext highlighter-rouge">LABEL func_&lt;name&gt;_N</code> … <code class="language-plaintext highlighter-rouge">LABEL end_&lt;name&gt;_N</code> markers), strips <code class="language-plaintext highlighter-rouge">LABEL</code> instructions (which vary in naming across languages), and compares the resulting opcode sequences.</p>

<h3 id="iterative-factorial-full-equivalence">Iterative Factorial: Full Equivalence</h3>

<p>The iterative factorial is implemented in all 15 languages using the same structure: initialise <code class="language-plaintext highlighter-rouge">result = 1</code> and <code class="language-plaintext highlighter-rouge">i = 2</code>, loop while <code class="language-plaintext highlighter-rouge">i &lt;= n</code>, multiply <code class="language-plaintext highlighter-rouge">result *= i</code>, increment <code class="language-plaintext highlighter-rouge">i</code>, return <code class="language-plaintext highlighter-rouge">result</code>. Despite the syntactic differences (Python’s <code class="language-plaintext highlighter-rouge">while</code>, Go’s <code class="language-plaintext highlighter-rouge">for</code>, Rust’s <code class="language-plaintext highlighter-rouge">loop</code> with <code class="language-plaintext highlighter-rouge">break</code>, Pascal’s <code class="language-plaintext highlighter-rouge">while...do</code>), all 15 frontends produce the identical opcode sequence:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SYMBOLIC, DECL_VAR, CONST, DECL_VAR, CONST, DECL_VAR,
LOAD_VAR, LOAD_VAR, BINOP, BRANCH_IF, LOAD_VAR, LOAD_VAR,
BINOP, STORE_VAR, LOAD_VAR, CONST, BINOP, STORE_VAR, BRANCH,
LOAD_VAR, RETURN, CONST, RETURN
</code></pre></div></div>

<p>This is a 23-opcode sequence: parameter binding, three initialisations, the loop condition (<code class="language-plaintext highlighter-rouge">LOAD_VAR i</code>, <code class="language-plaintext highlighter-rouge">LOAD_VAR n</code>, <code class="language-plaintext highlighter-rouge">BINOP &lt;=</code>, <code class="language-plaintext highlighter-rouge">BRANCH_IF</code>), the loop body (<code class="language-plaintext highlighter-rouge">BINOP *</code>, <code class="language-plaintext highlighter-rouge">STORE_VAR result</code>, <code class="language-plaintext highlighter-rouge">BINOP +</code>, <code class="language-plaintext highlighter-rouge">STORE_VAR i</code>, <code class="language-plaintext highlighter-rouge">BRANCH</code> back), and the return path. Every frontend, from C to Lua to Scala, produces exactly this sequence.</p>

<h3 id="recursive-factorial-partial-equivalence">Recursive Factorial: Partial Equivalence</h3>

<p>The recursive variant (<code class="language-plaintext highlighter-rouge">if n &lt;= 1: return 1; return n * factorial(n - 1)</code>) achieves equivalence across 11 of 15 frontends. Four languages (Kotlin, Pascal, Rust, Scala) emit minor redundant instructions: an extra <code class="language-plaintext highlighter-rouge">STORE_VAR</code>/<code class="language-plaintext highlighter-rouge">LOAD_VAR</code> pair or an unreachable <code class="language-plaintext highlighter-rouge">BRANCH</code>. These are semantically correct (the VM produces the right answer) but structurally non-identical. The test is marked <code class="language-plaintext highlighter-rouge">xfail</code> with <code class="language-plaintext highlighter-rouge">strict=True</code>, so it will fail loudly when the frontends are fixed, signalling that the xfail should be removed.</p>

<p>The structural differences are instructive:</p>

<ul>
  <li><strong>Kotlin</strong> emits an extra <code class="language-plaintext highlighter-rouge">LOAD_VAR</code> before the return, because Kotlin’s tree-sitter grammar wraps the return value in a <code class="language-plaintext highlighter-rouge">parenthesized_expression</code> that the frontend lowers as a separate load.</li>
  <li><strong>Rust</strong> emits an extra <code class="language-plaintext highlighter-rouge">BRANCH</code> after the implicit return, because Rust’s expression-position blocks produce a trailing unconditional jump to the function end label.</li>
  <li><strong>Pascal</strong> and <strong>Scala</strong> have similar minor redundancies from language-specific AST structures.</li>
</ul>

<p>These are all candidates for frontend-level peephole optimisations: removing dead stores, eliminating redundant loads, pruning unreachable branches. The equivalence test makes the gap visible and quantifiable.</p>

<h3 id="what-equivalence-tests-catch">What Equivalence Tests Catch</h3>

<p>The equivalence tests complement the Exercism execution tests. Exercism verifies that all 15 frontends produce the <em>correct answer</em>. Equivalence tests verify that they produce the <em>same IR structure</em>. A frontend could produce the correct answer through a longer, less efficient IR path (extra stores, redundant loads, unnecessary branches). The execution test would pass; the equivalence test would fail.</p>

<p>This distinction matters for analysis quality. Redundant instructions can introduce spurious dependencies in the dataflow graph, create unnecessary basic blocks in the CFG, or slow down the VM. <strong>Structural equivalence is a stronger property than semantic correctness.</strong></p>

<hr />

<h2 id="the-deterministic-vm">The Deterministic VM</h2>

<p><em>See also: <a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/docs/notes-on-vm-design.md">VM Design</a></em></p>

<p>The VM is <strong>fully deterministic</strong>. Unknown values are <em>created</em> as symbolic placeholders that propagate through computation, rather than being resolved via LLM calls. The entire execution engine is reproducible across runs.</p>

<h3 id="vm-state-frames-heap-and-closures">VM State: Frames, Heap, and Closures</h3>

<p>The VM’s state is held in a single <code class="language-plaintext highlighter-rouge">VMState</code> dataclass:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@dataclass</span>
<span class="k">class</span> <span class="nc">VMState</span><span class="p">:</span>
    <span class="n">heap</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">HeapObject</span><span class="p">]</span>          <span class="c1"># flat object store
</span>    <span class="n">call_stack</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">StackFrame</span><span class="p">]</span>         <span class="c1"># LIFO execution frames
</span>    <span class="n">path_conditions</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span>           <span class="c1"># branch assumptions
</span>    <span class="n">symbolic_counter</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>            <span class="c1"># fresh-name generator
</span>    <span class="n">closures</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">ClosureEnvironment</span><span class="p">]</span>  <span class="c1"># shared mutable cells
</span></code></pre></div></div>

<p>Each <code class="language-plaintext highlighter-rouge">StackFrame</code> has a <strong>two-level namespace</strong>: <code class="language-plaintext highlighter-rouge">registers</code> for IR temporaries (<code class="language-plaintext highlighter-rouge">%0</code>, <code class="language-plaintext highlighter-rouge">%1</code>, …) and <code class="language-plaintext highlighter-rouge">local_vars</code> for source-level named variables. This separation keeps the three-address code machinery invisible to the analysis layer, which only cares about named variables.</p>

<p>The heap is a flat dictionary mapping addresses (<code class="language-plaintext highlighter-rouge">"obj_0"</code>, <code class="language-plaintext highlighter-rouge">"arr_1"</code>) to <code class="language-plaintext highlighter-rouge">HeapObject</code> instances. Each <code class="language-plaintext highlighter-rouge">HeapObject</code> stores a <code class="language-plaintext highlighter-rouge">type_hint</code> and a <code class="language-plaintext highlighter-rouge">fields</code> dictionary. Arrays use stringified indices as field keys. This uniform representation means the VM doesn’t distinguish between object field access and array indexing at the storage level.</p>

<h3 id="the-execution-loop">The Execution Loop</h3>

<p>The VM executes one instruction per step in a bounded loop. Each step fetches, dispatches, coerces, applies, and resolves control flow:</p>

<pre><code class="language-mermaid">flowchart TD
    START(("🔄 Step N")):::start

    FETCH("Fetch instruction&lt;br/&gt;&lt;i&gt;block.instructions[ip]&lt;/i&gt;"):::step
    LABEL{"LABEL?"}:::decide
    DISPATCH("🔧 LocalExecutor.execute()&lt;br/&gt;&lt;i&gt;dispatch table → handler&lt;/i&gt;"):::step
    HANDLED{"handled?"}:::decide
    LLM("🤖 LLM Backend&lt;br/&gt;&lt;i&gt;interpret_instruction()&lt;/i&gt;"):::llm
    COERCE_L("coerce_local_update()"):::coerce
    COERCE_M("materialize_raw_update()&lt;br/&gt;&lt;i&gt;deserialize + coerce + wrap&lt;/i&gt;"):::coerce

    CALL{"call_push&lt;br/&gt;+ next_label?"}:::decide
    CALL_SETUP("⬇️ Push frame&lt;br/&gt;&lt;i&gt;save return_label, return_ip,&lt;br/&gt;result_reg&lt;/i&gt;"):::step
    APPLY("apply_update()&lt;br/&gt;&lt;i&gt;registers, vars, heap,&lt;br/&gt;closures, objects&lt;/i&gt;"):::apply

    RET{"RETURN&lt;br/&gt;or THROW?"}:::decide
    RETFLOW("⬆️ Pop frame&lt;br/&gt;&lt;i&gt;write return value to&lt;br/&gt;caller's result_reg&lt;/i&gt;"):::step
    STOP(("🛑 End")):::stop
    JUMP{"next_label&lt;br/&gt;set?"}:::decide
    NEXT_BLOCK("Jump to block&lt;br/&gt;&lt;i&gt;ip = 0&lt;/i&gt;"):::step
    NEXT_IP("ip += 1"):::step

    START --&gt; FETCH
    FETCH --&gt; LABEL
    LABEL -- "yes" --&gt; NEXT_IP
    LABEL -- "no" --&gt; DISPATCH
    DISPATCH --&gt; HANDLED
    HANDLED -- "yes" --&gt; COERCE_L
    HANDLED -- "no" --&gt; LLM
    LLM --&gt; COERCE_M
    COERCE_L --&gt; CALL
    COERCE_M --&gt; CALL
    CALL -- "yes" --&gt; CALL_SETUP
    CALL_SETUP --&gt; APPLY
    CALL -- "no" --&gt; APPLY
    APPLY --&gt; RET
    RET -- "yes" --&gt; RETFLOW
    RETFLOW -- "top-level" --&gt; STOP
    RETFLOW -- "has caller" --&gt; NEXT_BLOCK
    RET -- "no" --&gt; JUMP
    JUMP -- "yes" --&gt; NEXT_BLOCK
    JUMP -- "no" --&gt; NEXT_IP
    NEXT_IP --&gt; START
    NEXT_BLOCK --&gt; START

    classDef start fill:#e8f4fd,stroke:#4a90d9,stroke-width:3px,color:#1a3a5c
    classDef stop fill:#fce4ec,stroke:#c62828,stroke-width:3px,color:#5c0a0a
    classDef step fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef decide fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef llm fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#3a0a3a
    classDef coerce fill:#fff9c4,stroke:#f9a825,stroke-width:2px,color:#5c3a0a
    classDef apply fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
</code></pre>

<p>The critical design point: <strong>every path converges on <code class="language-plaintext highlighter-rouge">apply_update()</code></strong>. Whether the instruction was executed locally or by the LLM backend, the state change flows through the same single mutator. The LLM path adds a materialization step (deserialize JSON, coerce types, wrap in <code class="language-plaintext highlighter-rouge">TypedValue</code>), but the state application is identical.</p>

<p>The call dispatch path saves the return address (<code class="language-plaintext highlighter-rouge">return_label</code>, <code class="language-plaintext highlighter-rouge">return_ip</code>) and result register on the new frame <em>before</em> <code class="language-plaintext highlighter-rouge">apply_update</code> runs, so parameter bindings land in the callee’s frame automatically. On return, the saved metadata tells the loop where to resume in the caller.</p>

<h3 id="opcode-dispatch">Opcode Dispatch</h3>

<p>The <code class="language-plaintext highlighter-rouge">LocalExecutor</code> maps each of the 33 <code class="language-plaintext highlighter-rouge">Opcode</code> enum values to a handler function via a static dispatch table:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DISPATCH</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="n">Opcode</span><span class="p">,</span> <span class="n">Any</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="n">Opcode</span><span class="p">.</span><span class="n">CONST</span><span class="p">:</span> <span class="n">_handle_const</span><span class="p">,</span>
    <span class="n">Opcode</span><span class="p">.</span><span class="n">BINOP</span><span class="p">:</span> <span class="n">_handle_binop</span><span class="p">,</span>
    <span class="n">Opcode</span><span class="p">.</span><span class="n">CALL_FUNCTION</span><span class="p">:</span> <span class="n">_handle_call_function</span><span class="p">,</span>
    <span class="n">Opcode</span><span class="p">.</span><span class="n">LOAD_FIELD</span><span class="p">:</span> <span class="n">_handle_load_field</span><span class="p">,</span>
    <span class="c1"># ... all 33 opcodes
</span><span class="p">}</span>
</code></pre></div></div>

<p>Every handler receives the instruction, the VM state, the CFG, and a function registry, and returns an <code class="language-plaintext highlighter-rouge">ExecutionResult</code>. <strong>No handler mutates the VM directly.</strong> Instead, each constructs a <code class="language-plaintext highlighter-rouge">StateUpdate</code> describing the desired mutations.</p>

<h3 id="stateupdate-the-communication-contract">StateUpdate: The Communication Contract</h3>

<p><code class="language-plaintext highlighter-rouge">StateUpdate</code> is the universal contract between handlers and the state engine. It’s a pure data object listing all effects:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">StateUpdate</span><span class="p">:</span>
    <span class="n">register_writes</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">TypedValue</span><span class="p">]</span>  <span class="c1"># %0 = TypedValue(42, Int)
</span>    <span class="n">var_writes</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">TypedValue</span><span class="p">]</span>       <span class="c1"># x = TypedValue("hello", String)
</span>    <span class="n">heap_writes</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">HeapWrite</span><span class="p">]</span>            <span class="c1"># obj.field = TypedValue(...)
</span>    <span class="n">new_objects</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">NewObject</span><span class="p">]</span>            <span class="c1"># allocate on heap
</span>    <span class="n">call_push</span><span class="p">:</span> <span class="n">StackFramePush</span> <span class="o">|</span> <span class="bp">None</span>        <span class="c1"># push new frame
</span>    <span class="n">call_pop</span><span class="p">:</span> <span class="nb">bool</span>                          <span class="c1"># pop frame on return
</span>    <span class="n">path_condition</span><span class="p">:</span> <span class="nb">str</span> <span class="o">|</span> <span class="bp">None</span>              <span class="c1"># branch assumption
</span>    <span class="n">next_label</span><span class="p">:</span> <span class="nb">str</span> <span class="o">|</span> <span class="bp">None</span>                  <span class="c1"># jump target
</span></code></pre></div></div>

<p>All values flowing through <code class="language-plaintext highlighter-rouge">StateUpdate</code> are <code class="language-plaintext highlighter-rouge">TypedValue</code> objects — frozen dataclasses pairing a raw Python value with its <code class="language-plaintext highlighter-rouge">TypeExpr</code> type. This means type information is carried alongside every value throughout the execution pipeline, from handler output through state mutation to subsequent reads.</p>

<p>This separation of <em>computation</em> (handlers) from <em>mutation</em> (<code class="language-plaintext highlighter-rouge">apply_update</code>) is a deliberate <strong>functional core / imperative shell</strong> split. The handlers are pure functions that return data. The mutation is centralised in one place.</p>

<h3 id="typedvalue-runtime-type-propagation">TypedValue: Runtime Type Propagation</h3>

<p>Every runtime value in the VM is a <code class="language-plaintext highlighter-rouge">TypedValue</code> — a frozen dataclass pairing a raw value with its inferred type:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@dataclass</span><span class="p">(</span><span class="n">frozen</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">TypedValue</span><span class="p">:</span>
    <span class="n">value</span><span class="p">:</span> <span class="n">Any</span>       <span class="c1"># int, str, SymbolicValue, Pointer, ...
</span>    <span class="nb">type</span><span class="p">:</span> <span class="n">TypeExpr</span>   <span class="c1"># ScalarType("Int"), ParameterizedType("Array", ...), ...
</span></code></pre></div></div>

<p>Registers, local variables, heap fields, and closure bindings all store <code class="language-plaintext highlighter-rouge">TypedValue</code>. This bridges the static type inference pass (which runs before execution) with runtime coercion: when a <code class="language-plaintext highlighter-rouge">BINOP</code> operates on two <code class="language-plaintext highlighter-rouge">TypedValue</code> operands, it can inspect their types to apply language-specific coercion rules (e.g., Java’s <code class="language-plaintext highlighter-rouge">String + int</code> auto-stringification) without consulting the <code class="language-plaintext highlighter-rouge">TypeEnvironment</code> at every step.</p>

<p>Three helpers simplify construction: <code class="language-plaintext highlighter-rouge">typed(value, type_expr)</code> for explicit typing, <code class="language-plaintext highlighter-rouge">typed_from_runtime(value)</code> for inferring type from a Python value, and <code class="language-plaintext highlighter-rouge">VOID_RETURN</code> (a canonical <code class="language-plaintext highlighter-rouge">TypedValue(None, Void)</code>) for procedures that return nothing.</p>

<h3 id="two-layer-type-coercion">Two-Layer Type Coercion</h3>

<p>Type coercion operates at two points:</p>

<p><strong>Layer 1 (Pre-operation):</strong> Pluggable <code class="language-plaintext highlighter-rouge">BinopCoercionStrategy</code> and <code class="language-plaintext highlighter-rouge">UnopCoercionStrategy</code> coerce operands <em>before</em> evaluation. The default strategies are no-ops, but language-specific strategies override them — <code class="language-plaintext highlighter-rouge">JavaBinopCoercion</code> auto-stringifies when one operand of <code class="language-plaintext highlighter-rouge">+</code> is a <code class="language-plaintext highlighter-rouge">String</code>. Each strategy also infers the result type.</p>

<p><strong>Layer 2 (Write-time):</strong> <code class="language-plaintext highlighter-rouge">TypeConversionRules</code> coerce values when storing into typed registers (Int → Float widening, Float → Int narrowing, Bool → Int promotion). This layer operates inside <code class="language-plaintext highlighter-rouge">apply_update()</code>.</p>

<h3 id="builtinresult">BuiltinResult</h3>

<p>Built-in functions return a <code class="language-plaintext highlighter-rouge">BuiltinResult</code> — a uniform type distinguishing pure builtins from those with side effects:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@dataclass</span>
<span class="k">class</span> <span class="nc">BuiltinResult</span><span class="p">:</span>
    <span class="n">value</span><span class="p">:</span> <span class="n">Any</span>
    <span class="n">new_objects</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">NewObject</span><span class="p">]</span> <span class="o">=</span> <span class="nf">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="nb">list</span><span class="p">)</span>
    <span class="n">heap_writes</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">HeapWrite</span><span class="p">]</span> <span class="o">=</span> <span class="nf">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="nb">list</span><span class="p">)</span>
</code></pre></div></div>

<p>Pure builtins (<code class="language-plaintext highlighter-rouge">len</code>, <code class="language-plaintext highlighter-rouge">abs</code>, <code class="language-plaintext highlighter-rouge">max</code>) return <code class="language-plaintext highlighter-rouge">BuiltinResult(value=result)</code> with empty side-effect lists. Heap-mutating builtins (<code class="language-plaintext highlighter-rouge">list.append</code>, <code class="language-plaintext highlighter-rouge">array_of</code>) express mutations as structured <code class="language-plaintext highlighter-rouge">new_objects</code> and <code class="language-plaintext highlighter-rouge">heap_writes</code> that the executor applies through the same <code class="language-plaintext highlighter-rouge">apply_update()</code> path. This eliminated the previous inconsistency where some builtins returned bare values and others returned tuples with side effects.</p>

<h3 id="apply_update-the-single-mutator"><code class="language-plaintext highlighter-rouge">apply_update()</code>: The Single Mutator</h3>

<p>All state changes flow through <code class="language-plaintext highlighter-rouge">apply_update()</code>, which applies a <code class="language-plaintext highlighter-rouge">StateUpdate</code> in strict order: new objects, register writes, heap writes, path conditions, call push, variable writes, call pop. The ordering matters: call push (step 5) happens <em>before</em> variable writes (step 6), so parameter bindings automatically land in the new frame without special-casing. Variable writes also handle closure synchronisation: if a variable is in the frame’s <code class="language-plaintext highlighter-rouge">captured_var_names</code>, the write is mirrored to the shared <code class="language-plaintext highlighter-rouge">ClosureEnvironment</code>.</p>

<h3 id="symbolic-value-propagation">Symbolic Value Propagation</h3>

<p>When execution hits an unresolved import or function, the VM creates a <code class="language-plaintext highlighter-rouge">SymbolicValue</code> with a descriptive hint:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sym_0 (hint: "math.sqrt(16)")
</code></pre></div></div>

<p>This symbolic value propagates through computation deterministically. Each handler checks whether its operands are symbolic. If either operand of a <code class="language-plaintext highlighter-rouge">BINOP</code> is symbolic, the result is a fresh symbolic with a constraint recording the expression:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sym_0 + 1  →  sym_1 (constraint: "sym_0 + 1")
</code></pre></div></div>

<p>Field access on a symbolic object creates a symbolic field with <strong>lazy heap materialisation</strong>: the first access to <code class="language-plaintext highlighter-rouge">sym_0.x</code> allocates a synthetic heap entry and caches a symbolic value for <code class="language-plaintext highlighter-rouge">x</code>, so subsequent accesses to the same field return the same symbolic. This deduplication is important for dataflow analysis, where repeated reads of the same field should trace back to the same definition.</p>

<p>Concrete operations that fail (division by zero, unsupported operator) produce an <code class="language-plaintext highlighter-rouge">UNCOMPUTABLE</code> sentinel, which triggers symbolic fallback rather than crashing.</p>

<p>The trade-off is that symbolic branches always take the true path (a simplification), and symbolic values can’t be resolved to concrete results without help.</p>

<h3 id="the-unresolvedcallresolver">The UnresolvedCallResolver</h3>

<p>For the latter trade-off, a configurable <code class="language-plaintext highlighter-rouge">UnresolvedCallResolver</code> uses the Strategy pattern. Two strategies ship with RedDragon: a default symbolic resolver (zero LLM calls, fully deterministic) and an opt-in LLM-based resolver that produces plausible concrete values. The next section covers this mechanism in detail.</p>

<h3 id="closures">Closures</h3>

<p>One subtle design iteration worth mentioning: closure capture semantics. The initial implementation captured variables by snapshot (copy at definition time). This broke counter factories:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">make_counter</span><span class="p">():</span>
    <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">def</span> <span class="nf">inc</span><span class="p">():</span>
        <span class="n">count</span> <span class="o">+=</span> <span class="mi">1</span>
        <span class="k">return</span> <span class="n">count</span>
    <span class="k">return</span> <span class="n">inc</span>
</code></pre></div></div>

<p>With snapshot capture, <code class="language-plaintext highlighter-rouge">inc()</code> always reads <code class="language-plaintext highlighter-rouge">count = 0</code>. The fix was shared <code class="language-plaintext highlighter-rouge">ClosureEnvironment</code> cells: all closures from the same scope share a mutable environment, matching Python/JavaScript semantics. When a nested function is created, the enclosing frame’s variables are copied into a <code class="language-plaintext highlighter-rouge">ClosureEnvironment</code>. On each call, captured variables are injected into the new frame, and <code class="language-plaintext highlighter-rouge">apply_update()</code> mirrors writes back to the shared environment. This is the kind of deep correctness issue that only surfaces through specific test cases. It’s documented as ADR-019 in the project’s <a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/docs/architectural-design-decisions.md">architectural decision records</a>.</p>

<h3 id="class-hierarchy-and-inherited-method-dispatch">Class Hierarchy and Inherited Method Dispatch</h3>

<p>OOP languages encode class hierarchies differently: Java uses <code class="language-plaintext highlighter-rouge">extends</code>, Python lists bases in the class signature, Ruby uses <code class="language-plaintext highlighter-rouge">&lt;</code>, C++ has access-specified base lists, and so on. RedDragon handles all of these through a single mechanism in the <code class="language-plaintext highlighter-rouge">FunctionRegistry</code>, without adding any new IR opcodes.</p>

<p>Each frontend extracts parent class names from its language-specific tree-sitter nodes and encodes them in the class reference string: <code class="language-plaintext highlighter-rouge">&lt;class:Dog@class_Dog_0:Animal&gt;</code> records that <code class="language-plaintext highlighter-rouge">Dog</code> extends <code class="language-plaintext highlighter-rouge">Animal</code>. The registry’s <code class="language-plaintext highlighter-rouge">_scan_classes</code> method collects these direct parents, and <code class="language-plaintext highlighter-rouge">_expand_parent_chains</code> transitively expands them into a linearized list. For a chain <code class="language-plaintext highlighter-rouge">C extends B extends A</code>, <code class="language-plaintext highlighter-rouge">class_parents["C"]</code> becomes <code class="language-plaintext highlighter-rouge">["B", "A"]</code>.</p>

<p>At execution time, when <code class="language-plaintext highlighter-rouge">CALL_METHOD</code> resolves a method on an object, the executor first looks in the child class’s method table. On a miss, it walks <code class="language-plaintext highlighter-rouge">class_parents</code> until it finds a matching method:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Walk parent chain for inherited methods
</span><span class="k">for</span> <span class="n">parent</span> <span class="ow">in</span> <span class="n">registry</span><span class="p">.</span><span class="n">class_parents</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="n">type_hint</span><span class="p">,</span> <span class="p">[]):</span>
    <span class="n">parent_methods</span> <span class="o">=</span> <span class="n">registry</span><span class="p">.</span><span class="n">class_methods</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="n">parent</span><span class="p">,</span> <span class="p">{})</span>
    <span class="n">candidate</span> <span class="o">=</span> <span class="n">parent_methods</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="n">method_name</span><span class="p">,</span> <span class="sh">""</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">candidate</span> <span class="ow">and</span> <span class="n">candidate</span> <span class="ow">in</span> <span class="n">cfg</span><span class="p">.</span><span class="n">blocks</span><span class="p">:</span>
        <span class="n">func_label</span> <span class="o">=</span> <span class="n">candidate</span>
        <span class="k">break</span>
</code></pre></div></div>

<p>Method override works naturally: the child’s method table is checked first, so a redefined method in the child shadows the parent’s version. Multi-level inheritance (C → B → A) resolves at any depth.</p>

<p>Ten OOP frontends extract parent classes: Java, Python, C#, Kotlin, Ruby, JavaScript, TypeScript, Scala, PHP, and C++. Each uses a shared <code class="language-plaintext highlighter-rouge">make_class_ref</code> helper to encode parents into the class reference string, keeping the language-specific code minimal. The five non-OOP frontends (C, Go, Rust, Lua, Pascal) are unaffected.</p>

<h3 id="overload-resolution">Overload Resolution</h3>

<p>When a class defines multiple methods with the same name but different parameter types (common in Java, C#, Kotlin, Scala, C++), the VM needs to select the right overload at call time. The <code class="language-plaintext highlighter-rouge">OverloadResolver</code> uses a composable strategy pattern:</p>

<p><strong><code class="language-plaintext highlighter-rouge">ArityThenTypeStrategy</code></strong> ranks candidates in two passes:</p>

<ol>
  <li><strong>Arity distance</strong>: how many parameters differ from the call site’s argument count</li>
  <li><strong>Type compatibility score</strong>: exact match (2 points), coercion or subtype match (1 point), mismatch (-1 point)</li>
</ol>

<p>The type scoring uses <code class="language-plaintext highlighter-rouge">TypeGraph.is_subtype_expr()</code> for inheritance-aware dispatch: passing a <code class="language-plaintext highlighter-rouge">Dog</code> to <code class="language-plaintext highlighter-rouge">foo(Dog)</code> scores higher than <code class="language-plaintext highlighter-rouge">foo(Animal)</code>. Primitive coercions (Int → Float, Bool → Int) score as compatible but below exact matches. Since the resolver receives <code class="language-plaintext highlighter-rouge">list[TypedValue]</code> arguments, full type metadata is available without consulting the <code class="language-plaintext highlighter-rouge">TypeEnvironment</code>.</p>

<h3 id="symbol-table">Symbol Table</h3>

<p>The <code class="language-plaintext highlighter-rouge">SymbolTable</code> is a pre-execution data structure extracted from the IR by each frontend during lowering. It maps class names to their fields, methods, static fields, and parent classes — information that was previously recovered at runtime by scanning IR instructions.</p>

<p>The symbol table is populated via a <code class="language-plaintext highlighter-rouge">_extract_symbols</code> pre-pass hook on <code class="language-plaintext highlighter-rouge">BaseFrontend</code>. All 15 language frontends implement symbol extraction. The executor receives the symbol table at construction time and uses it for:</p>

<ul>
  <li><strong>Cross-class field resolution</strong>: <code class="language-plaintext highlighter-rouge">SymbolTable.resolve_field</code> walks the class hierarchy to find which class owns a field, enabling field access in subclass constructors without explicit <code class="language-plaintext highlighter-rouge">this.parent_field</code> qualification.</li>
  <li><strong>Implicit-this field access</strong>: In Java, C#, and C++, methods can reference fields without <code class="language-plaintext highlighter-rouge">this.</code> — <code class="language-plaintext highlighter-rouge">age = 10</code> inside a constructor means <code class="language-plaintext highlighter-rouge">this.age = 10</code>. The executor checks the symbol table to determine whether an unqualified name refers to a class field, and if so, rewrites the access as a <code class="language-plaintext highlighter-rouge">STORE_FIELD</code>/<code class="language-plaintext highlighter-rouge">LOAD_FIELD</code> on the implicit <code class="language-plaintext highlighter-rouge">this</code> parameter.</li>
  <li><strong>Static method dispatch</strong>: <code class="language-plaintext highlighter-rouge">Class::method()</code> calls (PHP, C++, Rust) are lowered with a <code class="language-plaintext highlighter-rouge">ClassRef</code> marker in the <code class="language-plaintext highlighter-rouge">CALL_METHOD</code> instruction. The executor uses the symbol table to verify the target is a static method and dispatches without requiring an object instance.</li>
</ul>

<h3 id="pointer-aliasing">Pointer Aliasing</h3>

<p>C and Rust programs use <code class="language-plaintext highlighter-rouge">&amp;x</code> to take the address of a variable. In most analysis tools, this creates an aliasing relationship that’s tracked through a separate alias analysis pass. RedDragon handles it directly in the VM through a KLEE-inspired <strong>promote-on-address-of</strong> model.</p>

<p>When the VM encounters <code class="language-plaintext highlighter-rouge">ADDRESS_OF</code>, it promotes the target variable from a primitive to a <code class="language-plaintext highlighter-rouge">HeapObject</code> and returns a typed <code class="language-plaintext highlighter-rouge">Pointer(base, offset)</code> value. Subsequent writes through the pointer (<code class="language-plaintext highlighter-rouge">*ptr = 99</code>) go through the heap and update the original variable’s storage. This means aliasing is exact: <code class="language-plaintext highlighter-rouge">*ptr</code> and <code class="language-plaintext highlighter-rouge">x</code> always see the same value, with no approximation.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>%0 = const 42
decl_var x %0
%1 = address_of x               # promote x to heap, return Pointer
decl_var ptr %1
%2 = const 99
store_field ptr "value" %2       # *ptr = 99; x is now 99
%3 = load_var x                  # %3 = 99 (reads through heap)
</code></pre></div></div>

<p>The model supports nested pointers (<code class="language-plaintext highlighter-rouge">int **pp</code>), pointer arithmetic (<code class="language-plaintext highlighter-rouge">ptr + 1</code> offsets into arrays), pointer subtraction (returns the offset difference between same-base pointers), pointer relational comparisons (<code class="language-plaintext highlighter-rouge">&lt;</code>, <code class="language-plaintext highlighter-rouge">&gt;</code>, <code class="language-plaintext highlighter-rouge">==</code> on offsets within the same base), struct pointers (arrow operator via <code class="language-plaintext highlighter-rouge">LOAD_FIELD</code>), and array pointer decay. The C and Rust frontends emit <code class="language-plaintext highlighter-rouge">ADDRESS_OF</code> for <code class="language-plaintext highlighter-rouge">&amp;identifier</code> expressions.</p>

<h3 id="slicing">Slicing</h3>

<p>All 15 languages support array and string slicing in execution. Each language’s slice syntax is lowered to the same <code class="language-plaintext highlighter-rouge">SLICE</code> IR operation, and the VM handles the semantics uniformly:</p>

<ul>
  <li><strong>Python:</strong> <code class="language-plaintext highlighter-rouge">a[1:3]</code>, <code class="language-plaintext highlighter-rouge">a[::2]</code>, <code class="language-plaintext highlighter-rouge">a[-2:]</code></li>
  <li><strong>Ruby:</strong> <code class="language-plaintext highlighter-rouge">arr[1..3]</code> (inclusive range), <code class="language-plaintext highlighter-rouge">arr[1...3]</code> (exclusive), <code class="language-plaintext highlighter-rouge">arr[start, length]</code> (positional)</li>
  <li><strong>Rust:</strong> <code class="language-plaintext highlighter-rouge">arr[1..3]</code> (exclusive), <code class="language-plaintext highlighter-rouge">arr[1..=3]</code> (inclusive)</li>
  <li><strong>Go:</strong> <code class="language-plaintext highlighter-rouge">a[1:3]</code>, <code class="language-plaintext highlighter-rouge">a[2:]</code>, string slicing</li>
  <li><strong>Kotlin:</strong> <code class="language-plaintext highlighter-rouge">subList()</code> and <code class="language-plaintext highlighter-rouge">substring()</code> via <code class="language-plaintext highlighter-rouge">METHOD_TABLE</code> dispatch</li>
</ul>

<h3 id="rest-patterns-and-variadic-parameters">Rest Patterns and Variadic Parameters</h3>

<p>JavaScript and TypeScript support rest patterns in destructuring and function parameters:</p>

<p><strong>Array destructuring:</strong> <code class="language-plaintext highlighter-rouge">const [a, ...rest] = arr</code> — the frontend emits a <code class="language-plaintext highlighter-rouge">SLICE</code> operation to extract remaining elements into the rest variable.</p>

<p><strong>Object destructuring:</strong> <code class="language-plaintext highlighter-rouge">const {a, ...rest} = obj</code> — remaining properties are spread into a new object on the heap.</p>

<p><strong>Function rest parameters:</strong> <code class="language-plaintext highlighter-rouge">function f(a, ...args)</code> — the frontend injects an <code class="language-plaintext highlighter-rouge">arguments</code> array into the function’s local scope and emits a <code class="language-plaintext highlighter-rouge">SLICE</code> to extract the variadic portion into the rest parameter.</p>

<h3 id="anonymous-class-resolution">Anonymous Class Resolution</h3>

<p>TypeScript and JavaScript allow assigning anonymous classes to variables: <code class="language-plaintext highlighter-rouge">const MyClass = class { ... }</code>. When <code class="language-plaintext highlighter-rouge">new MyClass()</code> is encountered, the VM resolves it by checking the variable store if the name isn’t in the class registry:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">class_name</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">class_registry</span><span class="p">:</span>
    <span class="n">resolved</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">current_frame</span><span class="p">.</span><span class="nf">lookup</span><span class="p">(</span><span class="n">class_name</span><span class="p">)</span>
    <span class="k">if</span> <span class="nf">isinstance</span><span class="p">(</span><span class="n">resolved</span><span class="p">,</span> <span class="nb">str</span><span class="p">):</span>
        <span class="n">class_name</span> <span class="o">=</span> <span class="n">resolved</span>
</code></pre></div></div>

<p>This reuses the existing variable store as the lookup table — no new data structures needed.</p>

<h3 id="language-prelude-classes">Language Prelude Classes</h3>

<p>Some languages have standard library types that are integral to idiomatic code but don’t exist as user-defined classes in the source. Rust’s <code class="language-plaintext highlighter-rouge">Box&lt;T&gt;</code> and <code class="language-plaintext highlighter-rouge">Option&lt;T&gt;</code> are examples: linked list implementations use <code class="language-plaintext highlighter-rouge">Box::new(node)</code> and <code class="language-plaintext highlighter-rouge">Some(value)</code> pervasively.</p>

<p>Rather than adding VM-level special cases, each frontend can override an <code class="language-plaintext highlighter-rouge">_emit_prelude</code> hook to emit synthetic class definitions during lowering:</p>

<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">Box::new(expr)</code></strong> is a pass-through — it returns its argument directly. In RedDragon’s reference-based VM, all values are already heap-allocated, so <code class="language-plaintext highlighter-rouge">Box</code> adds no indirection. This matches Rust’s auto-deref semantics transparently.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">Option</code></strong> is emitted as a real class with <code class="language-plaintext highlighter-rouge">__init__(self, value)</code>, <code class="language-plaintext highlighter-rouge">unwrap(self)</code>, and <code class="language-plaintext highlighter-rouge">as_ref(self)</code> methods. <code class="language-plaintext highlighter-rouge">Some(expr)</code> lowers to <code class="language-plaintext highlighter-rouge">CALL_FUNCTION "Option"</code>.</li>
</ul>

<p>The prelude hook is a no-op by default. Only the Rust frontend overrides it. The registry recognises prelude class labels via a prefix constant, so prelude classes participate in method dispatch and type inference like any user-defined class.</p>

<h3 id="structural-pattern-matching">Structural Pattern Matching</h3>

<p>Languages like Python, Rust, C#, and Scala have structural pattern matching constructs (<code class="language-plaintext highlighter-rouge">match</code>/<code class="language-plaintext highlighter-rouge">switch</code> with destructuring), where a subject value is tested against a sequence of patterns that can bind variables, decompose structures, and guard with conditions. RedDragon handles these through a common <strong>Pattern ADT</strong> (<code class="language-plaintext highlighter-rouge">interpreter/frontends/common/patterns.py</code>) shared across frontends.</p>

<p>The ADT defines pattern types that compose recursively:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LiteralPattern(value)                    # 42, "hello", true
WildcardPattern()                        # _
CapturePattern(name)                     # x (bind to variable)
OrPattern(alternatives)                  # 1 | 2 | 3
SequencePattern(elements)               # (a, b, _) — tuple destructuring
ClassPattern(class_name, positional, keyword)  # Point { x, y } or Some(x)
ValuePattern(parts)                      # Color::Red, Color.Red
AsPattern(inner, name)                   # x: Int (isinstance + bind)
</code></pre></div></div>

<p>Two shared compiler functions translate patterns into IR without any new opcodes:</p>

<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">compile_pattern_test(ctx, subject_reg, pattern)</code></strong> emits a chain of <code class="language-plaintext highlighter-rouge">BINOP ==</code>, <code class="language-plaintext highlighter-rouge">LOAD_FIELD</code>, <code class="language-plaintext highlighter-rouge">LOAD_INDEX</code>, and <code class="language-plaintext highlighter-rouge">BRANCH_IF</code> instructions that evaluate to a boolean test register. Nested patterns recurse and AND the results together.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">compile_pattern_bindings(ctx, subject_reg, pattern)</code></strong> emits <code class="language-plaintext highlighter-rouge">LOAD_FIELD</code>/<code class="language-plaintext highlighter-rouge">LOAD_INDEX</code> + <code class="language-plaintext highlighter-rouge">DECL_VAR</code> instructions that extract and bind matched components to local variables.</li>
</ul>

<p>Each frontend maps its language’s tree-sitter AST nodes to the Pattern ADT through a <code class="language-plaintext highlighter-rouge">parse_&lt;lang&gt;_pattern</code> function:</p>

<ul>
  <li><strong>Python</strong> maps <code class="language-plaintext highlighter-rouge">match_statement</code> cases: <code class="language-plaintext highlighter-rouge">case_pattern</code>, <code class="language-plaintext highlighter-rouge">class_pattern</code>, <code class="language-plaintext highlighter-rouge">as_pattern</code>, <code class="language-plaintext highlighter-rouge">or_pattern</code>.</li>
  <li><strong>C#</strong> maps <code class="language-plaintext highlighter-rouge">switch_expression</code> arms: <code class="language-plaintext highlighter-rouge">constant_pattern</code>, <code class="language-plaintext highlighter-rouge">declaration_pattern</code>, <code class="language-plaintext highlighter-rouge">var_pattern</code>, <code class="language-plaintext highlighter-rouge">property_pattern</code>, <code class="language-plaintext highlighter-rouge">discard_pattern</code>.</li>
  <li><strong>Rust</strong> maps <code class="language-plaintext highlighter-rouge">match_expression</code> arms: <code class="language-plaintext highlighter-rouge">integer_literal</code> → <code class="language-plaintext highlighter-rouge">LiteralPattern</code>, <code class="language-plaintext highlighter-rouge">identifier</code> → <code class="language-plaintext highlighter-rouge">CapturePattern</code>, <code class="language-plaintext highlighter-rouge">tuple_struct_pattern</code> → <code class="language-plaintext highlighter-rouge">ClassPattern</code> (with prelude variant resolution: <code class="language-plaintext highlighter-rouge">Some(x)</code> becomes <code class="language-plaintext highlighter-rouge">ClassPattern("Option", ...)</code>), <code class="language-plaintext highlighter-rouge">struct_pattern</code> → <code class="language-plaintext highlighter-rouge">ClassPattern</code> with keyword args, <code class="language-plaintext highlighter-rouge">or_pattern</code>, <code class="language-plaintext highlighter-rouge">tuple_pattern</code>, <code class="language-plaintext highlighter-rouge">scoped_identifier</code> → <code class="language-plaintext highlighter-rouge">ValuePattern</code>. Rust <code class="language-plaintext highlighter-rouge">match</code> is expression-style (returns a value), so the frontend uses <code class="language-plaintext highlighter-rouge">compile_pattern_test</code>/<code class="language-plaintext highlighter-rouge">compile_pattern_bindings</code> directly rather than the statement-level <code class="language-plaintext highlighter-rouge">compile_match</code>, storing each arm’s result into a shared result variable.</li>
</ul>

<p>Because the Pattern ADT compiles down to existing IR opcodes (<code class="language-plaintext highlighter-rouge">BINOP</code>, <code class="language-plaintext highlighter-rouge">LOAD_FIELD</code>, <code class="language-plaintext highlighter-rouge">LOAD_INDEX</code>, <code class="language-plaintext highlighter-rouge">BRANCH_IF</code>, <code class="language-plaintext highlighter-rouge">DECL_VAR</code>), no VM changes are needed. The VM, CFG builder, and dataflow analysis all handle pattern matching transparently.</p>

<h3 id="built-in-functions">Built-in Functions</h3>

<p>The VM includes a small table of built-in functions (<code class="language-plaintext highlighter-rouge">len</code>, <code class="language-plaintext highlighter-rouge">range</code>, <code class="language-plaintext highlighter-rouge">print</code>, <code class="language-plaintext highlighter-rouge">int</code>, <code class="language-plaintext highlighter-rouge">float</code>, <code class="language-plaintext highlighter-rouge">str</code>, <code class="language-plaintext highlighter-rouge">bool</code>, <code class="language-plaintext highlighter-rouge">abs</code>, <code class="language-plaintext highlighter-rouge">max</code>, <code class="language-plaintext highlighter-rouge">min</code>, plus array constructors) that are resolved before falling through to the <code class="language-plaintext highlighter-rouge">UnresolvedCallResolver</code>. Each built-in handles symbolic arguments gracefully: <code class="language-plaintext highlighter-rouge">len</code> of a symbolic list returns a symbolic, <code class="language-plaintext highlighter-rouge">range</code> with symbolic bounds returns <code class="language-plaintext highlighter-rouge">UNCOMPUTABLE</code>, and so on. This keeps common operations concrete without requiring language-specific runtime support.</p>

<hr />

<h2 id="llm-assisted-vm-execution">LLM-Assisted VM Execution</h2>

<p>The deterministic VM handles known functions, built-ins, class constructors, and concrete operations without external help. But real-world code calls libraries, frameworks, and system functions that don’t exist in the IR. When the VM exhausts all internal resolution paths (built-in table, function registry, class constructors, string/list indexing conversion), it delegates to an <code class="language-plaintext highlighter-rouge">UnresolvedCallResolver</code>. For a program with 50 function calls where 45 are to local functions and built-ins, only 5 hit the resolver.</p>

<p>This is the third and final point where an LLM can enter the pipeline. The first is at the frontend (lowering source to IR). The second is AST repair (fixing malformed syntax). This third point is at runtime: resolving calls to functions whose implementations are unavailable.</p>

<h3 id="the-strategy-pattern">The Strategy Pattern</h3>

<p>The resolver is a pluggable strategy, selected at pipeline configuration time:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">UnresolvedCallResolver</span><span class="p">(</span><span class="n">ABC</span><span class="p">):</span>
    <span class="nd">@abstractmethod</span>
    <span class="k">def</span> <span class="nf">resolve_call</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">func_name</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">inst</span><span class="p">,</span> <span class="n">vm</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">ExecutionResult</span><span class="p">:</span> <span class="bp">...</span>

    <span class="nd">@abstractmethod</span>
    <span class="k">def</span> <span class="nf">resolve_method</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">obj_desc</span><span class="p">,</span> <span class="n">method_name</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">inst</span><span class="p">,</span> <span class="n">vm</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">ExecutionResult</span><span class="p">:</span> <span class="bp">...</span>
</code></pre></div></div>

<p>Both <code class="language-plaintext highlighter-rouge">resolve_call</code> (for <code class="language-plaintext highlighter-rouge">CALL_FUNCTION</code> and <code class="language-plaintext highlighter-rouge">CALL_UNKNOWN</code>) and <code class="language-plaintext highlighter-rouge">resolve_method</code> (for <code class="language-plaintext highlighter-rouge">CALL_METHOD</code>) return an <code class="language-plaintext highlighter-rouge">ExecutionResult</code> containing a <code class="language-plaintext highlighter-rouge">StateUpdate</code>, the same data object that every opcode handler returns. This means the resolver’s output flows through the same <code class="language-plaintext highlighter-rouge">apply_update()</code> path as everything else. No special cases.</p>

<h3 id="symbolicresolver-default">SymbolicResolver (Default)</h3>

<p>The default resolver creates a fresh <code class="language-plaintext highlighter-rouge">SymbolicValue</code> for any unknown call (e.g., <code class="language-plaintext highlighter-rouge">sym_0 (hint: "math.sqrt(16)")</code>). The symbolic propagation described in the <a href="#symbolic-value-propagation">VM section</a> takes over from there: dependent operations produce constrained symbolics, and the dataflow analysis traces dependencies through them. This is the right default for most analysis tasks, where knowing that a dependency exists matters more than knowing its concrete value.</p>

<h3 id="llmplausibleresolver-opt-in">LLMPlausibleResolver (Opt-In)</h3>

<p>When concrete results matter (e.g., verifying that a computed value matches an expected output), the <code class="language-plaintext highlighter-rouge">LLMPlausibleResolver</code> asks an LLM to produce a plausible return value.</p>

<p>The resolver sends a structured JSON prompt containing:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"call"</span><span class="p">:</span><span class="w"> </span><span class="s2">"math.sqrt(16)"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"args"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="mi">16</span><span class="p">],</span><span class="w">
  </span><span class="nl">"result_reg"</span><span class="p">:</span><span class="w"> </span><span class="s2">"%5"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"state"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"local_vars"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="nl">"x"</span><span class="p">:</span><span class="w"> </span><span class="mi">16</span><span class="p">,</span><span class="w"> </span><span class="nl">"y"</span><span class="p">:</span><span class="w"> </span><span class="s2">"sym_0"</span><span class="p">},</span><span class="w">
    </span><span class="nl">"heap"</span><span class="p">:</span><span class="w"> </span><span class="p">{}</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"language"</span><span class="p">:</span><span class="w"> </span><span class="s2">"python"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>The system prompt constrains the LLM to return a JSON object with:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">value</code>: the concrete return value (or null if unknowable)</li>
  <li><code class="language-plaintext highlighter-rouge">heap_writes</code>: any side effects as <code class="language-plaintext highlighter-rouge">[{"obj_addr": "...", "field": "...", "value": ...}]</code></li>
  <li><code class="language-plaintext highlighter-rouge">var_writes</code>: any variable mutations as <code class="language-plaintext highlighter-rouge">{"name": value}</code></li>
  <li><code class="language-plaintext highlighter-rouge">reasoning</code>: a short explanation</li>
</ul>

<p>For standard library functions (<code class="language-plaintext highlighter-rouge">math.sqrt</code>, <code class="language-plaintext highlighter-rouge">string.upper</code>, <code class="language-plaintext highlighter-rouge">list.append</code>), the prompt instructs the LLM to compute the exact result. For unknown functions, it asks for a best estimate based on the name and arguments.</p>

<p>The response is parsed into a <code class="language-plaintext highlighter-rouge">StateUpdate</code> and applied through the same <code class="language-plaintext highlighter-rouge">apply_update()</code> path. If the LLM returns invalid JSON or the call fails for any reason, the resolver falls back to <code class="language-plaintext highlighter-rouge">SymbolicResolver</code> automatically. The worst case is identical to not using LLM resolution at all.</p>

<h3 id="worked-example-analysing-code-with-missing-dependencies">Worked Example: Analysing Code with Missing Dependencies</h3>

<p>Consider a Python program that uses <code class="language-plaintext highlighter-rouge">requests</code> (not available in the IR) and a local function:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">requests</span>

<span class="k">def</span> <span class="nf">extract_name</span><span class="p">(</span><span class="n">data</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">data</span><span class="p">[</span><span class="sh">"</span><span class="s">user</span><span class="sh">"</span><span class="p">][</span><span class="sh">"</span><span class="s">name</span><span class="sh">"</span><span class="p">]</span>

<span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">https://api.example.com/users/1</span><span class="sh">"</span><span class="p">)</span>
<span class="n">body</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="nf">json</span><span class="p">()</span>
<span class="n">name</span> <span class="o">=</span> <span class="nf">extract_name</span><span class="p">(</span><span class="n">body</span><span class="p">)</span>
<span class="n">greeting</span> <span class="o">=</span> <span class="sh">"</span><span class="s">Hello, </span><span class="sh">"</span> <span class="o">+</span> <span class="n">name</span>
</code></pre></div></div>

<p>The deterministic frontend lowers this to IR. <code class="language-plaintext highlighter-rouge">extract_name</code> gets a full function definition (skip-over pattern, parameter binding, <code class="language-plaintext highlighter-rouge">LOAD_FIELD</code> chain). <code class="language-plaintext highlighter-rouge">requests.get</code> and <code class="language-plaintext highlighter-rouge">response.json()</code> are calls to unresolved externals.</p>

<p><strong>With SymbolicResolver:</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>response  = sym_0 (hint: "requests.get('https://api.example.com/users/1')")
body      = sym_1 (hint: "sym_0.json()")
</code></pre></div></div>

<p>The VM enters <code class="language-plaintext highlighter-rouge">extract_name</code> with <code class="language-plaintext highlighter-rouge">data = sym_1</code>. The <code class="language-plaintext highlighter-rouge">LOAD_FIELD</code> for <code class="language-plaintext highlighter-rouge">data["user"]</code> triggers lazy heap materialisation: a synthetic heap entry is created for <code class="language-plaintext highlighter-rouge">sym_1</code>, and a symbolic field <code class="language-plaintext highlighter-rouge">user</code> is cached. Then <code class="language-plaintext highlighter-rouge">LOAD_FIELD</code> for <code class="language-plaintext highlighter-rouge">["name"]</code> creates another symbolic. The result:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>name      = sym_3 (hint: "sym_2.name")  where sym_2 = sym_1["user"]
greeting  = sym_4 (constraint: "'Hello, ' + sym_3")
</code></pre></div></div>

<p>The dataflow analysis traces <code class="language-plaintext highlighter-rouge">greeting</code> back through <code class="language-plaintext highlighter-rouge">name</code>, <code class="language-plaintext highlighter-rouge">body</code>, and <code class="language-plaintext highlighter-rouge">response</code> to the <code class="language-plaintext highlighter-rouge">requests.get</code> call. The <strong>dependency chain is fully preserved</strong> even though no concrete HTTP call was made.</p>

<p><strong>With LLMPlausibleResolver:</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>response  = &lt;plausible Response object&gt;
body      = {"user": {"name": "Alice", "email": "alice@example.com"}}
name      = "Alice"
greeting  = "Hello, Alice"
</code></pre></div></div>

<p>The LLM produces a plausible JSON response for the API call. <code class="language-plaintext highlighter-rouge">response.json()</code> returns a plausible dict. <code class="language-plaintext highlighter-rouge">extract_name</code> runs concretely on the plausible data. The final values are concrete and inspectable, at the cost of one LLM call per unresolved external.</p>

<hr />

<h2 id="dataflow-analysis">Dataflow Analysis</h2>

<p><em>See also: <a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/docs/notes-on-dataflow-design.md">Dataflow Design</a></em></p>

<p>The dataflow module (<code class="language-plaintext highlighter-rouge">interpreter/dataflow.py</code>) performs <strong>iterative intraprocedural analysis</strong> over the CFG in five stages:</p>

<ol>
  <li><strong>Collect definitions</strong>: identify every point where a variable or register is assigned</li>
  <li><strong>Reaching definitions</strong>: GEN/KILL worklist fixpoint iteration over the CFG</li>
  <li><strong>Def-use chains</strong>: link each use to the definition(s) that reach it</li>
  <li><strong>Raw dependency graph</strong>: trace through register chains to discover direct named-variable-to-named-variable dependencies</li>
  <li><strong>Transitive closure</strong>: propagate indirect dependencies to produce the full dependency graph</li>
</ol>

<p>The analysis is forward, may-approximate (over-approximate), and intraprocedural (single function/module scope). It covers all value-producing opcodes including the byte-addressed memory region operations (<code class="language-plaintext highlighter-rouge">ALLOC_REGION</code>, <code class="language-plaintext highlighter-rouge">LOAD_REGION</code>, <code class="language-plaintext highlighter-rouge">WRITE_REGION</code>), so COBOL programs get full dataflow tracking.</p>

<h3 id="reaching-definitions">Reaching Definitions</h3>

<p>Standard GEN/KILL worklist iteration over the dataflow equations:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>reach_in(B)  = ∪ { reach_out(P) | P ∈ predecessors(B) }
reach_out(B) = GEN(B) ∪ (reach_in(B) − KILL(B))
</code></pre></div></div>

<p>The lattice is the power set of all definitions (finite), so convergence is guaranteed. A safety cap of 1,000 iterations prevents runaway on pathological CFGs.</p>

<h3 id="the-register-chain-problem">The Register Chain Problem</h3>

<p>The interesting part is translating from register-level def-use chains to human-readable variable dependencies. The IR uses temporary registers (<code class="language-plaintext highlighter-rouge">%0</code>, <code class="language-plaintext highlighter-rouge">%1</code>, …) for all intermediate values. A statement like <code class="language-plaintext highlighter-rouge">y = x + 1</code> becomes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>%0 = LOAD_VAR x
%1 = CONST 1
%2 = BINOP +, %0, %1
     STORE_VAR y, %2
</code></pre></div></div>

<p>The raw def-use chain says “<code class="language-plaintext highlighter-rouge">y</code> depends on <code class="language-plaintext highlighter-rouge">%2</code>”. But a human wants to know “<code class="language-plaintext highlighter-rouge">y</code> depends on <code class="language-plaintext highlighter-rouge">x</code>”. The dependency graph builder traces through the register chain: <code class="language-plaintext highlighter-rouge">%2</code> comes from <code class="language-plaintext highlighter-rouge">BINOP</code> on <code class="language-plaintext highlighter-rouge">%0</code> and <code class="language-plaintext highlighter-rouge">%1</code>; <code class="language-plaintext highlighter-rouge">%0</code> comes from <code class="language-plaintext highlighter-rouge">LOAD_VAR x</code>; <code class="language-plaintext highlighter-rouge">%1</code> is a constant. Therefore <code class="language-plaintext highlighter-rouge">y</code> depends on <code class="language-plaintext highlighter-rouge">x</code>. Transitive closure extends this across multi-step computations.</p>

<h3 id="worked-example-diamond-dependencies">Worked Example: Diamond Dependencies</h3>

<p>Consider this program with diamond dependencies, function calls, and multi-operand expressions:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">b</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">a</span> <span class="o">*</span> <span class="n">b</span>
<span class="n">e</span> <span class="o">=</span> <span class="n">c</span> <span class="o">+</span> <span class="n">d</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">e</span> <span class="o">-</span> <span class="n">a</span>

<span class="k">def</span> <span class="nf">square</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="n">x</span>

<span class="n">g</span> <span class="o">=</span> <span class="nf">square</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
<span class="n">h</span> <span class="o">=</span> <span class="n">g</span> <span class="o">+</span> <span class="n">f</span>
<span class="n">total</span> <span class="o">=</span> <span class="n">h</span> <span class="o">+</span> <span class="n">e</span> <span class="o">+</span> <span class="n">b</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">c</code> and <code class="language-plaintext highlighter-rouge">d</code> both depend on <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> (the diamond). <code class="language-plaintext highlighter-rouge">g</code> depends on <code class="language-plaintext highlighter-rouge">c</code> through the function call. <code class="language-plaintext highlighter-rouge">total</code> depends on three variables directly. The IR for just the main body (omitting the function) looks like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>%0  = const 1           → a = 1
%1  = const 2           → b = 2
%2  = load_var a
%3  = load_var b
%4  = binop + %2 %3     → c = a + b
%5  = load_var a
%6  = load_var b
%7  = binop * %5 %6     → d = a * b
%8  = load_var c
%9  = load_var d
%10 = binop + %8 %9     → e = c + d
%11 = load_var e
%12 = load_var a
%13 = binop - %11 %12   → f = e - a
%14 = load_var c
%15 = call_function square %14  → g = square(c)
%16 = load_var g
%17 = load_var f
%18 = binop + %16 %17   → h = g + f
%19 = load_var h
%20 = load_var e
%21 = load_var b
%22 = binop + %19 %20   → (partial)
%23 = binop + %22 %21   → total = h + e + b
</code></pre></div></div>

<p>The dependency graph builder traces through the register chains. For example, <code class="language-plaintext highlighter-rouge">c</code> depends on <code class="language-plaintext highlighter-rouge">%4</code> (a <code class="language-plaintext highlighter-rouge">BINOP</code>), which reads <code class="language-plaintext highlighter-rouge">%2</code> (from <code class="language-plaintext highlighter-rouge">LOAD_VAR a</code>) and <code class="language-plaintext highlighter-rouge">%3</code> (from <code class="language-plaintext highlighter-rouge">LOAD_VAR b</code>), so <code class="language-plaintext highlighter-rouge">c → {a, b}</code>. Applying this recursively across all variables:</p>

<p><code class="language-plaintext highlighter-rouge">c → {a, b}</code>, <code class="language-plaintext highlighter-rouge">d → {a, b}</code>, <code class="language-plaintext highlighter-rouge">e → {c, d}</code>, <code class="language-plaintext highlighter-rouge">f → {e, a}</code>, <code class="language-plaintext highlighter-rouge">g → {c}</code> (through the function call), <code class="language-plaintext highlighter-rouge">h → {g, f}</code>, <code class="language-plaintext highlighter-rouge">total → {h, e, b}</code>.</p>

<p>The direct dependency graph:</p>

<pre><code class="language-mermaid">flowchart BT
    a("a"):::source
    b("b"):::source
    c("c"):::mid
    d("d"):::mid
    e("e"):::mid
    f("f"):::mid
    g("g&lt;br/&gt;&lt;i&gt;square(c)&lt;/i&gt;"):::mid
    h("h"):::mid
    total("total"):::sink

    a --&gt; c
    b --&gt; c
    a --&gt; d
    b --&gt; d
    c --&gt; e
    d --&gt; e
    a --&gt; f
    e --&gt; f
    c --&gt; g
    f --&gt; h
    g --&gt; h
    b --&gt; total
    e --&gt; total
    h --&gt; total

    classDef source fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef mid fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef sink fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#1a3a1a
</code></pre>

<p><code class="language-plaintext highlighter-rouge">total</code> directly depends on <code class="language-plaintext highlighter-rouge">h</code>, <code class="language-plaintext highlighter-rouge">e</code>, and <code class="language-plaintext highlighter-rouge">b</code>. The transitive closure adds <code class="language-plaintext highlighter-rouge">a</code>, <code class="language-plaintext highlighter-rouge">c</code>, <code class="language-plaintext highlighter-rouge">d</code>, <code class="language-plaintext highlighter-rouge">f</code>, and <code class="language-plaintext highlighter-rouge">g</code>, giving <code class="language-plaintext highlighter-rouge">total → {a, b, c, d, e, f, g, h}</code>. This means a change to any of these variables could affect <code class="language-plaintext highlighter-rouge">total</code>.</p>

<h3 id="branching-and-multiple-reaching-definitions">Branching and Multiple Reaching Definitions</h3>

<p>On a diamond CFG (if/else), reaching definitions produce multiple reaching defs for the same variable at the merge point:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>entry:    x = 10        → reach_out = {x@entry}
if_true:  x = 20        → reach_out = {x@if_true}
if_false: y = 30        → reach_out = {x@entry, y@if_false}
merge:    use(x)        → reach_in = {x@entry, x@if_true, y@if_false}
</code></pre></div></div>

<p>At the merge block, <code class="language-plaintext highlighter-rouge">x</code> has <em>two</em> reaching definitions (from <code class="language-plaintext highlighter-rouge">entry</code> and <code class="language-plaintext highlighter-rouge">if_true</code>). This correctly models the fact that the value of <code class="language-plaintext highlighter-rouge">x</code> at the merge point depends on which branch was taken. The def-use chain links the use of <code class="language-plaintext highlighter-rouge">x</code> in <code class="language-plaintext highlighter-rouge">merge</code> to both definitions.</p>

<h3 id="decoupling">Decoupling</h3>

<p>The dataflow module has <strong>no dependencies on the VM, frontends, or backends</strong>. It’s a pure analysis pass over the CFG, decoupled from the imperative shell (parsing, I/O, LLM calls). Its input is a <code class="language-plaintext highlighter-rouge">CFG</code> object; its output is a <code class="language-plaintext highlighter-rouge">DataflowResult</code> containing definitions, block facts, def-use chains, and both raw and transitive dependency graphs.</p>

<hr />

<h2 id="type-inference">Type Inference</h2>

<p><em>See also: <a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/docs/type-system.md">Type System Design</a></em></p>

<p>The type inference module (<code class="language-plaintext highlighter-rouge">interpreter/type_inference.py</code>) is a <strong>static analysis pass</strong> that runs after lowering but before VM execution. It walks the IR instructions in a fixpoint loop — repeating until no new types are discovered — and produces an immutable <code class="language-plaintext highlighter-rouge">TypeEnvironment</code> mapping registers and variables to canonical types. The VM then uses this environment for type-aware coercion at write time.</p>

<h3 id="type-representation-the-typeexpr-adt">Type Representation: The TypeExpr ADT</h3>

<p>Types are represented as an algebraic data type (<code class="language-plaintext highlighter-rouge">TypeExpr</code>) with nine variants:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ScalarType(name)                          # Int, String, Bool, MyClass
EnumType(name)                            # Color, Direction (enum declarations)
AnnotationType(name)                      # @Override, @Deprecated (annotation types)
StructPatternType(name)                   # Point { x, y } (struct pattern matching)
ParameterizedType(constructor, arguments) # Array[Int], Map[String, Int], Pointer[Pointer[Float]]
FunctionType(params, return_type)         # Fn(Int, String) -&gt; Bool
UnionType(members)                        # Union[Int, String], Optional = Union[T, Null]
TypeVar(name, bound)                      # T, T: Number (bounded type variable)
UnknownType                               # sentinel for unresolved types
</code></pre></div></div>

<p>All TypeExpr values produce canonical string representations and — critically for the migration from the original string-based system — compare equal to those strings (<code class="language-plaintext highlighter-rouge">ScalarType("Int") == "Int"</code>). This made it possible to migrate the inference engine, type resolver, and coercion rules to structured types incrementally without breaking existing tests.</p>

<h3 id="the-type-ontology">The Type Ontology</h3>

<p>The base type hierarchy is a DAG (<code class="language-plaintext highlighter-rouge">TypeGraph</code>) with subtype queries and least-upper-bound (LUB) computation:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Any
├── Number
│   ├── Int
│   └── Float
├── String
├── Bool
├── Object
└── Array
</code></pre></div></div>

<p>The graph is pluggable: <code class="language-plaintext highlighter-rouge">TypeGraph</code> is constructed from a tuple of <code class="language-plaintext highlighter-rouge">TypeNode</code> values and can be extended without mutating the original. Subtype checks (<code class="language-plaintext highlighter-rouge">is_subtype</code>) and common-supertype queries (<code class="language-plaintext highlighter-rouge">common_supertype</code>) traverse the DAG via BFS. At runtime the graph is extended with user-defined class types, interface/trait nodes, and parameterized type rules.</p>

<p><strong>Parameterized types.</strong> <code class="language-plaintext highlighter-rouge">ParameterizedType</code> values carry subtype checking through their arguments. <code class="language-plaintext highlighter-rouge">Array[Int]</code> is a subtype of <code class="language-plaintext highlighter-rouge">Array[Number]</code> because <code class="language-plaintext highlighter-rouge">Int</code> is a subtype of <code class="language-plaintext highlighter-rouge">Number</code> (covariant by default). A raw constructor (<code class="language-plaintext highlighter-rouge">Array</code>) is a supertype of any parameterised variant (<code class="language-plaintext highlighter-rouge">Array[Int]</code>).</p>

<p><strong>Union types.</strong> <code class="language-plaintext highlighter-rouge">UnionType</code> members are flattened (nested unions merge), deduplicated, and singleton-eliminated (<code class="language-plaintext highlighter-rouge">Union[Int]</code> simplifies to <code class="language-plaintext highlighter-rouge">Int</code>). <code class="language-plaintext highlighter-rouge">Optional</code> is sugar for <code class="language-plaintext highlighter-rouge">Union[T, Null]</code>. A union is a subtype of a parent if all members are subtypes; a child is a subtype of a union parent if it is a subtype of at least one member.</p>

<p><strong>Function types.</strong> <code class="language-plaintext highlighter-rouge">FunctionType</code> follows standard variance: parameters are <strong>contravariant</strong> (a function accepting <code class="language-plaintext highlighter-rouge">Number</code> is a subtype of one requiring <code class="language-plaintext highlighter-rouge">Int</code>), return types are <strong>covariant</strong> (returning <code class="language-plaintext highlighter-rouge">Int</code> is a subtype of returning <code class="language-plaintext highlighter-rouge">Number</code>).</p>

<p><strong>Type aliases.</strong> A <code class="language-plaintext highlighter-rouge">type_aliases</code> registry maps alias names to <code class="language-plaintext highlighter-rouge">TypeExpr</code> targets. Resolution is transitive with cycle protection: <code class="language-plaintext highlighter-rouge">IntPtr → Pointer[Int]</code>, <code class="language-plaintext highlighter-rouge">NestedPtr → Pointer[IntPtr] → Pointer[Pointer[Int]]</code>.</p>

<p><strong>Interface and trait typing.</strong> <code class="language-plaintext highlighter-rouge">TypeNode</code> values carry a <code class="language-plaintext highlighter-rouge">kind</code> field (<code class="language-plaintext highlighter-rouge">"class"</code> or <code class="language-plaintext highlighter-rouge">"interface"</code>). <code class="language-plaintext highlighter-rouge">TypeGraph.extend_with_interfaces()</code> adds interface nodes and wires implementing classes to them, so <code class="language-plaintext highlighter-rouge">is_subtype(MyClass, Serializable)</code> works when <code class="language-plaintext highlighter-rouge">MyClass</code> implements <code class="language-plaintext highlighter-rouge">Serializable</code>.</p>

<p><strong>Variance annotations.</strong> A per-constructor variance registry maps constructors to per-argument variance (<code class="language-plaintext highlighter-rouge">COVARIANT</code>, <code class="language-plaintext highlighter-rouge">CONTRAVARIANT</code>, or <code class="language-plaintext highlighter-rouge">INVARIANT</code>). For example, <code class="language-plaintext highlighter-rouge">MutableList</code> can be marked invariant so that <code class="language-plaintext highlighter-rouge">MutableList[Int]</code> is <em>not</em> a subtype of <code class="language-plaintext highlighter-rouge">MutableList[Number]</code>. Unregistered constructors default to covariant.</p>

<p><strong>Bounded type variables.</strong> <code class="language-plaintext highlighter-rouge">TypeVar("T", bound=ScalarType("Number"))</code> represents a generic parameter constrained to <code class="language-plaintext highlighter-rouge">Number</code> subtypes. <code class="language-plaintext highlighter-rouge">Array[Int]</code> is a subtype of <code class="language-plaintext highlighter-rouge">Array[T: Number]</code> because <code class="language-plaintext highlighter-rouge">Int</code> satisfies the bound.</p>

<h3 id="two-sources-of-type-information">Two Sources of Type Information</h3>

<p>Type information enters the system through two paths:</p>

<p><strong>1. Frontend type extraction (12 statically-typed languages).</strong> During lowering, each frontend extracts type annotations from the tree-sitter AST and normalises them to canonical types via a per-language type map. Simple types are straightforward: Java’s <code class="language-plaintext highlighter-rouge">int</code> → <code class="language-plaintext highlighter-rouge">Int</code>, Rust’s <code class="language-plaintext highlighter-rouge">f64</code> → <code class="language-plaintext highlighter-rouge">Float</code>, Go’s <code class="language-plaintext highlighter-rouge">string</code> → <code class="language-plaintext highlighter-rouge">String</code>. Generic types are extracted structurally: a shared <code class="language-plaintext highlighter-rouge">extract_normalized_type()</code> function walks tree-sitter’s <code class="language-plaintext highlighter-rouge">generic_type</code> / <code class="language-plaintext highlighter-rouge">generic_name</code> / <code class="language-plaintext highlighter-rouge">user_type</code> nodes recursively, decomposes each component through the language’s type map, and emits bracket notation (<code class="language-plaintext highlighter-rouge">List[Int]</code>, <code class="language-plaintext highlighter-rouge">Map[String, Array[Float]]</code>). C pointer types use a depth-counting approach: <code class="language-plaintext highlighter-rouge">int **p</code> becomes <code class="language-plaintext highlighter-rouge">Pointer[Pointer[Int]]</code>. All extracted types are parsed into <code class="language-plaintext highlighter-rouge">TypeExpr</code> objects and seeded into a <code class="language-plaintext highlighter-rouge">TypeEnvironmentBuilder</code> that the inference pass merges before its walk. Dynamically-typed languages (JavaScript, Ruby, Lua) skip this step — they have no annotations to extract.</p>

<p><strong>2. Inference from IR structure.</strong> The inference pass itself infers types from the IR opcodes, covering both dynamically-typed languages (where all types come from inference) and filling gaps in statically-typed code (e.g., inferring the type of an expression result).</p>

<h3 id="the-inference-algorithm">The Inference Algorithm</h3>

<p>The inference walk runs to fixpoint over the flat IR. A dispatch table maps opcodes to handler functions (control flow, pointer, region, and continuation instructions with no typeable results are skipped). Each handler is a pure function that reads from and writes to an <code class="language-plaintext highlighter-rouge">_InferenceContext</code> — a mutable bundle of maps storing <code class="language-plaintext highlighter-rouge">TypeExpr</code> values:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">register_types</code>: <code class="language-plaintext highlighter-rouge">%0</code> → <code class="language-plaintext highlighter-rouge">ScalarType("Int")</code>, <code class="language-plaintext highlighter-rouge">%3</code> → <code class="language-plaintext highlighter-rouge">ScalarType("Bool")</code>, …</li>
  <li><code class="language-plaintext highlighter-rouge">var_types</code>: <code class="language-plaintext highlighter-rouge">x</code> → <code class="language-plaintext highlighter-rouge">ScalarType("Int")</code>, <code class="language-plaintext highlighter-rouge">items</code> → <code class="language-plaintext highlighter-rouge">ParameterizedType("Array", [ScalarType("String")])</code>, …</li>
  <li><code class="language-plaintext highlighter-rouge">func_return_types</code>: <code class="language-plaintext highlighter-rouge">factorial</code> → <code class="language-plaintext highlighter-rouge">ScalarType("Int")</code>, …</li>
  <li><code class="language-plaintext highlighter-rouge">func_param_types</code>: <code class="language-plaintext highlighter-rouge">factorial</code> → <code class="language-plaintext highlighter-rouge">[("n", ScalarType("Int"))]</code>, …</li>
  <li><code class="language-plaintext highlighter-rouge">tuple_element_types</code>: <code class="language-plaintext highlighter-rouge">%5</code> → <code class="language-plaintext highlighter-rouge">{0: ScalarType("Int"), 1: ScalarType("String")}</code>, …</li>
</ul>

<p>The key inference rules:</p>

<table>
  <thead>
    <tr>
      <th>Opcode</th>
      <th>Rule</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">CONST</code></td>
      <td>Literal analysis: <code class="language-plaintext highlighter-rouge">42</code> → Int, <code class="language-plaintext highlighter-rouge">3.14</code> → Float, <code class="language-plaintext highlighter-rouge">"hello"</code> → String, <code class="language-plaintext highlighter-rouge">True</code>/<code class="language-plaintext highlighter-rouge">False</code> → Bool</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">LOAD_VAR</code></td>
      <td>Copy type from <code class="language-plaintext highlighter-rouge">var_types</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">STORE_VAR</code></td>
      <td>Inherit type from source register</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">BINOP</code></td>
      <td>Delegate to <code class="language-plaintext highlighter-rouge">TypeResolver</code> — comparison ops → Bool, arithmetic follows promotion rules (Int + Float → Float)</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">UNOP</code></td>
      <td>Fixed types for <code class="language-plaintext highlighter-rouge">not</code>/<code class="language-plaintext highlighter-rouge">!</code> → Bool, <code class="language-plaintext highlighter-rouge">#</code>/<code class="language-plaintext highlighter-rouge">~</code> → Int; otherwise inherit operand type</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">CALL_FUNCTION</code></td>
      <td>Look up <code class="language-plaintext highlighter-rouge">func_return_types</code>, then <code class="language-plaintext highlighter-rouge">_BUILTIN_RETURN_TYPES</code> (<code class="language-plaintext highlighter-rouge">len</code> → Int, <code class="language-plaintext highlighter-rouge">str</code> → String, <code class="language-plaintext highlighter-rouge">range</code> → Array)</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">CALL_METHOD</code></td>
      <td>Class-scoped dispatch → <code class="language-plaintext highlighter-rouge">func_return_types</code> fallback → builtin method table (60+ methods: <code class="language-plaintext highlighter-rouge">.upper()</code> → String, <code class="language-plaintext highlighter-rouge">.split()</code> → Array, <code class="language-plaintext highlighter-rouge">.find()</code> → Int, <code class="language-plaintext highlighter-rouge">.startswith()</code> → Bool, etc.)</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">RETURN</code></td>
      <td>Backfill <code class="language-plaintext highlighter-rouge">func_return_types</code> from the return expression’s type</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">NEW_OBJECT</code></td>
      <td>Use the class name as the type</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">STORE_FIELD</code> / <code class="language-plaintext highlighter-rouge">LOAD_FIELD</code></td>
      <td>Track and retrieve per-class field types</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">STORE_INDEX</code> / <code class="language-plaintext highlighter-rouge">LOAD_INDEX</code></td>
      <td>Track and retrieve array/tuple element types (tuples track per-index: <code class="language-plaintext highlighter-rouge">Tuple[Int, String]</code>)</td>
    </tr>
  </tbody>
</table>

<p>The <code class="language-plaintext highlighter-rouge">RETURN</code> backfill deserves mention: when the pass encounters a <code class="language-plaintext highlighter-rouge">RETURN</code> instruction inside a function, it records the return expression’s type in <code class="language-plaintext highlighter-rouge">func_return_types</code>. This means callers later in the IR can resolve the function’s return type even without an explicit annotation. The fixpoint loop extends this to <strong>forward references</strong>: when function A calls function B defined later in the IR, the first pass learns B’s return type from its <code class="language-plaintext highlighter-rouge">RETURN</code>, and the second pass propagates it to A’s call site. Chains of arbitrary depth (A → B → C) resolve correctly.</p>

<h3 id="type-aware-coercion-in-the-vm">Type-Aware Coercion in the VM</h3>

<p>The inference pass produces a frozen <code class="language-plaintext highlighter-rouge">TypeEnvironment</code> (backed by <code class="language-plaintext highlighter-rouge">MappingProxyType</code> for immutability). Since all VM values are <code class="language-plaintext highlighter-rouge">TypedValue</code> objects carrying their type alongside their value, coercion operates at two layers (described in more detail in the <a href="#two-layer-type-coercion">Two-Layer Type Coercion</a> section above):</p>

<p><strong>Layer 1 (Pre-operation):</strong> Pluggable <code class="language-plaintext highlighter-rouge">BinopCoercionStrategy</code> and <code class="language-plaintext highlighter-rouge">UnopCoercionStrategy</code> coerce operands before evaluation. Language-specific strategies (e.g., <code class="language-plaintext highlighter-rouge">JavaBinopCoercion</code> for <code class="language-plaintext highlighter-rouge">String + int</code> auto-stringification) override the defaults.</p>

<p><strong>Layer 2 (Write-time):</strong> The <code class="language-plaintext highlighter-rouge">TypeConversionRules</code> interface coerces values when storing into typed registers. The default rules (<code class="language-plaintext highlighter-rouge">DefaultTypeConversionRules</code>) handle:</p>

<ul>
  <li><strong>Widening</strong>: Int → Float (lossless promotion)</li>
  <li><strong>Narrowing</strong>: Float → Int (truncate toward zero, matching C/Java/COBOL semantics)</li>
  <li><strong>Bool promotion</strong>: Bool → Int</li>
  <li><strong>Arithmetic result types</strong>: Int ÷ Int → Int (floor division), Int + Float → Float</li>
  <li><strong>Comparison results</strong>: any comparison → Bool</li>
</ul>

<p>This two-layer coercion means the VM produces correct typed results for cross-language programs. A Go program that declares <code class="language-plaintext highlighter-rouge">var x int = 7 / 2</code> gets <code class="language-plaintext highlighter-rouge">3</code> (integer division), not <code class="language-plaintext highlighter-rouge">3.5</code>. A COBOL program with PIC 9(4) fields truncates float assignments. The type system makes this automatic rather than requiring per-language special cases in the VM.</p>

<h3 id="selfthis-typing">Self/This Typing</h3>

<p>Inside class definitions, the inference pass recognises <code class="language-plaintext highlighter-rouge">self</code>, <code class="language-plaintext highlighter-rouge">this</code>, and <code class="language-plaintext highlighter-rouge">$this</code> parameter names and assigns them the enclosing class type. This enables method return type resolution: when <code class="language-plaintext highlighter-rouge">self.method()</code> is called, the pass knows the receiver’s class and can look up the method’s return type in the class-scoped method type map.</p>

<h3 id="interface-aware-inference">Interface-Aware Inference</h3>

<p>When a variable is typed as an interface (<code class="language-plaintext highlighter-rouge">Animal animal = ...</code>) and a method is called on it (<code class="language-plaintext highlighter-rouge">animal.speak()</code>), the inference pass needs to resolve the return type without knowing the concrete class. The solution is a chain walk: when class method lookup fails, the pass walks the <code class="language-plaintext highlighter-rouge">interface_implementations</code> map to find the interface that the variable’s type implements, then looks up the method’s return type from the interface’s method definitions.</p>

<p>Five frontends (Java, C#, TypeScript, Kotlin, Go) seed <code class="language-plaintext highlighter-rouge">interface_implementations</code> during lowering. Interfaces are lowered as <code class="language-plaintext highlighter-rouge">CLASS</code> blocks with method definitions, so their return types are available in the function registry.</p>

<h3 id="method-signatures">Method Signatures</h3>

<p>Class methods are stored in a <code class="language-plaintext highlighter-rouge">method_signatures</code> dictionary keyed by <code class="language-plaintext highlighter-rouge">ScalarType</code>, with a <code class="language-plaintext highlighter-rouge">FunctionKind</code> enum (<code class="language-plaintext highlighter-rouge">UNBOUND</code>, <code class="language-plaintext highlighter-rouge">INSTANCE</code>, <code class="language-plaintext highlighter-rouge">STATIC</code>) distinguishing method types. This class-scoped storage eliminates method name collisions — the same method name in different classes no longer overwrites a single flat entry — and supports method overload accumulation.</p>

<h3 id="design-properties-1">Design Properties</h3>

<p><strong>Pure function.</strong> <code class="language-plaintext highlighter-rouge">infer_types()</code> takes a list of IR instructions and a <code class="language-plaintext highlighter-rouge">TypeResolver</code>, returns a <code class="language-plaintext highlighter-rouge">TypeEnvironment</code>. No mutation of the input instructions. No side effects.</p>

<p><strong>Fixpoint convergence.</strong> The pass repeats until no new types are discovered, resolving forward references across function boundaries. Convergence is measured by the combined size of <code class="language-plaintext highlighter-rouge">register_types</code> and <code class="language-plaintext highlighter-rouge">func_return_types</code>. Programs without forward references converge in one pass (no performance penalty). Each handler’s “skip if already known” guards prevent clobbering types from earlier passes while allowing unfilled gaps to be resolved on subsequent passes.</p>

<p><strong>Pluggable ontology.</strong> The <code class="language-plaintext highlighter-rouge">TypeGraph</code>, <code class="language-plaintext highlighter-rouge">TypeConversionRules</code>, and <code class="language-plaintext highlighter-rouge">TypeResolver</code> are all injected. A different language family (e.g., one with unsigned integers or decimal types) can supply its own rules without changing the inference engine.</p>

<p><strong>Structured types end-to-end.</strong> The entire type pipeline — from frontend extraction through inference, coercion, and environment output — operates on <code class="language-plaintext highlighter-rouge">TypeExpr</code> objects exclusively. The <code class="language-plaintext highlighter-rouge">str | TypeExpr</code> union that existed during the migration has been fully removed: all seed sites, all 15 frontends, and all downstream consumers (inference context, type resolver, conversion rules) work with the structured representation directly. There are no string serialization boundaries.</p>

<hr />

<h2 id="cross-language-type-inference-in-practice">Cross-Language Type Inference in Practice</h2>

<p><em>See also: <a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/docs/ir-lowering-gaps.md">IR Lowering Gaps</a></em></p>

<p>The type inference engine described above is language-agnostic: it operates on IR instructions without knowing which frontend produced them. But making it work <em>correctly</em> across 15 languages required solving language-specific lowering gaps — places where idiomatic code in one language produced IR that the inference pass couldn’t reason about.</p>

<h3 id="the-end-to-end-flow">The End-to-End Flow</h3>

<p>The full pipeline from source to typed environment looks like this:</p>

<pre><code class="language-mermaid">%%{ init: { "flowchart": { "curve": "stepBefore" } } }%%
flowchart LR
    subgraph Frontend ["Frontend (per-language)"]
        SRC("📄 Source Code"):::input --&gt; PARSE("🌳 tree-sitter Parse"):::step
        PARSE --&gt; LOWER("⬇️ Lower to IR"):::step
        LOWER --&gt; SEED("🌱 Seed Type&lt;br/&gt;Environment Builder"):::step
    end
    subgraph Inference ["Type Inference (language-agnostic)"]
        SEED --&gt; MERGE("🔗 Merge Seeds"):::infer
        MERGE --&gt; WALK("🔄 Fixpoint Loop&lt;br/&gt;&lt;i&gt;over IR&lt;/i&gt;"):::infer
        WALK --&gt; ENV("❄️ Frozen&lt;br/&gt;TypeEnvironment"):::output
    end
    ENV --&gt; VM("⚙️ VM&lt;br/&gt;&lt;i&gt;type-aware coercion&lt;/i&gt;"):::engine

    classDef input fill:#e8f4fd,stroke:#4a90d9,stroke-width:2px,color:#1a3a5c
    classDef step fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef infer fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#3a0a3a
    classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
    classDef engine fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
</code></pre>

<p>The critical contract: <strong>frontends must produce IR that the inference pass can consume</strong>. If a frontend drops a return value, the inference pass has nothing to propagate. If a field access is lowered as a variable load instead of a field load, field tracking breaks.</p>

<h3 id="field-type-tracking-across-oop-languages">Field Type Tracking Across OOP Languages</h3>

<p>Eight OOP languages (Python, Java, C#, C++, JavaScript, TypeScript, PHP, Scala) support field type tracking through <code class="language-plaintext highlighter-rouge">this</code>/<code class="language-plaintext highlighter-rouge">self</code>. The mechanism:</p>

<pre><code class="language-mermaid">%%{ init: { "flowchart": { "curve": "stepBefore" } } }%%
flowchart TD
    STORE("STORE_FIELD %this 'age' %val&lt;br/&gt;&lt;i&gt;%this typed as Dog, %val typed as Int&lt;/i&gt;"):::ir
    FT("🏷️ field_types['Dog']['age'] = Int"):::infer
    LOAD("LOAD_FIELD %this 'age'&lt;br/&gt;&lt;i&gt;%this typed as Dog&lt;/i&gt;"):::ir
    RESULT("✅ %result typed as Int"):::output

    STORE --&gt; FT --&gt; LOAD --&gt; RESULT

    classDef ir fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef infer fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#3a0a3a
    classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
</code></pre>

<p>This only works if:</p>
<ol>
  <li>The <code class="language-plaintext highlighter-rouge">self</code>/<code class="language-plaintext highlighter-rouge">this</code> parameter register is typed as the enclosing class (handled by each frontend’s <code class="language-plaintext highlighter-rouge">_emit_this_param</code> / <code class="language-plaintext highlighter-rouge">_emit_self_param</code>)</li>
  <li><code class="language-plaintext highlighter-rouge">this.field</code> access is lowered as <code class="language-plaintext highlighter-rouge">LOAD_FIELD</code>, not <code class="language-plaintext highlighter-rouge">LOAD_VAR</code></li>
</ol>

<h3 id="return-backfill-and-expression-bodied-functions">Return Backfill and Expression-Bodied Functions</h3>

<p>Return backfill infers a function’s return type from its <code class="language-plaintext highlighter-rouge">RETURN</code> instructions:</p>

<pre><code class="language-mermaid">%%{ init: { "flowchart": { "curve": "stepBefore" } } }%%
flowchart LR
    CONST("%0 = CONST 42&lt;br/&gt;&lt;i&gt;typed as Int&lt;/i&gt;"):::ir
    RET("RETURN %0"):::ir
    BF("🔄 Backfill&lt;br/&gt;func_return_types['f'] = Int"):::infer
    CALL("result = f()&lt;br/&gt;CALL_FUNCTION"):::ir
    TYPED("✅ %result typed as Int"):::output

    CONST --&gt; RET --&gt; BF --&gt; CALL --&gt; TYPED

    classDef ir fill:#fff3e0,stroke:#e8a735,stroke-width:2px,color:#5c3a0a
    classDef infer fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#3a0a3a
    classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a3a1a
</code></pre>

<p>This works for explicit <code class="language-plaintext highlighter-rouge">return</code> statements. But three languages have idioms where the return value is implicit:</p>

<ul>
  <li><strong>Scala</strong> expression-bodied functions: <code class="language-plaintext highlighter-rouge">def f() = 42</code> (the body <em>is</em> the return value)</li>
  <li><strong>Kotlin</strong> expression-bodied functions: <code class="language-plaintext highlighter-rouge">fun f() = 42</code></li>
  <li><strong>Ruby</strong> implicit return: the last expression in a method is the return value</li>
</ul>

<p>In all three cases, the original frontend lowering discarded the expression value and unconditionally emitted a default nil return. The inference pass saw <code class="language-plaintext highlighter-rouge">RETURN nil</code> and could not backfill the actual type.</p>

<h3 id="the-fix-pattern">The Fix Pattern</h3>

<p>The fix for all three languages followed the same principle: <strong>detect when a function body is a bare expression rather than a block of statements, and wire the expression’s register to the RETURN instruction</strong>.</p>

<p>For Scala, <code class="language-plaintext highlighter-rouge">lower_function_def</code> checks if <code class="language-plaintext highlighter-rouge">body_node</code> is a block type or a bare expression. If bare, it calls <code class="language-plaintext highlighter-rouge">lower_expr</code> and emits <code class="language-plaintext highlighter-rouge">RETURN</code> with the result:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Before: def getAge(): Int = this.age
func_getAge:
  %0 = load_var this    # ← wrong: should be load_field
  %1 = load_var age     # ← body iterated as children
  %2 = const ()
  return %2             # ← default nil return

# After:
func_getAge:
  %0 = load_var this
  %1 = load_field %0 age   # ← field_expression lowered correctly
  return %1                 # ← expression value returned
</code></pre></div></div>

<p>For Ruby, a <code class="language-plaintext highlighter-rouge">_lower_body_with_implicit_return</code> helper identifies the last named child of the method body. If it’s an expression (not a statement like <code class="language-plaintext highlighter-rouge">if</code>, <code class="language-plaintext highlighter-rouge">while</code>, or <code class="language-plaintext highlighter-rouge">return</code>), it lowers it via <code class="language-plaintext highlighter-rouge">lower_expr</code> and returns its register:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Before: def get_age; @age; end
func_get_age:
  %0 = load_var self
  %1 = load_field %0 age    # value loaded...
  %2 = const None
  return %2                  # ...but discarded

# After:
func_get_age:
  %0 = load_var self
  %1 = load_field %0 age
  return %1                  # implicit return wired
</code></pre></div></div>

<h3 id="cross-language-type-inference-test-matrix">Cross-Language Type Inference Test Matrix</h3>

<p>With all three gaps fixed, the integration test suite verifies type inference scenarios across all applicable languages:</p>

<table>
  <thead>
    <tr>
      <th>Scenario</th>
      <th>Languages Tested</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>BINOP (Int + Int → Int)</td>
      <td>Java, Go, C, C++, C#, Rust, Python, JavaScript, TypeScript, Kotlin, Scala, PHP, Lua, Ruby, Pascal</td>
    </tr>
    <tr>
      <td>BINOP (Int + Float → Float)</td>
      <td>Java, Go, C, C++, C#, Rust, Python, JavaScript, TypeScript, Kotlin, Scala, PHP, Lua, Ruby</td>
    </tr>
    <tr>
      <td>Comparison → Bool</td>
      <td>Java, Go, C, C++, C#, Rust, Python, JavaScript, TypeScript, Kotlin, Scala, PHP, Lua, Ruby</td>
    </tr>
    <tr>
      <td>UNOP (not/!) → Bool</td>
      <td>Java, C, C++, C#, Python, JavaScript, TypeScript, Kotlin, Scala, PHP, Lua, Ruby</td>
    </tr>
    <tr>
      <td>Return backfill</td>
      <td>Lua, PHP, TypeScript, Kotlin, Scala</td>
    </tr>
    <tr>
      <td>Typed param seeding</td>
      <td>Java, Go, C, C++, C#, Rust, TypeScript, Kotlin, Scala</td>
    </tr>
    <tr>
      <td>Field type tracking (OOP)</td>
      <td>Python, Java, C#, C++, JavaScript, TypeScript, PHP, Scala</td>
    </tr>
    <tr>
      <td>CALL_METHOD return types</td>
      <td>Python, Java, C#, C++, JavaScript, TypeScript, Kotlin, Scala, PHP, Ruby</td>
    </tr>
    <tr>
      <td>NEW_OBJECT typing</td>
      <td>JavaScript, TypeScript, PHP, Ruby, Scala</td>
    </tr>
    <tr>
      <td>Builtin method return types</td>
      <td>Python, JavaScript, Java, Ruby, Kotlin</td>
    </tr>
    <tr>
      <td>Forward reference resolution</td>
      <td>Python, JavaScript, Ruby</td>
    </tr>
    <tr>
      <td>Interface method return types</td>
      <td>Java, C#, TypeScript, Kotlin, Go</td>
    </tr>
  </tbody>
</table>

<p>Each cell is a parametrized pytest fixture — a failure in one language doesn’t mask failures in others.</p>

<hr />

<h2 id="cross-language-verification-via-exercism">Cross-Language Verification via Exercism</h2>

<p>The broadest verification effort was the Exercism integration test suite. The idea: take Exercism’s canonical test cases (which define expected inputs and outputs for programming exercises), write equivalent solutions in all 15 languages, and verify that RedDragon’s pipeline produces the correct answer for every case in every language.</p>

<p>The 18 exercises span a range of constructs: modulo and boolean logic (leap), while loops (collatz-conjecture), accumulators (difference-of-squares), string operations (two-fer, hamming, reverse-string, rna-transcription), nested loops (isogram, pangram), multi-branch classification (bob), float arithmetic (space-age), and multi-pass validation (luhn). For each exercise, every canonical test case generates tests across three dimensions:</p>

<ol>
  <li><strong>Lowering quality</strong>: does the IR contain any <code class="language-plaintext highlighter-rouge">unsupported:</code> SYMBOLIC? (15 tests per exercise)</li>
  <li><strong>Cross-language consistency</strong>: do all 15 languages produce structurally equivalent IR? (2 tests per exercise)</li>
  <li><strong>VM execution correctness</strong>: does the VM produce the expected output? (cases x languages tests)</li>
</ol>

<p>The argument substitution mechanism deserves a mention: a <code class="language-plaintext highlighter-rouge">build_program()</code> helper finds the <code class="language-plaintext highlighter-rouge">answer = f(default_arg)</code> line in each solution and substitutes new arguments for each canonical test case. This works across languages with different assignment syntaxes (<code class="language-plaintext highlighter-rouge">=</code>, <code class="language-plaintext highlighter-rouge">:=</code>, <code class="language-plaintext highlighter-rouge">: type =</code>) via regex.</p>

<p>The Exercism suite surfaced more bugs than any other test approach. Each exercise exposed new gaps: Ruby’s <code class="language-plaintext highlighter-rouge">parenthesized_statements</code> vs Python’s <code class="language-plaintext highlighter-rouge">parenthesized_expression</code>, Rust’s expression-position loops, Pascal’s single-quote string escaping, PHP’s <code class="language-plaintext highlighter-rouge">.</code> concatenation operator. Every gap found was a bug fixed.</p>

<hr />

<h2 id="the-numbers">The Numbers</h2>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Supported languages</td>
      <td>15 (deterministic) + COBOL (ProLeap) + any (LLM)</td>
    </tr>
    <tr>
      <td>IR opcodes</td>
      <td>33</td>
    </tr>
    <tr>
      <td>Tests (all passing)</td>
      <td>13,005</td>
    </tr>
    <tr>
      <td>LLM calls at test time</td>
      <td>0</td>
    </tr>
    <tr>
      <td>Exercism exercises</td>
      <td>18 (across 15 languages)</td>
    </tr>
    <tr>
      <td>Rosetta algorithms</td>
      <td>15 (across 15 languages)</td>
    </tr>
    <tr>
      <td>Type inference scenarios</td>
      <td>12 (verified across up to 15 languages each)</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>RedDragon started as a question: <em>“Can I build a single system that analyses code in any language?”</em> It evolved into a compiler pipeline with 15 deterministic frontends, a COBOL frontend via ProLeap, LLM-assisted AST repair, a fully typed instruction set (33 per-opcode frozen dataclasses with <code class="language-plaintext highlighter-rouge">Register</code>, <code class="language-plaintext highlighter-rouge">CodeLabel</code>, <code class="language-plaintext highlighter-rouge">BinopKind</code>/<code class="language-plaintext highlighter-rouge">UnopKind</code>, and <code class="language-plaintext highlighter-rouge">TypeExpr</code> domain types replacing all raw strings), a structured type system (an algebraic TypeExpr ADT with generics, unions, function types, enums, annotation types, struct patterns, tuples, type aliases, interface/trait typing, variance annotations, and bounded type variables), static type inference with fixpoint convergence, interface-aware chain walk, and structural generic extraction across 12 statically-typed languages, a deterministic VM with class hierarchy support (inherited method dispatch via linearized parent chains across 10 OOP languages), <code class="language-plaintext highlighter-rouge">TypedValue</code>-based runtime type propagation with two-layer coercion, type-aware overload resolution, a symbol table with cross-class field resolution and implicit-this support, cross-language slicing, rest pattern destructuring, structural pattern matching via a shared Pattern ADT, byte-addressed memory regions and named continuations, and cross-language verification.</p>

<p><strong>None of the individual components are novel.</strong> TAC IR, dispatch tables, worklist dataflow, and forward type inference are all textbook techniques. The value, if any, is in applying them together to a practical multi-language analysis tool.</p>

<hr />

<h2 id="references">References</h2>

<p>Design documents and detailed specs from the RedDragon repository:</p>

<ul>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/docs/ir-reference.md">IR Reference</a> — Full specification of all 33 opcodes, per-opcode typed instruction classes, and lowering conventions</li>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/docs/notes-on-frontend-design.md">Frontend Design</a> — Architecture of the frontend subsystem: dispatch tables, AST repair, LLM frontends</li>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/tree/main/docs/frontend-design">Per-Language Frontend Docs</a> — Exhaustive per-file documentation for all 15 language frontends and the base frontend</li>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/docs/notes-on-vm-design.md">VM Design</a> — VM internals: state model, opcode dispatch, symbolic propagation, closures, class hierarchy</li>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/docs/notes-on-dataflow-design.md">Dataflow Design</a> — Reaching definitions, def-use chains, and dependency graph construction</li>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/docs/type-system.md">Type System Design</a> — Type ontology, inference algorithm, coercion rules, and cross-language type extraction</li>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/docs/architectural-design-decisions.md">Architectural Decision Records</a> — Chronological log of key design decisions (ADR-001 through ADR-121)</li>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/docs/ir-lowering-gaps.md">IR Lowering Gaps</a> — Tracking document for cross-language type inference lowering gaps</li>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/PHILOSOPHY.md">Project Philosophy</a> — Design principles and engineering values</li>
  <li><a href="https://github.com/avishek-sen-gupta/red-dragon/blob/main/CONTRIBUTING.md">Contributing Guide</a> — How to contribute to the project</li>
</ul>

<p><em>This post has not been written or edited by AI.</em></p>]]></content><author><name>avishek</name></author><category term="Software Engineering" /><category term="Compilers" /><category term="Program Analysis" /><category term="AI-Assisted Development" /><summary type="html"><![CDATA[A universal IR with per-opcode typed instructions, 15 deterministic frontends, LLM-assisted repair/lowering/execution, a deterministic VM with class hierarchy support, overload resolution, and cross-language slicing, a structured type system with generics/unions/variance/traits and interface-aware inference, and iterative dataflow analysis.]]></summary></entry><entry><title type="html">Building a simple Virtual Machine in Prolog</title><link href="https://avishek.net/2025/07/03/building-vm-in-prolog.html" rel="alternate" type="text/html" title="Building a simple Virtual Machine in Prolog" /><published>2025-07-03T00:00:00+05:30</published><updated>2025-07-03T00:00:00+05:30</updated><id>https://avishek.net/2025/07/03/building-vm-in-prolog</id><content type="html" xml:base="https://avishek.net/2025/07/03/building-vm-in-prolog.html"><![CDATA[<p>In this post, I’ll talk about how I wrote a small <strong>Virtual Machine</strong> in <strong>Prolog</strong> which can both interpret concrete assembly language-like programs, and run basic symbolic executions, which are useful in data flow analyses of programs. The full code is available in this <a href="https://github.com/asengupta/prolog-exercises/blob/main/prolog_examples/symbolic_executor.pl">repository</a>.</p>

<p><em>This post has not been written or edited by AI.</em></p>

<h2 id="building-a-simple-virtual-machine">Building a simple Virtual Machine</h2>

<p>One of my favourite exercises to do when learning a new language is to build something which exercises non-trivial capabilities of the language like <strong>pattern matching</strong>, flexibility of <strong>data structures</strong>, or exposes the <strong>brevity</strong> of expressing ideas in the langauge. Building a simple virtual machine forces one to reckon with ideas like expression trees, recursive traversals, term rewriting, etc.</p>

<p>I’ve also been reading about <strong>symbolic execution</strong> as a way to perform dataflow analysis recently as well. As an added challenge, I chose to enhance the concrete interpretation of a program with symbolic execution capabilities.</p>

<h2 id="foundational-operations-from-the-ground-up">Foundational Operations from the ground-up</h2>

<p>We need a <strong>dictionary</strong> implementation. SWI-Prolog has the dictionaries, but since we are building everything from scratch, we will write a very naive implementation using only lists. Granted, there are some semantics of a dictionary that can be violated for now - for example, you can start off with duplicate keys, but let’s assume the happy path.</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">get2</span><span class="p">(</span><span class="nv">_</span><span class="p">,[],</span><span class="ss">empty</span><span class="p">).</span>
<span class="ss">get2</span><span class="p">(</span><span class="nv">K</span><span class="p">,</span> <span class="p">[(</span><span class="o">-</span><span class="p">(</span><span class="nv">K</span><span class="p">,</span><span class="nv">VX</span><span class="p">))|</span><span class="nv">_</span><span class="p">],</span><span class="nv">VX</span><span class="p">)</span> <span class="p">:-</span> <span class="p">!.</span>
<span class="ss">get2</span><span class="p">(</span><span class="nv">K</span><span class="p">,</span> <span class="p">[</span><span class="nv">_</span><span class="p">|</span><span class="nv">T</span><span class="p">],</span><span class="nv">R</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">get2</span><span class="p">(</span><span class="nv">K</span><span class="p">,</span><span class="nv">T</span><span class="p">,</span><span class="nv">R</span><span class="p">).</span>

<span class="ss">put2_</span><span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="nv">K</span><span class="p">,</span><span class="nv">V</span><span class="p">),[],</span><span class="nv">Replaced</span><span class="p">,</span><span class="nv">R</span><span class="p">)</span> <span class="p">:-</span> <span class="nv">Replaced</span><span class="o">-&gt;</span><span class="nv">R</span><span class="o">=</span><span class="p">[];</span><span class="nv">R</span><span class="o">=</span><span class="p">[</span><span class="o">-</span><span class="p">(</span><span class="nv">K</span><span class="p">,</span><span class="nv">V</span><span class="p">)].</span>
<span class="ss">put2_</span><span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="nv">K</span><span class="p">,</span><span class="nv">V</span><span class="p">),[</span><span class="o">-</span><span class="p">(</span><span class="nv">K</span><span class="p">,</span><span class="nv">_</span><span class="p">)|</span><span class="nv">T</span><span class="p">],</span><span class="nv">_</span><span class="p">,[</span><span class="o">-</span><span class="p">(</span><span class="nv">K</span><span class="p">,</span><span class="nv">V</span><span class="p">)|</span><span class="nv">RX</span><span class="p">])</span> <span class="p">:-</span> <span class="ss">put2_</span><span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="nv">K</span><span class="p">,</span><span class="nv">V</span><span class="p">),</span><span class="nv">T</span><span class="p">,</span><span class="ss">true</span><span class="p">,</span><span class="nv">RX</span><span class="p">).</span>
<span class="ss">put2_</span><span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="nv">K</span><span class="p">,</span><span class="nv">V</span><span class="p">),[</span><span class="nv">H</span><span class="p">|</span><span class="nv">T</span><span class="p">],</span><span class="nv">Replaced</span><span class="p">,[</span><span class="nv">H</span><span class="p">|</span><span class="nv">RX</span><span class="p">])</span> <span class="p">:-</span> <span class="ss">put2_</span><span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="nv">K</span><span class="p">,</span><span class="nv">V</span><span class="p">),</span><span class="nv">T</span><span class="p">,</span><span class="nv">Replaced</span><span class="p">,</span><span class="nv">RX</span><span class="p">).</span>

<span class="ss">put2</span><span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="nv">K</span><span class="p">,</span><span class="nv">V</span><span class="p">),</span><span class="nv">Map</span><span class="p">,</span><span class="nv">R</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">put2_</span><span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="nv">K</span><span class="p">,</span><span class="nv">V</span><span class="p">),</span><span class="nv">Map</span><span class="p">,</span><span class="ss">false</span><span class="p">,</span><span class="nv">R</span><span class="p">).</span>
</code></pre></div></div>

<p>To represent entries in a dictionary, we use the <code class="language-plaintext highlighter-rouge">K-V</code> compound term, which is basically syntactic sugar for <code class="language-plaintext highlighter-rouge">-(K,V)</code>. These entries live inside a list. Both <code class="language-plaintext highlighter-rouge">get2</code> and <code class="language-plaintext highlighter-rouge">put2</code> behave in predictable ways, except when the dictionary has duplicate keys. In that case:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">get2(K,V)</code> returns the value of the first matching key.</li>
  <li><code class="language-plaintext highlighter-rouge">put2(-(K,V),InputMap,OutputMap)</code> modifies all matching keys with the value <code class="language-plaintext highlighter-rouge">V</code>.</li>
</ul>

<p>In our current implementation, we will not worry about duplicate entries yet.</p>

<p>We will also need push/pop operations on stacks. This is very simple. Note that the top of the stack is always the leftmost element.</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">push_</span><span class="p">(</span><span class="nv">V</span><span class="p">,</span><span class="nv">Stack</span><span class="p">,</span><span class="nv">UpdatedStack</span><span class="p">)</span> <span class="p">:-</span> <span class="nv">UpdatedStack</span><span class="o">=</span><span class="p">[</span><span class="nv">V</span><span class="p">|</span><span class="nv">Stack</span><span class="p">].</span>
<span class="ss">pop_</span><span class="p">([],</span><span class="ss">empty</span><span class="p">,[]).</span>
<span class="ss">pop_</span><span class="p">([</span><span class="nv">H</span><span class="p">|</span><span class="nv">Rest</span><span class="p">],</span><span class="nv">H</span><span class="p">,</span><span class="nv">Rest</span><span class="p">).</span>
</code></pre></div></div>

<h2 id="logging">Logging</h2>

<p>We will be <strong>logging</strong> quite a bit inside the rules. Thus it is important to have a structured way of logging different levels, like <code class="language-plaintext highlighter-rouge">DEBUG</code>, <code class="language-plaintext highlighter-rouge">INFO</code>, <code class="language-plaintext highlighter-rouge">WARNING</code>, etc. This is what a basic logging setup looks like:</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">log_with_level</span><span class="p">(</span><span class="nv">LogLevel</span><span class="p">,</span><span class="nv">FormatString</span><span class="p">,</span><span class="nv">Args</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">format</span><span class="p">(</span><span class="ss">string</span><span class="p">(</span><span class="nv">Message</span><span class="p">),</span><span class="nv">FormatString</span><span class="p">,</span><span class="nv">Args</span><span class="p">),</span><span class="ss">format</span><span class="p">(</span><span class="ss">'[~w]: ~w~n'</span><span class="p">,[</span><span class="nv">LogLevel</span><span class="p">,</span><span class="nv">Message</span><span class="p">]).</span>

<span class="ss">debug</span><span class="p">(</span><span class="nv">Message</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">log_with_level</span><span class="p">(</span><span class="ss">'DEBUG'</span><span class="p">,</span><span class="nv">Message</span><span class="p">,[]).</span>
<span class="ss">debug</span><span class="p">(</span><span class="nv">FormatString</span><span class="p">,</span><span class="nv">Args</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">log_with_level</span><span class="p">(</span><span class="ss">'DEBUG'</span><span class="p">,</span><span class="nv">FormatString</span><span class="p">,</span><span class="nv">Args</span><span class="p">).</span>

<span class="ss">info</span><span class="p">(</span><span class="nv">Message</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">log_with_level</span><span class="p">(</span><span class="ss">'INFO'</span><span class="p">,</span><span class="nv">Message</span><span class="p">,[]).</span>
<span class="ss">info</span><span class="p">(</span><span class="nv">FormatString</span><span class="p">,</span><span class="nv">Args</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">log_with_level</span><span class="p">(</span><span class="ss">'INFO'</span><span class="p">,</span><span class="nv">FormatString</span><span class="p">,</span><span class="nv">Args</span><span class="p">).</span>

<span class="ss">warning</span><span class="p">(</span><span class="nv">Message</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">log_with_level</span><span class="p">(</span><span class="ss">'WARN'</span><span class="p">,</span><span class="nv">Message</span><span class="p">,[]).</span>
<span class="ss">warning</span><span class="p">(</span><span class="nv">FormatString</span><span class="p">,</span><span class="nv">Args</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">log_with_level</span><span class="p">(</span><span class="ss">'WARN'</span><span class="p">,</span><span class="nv">FormatString</span><span class="p">,</span><span class="nv">Args</span><span class="p">).</span>

<span class="ss">error</span><span class="p">(</span><span class="nv">Message</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">log_with_level</span><span class="p">(</span><span class="ss">'ERROR'</span><span class="p">,</span><span class="nv">Message</span><span class="p">,[]).</span>
<span class="ss">error</span><span class="p">(</span><span class="nv">FormatString</span><span class="p">,</span><span class="nv">Args</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">log_with_level</span><span class="p">(</span><span class="ss">'ERROR'</span><span class="p">,</span><span class="nv">FormatString</span><span class="p">,</span><span class="nv">Args</span><span class="p">).</span>

<span class="ss">dont_log</span><span class="p">(</span><span class="nv">_</span><span class="p">).</span>
<span class="ss">dont_log</span><span class="p">(</span><span class="nv">_</span><span class="p">,</span><span class="nv">_</span><span class="p">).</span>
</code></pre></div></div>

<h2 id="minimal-instruction-set">Minimal instruction set</h2>

<p>The minimal instruction is comprised of the following:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">mov(reg,reg|constant)</code></li>
  <li><code class="language-plaintext highlighter-rouge">cmp(reg,reg|constant)</code></li>
  <li><code class="language-plaintext highlighter-rouge">label(name)</code></li>
  <li><code class="language-plaintext highlighter-rouge">j(label)</code></li>
  <li><code class="language-plaintext highlighter-rouge">jz(label|address)</code></li>
  <li><code class="language-plaintext highlighter-rouge">jnz(label|address)</code></li>
  <li><code class="language-plaintext highlighter-rouge">push(reg|constant)</code></li>
  <li><code class="language-plaintext highlighter-rouge">pop(reg)</code></li>
  <li><code class="language-plaintext highlighter-rouge">call(label)</code></li>
  <li><code class="language-plaintext highlighter-rouge">ret</code></li>
  <li><code class="language-plaintext highlighter-rouge">hlt</code></li>
  <li><code class="language-plaintext highlighter-rouge">term(string)</code></li>
  <li><code class="language-plaintext highlighter-rouge">nop</code></li>
  <li><code class="language-plaintext highlighter-rouge">inc(reg)</code></li>
  <li><code class="language-plaintext highlighter-rouge">dec(reg)</code></li>
  <li><code class="language-plaintext highlighter-rouge">mul(reg,reg|constant)</code></li>
  <li><code class="language-plaintext highlighter-rouge">term(string)</code></li>
</ul>

<h2 id="registers-flags-and-other-data-structures">Registers, Flags, and other Data Structures</h2>

<p>For convenience, I chose to not have a fixed number of registers for convenience; thus, you can use any symbol as a register. In this respect, we will be treating registers more akin to conventional variables.</p>

<p>There will be one special register called the Instruction Pointer (IP). This will point to the next instruction to be executed. Jump instructions like <code class="language-plaintext highlighter-rouge">j</code>, <code class="language-plaintext highlighter-rouge">jnz</code>, and <code class="language-plaintext highlighter-rouge">jz</code> can can modify the IP to change the flow of the program.</p>

<p>The other useful data structure will be the stack, which is operated by <code class="language-plaintext highlighter-rouge">push</code>, <code class="language-plaintext highlighter-rouge">pop</code>, <code class="language-plaintext highlighter-rouge">call</code>, and <code class="language-plaintext highlighter-rouge">ret</code> (the last two use it to keep track of the stack when entering and leaving procedures).</p>

<p>There will be one flag called the Zero Flag. This should probably be better named to Equals Flag, because it is set to zero if the two sides of a <code class="language-plaintext highlighter-rouge">cmp</code> are equal, otherwise -1/+1 depending upon their relative ordering.</p>

<p>Data can be of two types:</p>

<ul>
  <li><strong>Concrete data</strong>, like numbers, which would be represented like <code class="language-plaintext highlighter-rouge">const(5)</code></li>
  <li><strong>Symbolic data</strong>, which stand in for concrete data, and are used for <strong>symbolic execuction</strong>, which we’ll introduce in <a href="#symbolic-execution-and-world-splits">Symbolic Execution and World Splits</a>. These are represented as <code class="language-plaintext highlighter-rouge">sym(x)</code>, <code class="language-plaintext highlighter-rouge">sym(abcd)</code>, etc.</li>
</ul>

<h2 id="memory-model">Memory Model</h2>

<p>For the purposes of this simple VM, I chose not to have any memory. I may add it later, and then I will update this post accordingly.</p>

<h2 id="execution-model">Execution Model</h2>

<p>The execution model is simple and similar to what we’d expect a very simple single-threaded VM to behave. Every instruction is sequentially mapped to a specific memory address (simple incrementing integers for our purposes). The Instruction Pointer starts at 0. At every instruction, something can happen. Actions include:</p>

<ul>
  <li><strong>Moving data</strong> between registers</li>
  <li><strong>Loading constants</strong> into registers</li>
  <li><strong>Incrementing / Decrementing</strong> registers</li>
  <li><strong>Push / pop</strong> values to / from the stack</li>
  <li><strong>Unconditional / Conditional jumps</strong> to another address</li>
  <li><strong>Compare</strong> registers to other registers or constants</li>
  <li><strong>Halt</strong> (effectively exit the program)</li>
  <li>Declare a <strong>label</strong></li>
  <li><strong>Call</strong> and <strong>return</strong> from a <strong>procedure</strong> (defined as a label)</li>
</ul>

<p><strong>Jumps work by modifying the value of the Instruction Pointer to the destination address.</strong> Procedure calls work similarly, but with an added side effect: the address of the instruction after the <code class="language-plaintext highlighter-rouge">call</code> is pushed onto the stack: when a <code class="language-plaintext highlighter-rouge">ret</code> is encountered, the topmost value is popped off the stack and is assigned back to the Instruction Pointer. This simulates the return from the procedure.</p>

<h2 id="building-the-navigation-maps">Building the navigation maps</h2>

<p>There are a couple of mappings we need to build to be able to jump to arbitrary locations because of changes to the IP.</p>

<ul>
  <li>Mapping <strong>labels to memory addresses</strong></li>
  <li>Mapping <strong>memory addresses to instructions</strong></li>
</ul>

<p>These mappings are done in the following code:</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">instruction_pointer_map</span><span class="p">([],</span><span class="nv">IPMap</span><span class="p">,</span><span class="nv">_</span><span class="p">,</span><span class="nv">IPMap</span><span class="p">).</span>
<span class="ss">instruction_pointer_map</span><span class="p">([</span><span class="nv">Instr</span><span class="p">|</span><span class="nv">T</span><span class="p">],</span><span class="nv">IPMap</span><span class="p">,</span><span class="nv">IPCounter</span><span class="p">,</span><span class="nv">FinalIPMap</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">put2</span><span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="nv">IPCounter</span><span class="p">,</span><span class="nv">Instr</span><span class="p">),</span><span class="nv">IPMap</span><span class="p">,</span><span class="nv">UpdatedIPMap</span><span class="p">),</span>
                                                                 <span class="ss">plusOne</span><span class="p">(</span><span class="nv">IPCounter</span><span class="p">,</span><span class="nv">UpdatedIPCounter</span><span class="p">),</span>
                                                                 <span class="ss">instruction_pointer_map</span><span class="p">(</span><span class="nv">T</span><span class="p">,</span><span class="nv">UpdatedIPMap</span><span class="p">,</span><span class="nv">UpdatedIPCounter</span><span class="p">,</span><span class="nv">FinalIPMap</span><span class="p">).</span>
<span class="ss">label_map</span><span class="p">([],</span><span class="nv">LabelMap</span><span class="p">,</span><span class="nv">_</span><span class="p">,</span><span class="nv">LabelMap</span><span class="p">).</span>
<span class="ss">label_map</span><span class="p">([</span><span class="ss">label</span><span class="p">(</span><span class="nv">Label</span><span class="p">)|</span><span class="nv">T</span><span class="p">],</span><span class="nv">LabelMap</span><span class="p">,</span><span class="nv">IPCounter</span><span class="p">,</span><span class="nv">FinalLabelMap</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">put2</span><span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="ss">label</span><span class="p">(</span><span class="nv">Label</span><span class="p">),</span><span class="nv">IPCounter</span><span class="p">),</span><span class="nv">LabelMap</span><span class="p">,</span><span class="nv">UpdatedLabelMap</span><span class="p">),</span>
                                                                 <span class="ss">plusOne</span><span class="p">(</span><span class="nv">IPCounter</span><span class="p">,</span><span class="nv">UpdatedIPCounter</span><span class="p">),</span>
                                                                 <span class="ss">label_map</span><span class="p">(</span><span class="nv">T</span><span class="p">,</span><span class="nv">UpdatedLabelMap</span><span class="p">,</span><span class="nv">UpdatedIPCounter</span><span class="p">,</span><span class="nv">FinalLabelMap</span><span class="p">),</span>
                                                                 <span class="p">!.</span>
<span class="ss">label_map</span><span class="p">([</span><span class="nv">_</span><span class="p">|</span><span class="nv">T</span><span class="p">],</span><span class="nv">LabelMap</span><span class="p">,</span><span class="nv">IPCounter</span><span class="p">,</span><span class="nv">FinalLabelMap</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">plusOne</span><span class="p">(</span><span class="nv">IPCounter</span><span class="p">,</span><span class="nv">UpdatedIPCounter</span><span class="p">),</span>
                                                     <span class="ss">label_map</span><span class="p">(</span><span class="nv">T</span><span class="p">,</span><span class="nv">LabelMap</span><span class="p">,</span><span class="nv">UpdatedIPCounter</span><span class="p">,</span><span class="nv">FinalLabelMap</span><span class="p">).</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">label_map</code> predicate simply step through the full list of instructions, adding a label mapping to the current address (incremented each time through <code class="language-plaintext highlighter-rouge">plusOne</code>) when it encounters a <code class="language-plaintext highlighter-rouge">label</code> fact.</p>

<p>The <code class="language-plaintext highlighter-rouge">instruction_map</code> predicate simply assigns every instruction that it finds to an incrementing counter.</p>

<h2 id="symbolic-execution-and-world-splits">Symbolic Execution and World Splits</h2>

<p><strong>Symbolic execution</strong> is a technique used to determine the provenance of data in a piece of code. Assume, you have a very simplified code segment, like so:</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">Code</span><span class="o">=</span><span class="p">[</span>
    <span class="ss">mvc</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">hl</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="m">10</span><span class="p">)),</span> <span class="c1">% load 10 into register HL</span>
    <span class="ss">mvc</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">bc</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="m">5</span><span class="p">)),</span>  <span class="c1">% load 5 into register BC</span>
    <span class="ss">inc</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">hl</span><span class="p">)),</span>           <span class="c1">% increment HL by one</span>
    <span class="ss">mul</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">hl</span><span class="p">),</span><span class="ss">reg</span><span class="p">(</span><span class="ss">bc</span><span class="p">))</span>    <span class="c1">% multiply contents of HL with that of BC, store the result in HL</span>
<span class="p">].</span>
</code></pre></div></div>

<p>Now, suppose you wished to determine what sort of data transformations were taking place in this code. You might want to do this for different reasons:</p>

<ul>
  <li>Determine if there are any optimisations that can be made, eg: an addition of zero can be eliminated, since it does not change the answer.</li>
  <li>Understand the data transformation for reverse engineering the business logic of this code</li>
</ul>

<p>Now, you can run this piece of code on concrete numbers and generate lots of test cases for different values in <code class="language-plaintext highlighter-rouge">hl</code>, <code class="language-plaintext highlighter-rouge">bc</code>, etc. However, as a human, you may be able to induce a generic rule which explains the behaviour of this piece of code, which is:</p>

<p><code class="language-plaintext highlighter-rouge">hl=(hl+1)*bc</code></p>

<p>To deduce this rule, note that you didn’t use a concrete number, you used symbols like <code class="language-plaintext highlighter-rouge">hl</code> and <code class="language-plaintext highlighter-rouge">bc</code> to represent the values that these registers could store. Symbolic execution does exactly this: instead of storing concrete numbers in registers, we store symbols. When we “execute” the program, all operations which modify these symbols essentially store the log of operations on these symbols. So for example:</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">Code</span><span class="o">=</span><span class="p">[</span>
<span class="ss">mvc</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">hl</span><span class="p">),</span><span class="ss">sym</span><span class="p">(</span><span class="ss">a</span><span class="p">)),</span> <span class="c1">% load symbol a into register HL</span>
<span class="ss">mvc</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">bc</span><span class="p">),</span><span class="ss">sym</span><span class="p">(</span><span class="ss">b</span><span class="p">)),</span>  <span class="c1">% load symbol b into register BC</span>
<span class="ss">inc</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">hl</span><span class="p">)),</span>           <span class="c1">% increment HL by one, HL now holds inc(sym(a))</span>
<span class="ss">mul</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">hl</span><span class="p">),</span><span class="ss">reg</span><span class="p">(</span><span class="ss">bc</span><span class="p">))</span>    <span class="c1">%  HL now holds mul(inc(sym(a)),sym(b))</span>
<span class="p">].</span>
</code></pre></div></div>

<p>Thus at the end <code class="language-plaintext highlighter-rouge">hl</code>’s contents are <code class="language-plaintext highlighter-rouge">mul(inc(sym(a)),sym(b))</code>, which is interpreted as <code class="language-plaintext highlighter-rouge">hl=(hl+1)*bc</code>.</p>

<p><strong>Symbolic execution is a powerful technique for program analysis.</strong> There is however one wrinkle we need to take care of when building a symbolic interpreter: <strong>branching</strong>.</p>

<p>Consider the instruction <code class="language-plaintext highlighter-rouge">jz(label(some_label))</code>. During concrete execution, we can look at the value of the Zero Flag, and then determine whether we want to jump to <code class="language-plaintext highlighter-rouge">some_label</code> or continue with the normal execution flow. However, the Zero Flag is set based on comparison between two concrete values: what if those values are symbols? You cannot meaningfully compare <code class="language-plaintext highlighter-rouge">sym(a)</code> and <code class="language-plaintext highlighter-rouge">sym(b)</code> numerically: they represent a range of values.</p>

<p><strong>So, then the question becomes: which path do we take?</strong></p>

<p>The image below is an example of the sort of situation you might encounter at a branch point.</p>

<p><img src="/assets/images/symbolic-execution-worlds.png" alt="Symbolic Execution World Splitting" /></p>

<p><strong>The answer is that we take both paths.</strong> Effectively, we split our execution world into two branches: one which makes the jump, and the other one which contains normal execution. These two branches then continue on as individual threads to completion. Of course, if these branches encounter more conditional jump instructions, more sub-worlds split out of these as well, and so on.</p>

<p><strong>Symbolic execution thus explores all possibilities of a program.</strong></p>

<p>One issue is that this can easily result in a <strong>combinatorial explosion of paths</strong>, as each binary branch doubles the number of effective worlds. Symbolic execution engines tackle this in various ways. However, for our simple VM, we will simply keep splitting our world into new branches whenever we encounter conditional jumps.</p>

<h2 id="arithmetic-operations">Arithmetic operations</h2>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">plusOne</span><span class="p">(</span><span class="ss">sym</span><span class="p">(</span><span class="nv">X</span><span class="p">),</span><span class="ss">sym</span><span class="p">(</span><span class="ss">inc</span><span class="p">(</span><span class="ss">sym</span><span class="p">(</span><span class="nv">X</span><span class="p">)))).</span>
<span class="ss">plusOne</span><span class="p">(</span><span class="ss">const</span><span class="p">(</span><span class="nv">X</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="nv">PlusOne</span><span class="p">))</span> <span class="p">:-</span> <span class="nv">PlusOne</span> <span class="ss">is</span> <span class="nv">X</span><span class="o">+</span><span class="m">1</span><span class="p">.</span>

<span class="ss">minusOne</span><span class="p">(</span><span class="ss">sym</span><span class="p">(</span><span class="nv">X</span><span class="p">),</span><span class="ss">sym</span><span class="p">(</span><span class="ss">dec</span><span class="p">(</span><span class="ss">sym</span><span class="p">(</span><span class="nv">X</span><span class="p">)))).</span>
<span class="ss">minusOne</span><span class="p">(</span><span class="ss">const</span><span class="p">(</span><span class="nv">X</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="nv">MinusOne</span><span class="p">))</span> <span class="p">:-</span> <span class="nv">MinusOne</span> <span class="ss">is</span> <span class="nv">X</span><span class="o">-</span><span class="m">1</span><span class="p">.</span>

<span class="ss">product</span><span class="p">(</span><span class="ss">const</span><span class="p">(</span><span class="nv">LHS</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="nv">RHS</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="nv">Product</span><span class="p">))</span> <span class="p">:-</span> <span class="nv">Product</span> <span class="ss">is</span> <span class="nv">LHS</span><span class="o">*</span><span class="nv">RHS</span><span class="p">.</span>
<span class="ss">product</span><span class="p">(</span><span class="ss">sym</span><span class="p">(</span><span class="nv">LHS</span><span class="p">),</span><span class="ss">sym</span><span class="p">(</span><span class="nv">RHS</span><span class="p">),</span><span class="ss">sym</span><span class="p">(</span><span class="ss">product</span><span class="p">(</span><span class="ss">sym</span><span class="p">(</span><span class="nv">LHS</span><span class="p">),</span><span class="ss">sym</span><span class="p">(</span><span class="nv">RHS</span><span class="p">)))).</span>
<span class="ss">product</span><span class="p">(</span><span class="ss">sym</span><span class="p">(</span><span class="nv">LHS</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="nv">RHS</span><span class="p">),</span><span class="ss">sym</span><span class="p">(</span><span class="ss">product</span><span class="p">(</span><span class="ss">sym</span><span class="p">(</span><span class="nv">LHS</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="nv">RHS</span><span class="p">)))).</span>
<span class="ss">product</span><span class="p">(</span><span class="ss">const</span><span class="p">(</span><span class="nv">LHS</span><span class="p">),</span><span class="ss">sym</span><span class="p">(</span><span class="nv">RHS</span><span class="p">),</span><span class="ss">sym</span><span class="p">(</span><span class="ss">product</span><span class="p">(</span><span class="ss">const</span><span class="p">(</span><span class="nv">LHS</span><span class="p">),</span><span class="ss">sym</span><span class="p">(</span><span class="nv">RHS</span><span class="p">)))).</span>
</code></pre></div></div>

<p>The predicate definitions take into account meaningful combinations of constants and symbols. As a rule of thumb, any expression containing even one symbol cannot be simplified (save degenerate cases like adding / subtracting 0, multiplying by 1 or 0, subtracting an expression from itself, etc., but we leave those aside for now for simplicity). Thus the only arithmetic simplification we do is when everything in an expression is a constant. For all the others, it creates a new compound term which reflects the operation being performed, which will ultimately be inspected at the end of a run as part of the value of a register.</p>

<h2 id="comparison">Comparison</h2>

<p>Comparison also works similar to arithmetic operations, in that actual comparison is <em>usually</em> only performed when both sides are constants. The special case is that if both sides are symbolic expressions with the same structure, that also counts for equality. Otherwise a new compound term logging the comparison operator (wrapped in a symbol itself) is returned.</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">equate</span><span class="p">(</span><span class="nv">LHS</span><span class="p">,</span><span class="nv">LHS</span><span class="p">,</span><span class="ss">const</span><span class="p">(</span><span class="m">0</span><span class="p">)).</span>
<span class="ss">equate</span><span class="p">(</span><span class="ss">sym</span><span class="p">(</span><span class="nv">LHS</span><span class="p">),</span><span class="ss">sym</span><span class="p">(</span><span class="nv">RHS</span><span class="p">),</span><span class="ss">sym</span><span class="p">(</span><span class="ss">cmp</span><span class="p">(</span><span class="ss">sym</span><span class="p">(</span><span class="nv">LHS</span><span class="p">)),</span><span class="ss">sym</span><span class="p">(</span><span class="nv">RHS</span><span class="p">))).</span>
<span class="ss">equate</span><span class="p">(</span><span class="ss">sym</span><span class="p">(</span><span class="nv">LHS</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="nv">RHS</span><span class="p">),</span><span class="ss">sym</span><span class="p">(</span><span class="ss">cmp</span><span class="p">(</span><span class="ss">sym</span><span class="p">(</span><span class="nv">LHS</span><span class="p">)),</span><span class="ss">const</span><span class="p">(</span><span class="nv">RHS</span><span class="p">))).</span>
<span class="ss">equate</span><span class="p">(</span><span class="ss">const</span><span class="p">(</span><span class="nv">LHS</span><span class="p">),</span><span class="ss">sym</span><span class="p">(</span><span class="nv">RHS</span><span class="p">),</span><span class="ss">sym</span><span class="p">(</span><span class="ss">cmp</span><span class="p">(</span><span class="ss">const</span><span class="p">(</span><span class="nv">LHS</span><span class="p">)),</span><span class="ss">sym</span><span class="p">(</span><span class="nv">RHS</span><span class="p">))).</span>
<span class="ss">equate</span><span class="p">(</span><span class="ss">const</span><span class="p">(</span><span class="nv">LHS</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="nv">RHS</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="m">1</span><span class="p">))</span> <span class="p">:-</span> <span class="nv">LHS</span> <span class="o">&lt;</span> <span class="nv">RHS</span><span class="p">.</span>
<span class="ss">equate</span><span class="p">(</span><span class="ss">const</span><span class="p">(</span><span class="nv">LHS</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="nv">RHS</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="o">-</span><span class="m">1</span><span class="p">))</span> <span class="p">:-</span> <span class="nv">LHS</span> <span class="o">&gt;</span> <span class="nv">RHS</span><span class="p">.</span>
</code></pre></div></div>

<h2 id="virtual-machine-state">Virtual Machine state</h2>

<p>The virtual machine state must reflect the exact configuration of registers, flags, stack, etc. These are represented as:</p>

<p><code class="language-plaintext highlighter-rouge">vmState(IP,Stack,CallStack,Registers,flags(zero(v1),hlt(v2),branch(v3))</code></p>

<ul>
  <li><strong>IP:</strong> <strong>Instruction Pointer</strong>, represented as a <code class="language-plaintext highlighter-rouge">const()</code></li>
  <li><strong>Stack:</strong> The program writer’s <strong>stack</strong></li>
  <li><strong>CallStack:</strong> We need a stack to store return addresses when performing procedure calls. However, using the program writer’s stack would mess with expectations of the programmer of what should be at the top of the stack. Therefore, a separate call stack is maintained.</li>
  <li><strong>Registers:</strong> A map of the registers with their values</li>
  <li><strong>Flags:</strong> The only flag that a user can set and access (indirectly) is the <strong>Zero Flag</strong>. However, there are two other (hidden flags) which indicate whether the program is about to halt or branch. The <code class="language-plaintext highlighter-rouge">branch()</code> compound term is useful to keep track of when to split worlds during symbolic execution. The <code class="language-plaintext highlighter-rouge">hlt()</code> compound term is to keep track of a halt condition (which can be an explicit <code class="language-plaintext highlighter-rouge">HLT</code> instruction, or the flow falling off the end of a program without a <code class="language-plaintext highlighter-rouge">HLT</code> instruction).</li>
</ul>

<p>This <code class="language-plaintext highlighter-rouge">vmState</code> compound term is passed around everywhere and essentially is equivalent to a <code class="language-plaintext highlighter-rouge">struct</code> in C (SWI-Prolog has dedicated facilities for representing structures, but I’ve deliberately kept the code as implementation-neutral as possible).</p>

<p>Accessing data inside this structure is very easy, thanks to unification and pattern matching; for example, if we wished to only extract the value of the Zero Flag, and not care about the other values, we can simply write:</p>

<p><code class="language-plaintext highlighter-rouge">vmState(_,_,_,_,flags(zero(ZeroFlagValue),_,_)</code></p>

<h2 id="inner-single-world-loop">Inner single world loop</h2>

<p>For reference, this is the high level predicate flow that we will be referencing in the explanation of the predicates.</p>

<p><img src="/assets/images/symbolic-interpreter-predicate-flow.png" alt="Symbolic Interpreter Flow" /></p>

<p>Before looking at how branches in symbolic execution are handled, it is instructive to understand how a single world, concrete execution flow works. The inner loop which evaluates a world, starts with:</p>

<ul>
  <li>The <strong>original program</strong>, which simply a list of instructions</li>
  <li>The <code class="language-plaintext highlighter-rouge">ExecutionMode</code> is either <code class="language-plaintext highlighter-rouge">symbolic</code> or <code class="language-plaintext highlighter-rouge">concrete</code>. For the purposes of this explanation, let us assume that it is concrete.</li>
  <li>The <code class="language-plaintext highlighter-rouge">StateIn</code> is simply the initial VM state which is described in <a href="#virtual-machine-state">Virtual Machine state</a></li>
  <li>The <code class="language-plaintext highlighter-rouge">vmMaps</code> is a tuple of the IP Map and the label map, as described in <a href="#building-the-navigation-maps">Building the navigation maps</a></li>
  <li>The last variable represents the final output of evaluating this particular world. In particular, the <code class="language-plaintext highlighter-rouge">TraceOut</code> contains the program trace for this execution. Since we are only doing a concrete flow, there will be no <code class="language-plaintext highlighter-rouge">ChildWorlds</code>. The code below represents the part of the <code class="language-plaintext highlighter-rouge">vm</code> predicate which is pertinent to this discussion.</li>
</ul>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">vm</span><span class="p">(</span><span class="nv">Program</span><span class="p">,</span><span class="nv">ExecutionMode</span><span class="p">,</span><span class="nv">StateIn</span><span class="p">,</span><span class="ss">vmMaps</span><span class="p">(</span><span class="nv">IPMap</span><span class="p">,</span><span class="nv">LabelMap</span><span class="p">),</span><span class="ss">world</span><span class="p">(</span><span class="nv">StateIn</span><span class="p">,</span><span class="nv">TraceOut</span><span class="p">,</span><span class="nv">ChildWorlds</span><span class="p">))</span> <span class="p">:-</span>
                              <span class="ss">exec_</span><span class="p">(</span><span class="ss">vmMaps</span><span class="p">(</span><span class="nv">IPMap</span><span class="p">,</span><span class="nv">LabelMap</span><span class="p">),</span>
                                  <span class="nv">StateIn</span><span class="p">,[],</span>
                                  <span class="ss">traceOut</span><span class="p">(</span><span class="nv">FinalTrace</span><span class="p">,</span><span class="nv">VmStateOut</span><span class="p">),</span>
                                  <span class="ss">env</span><span class="p">(</span><span class="ss">log</span><span class="p">(</span><span class="ss">debug</span><span class="p">,</span><span class="ss">info</span><span class="p">,</span><span class="ss">warning</span><span class="p">,</span><span class="ss">error</span><span class="p">),</span><span class="nv">ExecutionMode</span><span class="p">)),</span>
                              <span class="nv">VmStateOut</span><span class="o">=</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">FinalIP</span><span class="p">,</span><span class="nv">FinalStack</span><span class="p">,</span><span class="nv">FinalCallStack</span><span class="p">,</span><span class="nv">FinalRegisters</span><span class="p">,</span><span class="nv">FinalVmFlags</span><span class="p">),</span>
                              <span class="ss">minusOne</span><span class="p">(</span><span class="nv">FinalIP</span><span class="p">,</span><span class="nv">LastInstrIP</span><span class="p">),</span>
                              <span class="nv">TraceOut</span><span class="o">=</span><span class="ss">traceOut</span><span class="p">(</span><span class="nv">FinalTrace</span><span class="p">,</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">LastInstrIP</span><span class="p">,</span><span class="nv">FinalStack</span><span class="p">,</span><span class="nv">FinalCallStack</span><span class="p">,</span><span class="nv">FinalRegisters</span><span class="p">,</span><span class="nv">FinalVmFlags</span><span class="p">)),</span>
                              <span class="p">...</span>
                              <span class="p">.</span>
</code></pre></div></div>

<p>The core interpretation is triggered by the <code class="language-plaintext highlighter-rouge">exec_</code> predicate. The remaining lines involves taking the output VM state, adjusting its final IP value (which is one address beyond the last executed instruction) to point to the last executed instruction, and repackaging it to bind it to the output variable <code class="language-plaintext highlighter-rouge">TraceOut</code>.</p>

<p>Let’s look at the internal <code class="language-plaintext highlighter-rouge">exec_</code> loop.
The <code class="language-plaintext highlighter-rouge">exec_</code> has two cases:</p>

<ul>
  <li>The base case is if the <code class="language-plaintext highlighter-rouge">hlt</code> flag is set to true (<code class="language-plaintext highlighter-rouge">hlt(true)</code>). This indicates that a <code class="language-plaintext highlighter-rouge">HLT</code> instruction has been encountered in the execution of the previous instruction. At this point, the <code class="language-plaintext highlighter-rouge">TraceOut</code> variable is bound to the accumulated program trace and the current VM trace, and returned back to the <code class="language-plaintext highlighter-rouge">vm_</code> predicate.</li>
</ul>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">exec_</span><span class="p">(</span><span class="nv">_</span><span class="p">,</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">IP</span><span class="p">,</span><span class="nv">Stack</span><span class="p">,</span><span class="nv">CallStack</span><span class="p">,</span><span class="nv">Registers</span><span class="p">,</span><span class="ss">flags</span><span class="p">(</span><span class="nv">ZeroFlag</span><span class="p">,</span><span class="ss">hlt</span><span class="p">(</span><span class="ss">true</span><span class="p">),</span><span class="nv">BranchFlag</span><span class="p">)),</span>
                  <span class="nv">TraceAcc</span><span class="p">,</span>
                  <span class="ss">traceOut</span><span class="p">(</span><span class="nv">TraceAcc</span><span class="p">,</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">IP</span><span class="p">,</span><span class="nv">Stack</span><span class="p">,</span><span class="nv">CallStack</span><span class="p">,</span><span class="nv">Registers</span><span class="p">,</span><span class="ss">flags</span><span class="p">(</span><span class="nv">ZeroFlag</span><span class="p">,</span><span class="ss">hlt</span><span class="p">(</span><span class="ss">true</span><span class="p">),</span><span class="nv">BranchFlag</span><span class="p">))),</span>
                  <span class="ss">env</span><span class="p">(</span><span class="ss">log</span><span class="p">(</span><span class="nv">_</span><span class="p">,</span><span class="nv">Info</span><span class="p">,</span><span class="nv">_</span><span class="p">,</span><span class="nv">_</span><span class="p">),</span><span class="nv">_</span><span class="p">))</span> <span class="p">:-</span> <span class="ss">call</span><span class="p">(</span><span class="nv">Info</span><span class="p">,</span><span class="ss">'EXITING PROGRAM LOOP!!!'</span><span class="p">).</span>
</code></pre></div></div>
<ul>
  <li>For the general case (when the last instruction was not <code class="language-plaintext highlighter-rouge">HLT</code>), we first get the instruction corresponding to the current IP value from the <code class="language-plaintext highlighter-rouge">IPMap</code> structure. This is then passed to <code class="language-plaintext highlighter-rouge">exec_helper</code> with all the associated context.</li>
</ul>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">exec_</span><span class="p">(</span><span class="ss">vmMaps</span><span class="p">(</span><span class="nv">IPMap</span><span class="p">,</span><span class="nv">LabelMap</span><span class="p">),</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">IP</span><span class="p">,</span><span class="nv">Stack</span><span class="p">,</span><span class="nv">CallStack</span><span class="p">,</span><span class="nv">Registers</span><span class="p">,</span><span class="nv">VmFlags</span><span class="p">),</span><span class="nv">TraceAcc</span><span class="p">,</span><span class="nv">StateOut</span><span class="p">,</span><span class="nv">Env</span><span class="p">)</span> <span class="p">:-</span>
                                                    <span class="ss">get2</span><span class="p">(</span><span class="nv">IP</span><span class="p">,</span><span class="nv">IPMap</span><span class="p">,</span><span class="nv">Instr</span><span class="p">),</span>
                                                    <span class="ss">exec_helper</span><span class="p">(</span><span class="nv">Instr</span><span class="p">,</span><span class="ss">vmMaps</span><span class="p">(</span><span class="nv">IPMap</span><span class="p">,</span><span class="nv">LabelMap</span><span class="p">),</span>
                                                        <span class="ss">vmState</span><span class="p">(</span><span class="nv">IP</span><span class="p">,</span><span class="nv">Stack</span><span class="p">,</span><span class="nv">CallStack</span><span class="p">,</span><span class="nv">Registers</span><span class="p">,</span><span class="nv">VmFlags</span><span class="p">),</span><span class="nv">TraceAcc</span><span class="p">,</span><span class="nv">StateOut</span><span class="p">,</span><span class="nv">Env</span><span class="p">).</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">exec_helper</code> predicate is where the interpretation actually happens. This also has two cases.</p>

<ul>
  <li>The base case is when an empty instruction is encountered. This can happen if the program does not contain a <code class="language-plaintext highlighter-rouge">HLT</code> instruction, and execution falls off the end of the program. This is treated equivalent to an implicit <code class="language-plaintext highlighter-rouge">HLT</code> instruction. Thus, the <code class="language-plaintext highlighter-rouge">TraceOut</code> variable is bound to whatever context is already present. The only explicit modification is the <code class="language-plaintext highlighter-rouge">hlt</code> flag which is explicitly set to true.</li>
</ul>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">exec_helper</span><span class="p">(</span><span class="ss">empty</span><span class="p">,</span><span class="nv">VmMaps</span><span class="p">,</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">IP</span><span class="p">,</span><span class="nv">Stack</span><span class="p">,</span><span class="nv">CallStack</span><span class="p">,</span><span class="nv">Registers</span><span class="p">,</span><span class="ss">flags</span><span class="p">(</span><span class="nv">ZeroFlag</span><span class="p">,</span><span class="nv">_</span><span class="p">,</span><span class="nv">BranchFlag</span><span class="p">)),</span>
                    <span class="nv">TraceAcc</span><span class="p">,</span>
                    <span class="ss">traceOut</span><span class="p">(</span><span class="nv">TraceAcc</span><span class="p">,</span><span class="nv">ExitState</span><span class="p">),</span>
                    <span class="ss">env</span><span class="p">(</span><span class="ss">log</span><span class="p">(</span><span class="nv">Debug</span><span class="p">,</span><span class="nv">Info</span><span class="p">,</span><span class="nv">Warn</span><span class="p">,</span><span class="nv">Error</span><span class="p">),</span><span class="nv">ExecutionMode</span><span class="p">))</span> <span class="p">:-</span>
                            <span class="nv">ExitState</span><span class="o">=</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">IP</span><span class="p">,</span><span class="nv">Stack</span><span class="p">,</span><span class="nv">CallStack</span><span class="p">,</span><span class="nv">Registers</span><span class="p">,</span><span class="ss">flags</span><span class="p">(</span><span class="nv">ZeroFlag</span><span class="p">,</span><span class="ss">hlt</span><span class="p">(</span><span class="ss">true</span><span class="p">),</span><span class="nv">BranchFlag</span><span class="p">)),</span>
                            <span class="ss">call</span><span class="p">(</span><span class="nv">Warn</span><span class="p">,</span><span class="ss">'No other instruction found, but no HLT is present. Halting program.'</span><span class="p">),</span>
                            <span class="ss">exec_</span><span class="p">(</span><span class="nv">VmMaps</span><span class="p">,</span><span class="nv">ExitState</span><span class="p">,</span><span class="nv">TraceAcc</span><span class="p">,</span><span class="ss">traceOut</span><span class="p">(</span><span class="nv">TraceAcc</span><span class="p">,</span><span class="nv">ExitState</span><span class="p">),</span><span class="ss">env</span><span class="p">(</span><span class="ss">log</span><span class="p">(</span><span class="nv">Debug</span><span class="p">,</span><span class="nv">Info</span><span class="p">,</span><span class="nv">Warn</span><span class="p">,</span><span class="nv">Error</span><span class="p">),</span><span class="nv">ExecutionMode</span><span class="p">)).</span>
</code></pre></div></div>

<ul>
  <li>The general case is the actual interpretation of the instruction. Before the <code class="language-plaintext highlighter-rouge">interpret</code> predicate is called, a <code class="language-plaintext highlighter-rouge">NextIP</code> variable is initialised to the current IP  incremented by one, to indicate the next instruction that will be executed, <strong>assuming there are no jumps</strong>. This is so that a conditional jump can either return the same <code class="language-plaintext highlighter-rouge">NextIP</code> (i.e., no jump), or can return a different IP (indicating a jump).</li>
</ul>

<p>The <code class="language-plaintext highlighter-rouge">interpret</code> predicate is then called, which has different cases, depending upon the instruction encountered. The various cases are described in <a href="#some-notes-on-instruction-interpretation">Instruction Interpretation</a>.</p>

<p>The <code class="language-plaintext highlighter-rouge">shouldBranch()</code> ternary operator’s true condition is only triggered during symbolic execution, so we’ll not worry about that for the moment. The negative condition is the concrete execution flow. This part is straightforward. It simply calls the <code class="language-plaintext highlighter-rouge">exec_</code> predicate recursively with the updated IP.</p>

<p>This mutual recursive call between <code class="language-plaintext highlighter-rouge">exec_</code> and <code class="language-plaintext highlighter-rouge">exec_helper</code> continues until a halt condition is reached.</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">exec_helper</span><span class="p">(...)</span> <span class="p">:-</span>
        <span class="p">...,</span>
        <span class="ss">plusOne</span><span class="p">(</span><span class="nv">IP</span><span class="p">,</span><span class="nv">NextIP</span><span class="p">),</span>
        <span class="ss">interpret</span><span class="p">(</span><span class="nv">Instr</span><span class="p">,</span><span class="nv">VmMaps</span><span class="p">,</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">NextIP</span><span class="p">,</span><span class="nv">Stack</span><span class="p">,</span><span class="nv">CallStack</span><span class="p">,</span><span class="nv">Registers</span><span class="p">,</span><span class="nv">VmFlags</span><span class="p">),</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">UpdatedIP</span><span class="p">,</span><span class="nv">UpdatedStack</span><span class="p">,</span><span class="nv">UpdatedCallStack</span><span class="p">,</span><span class="nv">UpdatedRegisters</span><span class="p">,</span><span class="nv">UpdatedVmFlags</span><span class="p">),</span><span class="ss">env</span><span class="p">(</span><span class="ss">log</span><span class="p">(</span><span class="nv">Debug</span><span class="p">,</span><span class="nv">Info</span><span class="p">,</span><span class="nv">Warning</span><span class="p">,</span><span class="nv">Error</span><span class="p">),</span><span class="nv">ExecutionMode</span><span class="p">)),</span>
        <span class="p">(</span><span class="ss">shouldBranch</span><span class="p">(</span><span class="nv">UpdatedVmFlags</span><span class="p">)</span><span class="o">-&gt;</span>
            <span class="p">(</span>
                <span class="p">...</span>
            <span class="p">);</span>
            <span class="p">(</span>
                <span class="ss">call</span><span class="p">(</span><span class="nv">Debug</span><span class="p">,</span><span class="ss">'Next IP is ~w'</span><span class="p">,[</span><span class="nv">UpdatedIP</span><span class="p">]),</span>
                <span class="ss">exec_</span><span class="p">(</span><span class="nv">VmMaps</span><span class="p">,</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">UpdatedIP</span><span class="p">,</span><span class="nv">UpdatedStack</span><span class="p">,</span><span class="nv">UpdatedCallStack</span><span class="p">,</span><span class="nv">UpdatedRegisters</span><span class="p">,</span><span class="nv">UpdatedVmFlags</span><span class="p">),</span><span class="nv">TraceAcc</span><span class="p">,</span><span class="ss">traceOut</span><span class="p">(</span><span class="nv">RemainingTrace</span><span class="p">,</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">FinalIP</span><span class="p">,</span><span class="nv">FinalStack</span><span class="p">,</span><span class="nv">FinalCallStack</span><span class="p">,</span><span class="nv">FinalRegisters</span><span class="p">,</span><span class="nv">FinalVmFlags</span><span class="p">)),</span><span class="ss">env</span><span class="p">(</span><span class="ss">log</span><span class="p">(</span><span class="nv">Debug</span><span class="p">,</span><span class="nv">Info</span><span class="p">,</span><span class="nv">Warning</span><span class="p">,</span><span class="nv">Error</span><span class="p">),</span><span class="nv">ExecutionMode</span><span class="p">)),</span>
                <span class="nv">FinalTrace</span><span class="o">=</span><span class="p">[</span><span class="ss">traceEntry</span><span class="p">(</span><span class="nv">Instr</span><span class="p">,</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">UpdatedIP</span><span class="p">,</span><span class="nv">UpdatedStack</span><span class="p">,</span><span class="nv">UpdatedCallStack</span><span class="p">,</span><span class="nv">UpdatedRegisters</span><span class="p">,</span><span class="nv">UpdatedVmFlags</span><span class="p">))|</span><span class="nv">RemainingTrace</span><span class="p">]</span>
            <span class="p">)</span>
        <span class="p">),</span>
        <span class="p">!.</span>
</code></pre></div></div>

<h2 id="some-notes-on-instruction-interpretation">Some notes on Instruction Interpretation</h2>

<p>Many of the cases for the <code class="language-plaintext highlighter-rouge">interpret</code> predicate are about retrieving the values of (one or two) registers, performing some manipulation on them (symbolic or arithmetic) and storing the result in the register, as in this example for the <code class="language-plaintext highlighter-rouge">MUL</code> instruction.</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">interpret</span><span class="p">(</span><span class="ss">mul</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="nv">LHSRegister</span><span class="p">),</span><span class="ss">reg</span><span class="p">(</span><span class="nv">RHSRegister</span><span class="p">)),</span><span class="nv">_</span><span class="p">,</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">NextIP</span><span class="p">,</span><span class="nv">Stack</span><span class="p">,</span><span class="nv">CallStack</span><span class="p">,</span><span class="nv">Registers</span><span class="p">,</span><span class="nv">VmFlags</span><span class="p">),</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">NextIP</span><span class="p">,</span><span class="nv">Stack</span><span class="p">,</span><span class="nv">CallStack</span><span class="p">,</span><span class="nv">UpdatedRegisters</span><span class="p">,</span><span class="nv">VmFlags</span><span class="p">),</span><span class="ss">env</span><span class="p">(</span><span class="ss">log</span><span class="p">(</span><span class="nv">Debug</span><span class="p">,</span><span class="nv">_</span><span class="p">,</span><span class="nv">_</span><span class="p">,</span><span class="nv">_</span><span class="p">),</span><span class="nv">_</span><span class="p">))</span> <span class="p">:-</span>
                <span class="ss">get2</span><span class="p">(</span><span class="nv">LHSRegister</span><span class="p">,</span><span class="nv">Registers</span><span class="p">,</span><span class="nv">LHSValue</span><span class="p">),</span>
                <span class="ss">get2</span><span class="p">(</span><span class="nv">RHSRegister</span><span class="p">,</span><span class="nv">Registers</span><span class="p">,</span><span class="nv">RHSValue</span><span class="p">),</span>
                <span class="ss">call</span><span class="p">(</span><span class="nv">Debug</span><span class="p">,</span><span class="ss">'LHS is ~w,~w'</span><span class="p">,[</span><span class="nv">LHSRegister</span><span class="p">,</span><span class="nv">LHSValue</span><span class="p">]),</span>
                <span class="ss">call</span><span class="p">(</span><span class="nv">Debug</span><span class="p">,</span><span class="ss">'RHS is ~w,~w'</span><span class="p">,[</span><span class="nv">RHSRegister</span><span class="p">,</span><span class="nv">RHSValue</span><span class="p">]),</span>
                <span class="ss">product</span><span class="p">(</span><span class="nv">LHSValue</span><span class="p">,</span><span class="nv">RHSValue</span><span class="p">,</span><span class="nv">Product</span><span class="p">),</span>
                <span class="ss">call</span><span class="p">(</span><span class="nv">Debug</span><span class="p">,</span><span class="ss">'And result is ~w'</span><span class="p">,[</span><span class="nv">Product</span><span class="p">]),</span>
                <span class="ss">update_reg</span><span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="nv">LHSRegister</span><span class="p">),</span><span class="nv">Product</span><span class="p">),</span><span class="nv">Registers</span><span class="p">,</span><span class="nv">UpdatedRegisters</span><span class="p">).</span>
</code></pre></div></div>

<p>The two <code class="language-plaintext highlighter-rouge">get2</code> calls retrieve the values of <code class="language-plaintext highlighter-rouge">LHSRegister</code> and <code class="language-plaintext highlighter-rouge">RHSRegister</code>. <code class="language-plaintext highlighter-rouge">product</code> calculates their (symbolic or arithmetic) product. <code class="language-plaintext highlighter-rouge">update_reg</code> updates the <code class="language-plaintext highlighter-rouge">LHSRegister</code> with the product.</p>

<p>The two instructions that modify the VM state differently are the <code class="language-plaintext highlighter-rouge">JZ</code> and <code class="language-plaintext highlighter-rouge">JNZ</code> instructions, which we have discussed depend upon whether the <code class="language-plaintext highlighter-rouge">ExecutionMode</code> is <code class="language-plaintext highlighter-rouge">symbolic</code> or <code class="language-plaintext highlighter-rouge">concrete</code>.</p>

<h2 id="symbolic-execution-and-world-splitting-the-outer-loop">Symbolic Execution and World splitting: The outer loop</h2>

<p>Let’s talk about how symbolic execution works. The symbolic execution mode is controlled by two variables:</p>

<ul>
  <li>The <code class="language-plaintext highlighter-rouge">ExecutionMode</code> variable which can either be <code class="language-plaintext highlighter-rouge">symbolic</code> or <code class="language-plaintext highlighter-rouge">concrete</code>.</li>
  <li>The <code class="language-plaintext highlighter-rouge">branch</code> flag which is explicitly set to true by conditional jump instructions, only when the <code class="language-plaintext highlighter-rouge">ExecutionMode</code> is <code class="language-plaintext highlighter-rouge">symbolic</code>. To see this link, look at the two cases for the <code class="language-plaintext highlighter-rouge">JZ</code> (Jump if Zero) instruction.</li>
</ul>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">interpret</span><span class="p">(</span><span class="ss">jz</span><span class="p">(</span><span class="nv">JumpIP</span><span class="p">),</span><span class="nv">_</span><span class="p">,</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">OldNextIP</span><span class="p">,</span><span class="nv">Stack</span><span class="p">,</span><span class="nv">CallStack</span><span class="p">,</span><span class="nv">Registers</span><span class="p">,</span><span class="nv">VmFlags</span><span class="p">),</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">UpdatedIP</span><span class="p">,</span><span class="nv">Stack</span><span class="p">,</span><span class="nv">CallStack</span><span class="p">,</span><span class="nv">Registers</span><span class="p">,</span><span class="nv">UpdatedVmFlags</span><span class="p">),</span><span class="ss">env</span><span class="p">(</span><span class="nv">_</span><span class="p">,</span><span class="ss">mode</span><span class="p">(</span><span class="ss">symbolic</span><span class="p">)))</span> <span class="p">:-</span> <span class="ss">interpret_symbolic_condition</span><span class="p">(</span><span class="nv">OldNextIP</span><span class="p">,</span><span class="nv">JumpIP</span><span class="p">,</span><span class="nv">VmFlags</span><span class="p">,</span><span class="ss">isZero</span><span class="p">,</span><span class="nv">UpdatedVmFlags</span><span class="p">,</span><span class="nv">UpdatedIP</span><span class="p">).</span>
<span class="ss">interpret</span><span class="p">(</span><span class="ss">jz</span><span class="p">(</span><span class="nv">JumpIP</span><span class="p">),</span><span class="nv">_</span><span class="p">,</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">OldNextIP</span><span class="p">,</span><span class="nv">Stack</span><span class="p">,</span><span class="nv">CallStack</span><span class="p">,</span><span class="nv">Registers</span><span class="p">,</span><span class="nv">VmFlags</span><span class="p">),</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">UpdatedIP</span><span class="p">,</span><span class="nv">Stack</span><span class="p">,</span><span class="nv">CallStack</span><span class="p">,</span><span class="nv">Registers</span><span class="p">,</span><span class="nv">UpdatedVmFlags</span><span class="p">),</span><span class="ss">env</span><span class="p">(</span><span class="nv">_</span><span class="p">,</span><span class="ss">mode</span><span class="p">(</span><span class="ss">concrete</span><span class="p">)))</span> <span class="p">:-</span> <span class="ss">interpret_condition</span><span class="p">(</span><span class="nv">OldNextIP</span><span class="p">,</span><span class="nv">JumpIP</span><span class="p">,</span><span class="nv">VmFlags</span><span class="p">,</span><span class="ss">isZero</span><span class="p">,</span><span class="nv">UpdatedVmFlags</span><span class="p">,</span><span class="nv">UpdatedIP</span><span class="p">).</span>
</code></pre></div></div>

<p>The first case triggers when the mode is <code class="language-plaintext highlighter-rouge">symbolic</code>, and calls the <code class="language-plaintext highlighter-rouge">interpret_symbolic_condition</code> predicate. This predicate is a single predicate which directly sets <code class="language-plaintext highlighter-rouge">branch(true)</code>:</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">interpret_symbolic_condition</span><span class="p">(</span><span class="nv">OldNextIP</span><span class="p">,</span><span class="nv">_</span><span class="p">,</span><span class="ss">flags</span><span class="p">(</span><span class="nv">ZeroFlag</span><span class="p">,</span><span class="nv">HltFlag</span><span class="p">,</span><span class="nv">_</span><span class="p">),</span><span class="nv">_</span><span class="p">,</span><span class="ss">flags</span><span class="p">(</span><span class="nv">ZeroFlag</span><span class="p">,</span><span class="nv">HltFlag</span><span class="p">,</span><span class="ss">branch</span><span class="p">(</span><span class="ss">true</span><span class="p">)),</span><span class="nv">OldNextIP</span><span class="p">).</span>
</code></pre></div></div>

<p>The second case triggers when the mode is <code class="language-plaintext highlighter-rouge">concrete</code> and calls the <code class="language-plaintext highlighter-rouge">interpret_condition</code> predicate which is described in <a href="#some-notes-on-instruction-interpretation">Instruction Interpretation</a>.</p>

<p>Where is <code class="language-plaintext highlighter-rouge">branch(true)</code> actually used? This is in the <code class="language-plaintext highlighter-rouge">exec_helper</code> predicate, reproduced here with the pertinent code:</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">exec_helper</span><span class="p">(...)</span> <span class="p">:-</span>
    <span class="p">...,</span>
    <span class="p">(</span><span class="ss">shouldBranch</span><span class="p">(</span><span class="nv">UpdatedVmFlags</span><span class="p">)</span><span class="o">-&gt;</span>
        <span class="p">(</span>
            <span class="ss">terminateForBranch</span><span class="p">(</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">UpdatedIP</span><span class="p">,</span><span class="nv">UpdatedStack</span><span class="p">,</span><span class="nv">UpdatedCallStack</span><span class="p">,</span><span class="nv">UpdatedRegisters</span><span class="p">,</span><span class="nv">UpdatedVmFlags</span><span class="p">),</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">FinalIP</span><span class="p">,</span><span class="nv">FinalStack</span><span class="p">,</span><span class="nv">FinalCallStack</span><span class="p">,</span><span class="nv">FinalRegisters</span><span class="p">,</span><span class="nv">FinalVmFlags</span><span class="p">)),</span>
            <span class="nv">FinalTrace</span><span class="o">=</span><span class="nv">TraceAcc</span>
        <span class="p">);</span>
        <span class="p">(</span>
            <span class="p">...</span>
        <span class="p">)</span>
    <span class="p">),</span>
    <span class="p">!.</span>

</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">shouldBranch()</code> term is only true when <code class="language-plaintext highlighter-rouge">branch(true)</code> is true. At this point, it simply returns the entire trace and the VM state as-is, effectively ending the execution of this thread. This is because, beyond this, two new world threads need to be created and run interpreted as their own worlds with identical starting points.</p>

<p><strong>Where does this world splitting take place?</strong>
This happens in the <code class="language-plaintext highlighter-rouge">vm()</code> predicate, reproduced here with the relevant code</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">vm</span><span class="p">(...)</span> <span class="p">:-</span>
      <span class="p">...,</span>
      <span class="p">(</span><span class="ss">shouldTerminateWorld</span><span class="p">(</span><span class="nv">FinalVmFlags</span><span class="p">)</span><span class="o">-&gt;</span><span class="p">(...);</span>
        <span class="p">(</span>
          <span class="nv">NewStartIP_One</span><span class="o">=</span><span class="nv">FinalIP</span><span class="p">,</span>
          <span class="ss">branchDestination</span><span class="p">(</span><span class="nv">LastInstrIP</span><span class="p">,</span><span class="nv">LabelMap</span><span class="p">,</span><span class="nv">IPMap</span><span class="p">,</span><span class="nv">NewStartIP_Two</span><span class="p">),</span>
          <span class="nv">Branches</span><span class="o">=</span><span class="p">[</span><span class="nv">NewStartIP_One</span><span class="p">,</span><span class="nv">NewStartIP_Two</span><span class="p">],</span>
          <span class="ss">info</span><span class="p">(</span><span class="s2">"Branches are: ~w"</span><span class="p">,[</span><span class="nv">Branches</span><span class="p">]),</span>
          <span class="ss">explore</span><span class="p">(</span><span class="nv">Program</span><span class="p">,</span><span class="nv">ExecutionMode</span><span class="p">,</span><span class="nv">VmStateOut</span><span class="p">,</span><span class="ss">vmMaps</span><span class="p">(</span><span class="nv">IPMap</span><span class="p">,</span><span class="nv">LabelMap</span><span class="p">),</span><span class="nv">Branches</span><span class="p">,[],</span><span class="nv">ChildWorlds</span><span class="p">)</span>
        <span class="p">)</span>
      <span class="p">).</span>
</code></pre></div></div>

<p>At this point, we are back at the top, and we need to decide whether to halt the execution entirely, or whether to branch out into different worlds. <strong>How do we determine this?</strong></p>

<p><strong>Answer: We know we are back at the top, but we also know that we aren’t in a HALT condition (an explicit <code class="language-plaintext highlighter-rouge">HALT</code> instruction or execution flow falling off the end), therefore we must be at a branch point.</strong> Therefore, we extract two IP values, <code class="language-plaintext highlighter-rouge">NewStartIP_One</code> (the default execution flow IP value) and <code class="language-plaintext highlighter-rouge">NewStartIP_Two</code> (the jump IP value). Now, we recursively call the <code class="language-plaintext highlighter-rouge">explore</code> predicate, which is the top-level entry predicate for our VM.</p>

<h2 id="execution-entry-point">Execution entry point</h2>

<p>Let’s look at the <code class="language-plaintext highlighter-rouge">explore</code> predicate.
The important variable in this predicate is the <strong>list of Instruction Pointers</strong>. If there is only one IP value (which is what happens when this predicate is first invoked), that means there is only one branch / world. If there are two (or more, though that is not a situation that happens with this VM), it means that there are multiple branches which need to be symbolically interpreted independently, starting from the same set of initial conditions (the VM state, etc.). <code class="language-plaintext highlighter-rouge">explore</code> is thus called recursively for every conditional branch, until there are none left, thus signalling the end of the symbolic execution tree.</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">explore</span><span class="p">(</span><span class="nv">_</span><span class="p">,</span><span class="nv">_</span><span class="p">,</span><span class="nv">_</span><span class="p">,</span><span class="nv">_</span><span class="p">,[],</span><span class="nv">WorldAcc</span><span class="p">,</span><span class="nv">WorldAcc</span><span class="p">).</span>
<span class="ss">explore</span><span class="p">(</span><span class="nv">Program</span><span class="p">,</span><span class="nv">ExecutionMode</span><span class="p">,</span><span class="nv">VmState</span><span class="p">,</span><span class="nv">VmMaps</span><span class="p">,[</span><span class="nv">IP</span><span class="p">|</span><span class="nv">OtherIPs</span><span class="p">],</span><span class="nv">WorldAcc</span><span class="p">,[</span><span class="nv">WorldOut</span><span class="p">|</span><span class="nv">OtherWorldOuts</span><span class="p">])</span> <span class="p">:-</span>
        <span class="nv">VmState</span><span class="o">=</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">_</span><span class="p">,</span><span class="nv">Stack</span><span class="p">,</span><span class="nv">CallStack</span><span class="p">,</span><span class="nv">Registers</span><span class="p">,</span><span class="ss">flags</span><span class="p">(</span><span class="nv">ZeroFlag</span><span class="p">,</span><span class="nv">_</span><span class="p">,</span><span class="nv">_</span><span class="p">)),</span>
        <span class="nv">FreshState</span><span class="o">=</span><span class="ss">vmState</span><span class="p">(</span><span class="nv">IP</span><span class="p">,</span><span class="nv">Stack</span><span class="p">,</span><span class="nv">CallStack</span><span class="p">,</span><span class="nv">Registers</span><span class="p">,</span><span class="ss">flags</span><span class="p">(</span><span class="nv">ZeroFlag</span><span class="p">,</span><span class="ss">hlt</span><span class="p">(</span><span class="ss">false</span><span class="p">),</span><span class="ss">branch</span><span class="p">(</span><span class="ss">false</span><span class="p">))),</span>
        <span class="ss">vm</span><span class="p">(</span><span class="nv">Program</span><span class="p">,</span><span class="nv">ExecutionMode</span><span class="p">,</span><span class="nv">FreshState</span><span class="p">,</span><span class="nv">VmMaps</span><span class="p">,</span><span class="nv">WorldOut</span><span class="p">),</span>
        <span class="ss">explore</span><span class="p">(</span><span class="nv">Program</span><span class="p">,</span><span class="nv">ExecutionMode</span><span class="p">,</span><span class="nv">VmState</span><span class="p">,</span><span class="nv">VmMaps</span><span class="p">,</span><span class="nv">OtherIPs</span><span class="p">,</span><span class="nv">WorldAcc</span><span class="p">,</span><span class="nv">OtherWorldOuts</span><span class="p">),</span>
        <span class="p">!.</span>
</code></pre></div></div>

<p>If you wish to trace the flow yourself and are using <strong>SWI-Prolog</strong>, I’d highly recommend using the <strong>graphical debugger</strong>. It has full support for breakpoints, <strong>step over/step into</strong>, <strong>forcing predicates to fail</strong>, <strong>on-the-fly edit-and-recompile</strong>, and more.</p>

<p><img src="/assets/images/swi-prolog-graphical-debugger.png" alt="SWI-Prolog Graphical Debugger" /></p>

<h2 id="example-programs">Example Programs</h2>

<p>These are a couple of example programs you can run with this VM.</p>

<h3 id="factorial">Factorial</h3>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">execute_symbolic</span><span class="p">([</span>
     <span class="ss">mov</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r0</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="m">5</span><span class="p">)),</span>
     <span class="ss">mov</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r1</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="m">1</span><span class="p">)),</span>
     <span class="ss">call</span><span class="p">(</span><span class="ss">label</span><span class="p">(</span><span class="ss">factorial</span><span class="p">)),</span>
     <span class="ss">push</span><span class="p">(</span><span class="ss">const</span><span class="p">(</span><span class="m">30</span><span class="p">)),</span>
     <span class="ss">cmp</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r1</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="m">121</span><span class="p">)),</span>
     <span class="ss">hlt</span><span class="p">,</span>
     <span class="ss">label</span><span class="p">(</span><span class="ss">factorial</span><span class="p">),</span>
     <span class="ss">cmp</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r0</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="m">0</span><span class="p">)),</span>
     <span class="ss">jz</span><span class="p">(</span><span class="ss">label</span><span class="p">(</span><span class="ss">base</span><span class="p">)),</span>
     <span class="ss">push</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r0</span><span class="p">)),</span>
     <span class="ss">dec</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r0</span><span class="p">)),</span>
     <span class="ss">call</span><span class="p">(</span><span class="ss">label</span><span class="p">(</span><span class="ss">factorial</span><span class="p">)),</span>
     <span class="ss">pop</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r0</span><span class="p">)),</span>
     <span class="ss">mul</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r1</span><span class="p">),</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r0</span><span class="p">)),</span>
     <span class="ss">ret</span><span class="p">,</span>

     <span class="ss">label</span><span class="p">(</span><span class="ss">base</span><span class="p">),</span>
     <span class="ss">mov</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r1</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="m">1</span><span class="p">)),</span>
     <span class="ss">ret</span>
<span class="p">],</span>
<span class="ss">mode</span><span class="p">(</span><span class="ss">concrete</span><span class="p">),</span><span class="nv">AllWorlds</span><span class="p">),</span><span class="ss">print_worlds</span><span class="p">(</span><span class="nv">AllWorlds</span><span class="p">,</span><span class="m">0</span><span class="p">).</span>
</code></pre></div></div>

<h3 id="symbolic-execution-example">Symbolic execution example</h3>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">execute_symbolic</span><span class="p">([</span>
    <span class="ss">mov</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r1</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="m">10</span><span class="p">)),</span><span class="ss">inc</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r1</span><span class="p">)),</span>
    <span class="ss">mov</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r2</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="m">20</span><span class="p">)),</span>
    <span class="ss">cmp</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r1</span><span class="p">),</span> <span class="ss">const</span><span class="p">(</span><span class="m">0</span><span class="p">)),</span>
    <span class="ss">jnz</span><span class="p">(</span><span class="ss">const</span><span class="p">(</span><span class="m">7</span><span class="p">)),</span>
    <span class="ss">cmp</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r2</span><span class="p">),</span> <span class="ss">const</span><span class="p">(</span><span class="m">1</span><span class="p">)),</span>
    <span class="ss">mov</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r1</span><span class="p">),</span><span class="ss">const</span><span class="p">(</span><span class="m">30</span><span class="p">)),</span>
    <span class="ss">jz</span><span class="p">(</span><span class="ss">const</span><span class="p">(</span><span class="m">10</span><span class="p">)),</span>
    <span class="ss">mov</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r3</span><span class="p">),</span> <span class="ss">const</span><span class="p">(</span><span class="m">40</span><span class="p">)),</span>
    <span class="ss">hlt</span><span class="p">,</span>
    <span class="ss">mov</span><span class="p">(</span><span class="ss">reg</span><span class="p">(</span><span class="ss">r3</span><span class="p">),</span> <span class="ss">const</span><span class="p">(</span><span class="m">50</span><span class="p">)),</span>
    <span class="ss">hlt</span>
<span class="p">],</span>
<span class="ss">mode</span><span class="p">(</span><span class="ss">symbolic</span><span class="p">),</span><span class="nv">AllWorlds</span><span class="p">),</span><span class="ss">print_worlds</span><span class="p">(</span><span class="nv">AllWorlds</span><span class="p">,</span><span class="m">0</span><span class="p">).</span>
</code></pre></div></div>

<h2 id="program-statistics">Program statistics</h2>

<ul>
  <li>The first observation I had about the code is that it is <strong>dense</strong>. There is <strong>minimal syntactic overhead</strong>, associated with imperative or object oriented programming. Unification ensures that there are <strong>no verbose assignment statements</strong>, no special syntax to pack and unpack structures. There are a couple of places assignment is used, but it is mostly to improve readability.</li>
  <li>To get a sense of the expressiveness of the language due to its unification and pattern matching features, the original concrete interpreter (without the symbolic interpretation) was <strong>less than 50 lines of code</strong>.</li>
  <li>The current version of the  symbolic interpreter, excluding the data structures built from the ground up, takes up around <strong>120 lines of code</strong>.</li>
</ul>

<h2 id="references">References</h2>
<ul>
  <li><a href="https://github.com/asengupta/prolog-exercises/blob/main/ilp/prolog_examples/symbolic_executor.pl">Symbolic Interpreter</a></li>
</ul>]]></content><author><name>avishek</name></author><category term="Prolog" /><category term="Logic Programming" /><category term="Virtual Machine" /><category term="Symbolic Execution" /><summary type="html"><![CDATA[In this post, I’ll talk about how I wrote a small Virtual Machine in Prolog which can both interpret concrete assembly language-like programs, and run basic symbolic executions, which are useful in data flow analyses of programs. The full code is available in this repository.]]></summary></entry><entry><title type="html">Automated Reasoning: A brief overview of Prolog</title><link href="https://avishek.net/2025/06/28/a-brief-overview-of-prolog.html" rel="alternate" type="text/html" title="Automated Reasoning: A brief overview of Prolog" /><published>2025-06-28T00:00:00+05:30</published><updated>2025-06-28T00:00:00+05:30</updated><id>https://avishek.net/2025/06/28/a-brief-overview-of-prolog</id><content type="html" xml:base="https://avishek.net/2025/06/28/a-brief-overview-of-prolog.html"><![CDATA[<p>In this post, I present an abbreviated overview of <strong>Prolog</strong> and the paradigm of <strong>Logic Programming</strong>. I’ll discuss why I think it makes for such a powerful domain modelling language (with examples), and a gateway into the techniques of automated symbolic reasoning.</p>

<p><em>This post has not been written or edited by AI.</em></p>

<h2 id="why-prolog">Why Prolog?</h2>
<p>I’ve always been interested in the evolution of AI, and the potential for a synthesis of older techniques (formal and otherwise) and newer ones (deep learning) in my current area of focus (reverse engineering). My intuition tells me that combining many of these approaches will only serve to make application of AI more “robust”. I use that term in a rather loose sense for the moment. One interpretation might be that to make something robust is to make it more “deterministic” or “reproducible”, but this can only cover a subset of qualities of such a system. However, automated reasoning is a principled, methodical way of exploring a subspace of a larger, more ill-defined problem space: hence my interest in this space. After all, this early AI research did give us Lisp :-)</p>

<p><strong>Prolog</strong> was the European answer to automating the reasoning process in the 1960’s (the American one was Lisp). It belongs to the family of logic programming languages, which is a paradigm distinct from the well-known imperative/functional/object-oriented ones. There are several ideas in here that appealed to me. Not surprisingly, Prolog and its other subsets and derivative approaches (Datalog, Answer Set Programming, etc.) have been used in applications involving different types of reasoning tasks. There is a treasure trove of papers surveying the history of Prolog <a href="https://github.com/dtonhofer/prolog_notes/tree/master/other_notes/about_papers_of_interest">here</a> if you are interested.</p>

<p>From my reading, there are several kinds of reasoning, and the categorisation depends upon different parameters. I only list the types of reasoning which interest me.</p>

<ul>
  <li><strong>Deductive:</strong> Incontrovertible conclusions from facts. This is what base Prolog does.</li>
  <li><strong>Inductive:</strong> Plausible “general” facts from observed facts. <strong>Inductive Logic Programming (ILP)</strong> systems like Popper build on top of Prolog to provide such capabilities.</li>
  <li><strong>Abductive:</strong> Generalised “probable” rules from facts. Libraries like ILASP and Potassco’s <strong>clingo</strong> provide capabilities like Answer Set Programming.</li>
  <li><strong>Non-monotonic:</strong> Facts are tentative, and can be invalidated in the face of new facts. ILASP, etc. support non-monotonic reasoning.</li>
  <li><strong>Probabilistic Logic:</strong> Facts are assigned probabilities, conclusions are probabilistic in nature. <strong>Probabilistic Graphical Models</strong> embody such logic.</li>
</ul>

<p>Here’s a very informative diagram of the logic programming landscape, borrowed from <a href="https://github.com/dtonhofer/prolog_notes">David Tonhofer’s extensive notes on Prolog</a>:</p>

<p><img src="/assets/images/quick_map_of_lp_landscape.svg" alt="LP Landscape" /></p>

<p>For the purposes of this post, I will be using <a href="https://www.swi-prolog.org/">SWI-Prolog</a>, though you can also use other implementations like <a href="https://ciao-lang.org/">Ciao</a>, <a href="http://www.gprolog.org/">GNU Prolog</a>, etc. In an upcoming post, I will go through the design of a simple-but-nontrivial <strong>Virtual Machine</strong> together with <strong>symbolic execution</strong> with using Prolog in about <strong>200 lines of code</strong>.</p>

<h2 id="a-very-brief-introduction-to-prolog">A very brief introduction to Prolog</h2>

<p>The way of thinking about <strong>logic programming</strong> is close to - but not the same as - <strong>functional programming</strong>. Briefly, these are some of the salient points.</p>

<ul>
  <li><strong>Data is immutable.</strong> In this it is similar to functional programming.</li>
  <li>You don’t write statements. You write <strong>facts</strong> and <strong>predicates</strong>. These relate <em>things</em> to other <em>things</em>.</li>
  <li>Facts hold unconditionally, i.e., they are <em>true</em>.</li>
  <li>Any fact not defined as treated as <em>false</em>.</li>
  <li><strong>Predicates hold conditionally</strong>, based on whether other predicates or facts hold or not.</li>
  <li>The actual act of <strong>computation happens when asserting the truth</strong> of these predicates.</li>
  <li>Prolog is superb at <strong>pattern matching</strong> and <strong>symbolic manipulation</strong>. This comes partly from its unification mechanism (I discuss it in <a href="#prolog-concepts-unification">here</a>). Personally, I have not seen this level of capability at a language level in any other language I have seen (including OCaml, which is another language I’m currently learning).</li>
</ul>

<p>So, here’s a simple directed graph in Prolog.</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">node</span><span class="p">(</span><span class="ss">a</span><span class="p">).</span>
<span class="ss">node</span><span class="p">(</span><span class="ss">b</span><span class="p">).</span>
<span class="ss">node</span><span class="p">(</span><span class="ss">c</span><span class="p">).</span>
<span class="ss">node</span><span class="p">(</span><span class="ss">d</span><span class="p">).</span>
<span class="ss">node</span><span class="p">(</span><span class="ss">e</span><span class="p">).</span>
<span class="ss">node</span><span class="p">(</span><span class="ss">f</span><span class="p">).</span>

<span class="ss">edge</span><span class="p">(</span><span class="ss">a</span><span class="p">,</span><span class="ss">b</span><span class="p">).</span>
<span class="ss">edge</span><span class="p">(</span><span class="ss">b</span><span class="p">,</span><span class="ss">c</span><span class="p">).</span>
<span class="ss">edge</span><span class="p">(</span><span class="ss">b</span><span class="p">,</span><span class="ss">d</span><span class="p">).</span>
<span class="ss">edge</span><span class="p">(</span><span class="ss">c</span><span class="p">,</span><span class="ss">e</span><span class="p">).</span>
<span class="ss">edge</span><span class="p">(</span><span class="ss">d</span><span class="p">,</span><span class="ss">e</span><span class="p">).</span>
</code></pre></div></div>

<p>…and, that’s it. Facts (and predicates) can be any combination of letters/numbers (the exact rules are obviously more rigorous, but if you can name a variable in your favourite programming language, you can probably use it as a fact name). The only constraint is that only <strong>variables</strong> can start with an uppercase letter.</p>

<p>Also note that there were no quotes around <code class="language-plaintext highlighter-rouge">a</code>, <code class="language-plaintext highlighter-rouge">b</code>, <code class="language-plaintext highlighter-rouge">c</code>, etc. They are <strong>atoms</strong>, which are basically freeform symbols, that you can use in lieu of string (strings are a different type).</p>

<p>The simplest thing you can do with the above facts is query them. For example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?- edge(a,b).
true.
</code></pre></div></div>

<p>The first query simply checks whether there is an edge from <code class="language-plaintext highlighter-rouge">a</code> to <code class="language-plaintext highlighter-rouge">b</code>. This is forward inference; but we can also do reverse inference. We can turn a question about our facts on its head: suppose we want to know which are the outgoing edges of <code class="language-plaintext highlighter-rouge">b</code>. In any other programming language, you’d want to check the list of outgoing edges from <code class="language-plaintext highlighter-rouge">b</code>, and so on. In Prolog, you can simply ask for this information.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?- edge(b,N).
N = c ;
N = d.
</code></pre></div></div>

<p>No extra programming needed. This is because Prolog leverages the concepts of unification, backtracking and goal resolution. <strong>Effectively, you can run your program both forwards and backwards.</strong> This is extremely powerful, and I talk of another example <a href="#declarative-structural-matching">later</a>.</p>

<p>Let’s write a simple predicate. This predicate will hold true if <code class="language-plaintext highlighter-rouge">a</code> is a node.</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">a_exists</span> <span class="p">:-</span> <span class="ss">node</span><span class="p">(</span><span class="ss">a</span><span class="p">).</span>
</code></pre></div></div>
<p>…and, that’s it. Internally, it checks whether the fact <code class="language-plaintext highlighter-rouge">node(a)</code> exists. Read the <code class="language-plaintext highlighter-rouge">:-</code> as <strong>‘LHS is true if RHS is true’</strong>.</p>

<p>Suppose we want to test if a node with a symbol of our choosing exists. We write:</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">node_exists</span><span class="p">(</span><span class="nv">N</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">node</span><span class="p">(</span><span class="nv">N</span><span class="p">).</span>
</code></pre></div></div>

<p>Here, note that <code class="language-plaintext highlighter-rouge">N</code> is a variable which we bind to a ‘thing’ when asking the question. You can also not bind it, but that’s a different kind of question, which we will get to in a bit.</p>

<p>Now you can ask the Prolog interpreter:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?- node_exists(a).
true.

?- node_exists(efgh).
false.
</code></pre></div></div>

<p>Note that we have replaced the <code class="language-plaintext highlighter-rouge">a</code> with <code class="language-plaintext highlighter-rouge">N</code> which is a variable, which implies that <code class="language-plaintext highlighter-rouge">node_exists</code> takes in a <strong>thing</strong>. It is important to note that thing can really be anything: an atom, a string, another predicate, etc. As long as Prolog can find a rule which matches the sort of thing we inject into <code class="language-plaintext highlighter-rouge">node_exists</code>, it will consider this rule for further evaluation.</p>

<p>Suppose we wish to check if a node has multiple outgoing edges. We can write:</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">has_multiple_outgoing_edges</span><span class="p">(</span><span class="nv">N</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">edge</span><span class="p">(</span><span class="nv">N</span><span class="p">,</span><span class="nv">A</span><span class="p">),</span> <span class="ss">edge</span><span class="p">(</span><span class="nv">N</span><span class="p">,</span><span class="nv">B</span><span class="p">),</span> <span class="nv">A</span> <span class="err">\</span><span class="o">=</span> <span class="nv">B</span><span class="p">,</span> <span class="p">!.</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">,</code> in this case represents the logical AND. This means that <code class="language-plaintext highlighter-rouge">has_multiple_outgoing_edges(N)</code> is true if:</p>

<ul>
  <li>There is an edge from <code class="language-plaintext highlighter-rouge">N</code> to <code class="language-plaintext highlighter-rouge">A</code></li>
  <li>There is an edge from <code class="language-plaintext highlighter-rouge">N</code> to <code class="language-plaintext highlighter-rouge">B</code></li>
  <li><code class="language-plaintext highlighter-rouge">A</code> is not the same as <code class="language-plaintext highlighter-rouge">B</code>. <code class="language-plaintext highlighter-rouge">\=</code> represents <strong>“not equal to”</strong>.</li>
</ul>

<p>We can check whether this works:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?- has_multiple_outgoing_edges(b).
true.

?- has_multiple_outgoing_edges(a).
false.
</code></pre></div></div>

<p>Let’s check if a node is unconnected or not. We can write:</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">is_unconnected</span><span class="p">(</span><span class="nv">N</span><span class="p">)</span> <span class="p">:-</span> <span class="err">\</span><span class="o">+</span> <span class="ss">edge</span><span class="p">(</span><span class="nv">N</span><span class="p">,</span><span class="nv">_</span><span class="p">).</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">\+</code> negates a condition (logical NOT). The <code class="language-plaintext highlighter-rouge">_</code> implies that there is a value there, but we don’t care about the value enough to put it into a variable.</p>

<p>So, we can try:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?- is_unconnected(f).
true.

?- is_unconnected(a).
false.
</code></pre></div></div>

<p>We can perform more meaningful semantic reasoning though. Suppose, we are interested in knowing whether a node is reachable from another node. We can define a predicate <code class="language-plaintext highlighter-rouge">can_reach</code> like so:</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">can_reach</span><span class="p">(</span><span class="nv">From</span><span class="p">,</span> <span class="nv">To</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">edge</span><span class="p">(</span><span class="nv">From</span><span class="p">,</span> <span class="nv">To</span><span class="p">),</span> <span class="p">!.</span>
<span class="ss">can_reach</span><span class="p">(</span><span class="nv">From</span><span class="p">,</span> <span class="nv">To</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">edge</span><span class="p">(</span><span class="nv">From</span><span class="p">,</span> <span class="nv">Z</span><span class="p">),</span> <span class="ss">can_reach</span><span class="p">(</span><span class="nv">Z</span><span class="p">,</span> <span class="nv">To</span><span class="p">).</span>
</code></pre></div></div>

<p>The base case is simply that a node can reach another if there is an edge between them.
The general case is a recursive definition: it says that A can reach B, if A has an edge to Z (some undefined node), and if Z in turn can reach B.</p>

<p>Thus for example, if we run <code class="language-plaintext highlighter-rouge">can_reach(a,c)</code>, we conceptually reason in the following manner:</p>

<ul>
  <li>Can <code class="language-plaintext highlighter-rouge">a</code> reach <code class="language-plaintext highlighter-rouge">c</code>? This can be answered if we can answer if <code class="language-plaintext highlighter-rouge">a</code> has an edge to some node Z, and Z can reach <code class="language-plaintext highlighter-rouge">c</code>.</li>
  <li>The only node that <code class="language-plaintext highlighter-rouge">a</code> has an edge to is <code class="language-plaintext highlighter-rouge">b</code>, so we want to answer the question of whether <code class="language-plaintext highlighter-rouge">b</code> can reach <code class="language-plaintext highlighter-rouge">c</code>.</li>
  <li>This triggers the base case because <code class="language-plaintext highlighter-rouge">b</code> has a direct <code class="language-plaintext highlighter-rouge">edge</code> to <code class="language-plaintext highlighter-rouge">c</code>.</li>
  <li>Therefore, there is an arbitrary node Z (in this case <code class="language-plaintext highlighter-rouge">b</code>) to which <code class="language-plaintext highlighter-rouge">a</code> has an edge to, and which can reach <code class="language-plaintext highlighter-rouge">c</code>.</li>
</ul>

<p>Here are the last two examples, before we talk about <strong>structural matching</strong>. The first one prints all the elements of a list. Like in functional programming languages, lists are a core supported data structure in Prolog and their heads and tails is the canonical way of representing them. So, we will be using recursive rules.</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">printall</span><span class="p">([]).</span>
<span class="ss">printall</span><span class="p">([</span><span class="nv">H</span><span class="p">|</span><span class="nv">T</span><span class="p">])</span> <span class="p">:-</span> <span class="ss">writeln</span><span class="p">(</span><span class="nv">H</span><span class="p">),</span> <span class="ss">printall</span><span class="p">(</span><span class="nv">T</span><span class="p">).</span>
</code></pre></div></div>

<p>The first rule is the base recursive rule, for when we have an empty list. The second one is the general rule, which separates out a list into its head and tail (car, cdr), writes out the head, and calls <code class="language-plaintext highlighter-rouge">printall</code> on the tail.</p>

<p>As one last example, let’s reverse a list.</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">reverse2</span><span class="p">([],</span><span class="nv">Acc</span><span class="p">,</span><span class="nv">Acc</span><span class="p">).</span>
<span class="ss">reverse2</span><span class="p">([</span><span class="nv">H</span><span class="p">|</span><span class="nv">T</span><span class="p">],</span><span class="nv">Acc</span><span class="p">,</span><span class="nv">Result</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">reverse2</span><span class="p">(</span><span class="nv">T</span><span class="p">,[</span><span class="nv">H</span><span class="p">|</span><span class="nv">Acc</span><span class="p">],</span><span class="nv">Result</span><span class="p">),!.</span>
</code></pre></div></div>

<p>The first argument is the list being passed in (and gradually getting decomposed into successive heads and tails). The second one is the accumulator which keeps getting built up with the result (it always <code class="language-plaintext highlighter-rouge">[]</code> initially). The final argument is the actual result to which the reversed list will be bound to. You can run this like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?- reverse2([1,2,3],[],R).
R = [3, 2, 1].
</code></pre></div></div>

<p>A list of common list processing idioms may be found <a href="https://github.com/dtonhofer/prolog_notes/tree/master/other_notes/about_list_processing">here</a>.</p>

<p>Predicates may look like functions, but they are only ever <code class="language-plaintext highlighter-rouge">true</code> or <code class="language-plaintext highlighter-rouge">false</code>. The results of any computation are always bound to any unresolved variables that you specify when invoking them. In this case, <code class="language-plaintext highlighter-rouge">Result</code> is the unresolved variable.</p>

<ul>
  <li>The first rule is the simple one: if the list is empty, it binds the third parameter (the result) to whatever the accumulator is at that point. This is an example of unification: very simplistically, you don’t explicitly assign values to variable, specifying the value in the slot where a variable sits, is enough to bind it to the variable.</li>
  <li>The second rule once again keeps recursively decomposing the original list, but at the point of recursion, it adds the head to whatever the accumulator is (remember, appending in this case happens at the front of the list). The <code class="language-plaintext highlighter-rouge">!</code> is called a cut operator, and in this case is not strictly needed for forward inference, but I have it there to demonstrate backward inference, so you can technically ignore it for the moment.</li>
</ul>

<p>Thus, if we pass in a list <code class="language-plaintext highlighter-rouge">[1,2,3,4,5]</code>, the accumulator will be appended to (in the front) with <code class="language-plaintext highlighter-rouge">[1]</code>, <code class="language-plaintext highlighter-rouge">[2,1]</code>, <code class="language-plaintext highlighter-rouge">[3,2,1]</code>, <code class="language-plaintext highlighter-rouge">[4,3,2,1]</code>, and <code class="language-plaintext highlighter-rouge">[5,4,3,2,1]</code> on each successive recursive call. Also note that I use the term ‘append’ rather loosely, since there is no mutation: values in Prolog are always immutable.</p>

<p>So we can try:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?- reverse2([1,2,3,4,5],[],Reversed).
Reversed = [5, 4, 3, 2, 1].
</code></pre></div></div>

<p>The amazing part is that you can <strong>reverse</strong> this operation, i.e., ask Prolog: if I gave you the reversed list, what was the original list?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?- reverse2(Original,[],[5,4,3,2,1]).
Original = [1, 2, 3, 4, 5].
</code></pre></div></div>

<p>Here is one way you can concatenate two lists:</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">concat</span><span class="p">([],</span><span class="nv">RHS</span><span class="p">,</span><span class="nv">RHS</span><span class="p">)</span> <span class="p">:-</span> <span class="p">!.</span>
<span class="ss">concat</span><span class="p">([</span><span class="nv">H</span><span class="p">|</span><span class="nv">T</span><span class="p">],</span><span class="nv">Acc</span><span class="p">,[</span><span class="nv">H</span><span class="p">|</span><span class="nv">R</span><span class="p">])</span> <span class="p">:-</span> <span class="ss">concat</span><span class="p">(</span><span class="nv">T</span><span class="p">,</span><span class="nv">Acc</span><span class="p">,</span><span class="nv">R</span><span class="p">).</span>
</code></pre></div></div>

<p>The base recursive case is when the left side is empty, so that the concatenation result is simply <code class="language-plaintext highlighter-rouge">RHS</code>.
The general case is to gradually peel off the head of the left list, and add it as the head of the right side in reverse order.</p>

<h2 id="compound-terms-the-same-thing">Compound Terms: The same thing</h2>

<p>Predicates can be pattern-matched too. Write:</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">recognise_letters</span><span class="p">(</span><span class="ss">some_fn</span><span class="p">(</span><span class="ss">a</span><span class="p">))</span> <span class="p">:-</span> <span class="ss">writeln</span><span class="p">(</span><span class="ss">'a was first parameter'</span><span class="p">),!.</span>
<span class="ss">recognise_letters</span><span class="p">(</span><span class="ss">some_fn</span><span class="p">(</span><span class="ss">b</span><span class="p">))</span> <span class="p">:-</span> <span class="ss">writeln</span><span class="p">(</span><span class="ss">'b was first parameter'</span><span class="p">),!.</span>
<span class="ss">recognise_letters</span><span class="p">(</span><span class="ss">some_fn</span><span class="p">(</span><span class="nv">_</span><span class="p">))</span> <span class="p">:-</span> <span class="ss">writeln</span><span class="p">(</span><span class="ss">'Something other than a or b'</span><span class="p">).</span>
</code></pre></div></div>

<p>You will trigger one of the rules above as long as the arity of the given <code class="language-plaintext highlighter-rouge">some_fn</code> matches, as shown in the examples below. This structural matching of compound terms is extremely powerful and can be as arbitrary as you like.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?- recognise_letters(some_fn(a)).
a was first parameter
true.

?- recognise_letters(some_fn(b)).
b was first parameter
true.

?- recognise_letters(some_fn(c)).
Something other than a or b
true.

?- recognise_letters(some_fn(c,12)).
false.
</code></pre></div></div>

<h2 id="evaluation-does-not-happen-by-default">Evaluation does not happen by default</h2>

<p>If you write the following, you will see that Prolog doesn’t actually evaluate <code class="language-plaintext highlighter-rouge">1+2=3</code>. This is because <code class="language-plaintext highlighter-rouge">1+2</code> is actually a compound term <code class="language-plaintext highlighter-rouge">+(1,2)</code>, and not (yet) an arithmetic expression.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?- X=1+2,writeln(X).
1+2
X = 1+2.
</code></pre></div></div>

<p>If you want to actually evaluate, you’d use the <code class="language-plaintext highlighter-rouge">is</code> operator.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?- Y is 1+2.
Y = 3.
</code></pre></div></div>

<p>This allows us to do pass predicates around and write higher-order functions. Here is a (slightly-contrived) example:</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">print_mapped_number</span><span class="p">(</span><span class="nv">Number</span><span class="p">,</span><span class="nv">Pred</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">call</span><span class="p">(</span><span class="nv">Pred</span><span class="p">,</span> <span class="nv">Number</span><span class="p">,</span> <span class="nv">Result</span><span class="p">),</span><span class="ss">format</span><span class="p">(</span><span class="ss">'Result is: ~w'</span><span class="p">,</span> <span class="nv">Result</span><span class="p">).</span>
<span class="ss">add_one</span><span class="p">(</span><span class="nv">Number</span><span class="p">,</span> <span class="nv">Result</span><span class="p">)</span> <span class="p">:-</span> <span class="nv">Result</span> <span class="ss">is</span> <span class="nv">Number</span> <span class="o">+</span> <span class="m">1</span><span class="p">.</span>
</code></pre></div></div>

<p>Now if you write:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?- print_mapped_number(2,add_one).
Result is: 3
true.
</code></pre></div></div>

<p>Effectively, <code class="language-plaintext highlighter-rouge">add_one</code> is a predicate which isn’t evaluated until <code class="language-plaintext highlighter-rouge">call</code> is used on it.</p>

<p>You can also perform structural construction and deconstruction of compound terms very easily:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?- a(1,2)=..X.
X = [a, 1, 2].

?- X=..[a, 1, 2].
X = a(1, 2).
</code></pre></div></div>

<p>This opens up a world of possibilities for metaprogramming, term rewriting, and other applications.</p>

<p>What about more structured data?</p>

<p>You can simply represent more complex structured data, by nesting compound terms, like in the example below. Prolog’s unification automatically binds the corresponding variables.</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">print_person</span><span class="p">((</span><span class="nv">FirstName</span><span class="p">,</span> <span class="nv">LastName</span><span class="p">),</span><span class="nv">Age</span><span class="p">,(</span><span class="nv">AddressLine1</span><span class="p">,</span> <span class="nv">AddressLine2</span><span class="p">))</span> <span class="p">:-</span> <span class="ss">format</span><span class="p">(</span><span class="ss">'First name:~w, Last name: ~w, Age: ~w, AddressLine1: ~w, AddressLne2: ~w'</span><span class="p">,</span> <span class="p">[</span><span class="nv">FirstName</span><span class="p">,</span> <span class="nv">LastName</span><span class="p">,</span> <span class="nv">Age</span><span class="p">,</span> <span class="nv">AddressLine1</span><span class="p">,</span> <span class="nv">AddressLine2</span><span class="p">]).</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?- print_person(("John", "Doe"),30,("1 Some Street", "Some City")).
First name:John, Last name: Doe, Age: 30, AddressLine1: 1 Some Street, AddressLne2: Some City
true.
</code></pre></div></div>

<h2 id="declarative-structural-matching">Declarative Structural Matching</h2>

<p>Let’s talk about how you can do non-trivial pattern matching declaratively, and in the process, we will concretise the idea of being able to run Prolog programs backwards and forwards. For example, suppose you have an array, and you want to check whether it start with the elements <code class="language-plaintext highlighter-rouge">1</code>, <code class="language-plaintext highlighter-rouge">2</code>, and ends with <code class="language-plaintext highlighter-rouge">5</code>, <code class="language-plaintext highlighter-rouge">6</code>, irrespective of what exists in the middle. In most other languages, you’d want to extract the first 2 elements and the last 2 elements (after some boundary checking), and then check those against the patterns you are looking for. In Prolog, since you can (sort of) run programs both backwards and forwards, you can declare how you would construct a program which shows the desired pattern, and run it backwards.</p>

<p>So, let’s say you have three lists <code class="language-plaintext highlighter-rouge">Prefix</code>, <code class="language-plaintext highlighter-rouge">Middle</code>, and <code class="language-plaintext highlighter-rouge">Suffix</code>. How would you construct a list containing all of them?</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">join_3</span><span class="p">(</span><span class="nv">Prefix</span><span class="p">,</span> <span class="nv">Middle</span><span class="p">,</span> <span class="nv">Suffix</span><span class="p">,</span> <span class="nv">Result</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">concat</span><span class="p">(</span><span class="nv">Prefix</span><span class="p">,</span><span class="nv">RHS</span><span class="p">,</span><span class="nv">Result</span><span class="p">),</span> <span class="ss">concat</span><span class="p">(</span><span class="nv">Middle</span><span class="p">,</span><span class="nv">Suffix</span><span class="p">,</span><span class="nv">RHS</span><span class="p">).</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">RHS</code> is defined here as the concatenation of <code class="language-plaintext highlighter-rouge">Middle</code> and <code class="language-plaintext highlighter-rouge">Suffix</code>, and <code class="language-plaintext highlighter-rouge">Result</code> is defined as the concatenation of <code class="language-plaintext highlighter-rouge">Prefix</code> and <code class="language-plaintext highlighter-rouge">RHS</code>. That’s pretty logical, and you can test this out like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?- join_3([1,2],[3,4],[5,6],R).
R = [1, 2, 3, 4, 5, 6].
</code></pre></div></div>

<p>Now, make the result already-determined, and make the middle a <em>don’t-care</em> variable, and see what Prolog tells you:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?- join_3([1,2],_,[5,6],[1,2,3,4,5,6]).
true.

?- join_3([1,2],_,[5,6],[1,2,3,4,5,6,7]).
false.
</code></pre></div></div>

<p>Here, <strong>you’re literally asking whether this relation holds given the prefix, the suffix, and the end result</strong>: effectively you are performing <strong>pattern matching</strong>. Note how you did not have to extract out elements manually and do tedious equality checking. <strong>You just specified how a list conforming to the structural pattern you are looking for, might be built, and you just run it backwards.</strong> This is a very strong motivating example of you instructing the machine <strong>WHAT to do, but NOT how</strong> to do it.</p>

<h2 id="prolog-concepts-unification">Prolog Concepts: Unification</h2>

<p>In automated reasoning, unification is a technique used to solve equations where the left and right hand sides of the equation are symbolic expressions. For example, the following are examples of such equations:</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">% The free variable is X and the solution is X=c</span>
<span class="ss">f</span><span class="p">(</span><span class="ss">a</span><span class="p">,</span><span class="ss">b</span><span class="p">,</span><span class="nv">X</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">f</span><span class="p">(</span><span class="ss">a</span><span class="p">,</span><span class="ss">b</span><span class="p">,</span><span class="ss">c</span><span class="p">).</span>

<span class="c1">% The free variables are X and Y and the solution is {X=c, Y=a}</span>
<span class="ss">f</span><span class="p">(</span><span class="ss">a</span><span class="p">,</span><span class="ss">b</span><span class="p">,</span><span class="nv">X</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">f</span><span class="p">(</span><span class="nv">Y</span><span class="p">,</span><span class="ss">b</span><span class="p">,</span><span class="ss">c</span><span class="p">).</span>

</code></pre></div></div>

<p>If the right hand side has no free variables, it is basically the same as pattern matching. Consider the earlier example of reversing a list:</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">reverse2</span><span class="p">([],</span><span class="nv">Acc</span><span class="p">,</span><span class="nv">Acc</span><span class="p">).</span>
<span class="ss">reverse2</span><span class="p">([</span><span class="nv">H</span><span class="p">|</span><span class="nv">T</span><span class="p">],</span><span class="nv">Acc</span><span class="p">,</span><span class="nv">Result</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">reverse2</span><span class="p">(</span><span class="nv">T</span><span class="p">,[</span><span class="nv">H</span><span class="p">|</span><span class="nv">Acc</span><span class="p">],</span><span class="nv">Result</span><span class="p">),!.</span>
</code></pre></div></div>

<p>Let’s trace the process of reversing a small list <code class="language-plaintext highlighter-rouge">[1,2]</code> and see how unification works.</p>

<h3 id="step-1-general-rule-fires-but-is-unresolved">Step 1: General Rule fires but is unresolved</h3>
<p>When we run <code class="language-plaintext highlighter-rouge">reverse2([1,2])</code>, the general rule triggers with the following unifications.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">Acc</code> is unified with []</li>
  <li><code class="language-plaintext highlighter-rouge">H</code> is unified with 1</li>
  <li><code class="language-plaintext highlighter-rouge">T</code> is unified with <code class="language-plaintext highlighter-rouge">[2]</code></li>
</ul>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">reverse2</span><span class="p">([</span><span class="m">1</span><span class="p">|[</span><span class="m">2</span><span class="p">]],[],</span><span class="nv">Result</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">reverse2</span><span class="p">([</span><span class="m">2</span><span class="p">],[</span><span class="m">1</span><span class="p">|[]],</span><span class="nv">Result</span><span class="p">).</span>
</code></pre></div></div>
<p>At this point, this equation cannot be fully solved since <code class="language-plaintext highlighter-rouge">Result</code> is still unbound. However, this triggers the general rule again when trying to determine the truth of the right hand side.</p>

<h3 id="step-2-general-rule-fires-but-is-unresolved">Step 2: General Rule fires but is unresolved</h3>

<p>A similar situation occurs here too, with the following unifications.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">Acc</code> is unified with <code class="language-plaintext highlighter-rouge">[1|[]]</code>
` <code class="language-plaintext highlighter-rouge">H</code> is unified with 2</li>
  <li><code class="language-plaintext highlighter-rouge">T</code> is unified with <code class="language-plaintext highlighter-rouge">[]</code></li>
</ul>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">reverse2</span><span class="p">([</span><span class="m">2</span><span class="p">|[]],[</span><span class="m">1</span><span class="p">|[]],</span><span class="nv">Result</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">reverse2</span><span class="p">([],[</span><span class="m">2</span><span class="p">|[</span><span class="m">1</span><span class="p">|[]]],</span><span class="nv">Result</span><span class="p">).</span>
</code></pre></div></div>

<p>As in <a href="#step-1-general-rule-fires-but-is-unresolved">Step 1</a>, this equation is also unsolved because of the unbound <code class="language-plaintext highlighter-rouge">Result</code> variable. To determine the truth of the right hand side this time though, it is the base case which fires.</p>

<h3 id="step-3-base-rule-fires-and-is-resolved">Step 3: Base Rule fires and is resolved</h3>
<p>The base case is triggered, and here the third parameter (<code class="language-plaintext highlighter-rouge">Result</code>) is finally unified with the value of <code class="language-plaintext highlighter-rouge">Acc</code> (which is <code class="language-plaintext highlighter-rouge">[2|[1|[]]]</code>), since the same <code class="language-plaintext highlighter-rouge">Acc</code> variable appears in the second and third place.</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">reverse2</span><span class="p">([],[</span><span class="m">2</span><span class="p">|[</span><span class="m">1</span><span class="p">|[]]],[</span><span class="m">2</span><span class="p">|[</span><span class="m">1</span><span class="p">|[]]]).</span>
</code></pre></div></div>

<p>Now that this fact has been determined to be true, it is time to unroll and determine the truth values of the preceding rules in the stack.</p>

<h3 id="step-4-general-rule-resolution-is-resolved">Step 4: General Rule Resolution is resolved</h3>
<p>Now the <a href="#step-2-general-rule-fires-but-is-unresolved">unresolved rule in Step 2</a> can be solved, since <code class="language-plaintext highlighter-rouge">Result</code> is also known, thus <code class="language-plaintext highlighter-rouge">Result</code> is unified here with the value <code class="language-plaintext highlighter-rouge">[2|[1|[]]]</code>.</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">reverse2</span><span class="p">([</span><span class="m">2</span><span class="p">|[]],[</span><span class="m">1</span><span class="p">|[]],[</span><span class="m">2</span><span class="p">|[</span><span class="m">1</span><span class="p">|[]]])</span> <span class="p">:-</span> <span class="ss">reverse2</span><span class="p">([],[</span><span class="m">2</span><span class="p">|[</span><span class="m">1</span><span class="p">|[]]],[</span><span class="m">2</span><span class="p">|[</span><span class="m">1</span><span class="p">|[]]]).</span>
</code></pre></div></div>

<p>Now the left hand side is fully determined (and thus true); this means that the preceding rule in the stack is ready to be resolved.</p>

<h3 id="step-5-general-rule-resolution-is-resolved">Step 5: General Rule Resolution is resolved</h3>

<p>Solving the <a href="#step-1-general-rule-fires-but-is-unresolved">previously-unresolved rule in Step 1</a> with the <code class="language-plaintext highlighter-rouge">Result</code> variable from <a href="#step-4-general-rule-resolution-is-resolved">Step 2</a> unifies the <code class="language-plaintext highlighter-rouge">Result</code> variable on both sides with <code class="language-plaintext highlighter-rouge">[2|[1|[]]]</code>:</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">reverse2</span><span class="p">([</span><span class="m">1</span><span class="p">|[</span><span class="m">2</span><span class="p">]],[],[</span><span class="m">2</span><span class="p">|[</span><span class="m">1</span><span class="p">|[]]])</span> <span class="p">:-</span> <span class="ss">reverse2</span><span class="p">([</span><span class="m">2</span><span class="p">],[</span><span class="m">1</span><span class="p">|[]],[</span><span class="m">2</span><span class="p">|[</span><span class="m">1</span><span class="p">|[]]]).</span>
</code></pre></div></div>

<p>This final result <code class="language-plaintext highlighter-rouge">[2|[1|[]]]</code> is bound to our query variable and shown with syntactic sugar as <code class="language-plaintext highlighter-rouge">[2,1]</code>.</p>

<h2 id="prolog-concepts-backtracking">Prolog Concepts: Backtracking</h2>

<p>Prolog programs essentially form execution trees with nodes being either terminal leaves (facts), or dependent predicates with child predicates arranged in a logical expression comprised of AND, OR, and NOT.</p>

<p>Let’s revisit our graph reachability example. We had written:</p>

<div class="language-prolog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ss">can_reach</span><span class="p">(</span><span class="nv">From</span><span class="p">,</span> <span class="nv">To</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">edge</span><span class="p">(</span><span class="nv">From</span><span class="p">,</span> <span class="nv">To</span><span class="p">),</span> <span class="p">!.</span>
<span class="ss">can_reach</span><span class="p">(</span><span class="nv">From</span><span class="p">,</span> <span class="nv">To</span><span class="p">)</span> <span class="p">:-</span> <span class="ss">edge</span><span class="p">(</span><span class="nv">From</span><span class="p">,</span> <span class="nv">Z</span><span class="p">),</span> <span class="ss">can_reach</span><span class="p">(</span><span class="nv">Z</span><span class="p">,</span> <span class="nv">To</span><span class="p">).</span>
</code></pre></div></div>

<p>Let us ask <code class="language-plaintext highlighter-rouge">can_reach(b,e).</code>. Note that in the graph, <code class="language-plaintext highlighter-rouge">b</code> has 2 paths from <code class="language-plaintext highlighter-rouge">e</code>, via <code class="language-plaintext highlighter-rouge">c</code> and <code class="language-plaintext highlighter-rouge">d</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?- can_reach(b,e).
true ;
true.
</code></pre></div></div>

<p>You will notice that it returns two truth values. Each of these correspond to one of these viable paths. Conceptually, Prolog constructs the following abstract execution tree.</p>

<pre><code class="language-mermaid">graph TD;
can_reach_be["can_reach(b,e)"]--&gt;chol[AND];
chol[AND] --&gt; edge_bZ["edge(b,Z)"];
chol[AND] --&gt; can_reach_Ze["can_reach(Z,e)"];
style chol fill:#006f00,stroke:#000,stroke-width:2px,color:#fff
</code></pre>

<p>This abstract execution tree gets concretised and constrained by the space of available solutions. In this case, the two values that <code class="language-plaintext highlighter-rouge">Z</code> can take are <code class="language-plaintext highlighter-rouge">c</code> and <code class="language-plaintext highlighter-rouge">d</code>. Thus, the execution tree runs for each of these paths.</p>

<pre><code class="language-mermaid">graph TD;
can_reach_bc["can_reach(b,e)"]--&gt;path_Z_is_c;
can_reach_bc["can_reach(b,e)"]--&gt;path_Z_is_d;
path_Z_is_c["Z=c"];
path_Z_is_d["Z=d"];
path_Z_is_c --&gt; and_Zc[AND];
and_Zc[AND] --&gt; edge_bc_Zc["edge(b,c)"];
and_Zc[AND] --&gt; can_reach_Ze["can_reach(c,e)"];
path_Z_is_d --&gt; and_Zd[AND];
and_Zd[AND] --&gt; edge_bc_Zd["edge(b,c)"];
and_Zd[AND] --&gt; can_reach_de["can_reach(d,e)"];
style and_Zc fill:#006f00,stroke:#000,stroke-width:2px,color:#fff
style and_Zd fill:#006f00,stroke:#000,stroke-width:2px,color:#fff
</code></pre>

<p>Once a solution is found, Prolog <strong>walks back up the tree</strong>. At each preceding level, it attempts to find more viable paths to exhaust the space of all possible solutions. In this case, both paths <code class="language-plaintext highlighter-rouge">Z=c</code> and <code class="language-plaintext highlighter-rouge">Z=d</code> are viable, <code class="language-plaintext highlighter-rouge">true</code> is returned twice, corresponding to each of those paths.</p>

<h3 id="notes-on-logical-operator-syntax">Notes on Logical Operator Syntax</h3>

<ul>
  <li><code class="language-plaintext highlighter-rouge">,</code> represents <strong>AND</strong>. All the predicates we have seen so far use this operator. <code class="language-plaintext highlighter-rouge">p :- a,b.</code> means <strong>“<code class="language-plaintext highlighter-rouge">p</code> is true if both <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> are true”</strong>.</li>
  <li><code class="language-plaintext highlighter-rouge">;</code> represents <strong>OR</strong>. <code class="language-plaintext highlighter-rouge">p :- a;b.</code> means “<code class="language-plaintext highlighter-rouge">p</code> is true if either <code class="language-plaintext highlighter-rouge">a</code> or <code class="language-plaintext highlighter-rouge">b</code> is true”.</li>
  <li><code class="language-plaintext highlighter-rouge">\+</code> represents <strong>NOT</strong>. <code class="language-plaintext highlighter-rouge">p :- \+a.</code> means “<code class="language-plaintext highlighter-rouge">p</code> is true if <code class="language-plaintext highlighter-rouge">a</code> is false”.</li>
</ul>

<p>The usual operator precedence rules apply, thus you need parentheses to override precedence, for example: <code class="language-plaintext highlighter-rouge">p :- (a;b),c.</code></p>

<p>Relational operators are a little different from “conventional” programming languages.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">=</code> checks for <strong>structural equality</strong>. <code class="language-plaintext highlighter-rouge">f(a,b) = f(a,b)</code>. It also does double duty as assignment. To be honest, this is one and the same since unification applies to both sides.</li>
  <li><code class="language-plaintext highlighter-rouge">=&lt;</code> represents <strong>“less than or equal to”</strong>. <code class="language-plaintext highlighter-rouge">&lt;</code> is <strong>“less than”</strong>.</li>
  <li><code class="language-plaintext highlighter-rouge">&gt;=</code> represents <strong>“greater than or equal to”</strong>. <code class="language-plaintext highlighter-rouge">&gt;</code> is <strong>“greater than”</strong>.</li>
</ul>

<p>There is a <strong>ternary operator syntax</strong>. We write it as <code class="language-plaintext highlighter-rouge">a -&gt; b; c</code> which is read as <strong>“if <code class="language-plaintext highlighter-rouge">a</code> is true, then <code class="language-plaintext highlighter-rouge">b</code>, else <code class="language-plaintext highlighter-rouge">c</code>“</strong>. Note that since <code class="language-plaintext highlighter-rouge">;</code> is being used as a separator, any logical expressions you write must be suitably bracketed.</p>

<h2 id="my-personal-thoughts-on-prolog">My Personal Thoughts on Prolog</h2>

<h3 id="strengths">Strengths</h3>
<p>Prolog is an extremely powerful <strong>declarative modelling</strong> tool, in my opinion. Extremely low overhead in defining concepts and relations is one of the key draws for me, especially when <strong>rapidly prototyping</strong> concepts and relations. Unification plays a large part in allowing this level of brevity. In addition, being able to <strong>run logic backwards and forwards</strong> has given me a completely different way of looking at programming. The deductive reasoning semantics allow us to <strong>declaratively perform pattern matching</strong> with a naturalness I’ve not seen in other languages.</p>

<p>I will write about examples of doing static analysis on programs using Datalog, a widely-used subset of Prolog, in an upcoming post.</p>

<p>Many other forms of reasoning libraries (inductive, abductive, etc.) use Prolog (or a variant of it) as a base. Thus, it is emphatically a gateway to all work on <strong>automated reasoning</strong>, <strong>theorem proving</strong>, etc.</p>

<h3 id="usage">Usage</h3>
<p>Prolog is best used as an <strong>embedded component</strong> in a larger codebase written in a more conventional language, in order to play to its strengths. It has <strong>interfaces for Java, Python</strong>, and other more conventional programming languages.</p>

<h3 id="performance-and-scaling">Performance and Scaling</h3>
<p>Prolog’s performance used to be one of the sticking points, but there are fast implementations available, and <strong>Datalog</strong> (a non-Turing complete subset of Prolog) comfortably scales to <strong>millions of facts</strong>.</p>

<h2 id="applications">Applications</h2>

<ul>
  <li><a href="https://docs.datomic.com/query/query-data-reference.html">Datomic</a> uses Datalog as its query language (effectively, a much more expressive SQL).</li>
  <li>Prolog is used for pattern matching in the IBM Watson question answering system.</li>
  <li><a href="https://www.newscientist.com/article/dn7584-space-station-gets-hal-like-computer/">Clarissa</a>, a voice operated procedure browser aboard the International Space Station uses Prolog (https://sicstus.sics.se/customers.html).</li>
  <li>Prolog has been used as a bytecode verifier in Java (see <a href="https://openjdk.org/jeps/8267650">here</a>).</li>
</ul>

<h2 id="footnote-how-i-learned-prolog">Footnote: How I learned Prolog</h2>

<p>This is the first language I’ve learned for the most part by working with an LLM to design exercises and evaluate my submissions.</p>

<h2 id="references">References</h2>

<ul>
  <li><a href="https://github.com/dtonhofer/prolog_notes">David Tonhofer’s Notes on Prolog</a></li>
  <li><a href="https://www.researchgate.net/publication/241111987_Programming_Paradigms_for_Dummies_What_Every_Programmer_Should_Know">Programming Paradigms for Dummies: What Every Programmer Should Know - Peter Van Roy (2012)</a></li>
  <li><a href="https://eu.swi-prolog.org/pldoc/man?section=byrd-box-model">Byrd Box Model and Ports</a></li>
</ul>]]></content><author><name>avishek</name></author><category term="Logic Programming" /><category term="Automated Reasoning" /><category term="Prolog" /><summary type="html"><![CDATA[In this post, I present an abbreviated overview of Prolog and the paradigm of Logic Programming. I’ll discuss why I think it makes for such a powerful domain modelling language (with examples), and a gateway into the techniques of automated symbolic reasoning.]]></summary></entry><entry><title type="html">Building an HLASM grammar for reverse engineering, from scratch</title><link href="https://avishek.net/2025/06/26/building-hlasm-grammar-from-scratch.html" rel="alternate" type="text/html" title="Building an HLASM grammar for reverse engineering, from scratch" /><published>2025-06-26T00:00:00+05:30</published><updated>2025-06-26T00:00:00+05:30</updated><id>https://avishek.net/2025/06/26/building-hlasm-grammar-from-scratch</id><content type="html" xml:base="https://avishek.net/2025/06/26/building-hlasm-grammar-from-scratch.html"><![CDATA[<p>This post talks about a technique to build an <strong>ANTLR grammar for HLASM</strong> (mainframe assembler) from scratch, without writing the grammar of the entire instruction set by hand. The technique creates a parser which reads a table of instruction formats from IBM’s official documentation, and automates the creation of the actual HLASM grammar based on these instruction formats.</p>

<p>The parser is used in <a href="https://github.com/avishek-sen-gupta/tape-z">Tape/Z</a>.</p>

<p><em>This post has not been written or edited by AI.</em></p>

<h2 id="existing-parsers">Existing parsers</h2>
<p>I needed a grammar for mainframe assembler for working on some of my reverse engineering experiments, while building <a href="https://github.com/avishek-sen-gupta/tape-z">Tape/Z</a>. The only requirement I had was that the grammar needed to be declarative, since I needed to generate parsers from it, potentially in multiple languages. It could be ANTLR, Treesitter, etc.</p>

<p>I had previously built the parsing component of <a href="https://github.com/avishek-sen-gupta/cobol-rekt">Cobol-REKT</a> by extracting and reusing parts of the LSP server from the <a href="https://github.com/eclipse-che4z/che-che4z-lsp-for-cobol">che-che4z-lsp-for-cobol</a> project. The grammar for COBOL/DB2, etc. in that library was written in ANTLR. However, when I surveyed existing support for HLASM, the following were the only ones I could fine:</p>

<ul>
  <li><a href="https://github.com/z390development/z390">z390 emulator</a>: Programmatic Java parser, no explicit grammar</li>
  <li><a href="https://github.com/eclipse-che4z/che-che4z-lsp-for-hlasm">che-che4z-lsp-for-hlasm</a>: There used to be an ANTLR grammar, but it had many C++-specific extensions, and the most recent version has been completely rewritten in pure C++.</li>
</ul>

<p>Given that none of the above fit my requirement, I considered what it would take to write my own HLASM grammar.</p>

<h2 id="practical-difficulties-of-hand-writing-a-large-grammar">Practical difficulties of hand-writing a large grammar</h2>

<p>Now, HLASM by itself is not complicated. It’s simple opcodes, registers, and offsets. The problem is one of scale. z/OS assembly has well <strong>over 2000 instructions</strong>, being a CISC instruction set. They are all well-documented, but writing all of them by hand was not practical within the timelines I was looking at.</p>

<p>Thankfully, the formats for these instructions are documented very precisely <a href="https://www.ibm.com/docs/en/hla-and-tf/1.6.0?topic=instructions-table-all-supported">here</a>. Looking at that, I considered what it might take to automate the creation of this grammar.</p>

<h2 id="the-instruction-format-meta-parser">The instruction format Meta-Parser</h2>

<p>The table in the HLASM spec formally specifies the operand formats. Take for example the operand format for the <code class="language-plaintext highlighter-rouge">A</code> opcode.</p>

<p><img src="/assets/images/example-opcode-format.png" alt="Example opcode format" /></p>

<p>The idea is: <strong>what if we could parse this format and generate the actual desired grammar from the format parse tree?</strong> The “meta-parser” (for lack of a better term to describe its role) is less than 40 lines, and looks like so (the full grammar is documented <a href="https://github.com/avishek-sen-gupta/tape-z/blob/main/hlasm-parser/grammar/HlasmFormatParser.g4">here</a>):</p>

<pre><code class="language-antlrv4">control_register: CONTROL_REGISTER;
access_register: ACCESS_REGISTER;
floating_point_register_pair: FLOATING_POINT_REGISTER_PAIR;
floating_point_register: FLOATING_POINT_REGISTER;
index_register: INDEX_REGISTER;
base_register: BASE_REGISTER;
...
optionalSignedImmediateValue: OPTIONAL_OPEN_PAREN COMMA signed_immediate_value OPTIONAL_CLOSE_PAREN;
optionalRegister: OPTIONAL_OPEN_PAREN COMMA register_operand OPTIONAL_CLOSE_PAREN;
operands: ((operand COMMA)* operand)*;
operand: displacement | floating_point_register_pair | floating_point_register | index_register
    | base_register | register_pair | register_operand |  control_register | immediate_value | signed_immediate_value
    | length_field | mask_field | relative_immediate_operand | vector_register_pair | vector_register
    | optionalMaskField | optionalSignedImmediateValue | optionalRegister | access_register;
</code></pre>

<h2 id="the-hlasm-grammar-generator">The HLASM grammar generator</h2>

<p>The steps are pretty simple in and of themselves:</p>

<ul>
  <li><strong>Copy the HTML table</strong> into Google Sheets, and <strong>export that into a CSV</strong>. This gives us the formats ready for ingestion into the format parser.</li>
  <li><strong>Parse the operand formats</strong> for each instruction.</li>
  <li>Make a <strong>visitor</strong> (<code class="language-plaintext highlighter-rouge">HLASMParseRuleBuilderVisitor</code>) visit the operand format’s <code class="language-plaintext highlighter-rouge">ParseTree</code> and <strong>build object representations</strong> of the final grammar elements we wish to emit.</li>
  <li><strong>Add extra rules</strong> that might not have listed in the base instruction table.</li>
  <li><strong>Emit the string representation</strong> of the resulting rule objects into a <code class="language-plaintext highlighter-rouge">.g4</code> file.</li>
  <li>…err, profit? :-)</li>
</ul>

<p><img src="/assets/images/tapez-hlasm-parser-metaparser.png" alt="HLASM Parser/Meta-Parser" /></p>

<p>There was one more wrinkle. The generated parser had so many <code class="language-plaintext highlighter-rouge">if</code> conditions and <code class="language-plaintext highlighter-rouge">switch...case</code> in the top-level scope, that it exceeded some internal JDK limit, and the compiler refused to compile it. Thus, I had to break the 2000+ rules into groups of 400 (a rather arbitrary number), and use that as the first level of matching in the final grammar. Hey, you learn something new everyday!</p>

<p>Here’s an extract from the resulting grammar:</p>

<pre><code class="language-antlrv4">...
a_rule_1: 'A' ((operand_1_register) (COMMA operand_2_displacement));
acontrol_rule_2: 'ACONTROL';
actr_rule_3: 'ACTR';
ad_rule_4: 'AD' ((operand_1_floatingPointRegister) (COMMA operand_2_displacement));
adata_rule_5: 'ADATA';
adb_rule_6: 'ADB' ((operand_1_floatingPointRegister) (COMMA operand_2_displacement));
adbr_rule_7: 'ADBR' ((operand_1_floatingPointRegister) (COMMA operand_2_floatingPointRegister));
adr_rule_8: 'ADR' ((operand_1_floatingPointRegister) (COMMA operand_2_floatingPointRegister));
...
</code></pre>

<p>and an example parse tree:</p>

<p><img src="/assets/images/example-hlasm-parse-tree.png" alt="Example HLASM parse tree" /></p>

<p>The full grammar is <a href="https://github.com/avishek-sen-gupta/tape-z/blob/main/hlasm-parser/grammar/HlasmParser.g4">here</a>.</p>

<h2 id="current-limitations">Current Limitations</h2>

<p>The instruction formats in the reference are for the base instructions. In practice, most HLASM programs have some higher-abstraction level addressing formats which allow them to use symbols, expressions, etc. which are then lowered by the assembler/macro processor to the form which will ultimately be translated into machine code.</p>

<p>Thus, I had to add some extra operand forms to accomodate this. However, this list is not exhaustive. For example, this parser will not parse operands with the length operator (<code class="language-plaintext highlighter-rouge">L'&lt;symbol-or-literal&gt;</code>).</p>

<h2 id="references">References</h2>

<ul>
  <li><a href="https://www.ibm.com/docs/en/hla-and-tf/1.6.0?topic=instructions-table-all-supported">Table of all supported HLASM instructions</a></li>
  <li><a href="https://github.com/avishek-sen-gupta/tape-z">Tape/Z</a></li>
</ul>]]></content><author><name>avishek</name></author><category term="Parsers" /><category term="Reverse Engineering" /><category term="HLASM" /><category term="ANTLR" /><summary type="html"><![CDATA[This post talks about a technique to build an ANTLR grammar for HLASM (mainframe assembler) from scratch, without writing the grammar of the entire instruction set by hand. The technique creates a parser which reads a table of instruction formats from IBM’s official documentation, and automates the creation of the actual HLASM grammar based on these instruction formats.]]></summary></entry><entry><title type="html">Inductor: Automated Hypothesis Verification using LLMs and Hierarchical Bayes-like models</title><link href="https://avishek.net/2025/06/23/inductor-hypothesis-verification-hierarchical-bayes.html" rel="alternate" type="text/html" title="Inductor: Automated Hypothesis Verification using LLMs and Hierarchical Bayes-like models" /><published>2025-06-23T00:00:00+05:30</published><updated>2025-06-23T00:00:00+05:30</updated><id>https://avishek.net/2025/06/23/inductor-hypothesis-verification-hierarchical-bayes</id><content type="html" xml:base="https://avishek.net/2025/06/23/inductor-hypothesis-verification-hierarchical-bayes.html"><![CDATA[<p>We look at how a <strong>Hierarchical Bayes</strong>-like model can be used to recursively decompose a hypothesis into sub-hypotheses to form an <strong>inference tree</strong>. The beliefs of these sub-hypotheses are updated based on the strength of the evidence gathered using <strong>MCP tools</strong>.</p>

<p>These beliefs are propagated upwards through the inference to indicate the aggregate confidence of the original root hypothesis. This concept is demonstrated in a library called <a href="https://github.com/asengupta/inductor">Inductor</a>.</p>

<p><em>This post has not been written or edited by AI.</em></p>

<h2 id="motivation">Motivation</h2>
<p>For the last year or so, I’ve been heavily involved in building <strong>reverse engineering</strong> tooling dealing with legacy code. This legacy code includes the usual suspects (COBOL, HLASM), but can also include code written in more “modern” stacks, like Java, C#, etc. Much of this tooling is driven through LLMs (isn’t everything these days :-) ?).</p>

<p>However, these efforts have also forced some deeper introspection on my part about how humans deal with comprehending legacy code. There are several studies on models of human comprehension of code (both novices and experts), but for the purposes of this post, I will restrict myself to my own (obviously incomplete) mental model of how I resolve uncertainty when attempting to validate or invalidate a hypothesis.</p>

<p>This could be a hypothesis about anything, for example:</p>

<ul>
  <li><em>This routine does not modify this variable.</em></li>
  <li><em>This piece of code branches off unconditionally, but does return to the point of origin.</em></li>
  <li><em>This memory addresses represents a particular quantity in the domain</em></li>
  <li>… and so on</li>
</ul>

<p>Most of the time, we (I?) look for <strong>signals</strong> which strengthen or weaken my belief in the hypothesis. Some studies call these signals <strong>beacons</strong>. The result of aggregating all these signals gives me a rough idea of how valid my hypothesis is. It is important to note that this is a sliding scale from <strong>“This is definitely false”</strong> to <strong>“This is definitely true”</strong>, and other values like <strong>“I’m still not sure”</strong> in between.</p>

<p>This seems to be a good fit for <strong>Bayesian reasoning</strong>. For the purposes of this experiment, I adopted a simple approach which is analogous to using a <strong>Hierarchical Bayes Model</strong> with a <strong>Beta-Bernoulli conjugate</strong> for prior-posterior belief calculations (more on that <a href="#analogy-with-hierarchical-bayes-the-beta-bernoulli-conjugate">here</a>)</p>

<p>Let’s talk of <strong>hypothesis decomposition</strong>. Whenever I have a hypothesis, I’m subconsciously breaking it down into smaller hypotheses that I can prove/disprove. Then, I would go and gather evidence for/against these smaller hypotheses, and go back and assess my confidence in my original hypothesis. Essentially, we can think of this as building an <strong>inference tree</strong>, like so:</p>

<p><img src="/assets/images/inductor-hypothesis-decomposition.png" alt="Hypothesis Decomposition" /></p>

<p>The question is: <strong>is this something reproducible through LLMs and Bayes-like techniques?</strong>
This sort of hierarchical modelling is something found in <strong>Hierarchical Bayes Models</strong>. We will use something similar, but much simpler. We will simply sum up weighted combinations of the evidences for and against the corresponding sub-hypotheses.</p>

<p>I have thus encapsulated some of my learning and experiments into a library called <a href="https://github.com/asengupta/inductor">Inductor</a>. I originally meant for it to help me explore inductive logic programming techniques, hence the name (I still might :-) ).</p>

<h2 id="how-do-we-validate-a-hypothesis">How do we validate a hypothesis?</h2>

<ul>
  <li><strong>Propose</strong> a hypothesis (User / LLM)</li>
  <li><strong>Decompose</strong> the hypothesis with initial levels of belief (LLM) like so:
    <ul>
      <li>At every level of decomposition, ask the LLM to decide whether any more decomposition into sub-hypotheses is required. <strong>If decomposition is not required, the sub-hypothesis ends with a set of leaf evidence nodes.</strong> These are the pieces of evidence that will be gathered to determine the strength of the sub-hypothesis.</li>
    </ul>
  </li>
  <li><strong>Gather</strong> evidence for sub-hypotheses: This is where LLMs can use tools (exposed via <strong>MCP tools</strong>) to pick the best tool(s) to gather specific pieces of evidence. The LLM decides whether the evidence supports the sub-hypothesis or not.</li>
  <li><strong>Propagate</strong> <em>beliefs</em> upward based on summing the (potentially weighted) counts of for/against evidence, all the way up to the original hypothesis (Deterministic)</li>
  <li>The final weighted counts of the for/against evidence at the root provides us a <strong>degree of belief in the root hypothesis</strong>.</li>
</ul>

<p>The flow would look something like the following:</p>

<p><img src="/assets/images/inductor-belief-aggregation.png" alt="Hypothesis Belief Aggregation" /></p>

<p>In this case, we can simply sum aggregate the counts of the evidences into two buckets:</p>

<ul>
  <li>Evidence supporting the sub-hypothesis</li>
  <li>Evidence not supporting the sub-hypothesis</li>
</ul>

<h2 id="motivating-example">Motivating example</h2>

<p>As a demonstration, I took a simple HLASM program, ran it through <a href="https://github.com/avishek-sen-gupta/tape-z">Tape/Z</a> to parse its structure, and exposed its various functionalities to the Langgraph system through an MCP server. Examples of the functionalities exposed were:</p>

<ul>
  <li>Regex search</li>
  <li>Cyclomatic complexity</li>
  <li>Code in specific sections (labels)</li>
  <li>…etc.</li>
</ul>

<p>The hypothesis that I asked it to verify was that the program uses a lot of registers. This is shown in the screenshot below.</p>

<p><img src="/assets/images/inductor-step-01.png" alt="Inductor Step 1" /></p>

<p>Beyond this point, the <strong>Hypothesis Decomposer</strong> component of Inductor starts recursively decomposing this hypothesis into an inference tree, as show by the progression of screenshots below (I intentionally limited the number of branches at each step to 2 for speed of demonstration):</p>

<p><img src="/assets/images/inductor-step-02.png" alt="Inductor Step 2" />
<img src="/assets/images/inductor-step-03.png" alt="Inductor Step 3" /></p>

<p>…and so on, until the leaves of evidence are reached.</p>

<p><img src="/assets/images/inductor-step-08.png" alt="Inductor Step 8" />
<img src="/assets/images/inductor-step-09.png" alt="Inductor Step 9" /></p>

<p>At this point, the inference tree has been built, and the <strong>Hypothesis Validator</strong> component goes into action, starting to collect evidence, and aggregating the strengths up the hierarchy of the tree. The result is an updated strength of the root hypothesis. As shown in the screenshot below, the original strength was 0.5 (equally likely to be true or false), and the posterior strength came down to 0.4, indicating a weakened belief in the root hypothesis.</p>

<p><img src="/assets/images/inductor-before-after.png" alt="Inductor Prior and Posterior" /></p>

<h2 id="architecture-of-inductor">Architecture of Inductor</h2>

<p>The overall architecture consists of several parts, some of them more experimental than others at this point. They reflect my early attempts to build a CLI for explore a system for the purposes of reverse engineering using MCP tools. The whole thing is probably meant to be plugged into a larger system. It is essentially a medium-sized Langgraph graph, with the following components:</p>

<ul>
  <li><strong>Executive Agent:</strong> Guides the overall exploration process</li>
  <li><strong>Hypothesis Decomposer:</strong> Validates hypotheses through structured inference</li>
  <li><strong>Hypothesis Validator:</strong> Explores evidence related to hypotheses and propagates beliefs upwards</li>
  <li><strong>Hypothesizer:</strong> Generates hypotheses about code functionality</li>
  <li><strong>Free Explorer:</strong> Allows for free exploration of the codebase using MCP tools</li>
  <li><strong>System Query:</strong> Answers questions about the MCP tools themselves</li>
</ul>

<p><img src="/assets/images/inductor-macro-structure.png" alt="Overall Architecture" /></p>

<h3 id="hypothesis-decomposer-design">Hypothesis Decomposer: Design</h3>

<p>The <strong>Hypothesis Decomposer</strong> component builds the inference tree recursively. There were a couple of  options for designing this.</p>

<ul>
  <li>Build a <strong>smaller independent Langgraph graph</strong> where each task node corresponds to either aggregation from its child nodes or evidence gathering using MCP tools.</li>
  <li>Build an <strong>iterative graph loop</strong> and keep track of the recursion state in the agent context. This requires more bookkeeping (like storing current recursion information in a stack), and nodes with dedicated logic to decide when to unroll the recursive call.</li>
</ul>

<p>In the end, I decided to go with the second option, because it seemed more straightforward; however, I may try out the first approach at some point. The subgraph which implements this component is showb below.</p>

<p><img src="/assets/images/inductor-hypothesis-decomposer-langgraph.png" alt="Hypothesis Decomposer" /></p>

<h3 id="hypothesis-validator-design">Hypothesis Validator: Design</h3>

<p>I followed a very similar approach to the <a href="#hypothesis-decomposer-design">Hypothesis Decomposer</a> component, except this time, we are traversing the inference tree instead of building it. Similar stack-based bookkeeping of the recursion state applies here.</p>

<p>The subgraph which implements this component is showb below.</p>

<p><img src="/assets/images/inductor-hypothesis-validator-langgraph.png" alt="Hypothesis Validator" /></p>

<h2 id="caveats-and-limitations">Caveats and Limitations</h2>

<ul>
  <li>I use the term <strong>“belief”</strong> pretty loosely. It is intentionally constrained to be between 0 and 1 to reflect a possible Bayesian probability. That more formal interpretation will probably require a more refined modelling. I discuss the analogy briefly in <a href="#analogy-with-hierarchical-bayes-the-beta-bernoulli-conjugate">Analogy with Hierarchical Bayes</a>.</li>
  <li><strong>All sub-hypotheses are assumed to be independent of each other.</strong> This is often not true. Overlapping, dependent sub-hypotheses need causal connections between them which would normally affect belief propagation.</li>
  <li><strong>The Beta-Bernoulli conjugate-like calculations have been chosen for their simplicity</strong>, and aren’t necessarily the best fit to represent the prior and posterior. More sophisticated probability distribution modelling using MCMC, etc. should ideally be done.</li>
  <li><strong>The prompting needs to be more precise</strong> to get more contextually correct sub-hypotheses, because the current prompting leads to inferring some sub-hypotheses which make sense in the context of HLASM programs (for example: how would an HLASM program contain complex functions?).</li>
</ul>

<h2 id="analogy-with-hierarchical-bayes-with-beta-bernoulli-conjugate">Analogy with Hierarchical Bayes with Beta-Bernoulli conjugate</h2>

<p>The above technique is analogous to a <strong>Hierarchical Bayes Model</strong> where probability distributions are modelled by <strong>Beta distributions</strong>, and posterior distributions are calculated using the conjugacy of the <strong>Beta-Bernoulli pair</strong>.</p>

<p>The <strong>Beta distribution</strong> is a customisable probability distribution, whose shape can be controlled by two parameters \(\alpha\) and \(\beta\). The formula for the distribution is given as:</p>

\[f(x;\alpha,\beta) = k.x^{\alpha-1}.{(1-x)}^{\beta-1}\]

<p>where \(k\) is a constant chosen such that the area under the probability distribution sums to 1.
The shapes of the Beta distribution for different values of \(\alpha\) and \(\beta\) are shown below (taken from Wikipedia). As you can see, the shape can vary widely depending upon the parameter combination.</p>

<p><img src="/assets/images/beta-distribution.png" alt="Beta Distribution shapes" /></p>

<p><strong>The Beta distribution is frequently used to model experiments with discrete successes/failure counts.</strong> If the already observed results are represented by a Beta distribution, then the updated distribution after observing a set of more experiments (successes and failures) can simply be modelled as simple sums of the \(\alpha\) and \(\beta\) parameters. This is analogous to summing up the counts of evidences supporting and not supporting the sub-hypotheses and propagating them up the inference tree.</p>

<h2 id="references">References</h2>
<ul>
  <li><a href="https://github.com/asengupta/inductor">Inductor</a></li>
  <li><a href="https://github.com/avishek-sen-gupta/tape-z">Tape/Z</a></li>
  <li><a href="https://sites.cc.gatech.edu/reverse/repository/cogmodels.pdf">Cognitive Models of Program Comprehension</a></li>
  <li></li>
</ul>]]></content><author><name>avishek</name></author><category term="Hierarchical Bayes" /><category term="Large Language Models" /><category term="Reasoning" /><summary type="html"><![CDATA[We look at how a Hierarchical Bayes-like model can be used to recursively decompose a hypothesis into sub-hypotheses to form an inference tree. The beliefs of these sub-hypotheses are updated based on the strength of the evidence gathered using MCP tools.]]></summary></entry><entry><title type="html">An Ode to the Generalist</title><link href="https://avishek.net/2023/03/13/ode-to-the-generalist.html" rel="alternate" type="text/html" title="An Ode to the Generalist" /><published>2023-03-13T00:00:00+05:30</published><updated>2023-03-13T00:00:00+05:30</updated><id>https://avishek.net/2023/03/13/ode-to-the-generalist</id><content type="html" xml:base="https://avishek.net/2023/03/13/ode-to-the-generalist.html"><![CDATA[<p>This post is probably a spiritual successor to <a href="/2021/11/06/resilient-knowledge-bases.html">Resilient Knowledge Bases</a>.</p>

<p><strong>I fear for the death of the Generalist.</strong></p>

<p>The Generalist is characterised not by his extreme capability to specialise in one particular area, though that is not an uncommon trait. <strong>He is characterised by his ability to shapeshift into whatever form that derives the maximum value around a particular set of circumstances.</strong> Business Analyst not around? He can write good stories, and hold the fort well enough until the BA returns. QA not around? He can devise a reasonable QA strategy. A project needs some UX practices in place? The Generalist will understand enough from first principles, as well as digest literature on the latest library, and produce something halfway decent. None of these cases imply that the Generalist stands alone. He knows fully well his limitations.</p>

<p><strong>The Generalist is a Born Troubleshooter.</strong> This comes from his experience in having to figure his way out of numerous unfamiliar problems as they struck. He will be the first person you turn to on your project when the build is broken at 5 pm on a Friday evening, and you cannot figure out why it’s breaking. To be fair, the Generalist will not have all the information that is needed to redeem the situation. This leads us to the next trait of the Generalist…</p>

<p><strong>The Generalist knows the Right Questions to ask.</strong> He is not infaliible, and he is not omniscient. He knows his limitations, especially when confronted with an unfamiliar domain. But he has built a mental framework which he can use to zoom in and out of the situation to map out an unfamiliar terrain or problem space. You may or may not have heard of a Role Playing Game system called Microscope. In that, players collaborate in building the history of a (usually fictional) world or universe. The flexibility of the system comes from the fact that at any point, a player can zoom out to describe events of a historical scale, like the rise and fall of a civilisation; equally he can zoom in to describe events of a single (fictional) person’s day and how that ultimately affected the outcome of a large-scale event. That is how the Generalist operates; sometimes he will ask 10,000 feet-level questions; sometimes he will ask why the contents of a particular register changed from <em>0x20</em> to <em>0xFF</em>. These are not random questions he asks; he is simply figuring out the lay of the land, particularly the interesting spots.</p>

<p><strong>The Generalist always has a Plan</strong>; and he lays it out freely. He knows that one of his primary objectives is to help others – who have more context of the problem, and have more knowledge – reach conclusions or resolutions. More often than not, he will sketch out his thinking to others, encouraging them to fill in the gaps, and point out the loopholes. He will want you to realise that <em>“Oh, my event history ordering is incorrect because the sometimes the events are reaching the database out of order, and OMG I can’t depend on server clocks to establish causality”</em>. That insight came from you, not him; his MO is to state the evident facts and build a chain of reasoning to help you – the expert – reach the conclusion that was already present in your head, but not accessed. <strong>The Generalist is thus a Team Player.</strong></p>

<p><strong>The Generalist has his Fundamentals firmly in place.</strong> He is not buffetted by the whims and fancies of the latest frameworks and libraries. That is not to say that he does not learn these, or is immune to charms of particularly compelling programming language. But he does not despair simply because he has not used the latest technology on his project. He makes bets on things that will force him to expand his dictionary of fundamentals, and learns those. He may not remember the latest API’s, but he understands the spirit behind the learning, and is fully capable of jumping onto the saddle of hands-on implementation if called for. But, the Generalist is always Learning. If he has yet to understand Functional Programming fully, he will attempt to incorporate that thinking into his toolbox. If he feels that learning Vim is worth it, because it is usually the lowest common denominator on all Unix-like operating systems, he will do it. He will select a technology, pluck the hardy core idea behind this technology, and file it away for the future.</p>

<p><strong>The Generalist is a fierce Specialist in his preferred area of specialisation.</strong> He may be known for his expertise in this area, but he does not let this define him. He hones his knowledge and capability in this area with a single-minded fervour, because he knows that if something is worth doing, it is worth doing it extremely well. He does this not to show off, but because he believes in attaining some semblance of mastery in his chosen discipline.</p>

<p><strong>I fear for the death of the Generalist.</strong></p>]]></content><author><name>avishek</name></author><category term="Software Engineering" /><summary type="html"><![CDATA[This post is probably a spiritual successor to Resilient Knowledge Bases.]]></summary></entry></feed>