v5.2.5 · Open Source

The AI Verification Protocol

A multi-agent pipeline that diagnoses verification debt, repairs what it can, and produces machine-readable certificates — so humans audit the certificate, not the code.

Read the Protocol GitHub

✅

AutoApprove

Merge proceeds
no human needed

PR #142 · fix: race condition in connection pool · +48 −12 LOC

0.94

η (efficiency)

0.04

ρ (correlation)

12:1

Cv / Ci

0.6h

Δ Debt

Why: mutation kill rate 97%, 100% branch coverage on changed lines, provider families: Anthropic (gen) vs DeepSeek (verify) — pipeline diversity avoids shared blind spots.

⚠️

Human Review
Recommended

Merge allowed
review suggested

PR #287 · feat: add WebSocket upgrade handler · +312 −89 LOC

0.82

η (efficiency)

0.08

ρ (correlation)

180:1

Cv / Ci

2.4h

Δ Debt

Why: axis 2.5 (behavioral exploration) flagged ⚠️ — sandbox revealed 3 edge-case timeouts under concurrent close. All 3 auto-remediated. Human should verify the fix doesn't mask a deeper ordering bug.

🔴

Human Review
Required

Merge blocked
must inspect

PR #391 · feat: OAuth2 token refresh with encrypted storage · +647 −23 LOC

0.71

η (efficiency)

0.12

ρ (correlation)

890:1

Cv / Ci

4.1h

Δ Debt

Why: axis 2.3 (security surface) 🔴 — new crypto code with 3 SAST-high findings. PR size cap triggered (647 LOC in security path). Certificate identifies exact files and lines for human auditor to inspect.

🛑

Cannot
Verify

Pipeline abort
return to author

PR #503 · feat: speculative decoding cache with custom allocator · +1,892 −0 LOC

—

η (efficiency)

0.38

ρ (correlation)

∞

Cv / Ci

—

Δ Debt

Why: ρ exceeds 0.30 cap — agent A and agent D share OpenAI o4-mini; AST similarity 0.82. PR size 1,892 LOC exceeds 1,500 hard cap. PR must be split into ≤500 LOC chunks with a different verifier family per chunk.

certificate previews

The Problem

AI generates code 16,000× cheaper than we can verify it

In 2023, generating a 200-line PR cost $0.30 and verifying it cost ~$50. Today, the same PR costs $0.0015 to generate — but verification still costs ~$50. The ratio has exploded from 33:1 to over 3,300:1, and it's degrading exponentially.

Cv / Ci → ∞

Cost-to-Verify ÷ Cost-to-Implement — the ratio that matters, tracked per PR, per module

📥

PR Arrives

Agent A submits code. The pipeline classifies it — KnownGroundTruth, NovelBehavior, GeneratedCode, or GeneratedTests.

🔍

9 Axes Scanned

Semantic correctness, behavioral contract, security surface, structural integrity, adversarial surface, documentation coverage, and more.

🎲

Sandbox + Fuzz

10,000 scenarios replayed in a deterministic sandbox. Behavioral contracts extracted from the spec — never from the implementation.

📊

Certificate Signed

Agent E derives η from signals, tracks Cv/Ci, computes verification debt, and signs an in-toto attestation.

👤

Human Audits

The human reviews the certificate — not the code. Auto-Approve, Human Review Recommended, or Human Review Required.

The Five Whys

Why is AI verification debt the defining problem of the next decade?

Root-cause analysis, cited from The AI Verification Debt whitepaper.

Generation costs collapsed 100–150×. Verification costs haven't budged.

In March 2023, GPT-4 cost $30/$60 per 1M tokens. By May 2026, DeepSeek V4-Flash costs $0.14/$0.28 — two orders of magnitude cheaper. But human review is still capped at ~200 lines of code per hour, the same sustainable rate documented by SmartBear and Cisco since 2006. The bottleneck shifted from generation to verification overnight.

The AI Verification Debt — "The industry is celebrating the collapse of generation cost. That is half of the story. [...] Google's engineering practices recommend changes under 200–400 LOC for thorough review. The SmartBear/Cisco study found that reviewing 200–400 LOC/hour is the sustainable human rate for detecting meaningful defects."

Automated tests inherit the same blind spots as the code they verify.

If the same model generates both the implementation and its tests, an error mode the model systematically reproduces passes through both layers undetected. The automated filtering efficiency η is not an independent variable — it degrades dynamically as the shared generation loop introduces correlated failures. The protocol detects this via the correlation penalty ρ and enforces provider-family diversity across pipeline agents.

The AI Verification Debt — "If the same model generates both the implementation and the filters that verify it, η is not an independent variable. An error mode that the model systematically reproduces will pass through both layers undetected — the automated filter inherits the same blind spot as the code it is validating."

Unverified code compounds — each module adds uncertainty to the entire dependency graph.

Every AI-generated feature shipped without a corresponding verification investment degrades your ability to autonomously ship the next feature. The agent generating the next PR cannot self-evaluate its impact on unverified code — it has no ground truth about what the unverified module actually does. The protocol tracks this as verification debt per module, with interest rates for hot paths.

The AI Verification Debt — "Verification debt is the accumulated gap between the volume of generated code and the volume of verified behavior. [...] This is a compounding liability, not a fixed cost. Each unverified module adds uncertainty to the entire dependency graph."

The components exist — they're just not wired together.

Property-based testing (QuickCheck), fuzzing, SAST, deterministic replay sandboxes, behavioral contract differencing, SLSA/in-toto attestation — all exist as isolated tools. What's missing is an orchestration and provenance layer that connects them into a coherent verification pipeline. The protocol is that layer: a unified spec that coordinates five agents across these tools and produces a single auditable certificate.

The AI Verification Debt — "The industry built the accelerator. It also built most of the brake components. Nobody has connected them into a braking system that a driver can use at highway speed."

The Cv/Ci ratio is already ~3,300:1 and degrading exponentially.

Projecting forward: if the price war between US and Chinese labs continues and inference costs drop another 10×, the ratio hits ~16,700:1. The team that invests in verification infrastructure can safely ship 10× more code than the team that invests in better generation alone. The team that does not — will drown in verification debt within this decade.

The AI Verification Debt — "The trap is not that the code is bad. The trap is that nobody knows how bad it is until the verification debt compounds. [...] The team that invests in verification infrastructure can safely ship 10× more code than the team that invests in better generation alone."

Technical Depth

How the protocol works

The protocol is both a specification and a system prompt — it can be loaded into any capable AI model playing a pipeline role, or read as a standalone guide for building verification infrastructure.

η Derivation from Signals

Automated filtering efficiency is computed mechanically from seven observable signals: m (mutation kill rate), o (oracle agreement), b (branch coverage), f (fuzz survival), s (SAST clean rate), t (static depth), and d (doc coverage).

ρ Correlation Penalty

Quantifies how dependent verification artifacts are on the generator — family overlap, version overlap, AST similarity, and shared mutation survival patterns. ρ > 0.30 triggers CannotVerify regardless of η.

Spec Independence Check

Detects when the same author wrote the spec and the code within 7 days — a form of correlator-break laundering. Flagged in the certificate for human audit, priced into ρ.

Auto-Repair with Guardrails

Documentation gaps, missing tests, and type mismatches are auto-remediated. Behavior-changing fixes are human-only. Five-gate repair verification prevents regressions.

Machine-Readable Certificates

JSON certificates with in-toto attestation. Downstream CI gates, deploy pipelines, and audit dashboards consume them. Markdown rendering for human auditors.

Meta-Audit Loop

5% of certificates sampled monthly for ground-truth comparison. Brier scoring detects calibration drift. Weights auto-recalibrate against observed defect outcomes.

The AI Verification Protocol

AI generates code 16,000× cheaper than we can verify it

PR Arrives

9 Axes Scanned

Sandbox + Fuzz

Certificate Signed

Human Audits

Why is AI verification debt the defining problem of the next decade?

Generation costs collapsed 100–150×. Verification costs haven't budged.

Automated tests inherit the same blind spots as the code they verify.

Unverified code compounds — each module adds uncertainty to the entire dependency graph.

The components exist — they're just not wired together.

The Cv/Ci ratio is already ~3,300:1 and degrading exponentially.

How the protocol works

η Derivation from Signals

ρ Correlation Penalty

Spec Independence Check

Auto-Repair with Guardrails

Machine-Readable Certificates

Meta-Audit Loop

Ready to measure what matters?