A multi-agent pipeline that diagnoses verification debt, repairs what it can, and produces machine-readable certificates — so humans audit the certificate, not the code.
In 2023, generating a 200-line PR cost $0.30 and verifying it cost ~$50. Today, the same PR costs $0.0015 to generate — but verification still costs ~$50. The ratio has exploded from 33:1 to over 3,300:1, and it's degrading exponentially.
Cv / Ci → ∞
Cost-to-Verify ÷ Cost-to-Implement — the ratio that matters, tracked per PR, per module
Agent A submits code. The pipeline classifies it — KnownGroundTruth, NovelBehavior, GeneratedCode, or GeneratedTests.
Semantic correctness, behavioral contract, security surface, structural integrity, adversarial surface, documentation coverage, and more.
10,000 scenarios replayed in a deterministic sandbox. Behavioral contracts extracted from the spec — never from the implementation.
Agent E derives η from signals, tracks Cv/Ci, computes verification debt, and signs an in-toto attestation.
The human reviews the certificate — not the code. Auto-Approve, Human Review Recommended, or Human Review Required.
Root-cause analysis, cited from The AI Verification Debt whitepaper.
In March 2023, GPT-4 cost $30/$60 per 1M tokens. By May 2026, DeepSeek V4-Flash costs $0.14/$0.28 — two orders of magnitude cheaper. But human review is still capped at ~200 lines of code per hour, the same sustainable rate documented by SmartBear and Cisco since 2006. The bottleneck shifted from generation to verification overnight.
If the same model generates both the implementation and its tests, an error mode the model systematically reproduces passes through both layers undetected. The automated filtering efficiency η is not an independent variable — it degrades dynamically as the shared generation loop introduces correlated failures. The protocol detects this via the correlation penalty ρ and enforces provider-family diversity across pipeline agents.
Every AI-generated feature shipped without a corresponding verification investment degrades your ability to autonomously ship the next feature. The agent generating the next PR cannot self-evaluate its impact on unverified code — it has no ground truth about what the unverified module actually does. The protocol tracks this as verification debt per module, with interest rates for hot paths.
Property-based testing (QuickCheck), fuzzing, SAST, deterministic replay sandboxes, behavioral contract differencing, SLSA/in-toto attestation — all exist as isolated tools. What's missing is an orchestration and provenance layer that connects them into a coherent verification pipeline. The protocol is that layer: a unified spec that coordinates five agents across these tools and produces a single auditable certificate.
Projecting forward: if the price war between US and Chinese labs continues and inference costs drop another 10×, the ratio hits ~16,700:1. The team that invests in verification infrastructure can safely ship 10× more code than the team that invests in better generation alone. The team that does not — will drown in verification debt within this decade.
The protocol is both a specification and a system prompt — it can be loaded into any capable AI model playing a pipeline role, or read as a standalone guide for building verification infrastructure.
Automated filtering efficiency is computed mechanically from seven observable signals: m (mutation kill rate), o (oracle agreement), b (branch coverage), f (fuzz survival), s (SAST clean rate), t (static depth), and d (doc coverage).
Quantifies how dependent verification artifacts are on the generator — family overlap, version overlap, AST similarity, and shared mutation survival patterns. ρ > 0.30 triggers CannotVerify regardless of η.
Detects when the same author wrote the spec and the code within 7 days — a form of correlator-break laundering. Flagged in the certificate for human audit, priced into ρ.
Documentation gaps, missing tests, and type mismatches are auto-remediated. Behavior-changing fixes are human-only. Five-gate repair verification prevents regressions.
JSON certificates with in-toto attestation. Downstream CI gates, deploy pipelines, and audit dashboards consume them. Markdown rendering for human auditors.
5% of certificates sampled monthly for ground-truth comparison. Brier scoring detects calibration drift. Weights auto-recalibrate against observed defect outcomes.
Read the full protocol — 14 sections, 2 appendices, and 7 critical pitfalls with 35+ rule-level safeguards. Version 5.2.5.
Read the Protocol →Or clone the repo to integrate it into your CI/CD pipeline.