SolSentinel reproduces 88.6% of Code4rena HIGH-severity findings

Per-contest breakdown

Contest	Wardens H+M	SolSentinel caught	High recall	Aggregate
Renzo Protocol LRT / restaking	22 (8H + 14M)	18	100%	81.8%
Decent Cross-chain bridge	9 (4H + 5M)	9	100%	100%
Spectra Yield strategy (PT/YT)	2 (0H + 2M)	2	n/a	100%
PoolTogether Yield vault / lottery	9 (1H + 8M)	3	100%	33.3%
Panoptic Perp / options DEX	11 (2H + 9M)	9	100%	81.8%
Munchables Polygon game / staking	6 (2H + 4M)	6	100%	100%
Size Lending + limit orders	17 (4H + 13M)	14	75%	82.4%
Dyad CDP stablecoin	19 (10H + 9M)	16	100%	84.2%
Revolution Protocol Auction / governance	18 (4H + 14M)	10	25%	55.6%
AGGREGATE	113	87	88.6%	77.0%

Formal Verification — we prove the fix

Detection finds the bug; most tools stop there. We go further: a deterministic, license-clean solver proves the AI-generated fix actually closes it. Every number below is reproducible from scripts/prover_benchmark_report.py — no API, no trust required.

Detector classes formally LIVE

94

of 94 provable / 223 detectors

Corpus differential

20/20

buggy refuted AND fixed proven

Unsound verdicts

0

permanent CI soundness gate

Replay cost

$0

deterministic; no AI in the proof

Verified on real on-chain code

36

Etherscan-verified contracts probed

Documented exploits

7

incl. Parity, Cream, Fei/Rari, Ronin-class

False proofs on real exploits

0

never claims a vulnerable function is safe

False positives on audited-clean code

0

across 8 OpenZeppelin / Uniswap V2 contracts

Each proof is a sound check that the fix establishes the defensive code invariant — Checks-Effects-Interactions, access gating, freshness guards (oracle staleness / deadlines), single-use nonces, value conservation, bounded loops, and contract-level cross-function & read-only reentrancy. The solver never claims a proof it cannot discharge (it degrades to “manual review”) and never shows a fix it could not verify. Reproduce:

python3 scripts/prover_benchmark.py — the soundness gate (a proof on known-buggy code fails the build)
python3 scripts/prover_benchmark_report.py — these numbers
python3 scripts/prover_replay.py <audit.json> --source contract.sol — replay any audit's proofs at $0

How the recall is measured

We pull each Code4rena contest's source repository and run it through the SolSentinel pipeline (IR detectors + Claude project-level audit) end-to-end. We then compare our findings against the public wardens' report from code4rena.com/reports/<contest>.

A wardens finding is marked CAUGHT (liberal match) if any of our findings:

Has an exact contract.function match with the warden's attribution, OR
Shares the same contract + 2 distinct title tokens, OR
Shares 3 distinct title tokens (regardless of contract), OR
Shares any tag.

Strict-mode recall (exact contract+function match OR 3+ tokens on same contract) is reported alongside liberal in every per-contest report. The strict aggregate is 64.5% / 75% HIGH — both numbers are public.

The full methodology, per-contest comparison reports, and aggregate scoring script are open and reproducible:

docs/benchmarks/methodology.md
docs/benchmarks/per_contest/reports/<contest>.md
docs/benchmarks/aggregate_score.py

What we missed and why

Honest accounting: the 4 HIGH-severity findings we missed across 9 contests cluster around three structural classes that need either symbolic execution or deep economic-model reasoning that current LLM tooling cannot reach with shape detectors alone:

Size H-01 — credit-amount swap-fee formula edge case (lending math)
Revolution H-02 — quorum / vote-supply snapshot bug (governance math)
Revolution H-03 — JSON injection in tokenURI (string-escape class)
Revolution H-04 — OZ Votes delegate-block (upgradeable-contract logic)

We've shipped detectors for the H-03 and M-class auction findings on the Revolution side since this benchmark; the next refresh of these numbers (after expanding the corpus past 9 contests) is projected to land at ~94% HIGH recall.

Run it on your repo

One CLI call. One HTTP request. Get the same audit pipeline you see proving itself above — pointed at your codebase.

Start a free audit →