88.6% HIGH-Severity Recall
Across 9 Code4rena Audits

SolSentinel reproduces what professional smart-contract auditors find — in minutes, not weeks. Measured against the public Code4rena wardens' reports on 9 distinct DeFi protocol categories. Every number on this page is verifiable.

HIGH recall
88.6%
31 of 35 HIGH findings
Aggregate recall
77.0%
87 of 113 H+M findings
Medium recall
71.8%
56 of 78 MEDIUM
Contests probed
9
Each a distinct DeFi category
Reproducible
100%
Every number from `aggregate_score.py`
Total runtime
50m
~5-10 min per audit

Per-contest breakdown

Contest Wardens H+M SolSentinel caught High recall Aggregate
Renzo Protocol
LRT / restaking
22 (8H + 14M)18100%81.8%
Decent
Cross-chain bridge
9 (4H + 5M)9100%100%
Spectra
Yield strategy (PT/YT)
2 (0H + 2M)2n/a100%
PoolTogether
Yield vault / lottery
9 (1H + 8M)3100%33.3%
Panoptic
Perp / options DEX
11 (2H + 9M)9100%81.8%
Munchables
Polygon game / staking
6 (2H + 4M)6100%100%
Size
Lending + limit orders
17 (4H + 13M)1475%82.4%
Dyad
CDP stablecoin
19 (10H + 9M)16100%84.2%
Revolution Protocol
Auction / governance
18 (4H + 14M)1025%55.6%
AGGREGATE 113 87 88.6% 77.0%

How the recall is measured

We pull each Code4rena contest's source repository and run it through the SolSentinel pipeline (IR detectors + Claude project-level audit) end-to-end. We then compare our findings against the public wardens' report from code4rena.com/reports/<contest>.

A wardens finding is marked CAUGHT (liberal match) if any of our findings:

Strict-mode recall (exact contract+function match OR 3+ tokens on same contract) is reported alongside liberal in every per-contest report. The strict aggregate is 64.5% / 75% HIGH — both numbers are public.

The full methodology, per-contest comparison reports, and aggregate scoring script are open and reproducible:

What we missed and why

Honest accounting: the 4 HIGH-severity findings we missed across 9 contests cluster around three structural classes that need either symbolic execution or deep economic-model reasoning that current LLM tooling cannot reach with shape detectors alone:

We've shipped detectors for the H-03 and M-class auction findings on the Revolution side since this benchmark; the next refresh of these numbers (after expanding the corpus past 9 contests) is projected to land at ~94% HIGH recall.

Run it on your repo

One CLI call. One HTTP request. Get the same audit pipeline you see proving itself above — pointed at your codebase.

Start a free audit →