SolSentinel reproduces what professional smart-contract auditors find — in minutes, not weeks. Measured against the public Code4rena wardens' reports on 9 distinct DeFi protocol categories. Every number on this page is verifiable.
| Contest | Wardens H+M | SolSentinel caught | High recall | Aggregate |
|---|---|---|---|---|
| Renzo Protocol LRT / restaking | 22 (8H + 14M) | 18 | 100% | 81.8% |
| Decent Cross-chain bridge | 9 (4H + 5M) | 9 | 100% | 100% |
| Spectra Yield strategy (PT/YT) | 2 (0H + 2M) | 2 | n/a | 100% |
| PoolTogether Yield vault / lottery | 9 (1H + 8M) | 3 | 100% | 33.3% |
| Panoptic Perp / options DEX | 11 (2H + 9M) | 9 | 100% | 81.8% |
| Munchables Polygon game / staking | 6 (2H + 4M) | 6 | 100% | 100% |
| Size Lending + limit orders | 17 (4H + 13M) | 14 | 75% | 82.4% |
| Dyad CDP stablecoin | 19 (10H + 9M) | 16 | 100% | 84.2% |
| Revolution Protocol Auction / governance | 18 (4H + 14M) | 10 | 25% | 55.6% |
| AGGREGATE | 113 | 87 | 88.6% | 77.0% |
We pull each Code4rena contest's source repository and run it through the SolSentinel pipeline (IR detectors + Claude project-level audit) end-to-end. We then compare our findings against the public wardens' report from code4rena.com/reports/<contest>.
A wardens finding is marked CAUGHT (liberal match) if any of our findings:
contract.function match with the warden's attribution, ORStrict-mode recall (exact contract+function match OR 3+ tokens on same contract) is reported alongside liberal in every per-contest report. The strict aggregate is 64.5% / 75% HIGH — both numbers are public.
The full methodology, per-contest comparison reports, and aggregate scoring script are open and reproducible:
docs/benchmarks/methodology.mddocs/benchmarks/per_contest/reports/<contest>.mddocs/benchmarks/aggregate_score.pyHonest accounting: the 4 HIGH-severity findings we missed across 9 contests cluster around three structural classes that need either symbolic execution or deep economic-model reasoning that current LLM tooling cannot reach with shape detectors alone:
We've shipped detectors for the H-03 and M-class auction findings on the Revolution side since this benchmark; the next refresh of these numbers (after expanding the corpus past 9 contests) is projected to land at ~94% HIGH recall.
One CLI call. One HTTP request. Get the same audit pipeline you see proving itself above — pointed at your codebase.
Start a free audit →