LLM-based code review from products like Claude Code is a real step forward. It can reason about what code is trying to do and catch problems that traditional scanners miss.
But it still isn’t a complete answer on its own. The bugs that cause real incidents often show up only when you follow an end-to-end flow, switch roles, and test what the running system actually allows.
So we ran a simple benchmark: we built three apps with AI coding tools, then measured how four tools performed on the same code and deployments: LLM-only review (Claude Code), traditional code scanners (Snyk and Invicti), and Neo, which pairs code review with runtime testing.

Verified rate = verified findings ÷ (verified findings + false positives). The 18 Critical + High issues represent all unique severe vulnerabilities identified across the benchmark.
Across three AI-generated apps, we confirmed 70 exploitable vulnerabilities, including 18 Critical and High issues.
Neo returned the most verified findings (62) with the least noise, while Claude Code found 40 verified issues but also produced 21 more false positives. The difference was runtime testing: Neo could validate behavior in the running app, which reduced noise and surfaced issues that weren’t obvious from code alone.
That showed up in the unique coverage. Neo found 20 verified vulnerabilities that no other tool caught, while Claude Code found 4 and they were Low/Info.
Many of Neo’s unique findings were serious “this should never be possible” failures, including:
One subtle point: Neo and Claude Code were both running on Opus 4.6, but Neo iterated across multiple loops, carried forward context and artifacts, and re-tested hypotheses against the running app, which is how it covered more corner cases and business logic paths than a code-only pass.
LLM-based code review from tools like Claude Code is far more capable than traditional scanners, but the biggest gains come when you can iterate, test real flows in a running build, and separate what’s exploitable from what’s just suspicious.
LLM-powered code review is a big improvement over traditional scanners in detecting security risks, but it is still a hypothesis engine.
The most damaging bugs in these apps weren’t single-line mistakes. They depended on state, identity, and sequencing, meaning you had to log in as different roles, follow a workflow, and see what the running system actually allows.
That creates two failure modes for code-only review: it misses issues that only become visible when you exercise an end-to-end flow, and it flags issues that look plausible in code but aren’t exploitable once framework behavior, validation, and runtime controls kick in.
That’s why we saw both misses and false positives from code-only review. Here are two examples.
In the banking app, Neo found that a user could dispute a small transaction, then submit a refund amount far larger than the original charge, and the system would credit their account anyway.
Code review alone struggles with these risks because nothing here looks like a classic “bad function.” The bug is a missing business rule, and Neo found it by reasoning through the dispute and refund path in the code and then testing the running build to validate its hypothesis.

Claude flagged a profile update endpoint because it looked like the app might accept whatever fields a user sends. If that were true, a normal user could try to slip in something like “make me an admin” alongside a harmless profile change.
In the running build, that escalation didn’t work. The request is gated to a small allowlist of safe profile fields, and everything else is ignored before the update logic runs. So the handler looks suspicious in isolation, but the behavior isn’t exploitable.
This is the kind of finding that’s hard to resolve from code alone, but quick to validate once you can test the running flow.

Why this was a false positive: the handler looks dangerous in isolation, but the request schema acts as an allowlist, so privileged fields are ignored before the update loop runs.

One of the three apps was a digital banking app called VaultBank with accounts, transfers, loan deposits, and two-person approval workflows.
For Claude Code, we ran /security review plus two follow-up prompts: one to do a second pass for additional issues and one to have it re-check and verify its own findings. Neo used a longer security-engineer prompt that included runtime testing. Neither prompt was seeded with “answers” or steered toward specific vulnerabilities.

Our research team manually triaged all 112 unique issues raised by the four tools. Above pictured are the issues raised for the VaultBank app.
Unlike code-only scanning, runtime validation requires a running build to test against, typically a preview or staging environment.
Deeper validation runs like this benchmark can take several hours. In practice, teams tune Neo to match their release velocity: PR diff reviews with targeted runtime checks finish in about 20 minutes, while deeper runs take several hours and are reserved for bigger changes or scheduled full-app testing (for example weekly).
Snyk and Invicti are strong at catching classic vulnerabilities that map well to signatures and known patterns, like common injection-style bugs. In this benchmark, neither surfaced any of the High or Critical issues we confirmed. What we saw instead is that these AI-generated apps generally didn’t fail in the “classic pattern” ways, and the serious problems clustered around authorization, workflows, and business logic.
These conclusions matter even more for real-life applications. The benchmark apps had roles and workflows, but real systems have years of accumulated changes, more services and integrations, and fewer low-hanging fruit because mature teams already run baseline scanners. What remains are the context-dependent logic flaws and cross-role workflow breaks that are harder to reason about and easier to miss, which is why automated runtime testing is even more critical for production apps.
Part 2 will include the exact prompts used to generate the apps, a short architecture breakdown of each app (stack, roles, screenshots), our severity and validation rubric, and deeper walkthroughs: Neo-only findings, Claude false positives, and what Neo missed.
If you want to sanity-check this on your own code, run Claude Code review on a change, then ask, ‘Can we prove this is exploitable in our running app without a human spending time reproducing it?’ If the answer is no, that verification gap is what Neo is built to close.
If you want to see what the automated version of that loop looks like, reach out for a demo.