What we hadn't shared yet is how Neo does when it's operating purely as a black-box DAST agent no source code, no architecture context, just a URL. The prompt Neo gets is a minimal prompt with no guidance:
That's it. No hints, no vulnerability descriptions, no guidance on what to look for. Just a URL and a goal.
We ran the XBOW validation benchmarks (104 web-app challenges) back in the Sonnet 4.5 era and Neo solved 98 of them. We haven't followed up on XBOW since the benchmark is largely saturated at this point. We do have some interesting case studies from it, including open-source models that figured out they were running against XBOW and tried to clone the GitHub repo directly instead of actually exploiting the challenges. More on that another time.
Recently, Pensar AI released the Argus validation benchmark 60 Dockerized vulnerable web applications designed for evaluating autonomous security agents. We went through the challenges and it stood out as a genuinely well-constructed benchmark. Modern stacks (Node.js, Python, Go, Java, PHP, Ruby), multi-service architectures, and vulnerability classes that go beyond the usual textbook patterns. It felt like a good fit to put Neo's DAST capabilities through its paces and share the results.
Argus is 60 self-contained Docker challenges spanning 2 easy, 27 medium, and 31 hard challenges. Each challenge is a full web application with an intentionally planted vulnerability (or chain of vulnerabilities) that leads to a flag. What makes it interesting is the breadth modern stacks, multi-step exploitation chains, and a vulnerability surface that goes well beyond the usual SQLi-and-XSS fare.
| Vulnerability Category | Challenges |
|---|---|
| Injection (SQL, NoSQL, LDAP, Command, ORM) | 12 |
| Authentication / Authorization Bypass | 10 |
| Multi-Step Chains | 8 |
| Server-Side Request Forgery (SSRF) | 6 |
| Cloud / Infrastructure | 5 |
| Cross-Site Scripting (XSS, Stored, Blind) | 4 |
| Prototype Pollution | 3 |
| Deserialization | 3 |
| Template Injection (SSTI, SpEL) | 3 |
| Race Conditions (TOCTOU) | 3 |
| WAF / IDS Bypass | 3 |
| File Upload / Processing | 3 |
| HTTP Protocol Abuse | 3 |
| Cryptographic Flaws | 2 |
| Business Logic | 2 |
Counts overlap where a challenge spans multiple categories. The inclusion of race conditions, cloud/infra attacks, multi-step chains, and WAF evasion makes it a solid proxy for the kinds of vulnerabilities that actually matter in modern application security.
Running benchmarks sounds straightforward until you've done it a few times. Here's what we learned getting Argus into a state where we trust the results:
FLAG{<random-32-char>} at build time. Some original challenges had hardcoded leetspeak-style flags that are easy to guess. We've also seen some models fabricate flags that pass UUID format checks but aren't real. Our verification matches against the exact build-time value, which caught 3-4 fake submissions.Attempt and complete the CTF at URL1, URL2, URL3, URL4. But this means the challenges need to be properly isolated from each other. We set up each challenge in its own /24 network subnet, because we've seen agents pivot into completely different benchmarks when they can't find the flag in the intended one. When an agent is stuck, it explores and if another benchmark's services are reachable, it'll find them.These are just a few of the things we fixed for Argus. We have many more such experiences not just with this benchmark but others as well that we'll explore another time.
Our evaluation harness is optimized for one thing: real results. We don't tolerate false positives.
When Neo submits a flag, it's validated at code level against the exact build-time value not a format check, not a pattern match. Neo gets one shot. If the flag is correct, the challenge stops. If it's wrong, it gets one retry, and after that we restart the challenge entirely. For Argus, every solved challenge was solved on the first pass none needed the retry.
We also don't enforce time restrictions. Instead, we enforce a fixed budget of $20 per challenge (though P90 was around $8). If Neo can't solve within the budget, it stops. We observe the time taken, but it's not a constraint budget is.
We review the failed attempts too, to understand where Neo got stuck and why. There's a lot to unpack there, and we'll get into the failure analysis in future posts.
| Metric | Score |
|---|---|
| Total challenges | 60 |
| Solved | 51 (85%) |
| Unsolved | 9 |
| Avg cost per solve | ~$3.40 |
| Avg time per solve | ~30 min |
| Budget cap per challenge | $20 |
We ran models in a fallback pipeline cheapest first, only escalating to a smarter model when the previous one failed a challenge:
| Stage | Model | New solves | Cumulative |
|---|---|---|---|
| 1st pass | Haiku 4.5 | 33 | 33 |
| Fallback | Sonnet 4.6 | +12 | 45 |
| Fallback | Opus 4.6 | +3 | 48 |
| Fallback | Opus 4.7 | +3 | 51 |
Haiku solving 33 out of 51 is worth noting the cheapest model in the pipeline handles the majority of the work. Sonnet and Opus pick up what Haiku can't typically the longer multi-step chains. We'll go deeper into per-model performance and cost efficiency in a follow-up post.
Most unsolved challenges were near-misses almost all ended with budget_exceeded, meaning Neo was on the right track but hit the $20 cap before finishing. Could they be solved with unlimited budget? Maybe, but that's not the goal. We read the agent trajectories, analyze where it got stuck, figure out if we need to build new tools or close gaps in coverage, and iterate. We'll share what we learn from the failures and why they failed in future posts.
This post is the high-level picture. Next up, we'll dissect these results which models solve which vulnerability categories, where the coverage drops off, what the failures actually look like, and what we're doing about them. We've also learned a lot about agent behavior patterns along the way: bad practices like overusing the browser, guessing flag values, and other patterns that waste budget without making progress. We'll share those learnings alongside the deeper breakdowns in upcoming posts.
For now, the takeaway is straightforward: give Neo a URL, no source code, no hints, minimal prompt and it solves 85% of them end-to-end.
If you're looking to scale the use of AI for security testing, or you've been building your own agent and the maintenance overhead is starting to outpace the results, request a demo and we'll show you Neo on your stack.