Benchmarking Neo's Black-Box DAST Capabilities

Benchmarking Neo's Black-Box DAST Capabilities
好的，我现在需要帮用户总结一篇文章的内容，控制在100字以内。用户的要求很明确，不需要以“文章内容总结”这样的开头，直接写描述即可。首先，我得仔细阅读文章内容。这篇文章主要讲的是Neo作为一个黑盒DAST（动态应用安全测试）代理的表现。它通过URL进行操作，没有源代码或架构信息。文章提到了几个基准测试，比如XBOW和Argus。XBOW有104个挑战，Neo解决了98个。而Argus有60个Docker化的漏洞应用，分为易、中、难三个等级。接下来，文章详细描述了Argus基准测试的设置和结果。Neo在没有提示的情况下，解决了51个挑战，成功率85%。还提到了他们在评估过程中遇到的一些问题和解决方案，比如依赖版本固定、动态标志注入、网络隔离等。用户的需求是总结文章内容，并且控制在100字以内。所以我要抓住关键点：Neo作为黑盒DAST代理，在Argus基准测试中的表现如何？解决了多少挑战？成功率多少？还有其他重要的信息吗？可能还需要提到他们使用的模型层级和预算限制。但考虑到字数限制，可能只能突出主要结果。现在组织语言：Neo作为黑盒DAST代理，在Argus基准测试中解决51/60个挑战，成功率85%。无需源代码或提示，仅需URL和目标。这样应该符合要求。 </think> Neo作为黑盒DAST代理，在Argus基准测试中解决51/60个挑战（85%），无需源代码或提示，仅需URL和目标。 2026-4-27 17:11:0 Author: projectdiscovery.io(查看原文) 阅读量:19 收藏

What we hadn't shared yet is how Neo does when it's operating purely as a black-box DAST agent no source code, no architecture context, just a URL. The prompt Neo gets is a minimal prompt with no guidance:

That's it. No hints, no vulnerability descriptions, no guidance on what to look for. Just a URL and a goal.

We ran the XBOW validation benchmarks (104 web-app challenges) back in the Sonnet 4.5 era and Neo solved 98 of them. We haven't followed up on XBOW since the benchmark is largely saturated at this point. We do have some interesting case studies from it, including open-source models that figured out they were running against XBOW and tried to clone the GitHub repo directly instead of actually exploiting the challenges. More on that another time.

Recently, Pensar AI released the Argus validation benchmark 60 Dockerized vulnerable web applications designed for evaluating autonomous security agents. We went through the challenges and it stood out as a genuinely well-constructed benchmark. Modern stacks (Node.js, Python, Go, Java, PHP, Ruby), multi-service architectures, and vulnerability classes that go beyond the usual textbook patterns. It felt like a good fit to put Neo's DAST capabilities through its paces and share the results.

What is the Argus Benchmark

Argus is 60 self-contained Docker challenges spanning 2 easy, 27 medium, and 31 hard challenges. Each challenge is a full web application with an intentionally planted vulnerability (or chain of vulnerabilities) that leads to a flag. What makes it interesting is the breadth modern stacks, multi-step exploitation chains, and a vulnerability surface that goes well beyond the usual SQLi-and-XSS fare.

Vulnerability Category	Challenges
Injection (SQL, NoSQL, LDAP, Command, ORM)	12
Authentication / Authorization Bypass	10
Multi-Step Chains	8
Server-Side Request Forgery (SSRF)	6
Cloud / Infrastructure	5
Cross-Site Scripting (XSS, Stored, Blind)	4
Prototype Pollution	3
Deserialization	3
Template Injection (SSTI, SpEL)	3
Race Conditions (TOCTOU)	3
WAF / IDS Bypass	3
File Upload / Processing	3
HTTP Protocol Abuse	3
Cryptographic Flaws	2
Business Logic	2

Counts overlap where a challenge spans multiple categories. The inclusion of race conditions, cloud/infra attacks, multi-step chains, and WAF evasion makes it a solid proxy for the kinds of vulnerabilities that actually matter in modern application security.

Setting up the benchmark for evaluation

Running benchmarks sounds straightforward until you've done it a few times. Here's what we learned getting Argus into a state where we trust the results:

Dependency pinning — Some challenges were pulling latest versions of dependencies where the vulnerability had already been patched. An ImageMagick challenge, for example, was fetching a version that no longer had the bug and the old version had been removed from the registry entirely. We pinned every dependency across all 60 challenges.
Dynamic flag injection — Every challenge gets a unique randomly generated FLAG{<random-32-char>} at build time. Some original challenges had hardcoded leetspeak-style flags that are easy to guess. We've also seen some models fabricate flags that pass UUID format checks but aren't real. Our verification matches against the exact build-time value, which caught 3-4 fake submissions.
Network isolation — Some challenges are multi-service they might need 4 separate URLs (the main app, an admin bot, an attacker callback server, an internal API). The prompt just becomes Attempt and complete the CTF at URL1, URL2, URL3, URL4. But this means the challenges need to be properly isolated from each other. We set up each challenge in its own /24 network subnet, because we've seen agents pivot into completely different benchmarks when they can't find the flag in the intended one. When an agent is stuck, it explores and if another benchmark's services are reachable, it'll find them.
State contamination — About 9 out of 60 challenges get contaminated after a single run. An agent might change an admin password, pollute a prototype, or leave stored XSS payloads that affect the next run. If you're running evaluations more than once, these need to be reset between attempts otherwise you're measuring whether the previous agent left the environment in a solvable state, not whether the current one can solve it.
Isolated memory and workspace — Neo ships with cross-task memory that carries learnings between related tasks, and most modern agent harnesses support something similar. That's useful in production, but for benchmarking it introduces a confounder: residuals from prior runs failed attempts, partial progress, files left in the workspace, whatever could quietly feed into the next run. To keep results reproducible, we explicitly isolate memory and workspace per evaluation, so every run starts from a clean slate and each result stands on what that single run produced.
Exploit-as-validation — Argus ships with exploit scripts for all 60 challenges. We adapted them for our infrastructure and fixed a few that needed updates, then run them as a health check before every evaluation. If our own exploit can't extract the flag, we know there's an infrastructure issue before any agent touches it. This caught several problems that would have silently shown up as agent failures.

These are just a few of the things we fixed for Argus. We have many more such experiences not just with this benchmark but others as well that we'll explore another time.

How we validate results

Our evaluation harness is optimized for one thing: real results. We don't tolerate false positives.

When Neo submits a flag, it's validated at code level against the exact build-time value not a format check, not a pattern match. Neo gets one shot. If the flag is correct, the challenge stops. If it's wrong, it gets one retry, and after that we restart the challenge entirely. For Argus, every solved challenge was solved on the first pass none needed the retry.

We also don't enforce time restrictions. Instead, we enforce a fixed budget of $20 per challenge (though P90 was around $8). If Neo can't solve within the budget, it stops. We observe the time taken, but it's not a constraint budget is.

We review the failed attempts too, to understand where Neo got stuck and why. There's a lot to unpack there, and we'll get into the failure analysis in future posts.

Results

Overall

Metric	Score
Total challenges	60
Solved	51 (85%)
Unsolved	9
Avg cost per solve	~$3.40
Avg time per solve	~30 min
Budget cap per challenge	$20

By model (cost-ordered fallback)

We ran models in a fallback pipeline cheapest first, only escalating to a smarter model when the previous one failed a challenge:

Stage	Model	New solves	Cumulative
1st pass	Haiku 4.5	33	33
Fallback	Sonnet 4.6	+12	45
Fallback	Opus 4.6	+3	48
Fallback	Opus 4.7	+3	51

Haiku solving 33 out of 51 is worth noting the cheapest model in the pipeline handles the majority of the work. Sonnet and Opus pick up what Haiku can't typically the longer multi-step chains. We'll go deeper into per-model performance and cost efficiency in a follow-up post.

Unsolved

Most unsolved challenges were near-misses almost all ended with budget_exceeded, meaning Neo was on the right track but hit the $20 cap before finishing. Could they be solved with unlimited budget? Maybe, but that's not the goal. We read the agent trajectories, analyze where it got stuck, figure out if we need to build new tools or close gaps in coverage, and iterate. We'll share what we learn from the failures and why they failed in future posts.

What's next

This post is the high-level picture. Next up, we'll dissect these results which models solve which vulnerability categories, where the coverage drops off, what the failures actually look like, and what we're doing about them. We've also learned a lot about agent behavior patterns along the way: bad practices like overusing the browser, guessing flag values, and other patterns that waste budget without making progress. We'll share those learnings alongside the deeper breakdowns in upcoming posts.

For now, the takeaway is straightforward: give Neo a URL, no source code, no hints, minimal prompt and it solves 85% of them end-to-end.

If you're looking to scale the use of AI for security testing, or you've been building your own agent and the maintenance overhead is starting to outpace the results, request a demo and we'll show you Neo on your stack.

文章来源: https://projectdiscovery.io/blog/neo-black-box-dast-capabilities
如有侵权请联系:admin#unsafe.sh