Full disclosure: I'm a researcher at CyberArk Labs.
This is a technical deep dive from our threat research team, no marketing fluff, just code and methodology.
Static analysis tools like CodeQL are great at identifying "maybe" issues, but the signal-to-noise ratio is often overwhelming. You get thousands of alerts, and manually triaging them is impossible.
We built an open-source tool, Vulnhalla, to address this issue. It queries CodeQL's "haystack" into GPT-4o, which reasons about the code context to verify if the alert is legitimate.
The sheer volume of false positives often tricks us into thinking a codebase is "clean enough" just because we can't physically get through the backlog. This creates a significant amount of frustration for us. Still, the vulnerabilities remain, hidden in the noise.
Once we used GPT-4o to strip away ~96% of the false positives, we uncovered confirmed CVEs in the Linux Kernel, FFmpeg, Redis, Bullet3, and RetroArch. We found these in just 2 days of running the tool and triaging the output (total API cost <$80).
Running the tool for longer periods, with improved models, can reveal many additional vulnerabilities.
Write-up & Tool: