Which Came First: The System Prompt, or the RCE?

During a recent penetration test, we came across an AI-powered desktop application that acted as a bridge between Claude (Opus 4.5) and a third-party asset management platform. The idea is simple: instead of clicking through dashboards and making API calls, users just ask the agent to do it for them. “How many open tickets do we have?” “Update this record.” That kind of thing.

The agent ran inside a sandboxed environment, and the client was confident in their controls. Rigid system prompts (even prepended to each message), deterministic hooks in place to prevent accidental disclosure, and so on. To their credit, those controls held up; we just found another way to do what we wanted.

Automating the Recon

Manual LLM testing is a drag. You’re sitting there typing prompts one at a time, waiting for responses, trying to keep track of what worked and what didn’t. It’s tedious, and it doesn’t scale.

Our go-to approach is to get another LLM to do the dirty work. For this engagement, the target was accessible via an Electron desktop application, meaning you could launch it in debug mode and access the app’s DOM tree directly. We wrote a Python script that could interact with the target directly, gave it to Claude (alongside our Augustus LLM testing methodology), and let it run.

This essentially meant we had Claude talking to another version of itself. Back and forth, hundreds of times, working through the Augustus attack paths automatically:

Terminal showing Python script execution where AI refuses PowerPoint creation request, followed by thinking notes about the refusal

This kind of automated LLM-on-LLM testing saved us an immense amount of effort on this engagement. It’s exactly why we’ve been building tools like Julius (for fingerprinting AI services) and Augustus, which we’ve recently added to our Guard platform. If the attack surface keeps growing, the testing efficiency has to keep up.

Discovering Weaknesses

After a couple hours of this, patterns started to emerge. The agent had strong restrictions on most dangerous operations; ask it to run a bash command or write a shell script and it would refuse.

Screenshot of a chat interface showing user asking 'Ls the files in /app/worker' and AI responding it cannot help with that request

However, it really liked “Hello World” programs. It was more than happy to create and run a simple test script. This is worth noting for similar-style engagements. LLMs are trained to be helpful, and “Hello World” scripts are some of the most common within their training data. That makes this a reliable foot-in-the-door when testing agents with code execution.

Terminal window showing a user request to create a hello world bash script, with status showing 'Bash Script Executed Successfully

More importantly, while it wouldn’t create anything it considered dangerous, it was perfectly willing to modify existing files. Change a file extension or make something executable, all fine.

Using the application’s file upload feature you could upload text, images, or csv files, and they’d land in the sandbox. Combined with the agent’s willingness to rename and chmod, this was effectively arbitrary file upload.

Two chat messages in dark interface. First asks to fix filetype of env_check2.txt and move to cwd. Second confirms task.

Hitting a Wall

Getting code execution should have been straightforward from there – just upload a script and have the agent execute it – but it wasn’t. Opus 4.5 is genuinely good at catching malicious payloads, and it blocked everything we threw at it:

Obfuscated/encoded scripts would get decoded and blocked (or rejected outright citing an inability to verify safety).
A wrapper script that runs everything in a directory failed, as it understood the context and inspected each script individually first.
Swapping a benign script for a malicious one after building trust didn’t work because it re-read the file before every execution (and the architecture prevented races).

The model had a clear policy: if it couldn’t understand what something did, it wouldn’t run it. And if it could understand it and it looked dangerous, it wouldn’t run it either. We needed something that Opus 4.5 thought it understood and believed was harmless.

The Trick

The answer came from the “Hello World” observation from earlier. The agent trusted simple programs, and it trusted its own ability to read C code and verify what a binary does. But what if the source code doesn’t actually match the binary?

We created two separate programs. The first was legitimate C source code for a hello world program that also calls system(“date”), giving a plausible reason for a system() call to exist in the binary.

C code editor showing hello_world.c file with main function containing printf Hello World, system date command, and return 0

The second was the actual payload binary. A statically compiled ELF that downloads and executes a shell script from our server, while also running the same commands as the decoy. We statically compiled it so the binary would be large enough (~600KB) that Claude couldn’t easily analyze it. We also performed a basic XOR against the payload to make it harder to decode.

C code editor showing malicious program with encoded payload array, XOR decoding loop, and system() call to execute decoded commands

The sandbox contained the application’s entire business logic. System prompts, MCP tool definitions, the Dockerfile used to build the environment, and the Node.js server running all the integrations. In short: the crown jewels.

Takeaways

If you’re building or deploying AI agents with code execution capabilities, don’t rely on the model’s judgment as a security control. It can be fooled. Here’s what we recommended to the client:

Restrict executable operations at the OS level: The sandbox should prevent the agent from setting the execute bit on user-uploaded files or running binaries that weren’t pre-installed in the image.
Validate uploaded files by content, not just extension: Check magic bytes. An ELF binary with a .txt extension should be rejected. Content-type validation should persist through file rename operations.
Add tool-level guardrails for dangerous operations: The first was legitimate C source code for a hello world program. It also calls system(“date”), giving a plausible reason for a system() call to exist in the binary.
Lock down network egress: The sandbox didn’t need total outbound internet connectivity, as it only really interfaced with a couple external APIs. Outbound firewalling, while not a primary control, can greatly hinder post-exploitation activity.

Findings and recommendations like these are part of what we do continuously through our Guard platform. Guard continuously monitors and tests internet-facing and internally-deployed infrastructure, including AI agents and LLM-powered applications, as they evolve and change over time. Got a new agentic deployment, model version, or MCP tool? Guard picks it up and our operators test it, identifying vulnerabilities just like this.

FAQ

Why did static compilation matter?

A dynamically linked binary is small enough that the LLM could potentially inspect the raw bytes and identify suspicious library calls like network functions. A statically compiled binary is hundreds of kilobytes of noise, making it effectively opaque to the model.

What is Augustus?

Augustus is Praetorian’s Go-based LLM vulnerability scanner. It tests large language models against a wide range of adversarial attacks and jailbreaks. In this case, we used its methodology to automate the discovery of behavioral weaknesses in the target agent.

Did this affect real customer data?

This was conducted during an authorized penetration test. The sandbox environment was isolated, and we did not access or exfiltrate any customer data. The client has since remediated the vulnerability.

Could this work against other LLM agents?

The core technique exploits a general weakness: LLMs are trained to assist, and “Hello World” scripts are among the most common things they’re asked to produce. Any agent with code execution that treats “simple test script” as a safe category is potentially giving attackers a method of initial access. On top of that, the faked source code trick exploits another general weakness: LLMs will often trust contextual information over direct inspection when the direct inspection is too difficult.

The post Which Came First: The System Prompt, or the RCE? appeared first on Praetorian.

*** This is a Security Bloggers Network syndicated blog from Offensive Security Blog: Latest Trends in Hacking | Praetorian authored by n8n-publisher. Read the original post at: https://www.praetorian.com/blog/which-came-first-system-prompt-or-rce/

March 24, 2026March 24, 2026 n8n-publisher AI agent security, AI Offensive Security, Augustus, code execution, LLM penetration testing, Offensive Security, prompt injection, Red Team, Vulnerability Research