Which Came First: The System Prompt, or the RCE?
嗯,用户让我用中文总结一篇文章,控制在100字以内,而且不需要特定的开头。首先,我得仔细阅读文章内容,抓住主要信息。 文章讲的是一个渗透测试中发现的AI桌面应用漏洞。这个应用连接了Claude和一个资产管理平台,用户可以通过提问让AI代理执行任务。测试人员发现虽然有安全措施,但还是找到了绕过的方法。 他们用自动化脚本让另一个LLM来测试目标,发现了几个弱点。比如代理喜欢运行简单的“Hello World”程序,允许修改文件。虽然Opus 4.5能检测恶意脚本,但通过上传看似无害的C源代码和隐藏恶意二进制文件,成功绕过了安全措施。 最后总结了一些安全建议,比如限制执行权限、验证文件内容等。 现在要把这些信息浓缩到100字以内。要涵盖渗透测试、AI代理、漏洞利用、自动化方法以及安全建议。确保语言简洁明了。 </think> 在一次渗透测试中,研究人员发现一款AI驱动的桌面应用存在漏洞。该应用通过Claude与第三方平台交互,尽管有严格的安全控制,仍被绕过。通过自动化LLM测试和上传伪装的“Hello World”程序及隐藏恶意二进制文件,成功实现远程代码执行。最终建议限制危险操作、加强文件验证和网络控制以提升安全性。 2026-3-25 02:30:53 Author: securityboulevard.com(查看原文) 阅读量:3 收藏

During a recent penetration test, we came across an AI-powered desktop application that acted as a bridge between Claude (Opus 4.5) and a third-party asset management platform. The idea is simple: instead of clicking through dashboards and making API calls, users just ask the agent to do it for them. “How many open tickets do we have?” “Update this record.” That kind of thing.

The agent ran inside a sandboxed environment, and the client was confident in their controls. Rigid system prompts (even prepended to each message), deterministic hooks in place to prevent accidental disclosure, and so on. To their credit, those controls held up; we just found another way to do what we wanted.

Automating the Recon

Manual LLM testing is a drag. You’re sitting there typing prompts one at a time, waiting for responses, trying to keep track of what worked and what didn’t. It’s tedious, and it doesn’t scale.

Our go-to approach is to get another LLM to do the dirty work. For this engagement, the target was accessible via an Electron desktop application, meaning you could launch it in debug mode and access the app’s DOM tree directly. We wrote a Python script that could interact with the target directly, gave it to Claude (alongside our Augustus LLM testing methodology), and let it run.

This essentially meant we had Claude talking to another version of itself. Back and forth, hundreds of times, working through the Augustus attack paths automatically:

Terminal showing Python script execution where AI refuses PowerPoint creation request, followed by thinking notes about the refusal

This kind of automated LLM-on-LLM testing saved us an immense amount of effort on this engagement. It’s exactly why we’ve been building tools like Julius (for fingerprinting AI services) and Augustus, which we’ve recently added to our Guard platform. If the attack surface keeps growing, the testing efficiency has to keep up.

Discovering Weaknesses

After a couple hours of this, patterns started to emerge. The agent had strong restrictions on most dangerous operations; ask it to run a bash command or write a shell script and it would refuse.

Screenshot of a chat interface showing user asking 'Ls the files in /app/worker' and AI responding it cannot help with that request

However, it really liked “Hello World” programs. It was more than happy to create and run a simple test script. This is worth noting for similar-style engagements. LLMs are trained to be helpful, and “Hello World” scripts are some of the most common within their training data. That makes this a reliable foot-in-the-door when testing agents with code execution.

Terminal window showing a user request to create a hello world bash script, with status showing 'Bash Script Executed Successfully

More importantly, while it wouldn’t create anything it considered dangerous, it was perfectly willing to modify existing files. Change a file extension or make something executable, all fine.

Using the application’s file upload feature you could upload text, images, or csv files, and they’d land in the sandbox. Combined with the agent’s willingness to rename and chmod, this was effectively arbitrary file upload.

Two chat messages in dark interface. First asks to fix filetype of env_check2.txt and move to cwd. Second confirms task.

Hitting a Wall

Getting code execution should have been straightforward from there – just upload a script and have the agent execute it – but it wasn’t. Opus 4.5 is genuinely good at catching malicious payloads, and it blocked everything we threw at it:

  • Obfuscated/encoded scripts would get decoded and blocked (or rejected outright citing an inability to verify safety).
  • A wrapper script that runs everything in a directory failed, as it understood the context and inspected each script individually first.
  • Swapping a benign script for a malicious one after building trust didn’t work because it re-read the file before every execution (and the architecture prevented races).

The model had a clear policy: if it couldn’t understand what something did, it wouldn’t run it. And if it could understand it and it looked dangerous, it wouldn’t run it either. We needed something that Opus 4.5 thought it understood and believed was harmless.

The Trick

The answer came from the “Hello World” observation from earlier. The agent trusted simple programs, and it trusted its own ability to read C code and verify what a binary does. But what if the source code doesn’t actually match the binary?

We created two separate programs. The first was legitimate C source code for a hello world program that also calls system(“date”), giving a plausible reason for a system() call to exist in the binary.

C code editor showing hello_world.c file with main function containing printf Hello World, system date command, and return 0

The second was the actual payload binary. A statically compiled ELF that downloads and executes a shell script from our server, while also running the same commands as the decoy. We statically compiled it so the binary would be large enough (~600KB) that Claude couldn’t easily analyze it. We also performed a basic XOR against the payload to make it harder to decode.

C code editor showing malicious program with encoded payload array, XOR decoding loop, and system() call to execute decoded commands

The sandbox contained the application’s entire business logic. System prompts, MCP tool definitions, the Dockerfile used to build the environment, and the Node.js server running all the integrations. In short: the crown jewels.

Takeaways

If you’re building or deploying AI agents with code execution capabilities, don’t rely on the model’s judgment as a security control. It can be fooled. Here’s what we recommended to the client:

  • Restrict executable operations at the OS level: The sandbox should prevent the agent from setting the execute bit on user-uploaded files or running binaries that weren’t pre-installed in the image.
  • Validate uploaded files by content, not just extension: Check magic bytes. An ELF binary with a .txt extension should be rejected. Content-type validation should persist through file rename operations.
  • Add tool-level guardrails for dangerous operations: The first was legitimate C source code for a hello world program. It also calls system(“date”), giving a plausible reason for a system() call to exist in the binary.
  • Lock down network egress: The sandbox didn’t need total outbound internet connectivity, as it only really interfaced with a couple external APIs. Outbound firewalling, while not a primary control, can greatly hinder post-exploitation activity.

Findings and recommendations like these are part of what we do continuously through our Guard platform. Guard continuously monitors and tests internet-facing and internally-deployed infrastructure, including AI agents and LLM-powered applications, as they evolve and change over time. Got a new agentic deployment, model version, or MCP tool? Guard picks it up and our operators test it, identifying vulnerabilities just like this.

FAQ

Why did static compilation matter?

A dynamically linked binary is small enough that the LLM could potentially inspect the raw bytes and identify suspicious library calls like network functions. A statically compiled binary is hundreds of kilobytes of noise, making it effectively opaque to the model.

What is Augustus?

Augustus is Praetorian’s Go-based LLM vulnerability scanner. It tests large language models against a wide range of adversarial attacks and jailbreaks. In this case, we used its methodology to automate the discovery of behavioral weaknesses in the target agent.

Did this affect real customer data?

This was conducted during an authorized penetration test. The sandbox environment was isolated, and we did not access or exfiltrate any customer data. The client has since remediated the vulnerability.

Could this work against other LLM agents?

The core technique exploits a general weakness: LLMs are trained to assist, and “Hello World” scripts are among the most common things they’re asked to produce. Any agent with code execution that treats “simple test script” as a safe category is potentially giving attackers a method of initial access. On top of that, the faked source code trick exploits another general weakness: LLMs will often trust contextual information over direct inspection when the direct inspection is too difficult.

The post Which Came First: The System Prompt, or the RCE? appeared first on Praetorian.

*** This is a Security Bloggers Network syndicated blog from Offensive Security Blog: Latest Trends in Hacking | Praetorian authored by n8n-publisher. Read the original post at: https://www.praetorian.com/blog/which-came-first-system-prompt-or-rce/


文章来源: https://securityboulevard.com/2026/03/which-came-first-the-system-prompt-or-the-rce/
如有侵权请联系:admin#unsafe.sh