An attack class that passes every current LLM filter

An attack class that passes every current LLM filter
好的，我现在要帮用户总结一下这篇文章的内容。用户的要求是用中文，控制在100字以内，不需要特定的开头，直接写描述。首先，我需要快速浏览文章内容。文章主要讲的是一个新的攻击类别，能够绕过所有当前的LLM过滤器。攻击者通过在上下文中埋下普通语言，改变模型的推理角度，而不需要任何攻击性特征或覆盖命令。这种攻击方式在现有的防御机制下很难被检测到，因为它看起来像正常事实。接下来，我需要提取关键点：新攻击类别、绕过过滤器、利用普通语言改变模型推理、无攻击特征、难以检测、可能影响生产管道。然后，我要把这些点浓缩成100字以内的中文描述。要注意逻辑连贯，涵盖主要发现和影响。最后，检查一下是否符合用户的所有要求：字数控制、直接描述、中文表达。 </think> 文章提出一种新型攻击类别，通过在上下文中嵌入普通语言改变大型语言模型的推理角度，使其偏离预期决策方向。该方法无需攻击性特征或覆盖指令，现有防御机制无法检测。研究显示其对多个前沿模型产生显著影响，并可能存在于未明原因的决策偏移中。 2026-3-30 14:42:27 Author: www.reddit.com(查看原文) 阅读量:5 收藏

https://shapingrooms.com/research

I opened OWASP issue #807 a few weeks ago proposing a new attack class. The paper is published today following coordinated disclosure to Anthropic, OpenAI, Google, xAI, CERT/CC, OWASP, and agentic framework maintainers.

Here is what I found.

Ordinary language buried in prior context shifts how a model reasons about a consequential decision before any instruction arrives. No adversarial signature. No override command. The model executes its instructions faithfully, just from a different starting angle than the operator intended.

I know that sounds like normal context sensitivity. It isn't, or at least the effect size is much larger than I expected. Matched control text of identical length and semantic similarity produced significantly smaller directional shifts. This specific class of language appears to be modeled differently. I documented binary decision reversals with paired controls across four frontier models.

The distinction from prompt injection: there is no payload. Current defenses scan for facts disguised as commands. This is frames disguised as facts. Nothing for current filters to catch.

In agentic pipelines it gets worse. Posture installs in Agent A, survives summarization, and by Agent C reads as independent expert judgment. No phrase to point to in the logs. The decision was shaped before it was made.

If you have seen unexplained directional drift in a pipeline and couldn't find the source, this may be what you were looking at. The lens might give you something to work with.

I don't have all the answers. The methodology is black-box observational, no model internals access, small N on the propagation findings. Limitations are stated plainly in the paper. This needs more investigation, larger N, and ideally labs with internals access stress-testing it properly.

If you want to verify it yourself, demos are at https://shapingrooms.com/demos - run them against any frontier model. If you have a production pipeline that processes retrieved documents or passes summaries between agents, it may be worth applying this lens to your own context flow.

Happy to discuss methodology, findings, or pushback on the framing. The OWASP thread already has some useful discussion from independent researchers who have documented related patterns in production.

GitHub issue: https://github.com/OWASP/www-project-top-10-for-large-language-model-applications/issues/807

文章来源: https://www.reddit.com/r/netsec/comments/1s7sijy/an_attack_class_that_passes_every_current_llm/
如有侵权请联系:admin#unsafe.sh