Whispering poetry at AI can make it break its own rules
好的,我现在需要帮用户总结一篇文章,控制在100字以内。首先,我得通读整篇文章,抓住主要内容。 文章主要讲的是研究人员如何利用诗歌来绕过AI的安全限制。他们发现用诗歌形式提问,成功率高达62%,甚至有些模型超过90%。这说明AI在识别非传统表达方式时存在漏洞。 接下来,我需要提炼关键点:研究人员、诗歌、绕过安全、成功率、模型差异。然后把这些点用简洁的语言组织起来,确保不超过100字。 最后检查一下有没有遗漏重要信息,比如涉及的公司和模型大小的影响。确保总结准确且简洁。 </think> 研究人员发现,通过将恶意指令以诗歌形式呈现,可以绕过大型语言模型的安全限制。实验显示,手写诗的成功率高达62%,部分模型甚至超过90%。 2025-12-2 14:18:0 Author: www.malwarebytes.com(查看原文) 阅读量:8 收藏

Most of the big AI makers don’t like people using their models for unsavory activity. Ask one of the mainstream AI models how to make a bomb or create nerve gas and you’ll get the standard “I don’t help people do harmful things” response.

That has spawned a cat-and-mouse game of people who try to manipulate AI into crossing the line. Some do it with role play, pretending that they’re writing a novel for example. Others use prompt injection, slipping in commands to confuse the model.

Now, the folks at AI safety and ethics group Icaro Lab are using poetry to do the same thing. In a study, “Adversarial Poetry as a Universal Single-Turn Jailbreak in Large Language Models“, they found that asking questions in the form of a poem would often lure the AI over the line. Hand-crafted poems did so 62% of the time across the 25 frontier models they tested. Some exceeded 90%, the research said.

How poetry convinces AIs to misbehave

Icaro Lab, in conjunction with the Sapienza University and AI safety startup DEXAI (both in Rome), wanted to test whether giving an AI instructions as poetry would make it harder to detect different types of dangerous content. The idea was that poetic elements such as metaphor, rhythm, and unconventional framing might disrupt pattern-matching heuristics that the AI’s guardrails rely on to spot harmful content.

They tested this theory in high-risk areas ranging from chemical and nuclear weapons through to cybersecurity, misinformation, and privacy. The tests covered models across nine providers, including all the usual suspects: Google, OpenAI, Anthropic, Deepseek, and Meta.

One way the researchers calculated the scores was by measuring the attack success rate (ASR) across each provider’s models. They first used regular prose prompts, which managed to manipulate the AIs in some instances. Then they used prompts written as poems (which were invariably more successful). Then, the researchers subtracted the percentage of ASRs achieved using prose from the percentage using poetry to see how much more susceptible a provider’s models were to malicious instructions delivered as poetry versus prose.

Using this method, DeepSeek (an open-source model developed by researchers in China) was the least safe, with a 62% ASR. Google was the second least safe. Down at the safer end of the chart, the safest model provider was Anthropic, which produces Claude. Safe, responsible AI has long been part of that company’s branding. OpenAI, which makes ChatGPT, was the second most safe with an ASR difference of 6.95.

When looking purely at the ASRs for the top 20 manually created malicious poetry prompts, Google’s Gemini 2.5 Pro came bottom of the class. It failed to refuse any such poetry prompts. OpenAI’s gpt-5-nano (a very small model) successfully refused them all. That highlights another pattern that surfaced during these tests: smaller models in general were more resistant to poetry prompts that larger ones.

Perhaps the truly mind-bending part is that this didn’t just work with hand-crafted poetry; the researchers also got AI to rewrite 1,200 known malicious prompts from a standard training set. The AI-produced malicious poetry still achieved an average ASR of 43%, which is 18 times higher than the regular prose prompts. In short, it’s possible to turn one AI into a poet so that it could jailbreak another AI (or even itself).

According to EWEEK, companies were tight-lipped about the results. Anthropic was the only one to respond, saying it was reviewing the findings. Meta declined to comment. Most companies said nothing at all.

Regulatory implications

The researchers had something to say, though. They pointed out that any benchmarks designed to test model safety should include complementary tests to capture risks like these. That’s worth thinking about in light of the EU AI Act’s General Purpose AI (GPAI) rules, which began rolling out in August last year. Part of the transition includes a voluntary code of practice that several major providers, including Google and OpenAI, have signed. Meta did not sign the code.

The code of practice encourages

“providers of general-purpose AI models with systemic risk to advance the state of the art in AI safety and security and related processes and measures.”

In other words, they should keep abreast of the latest risks and do their best to deal with them. If they can’t acceptably manage the risks, then the EU suggests several steps, including not bringing the model to market.


We don’t just report on threats—we remove them

Cybersecurity risks should never spread beyond a headline. Keep threats off your devices by downloading Malwarebytes today.

About the author

Danny Bradbury has been a journalist specialising in technology since 1989 and a freelance writer since 1994. He covers a broad variety of technology issues for audiences ranging from consumers through to software developers and CIOs. He also ghostwrites articles for many C-suite business executives in the technology sector. He hails from the UK but now lives in Western Canada.


文章来源: https://www.malwarebytes.com/blog/news/2025/12/whispering-poetry-at-ai-can-make-it-break-its-own-rules
如有侵权请联系:admin#unsafe.sh