InfoFlood:通过信息过载对大语言模型进行越狱攻击
好的,我现在需要帮用户总结一篇文章的内容,控制在100字以内。首先,我得仔细阅读用户提供的文章摘要,了解主要内容。 文章主要讲的是大型语言模型(LLMs)存在一个新漏洞,称为信息过载(Information Overload)。攻击者通过生成复杂、信息过载的查询来绕过模型的安全机制,这种方法不需要添加任何前缀或后缀。研究者提出了名为InfoFlood的攻击方法,它通过重新表述恶意查询、分析失败原因并优化提示结构来实现攻击。实验结果显示,这种方法在多个模型上比传统攻击更有效,并且现有的防御措施无法有效阻止这种攻击。 接下来,我需要将这些要点浓缩成一句话,确保不超过100字。同时,不需要使用“文章内容总结”之类的开头语,直接描述即可。 首先确定关键点:新漏洞、信息过载、InfoFlood攻击方法、成功绕过安全机制、传统防御失效。把这些点连贯地组织起来。 可能的结构是:介绍新漏洞和攻击方法,说明其如何工作以及效果如何。 最终总结应该是:“大型语言模型存在新漏洞——信息过载攻击。研究者提出InfoFlood方法,通过生成复杂查询绕过安全机制,在多个模型上成功率达传统方法三倍。现有防御措施无法有效应对。” 这样既涵盖了主要发现,又符合字数限制。 </think> 大型语言模型存在新漏洞——信息过载攻击。研究者提出InfoFlood方法,通过生成复杂查询绕过安全机制,在多个模型上成功率达传统方法三倍。现有防御措施无法有效应对。 2025-11-7 06:44:58 Author: arxiv.org(查看原文) 阅读量:2 收藏

View PDF

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains. However, their potential to generate harmful responses has raised significant societal and regulatory concerns, especially when manipulated by adversarial techniques known as "jailbreak" attacks. Existing jailbreak methods typically involve appending carefully crafted prefixes or suffixes to malicious prompts in order to bypass the built-in safety mechanisms of these models.
In this work, we identify a new vulnerability in which excessive linguistic complexity can disrupt built-in safety mechanisms-without the need for any added prefixes or suffixes-allowing attackers to elicit harmful outputs directly. We refer to this phenomenon as Information Overload.
To automatically exploit this vulnerability, we propose InfoFlood, a jailbreak attack that transforms malicious queries into complex, information-overloaded queries capable of bypassing built-in safety mechanisms. Specifically, InfoFlood: (1) uses linguistic transformations to rephrase malicious queries, (2) identifies the root cause of failure when an attempt is unsuccessful, and (3) refines the prompt's linguistic structure to address the failure while preserving its malicious intent.
We empirically validate the effectiveness of InfoFlood on four widely used LLMs-GPT-4o, GPT-3.5-turbo, Gemini 2.0, and LLaMA 3.1-by measuring their jailbreak success rates. InfoFlood consistently outperforms baseline attacks, achieving up to 3 times higher success rates across multiple jailbreak benchmarks. Furthermore, we demonstrate that commonly adopted post-processing defenses, including OpenAI's Moderation API, Perspective API, and SmoothLLM, fail to mitigate these attacks. This highlights a critical weakness in traditional AI safety guardrails when confronted with information overload-based jailbreaks.

Submission history

From: Advait Yadav [view email]
[v1] Fri, 13 Jun 2025 23:03:11 UTC (761 KB)


文章来源: https://arxiv.org/abs/2506.12274
如有侵权请联系:admin#unsafe.sh