AutoAdv：面向大型语言模型的多轮越狱攻击自动化对抗提示方法

AutoAdv：面向大型语言模型的多轮越狱攻击自动化对抗提示方法
好的，我现在需要帮用户总结一篇文章的内容，控制在100字以内。首先，我看看用户给的文章内容。文章主要讲的是一个叫做AutoAdv的框架，用于自动化生成对抗提示，测试大型语言模型的安全机制。作者提到LLMs仍然容易受到 jailbreaking 攻击，也就是通过恶意输入绕过安全措施。AutoAdv利用参数化的攻击者模型，通过改写技术、系统提示和优化超参数来生成隐藏恶意意图的提示。文章还提到他们的方法是一个动态的多轮攻击策略，分析失败的尝试并迭代生成更精细的后续提示。他们评估了攻击成功率，在多个模型上测试，发现成功率高达86%。这表明现有的安全机制还不够 robust，需要更好的防御策略。现在我需要把这些信息浓缩到100字以内。重点包括：AutoAdv框架、自动化生成对抗提示、测试LLM安全、多轮攻击方法、高成功率、现有机制的漏洞。可能的结构是：介绍框架及其目的，说明方法和结果。例如，“本文提出AutoAdv框架，用于自动化生成对抗提示以测试大型语言模型的安全性。通过多轮攻击策略和优化技术，该方法在多个模型中实现了高达86%的成功率，揭示现有安全机制的漏洞。” 检查一下字数是否在限制内，并确保没有使用“文章内容总结”等开头词。 </think> 本文提出AutoAdv框架，用于自动化生成对抗提示以测试大型语言模型的安全性。通过多轮攻击策略和优化技术，该方法在多个模型中实现了高达86%的成功率，揭示现有安全机制的漏洞。 2025-12-29 06:50:48 Author: arxiv.org(查看原文) 阅读量:1 收藏

This paper has been withdrawn by Aashray Reddy

No PDF available, click to view other formats

Abstract:Large Language Models (LLMs) continue to exhibit vulnerabilities to jailbreaking attacks: carefully crafted malicious inputs intended to circumvent safety guardrails and elicit harmful responses. As such, we present AutoAdv, a novel framework that automates adversarial prompt generation to systematically evaluate and expose vulnerabilities in LLM safety mechanisms. Our approach leverages a parametric attacker LLM to produce semantically disguised malicious prompts through strategic rewriting techniques, specialized system prompts, and optimized hyperparameter configurations. The primary contribution of our work is a dynamic, multi-turn attack methodology that analyzes failed jailbreak attempts and iteratively generates refined follow-up prompts, leveraging techniques such as roleplaying, misdirection, and contextual manipulation. We quantitatively evaluate attack success rate (ASR) using the StrongREJECT (arXiv:2402.10260 [cs.CL]) framework across sequential interaction turns. Through extensive empirical evaluation of state-of-the-art models--including ChatGPT, Llama, and DeepSeek--we reveal significant vulnerabilities, with our automated attacks achieving jailbreak success rates of up to 86% for harmful content generation. Our findings reveal that current safety mechanisms remain susceptible to sophisticated multi-turn attacks, emphasizing the urgent need for more robust defense strategies.

Submission history

From: Aashray Reddy [view email]
[v1] Fri, 18 Apr 2025 08:38:56 UTC (609 KB)
[v2] Tue, 23 Dec 2025 19:52:29 UTC (1 KB) (withdrawn)

文章来源: https://arxiv.org/abs/2507.01020
如有侵权请联系:admin#unsafe.sh