TurboFuzzLLM:通过强化变异模糊测试技术高效破解大语言模型的限制
View PDF HTML (experimental) Abstract:Jailbreaking l 2025-11-25 06:53:31 Author: arxiv.org(查看原文) 阅读量:2 收藏

View PDF HTML (experimental)

Abstract:Jailbreaking large-language models (LLMs) involves testing their robustness against adversarial prompts and evaluating their ability to withstand prompt attacks that could elicit unauthorized or malicious responses. In this paper, we present TurboFuzzLLM, a mutation-based fuzzing technique for efficiently finding a collection of effective jailbreaking templates that, when combined with harmful questions, can lead a target LLM to produce harmful responses through black-box access via user prompts. We describe the limitations of directly applying existing template-based attacking techniques in practice, and present functional and efficiency-focused upgrades we added to mutation-based fuzzing to generate effective jailbreaking templates automatically. TurboFuzzLLM achieves $\geq$ 95\% attack success rates (ASR) on public datasets for leading LLMs (including GPT-4o \& GPT-4 Turbo), shows impressive generalizability to unseen harmful questions, and helps in improving model defenses to prompt attacks. TurboFuzzLLM is available open source at this https URL.

Submission history

From: Aman Goel [view email]
[v1] Fri, 21 Feb 2025 21:10:12 UTC (198 KB)
[v2] Wed, 4 Jun 2025 23:08:28 UTC (198 KB)


文章来源: https://arxiv.org/abs/2502.18504
如有侵权请联系:admin#unsafe.sh