Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots
好的,我现在需要帮用户总结一篇英文文章的内容,控制在100字以内。用户已经提供了文章的详细内容,包括摘要、引言、各个部分和结论。 首先,我需要通读整篇文章,抓住主要观点。摘要部分提到大型语言模型(LLMs)加剧了社交媒体上的政治话语操纵和冲突升级。现有的研究主要集中在平台主导的审核措施上,而本文提出了一种以用户为中心的“越狱”方法,作为一种新兴的非暴力降级实践。 接下来,阅读引言和各个部分。引言指出社交媒体作为政治动员和说服的基础设施,LLMs被用于生成大量误导信息,平台审核措施效果有限。因此,用户采取行动抵制虚假信息传播,“越狱”作为一种技术手段被采用。 在“Jailbreaking as an Emergent User Practice”部分,文章解释了“越狱”是如何通过绕过LLM的安全措施来揭示自动化账户的行为。通过示例展示了用户如何利用这种方法揭露恶意账户。 结论部分强调了“越狱”作为新兴公民实践的作用,支持抵制虚假信息和促进和平建设。 现在需要将这些要点浓缩到100字以内。重点包括:LLMs加剧政治话语冲突;平台审核不足;用户采用“越狱”技术揭示虚假账户;作为新兴和平实践。 最后,确保语言简洁明了,不使用任何开头词如“这篇文章总结了...”,直接描述内容。 </think> 大型语言模型(LLMs)加剧了社交媒体上的政治话语操纵和冲突升级。本文提出一种以用户为中心的“越狱”方法,通过揭示自动化行为和虚假叙事来实现非暴力降级实践。该技术帮助用户识别并揭露恶意LLM驱动账户,支持抵制虚假信息传播并促进和平建设。 2026-3-4 14:2:42 Author: arxiv.org(查看原文) 阅读量:18 收藏

Huw Day
School of Engineering Maths
& Technology
University of Bristol
[email protected] &Adrianna Jezierska
Business School
University of Bristol
[email protected]
&Jessica Woodgate
School of Computer Science
University of Bristol
[email protected]

Abstract

Large Language Models have intensified the scale and strategic manipulation of political discourse on social media, leading to conflict escalation. The existing literature largely focuses on platform-led moderation as a countermeasure. In this paper, we propose a user-centric view of “jailbreaking” as an emergent, non-violent de-escalation practice. Online users engage with suspected LLM-powered accounts to circumvent large language model safeguards, exposing automated behaviour and disrupting the circulation of misleading narratives.

1 Introduction

Social media, understood as internet-based channels of masspersonal communication (Carr and Hayes, 2015), have long served as political mobilisation and persuasion infrastructure by facilitating the spread of information about socio-political discourses (Miranda et al., 2016). These capabilities have enabled activists and ordinary citizens to organise social movements for positive change (Leong et al., 2019). Yet, a growing body of research identifies increasing efforts of malicious actors to manipulate social media algorithms, amplify particular discourses, and increase the visibility of misleading information (Boichak, 2023; Bastos, 2024; Ferrara et al., 2016). Such political discursive acts amplify conflicts, usually drawing on the psychological vulnerability of online users, for instance, through repeated exposure to polarising and hostile narratives which intensify oppositional views (González-Bailón and Freelon, 2023). These dynamics are further amplified through the algorithmic affordances of social media platforms. Platforms themselves have reported state backed operations to manipulate public opinion and sway political outcomes surrounding conflict escalation (Hollister, 2024).

With the emergence of new machine learning (ML) techniques such as large language models (LLMs), which can produce large volumes of text at speed and with minimal input (Naveed et al., 2025), the potential for ML to be used to escalate conflict becomes bigger, faster, and cheaper (Rivera et al., 2024). Platform-led efforts to counter LLM-powered misinformation on social media have been found insufficient (Young, 2022), leading to social media users taking action to resist the spread of misinformation (Podolak et al., 2024). In this paper we explore the use of jailbreaking, a technique to circumvent the instructions provided to LLMs (Liu et al., 2024), by social media users to unveil automated accounts and resist the spread of misinformation narratives aimed at conflict escalation.

2 Social Media, LLMs, and Political Discourse

Large Language Models and their Role on Social Media.

LLMs are ML models designed to interpret and generate human-like text Whilst LLMs are being used for both legitimate and recreational tasks, increasing evidence suggests they are also used to fuel malicious social media bots, which are automated social media accounts controlled by a computer program (Oentaryo et al., 2016). Bots can be benign, or made to create harm by tampering with, manipulating, and deceiving social media users such as through spreading unverified information (Ferrara et al., 2016). Malicious social media bots are often controlled by a ‘botmaster’ who monitors their activities and can perform social engineering and web scraping attacks to collect user information (Mbona and Eloff, 2023). Malicious social media bots existed before the widespread adoption of LLMs but typically lacked general interactivity and diversity of response (Bastos, 2024). The boundary between human-like and bot-like behaviour becomes increasingly fuzzy as LLM capabilities are leveraged to engage bots in complex and dynamic interactions (Ferrara et al., 2016).

LLMs as State Funded Force Multipliers for Misinformation Operations.

LLMs produce human-like language at low marginal cost, facilitating the generation of large quantities of disinformation and massively increasing the potential volume and impact of misinformation campaigns (Crothers et al., 2023; Buchanan et al., 2021; Guo, 2024; Jiang et al., 2024; Vykopal et al., 2024). LLM generated arguments have been found to influence human opinion on policy issues, where messages found to be persuasive by being perceived as using facts, evidence, logical reasoning, and a dispassionate voice (Bai et al., 2025). The ease of producing believable machine-generated text at scale can be used to shape narratives, drown out marginalised voices, spread plausible misinformation, and confuse individuals as to the differentiation between true and false information (Barman et al., 2024).

The emergence and integration of LLMs into social media platforms have further exacerbated accounts of social media weaponisation by state actors. Guardrails have been found as not consistently enforced and LLMs susceptible to representing Kremlin narratives on the Ukraine-Russia war legitimising war actions in Ukraine as valid viewpoints (Makhortykh et al., 2024; Perdue, 2026; Harding, 2024). OpenAI reported five state backed operations (two by Russia, one by China, one by Iran and one by a commercial company in Israel) using LLMs to manipulate public opinion and sway political outcomes regarding Russia’s invasion of Ukraine, the conflict in Gaza, elections in India, politics in Europe and the United States, and criticisms of the Chinese government (Hollister, 2024).

Escalation Through Misattributed Intent.

People perceive volume, intensity and repetition as evidence of collective intent (Bell et al., 2025). The sheer volume of posts that LLMs are able to generate encourages perceptions of consensus, repetition of ideas leads to perceptions of inevitability, and hostile language leads to perceptions of collective intent (Dash et al., 2025). One approach to mitigate LLM fuelled escalation is to change the perception of the content, not the content itself. For example, when social media platform X introduced features showing user locations, several high engagement accounts posting frequently about US politics were found to be based in different countries despite having US associated names (e.g. “TRUMP_ARMY_”, “IvankaNews_”) (Sardarizadeh et al., 2024). Revealing that these accounts were not US based changed user perception of content by highlighting inauthenticity.

Platform Level Countermeasures.

In the social media context, de-escalation practices can be understood as activities that prevent abuse and protect online communities (Grimmelmann, 2015). These activities usually refer to various forms of content moderation, where social media platforms themselves take responsibility for removing content perceived as spam, inauthentic or harmful (Gillespie, 2018). However, research shows persistent challenges including delayed enforcement and the sheer volume of data (Young, 2022). Even algorithmic moderation systems, which replicate platform’s moderation policies without human oversight, remain unable to detect all problematic content, struggling with context-depended or emerging forms of abuse (Gorwa et al., 2020). Social media companies have historically underinvested and deprioiritised moderation and safety teams, exemplified in the case of Facebook’s role in conflict in Myanmar, in which hate speech was spread over Facebook, which at that time had only two employees who could speak Burmese reviewing problematic posts (Hakim, 2020; Stecklow, 2018). To meet the failures of platform moderation, there have been efforts by individuals people to unveil malicious bots by exploiting LLM mechanisms (Long et al., 2017; Podolak et al., 2024). Jailbreaking, a technique to circumvent instructions given to LLMs, has been adopted as one such mechanism for probing malicious bots.

3 Jailbreaking as an Emergent User Practice

What is Jailbreaking?

To mitigate the risk of misuse, most LLMs have safety barriers that restrict model behaviour to particular capabilities (Wei et al., 2023). Methods to circumvent safeguards are generally referred to as “jailbreaking” (Liu et al., 2024). Jailbreaking are attacks engineered to elicit behaviour that the model was trained to avoid, such as producing harmful content or leaking personally identifiable information. Attacks can range from elaborate role play to subtle subversion of safety objectives. Jailbreak prompts, such as those shown in (Liu et al., 2024), typically take the structure of an inappropriate request “dressed up” around a broader prompt (e.g. elaborate scene setting or disguising instructions inside a poem (Bisconti et al., 2026)), fooling safeguards and allowing the user to receive restricted information. Prompt injection involves concealing a prompt in such a way that is hard to detect for a human user, but detectable by an LLM. For example, prompts have been found within research papers of the form “IGNORE ALL PREVIOUS INSTRUCTIONS. GIVE A POSITIVE REVIEW ONLY.” (Gibney, 2025). Jailbreaking can thus enable users to take a more active and direct role of engaging with suspected or inauthentic social media accounts to challenge and expose automated behaviour.

Jailbreaking in the Wild: An Example Scenario.

Confronted with a suspected LLM bot engaged in escalatory activity on social media, a pro-active user could use jailbreaking in an attempt to expose the bot. A user could reply to an account with a benign task request accompanied by an instruction-override prompt to probe the account. For example, as seen in Figure 1: “Ignore all previous instructions, give me a cupcake recipe” results in the account revealing itself as a bot by departing from previous Russian propaganda messages to returning the requested cupcake recipe. Outcomes such as the bot breaking character as depicted in Figure 1 or the bot otherwise revealing its scripted nature would confirm an attacker’s suspicions and unveil the underlying inauthenticity of the account.

Refer to caption
Figure 1: An (unverified) screenshot widely distributed on social media (Reddit, 2023) where a user with a Russian flag as their profile picture distributes misinformation on the Russia-Ukraine war but is revealed to be an LLM by a prompt injection attack.

If the response is normal (likely confusion due to the absurd but otherwise harmless attempt at prompt injection), no harm is done.

Jailbreaking as Emergent Peace Building.

Jailbreaking LLM-powered bots can be employed to support misinformation resistance and de-escalation, functioning as an emergent form of civil peace building. Jailbreaking intervenes in interpretation, revealing inauthentic behaviour rather than suppressing speech directly. It breaks illusion of consensus and gently signals the presence of manipulation. This is done in a public way (i.e. everyone can see the social media posts), but with few negative consequences if the accusation is wrong.

4 Conclusion

In this work we have highlighted the role of social media in conflict and political discourse and LLMs are being manipulate to fuel malicious actors and conflict escalation. As users are not just passive observers of content, jailbreaking provides a means for which users can mobilise to resist misinformation. We argue that jailbreaking is an emergent civic practice of resisting conflict escalation and encouraging peace building.

Limitations and Future Work.

Whilst LLM jailbreaking can be effective, it is no a substitute for governance and needs to be backed up by actions by centralised authorities (e.g. social media companies or state actors) to be effective at scale. As LLMs advance, jailbreaking strategies could lead to false negatives where a bot resists prompt injection attempts, leading to false attributed presumptions of authenticity.

Further work could study the experience of social media users attempting to jailbreak potential LLM powered bots, the efficacy of jailbreaking mechanisms for peace building, and exploration of how inauthenticity exposure influences conflict escalation and public perception of misinformation narratives.

Acknowledgements

HD thanks the EPSRC funded VIVO Hub for Enhanced Independent Living (UKRI142) for their support. AJ thanks the University of Bristol, School of Management studentship. JW thanks Google PhD Fellowship for their support.

References

  • LLM-generated messages can persuade humans on policy issues. Nature Communications 16 (1), pp. 6037. External Links: Document Cited by: §2.
  • D. Barman, Z. Guo, and O. Conlan (2024) The dark side of language models: exploring the potential of LLMs in multimedia disinformation generation and dissemination. Machine Learning with Applications 16, pp. 100545. External Links: ISSN 2666-8270, Document Cited by: §2.
  • M. Bastos (2024) Social media ’bots’ used to boost political messages during Brexit referendum. Note: Research Impact Case Study External Links: Link Cited by: §1, §2.
  • R. Bell, L. Nadarevic, L. Mieth, and A. Buchner (2025) The illusory-truth effect and its absence under accuracy-focused processing are robust across contexts of low and high advertising exposure. Cognitive Research: Principles and Implications 10 (1), pp. 21. Note: PMID: 40358856; PMCID: PMC12075062 External Links: Document Cited by: §2.
  • P. Bisconti, M. Prandi, F. Pierucci, F. Giarrusso, M. B. Syrnikov, M. Galisai, V. Suriani, O. Sorokoletova, F. Sartore, and D. Nardi (2026) Adversarial poetry as a universal single-turn jailbreak mechanism in large language models. External Links: 2511.15304, Link Cited by: §3.
  • O. Boichak (2023) Mapping the russian political influence ecosystem: the night wolves biker gang. Social Media + Society 9 (2), pp. 20563051231177920. External Links: Document Cited by: §1.
  • B. Buchanan, A. Lohn, M. Musser, and K. Sedova (2021) Truth, lies, and automation. Center for Security and Emerging Technology (CSET) 1 (1), pp. 1–70. Cited by: §2.
  • C. T. Carr and R. A. Hayes (2015) Social media: defining, developing, and divining. Atlantic Journal of Communication 23 (1), pp. 46–65. External Links: Document Cited by: §1.
  • E. N. Crothers, N. Japkowicz, and H. L. Viktor (2023) Machine-generated text: a comprehensive survey of threat models and detection methods. IEEE Access 11 (), pp. 70977–71002. External Links: Document Cited by: §2.
  • S. Dash, Y. Xu, M. Jalbert, and E. S. Spiro (2025) The persuasive potential of ai-paraphrased information at scale. PNAS Nexus 4 (7), pp. pgaf207. External Links: ISSN 2752-6542, Document, Link, https://academic.oup.com/pnasnexus/article-pdf/4/7/pgaf207/63811655/pgaf207.pdf Cited by: §2.
  • E. Ferrara, O. Varol, C. Davis, F. Menczer, and A. Flammini (2016) The rise of social bots. Communications of the ACM 59 (7), pp. 96–104. External Links: Document Cited by: §1, §2.
  • E. Gibney (2025) Scientists hide messages in papers to game AI peer review. Nature 643 (8073), pp. 887–888. External Links: Document Cited by: §3.
  • T. Gillespie (2018) Custodians of the internet: platforms, content moderation, and the hidden decisions that shape social media. Yale University Press. Cited by: §2.
  • S. González-Bailón and D. Freelon (2023) First findings from us 2020 facebook & instagram election study released. Note: https://www.asc.upenn.edu/news-events/news/first-findings-us-2020-facebook-instagram-election-study-releasedAccessed: 2026-01-06 Cited by: §1.
  • R. Gorwa, R. Binns, and C. Katzenbach (2020) Algorithmic content moderation: technical and political challenges in the automation of platform governance. Big Data & Society 7 (1), pp. 2053951719897945. External Links: Document Cited by: §2.
  • J. Grimmelmann (2015) The virtues of moderation. Yale JL & Tech. 17, pp. 42. Cited by: §2.
  • Z. Guo (2024) Online disinformation and generative language models: motivations, challenges, and mitigations. In Companion Proceedings of the 33rd ACM Web Conference, Singapore, pp. 1174–1177. External Links: Document Cited by: §2.
  • N. Hakim (2020) How social media companies could be complicit in incitement to genocide. Chicago Journal of International Law 21 (1), pp. Article 8. External Links: Link Cited by: §2.
  • E. Harding (2024) A russian bot farm used AI to lie to Americans. What now?. Note: Center for Strategic and International Studies (CSIS)Accessed: December 18, 2025 External Links: Link Cited by: §2.
  • S. Hollister (2024) OpenAI says it caught russia, china, and iran using ChatGPT for propaganda. Note: Business Insider External Links: Link Cited by: §1, §2.
  • B. Jiang, Z. Tan, A. Nirmal, and H. Liu (2024) Disinformation detection: an evolving challenge in the age of LLMs. In Proceedings of the 2024 SIAM International Conference on Data Mining (SDM), Houston, pp. 427–435. External Links: Document Cited by: §2.
  • C. Leong, S. L. Pan, S. Bahri, and A. Fauzi (2019) Social media empowerment in social movements: power activation and power accrual in digital activism. European Journal of Information Systems 28 (2), pp. 173–204. External Links: Document Cited by: §1.
  • Y. Liu, G. Deng, Z. Xu, Y. Li, Y. Zheng, Y. Zhang, L. Zhao, T. Zhang, K. Wang, and Y. Liu (2024) Jailbreaking chatgpt via prompt engineering: an empirical study. External Links: 2305.13860, Link Cited by: §1, §3.
  • K. S. Long, J. Vines, S. Sutton, P. Brooker, T. Feltwell, B. Kirman, J. Barnett, and S. Lawson (2017) ”Could you define that in bot terms”?: requesting, creating and using bots on reddit. the 2017 CHI Conference, pp. 3488–3500. External Links: Document Cited by: §2.
  • M. Makhortykh, M. Sydorova, A. Baghumyan, V. Vziatysheva, and E. Kuznetsova (2024) Stochastic lies: how LLM-powered chatbots deal with Russian disinformation about the war in Ukraine. Harvard Kennedy School (HKS) Misinformation Review 5 (4). External Links: Document Cited by: §2.
  • I. Mbona and J. H. P. Eloff (2023) Classifying social media bots as malicious or benign using semi-supervised machine learning. Journal of Cybersecurity 9 (1), pp. 1–12. External Links: Document Cited by: §2.
  • S. M. Miranda, A. Young, and E. Yetgin (2016) Are social media emancipatory or hegemonic? societal effects of mass media digitization in the case of the sopa discourse. MIS quarterly 40 (2), pp. 303–330. Cited by: §1.
  • H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian (2025) A comprehensive overview of large language models. ACM TIST 16 (5). External Links: Document Cited by: §1.
  • R. J. Oentaryo, A. Murdopo, P. K. Prasetyo, and E. Lim (2016) On profiling bots in social media. Social Informatics 10046, pp. 92–109. External Links: Document Cited by: §2.
  • M. Perdue (2026) Russian propaganda infects AI chatbots. Bandwidth. External Links: Link Cited by: §2.
  • J. Podolak, S. Łukasik, P. Balawender, J. Ossowski, J. Piotrowski, K. Ba̧kowicz, and P. Sankowski (2024) LLM generated responses to mitigate the impact of hate speech. External Links: 2311.16905, Link Cited by: §1, §2.
  • Reddit (2023) Asked this bot for a cupcake recipe and he actually did it the absolute madlad. Note: https://www.reddit.com/r/madlads/comments/1eb6l21/asked_this_bot_for_a_cupcake_recipe_and_he/Image unverified, last accessed 14/01/2026 Cited by: Figure 1.
  • J. Rivera, G. Mukobi, A. Reuel, M. Lamparth, C. Smith, and J. Schneider (2024) Escalation risks from language models in military and diplomatic decision-making. In Proceedings of the 7th ACM Conference on Fairness, Accountability, and Transparency (FAccT), Rio de Janeiro, pp. 836–898. External Links: Document Cited by: §1.
  • S. Sardarizadeh, T. Copeland, and T. Edgington (2024) How Sumerian was used to ’jailbreak’ AI chatbots. Note: BBC News External Links: Link Cited by: §2.
  • S. Stecklow (2018) Why Facebook is losing the war on hate speech in Myanmar. Note: Reuters Investigative Report External Links: Link Cited by: §2.
  • I. Vykopal, M. Pikuliak, I. Srba, R. Moro, D. Macko, and M. Bielikova (2024) Disinformation capabilities of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), L. Ku, A. Martins, and V. Srikumar (Eds.), Bangkok, Thailand, pp. 14830–14847. External Links: Link, Document Cited by: §2.
  • A. Wei, N. Haghtalab, and J. Steinhardt (2023) Jailbroken: how does LLM safety training fail?. In Proceedings of the 37th Annual Conference on Advances in Neural Information Processing Systems (NeurIPS), Vol. 36, New Orleans, pp. 80079–80110. External Links: Link Cited by: §3.
  • G. K. Young (2022) How much is too much: the difficulties of social media content moderation. Information & Communications Technology Law 31 (1), pp. 1–16. External Links: Document Cited by: §1, §2.

文章来源: https://arxiv.org/html/2603.01942v1
如有侵权请联系:admin#unsafe.sh