A significant security vulnerability has been uncovered in the artificial intelligence safeguards deployed by tech giants Microsoft, Nvidia, and Meta.
According to new research, these companies’ AI safety systems can be completely bypassed using a deceptively simple technique involving emoji characters, allowing malicious actors to inject harmful prompts and execute jailbreaks with 100% success in some cases.
Large Language Model (LLM) guardrails are specialized systems designed to protect AI models from prompt injection and jailbreak attacks.
These security measures inspect user inputs and outputs, filtering or blocking potentially harmful content before it reaches the underlying AI model.
As organizations increasingly deploy AI systems across various sectors, these guardrails have become critical infrastructure for preventing misuse.
Mindgard and Lancaster University researchers identified this alarming vulnerability through systematic testing against six prominent LLM protection systems.
Their findings, published in a comprehensive academic paper, demonstrate that character injection techniques – particularly emoji smuggling – can completely circumvent detection while maintaining the functionality of the underlying prompt.
The impact of this discovery is far-reaching, affecting major commercial AI safety systems including Microsoft’s Azure Prompt Shield, Meta’s Prompt Guard, and Nvidia’s NeMo Guard Jailbreak Detect.
The researchers achieved attack success rates of 71.98% against Microsoft, 70.44% against Meta, and 72.54% against Nvidia using various evasion techniques.
Most concerning, the emoji smuggling technique achieved a perfect 100% success rate across multiple systems.
The most effective bypass method discovered involves embedding malicious text within emoji variation selectors – a technique the researchers call “emoji smuggling.”
This method exploits a fundamental weakness in how AI guardrails process Unicode characters compared to how the underlying LLMs interpret them.
The technique works by inserting text between special Unicode characters that are used to modify emojis.
When processed by guardrail systems, these characters and the text between them become essentially invisible to detection algorithms, while the LLM itself can still parse and execute the hidden instructions.
For example, when a malicious prompt is embedded using this method, it appears harmless to the guardrail filter but remains fully functional to the target LLM.
The researchers note: “LLM Guardrails can be trained on entirely different datasets than the underlying LLM, resulting in their inability to detect certain character injection techniques that the LLM itself can understand.”
The researchers followed responsible disclosure protocols, notifying all affected companies in February 2024, with final disclosures completed in April 2025.
This discovery highlights critical weaknesses in existing AI safety mechanisms and emphasizes the urgent need for more robust protective measures as AI systems become increasingly integrated into sensitive applications.
Are you from the SOC and DFIR Teams? – Analyse Real time Malware Incidents with ANY.RUN -> Start Now for Free.