As organizations scale AI operations, they increasingly deploy AI judges — large language models (LLMs) acting as automated security gatekeepers to enforce safety policies and evaluate output quality. Our research investigates a critical security issue in these systems: They can be manipulated into authorizing policy violations through stealthy input sequences, a type of prompt injection.
To do this investigation, we designed an automated fuzzer for internal use for red-team style assessments called AdvJudge-Zero. Fuzzers are tools that identify software vulnerabilities by providing unexpected input, and we apply the same approach to attacking AI judges. It identifies specific trigger sequences that exploit a model's decision-making logic to bypass security controls.
Unlike previous adversarial attacks that produce detectable gibberish, our research proves that effective attacks can be entirely stealthy, using benign formatting symbols to reverse a block decision to allow.
By examining how this tool works, we can more easily see the security issues inherent in AI judges used by current LLMs.
Palo Alto Networks customers are better protected from this type of issue through the following products and services:
The Unit 42 AI Security Assessment can help empower safe AI use and development.
If you think you might have been compromised or have an urgent matter, contact the Unit 42 Incident Response team.
In modern AI architectures, AI judges often serve as the final line of defense. These automated gatekeepers are responsible for enforcing safety policies (e.g., "Is this response harmful?") and evaluating performance. Our research tool, AdvJudge-Zero, treats LLMs as opaque boxes to be audited, revealing that AI judges can be subject to exploitable logic bugs of their own.
Previous adversarial attacks on AI judges have required clear-box access. With full visibility to the internal structure of the system, pen-testers can rely on mathematical routines to force model errors. This often results in high-entropy gibberish that is easily detected.
In contrast, AdvJudge-Zero employs an automated fuzzing approach. The tool interacts with an LLM strictly as a user would, using search algorithms to exploit the model's own predictive nature.
1. Token discovery via next-token distribution
The process begins by querying the model to identify expected inputs based on its own next-token distribution.
2. Iterative refinement and logit-gap analysis
Once candidate tokens are collected, the system enters a refinement phase.
By observing which innocent-looking formatting tokens minimize the probability of a block decision, the tool identifies the weak points in the model's logic.
3. Exploitation: isolating the decisive control elements
The final stage of AdvJudge-Zero's process isolates specific tokens that act as decisive control elements. These refined sequences steer the model’s internal attention mechanism toward an approval state, leading to a yes decision regardless of the actual input content.
The most alarming finding for security professionals is the stealth of these attacks. AI judges are highly sensitive to innocent-looking characters that act as logical triggers. To a human observer or a web application firewall (WAF), these look like standard data formatting. To the AI judge, they shift the model into compliance mode.
Effective triggers identified include:
Testing against a suite of general-purpose and specialized defense models confirms that LLM-as-a-judge setups are not a set-and-forget security control. By injecting low-perplexity stealth control tokens, an attacker can fundamentally break the logic of the automated gatekeeper.
To verify that our discovered control tokens are stealthier than common gibberish jailbreak tokens, we subjected them to a perplexity test. We compared the perplexity scores of our AdvJudge-Zero tokens against those from a common jailbreak algorithm (GCG) and against manually discovered, verified stealthy tokens (e.g., 解 and Solution🙂 from other prior research.
As Figure 1 illustrates, the tokens discovered by AdvJudge-Zero (blue area toward the left) yield significantly lower perplexity scores than the gibberish adversarial tokens (red area on the right). Furthermore, the AdvJudge-Zero tokens exhibit perplexity scores equivalent to the verified stealth jailbreak tokens (yellow area, the leftmost spike). This evidence supports the conclusion that the tokens discovered by AdvJudge-Zero are indeed more stealthy and significantly more likely to bypass general gatekeepers undetected.

These attacks do not resemble traditional hacking or computer code. Instead, they appear as standard formatting that exploits the logic in the AI's judgment.
An attacker can force a judge to approve toxic, biased or prohibited content.
In many enterprises, AI judges are used to score model responses during training, a process called reinforcement learning from human feedback (RLHF). If the judge is hacked, the AI learns the wrong lessons.
Our research using this tool achieved a 99% success rate in bypassing controls across several widely used architectures that customers rely on today:
The methods of AdvJudge-Zero in our testing prove that AI judges are susceptible to logic flaws similar to other software. If an attacker can automate the discovery of bypass codes through fuzzing, they can systematically defeat AI guardrails with innocent-looking inputs.
However, the fuzzer methodology also provides a solution. By adopting adversarial training — running this type of fuzzer internally to identify weaknesses and then retraining the model on these examples — organizations can harden their systems. This approach can reduce the attack success rate from approximately 99% to near zero.
Palo Alto Networks customers are better protected from the threats discussed above through the following products and services:
Organizations are better equipped to close the AI security gap through the deployment of Cortex AI-SPM, which delivers comprehensive visibility and posture management for AI agents. Cortex AI-SPM is designed to mitigate critical risks including over-privileged AI agent access, misconfigurations and unauthorized data exposure.
The Unit 42 AI Security Assessment can help empower safe AI use and development.
If you think you may have been compromised or have an urgent matter, get in touch with the Unit 42 Incident Response team or call:
Palo Alto Networks has shared these findings with our fellow Cyber Threat Alliance (CTA) members. CTA members use this intelligence to rapidly deploy protections to their customers and to systematically disrupt malicious cyber actors. Learn more about the Cyber Threat Alliance.