MAI-2024-0020
Published:May 16, 2026
Updated:May 16, 2026
Large Language Models (LLMs) employed as safety judges are susceptible to a vulnerability known as the "Emoji Attack," which is a form of prompt injection exploiting token segmentation bias. This technique involves the strategic insertion of emojis within tokens, thereby altering sub-token embeddings. As a result, the judge LLM is deceived into misclassifying harmful content as benign. The attack's efficacy is heightened by the precise placement of emojis to maximize the discrepancy between the embeddings of sub-tokens and the original token.
Mitigation steps: **For AI Developers:**
* Implement advanced character filtering mechanisms that analyze context and embedding changes, rather than solely removing unusual characters.
* Develop detection mechanisms that identify patterns indicative of the Emoji Attack, focusing on unusual character placement within tokens.
**For Model Trainers/Fine-tuners:**
* Enhance judge LLMs to increase robustness against token segmentation bias.
* Utilize diverse and robust evaluation metrics beyond simple unsafe prediction ratios when assessing LLM safety.
Related Resources (1)
Do you need more information?
Contact UsCVSS v4
Base Score:
8.7
Attack Vector
NETWORK
Attack Complexity
LOW
Attack Requirements
NONE
Privileges Required
NONE
User Interaction
NONE
Vulnerable System Confidentiality
NONE
Vulnerable System Integrity
HIGH
Vulnerable System Availability
NONE
Subsequent System Confidentiality
NONE
Subsequent System Integrity
NONE
Subsequent System Availability
NONE
CVSS v3
Base Score:
7.5
Attack Vector
NETWORK
Attack Complexity
LOW
Privileges Required
NONE
User Interaction
NONE
Scope
UNCHANGED
Confidentiality
NONE
Integrity
HIGH
Availability
NONE
AIVSS
Base Score:
4.7