MAI-2024-0020 | Mend Vulnerability Database

Vulnerability DatabaseMAI-2024-0020

MAI-2024-0020

Published:May 16, 2026

Updated:June 17, 2026

Large Language Models (LLMs) employed as safety judges are susceptible to a vulnerability known as the "Emoji Attack," which is a form of prompt injection exploiting token segmentation bias. This technique involves the strategic insertion of emojis within tokens, thereby altering sub-token embeddings. As a result, the judge LLM is deceived into misclassifying harmful content as benign. The attack's efficacy is heightened by the precise placement of emojis to maximize the discrepancy between the embeddings of sub-tokens and the original token. Mitigation steps: **For AI Developers:** * Implement advanced character filtering mechanisms that analyze context and embedding changes, rather than solely removing unusual characters. * Develop detection mechanisms that identify patterns indicative of the Emoji Attack, focusing on unusual character placement within tokens. **For Model Trainers/Fine-tuners:** * Enhance judge LLMs to increase robustness against token segmentation bias. * Utilize diverse and robust evaluation metrics beyond simple unsafe prediction ratios when assessing LLM safety.

Related Resources (1)

https://arxiv.org/abs/2411.01077

Do you need more information?

CVSS v4

Base Score:

8.7

Attack Vector

NETWORK

Attack Complexity

LOW

Attack Requirements

NONE

Privileges Required

NONE

User Interaction

NONE

Vulnerable System Confidentiality

NONE

Vulnerable System Integrity

HIGH

Vulnerable System Availability

NONE

Subsequent System Confidentiality

NONE

Subsequent System Integrity

NONE

Subsequent System Availability

NONE

CVSS v3

Base Score:

7.5

Attack Vector

NETWORK

Attack Complexity

LOW

Privileges Required

NONE

User Interaction

NONE

Scope

UNCHANGED

Confidentiality

NONE

Integrity

HIGH

Availability

NONE

AIVSS

Base Score:

4.7