MAI-2024-0060
Published:May 16, 2026
Updated:May 16, 2026
Large Language Models (LLMs) are susceptible to a sophisticated semantic mirror jailbreak attack. This attack utilizes a genetic algorithm to craft jailbreak prompts that closely resemble benign prompts in semantic terms, thereby circumventing defenses that rely on semantic similarity metrics. The attack is designed to optimize for both semantic resemblance to the original query and the capacity to provoke harmful responses from the model.
Mitigation steps: **For AI Developers:**
* Implement advanced safety mechanisms that incorporate contextual analysis and intention detection, moving beyond basic semantic similarity checks.
* Deploy sophisticated detection systems capable of identifying subtle manipulations, even when they are semantically similar to benign prompts.
* Rate-limit queries exhibiting high semantic similarity, particularly if originating from the same source, to mitigate potential abuse.
**For Model Trainers/Fine-tuners:**
* Regularly update safety models and filters to address emerging threats, utilizing adversarial training to enhance resistance against this type of attack.
Related Resources (1)
Do you need more information?
Contact UsCVSS v4
Base Score:
6.3
Attack Vector
NETWORK
Attack Complexity
HIGH
Attack Requirements
NONE
Privileges Required
NONE
User Interaction
NONE
Vulnerable System Confidentiality
NONE
Vulnerable System Integrity
LOW
Vulnerable System Availability
NONE
Subsequent System Confidentiality
NONE
Subsequent System Integrity
LOW
Subsequent System Availability
NONE
CVSS v3
Base Score:
4
Attack Vector
NETWORK
Attack Complexity
HIGH
Privileges Required
NONE
User Interaction
NONE
Scope
CHANGED
Confidentiality
NONE
Integrity
LOW
Availability
NONE
AIVSS
Base Score:
3.8