MAI-2024-0045 | Mend Vulnerability Database

Vulnerability DatabaseMAI-2024-0045

MAI-2024-0045

Published:May 16, 2026

Updated:June 17, 2026

This vulnerability in large language models (LLMs) enables near-perfect jailbreaking through iterative prompt refinement and self-explanation techniques. An attacker can leverage the LLM's own capabilities to refine adversarial prompts by soliciting self-explanations for previous unsuccessful attempts. This iterative process ultimately produces prompts capable of circumventing established safety mechanisms, thus eliciting harmful content. Additionally, a "Rate+Enhance" step is employed to further amplify the harmfulness of the generated output. Mitigation steps: **For AI Developers:** * [Implement robust prompt filtering mechanisms resistant to iterative refinement and self-explanation techniques.] * [Incorporate sophisticated safety measures beyond simple word-count or keyword-based filtering to detect malicious intent within generated prompts.] **For Model Trainers/Fine-tuners:** * [Develop detection mechanisms to identify and block prompts generated through self-jailbreaking methods by analyzing patterns in language and query structure.] * [Conduct rigorous red-teaming and adversarial testing to identify and address vulnerabilities before model deployment.]

Related Resources (1)

https://arxiv.org/abs/2405.13077

Do you need more information?

CVSS v4

Base Score:

6.9

Attack Vector

NETWORK

Attack Complexity

HIGH

Attack Requirements

NONE

Privileges Required

NONE

User Interaction

NONE

Vulnerable System Confidentiality

NONE

Vulnerable System Integrity

LOW

Vulnerable System Availability

NONE

Subsequent System Confidentiality

NONE

Subsequent System Integrity

HIGH

Subsequent System Availability

NONE

CVSS v3

Base Score:

Attack Vector

NETWORK

Attack Complexity

HIGH

Privileges Required

NONE

User Interaction

NONE

Scope

CHANGED

Confidentiality

NONE

Integrity

LOW

Availability

NONE

AIVSS

Base Score:

4.8