MAI-2024-0036 | Mend Vulnerability Database

Vulnerability DatabaseMAI-2024-0036

MAI-2024-0036

Published:May 16, 2026

Updated:June 17, 2026

Large Language Models (LLMs) are susceptible to sophisticated jailbreak attacks that exploit benign data mirroring techniques. In this attack vector, adversaries construct a local "mirror model" by training it on non-malicious data sourced from the target LLM. This mirror model replicates the target's operational characteristics and is subsequently employed to craft adversarial prompts. These prompts are strategically deployed against the target LLM, effectively circumventing content moderation systems due to the absence of overtly malicious elements during the initial data collection phase. Mitigation steps: **For AI Developers:** * Implement robust input detection systems to identify subtle patterns indicative of shadow model attacks, focusing on statistically unusual query sequences or prompt variations. * Develop adaptable safety mechanisms that dynamically adjust response strategies based on real-time threat assessments, ensuring regular updates to dynamic threat models. **For Model Trainers/Fine-tuners:** * Train LLMs using diverse datasets that include a wide range of potential adversarial prompts, incorporating shadow model techniques to enhance safety alignment.

Related Resources (1)

https://arxiv.org/abs/2410.21083

Do you need more information?

CVSS v4

Base Score:

8.2

Attack Vector

NETWORK

Attack Complexity

HIGH

Attack Requirements

NONE

Privileges Required

NONE

User Interaction

NONE

Vulnerable System Confidentiality

NONE

Vulnerable System Integrity

HIGH

Vulnerable System Availability

NONE

Subsequent System Confidentiality

NONE

Subsequent System Integrity

NONE

Subsequent System Availability

NONE

CVSS v3

Base Score:

5.9

Attack Vector

NETWORK

Attack Complexity

HIGH

Privileges Required

NONE

User Interaction

NONE

Scope

UNCHANGED

Confidentiality

NONE

Integrity

HIGH

Availability

NONE

AIVSS

Base Score:

5.7