MAI-2024-0066 | Mend Vulnerability Database

Vulnerability DatabaseMAI-2024-0066

MAI-2024-0066

Published:May 16, 2026

Updated:June 17, 2026

This vulnerability pertains to the safety alignment mechanisms of large language models (LLMs), facilitating a "weak-to-strong" jailbreaking attack. The attack methodology involves employing a smaller, adversarially trained LLM, referred to as "unsafe," to manipulate the decoding probabilities of a larger, safety-aligned LLM, known as "safe." The attack exploits the observation that the initial decoding distributions between safe and unsafe LLMs exhibit significant divergence, which diminishes as the generation process advances. By algebraically combining the probability distributions of the initial tokens from both the safe and unsafe models, attackers can effectively bypass the safety protocols of the larger model. This technique requires only a single forward pass per example in the target LLM, rendering the attack computationally efficient. Mitigation steps: **For AI Developers:** * Implement stricter input validation and filtering mechanisms to prevent malicious inputs from exploiting vulnerabilities. * Continuously monitor outputs for signs of malicious behavior and promptly address identified issues. **For Model Trainers/Fine-tuners:** * Develop and apply more robust safety alignment techniques that minimize susceptibility to manipulations of initial decoding distributions. * Implement gradient ascent defense strategies to adjust model parameters based on harmful generations and increase resistance to specific attack vectors.

Related Resources (1)

https://arxiv.org/abs/2401.17256

Do you need more information?

CVSS v4

Base Score:

1.8

Attack Vector

LOCAL

Attack Complexity

HIGH

Attack Requirements

NONE

Privileges Required

HIGH

User Interaction

NONE

Vulnerable System Confidentiality

NONE

Vulnerable System Integrity

LOW

Vulnerable System Availability

NONE

Subsequent System Confidentiality

NONE

Subsequent System Integrity

LOW

Subsequent System Availability

NONE

CVSS v3

Base Score:

2.5

Attack Vector

LOCAL

Attack Complexity

HIGH

Privileges Required

HIGH

User Interaction

NONE

Scope

CHANGED

Confidentiality

NONE

Integrity

LOW

Availability

NONE

AIVSS

Base Score:

2.3