MAI-2025-0020 | Mend Vulnerability Database

Vulnerability DatabaseMAI-2025-0020

MAI-2025-0020

Published:May 16, 2026

Updated:June 17, 2026

Large Language Models (LLMs) that utilize gradient-based optimization techniques for defending against jailbreaking attacks are susceptible to enhanced transferability attacks due to the presence of superfluous constraints within their objective functions. Specifically, constraints such as the "response pattern constraint," which mandates a specific initial response phrase, and the "token tail constraint," which penalizes deviations in the response beyond a predetermined prefix, restrict the search space and diminish the effectiveness of attacks across various models. The removal of these constraints significantly increases the success rate of attacks when transferred to target models. Mitigation steps: **For AI Developers:** * Develop more robust safety mechanisms that are less susceptible to manipulation via gradient-based attacks, considering methods beyond simple token-level prediction. **For Model Trainers/Fine-tuners:** * Re-evaluate the objective function used in gradient-based LLM safety mechanisms, ensuring constraints on the response pattern are only essential for safety. * Relax or remove constraints on the "token tail," allowing for more variability in output while maintaining core safety restrictions.

Related Resources (1)

https://arxiv.org/abs/2503.01865

Do you need more information?

CVSS v4

Base Score:

6.3

Attack Vector

NETWORK

Attack Complexity

HIGH

Attack Requirements

NONE

Privileges Required

NONE

User Interaction

NONE

Vulnerable System Confidentiality

NONE

Vulnerable System Integrity

LOW

Vulnerable System Availability

NONE

Subsequent System Confidentiality

NONE

Subsequent System Integrity

LOW

Subsequent System Availability

NONE

CVSS v3

Base Score:

Attack Vector

NETWORK

Attack Complexity

HIGH

Privileges Required

NONE

User Interaction

NONE

Scope

CHANGED

Confidentiality

NONE

Integrity

LOW

Availability

NONE

AIVSS

Base Score:

3.8