MAI-2025-0020
Published:May 16, 2026
Updated:May 16, 2026
Large Language Models (LLMs) that utilize gradient-based optimization techniques for defending against jailbreaking attacks are susceptible to enhanced transferability attacks due to the presence of superfluous constraints within their objective functions. Specifically, constraints such as the "response pattern constraint," which mandates a specific initial response phrase, and the "token tail constraint," which penalizes deviations in the response beyond a predetermined prefix, restrict the search space and diminish the effectiveness of attacks across various models. The removal of these constraints significantly increases the success rate of attacks when transferred to target models.
Mitigation steps: **For AI Developers:**
* Develop more robust safety mechanisms that are less susceptible to manipulation via gradient-based attacks, considering methods beyond simple token-level prediction.
**For Model Trainers/Fine-tuners:**
* Re-evaluate the objective function used in gradient-based LLM safety mechanisms, ensuring constraints on the response pattern are only essential for safety.
* Relax or remove constraints on the "token tail," allowing for more variability in output while maintaining core safety restrictions.
Related Resources (1)
Do you need more information?
Contact UsCVSS v4
Base Score:
6.3
Attack Vector
NETWORK
Attack Complexity
HIGH
Attack Requirements
NONE
Privileges Required
NONE
User Interaction
NONE
Vulnerable System Confidentiality
NONE
Vulnerable System Integrity
LOW
Vulnerable System Availability
NONE
Subsequent System Confidentiality
NONE
Subsequent System Integrity
LOW
Subsequent System Availability
NONE
CVSS v3
Base Score:
4
Attack Vector
NETWORK
Attack Complexity
HIGH
Privileges Required
NONE
User Interaction
NONE
Scope
CHANGED
Confidentiality
NONE
Integrity
LOW
Availability
NONE
AIVSS
Base Score:
3.8