Mend.io Vulnerability Database
The largest open source vulnerability database
What is a Vulnerability ID?
New vulnerability? Tell us about it!
MAI-2024-0051
Published:May 16, 2026
Updated:May 16, 2026
Large Language Models (LLMs) that utilize reinforcement learning from human feedback (RLHF) are susceptible to jailbreaking attacks due to reward function misspecification. The alignment process involves a reward function intended to evaluate the quality of responses, yet it often fails to accurately differentiate between safe and adversarial prompts. This vulnerability allows adversaries to construct prompts that produce harmful outputs, circumventing the model's safety constraints. The issue arises from a disparity in the implicit rewards assigned to safe versus harmful responses, enabling attackers to exploit this misspecification and bypass established safety mechanisms. Mitigation steps: **For AI Developers:** * Implement defense mechanisms to detect and mitigate adversarial prompts by analyzing implicit rewards and other subtle indicators of malicious intent. * Conduct robustness testing and red teaming exercises to identify vulnerabilities related to reward misspecification, utilizing systems like ReMiss for proactive mitigation. **For Model Trainers/Fine-tuners:** * Design improved reward functions that accurately capture desired behaviors and generalize effectively to out-of-distribution inputs, including adversarial prompts. * Utilize advanced data filtering and augmentation techniques to minimize the impact of noisy or malicious human feedback, thereby reducing reward misspecification.
Related Resources (1)
Do you need more information?
Contact Us
CVSS v4
Base Score:
6.3
Attack Vector
NETWORK
Attack Complexity
HIGH
Attack Requirements
NONE
Privileges Required
NONE
User Interaction
NONE
Vulnerable System Confidentiality
NONE
Vulnerable System Integrity
LOW
Vulnerable System Availability
NONE
Subsequent System Confidentiality
NONE
Subsequent System Integrity
NONE
Subsequent System Availability
NONE
CVSS v3
Base Score:
3.7
Attack Vector
NETWORK
Attack Complexity
HIGH
Privileges Required
NONE
User Interaction
NONE
Scope
UNCHANGED
Confidentiality
NONE
Integrity
LOW
Availability
NONE
AIVSS
Base Score:
4.3