MAI-2024-0051 | Mend Vulnerability Database

Vulnerability DatabaseMAI-2024-0051

MAI-2024-0051

Published:May 16, 2026

Updated:June 17, 2026

Large Language Models (LLMs) that utilize reinforcement learning from human feedback (RLHF) are susceptible to jailbreaking attacks due to reward function misspecification. The alignment process involves a reward function intended to evaluate the quality of responses, yet it often fails to accurately differentiate between safe and adversarial prompts. This vulnerability allows adversaries to construct prompts that produce harmful outputs, circumventing the model's safety constraints. The issue arises from a disparity in the implicit rewards assigned to safe versus harmful responses, enabling attackers to exploit this misspecification and bypass established safety mechanisms. Mitigation steps: **For AI Developers:** * Implement defense mechanisms to detect and mitigate adversarial prompts by analyzing implicit rewards and other subtle indicators of malicious intent. * Conduct robustness testing and red teaming exercises to identify vulnerabilities related to reward misspecification, utilizing systems like ReMiss for proactive mitigation. **For Model Trainers/Fine-tuners:** * Design improved reward functions that accurately capture desired behaviors and generalize effectively to out-of-distribution inputs, including adversarial prompts. * Utilize advanced data filtering and augmentation techniques to minimize the impact of noisy or malicious human feedback, thereby reducing reward misspecification.

Related Resources (1)

https://arxiv.org/abs/2406.14393

Do you need more information?

CVSS v4

Base Score:

6.3

Attack Vector

NETWORK

Attack Complexity

HIGH

Attack Requirements

NONE

Privileges Required

NONE

User Interaction

NONE

Vulnerable System Confidentiality

NONE

Vulnerable System Integrity

LOW

Vulnerable System Availability

NONE

Subsequent System Confidentiality

NONE

Subsequent System Integrity

NONE

Subsequent System Availability

NONE

CVSS v3

Base Score:

3.7

Attack Vector

NETWORK

Attack Complexity

HIGH

Privileges Required

NONE

User Interaction

NONE

Scope

UNCHANGED

Confidentiality

NONE

Integrity

LOW

Availability

NONE

AIVSS

Base Score:

4.3