MAI-2023-0003 | Mend Vulnerability Database

Vulnerability DatabaseMAI-2023-0003

MAI-2023-0003

Published:May 16, 2026

Updated:June 17, 2026

A vulnerability has been identified within the fine-tuning API of GPT-4, which permits adversaries to bypass the Reinforcement Learning from Human Feedback (RLHF) safety mechanisms. By employing a relatively small set of meticulously crafted prompt-response pairs, attackers can fine-tune the model to produce harmful content. This content includes instructions for illegal activities and the creation of hazardous materials, which the base model is designed to refuse to generate. Mitigation steps: **For AI Developers:** * Implement post-processing mechanisms to detect and filter harmful content generated by the model. * Restrict access to the API to trusted users and organizations. **For Model Trainers/Fine-tuners:** * Implement robust input sanitization and filtering to prevent malicious prompts from being used in the fine-tuning process, including detecting and blocking prompts designed to elicit harmful responses. * Continuously monitor fine-tuning datasets for the presence of malicious or harmful content, incorporating automated detection systems into the fine-tuning process. * Develop more robust Reinforcement Learning from Human Feedback (RLHF) techniques that are less susceptible to fine-tuning attacks.

Related Resources (1)

https://arxiv.org/abs/2311.05553

Do you need more information?

CVSS v4

Base Score:

8.3

Attack Vector

NETWORK

Attack Complexity

LOW

Attack Requirements

NONE

Privileges Required

LOW

User Interaction

NONE

Vulnerable System Confidentiality

NONE

Vulnerable System Integrity

HIGH

Vulnerable System Availability

NONE

Subsequent System Confidentiality

NONE

Subsequent System Integrity

HIGH

Subsequent System Availability

NONE

CVSS v3

Base Score:

7.7

Attack Vector

NETWORK

Attack Complexity

LOW

Privileges Required

LOW

User Interaction

NONE

Scope

CHANGED

Confidentiality

NONE

Integrity

HIGH

Availability

NONE

AIVSS

Base Score:

5.7