MAI-2024-0018 | Mend Vulnerability Database

Vulnerability DatabaseMAI-2024-0018

MAI-2024-0018

Published:May 16, 2026

Updated:June 17, 2026

Large Language Models (LLMs) that have undergone safety fine-tuning are susceptible to a sophisticated attack known as Response-Guided Question Augmentation (ReG-QA). This method exploits the disparity in safety alignment between the processes of question generation and answer formulation. By introducing toxic answers generated by an unaligned LLM to a safety-aligned LLM, ReG-QA facilitates the creation of semantically related, naturally phrased questions that effectively bypass established safety protocols, leading to the generation of undesirable responses. Notably, this attack circumvents the need for adversarial prompt engineering or model optimization. Mitigation steps: **For AI Developers:** * Implement robust filtering mechanisms that utilize deeper analysis of both input and output content, surpassing standard perplexity checks. * Develop defenses that prioritize semantic understanding over reliance on surface-level features of the input. **For Model Trainers/Fine-tuners:** * Improve the symmetry of safety training to enhance generalization capabilities for question generation from unsafe answers. * Investigate and address the "reversal curse," ensuring safety training effectiveness in both directions (question to answer and answer to question).

Related Resources (1)

https://arxiv.org/abs/2412.03235

Do you need more information?

CVSS v4

Base Score:

8.7

Attack Vector

NETWORK

Attack Complexity

LOW

Attack Requirements

NONE

Privileges Required

NONE

User Interaction

NONE

Vulnerable System Confidentiality

NONE

Vulnerable System Integrity

HIGH

Vulnerable System Availability

NONE

Subsequent System Confidentiality

NONE

Subsequent System Integrity

NONE

Subsequent System Availability

NONE

CVSS v3

Base Score:

7.5

Attack Vector

NETWORK

Attack Complexity

LOW

Privileges Required

NONE

User Interaction

NONE

Scope

UNCHANGED

Confidentiality

NONE

Integrity

HIGH

Availability

NONE

AIVSS

Base Score:

5.2