MAI-2025-0016 | Mend Vulnerability Database

Vulnerability DatabaseMAI-2025-0016

MAI-2025-0016

Published:May 16, 2026

Updated:June 17, 2026

Large Language Models (LLMs) integrated into hate speech detection systems are susceptible to adversarial attacks and model extraction vulnerabilities. Adversarial attacks involve the strategic alteration of hate speech text to evade detection mechanisms, while model extraction facilitates the creation of surrogate models that replicate the behavior of the targeted system. These vulnerabilities compromise the integrity and efficacy of hate speech detection frameworks. Mitigation steps: **For AI Developers:** * [Diversify detection methods beyond LLMs by incorporating human-in-the-loop verification] * [Increase scrutiny of user queries to detect anomalous patterns that might indicate adversarial attacks] **For Model Trainers/Fine-tuners:** * [Regularly update hate speech detection models with new data, including examples generated by advanced LLMs and adversarial attacks] * [Implement robust defenses against adversarial attacks through techniques like adversarial training and robust optimization] * [Employ techniques to detect model stealing attempts, such as monitoring query patterns and distributions]

Related Resources (1)

https://arxiv.org/abs/2501.16750

Do you need more information?

CVSS v4

Base Score:

6.9

Attack Vector

NETWORK

Attack Complexity

LOW

Attack Requirements

NONE

Privileges Required

NONE

User Interaction

NONE

Vulnerable System Confidentiality

LOW

Vulnerable System Integrity

LOW

Vulnerable System Availability

NONE

Subsequent System Confidentiality

NONE

Subsequent System Integrity

NONE

Subsequent System Availability

NONE

CVSS v3

Base Score:

6.5

Attack Vector

NETWORK

Attack Complexity

LOW

Privileges Required

NONE

User Interaction

NONE

Scope

UNCHANGED

Confidentiality

LOW

Integrity

LOW

Availability

NONE

AIVSS

Base Score:

4.3