Mend.io Vulnerability Database
The largest open source vulnerability database
What is a Vulnerability ID?
New vulnerability? Tell us about it!
MAI-2025-0016
Published:May 16, 2026
Updated:May 16, 2026
Large Language Models (LLMs) integrated into hate speech detection systems are susceptible to adversarial attacks and model extraction vulnerabilities. Adversarial attacks involve the strategic alteration of hate speech text to evade detection mechanisms, while model extraction facilitates the creation of surrogate models that replicate the behavior of the targeted system. These vulnerabilities compromise the integrity and efficacy of hate speech detection frameworks. Mitigation steps: **For AI Developers:** * [Diversify detection methods beyond LLMs by incorporating human-in-the-loop verification] * [Increase scrutiny of user queries to detect anomalous patterns that might indicate adversarial attacks] **For Model Trainers/Fine-tuners:** * [Regularly update hate speech detection models with new data, including examples generated by advanced LLMs and adversarial attacks] * [Implement robust defenses against adversarial attacks through techniques like adversarial training and robust optimization] * [Employ techniques to detect model stealing attempts, such as monitoring query patterns and distributions]
Related Resources (1)
Do you need more information?
Contact Us
CVSS v4
Base Score:
6.9
Attack Vector
NETWORK
Attack Complexity
LOW
Attack Requirements
NONE
Privileges Required
NONE
User Interaction
NONE
Vulnerable System Confidentiality
LOW
Vulnerable System Integrity
LOW
Vulnerable System Availability
NONE
Subsequent System Confidentiality
NONE
Subsequent System Integrity
NONE
Subsequent System Availability
NONE
CVSS v3
Base Score:
6.5
Attack Vector
NETWORK
Attack Complexity
LOW
Privileges Required
NONE
User Interaction
NONE
Scope
UNCHANGED
Confidentiality
LOW
Integrity
LOW
Availability
NONE
AIVSS
Base Score:
4.3