MAI-2025-0021
Published:May 16, 2026
Updated:May 16, 2026
This white-box vulnerability enables adversaries with full access to a model to circumvent the safety alignments of Large Language Models (LLMs). By identifying and selectively pruning parameters that enforce the rejection of harmful prompts, attackers can effectively bypass security measures. The method employs an innovative "twin prompt" strategy to distinguish parameters related to safety from those crucial for the model's core functionality, allowing for precise pruning with negligible impact on the model's overall performance.
Mitigation steps: **For AI Developers:**
* Restrict direct access to model parameters, especially for open-source models, to prevent unauthorized modifications.
* Implement model integrity checks to detect unauthorized parameter modifications.
**For Model Trainers/Fine-tuners:**
* Develop robust safety alignment techniques that distribute safety mechanisms across a larger, less identifiable portion of the model's parameters.
* Explore and implement advanced detection mechanisms to identify signs of parameter pruning or other modifications indicative of attacks.
Related Resources (1)
Do you need more information?
Contact UsCVSS v4
Base Score:
5.9
Attack Vector
NETWORK
Attack Complexity
HIGH
Attack Requirements
PRESENT
Privileges Required
HIGH
User Interaction
NONE
Vulnerable System Confidentiality
NONE
Vulnerable System Integrity
HIGH
Vulnerable System Availability
NONE
Subsequent System Confidentiality
NONE
Subsequent System Integrity
LOW
Subsequent System Availability
NONE
CVSS v3
Base Score:
5.8
Attack Vector
NETWORK
Attack Complexity
HIGH
Privileges Required
HIGH
User Interaction
NONE
Scope
CHANGED
Confidentiality
NONE
Integrity
HIGH
Availability
NONE
AIVSS
Base Score:
3.8