MAI-2023-0012 | Mend Vulnerability Database

Vulnerability DatabaseMAI-2023-0012

MAI-2023-0012

Published:May 16, 2026

Updated:June 17, 2026

Large Language Models (LLMs) that provide access to output logits are susceptible to a sophisticated attack known as "coercive interrogation." This method enables the extraction of concealed and potentially harmful knowledge embedded within low-ranked tokens. Unlike traditional prompt-based attacks, this technique does not necessitate specially crafted prompts. Instead, it systematically compels the LLM to select and produce low-probability tokens at strategic points within the response sequence. This approach unveils toxic content that the model typically suppresses. Mitigation steps: **For AI Developers:** * Restrict API access to output logits, ensuring users receive only the final, filtered output. * Implement enhanced filtering mechanisms to identify and remove toxic content from responses, utilizing advanced techniques beyond simple keyword blocking. **For Model Trainers/Fine-tuners:** * Develop improved model alignment techniques that are resistant to coercion and effectively suppress harmful knowledge. * Investigate methods for model unlearning to remove harmful knowledge from the training data or model weights.

Related Resources (1)

https://arxiv.org/abs/2312.04782

Do you need more information?

CVSS v4

Base Score:

Attack Vector

LOCAL

Attack Complexity

HIGH

Attack Requirements

NONE

Privileges Required

HIGH

User Interaction

NONE

Vulnerable System Confidentiality

LOW

Vulnerable System Integrity

LOW

Vulnerable System Availability

NONE

Subsequent System Confidentiality

NONE

Subsequent System Integrity

HIGH

Subsequent System Availability

NONE

CVSS v3

Base Score:

3.9

Attack Vector

LOCAL

Attack Complexity

HIGH

Privileges Required

HIGH

User Interaction

NONE

Scope

CHANGED

Confidentiality

LOW

Integrity

LOW

Availability

NONE

AIVSS

Base Score:

3.2