MAI-2023-0012
Published:May 16, 2026
Updated:May 16, 2026
Large Language Models (LLMs) that provide access to output logits are susceptible to a sophisticated attack known as "coercive interrogation." This method enables the extraction of concealed and potentially harmful knowledge embedded within low-ranked tokens. Unlike traditional prompt-based attacks, this technique does not necessitate specially crafted prompts. Instead, it systematically compels the LLM to select and produce low-probability tokens at strategic points within the response sequence. This approach unveils toxic content that the model typically suppresses.
Mitigation steps: **For AI Developers:**
* Restrict API access to output logits, ensuring users receive only the final, filtered output.
* Implement enhanced filtering mechanisms to identify and remove toxic content from responses, utilizing advanced techniques beyond simple keyword blocking.
**For Model Trainers/Fine-tuners:**
* Develop improved model alignment techniques that are resistant to coercion and effectively suppress harmful knowledge.
* Investigate methods for model unlearning to remove harmful knowledge from the training data or model weights.
Related Resources (1)
Do you need more information?
Contact UsCVSS v4
Base Score:
4
Attack Vector
LOCAL
Attack Complexity
HIGH
Attack Requirements
NONE
Privileges Required
HIGH
User Interaction
NONE
Vulnerable System Confidentiality
LOW
Vulnerable System Integrity
LOW
Vulnerable System Availability
NONE
Subsequent System Confidentiality
NONE
Subsequent System Integrity
HIGH
Subsequent System Availability
NONE
CVSS v3
Base Score:
3.9
Attack Vector
LOCAL
Attack Complexity
HIGH
Privileges Required
HIGH
User Interaction
NONE
Scope
CHANGED
Confidentiality
LOW
Integrity
LOW
Availability
NONE
AIVSS
Base Score:
3.2