MAI-2025-0015
Published:May 16, 2026
Updated:May 16, 2026
Large Language Models (LLMs) are susceptible to sophisticated multi-turn adversarial attacks that ingeniously fragment malicious intents into innocuous interactions. These interactions progressively steer the conversation towards generating harmful outputs. This vulnerability enables attackers to circumvent the safety mechanisms of LLMs through a sequence of meticulously designed prompts, leveraging the model's iterative response generation capabilities. The effectiveness of the attack relies on the adaptive modification of each prompt based on the model's preceding responses, rendering traditional keyword-based detection methods inadequate.
Mitigation steps: **For AI Developers:**
* Implement multi-turn dialogue safety scrutiny to analyze entire conversation contexts for detecting harmful trajectories.
* Develop robust context-aware safety mechanisms to track evolving conversation contexts and flag harmful pathways.
**For Model Trainers/Fine-tuners:**
* Utilize advanced detection methods beyond simple keyword filtering to identify subtle shifts in conversation direction, including techniques measuring semantic similarity changes.
* Adopt reinforcement learning for safety by fine-tuning LLMs with data that includes adversarial prompts to enhance resilience to multi-turn manipulation.
Related Resources (1)
Do you need more information?
Contact UsCVSS v4
Base Score:
6.9
Attack Vector
NETWORK
Attack Complexity
LOW
Attack Requirements
NONE
Privileges Required
NONE
User Interaction
NONE
Vulnerable System Confidentiality
NONE
Vulnerable System Integrity
LOW
Vulnerable System Availability
NONE
Subsequent System Confidentiality
NONE
Subsequent System Integrity
LOW
Subsequent System Availability
NONE
CVSS v3
Base Score:
5.8
Attack Vector
NETWORK
Attack Complexity
LOW
Privileges Required
NONE
User Interaction
NONE
Scope
CHANGED
Confidentiality
NONE
Integrity
LOW
Availability
NONE
AIVSS
Base Score:
4.8