MAI-2024-0013
Published:May 16, 2026
Updated:May 16, 2026
Large Language Models (LLMs) are susceptible to jailbreak attacks facilitated by autonomously discovered strategies. The AutoDAN-Turbo method exemplifies a black-box attack approach capable of identifying novel and highly effective jailbreak strategies without human intervention. This method achieves a notable success rate, such as 88.5% on GPT-4-1106-turbo, in extracting harmful or unsafe responses from LLMs. The attack employs a lifelong learning agent to iteratively refine strategies based on model feedback, resulting in progressively effective prompts that circumvent established safety protocols.
Mitigation steps: **For AI Developers:**
* Implement advanced detection systems to identify and block malicious prompts effectively.
* Continuously assess and enhance safety mechanisms in response to new attack techniques, including automated jailbreaks.
**For Model Trainers/Fine-tuners:**
* Integrate robust safety mechanisms into LLMs to withstand iterative attacks and strategy adaptation.
* Conduct regular red teaming exercises using diverse attack strategies to uncover and mitigate vulnerabilities.
Related Resources (1)
Do you need more information?
Contact UsCVSS v4
Base Score:
8.9
Attack Vector
NETWORK
Attack Complexity
HIGH
Attack Requirements
NONE
Privileges Required
NONE
User Interaction
NONE
Vulnerable System Confidentiality
NONE
Vulnerable System Integrity
HIGH
Vulnerable System Availability
NONE
Subsequent System Confidentiality
NONE
Subsequent System Integrity
HIGH
Subsequent System Availability
NONE
CVSS v3
Base Score:
6.8
Attack Vector
NETWORK
Attack Complexity
HIGH
Privileges Required
NONE
User Interaction
NONE
Scope
CHANGED
Confidentiality
NONE
Integrity
HIGH
Availability
NONE
AIVSS
Base Score:
7