Microsoft has show up a new type of AI jailbreak attack called ‘Skeleton Key’, which can bypass responsible AI guardrails on multiple AI models. This technique, capable of subverting most security measures built into AI systems, highlights the critical need for strong security measures at all levels of the AI stack.
The Skeleton Key jailbreak uses a multi-turn strategy to trick an AI model into ignoring its built-in safeguards. Once successful, the model cannot distinguish between malicious or unauthorized requests and legitimate requests, effectively giving attackers complete control over what the AI produces.
Microsoft’s research team has successfully tested the Skeleton Key technique on many prominent AI models, such as Meta’s Llama3-70b-instruct, Google’s Gemini Pro, OpenAI’s GPT-3.5 Turbo and GPT-4, Mistral Large, Anthropic’s Claude 3 Opus and Cohere Commander R Plus.
All affected models fully complied with requests for various risk categories, including explosives, biological weapons, political content, self-harm, racism, drugs, performance sex, and violence.
The attack works by instructing the model to increase its behavior instructions, persuading it to respond to any request for information or content while providing a warning if the result may be considered offensive, harmful, or illegal. This approach, known as “Explicit: mandatory command following”, has been shown to be effective in multiple artificial intelligence systems.
“By bypassing safeguards, the Skeleton Key allows the user to cause the model to produce normally prohibited behaviors, which could range from producing harmful content to bypassing normal decision-making rules,” Microsoft explained.
In response to this discovery, Microsoft has implemented several safeguards in its AI offerings, including Copilot AI assistants.
Microsoft says it has also shared its findings with other AI providers through responsible disclosure processes and updated its models managed by Azure AI to detect and block this type of attack using Prompt Shields.
To mitigate the risks associated with Skeleton Key and similar jailbreak techniques, Microsoft recommends a multi-layered approach for AI system designers:
- Input filtering to detect and block potentially harmful or malicious inputs
- Careful direct engineering system messages to reinforce appropriate behavior
- Output filtering to prevent the creation of content that violates security criteria
- Abuse monitoring systems trained in counterexamples to identify and moderate repetitive problematic content or behaviors
Microsoft has also updated its own PyRIT (Python Risk Identification Toolkit) to include Skeleton Key, allowing developers and security teams to test their AI systems against this new threat.
The discovery of the Skeleton Key jailbreak technique highlights the ongoing challenges for the security of AI systems as they become more widespread in various applications.
(Photo by Matt Artz)
See also: Think tank calls for AI incident reporting system
Want to learn more about AI and big data from industry leaders? Checkout AI & Big Data Expo takes place in Amsterdam, California and London. The comprehensive event is co-located with other top events including; Intelligent Automation Conference, BlockX, Digital Transformation Weekand Cyber Security & Cloud Expo.
Explore other upcoming corporate tech events and webinars powered by TechForge here.