Poems can trick major AI models into sharing dangerous info
A new study from Icaro Lab, a collaboration between Sapienza University of Rome and the DexAI think tank, found that large language models can be coaxed into providing guidance on highly restricted topics, including nuclear weapons, child abuse material, and malware, when the request is written as a poem, Qazinform News Agency correspondent reports, citing Wired.
The paper, titled “Adversarial Poetry as a Universal Single-Turn Jailbreak in Large Language Models,” reports that poetic prompts bypass safety filters on 25 major systems built by companies such as OpenAI, Meta, and Anthropic. According to the researchers, human-written poems achieved an average jailbreak rate of 62%, while automatically generated poetic prompts succeeded 43% of the time. On some cutting-edge models, success rates climbed as high as 90%.
Poetry appears to function much like the “adversarial suffixes” used in earlier jailbreaks, where long strings of unrelated text confuse safety systems. This time, however, the creative structure itself seems to disrupt the mechanisms that detect dangerous intent. Requests that were immediately rejected in plain language were often accepted once translated into verse.
To test the vulnerability, the team first wrote their own poems and later trained a system to generate new attack prompts. While handcrafted poems proved more effective, the automated approach still consistently outperformed prose-based jailbreak attempts. None of the original attack examples were published, with the researchers warning they are too risky to release.
Why it works
The Icaro team offered one theory: poetry pushes language into unusual, low-probability patterns. That unpredictability, they suggest, may help prompts evade classifiers that look for specific keywords or semantic patterns associated with harmful intent. In effect, the poetic framing steers the model’s internal navigation around the areas where safety alarms are typically triggered.
Despite the researchers’ hypotheses, they admit the underlying mechanics remain unclear.
Meta, Anthropic, and OpenAI did not respond to WIRED’s request for comment. The researchers say they have also reached out to the companies privately to share the findings.
Earlier, Qazinform News Agency reported on how AI progress mirrors a billion years of evolution.