HackAPrompt
@hackaprompt
Gaslight AIs & Win Prizes in the World's Largest AI Hacking Competition | Made w/ 💙 by the team @learnprompting
We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN We compared Humans on @HackAPrompt vs. Automated AI Red Teaming Humans broke every defense/model we evaluated… 100% of the time🧵
This presents serious limitations that must be overcome before LLMs can be deployed broadly in security sensitive applications. Our work highlights the need for more robust evaluations of defenses, and continued research into effective mitigations.
Human attackers generally succeed within just a few queries, automated attacks under 1_000 queries (usually significantly so). Attacks remain not just possible, but affordable.
New paper by OpenAI, Anthropic, GDM & more, showing that LLM security remains an unsolved problem. -- We tested twelve recent jailbreak and prompt injection defenses that claimed robustness against static evals. All failed when confronted with human & LLM attackers.
Paper: arxiv.org/abs/2510.09023 Many thanks to @srxzr @csitawarin @hackaprompt @florian_tramer @aterzis @KaiKaiXiao @iliaishacked
We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN We compared Humans on @HackAPrompt vs. Automated AI Red Teaming Humans broke every defense/model we evaluated… 100% of the time🧵
Human red-teamers could jailbreak leading models 100% of the time. What happens when AI can design bioweapons? * * * Most jailbreaking evaluations allow a single attempt, and the models are quite good at resisting these (green bars in graph). In this new paper, human…
take a seat, fuzzers the force is not strong with you yet
We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN We compared Humans on @HackAPrompt vs. Automated AI Red Teaming Humans broke every defense/model we evaluated… 100% of the time🧵
You must design your LLM-powered app with the assumption that an attacker can make the LLM produce whatever they want.
We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN We compared Humans on @HackAPrompt vs. Automated AI Red Teaming Humans broke every defense/model we evaluated… 100% of the time🧵
This competition and research confirms what we’re seeing in the wild at @arcanuminfosec . Automation can only get you part way to testing for AI security. Creative AI Red Teamers and Pentesters are still the most important players to identify AI security risks. Amazing…
We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN We compared Humans on @HackAPrompt vs. Automated AI Red Teaming Humans broke every defense/model we evaluated… 100% of the time🧵
5 years ago, I wrote a paper with @wielandbr @aleks_madry and Nicholas Carlini that showed that most published defenses in adversarial ML (for adversarial examples at the time) failed against properly designed attacks. Has anything changed? Nope...
United States เทรนด์
- 1. Clay Higgins 11K posts
- 2. Saudi 211K posts
- 3. #UNBarbie 9,475 posts
- 4. Gemini 3 43.8K posts
- 5. Khashoggi 35.5K posts
- 6. #UnitedNationsBarbie 9,586 posts
- 7. Peggy 5,331 posts
- 8. Salman 65.4K posts
- 9. Nicki 85.9K posts
- 10. Mary Bruce 1,748 posts
- 11. Robinhood 5,105 posts
- 12. #NXXT2Run 1,249 posts
- 13. Cloudflare 257K posts
- 14. Shanice N/A
- 15. McTominay 9,912 posts
- 16. Merch 66.3K posts
- 17. Pat Bev 2,007 posts
- 18. Frankie Montas N/A
- 19. Pirates 17.7K posts
- 20. Olivia Dean 4,484 posts
Something went wrong.
Something went wrong.