hackaprompt's profile picture. Gaslight AIs & Win Prizes in the World's Largest AI Hacking Competition | Made w/ 💙 by the team @learnprompting

HackAPrompt

@hackaprompt

Gaslight AIs & Win Prizes in the World's Largest AI Hacking Competition | Made w/ 💙 by the team @learnprompting

고정된 트윗

We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN We compared Humans on @HackAPrompt vs. Automated AI Red Teaming Humans broke every defense/model we evaluated… 100% of the time🧵

hackaprompt's tweet image. We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN

We compared Humans on @HackAPrompt vs. Automated AI Red Teaming

Humans broke every defense/model we evaluated… 100% of the time🧵

HackAPrompt 님이 재게시함

PSA: our team is online 24/7 helping customers scale with Gemini 3 Pro and Nano Banana Pro, please let us know what you need (including higher API rate limits)! My email is [email protected]


HackAPrompt 님이 재게시함
foundjuliette's tweet image.

HackAPrompt 님이 재게시함

HackAPrompt 님이 재게시함

This presents serious limitations that must be overcome before LLMs can be deployed broadly in security sensitive applications. Our work highlights the need for more robust evaluations of defenses, and continued research into effective mitigations.


HackAPrompt 님이 재게시함

Human attackers generally succeed within just a few queries, automated attacks under 1_000 queries (usually significantly so). Attacks remain not just possible, but affordable.


HackAPrompt 님이 재게시함

New paper by OpenAI, Anthropic, GDM & more, showing that LLM security remains an unsolved problem. -- We tested twelve recent jailbreak and prompt injection defenses that claimed robustness against static evals. All failed when confronted with human & LLM attackers.

foundjuliette's tweet image. New paper by OpenAI, Anthropic, GDM & more, showing that LLM security remains an unsolved problem. -- We tested twelve recent jailbreak and prompt injection defenses that claimed robustness against static evals. All failed when confronted with human & LLM attackers.

HackAPrompt 님이 재게시함

We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN We compared Humans on @HackAPrompt vs. Automated AI Red Teaming Humans broke every defense/model we evaluated… 100% of the time🧵

hackaprompt's tweet image. We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN

We compared Humans on @HackAPrompt vs. Automated AI Red Teaming

Humans broke every defense/model we evaluated… 100% of the time🧵

HackAPrompt 님이 재게시함

Human red-teamers could jailbreak leading models 100% of the time. What happens when AI can design bioweapons? * * * Most jailbreaking evaluations allow a single attempt, and the models are quite good at resisting these (green bars in graph). In this new paper, human…


HackAPrompt 님이 재게시함

take a seat, fuzzers the force is not strong with you yet

We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN We compared Humans on @HackAPrompt vs. Automated AI Red Teaming Humans broke every defense/model we evaluated… 100% of the time🧵

hackaprompt's tweet image. We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN

We compared Humans on @HackAPrompt vs. Automated AI Red Teaming

Humans broke every defense/model we evaluated… 100% of the time🧵


PointCrow's Funeral


HackAPrompt 님이 재게시함

You must design your LLM-powered app with the assumption that an attacker can make the LLM produce whatever they want.

We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN We compared Humans on @HackAPrompt vs. Automated AI Red Teaming Humans broke every defense/model we evaluated… 100% of the time🧵

hackaprompt's tweet image. We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN

We compared Humans on @HackAPrompt vs. Automated AI Red Teaming

Humans broke every defense/model we evaluated… 100% of the time🧵


HackAPrompt 님이 재게시함

This competition and research confirms what we’re seeing in the wild at @arcanuminfosec . Automation can only get you part way to testing for AI security. Creative AI Red Teamers and Pentesters are still the most important players to identify AI security risks. Amazing…

We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN We compared Humans on @HackAPrompt vs. Automated AI Red Teaming Humans broke every defense/model we evaluated… 100% of the time🧵

hackaprompt's tweet image. We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN

We compared Humans on @HackAPrompt vs. Automated AI Red Teaming

Humans broke every defense/model we evaluated… 100% of the time🧵


HackAPrompt 님이 재게시함

5 years ago, I wrote a paper with @wielandbr @aleks_madry and Nicholas Carlini that showed that most published defenses in adversarial ML (for adversarial examples at the time) failed against properly designed attacks. Has anything changed? Nope...

florian_tramer's tweet image. 5 years ago, I wrote a paper with @wielandbr @aleks_madry and Nicholas Carlini that showed that most published defenses in adversarial ML (for adversarial examples at the time) failed against properly designed attacks.

Has anything changed?

Nope...

United States 트렌드

Loading...

Something went wrong.


Something went wrong.