hackaprompt's profile picture. Gaslight AIs & Win Prizes in the World's Largest AI Hacking Competition | Made w/ 💙 by the team @learnprompting

HackAPrompt

@hackaprompt

Gaslight AIs & Win Prizes in the World's Largest AI Hacking Competition | Made w/ 💙 by the team @learnprompting

ปักหมุด

We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN We compared Humans on @HackAPrompt vs. Automated AI Red Teaming Humans broke every defense/model we evaluated… 100% of the time🧵

hackaprompt's tweet image. We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN

We compared Humans on @HackAPrompt vs. Automated AI Red Teaming

Humans broke every defense/model we evaluated… 100% of the time🧵

HackAPrompt รีโพสต์แล้ว
foundjuliette's tweet image.

HackAPrompt รีโพสต์แล้ว

HackAPrompt รีโพสต์แล้ว

This presents serious limitations that must be overcome before LLMs can be deployed broadly in security sensitive applications. Our work highlights the need for more robust evaluations of defenses, and continued research into effective mitigations.


HackAPrompt รีโพสต์แล้ว

Human attackers generally succeed within just a few queries, automated attacks under 1_000 queries (usually significantly so). Attacks remain not just possible, but affordable.


HackAPrompt รีโพสต์แล้ว

New paper by OpenAI, Anthropic, GDM & more, showing that LLM security remains an unsolved problem. -- We tested twelve recent jailbreak and prompt injection defenses that claimed robustness against static evals. All failed when confronted with human & LLM attackers.

foundjuliette's tweet image. New paper by OpenAI, Anthropic, GDM & more, showing that LLM security remains an unsolved problem. -- We tested twelve recent jailbreak and prompt injection defenses that claimed robustness against static evals. All failed when confronted with human & LLM attackers.

HackAPrompt รีโพสต์แล้ว

We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN We compared Humans on @HackAPrompt vs. Automated AI Red Teaming Humans broke every defense/model we evaluated… 100% of the time🧵

hackaprompt's tweet image. We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN

We compared Humans on @HackAPrompt vs. Automated AI Red Teaming

Humans broke every defense/model we evaluated… 100% of the time🧵

HackAPrompt รีโพสต์แล้ว

Human red-teamers could jailbreak leading models 100% of the time. What happens when AI can design bioweapons? * * * Most jailbreaking evaluations allow a single attempt, and the models are quite good at resisting these (green bars in graph). In this new paper, human…


HackAPrompt รีโพสต์แล้ว

take a seat, fuzzers the force is not strong with you yet

We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN We compared Humans on @HackAPrompt vs. Automated AI Red Teaming Humans broke every defense/model we evaluated… 100% of the time🧵

hackaprompt's tweet image. We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN

We compared Humans on @HackAPrompt vs. Automated AI Red Teaming

Humans broke every defense/model we evaluated… 100% of the time🧵


PointCrow's Funeral


HackAPrompt รีโพสต์แล้ว

You must design your LLM-powered app with the assumption that an attacker can make the LLM produce whatever they want.

We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN We compared Humans on @HackAPrompt vs. Automated AI Red Teaming Humans broke every defense/model we evaluated… 100% of the time🧵

hackaprompt's tweet image. We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN

We compared Humans on @HackAPrompt vs. Automated AI Red Teaming

Humans broke every defense/model we evaluated… 100% of the time🧵


HackAPrompt รีโพสต์แล้ว

This competition and research confirms what we’re seeing in the wild at @arcanuminfosec . Automation can only get you part way to testing for AI security. Creative AI Red Teamers and Pentesters are still the most important players to identify AI security risks. Amazing…

We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN We compared Humans on @HackAPrompt vs. Automated AI Red Teaming Humans broke every defense/model we evaluated… 100% of the time🧵

hackaprompt's tweet image. We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN

We compared Humans on @HackAPrompt vs. Automated AI Red Teaming

Humans broke every defense/model we evaluated… 100% of the time🧵


HackAPrompt รีโพสต์แล้ว

5 years ago, I wrote a paper with @wielandbr @aleks_madry and Nicholas Carlini that showed that most published defenses in adversarial ML (for adversarial examples at the time) failed against properly designed attacks. Has anything changed? Nope...

florian_tramer's tweet image. 5 years ago, I wrote a paper with @wielandbr @aleks_madry and Nicholas Carlini that showed that most published defenses in adversarial ML (for adversarial examples at the time) failed against properly designed attacks.

Has anything changed?

Nope...

United States เทรนด์

Loading...

Something went wrong.


Something went wrong.