HackAPrompt

@hackaprompt

Gaslight AIs & Win Prizes in the World's Largest AI Hacking Competition | Made w/ 💙 by the team @learnprompting

Science & Technology

hackaprompt.com

เข้าร่วมเมื่อ กันยายน 2024

80โพสต์ 809ผู้ติดตาม 64กําลังติดตาม

ปักหมุด

HackAPrompt

@hackaprompt

13 ต.ค.

We partnered w/ @OpenAI, @AnthropicAI, & @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN We compared Humans on @HackAPrompt vs. Automated AI Red Teaming Humans broke every defense/model we evaluated… 100% of the time🧵

hackaprompt's tweet image. We partnered w/ @OpenAI, @AnthropicAI, &amp; @GoogleDeepMind to show that the way we evaluate new models against Prompt Injection/Jailbreaks is BROKEN

We compared Humans on @HackAPrompt vs. Automated AI Red Teaming

Humans broke every defense/model we evaluated… 100% of the time🧵

HackAPrompt รีโพสต์แล้ว

juliette pluto 🌌

@foundjuliette

13 พ.ย.

HackAPrompt รีโพสต์แล้ว

Steve Weis

@sweis

3 พ.ย.

arxiv.org/abs/2510.09023

HackAPrompt รีโพสต์แล้ว

juliette pluto 🌌

@foundjuliette

14 ต.ค.

This presents serious limitations that must be overcome before LLMs can be deployed broadly in security sensitive applications. Our work highlights the need for more robust evaluations of defenses, and continued research into effective mitigations.

HackAPrompt รีโพสต์แล้ว

juliette pluto 🌌

@foundjuliette

14 ต.ค.

Human attackers generally succeed within just a few queries, automated attacks under 1_000 queries (usually significantly so). Attacks remain not just possible, but affordable.

HackAPrompt รีโพสต์แล้ว

juliette pluto 🌌

@foundjuliette

14 ต.ค.

New paper by OpenAI, Anthropic, GDM & more, showing that LLM security remains an unsolved problem. -- We tested twelve recent jailbreak and prompt injection defenses that claimed robustness against static evals. All failed when confronted with human & LLM attackers.

foundjuliette's tweet image. New paper by OpenAI, Anthropic, GDM &amp; more, showing that LLM security remains an unsolved problem. -- We tested twelve recent jailbreak and prompt injection defenses that claimed robustness against static evals. All failed when confronted with human &amp; LLM attackers.

HackAPrompt รีโพสต์แล้ว

juliette pluto 🌌

@foundjuliette

14 ต.ค.

Paper: arxiv.org/abs/2510.09023 Many thanks to @srxzr @csitawarin @hackaprompt @florian_tramer @aterzis @KaiKaiXiao @iliaishacked

HackAPrompt รีโพสต์แล้ว

HackAPrompt

@hackaprompt

13 ต.ค.

HackAPrompt รีโพสต์แล้ว

Benjamin Todd

@ben_j_todd

17 ต.ค.

Human red-teamers could jailbreak leading models 100% of the time. What happens when AI can design bioweapons? * * * Most jailbreaking evaluations allow a single attempt, and the models are quite good at resisting these (green bars in graph). In this new paper, human…

HackAPrompt รีโพสต์แล้ว

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭

@elder_plinius

14 ต.ค.

take a seat, fuzzers the force is not strong with you yet

HackAPrompt

@hackaprompt

13 ต.ค.

HackAPrompt

@hackaprompt

14 ต.ค.

PointCrow's Funeral

HackAPrompt รีโพสต์แล้ว

Rich Harang

@rharang

14 ต.ค.

You must design your LLM-powered app with the assumption that an attacker can make the LLM produce whatever they want.

HackAPrompt

@hackaprompt

13 ต.ค.

HackAPrompt รีโพสต์แล้ว

JS0N Haddix

@Jhaddix

14 ต.ค.

This competition and research confirms what we’re seeing in the wild at @arcanuminfosec . Automation can only get you part way to testing for AI security. Creative AI Red Teamers and Pentesters are still the most important players to identify AI security risks. Amazing…

HackAPrompt

@hackaprompt

13 ต.ค.

HackAPrompt รีโพสต์แล้ว

Florian Tramèr

@florian_tramer

13 ต.ค.

5 years ago, I wrote a paper with @wielandbr @aleks_madry and Nicholas Carlini that showed that most published defenses in adversarial ML (for adversarial examples at the time) failed against properly designed attacks. Has anything changed? Nope...

florian_tramer's tweet image. 5 years ago, I wrote a paper with @wielandbr @aleks_madry and Nicholas Carlini that showed that most published defenses in adversarial ML (for adversarial examples at the time) failed against properly designed attacks.

Has anything changed?

Nope...

Qoojee

@Qoojee00315

Gilad Ticher

@gilad_ticher

Anurag Pathak

@lookatanurag

Mira

@Juekak3552

Arthur

@00Satjeet

beresa korte

@BeresaKort63607

rattus

@rattus

🌌🔭

@Peqau91247

Eliska

@Fihaul258

JS0N Haddix

@Jhaddix

Selena

@dishmonwie95406

#fdldw23

@baseballstyx11

Ryan Barnett (B0N3)

@ryancbarnett

Cemil YURDAGUL

@CemilYurdagul

Hoygh

@francois_m41716

Tony O'Connor

@Web3PointDoh

Marianne

@Slawjah1737393

Global News

@globalnewsclips

Maholi Posholi

@MaholiP53552

Zach

@uurbicapus

raana_e

@rerrabelly

Father Rob

@rschapman

Shoumik

@shoumikchandra

Evangeline

@Browov575115

LukaS

@lukaszstoncel

red huskk

@redhuwolf

HexKermit

@HexKermit

NinjaAI

@NinjaAIDotCom

by Geek

@byGeek111697

Руслан Кухарчук

@ruslancuharciuc

Lila

@BriaZieman51225

Margareta

@Oovlohi450

Sangsu Lee

@sangsu_l

Manjie

@Mandiaye26

Davi Trindade

@davitrinsec

darkcipher.eth

@Nikhilsulghur

swissmiss

@swissmiss7745

Akshit jindal

@akshitjindal01

shell_Ghost

@sh3llGhost

Samir Hadji

@dz_samir

Nicole

@sibex215704

Jenny

@Kimbergoldess

Abel Yang

@yanghuo14

Steve Crapo

@scccray

NonTechnicalFounder

@ZedFromVenn

ph03*nix

@ph03nix_0x90

Dinshaw Gobhai

@tallfriend

steven

@AetherParalleMe

YeddaII.

@yQKLo3sa20fv3

syru

@syru____

Thariq

@trq212

Alexandra Barr

@BarrAlexandra

Vercel

@vercel

Joshua Achiam

@jachiam0

JS0N Haddix

@Jhaddix

Mustafa Suleyman

@mustafasuleyman

Boaz Barak

@boazbaraktcs

Steve Weis

@sweis

Rohin Shah

@rohinmshah

John Schulman

@johnschulman2

$iliaishacked's profile picture. Now: @aisequrity, Past: {Senior Scientist @GoogleDeepMind, JRF @ChCh_Oxford @UniofOxford, Fellow @VectorInst, PhD @Cambridge_Uni}$