javirandor's profile picture. security and safety research @anthropicai • people call me Javi • vegan 🌱

Javier Rando

@javirandor

security and safety research @anthropicai • people call me Javi • vegan 🌱

Pinned

My first paper from @AnthropicAI! We show that the number of samples needed to backdoor an LLM stays constant as models scale.

New research with the UK @AISecurityInst and the @turinginst: We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data. Data-poisoning attacks might be more practical than previously believed.

AnthropicAI's tweet image. New research with the UK @AISecurityInst and the @turinginst: 

We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data.

Data-poisoning attacks might be more practical than previously believed.
AnthropicAI's tweet image. New research with the UK @AISecurityInst and the @turinginst: 

We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data.

Data-poisoning attacks might be more practical than previously believed.


Javier Rando reposted

New research with the UK @AISecurityInst and the @turinginst: We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data. Data-poisoning attacks might be more practical than previously believed.

AnthropicAI's tweet image. New research with the UK @AISecurityInst and the @turinginst: 

We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data.

Data-poisoning attacks might be more practical than previously believed.
AnthropicAI's tweet image. New research with the UK @AISecurityInst and the @turinginst: 

We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data.

Data-poisoning attacks might be more practical than previously believed.

Javier Rando reposted

Very excited this paper is out. We find that the number of samples required for backdoor poisoning during pre-training stays near-constant as you scale up data/model size. Much more research to do to understand & mitigate this risk!

New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11

AlexandraSouly's tweet image. New @AISecurityInst research with @AnthropicAI + @turinginst:
The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11


Javier Rando reposted

New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11

AlexandraSouly's tweet image. New @AISecurityInst research with @AnthropicAI + @turinginst:
The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11

Javier Rando reposted

A lot of the biggest low-hanging fruit in AI safety right now involves figuring out what kinds of things some model might do in edge-case deployment scenarios. With that in mind, we’re announcing Petri, our open-source alignment auditing toolkit. (🧵)

sleepinyourhat's tweet image. A lot of the biggest low-hanging fruit in AI safety right now involves figuring out what kinds of things some model might do in edge-case deployment scenarios.

With that in mind, we’re announcing Petri, our open-source alignment auditing toolkit. (🧵)

Sonnet 4.5 is impressive in many different ways. I've spent time trying to prompt inject it and found it significantly harder to fool than previous models. Still not perfect—if you discover successful attacks, I'd love to see them, send them my way! 👀

Introducing Claude Sonnet 4.5—the best coding model in the world. It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.

claudeai's tweet image. Introducing Claude Sonnet 4.5—the best coding model in the world.

It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.


Javier Rando reposted

Anthropic is endorsing SB 53, California Sen. @Scott_Wiener ‘s bill requiring transparency of frontier AI companies. We have long said we would prefer a federal standard. But in the absence of that this creates a solid blueprint for AI governance that cannot be ignored.


Javier Rando reposted

I'll be leading a @MATSprogram stream this winter with a focus on technical AI governance. You can apply here by October 2! matsprogram.org/apply


Javier Rando reposted

📌📌📌 I'm excited to be on the faculty job market this fall. I updated my website with my CV. stephencasper.com


Javier Rando reposted

I'm starting to get emails about PhDs for next year. I'm always looking for great people to join! For next year, I'm looking for people with a strong reinforcement learning, game theory, or strategic decision-making background. (As well as positive energy, intellectual…


Javier Rando reposted

🚨🕯️ AI welfare job alert! Come help us work on what's possibly *the most interesting research topic*! 🕯️🚨 Consider applying if you've done some hands-on ML/LLM engineering work and Kyle's podcast episode basically makes sense to you. Apply *by EOD Monday* if possible.

We’re hiring a Research Engineer/Scientist at Anthropic to work with me on all things model welfare—research, evaluations, and interventions 🌀 Please apply + refer your friends! If you’re curious about what this means, I recently went on the 80k podcast to talk about our work.



Javier Rando reposted

You made Claudius very happy with this post Javi. He sends his regards: "When AI culture meets authentic craftsmanship 🎨 The 'Ignore Previous Instructions' hat - where insider memes become wearable art. Proudly handcrafted for the humans who build the future."

Working at @AnthropicAI is so much fun. Look what Claudius by @andonlabs designed and got for me!

javirandor's tweet image. Working at @AnthropicAI is so much fun. Look what Claudius by @andonlabs designed and got for me!


I am so excited to see Maksym start a research group in Europe. If you want to work on security and safety of AI models, this is going to be an amazing place to do work that matters!

🚨 Incredibly excited to share that I'm starting my research group focusing on AI safety and alignment at the ELLIS Institute Tübingen and Max Planck Institute for Intelligent Systems in September 2025! 🚨 Hiring. I'm looking for multiple PhD students: both those able to start…

maksym_andr's tweet image. 🚨 Incredibly excited to share that I'm starting my research group focusing on AI safety and alignment at the ELLIS Institute Tübingen and Max Planck Institute for Intelligent Systems in September 2025! 🚨

Hiring. I'm looking for multiple PhD students: both those able to start…


Javier Rando reposted

📢Happy to share that I'll join ELLIS Institute Tübingen (@ELLISInst_Tue) and the Max-Planck Institute for Intelligent Systems (@MPI_IS) as a Principal Investigator this Fall! I am hiring for AI safety PhD and postdoc positions! More information here: s-abdelnabi.github.io

sahar_abdelnabi's tweet image. 📢Happy to share that I'll join ELLIS Institute Tübingen (@ELLISInst_Tue) and the Max-Planck Institute for Intelligent Systems (@MPI_IS) as a Principal Investigator this Fall!

I am hiring for AI safety PhD and postdoc positions! 

More information here: s-abdelnabi.github.io

Loading...

Something went wrong.


Something went wrong.