davidelson's profile picture. AGI Alignment/Safety @ Google DeepMind. Opinions my own

David Elson

@davidelson

AGI Alignment/Safety @ Google DeepMind. Opinions my own

Gemini 3.0 Frontier Safety report ⬇️

Frontier Safety Framework report for Gemini 3 Pro, presenting risk assessments and evaluation results in CBRN, Cybersecurity, Harmful Manipulation, Machine Learning R&D and Misalignment domains. storage.googleapis.com/deepmind-media…



New paper, following up on our chain-of-thought faithfulness work from a few months ago, about how we can make sure that LLM thoughts are staying faithful and monitorable.

CoT monitoring is one of our best shots at AI safety. But it's fragile and could be lost due to RL or architecture changes. Would we even notice if it starts slipping away? 🧵

emmons_scott's tweet image. CoT monitoring is one of our best shots at AI safety. But it's fragile and could be lost due to RL or architecture changes.

Would we even notice if it starts slipping away? 🧵


New paper showing that when LLMs chew over tough problems, they tend to think clearly and transparently -- making them easier to monitor for bad behavior ⬇️

Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models…

emmons_scott's tweet image. Is CoT monitoring a lost cause due to unfaithfulness? 🤔

We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes!

Our finding: "When Chain of Thought is Necessary, Language Models…


David Elson reposted

We're hiring for our Google DeepMind AGI Safety & Alignment and Gemini Safety teams. Locations: London, NYC, Mountain View, SF. Join us to help build safe AGI. Research Engineer boards.greenhouse.io/deepmind/jobs/…… Research Scientist boards.greenhouse.io/deepmind/jobs/…


Some promising results on keeping AIs from scheming against you - or at least removing the incentive for them to do this.

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵

davlindner's tweet image. New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward?

Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!

Inspired by myopic optimization but better performance – details in🧵


Loading...

Something went wrong.


Something went wrong.