Neil Chowdhury

@ChowdhuryNeil

@TransluceAI, previously @OpenAI

San Francisco

nchowdhury.com

انضم في يونيو 2016

358المنشورات 3Kالمتابعون 413المتابَعون

قد يعجبك

@juhyunk_

@qicy11

@xu_zxu

@Sajad_Rzv_Bazaz

@zeribechike

@XiaotaoWang3

@wavylinesem

@S_Corcoran

@pepeluisrr

@Chandler_Lab

@TheOrganoidBoy

@YuLiu_Sunny

@charles_breeze

@ZheFrench

@jmw86069

مثبتة

Neil Chowdhury

@ChowdhuryNeil

٥ يونيوم

Ever wondered how likely your AI model is to misbehave? We developed the *propensity lower bound* (PRBO), a variational lower bound on the probability of a model exhibiting a target (misaligned) behavior.

Transluce

@TransluceAI

٥ يونيوم

Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸 We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎

TransluceAI's tweet image. Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸

We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎

Neil Chowdhury

@ChowdhuryNeil

٦ أكتوبرم

Claude Sonnet 4.5 behaves the most desirably across Petri evals, but is 2-10x more likely to express awareness it's being evaluated than competitive peers. This affects how much we can conclude about how "aligned" models are from these evals. Improving realism seems essential.

ChowdhuryNeil's tweet image. Claude Sonnet 4.5 behaves the most desirably across Petri evals, but is 2-10x more likely to express awareness it's being evaluated than competitive peers.

This affects how much we can conclude about how "aligned" models are from these evals. Improving realism seems essential.

Anthropic

@AnthropicAI

٦ أكتوبرم

Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception. Now we’re open-sourcing the tool to run those audits.

AnthropicAI's tweet image. Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception.

Now we’re open-sourcing the tool to run those audits.

Neil Chowdhury أعاد

Sayash Kapoor

@sayashk

١ أكتوبرم

On our evals for HAL, we found that agents figure out they're being evaluated even on capability evals. For example, here Claude 3.7 Sonnet *looks up the benchmark on HuggingFace* to find the answer to an AssistantBench question. There were many such cases across benchmarks and…

sayashk's tweet image. On our evals for HAL, we found that agents figure out they're being evaluated even on capability evals.

For example, here Claude 3.7 Sonnet *looks up the benchmark on HuggingFace* to find the answer to an AssistantBench question. There were many such cases across benchmarks and…

1a3orn

@1a3orn

١ أكتوبرم

To make a model that *doesn't* instantly learn to distinguish between "fake-ass alignment test" and "normal task." ...seems like the first thing to do seems like it would be "make all alignment evals very small variations on actual capability evals." Do people do this?

Neil Chowdhury أعاد

will brown

@willccbb

٣٠ سبتمبرم

AI is very quickly becoming a foundational and unavoidable piece of daily life. the dam has burst. the question we must ask and answer is which ways do we want the waves to flow. i would like to live in a world where we all understand this technology enough to be able to…

Neil Chowdhury

@ChowdhuryNeil

٢٥ سبتمبرم

Docent has been really useful for understanding the outputs of my RL training runs -- glad it's finally open-source!

Transluce

@TransluceAI

٢٥ سبتمبرم

We’re open-sourcing Docent under an Apache 2.0 license. Check out our public codebase to self-host Docent, peek under the hood, or open issues & pull requests! The hosted version remains the easiest way to get started with one click and use Docent with zero maintenance overhead.

Neil Chowdhury أعاد

Elizabeth Barnes

@BethMayBarnes

٢٣ سبتمبرم

METR is a non-profit research organization, and we are actively fundraising! We prioritise independence and trustworthiness, which shapes both our research process and our funding options. To date, we have not accepted funding from frontier AI labs.

Neil Chowdhury

@ChowdhuryNeil

١٧ سبتمبرم

they parted disclaim marinade they parted illusions

Neil Chowdhury

@ChowdhuryNeil

١٦ سبتمبرم

Stop by on Thursday if you're at MIT 🙂

Transluce

@TransluceAI

١٦ سبتمبرم

Later this week, we're giving a talk about our research at MIT! Understanding AI Systems at Scale: Applied Interpretability, Agent Robustness, and the Science of Model Behaviors @ChowdhuryNeil and @vvhuang_ Where: 4-370 When: Thursday 9/18, 6:00pm Details below 👇

Neil Chowdhury أعاد

Sayash Kapoor

@sayashk

١٢ سبتمبرم

Agent benchmarks lose *most* of their resolution because we throw out the logs and only look at accuracy. I’m very excited that HAL is incorporating @TransluceAI’s Docent to analyze agent logs in depth. Peter’s thread is a simple example of the type of analysis this enables,…

Peter Kirgis

@PKirgis

١٢ سبتمبرم

OpenAI claims hallucinations persist because evaluations reward guessing and that GPT-5 is better calibrated. Do results from HAL support this conclusion? On AssistantBench, a general web search benchmark, GPT-5 has higher precision and lower guess rates than o3!

PKirgis's tweet image. OpenAI claims hallucinations persist because evaluations reward guessing and that GPT-5 is better calibrated. Do results from HAL support this conclusion? On AssistantBench, a general web search benchmark, GPT-5 has higher precision and lower guess rates than o3!

Neil Chowdhury

@ChowdhuryNeil

١٢ سبتمبرم

Very cool work -- points toward AI being one of the rare cases in tech where governments can be on the cutting edge

Xander Davies

@alxndrdavies

١٢ سبتمبرم

Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6

alxndrdavies's tweet image. Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6

Neil Chowdhury

@ChowdhuryNeil

٢٩ أغسطسم

Looks like somebody added safeguards for best-of-N jailbreaking

Neil Chowdhury

@ChowdhuryNeil

٢٨ أغسطسم

Very happy to see this! I hope other AI developers follow (Anthropic created a collective constitution a couple years ago, perhaps it needs updating), and that we as a community develop better rubrics & measurement tools for model behavior :)

Tyna Eloundou

@ThankYourNiceAI

٢٧ أغسطسم

No single person or institution should define ideal AI behavior for everyone. Today, we’re sharing early results from collective alignment, a research effort where we asked the public about how models should behave by default. Blog here: openai.com/index/collecti…

ThankYourNiceAI's tweet card. We surveyed over 1,000 people worldwide on how our models should behave and compared their views to our Model Spec. We found they largely agree with the Spec, and we adopted changes from the disagr...

Collective alignment: public input on our Model Spec

المصدر: openai.com

Neil Chowdhury أعاد

Transluce

@TransluceAI

٢٦ أغسطسم

Docent, our tool for analyzing complex AI behaviors, is now in public alpha! It helps scalably answer questions about agent behavior, like “is my model reward hacking” or “where does it violate instructions.” Today, anyone can get started with just a few lines of code!