dbaek__'s profile picture. PhD Student @ MIT EECS / Mechanistic Interpretability, Scalable Oversight

David D. Baek

@dbaek__

PhD Student @ MIT EECS / Mechanistic Interpretability, Scalable Oversight

David D. Baek reposted

🚨 AI Safety Arms Race: Even after OpenAI’s emergent misalignment patching, we can easily leverage their SFT API to obtain a Turncoat GPT Model (not even adversarial fine-tuning, and can even easily bypass the detection from @johnschulman2’s recent work) that produces even more…


David D. Baek reposted

BREAKING: Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well. Here's what Apple discovered: (hint: we're not as close to AGI as the hype suggests)

RubenHssd's tweet image. BREAKING: Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all.

They just memorize patterns really well.

Here's what Apple discovered:

(hint: we're not as close to AGI as the hype suggests)

David D. Baek reposted

Today, the most competent AI systems in almost *any* domain (math, coding, etc.) are broadly knowledgeable across almost *every* domain. Does it have to be this way, or can we create truly narrow AI systems? In a new preprint, we explore some questions relevant to this goal...


David D. Baek reposted

Interested in the science of language models but tired of neural scaling laws? Here's a new perspective: our new paper presents neural thermodynamic laws -- thermodynamic concepts and laws naturally emerge in language model training! AI is naturAl, not Artificial, after all.

ZimingLiu11's tweet image. Interested in the science of language models but tired of neural scaling laws? Here's a new perspective: our new paper presents neural thermodynamic laws -- thermodynamic concepts and laws naturally emerge in language model training!

AI is naturAl, not Artificial, after all.

David D. Baek reposted

1/14: If sparse autoencoders work, they should give us interpretable classifiers that help with probing in difficult regimes (e.g. data scarcity). But we find that SAE probes consistently underperform! Our takeaway: mech interp should use stronger baselines to measure progress 🧵

JoshAEngels's tweet image. 1/14: If sparse autoencoders work, they should give us interpretable classifiers that help with probing in difficult regimes (e.g. data scarcity). But we find that SAE probes consistently underperform! Our takeaway: mech interp should use stronger baselines to measure progress 🧵

David D. Baek reposted

(1/N) LLMs represent numbers on a helix? And use trigonometry to do addition? Answers below 🧵


United States Trends

Loading...

Something went wrong.


Something went wrong.