_dron_h's profile picture. math/music/ai nerd | research @GoodfireAI | prev cambridge, bair, polaris | giving a semantics to the syntax

dron

@_dron_h

math/music/ai nerd | research @GoodfireAI | prev cambridge, bair, polaris | giving a semantics to the syntax

another very neat result! big day for interp ✨

Overall, our results add to a growing picture that feature geometry can be a lot more intricate than mere directions, with dynamics offering a signal to isolate this geometry. E.g., see video below, where we show TFA elicits user vs. assistant manifolds in chat data! (12/14)



this is such a neat result!

One of my favorite parts of this work was looking at the circuits our model implements. The model implements counting (“[“ characters in python lists)  through a single attention head, which does a simple average of a “[“ token detector over the sequence dimension.

AchyutaBot's tweet image. One of my favorite parts of this work was looking at the circuits our model implements. The model implements counting (“[“ characters in python lists)  through a single attention head, which does a simple average of a “[“ token detector over the sequence dimension.


dron reposted

New paper! Language has rich, multiscale temporal structure, but sparse autoencoders assume features are *static* directions in activations. To address this, we propose Temporal Feature Analysis: a predictive coding protocol that models dynamics in LLM activations! (1/14)


dron reposted

New research: are prompting and activation steering just two sides of the same coin? @EricBigelow @danielwurgaft @EkdeepL and coauthors argue they are: ICL and steering have formally equivalent effects. (1/4)

GoodfireAI's tweet image. New research: are prompting and activation steering just two sides of the same coin?

@EricBigelow @danielwurgaft @EkdeepL and coauthors argue they are: ICL and steering have formally equivalent effects. (1/4)

dron reposted

How is memorized data stored in a model? We disentangle MLP weights in LMs and ViTs into rank-1 components based on their curvature in the loss, and find representational signatures of both generalizing structure and memorized training data

jack_merullo_'s tweet image. How is memorized data stored in a model? We disentangle MLP weights in LMs and ViTs into rank-1 components based on their curvature in the loss, and find representational signatures of both generalizing structure and memorized training data

dron reposted

LLMs memorize a lot of training data, but memorization is poorly understood. Where does it live inside models? How is it stored? How much is it involved in different tasks? @jack_merullo_ & @srihita_raju's new paper examines all of these questions using loss curvature! (1/7)

GoodfireAI's tweet image. LLMs memorize a lot of training data, but memorization is poorly understood.

Where does it live inside models? How is it stored? How much is it involved in different tasks?

@jack_merullo_ & @srihita_raju's new paper examines all of these questions using loss curvature! (1/7)

my fav result: random forest SAE probes generalize best! this lets us handle real-world distribution shifts from small synthetic datasets to ever-changing production data. this is an area where interpretability seems to really shine. really thoughtful science from the team here

_dron_h's tweet image. my fav result: random forest SAE probes generalize best! this lets us handle real-world distribution shifts from small synthetic datasets to ever-changing production data. this is an area where interpretability seems to really shine. really thoughtful science from the team here

Why use LLM-as-a-judge when you can get the same performance for 15–500x cheaper? Our new research with @RakutenGroup on PII detection finds that SAE probes: - transfer from synthetic to real data better than normal probes - match GPT-5 Mini performance at 1/15 the cost (1/6)

GoodfireAI's tweet image. Why use LLM-as-a-judge when you can get the same performance for 15–500x cheaper?

Our new research with @RakutenGroup on PII detection finds that SAE probes:
- transfer from synthetic to real data better than normal probes
- match GPT-5 Mini performance at 1/15 the cost

(1/6)


dron reposted

🔎Did someone steal your language model? We can tell you, as long as you shuffled your training data🔀. All we need is some text from their model! Concretely, suppose Alice trains an open-weight model and Bob uses it to produce text. Can Alice prove Bob used her model?🚨


i joined as a research fellow to train the largest-ever open source SAEs on R1. we don't mess around here! i had autonomy and the team's trust to go and do something real, with mentorship from an insanely talent-dense team. join us! you'll have real impact on our research.

Are you a high-agency, early- to mid-career researcher or engineer who wants to work on AI interpretability? We're looking for several Research Fellows and Research Engineering Fellows to start this fall.

GoodfireAI's tweet image. Are you a high-agency, early- to mid-career researcher or engineer who wants to work on AI interpretability?

We're looking for several Research Fellows and Research Engineering Fellows to start this fall.


i hope to never write another line of pandas again

Agents for experimental research != agents for software development. This is a key lesson we've learned after several months refining agentic workflows! More takeaways on effectively using experimenter agents + a key tool we're open-sourcing to enable them: 🧵

GoodfireAI's tweet image. Agents for experimental research != agents for software development.

This is a key lesson we've learned after several months refining agentic workflows!

More takeaways on effectively using experimenter agents + a key tool we're open-sourcing to enable them: 🧵


dron reposted

Agents for experimental research != agents for software development. This is a key lesson we've learned after several months refining agentic workflows! More takeaways on effectively using experimenter agents + a key tool we're open-sourcing to enable them: 🧵

GoodfireAI's tweet image. Agents for experimental research != agents for software development.

This is a key lesson we've learned after several months refining agentic workflows!

More takeaways on effectively using experimenter agents + a key tool we're open-sourcing to enable them: 🧵

has been and will continue to be a fun and fruitful partnership!

We're excited to announce a collaboration with @MayoClinic! We're working to improve personalized patient outcomes by extracting richer, more reliable signals from genomic & digital pathology models. That could mean novel biomarkers, personalized diagnostics, & more.

GoodfireAI's tweet image. We're excited to announce a collaboration with @MayoClinic!

We're working to improve personalized patient outcomes by extracting richer, more reliable signals from genomic & digital pathology models.

That could mean novel biomarkers, personalized diagnostics, & more.


dron reposted

We're excited to announce a collaboration with @MayoClinic! We're working to improve personalized patient outcomes by extracting richer, more reliable signals from genomic & digital pathology models. That could mean novel biomarkers, personalized diagnostics, & more.

GoodfireAI's tweet image. We're excited to announce a collaboration with @MayoClinic!

We're working to improve personalized patient outcomes by extracting richer, more reliable signals from genomic & digital pathology models.

That could mean novel biomarkers, personalized diagnostics, & more.

dron reposted

During my summer at Goodfire, I ended up thinking a bit about sparse autoencoder scaling laws, and whether the existence of "feature manifolds" could impact SAE scaling behavior, with @livgorton and @banburismus_ 🙏: arxiv.org/abs/2509.02565

Does making an SAE bigger let you explain more of your model's features? New research from @ericjmichaud_ models SAE scaling dynamics, and explores whether SAEs will pack increasingly many latents onto a few multidimensional features, rather than learning more features.

GoodfireAI's tweet image. Does making an SAE bigger let you explain more of your model's features?

New research from @ericjmichaud_ models SAE scaling dynamics, and explores whether SAEs will pack increasingly many latents onto a few multidimensional features, rather than learning more features.


dron reposted

Excited to share our work digging into how Evo 2 represents species relatedness or phylogeny. Genetics provides a good quantitative measure of relatedness, so we could use it to probe the model and see if its internal geometry reflects it.

Arc Institute trained their foundation model Evo 2 on DNA from all domains of life. What has it learned about the natural world? Our new research finds that it represents the tree of life, spanning thousands of species, as a curved manifold in its neuronal activations. (1/8)

GoodfireAI's tweet image. Arc Institute trained their foundation model Evo 2 on DNA from all domains of life. What has it learned about the natural world?
Our new research finds that it represents the tree of life, spanning thousands of species, as a curved manifold in its neuronal activations. (1/8)


i saw early versions of this work when i was still in school and it made waiting to join this team very difficult... very cool results! @_MichaelPearce

Arc Institute trained their foundation model Evo 2 on DNA from all domains of life. What has it learned about the natural world? Our new research finds that it represents the tree of life, spanning thousands of species, as a curved manifold in its neuronal activations. (1/8)

GoodfireAI's tweet image. Arc Institute trained their foundation model Evo 2 on DNA from all domains of life. What has it learned about the natural world?
Our new research finds that it represents the tree of life, spanning thousands of species, as a curved manifold in its neuronal activations. (1/8)


dron reposted

Arc Institute trained their foundation model Evo 2 on DNA from all domains of life. What has it learned about the natural world? Our new research finds that it represents the tree of life, spanning thousands of species, as a curved manifold in its neuronal activations. (1/8)

GoodfireAI's tweet image. Arc Institute trained their foundation model Evo 2 on DNA from all domains of life. What has it learned about the natural world?
Our new research finds that it represents the tree of life, spanning thousands of species, as a curved manifold in its neuronal activations. (1/8)

dron reposted

What if adversarial examples aren't a bug, but a direct consequence of how neural networks process information? We've found evidence that superposition – the way networks represent many more features than they have neurons – might cause adversarial examples.

livgorton's tweet image. What if adversarial examples aren't a bug, but a direct consequence of how neural networks process information?

We've found evidence that superposition – the way networks represent many more features than they have neurons – might cause adversarial examples.

dron reposted

New research! Post-training often causes weird, unwanted behaviors that are hard to catch before deployment because they only crop up rarely - then are found by bewildered users. How can we find these efficiently? (1/7)


Loading...

Something went wrong.


Something went wrong.