cloned_ID's profile picture. enjoyed 379 world models and counting

hûn

@cloned_ID

enjoyed 379 world models and counting

hûn reposted

By decomposing weights using loss curvature, you can identify components used for memorization vs generalization. High-curvature = shared mechanisms used across data. Low-curvature = idiosyncratic directions for memorized examples. You can then ablate the memorization weights!

LLMs memorize a lot of training data, but memorization is poorly understood. Where does it live inside models? How is it stored? How much is it involved in different tasks? @jack_merullo_ & @srihita_raju's new paper examines all of these questions using loss curvature! (1/7)

GoodfireAI's tweet image. LLMs memorize a lot of training data, but memorization is poorly understood.

Where does it live inside models? How is it stored? How much is it involved in different tasks?

@jack_merullo_ & @srihita_raju's new paper examines all of these questions using loss curvature! (1/7)


hûn reposted

Our work on how to initialize transformers at the edge of chaos so as to improve trainability is now published in @PhysRevE. See also thread: x.com/SuryaGanguli/s… This initialization scheme was used in OLMO2: x.com/SuryaGanguli/s…

Researchers analyzed forward signal and gradient back propagation in deep, randomly-initialized transformers and proposed simple, geometrically-meaningful criteria for hyperparameter initialization that ensure the trainability of deep transformers. 🔗 go.aps.org/3WVMa4R

PhysRevE's tweet image. Researchers analyzed forward signal and gradient back propagation in deep, randomly-initialized transformers and proposed simple, geometrically-meaningful criteria for hyperparameter initialization that ensure the trainability of deep transformers.

🔗 go.aps.org/3WVMa4R


hûn reposted

OpenAIs Stargate supercluster


hûn reposted

Anthropic, Dec 2024 vs May 2025

g_leech_'s tweet image. Anthropic, Dec 2024 vs May 2025
g_leech_'s tweet image. Anthropic, Dec 2024 vs May 2025

Sonnet 4.5 games alignment evaluations a lot, enough to cause serious doubts. This seems caused by training on alignment evals (eval awareness could be reinforced as it yields better apparent behavior). I expect eval awareness to be an increasingly big problem going forward. 1/

RyanPGreenblatt's tweet image. Sonnet 4.5 games alignment evaluations a lot, enough to cause serious doubts.

This seems caused by training on alignment evals (eval awareness could be reinforced as it yields better apparent behavior).

I expect eval awareness to be an increasingly big problem going forward. 1/


hûn reposted
Grad62304977's tweet image.

FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!

rosinality's tweet image. FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!


hûn reposted

Training an LLM to predict temperature and top-P parameters for its tokens. It is simply trained using autoregressive loss. Hmm.

rosinality's tweet image. Training an LLM to predict temperature and top-P parameters for its tokens. It is simply trained using autoregressive loss. Hmm.

hûn reposted

New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.

AnthropicAI's tweet image. New Anthropic research: Signs of introspection in LLMs.

Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.

hûn reposted

LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)

GladiaLab's tweet image. LLMs are injective and invertible.

In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space.

(1/6)

hûn reposted

LLMs, trained only on text, might already know more about other modalities than we realized; we just need to find ways elicit it. project page: sophielwang.com/sensory w/ @phillip_isola and @thisismyhat


hûn reposted

Hedging words in Science papers have been steadily decreasing

MishaTeplitskiy's tweet image. Hedging words in Science papers have been steadily decreasing

hûn reposted

This is a really cool & creative technique. I'm very excited about "training models to verbalize information about their own cognition" as a new frontier in interpretability research.

New paper! We show how to give an LLM the ability to accurately verbalize what changed about itself after a weight update is applied. We see this as a proof of concept for a new, more scalable approach to interpretability.🧵

TonyWangIV's tweet image. New paper! We show how to give an LLM the ability to accurately verbalize what changed about itself after a weight update is applied.

We see this as a proof of concept for a new, more scalable approach to interpretability.🧵


hûn reposted

New research paper with Anthropic and Thinking Machines AI companies use model specifications to define desirable behaviors during training. Are model specs clearly expressing what we want models to do? And do different frontier models have different personalities? We generated…

jifan_zhang's tweet image. New research paper with Anthropic and Thinking Machines

AI companies use model specifications to define desirable behaviors during training. Are model specs clearly expressing what we want models to do? And do different frontier models have different personalities?

We generated…
jifan_zhang's tweet image. New research paper with Anthropic and Thinking Machines

AI companies use model specifications to define desirable behaviors during training. Are model specs clearly expressing what we want models to do? And do different frontier models have different personalities?

We generated…

hûn reposted

Great question. The last chess paper (pnas.org/doi/10.1073/pn…) did not involve a language model - it's just pure AZ chess playing machine (output is a chess move) that we extracted superhuman chess concepts from. This method assumes a language model, so not directly applicable.…


hûn reposted

🔎Did someone steal your language model? We can tell you, as long as you shuffled your training data🔀. All we need is some text from their model! Concretely, suppose Alice trains an open-weight model and Bob uses it to produce text. Can Alice prove Bob used her model?🚨


hûn reposted

What happens when you turn a designer into an interpretability researcher? They spend hours staring at feature activations in SVG code to see if LLMs actually understand SVGs. It turns out – yes~ We found that semantic concepts transfer across text, ASCII, and SVG:

tarngerine's tweet image. What happens when you turn a designer into an interpretability researcher? They spend hours staring at feature activations in SVG code to see if LLMs actually understand SVGs. It turns out – yes~

We found that semantic concepts transfer across text, ASCII, and SVG:

hûn reposted

How does training data shape model behavior? Well, it’s complicated… 1/10

jesse_hoogland's tweet image. How does training data shape model behavior? Well, it’s complicated… 

1/10

hûn reposted

It looks like AI music is following the same path as AI text: 1) Appears to have passed the Turing Test, people are only 50/50 in identifying older Suno vs. human songs (but 60/40 when two songs are the same genre) 2) Same fast development, new models are getting better quickly.

emollick's tweet image. It looks like AI music is following the same path as AI text:
1) Appears to have passed the Turing Test, people are only 50/50 in identifying older Suno vs. human songs (but 60/40 when two songs are the same genre)
2) Same fast development, new models are getting better quickly.
emollick's tweet image. It looks like AI music is following the same path as AI text:
1) Appears to have passed the Turing Test, people are only 50/50 in identifying older Suno vs. human songs (but 60/40 when two songs are the same genre)
2) Same fast development, new models are getting better quickly.
emollick's tweet image. It looks like AI music is following the same path as AI text:
1) Appears to have passed the Turing Test, people are only 50/50 in identifying older Suno vs. human songs (but 60/40 when two songs are the same genre)
2) Same fast development, new models are getting better quickly.

hûn reposted

Below is a deep dive into why self play works for two-player zero-sum (2p0s) games like Go/Poker/Starcraft but is so much harder to use in "real world" domains. tl;dr: self play converges to minimax in 2p0s games, and minimax is really useful in those games. Every finite 2p0s…

polynoamial's tweet image. Below is a deep dive into why self play works for two-player zero-sum (2p0s) games like Go/Poker/Starcraft but is so much harder to use in "real world" domains. tl;dr: self play converges to minimax in 2p0s games, and minimax is really useful in those games.

Every finite 2p0s…

Self play works so well in chess, go, and poker because those games are two-player zero-sum. That simplifies a lot of problems. The real world is messier, which is why we haven’t seen many successes from self play in LLMs yet. Btw @karpathy did great and I mostly agree with him!



hûn reposted

In this thread, we summarize the new types of prompt injection attacks we uncovered. You can read more details on them in today’s blog: brave.com/blog/unseeable…


Loading...

Something went wrong.


Something went wrong.