CodeTitanium's profile picture.

L

@CodeTitanium

置頂

Looks like the Noam tax 💰 did pay off


L 已轉發

I've written a companion to this record talking about various tricks used in Modded NanoGPT's implementation of Muon. Do check it out! varunneal.github.io/essays/muon

@varunneal has set a new NanoGPT Speedrun WR of 132s with a batch size schedule, achieving an absurd peak of 30 steps/s! github.com/KellerJordan/m…. Read more on his improvements at varunneal.github.io/essays/muon, which include Cautious Weight Decay (@Tim38463182 et al) and Normuon tuning.

classiclarryd's tweet image. @varunneal has set a new NanoGPT Speedrun WR of 132s with a batch size schedule, achieving an absurd peak of 30 steps/s! github.com/KellerJordan/m…. Read more on his improvements at varunneal.github.io/essays/muon, which include Cautious Weight Decay (@Tim38463182 et al) and Normuon tuning.


L 已轉發

Really excited about our new paper which derives compute-optimal hyperparameter scaling for matrix-preconditioned optimizers. Muon is _particularly_ effective if we scale its hyperparameters properly, with significant gains over Adam in training LLMs!

Optimizers are only as good as our ability to predict good hyperparameters for them when used at scale. With robust hyperparameter transfer, we find Muon and Shampoo consistently beat AdamW by 1.4x and 1.3x in compute for training language models up to 1.4B parameters. 1/n

ShikaiQiu's tweet image. Optimizers are only as good as our ability to predict good hyperparameters for them when used at scale. With robust hyperparameter transfer, we find Muon and Shampoo consistently beat AdamW by 1.4x and 1.3x in compute for training language models up to 1.4B parameters.
1/n


L 已轉發

Optimizers are only as good as our ability to predict good hyperparameters for them when used at scale. With robust hyperparameter transfer, we find Muon and Shampoo consistently beat AdamW by 1.4x and 1.3x in compute for training language models up to 1.4B parameters. 1/n

ShikaiQiu's tweet image. Optimizers are only as good as our ability to predict good hyperparameters for them when used at scale. With robust hyperparameter transfer, we find Muon and Shampoo consistently beat AdamW by 1.4x and 1.3x in compute for training language models up to 1.4B parameters.
1/n

L 已轉發

Putnam, the world's hardest college-level math test, ended yesterday 4p PT. Noon today, AxiomProver solved 9/12 problems in Lean autonomously (3:58p PT yesterday, it was 8/12). Our score would've been #1 of ~4000 participants last year and Putnam Fellow (top 5) in recent years


L 已轉發

This is incredible performance!

Putnam, the world's hardest undergrad math contest, ended 4pm PT yesterday. By 3:58pm, AxiomProver @axiommathai autonomously solved 8/12 of Putnam2025 in Lean, a 100% verifiable language. Last year, our score would've been #4 of ~4000 and a Putnam Fellow (top 10 in recent yrs)



L 已轉發

Interesting new variation of Muon called Turbo-Muon by @ThibautBoissin. It uses "almost orthogonal preconditioning" before Newton-Schulz -> reduces number of NS iterations needed to orthogonalize by 1 -> savings in per-iteration complexity. arxiv.org/abs/2512.04632

tonysilveti's tweet image. Interesting new variation of Muon called Turbo-Muon by @ThibautBoissin. It uses "almost orthogonal preconditioning" before Newton-Schulz -> reduces number of NS iterations needed to orthogonalize by 1 -> savings in per-iteration complexity.

arxiv.org/abs/2512.04632
tonysilveti's tweet image. Interesting new variation of Muon called Turbo-Muon by @ThibautBoissin. It uses "almost orthogonal preconditioning" before Newton-Schulz -> reduces number of NS iterations needed to orthogonalize by 1 -> savings in per-iteration complexity.

arxiv.org/abs/2512.04632

L 已轉發

For anyone interested in understanding orthogonalization/spectral based methods here’s the slides from our #neurips25 oral that I tried to make more broadly about the topic.

tmpethick's tweet image. For anyone interested in understanding orthogonalization/spectral based methods here’s the slides from our #neurips25 oral that I tried to make more broadly about the topic.

L 已轉發

Opus 4.5 is too good to be true. I think we've reached the "more than good enough" level; everything beyond this point may even be too much.


L 已轉發

what the actual fuck how are people not talking about the fact that they found bio-essential sugars on an asteroid


L 已轉發

🤔

Attending @NeurIPSConf? Stop by our poster "Do Language Models Use Their Depth Efficiently?" with @chrmanning and @ChrisGPotts today at poster #4011 in Exhibit Hall C, D, E from 4:30pm.

robert_csordas's tweet image. Attending @NeurIPSConf? Stop by our poster "Do Language Models Use Their Depth Efficiently?" with @chrmanning and @ChrisGPotts today at poster #4011 in Exhibit Hall C, D, E from 4:30pm.


L 已轉發

GPT-5 generated the key insight for a paper accepted to Physics Letters B, a serious and reputable peer-reviewed journal.

This is the key step in which GPT5 proposes the main idea of the paper, completely de novo:

hsu_steve's tweet image. This is the key step in which GPT5 proposes the main idea of the paper, completely de novo:


L 已轉發

We need a talent tracker for all DeepSeek staff the rest aren't fodder whatsoever, there's only so many people that China sends to the IOI

Yi Qian= IOI Gold Medal and ICPC Gold Medal. Yuxiang Luo= IOI Gold Medal and ICPC Gold Medal Yuyang Zhou= IOI Gold Medal and ICPC Gold Medal Zuofan Wu= IOI Gold Medal Zhizhen Ren= IOI Gold Medal Xiangwen Wang= ICPC Gold Medal and 3401 Rating Codeforces. 👑



L 已轉發

My message to labs is simple: Please multiply your reward function by some multiplicative factor of the form 10/log(total used token count) and subtract an additive factor of the form 10/log(number of reasoning tokens used after the correct answer is first mentioned in CoT).


L 已轉發

New paper studies when spectral gradient methods (e.g., Muon) help in deep learning: 1. We identify a pervasive form of ill-conditioning in DL: post-activations matrices are low-stable rank. 2. We then explain why spectral methods can perform well despite this. Long thread

damekdavis's tweet image. New paper studies when spectral gradient methods (e.g., Muon) help in deep learning:

1. We identify a pervasive form of ill-conditioning in DL: post-activations matrices are low-stable rank. 
2. We then explain why spectral methods can perform well despite this. 

Long thread

L 已轉發

Couldn’t resist

We just tested Aleph prover on this version of Erdos #124 problem and were able to prove it in less that 2.5 hours and under $200 in cost: gist.github.com/winger/a2c27e4… x.com/vladtenev/stat…



L 已轉發

DeepSeek V3.2 finally corrected the advantage calculation (in red). But they still keep the biased length norm term (in blue), which can lead to unhealthy response length behavior as we showed in arxiv.org/pdf/2503.20783. Check also tinker-docs.thinkingmachines.ai/losses to see how TML…

zzlccc's tweet image. DeepSeek V3.2 finally corrected the advantage calculation (in red).

But they still keep the biased length norm term (in blue), which can lead to unhealthy response length behavior as we showed in arxiv.org/pdf/2503.20783.

Check also tinker-docs.thinkingmachines.ai/losses to see how TML…

🚀 Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale — Reasoning-first models built for agents! 🔹 DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API. 🔹 DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now. 📄 Tech…

deepseek_ai's tweet image. 🚀 Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale — Reasoning-first models built for agents!

🔹 DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API.
🔹 DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now.

📄 Tech…


L 已轉發

no merging silliness. It is just a DSM-V2 plus everything else

teortaxesTex's tweet image. no merging silliness. It is just a DSM-V2 plus everything else

incorporates == merge ?



L 已轉發

NEW DEEPSEEK MODEL RELEASE 🚨 Two models: v3.2 - GPT-5 level performance v3.2 speciale - Gemini-3.0-Pro level performance, gold medal level for IMO, CMO, ICPC World Finals & IOI 2025 GRPO with modifications: - unbiased KL estimate - off-policy sequence masking - keep MoE…

iScienceLuvr's tweet image. NEW DEEPSEEK MODEL RELEASE 🚨

Two models:
v3.2 - GPT-5 level performance
v3.2 speciale - Gemini-3.0-Pro level performance, gold medal level for IMO, CMO, ICPC World Finals & IOI 2025

GRPO with modifications:
- unbiased KL estimate
- off-policy sequence masking
- keep MoE…

L 已轉發

🚀 Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale — Reasoning-first models built for agents! 🔹 DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API. 🔹 DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now. 📄 Tech…

deepseek_ai's tweet image. 🚀 Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale — Reasoning-first models built for agents!

🔹 DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API.
🔹 DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now.

📄 Tech…

L 已轉發

The theorem statement details always matter a lot, and the ruling on this is that it solved a version of the problem but there is a harder version of the problem that is still open. It's very exciting still though


United States 趨勢

Loading...

Something went wrong.


Something went wrong.