mathlfs's profile picture. PhD student @ National University of Singapore

mathematicallfs@gmail.com

Fusheng Liu

@mathlfs

PhD student @ National University of Singapore [email protected]

Fusheng Liu 님이 재게시함

🚀 Introducing our new tech report: Muon is Scalable for LLM Training We found that Muon optimizer can be scaled up using the follow techniques: • Adding weight decay • Carefully adjusting the per-parameter update scale ✨ Highlights: • ~2x computational efficiency vs AdamW…

Kimi_Moonshot's tweet image. 🚀 Introducing our new tech report: Muon is Scalable for LLM Training

We found that Muon optimizer can be scaled up using the follow techniques: 
• Adding weight decay
• Carefully adjusting the per-parameter update scale

✨ Highlights:
• ~2x computational efficiency vs AdamW…
Kimi_Moonshot's tweet image. 🚀 Introducing our new tech report: Muon is Scalable for LLM Training

We found that Muon optimizer can be scaled up using the follow techniques: 
• Adding weight decay
• Carefully adjusting the per-parameter update scale

✨ Highlights:
• ~2x computational efficiency vs AdamW…

Fusheng Liu 님이 재게시함

🚀 Introducing DeepSeek-V3! Biggest leap forward yet: ⚡ 60 tokens/second (3x faster than V2!) 💪 Enhanced capabilities 🛠 API compatibility intact 🌍 Fully open-source models & papers 🐋 1/n

deepseek_ai's tweet image. 🚀 Introducing DeepSeek-V3!

Biggest leap forward yet:
⚡ 60 tokens/second (3x faster than V2!)
💪 Enhanced capabilities
🛠 API compatibility intact
🌍 Fully open-source models & papers

🐋 1/n

Highly recommend this user-friendly project if you start with LM pretraining and want to build your own model/optimizer. The repo is easy to understand, easy to edit and easy to implement new ideas with minimum workloads. Well done Keller! Looking forward to your records on VIT:)

I enjoy getting NanoGPT training speed records. I’m also interested in making my formulation of NanoGPT speedrunning an accessible benchmark on which other people find it easy to try new ideas. To that end, I have tried to keep the code of the current record short, and minimize…



Fusheng Liu 님이 재게시함

Fixed a bug which caused all training losses to diverge for large gradient accumulation sizes. 1. First reported by @bnjmn_marie, GA is supposed to be mathematically equivalent to full batch training, but losses did not match. 2. We reproed the issue, and further investigation…

danielhanchen's tweet image. Fixed a bug which caused all training losses to diverge for large gradient accumulation sizes.

1. First reported by @bnjmn_marie, GA is supposed to be mathematically equivalent to full batch training, but losses did not match.
2. We reproed the issue, and further investigation…

Mamba at ICLR :)

mathlfs's tweet image. Mamba at ICLR :)

Fusheng Liu 님이 재게시함

Harm's Law of Smol Models (HLSM) tells us how much we need to scale up the data size (k_D) as we scale down the model size (k_N), if we wish to preserve the loss of a Chinchilla-optimal model. harmdevries.com/post/model-siz…

TacoCohen's tweet image. Harm's Law of Smol Models (HLSM) tells us how much we need to scale up the data size (k_D) as we scale down the model size (k_N), if we wish to preserve the loss of a Chinchilla-optimal model. 
harmdevries.com/post/model-siz…

Fusheng Liu 님이 재게시함

Afraid of sum-of-squares (SOS) relaxations? Read this new blog post for a smooth ride in the Fourier domain. francisbach.com/sums-of-square…


United States 트렌드

Loading...

Something went wrong.


Something went wrong.