L

@CodeTitanium

Tham gia vào Tháng 7 2023

1KBài đăng 105Người theo dõi 5KĐang theo dõi

L đã đăng lại

Timothy Gowers @wtgowers

@wtgowers

31 thg 10

I crossed an interesting threshold yesterday, which I think many other mathematicians have been crossing recently as well. In the middle of trying to prove a result, I identified a statement that looked true and that would, if true, be useful to me. 1/3

L đã đăng lại

tender

@tenderizzation

31 thg 10

you see that new attention variant? you’re going to be writing our custom CUDA kernel for it

Citrini

@Citrini7

30 thg 10

caption this meme

L đã đăng lại

stochasm

@stochasticchasm

31 thg 10

Insane result and while the smaller training-inference gap makes sense, I cannot believe it has such a huge effect

Rosinality

@rosinality

31 thg 10

FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!

rosinality's tweet image. FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!

L đã đăng lại

Yacine Mahdid

@yacinelearning

30 thg 10

between minimax ditching linear attention and kimi finding a variant that outperform full attention I’m completely confused

Kimi.ai

@Kimi_Moonshot

30 thg 10

Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Kim… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi…

moonshotai/Kimi-Linear-48B-A3B-Instruct · Hugging Face

Nguồn: huggingface.co

L đã đăng lại

Xinyu Zhou

@zxytim

30 thg 10

You see: - a new arch that is better and faster than full attention verified with Kimi-style solidness. I see: - Starting with inferior performance even on short contexts. Nothing works and nobody knows why. - Tweaking every possible hyper-parameter to grasp what is wrong. -…

Kimi.ai

@Kimi_Moonshot

30 thg 10

moonshotai/Kimi-Linear-48B-A3B-Instruct · Hugging Face

Nguồn: huggingface.co

L đã đăng lại

Rui-Jie (Ridger) Zhu

@RidgerZhu

30 thg 10

7/12 You can clearly see the scaling in performance: With just 1 loop (T=1), the model performs poorly. At T=4 loops, its reasoning and math capabilities are much improved, reaching SOTA levels. More “thinking” directly translates to stronger capabilities.

RidgerZhu's tweet image. 7/12
You can clearly see the scaling in performance:
With just 1 loop (T=1), the model performs poorly.
At T=4 loops, its reasoning and math capabilities are much improved, reaching SOTA levels. More “thinking” directly translates to stronger capabilities.

L đã đăng lại

Ahmad

@TheAhmadOsman

29 thg 10

> you are > MiniMax M2 pretrain lead > now everyone’s asking why you went full attention again > “aren’t you supposed to be the efficiency gang?” > efficient attention? yeah we tried > we trained small hybrids > looked good > matched full attention on paper > scaled up >…

Pengyu Zhao

@zpysky1125

29 thg 10

MiniMax M2 Tech Blog 3: Why Did M2 End Up as a Full Attention Model? On behave of pre-training lead Haohai Sun. (zhihu.com/question/19653…) I. Introduction As the lead of MiniMax-M2 pretrain, I've been getting many queries from the community on "Why did you turn back the clock…

L đã đăng lại

Alexia Jolicoeur-Martineau

@jm_alexia

29 thg 10

You can do TRM/HRM with reinforcement learning!

Causal Wizard

@CausalWizard

29 thg 10

HRM-Agent: Using the Hierarchical Reasoning Model in Reinforcement Learning Paper: arxiv.org/abs/2510.22832 The Hierarchical Reasoning Model (HRM) has impressive reasoning abilities given its small size, but has only been applied to supervised, static, fully-observable problems.

CausalWizard's tweet image. HRM-Agent: Using the Hierarchical Reasoning Model in Reinforcement Learning
Paper: arxiv.org/abs/2510.22832

The Hierarchical Reasoning Model (HRM) has impressive reasoning abilities given its small size, but has only been applied to supervised, static, fully-observable problems.

L đã đăng lại

Yifan Zhang

@yifan_zhang_

13 thg 10

The paper of Mamba-3 is now available at openreview.net/pdf?id=HwCvaJO…!

L đã đăng lại

Yifan Zhang

@yifan_zhang_

27 thg 10

🎉 Our paper "Tensor Product Attention Is All You Need" has been accepted as NeurIPS 2025 Spotlight (Top 3%)! The Camera Ready version of TPA has been publicly available on the arXiv now: arxiv.org/abs/2501.06425 ⚡️TPA is stronger and faster than GQA and MLA, and is compatible…

Yifan Zhang

@yifan_zhang_

18 thg 9

🎉Our paper, "Tensor Product Attention Is All You Need," has been accepted for a spotlight presentation at the 2025 Conference on Neural Information Processing Systems (NeurIPS 2025)! See arxiv.org/abs/2501.06425 and github.com/tensorgi/TPA for details!

yifan_zhang_'s tweet card. [NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425) - tensorgi/TPA

GitHub - tensorgi/TPA: [NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6)...

Nguồn: github.com

L đã đăng lại

Lucas Beyer (bl16)

@giffmana

23 thg 10

I have a thing for empirical deep dive into learning dynamics like done in this paper. Sounds like muP mostly helps the early training, while wd affects the long term.

Atli Kosson

@AtliKosson

23 thg 10

The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵

AtliKosson's tweet image. The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work?

We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵

L đã đăng lại

Sabri Eyuboglu

@EyubogluSabri

21 thg 10

I think we're going to start seeing more continual learning techniques like this one that dynamically adapt which parameters are updated -- some really important ideas in here

Jessy Lin

@realJessyLin

21 thg 10

🧠 How can we equip LLMs with memory that allows them to continually learn new things? In our new paper with @AIatMeta, we show how sparsely finetuning memory layers enables targeted updates for continual learning, w/ minimal interference with existing knowledge. While full…

realJessyLin's tweet image. 🧠 How can we equip LLMs with memory that allows them to continually learn new things?

In our new paper with @AIatMeta, we show how sparsely finetuning memory layers enables targeted updates for continual learning, w/ minimal interference with existing knowledge.

While full…

L đã đăng lại

leo 🐾

@synthwavedd

21 thg 10

well well well... happy perplexity death day

OpenAI

@OpenAI

21 thg 10

L đã đăng lại

Lucas Beyer (bl16)

@giffmana

14 thg 10

nice

Saining Xie

@sainingxie

14 thg 10

diffusion transformers have come a long way, but most still lean on the old 2021 sd-vae for their latent space. that causes a few big issues: 1. outdated backbones make the architecture more complex than it needs to be. the sd-vae runs at around 450 gflops, while a simple ViT-B…

sainingxie's tweet image. diffusion transformers have come a long way, but most still lean on the old 2021 sd-vae for their latent space.

that causes a few big issues:
1. outdated backbones make the architecture more complex than it needs to be. the sd-vae runs at around 450 gflops, while a simple ViT-B…

L đã đăng lại

Sebastien Bubeck

@SebastienBubeck

12 thg 10

btw this discovery is by Mehtaab Sawhney (msawhney in the picture) mit.edu/~msawhney who has been doing lots of AWESOME works in combinatorics for many years now!

mit.edu

Mehtaab Sawhney — Clay Research Fellow & Assistant Professor at Columbia University

Mehtaab Sawhney — Clay Research Fellow and Assistant Professor at Columbia University. Research in combinatorics, probability, analytic number theory, and theoretical computer science.

Nguồn: mit.edu

Sebastien Bubeck

@SebastienBubeck

12 thg 10

gpt5-pro is superhuman at literature search: it just solved Erdos Problem #339 (listed as open in the official database erdosproblems.com/forum/thread/3…) by realizing that it had actually been solved 20 years ago h/t @MarkSellke for pointing this out to me!

SebastienBubeck's tweet image. gpt5-pro is superhuman at literature search:

it just solved Erdos Problem #339 (listed as open in the official database erdosproblems.com/forum/thread/3…) by realizing that it had actually been solved 20 years ago

h/t @MarkSellke for pointing this out to me!

L đã đăng lại

Michael Druggan

@Michael_Druggan

10 thg 10

Things that have happened since jan 2025: -Gemini 2.5 deepthink IMO got a gold medal on the IMO - unreleased openai model got a gold medal on the imo and ioi and a perfect score on the icpc - same unreleased openai model coded for 9 hours autonomously and got second place in the…

Anatoly Karlin 🧲💯

@powerfultakes

10 thg 10

Matt Walsh is now "feeling the AGI." Meanwhile there's been ~zero change in real capabilities since we hit Midwit Wall (120 IQ) in January 2025. It's over. Send all these AI scams to zero prompto. 🚨🐻📉

powerfultakes's tweet image. Matt Walsh is now "feeling the AGI."

Meanwhile there's been ~zero change in real capabilities since we hit Midwit Wall (120 IQ) in January 2025.

It's over.

Send all these AI scams to zero prompto. 🚨🐻📉

L đã đăng lại

Tuo Zhao

@tourzhao

10 thg 10

🚀 NorMuon: Muon + Neuron-wise adaptive learning rates: +21.7% training efficiency vs Adam, +11.3% vs Muon on 1.1B pretrain. 🚀 Distributed Normuon: A highly efficient FSDP2 implementation. Paper 👉 arxiv.org/abs/2510.05491 #LLM #AI #DeepLearning #Optimizer

tourzhao's tweet image. 🚀 NorMuon: Muon + Neuron-wise adaptive learning rates: +21.7% training efficiency vs Adam, +11.3% vs Muon on 1.1B pretrain.

🚀 Distributed Normuon: A highly efficient FSDP2 implementation.

Paper 👉 arxiv.org/abs/2510.05491
#LLM #AI #DeepLearning #Optimizer

L đã đăng lại

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxesTex

9 thg 10

Ok I was unfair to @jm_alexia. had no time to read the TRM paper, I should have. It's a very good faith attempt to rescue the idea of enabling huge effective depth for a tiny model, without any shaky premises or unwarranted approximations of HRM. It may be a real, big finding.

teortaxesTex's tweet image. Ok I was unfair to @jm_alexia. had no time to read the TRM paper, I should have. It's a very good faith attempt to rescue the idea of enabling huge effective depth for a tiny model, without any shaky premises or unwarranted approximations of HRM. It may be a real, big finding.

Lucas Nestler

@_clashluke

8 thg 10

TRM is one of the best papers I've read in the past years - it truly shows the unfiltered process of a researcher: 1) See awesome paper, get hyped about it 2) Read it - looks cool 3) Run it - doesn't work 4) Find glaring mistakes 5) Fix the issues x.com/jm_alexia/stat…