Yu Zhang 🐈🐙

@yzhang_cs

@Kimi_Moonshot; PhD Student @ Soochow University; working on efficient methods for LLMs; disciple of parallel programming; INTP

yzhang.site

Joined February 2023

777Posts 2KFollowers 704Following

You might like

@eusondavid

Yu Zhang 🐈🐙 reposted

jianlin.su

@Jianlin_S

4 h

Weight Decay and Learning Rate from a EMA View kexue.fm/archives/11459 Derives optimal WD and LR schedules from this perspective.

Yu Zhang 🐈🐙 reposted

張小珺 Xiaojùn

@zhang_benita

Dec 4

语言即世界工作室和微博合作了一档全新的节目《未竟之约》。在变化的时空里，表达与观察是没有起点和终点的，从此刻出发，一起走向未竟的旅程🎧🎧🎧

Yu Zhang 🐈🐙 reposted

Shilong Liu

@atasteoff

16 h

#NeurIPS2025

Yu Zhang 🐈🐙 reposted

Yifan Zhang @ NeurIPS

@yifan_zhang_

18 h

Hope you enjoyed yesterday’s poster! We were honored to have @SonglinYang4, @Xinyu2ML, @wen_kaiyue, and many other esteemed researchers visit and share their guidance! 🚀

yifan_zhang_'s tweet image. Hope you enjoyed yesterday’s poster! We were honored to have @SonglinYang4, @Xinyu2ML, @wen_kaiyue, and many other esteemed researchers visit and share their guidance! 🚀

Jiankui He

@Jiankui_He

Dec 3

I am the greatest scientist in China in the last 100 years.

Yu Zhang 🐈🐙 reposted

I will talk about recent developments of linear attention, like GLA, deltanet, GDN, and KDA. The application of linear attention to latest frontier models like Qwen 3 next and kimi linear. It’s strong coherence with test time learning. And the exciting future ahead! Also Looking…

Delta Institute @ NeurIPS

@DeltaInstitutes

Dec 2

We're super excited to host our fourth NeurIPS reading group with @Yikang_Shen, including lunch sponsored by @StrikerVP! We'll gather 25-30 researchers to discuss Linear Attention and DeltaNet, including their usage in recent models like Qwen3-Next and Kimi Linear, RSVP below!

DeltaInstitutes's tweet image. We're super excited to host our fourth NeurIPS reading group with @Yikang_Shen, including lunch sponsored by @StrikerVP!

We'll gather 25-30 researchers to discuss Linear Attention and DeltaNet, including their usage in recent models like Qwen3-Next and Kimi Linear, RSVP below!

Yu Zhang 🐈🐙 reposted

YIFENG LIU

@YIFENGLIU_AI

Dec 1

I will go to NeurIPS 2025@San Diego during Dec. 2-7 for my spotlight paper "Tensor product attention is all you need", and I'm also excited to meet all of you there to discuss anything interesting, exciting and enlightening about AI, LLMs and next trend of innovation.

YIFENGLIU_AI's tweet image. I will go to NeurIPS 2025@San Diego during Dec. 2-7 for my spotlight paper "Tensor product attention is all you need", and I'm also excited to meet all of you there to discuss anything interesting, exciting and enlightening about AI, LLMs and next trend of innovation.

Yu Zhang 🐈🐙 reposted

stochasm

@stochasticchasm

Dec 1

Details on the trinity architecture that we settled on!

Yu Zhang 🐈🐙 reposted

Yi Ma

@YiMaTweets

Dec 2

Open source is probably the new or final form of publication for academia.

Yu Zhang 🐈🐙 reposted

Zhibin Gou

@zebgou

Dec 1

If Gemini-3 proved continual scaling pretraining, DeepSeek-V3.2-Speciale proves scaling RL with large context. We spent a year pushing DeepSeek-V3 to its limits. The lesson is post-training bottlenecks are solved by refining methods and data, not just waiting for a better base.

DeepSeek

@deepseek_ai

Dec 1

🚀 Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale — Reasoning-first models built for agents! 🔹 DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API. 🔹 DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now. 📄 Tech…

deepseek_ai's tweet image. 🚀 Launching DeepSeek-V3.2 &amp; DeepSeek-V3.2-Speciale — Reasoning-first models built for agents!

🔹 DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web &amp; API.
🔹 DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now.

📄 Tech…

Yu Zhang 🐈🐙 reposted

DeepSeek

@deepseek_ai

Dec 1

Yu Zhang 🐈🐙 reposted

Victor M

@victormustar

Dec 1

let's appreciate this model name :)

Yu Zhang 🐈🐙 reposted

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxesTex

Dec 1

enough preamble. The most important part in every Whale paper, as I've said so many times over these years, is “Conclusion, Limitation, and Future Work”. They say: Frontier has no knowledge advantage. Compute is the only serious differentiator left. Time to get more GPUs.

teortaxesTex's tweet image. enough preamble. The most important part in every Whale paper, as I've said so many times over these years, is “Conclusion, Limitation, and Future Work”.
They say: Frontier has no knowledge advantage. Compute is the only serious differentiator left. Time to get more GPUs.

Yu Zhang 🐈🐙

@yzhang_cs

Dec 1

感觉模型大小还是很本质的今天被几米奶老师震撼到了，几乎无错地完成了KDA并行形式的推导orz

Yu Zhang 🐈🐙 reposted

Yuandong Tian

@tydsh

Nov 30

Congrats to @Alibaba_Qwen! Great to hear that our Attention Sink study 2 years ago leads to strong architecture improvement and more stable model training😀

Qwen

@Alibaba_Qwen

Nov 27

🏆 We are incredibly honored to announce that our paper, "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" has received the NeurIPS 2025 Best Paper Award! A huge congratulations to our dedicated research team for pushing the boundaries…

Alibaba_Qwen's tweet image. 🏆 We are incredibly honored to announce that our paper, "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" has received the NeurIPS 2025 Best Paper Award!

A huge congratulations to our dedicated research team for pushing the boundaries…

Yu Zhang 🐈🐙 reposted

DailyPapers

@HuggingPapers

Nov 30

Tencent YouTu Lab just dropped SSA: Sparse Sparse Attention for efficient LLM processing This new framework for long-context inference achieves state-of-the-art by explicitly encouraging sparser attention distributions, outperforming existing methods in perplexity across huge…

HuggingPapers's tweet image. Tencent YouTu Lab just dropped SSA: Sparse Sparse Attention for efficient LLM processing

This new framework for long-context inference achieves state-of-the-art by explicitly encouraging sparser attention distributions, outperforming existing methods in perplexity across huge…

Yu Zhang 🐈🐙

@yzhang_cs

Nov 29

Sleep at 6:00 a.m. 👻

World of Statistics

@stats_feed

Nov 29

Healthy sleep schedules: Sleep at 8:00PM → Wake up at 3:30AM Sleep at 8:30PM → Wake up at 4:00AM Sleep at 9:00PM → Wake up at 4:30AM Sleep at 9:30PM → Wake up at 5:00AM Sleep at 10:00PM → Wake up at 5:30AM Sleep at 10:30PM → Wake up at 6:00AM Sleep at 11:00PM → Wake up at…

Yu Zhang 🐈🐙 reposted

Maximilian Beck✈️NeurIPS‘25

@maxmbeck

Nov 28

Interested in how we can use ideas from Flash Attention for more efficient linear RNN kernels? I am heading to NeurIPS in San Diego to present our work on Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels.

maxmbeck's tweet image. Interested in how we can use ideas from Flash Attention for more efficient linear RNN kernels?

I am heading to NeurIPS in San Diego to present our work on Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels.

Yu Zhang 🐈🐙 reposted

Andreas Kirsch 🇺🇦

@BlackHC

Nov 28

Lol what a shitshow @iclr_conf I'm sure the new ACs will take the rebuttals into account in a meaningful way when they decide to keep the original scores. What a waste of effort for everyone who spent time on rebuttals, and what a stupid reaction to the leak 🤦

BlackHC's tweet image. Lol what a shitshow @iclr_conf

I'm sure the new ACs will take the rebuttals into account in a meaningful way when they decide to keep the original scores. What a waste of effort for everyone who spent time on rebuttals, and what a stupid reaction to the leak 🤦

Yu Zhang 🐈🐙 reposted

Tianqi Chen

@tqchenml

Nov 28

CuteDSL 4.3.1 is here 🚀 Major host overhead optimization (10-40µs down to a 2µs in hot loops_, streamlined PyTorch interop (pass torch.Tensors directly, no more conversions needed) and export and use in more languages and envs. All powered by apache tvm-ffi ABI