yzhang_cs's profile picture. @Kimi_Moonshot; PhD Student @ Soochow University; working on efficient methods for LLMs; disciple of parallel programming; INTP

Yu Zhang 🐈🐙

@yzhang_cs

@Kimi_Moonshot; PhD Student @ Soochow University; working on efficient methods for LLMs; disciple of parallel programming; INTP

You might like
Yu Zhang 🐈🐙 reposted

Weight Decay and Learning Rate from a EMA View kexue.fm/archives/11459 Derives optimal WD and LR schedules from this perspective.


Yu Zhang 🐈🐙 reposted

语言即世界工作室和微博合作了一档全新的节目《未竟之约》。 在变化的时空里,表达与观察是没有起点和终点的,从此刻出发,一起走向未竟的旅程🎧🎧🎧

zhang_benita's tweet image. 语言即世界工作室和微博合作了一档全新的节目《未竟之约》。
在变化的时空里,表达与观察是没有起点和终点的,从此刻出发,一起走向未竟的旅程🎧🎧🎧
zhang_benita's tweet image. 语言即世界工作室和微博合作了一档全新的节目《未竟之约》。
在变化的时空里,表达与观察是没有起点和终点的,从此刻出发,一起走向未竟的旅程🎧🎧🎧

Yu Zhang 🐈🐙 reposted
atasteoff's tweet image. #NeurIPS2025

Yu Zhang 🐈🐙 reposted

Hope you enjoyed yesterday’s poster! We were honored to have @SonglinYang4, @Xinyu2ML, @wen_kaiyue, and many other esteemed researchers visit and share their guidance! 🚀

yifan_zhang_'s tweet image. Hope you enjoyed yesterday’s poster! We were honored to have @SonglinYang4, @Xinyu2ML, @wen_kaiyue, and many other esteemed researchers visit and share their guidance! 🚀

🧐🧐🧐

I am the greatest scientist in China in the last 100 years.

Jiankui_He's tweet image. I am the greatest scientist in China in the last 100 years.


Yu Zhang 🐈🐙 reposted

I will talk about recent developments of linear attention, like GLA, deltanet, GDN, and KDA. The application of linear attention to latest frontier models like Qwen 3 next and kimi linear. It’s strong coherence with test time learning. And the exciting future ahead! Also Looking…

We're super excited to host our fourth NeurIPS reading group with @Yikang_Shen, including lunch sponsored by @StrikerVP! We'll gather 25-30 researchers to discuss Linear Attention and DeltaNet, including their usage in recent models like Qwen3-Next and Kimi Linear, RSVP below!

DeltaInstitutes's tweet image. We're super excited to host our fourth NeurIPS reading group with @Yikang_Shen, including lunch sponsored by @StrikerVP!

We'll gather 25-30 researchers to discuss Linear Attention and DeltaNet, including their usage in recent models like Qwen3-Next and Kimi Linear, RSVP below!


Yu Zhang 🐈🐙 reposted

I will go to NeurIPS 2025@San Diego during Dec. 2-7 for my spotlight paper "Tensor product attention is all you need", and I'm also excited to meet all of you there to discuss anything interesting, exciting and enlightening about AI, LLMs and next trend of innovation.

YIFENGLIU_AI's tweet image. I will go to NeurIPS 2025@San Diego during Dec. 2-7 for my spotlight paper "Tensor product attention is all you need", and I'm also excited to meet all of you there to discuss anything interesting, exciting and enlightening about AI, LLMs and next trend of innovation.

Yu Zhang 🐈🐙 reposted

Details on the trinity architecture that we settled on!

stochasticchasm's tweet image. Details on the trinity architecture that we settled on!
stochasticchasm's tweet image. Details on the trinity architecture that we settled on!
stochasticchasm's tweet image. Details on the trinity architecture that we settled on!
stochasticchasm's tweet image. Details on the trinity architecture that we settled on!

Yu Zhang 🐈🐙 reposted

Open source is probably the new or final form of publication for academia.


Yu Zhang 🐈🐙 reposted

If Gemini-3 proved continual scaling pretraining, DeepSeek-V3.2-Speciale proves scaling RL with large context. We spent a year pushing DeepSeek-V3 to its limits. The lesson is post-training bottlenecks are solved by refining methods and data, not just waiting for a better base.

🚀 Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale — Reasoning-first models built for agents! 🔹 DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API. 🔹 DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now. 📄 Tech…

deepseek_ai's tweet image. 🚀 Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale — Reasoning-first models built for agents!

🔹 DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API.
🔹 DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now.

📄 Tech…


Yu Zhang 🐈🐙 reposted

🚀 Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale — Reasoning-first models built for agents! 🔹 DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API. 🔹 DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now. 📄 Tech…

deepseek_ai's tweet image. 🚀 Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale — Reasoning-first models built for agents!

🔹 DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API.
🔹 DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now.

📄 Tech…

Yu Zhang 🐈🐙 reposted

let's appreciate this model name :)

victormustar's tweet image. let's appreciate this model name :)

Yu Zhang 🐈🐙 reposted

enough preamble. The most important part in every Whale paper, as I've said so many times over these years, is “Conclusion, Limitation, and Future Work”. They say: Frontier has no knowledge advantage. Compute is the only serious differentiator left. Time to get more GPUs.

teortaxesTex's tweet image. enough preamble. The most important part in every Whale paper, as I've said so many times over these years, is “Conclusion, Limitation, and Future Work”. 
They say: Frontier has no knowledge advantage. Compute is the only serious differentiator left. Time to get more GPUs.

感觉模型大小还是很本质的 今天被几米奶老师震撼到了,几乎无错地完成了KDA并行形式的推导orz


Yu Zhang 🐈🐙 reposted

Congrats to @Alibaba_Qwen! Great to hear that our Attention Sink study 2 years ago leads to strong architecture improvement and more stable model training😀

🏆 We are incredibly honored to announce that our paper, "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" has received the NeurIPS 2025 Best Paper Award! A huge congratulations to our dedicated research team for pushing the boundaries…

Alibaba_Qwen's tweet image. 🏆 We are incredibly honored to announce that our paper, "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" has received the NeurIPS 2025 Best Paper Award!

A huge congratulations to our dedicated research team for pushing the boundaries…


Yu Zhang 🐈🐙 reposted

Tencent YouTu Lab just dropped SSA: Sparse Sparse Attention for efficient LLM processing This new framework for long-context inference achieves state-of-the-art by explicitly encouraging sparser attention distributions, outperforming existing methods in perplexity across huge…

HuggingPapers's tweet image. Tencent YouTu Lab just dropped SSA: Sparse Sparse Attention for efficient LLM processing

This new framework for long-context inference achieves state-of-the-art by explicitly encouraging sparser attention distributions, outperforming existing methods in perplexity across huge…

Sleep at 6:00 a.m. 👻

Healthy sleep schedules: Sleep at 8:00PM → Wake up at 3:30AM Sleep at 8:30PM → Wake up at 4:00AM Sleep at 9:00PM → Wake up at 4:30AM Sleep at 9:30PM → Wake up at 5:00AM Sleep at 10:00PM → Wake up at 5:30AM Sleep at 10:30PM → Wake up at 6:00AM Sleep at 11:00PM → Wake up at…



Yu Zhang 🐈🐙 reposted

Interested in how we can use ideas from Flash Attention for more efficient linear RNN kernels? I am heading to NeurIPS in San Diego to present our work on Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels.

maxmbeck's tweet image. Interested in how we can use ideas from Flash Attention for more efficient linear RNN kernels?

I am heading to NeurIPS in San Diego to present our work on Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels.

Yu Zhang 🐈🐙 reposted

Lol what a shitshow @iclr_conf I'm sure the new ACs will take the rebuttals into account in a meaningful way when they decide to keep the original scores. What a waste of effort for everyone who spent time on rebuttals, and what a stupid reaction to the leak 🤦

BlackHC's tweet image. Lol what a shitshow @iclr_conf 

I'm sure the new ACs will take the rebuttals into account in a meaningful way when they decide to keep the original scores. What a waste of effort for everyone who spent time on rebuttals, and what a stupid reaction to the leak 🤦

Yu Zhang 🐈🐙 reposted

CuteDSL 4.3.1 is here 🚀 Major host overhead optimization (10-40µs down to a 2µs in hot loops_, streamlined PyTorch interop (pass torch.Tensors directly, no more conversions needed) and export and use in more languages and envs. All powered by apache tvm-ffi ABI

tqchenml's tweet image. CuteDSL 4.3.1 is here 🚀 Major host overhead optimization (10-40µs down to a 2µs in hot loops_, streamlined PyTorch interop (pass torch.Tensors directly, no more conversions needed) and export and use in more languages and envs. All powered by apache tvm-ffi ABI

United States Trends

You might like

Loading...

Something went wrong.


Something went wrong.