
Yu Zhang 🐈🐙
@yzhang_cs
@Kimi_Moonshot; PhD Student @ Soochow University; working on efficient methods for LLMs; disciple of parallel programming; INTP
We’re also shipping fla-core in lock-step with flash-linear-attention: a minimal, forever-in-sync companion pkg that carries nothing except triton+torch Need only fused Norm, CausalConv, linear-attn kernels, w/o transformers worries? fla-core is enough. pypi.org/project/fla-co…
Excited to see Gated DeltaNet being adopted in the @Alibaba_Qwen series ! It has also previously demonstrated strong effectiveness in @nvidia's Jet-Nemotron
Sharing a preview of Dragon, a new hybrid GDN-Transformer that outperforms traditional architectures! We are an EU 🇪🇺🇫🇷 based company developing foundational models. What's new, what's different? 🧵👇
💪🦾Agentless Training as Skill Prior for SWE-Agents We recently released the technical report for Kimi-Dev. Here is the story we’d love to share behind it: (1/7)

⚡️500 TPS on DeepSeek-V3.1 🚀
What if your LLM inference automatically got faster the more you used it? Introducing ATLAS from the Together AI Turbo research team. Read more: togetherai.link/atlas Here’s Together AI Founder and Chief Scientist @tri_dao introducing ATLAS:
🎉 Excited to share that our paper “Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models” — the first comprehensive survey on hallucinations in LLMs — has been accepted to Computational Linguistics (CL) and will be presented at EMNLP 2025!…



🙋
The Canadian visa is a joke. So please don't take it seriously. It's their loss.
Why does linear attention need Short Conv? kexue.fm/archives/11320
🚀 New preprint: ExGRPO — Learning to Reason from Experience On-policy RLVR wastes valuable rollouts after one use. We ask: what makes a reasoning experience truly useful? 🔑 Key ideas: 1. Correctness & entropy as indicators of valuable experience 2. Medium-difficulty questions =…




Big win for @Lei_Wang_1999 on Tilelang endorsement from the whale !
DeepSeek V3.2 breakdown 1. Sparse attention via lightning indexer + top_k attention 2. Uses V3.1 Terminus + 1T continued pretraining tokens 3. 5 specialized models (coding, math etc) via RL then distillation for final ckpt 4. GRPO. Reward functions for length penalty, language…

Deepseek is using TileLang instead of Triton. TileLang is a rlly elegant language! Also reminds me of this surface-level blog I wrote when first learning about it. It only takes less than 100 lines of code to achieve 630 TFLOPS for softmax attn fwd in TileLang (1.3x of FA2)

🛠 Open Source Release 🔗 Model: huggingface.co/deepseek-ai/De… 🔗 Tech report: github.com/deepseek-ai/De… 🔗 Key GPU kernels in TileLang & CUDA (use TileLang for rapid research prototyping!) 4/n
man how is Kimi so different a model from a parallel universe

Some h800 performance results of the Attention Sink in tilelang. Take a look if you want to make some variants :) github.com/tile-ai/tilela…


Sharing our second Connectionism research post on Modular Manifolds, a mathematical approach to refining training at each layer of the neural network
Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.…

(1/6) triton kernels are a great way to understand ML models. but tutorials are scattered the learning method for me was jst to read real, high performance code so i wrote a blog which walkthroughs the design and intuitions behind FLA's softmax attention kernel 🧵also a thread

Check out @nathancgy4's awesome Deltaformer PR and stay tuned for a post on the architecture soon!
SEED's paper on associative memory and DeltaFormer is still one of my favorites 🎉so I'm happy share that DeltaFormer is now supported on FLA (flash linear attention)! Learned incredibly much from @yzhang_cs and Mingyu

I really don't want this guy winning..
SEED's paper on associative memory and DeltaFormer is still one of my favorites 🎉so I'm happy share that DeltaFormer is now supported on FLA (flash linear attention)! Learned incredibly much from @yzhang_cs and Mingyu

Why do FFNs use ReLU instead of more precise ones like Exp? "We propose the following hypothesis: A kernel with lower retrieval precision encourages a more polysemantic key–value memory: multiple unrelated facts can be stored under the same key space" Great and inspiring read!

Still one of the most insane lines I've read in any ML paper.

swiglu-style gates working so well for attention (and not just in the ffn layers) is a beautiful thing. as it turns out, the "divine benevolence" might just be caused by better inductive biases for controlling where information goes.

Big day for AI agents! Tongyi Lab (@Ali_TongyiLab) just dropped half a dozen new papers, most focused on Deep Research agents. I’ll walk you through the highlights in this thread. (1/N)

United States トレンド
- 1. $CHA N/A
- 2. #TORQSports N/A
- 3. Malcolm Brogdon 3,723 posts
- 4. Supreme Court 114K posts
- 5. Banish 1,370 posts
- 6. Argentina 461K posts
- 7. Nancy 73.9K posts
- 8. Waddle 4,259 posts
- 9. SCOTUS 36.3K posts
- 10. Big Balls 28.1K posts
- 11. Olave 3,611 posts
- 12. Russ 19.8K posts
- 13. #PokemonGO 3,317 posts
- 14. #ClockTower1Year N/A
- 15. Martha 22.1K posts
- 16. $HIMS 4,951 posts
- 17. Voting Rights Act 35.5K posts
- 18. Biker Boyz N/A
- 19. Rickey 2,502 posts
- 20. Aphrodite 5,765 posts
Something went wrong.
Something went wrong.