yzhang_cs's profile picture. @Kimi_Moonshot; PhD Student @ Soochow University; working on efficient methods for LLMs; disciple of parallel programming; INTP

Yu Zhang 🐈🐙

@yzhang_cs

@Kimi_Moonshot; PhD Student @ Soochow University; working on efficient methods for LLMs; disciple of parallel programming; INTP

固定されたツイート

We’re also shipping fla-core in lock-step with flash-linear-attention: a minimal, forever-in-sync companion pkg that carries nothing except triton+torch Need only fused Norm, CausalConv, linear-attn kernels, w/o transformers worries? fla-core is enough. pypi.org/project/fla-co…

Excited to see Gated DeltaNet being adopted in the @Alibaba_Qwen series ! It has also previously demonstrated strong effectiveness in @nvidia's Jet-Nemotron



Yu Zhang 🐈🐙 さんがリポスト

Sharing a preview of Dragon, a new hybrid GDN-Transformer that outperforms traditional architectures! We are an EU 🇪🇺🇫🇷 based company developing foundational models. What's new, what's different? 🧵👇


Yu Zhang 🐈🐙 さんがリポスト

💪🦾Agentless Training as Skill Prior for SWE-Agents We recently released the technical report for Kimi-Dev. Here is the story we’d love to share behind it: (1/7)

yang_zonghan's tweet image. 💪🦾Agentless Training as Skill Prior for SWE-Agents

We recently released the technical report for Kimi-Dev. Here is the story we’d love to share behind it: (1/7)

Yu Zhang 🐈🐙 さんがリポスト

⚡️500 TPS on DeepSeek-V3.1 🚀

What if your LLM inference automatically got faster the more you used it? Introducing ATLAS from the Together AI Turbo research team. Read more: togetherai.link/atlas Here’s Together AI Founder and Chief Scientist @tri_dao introducing ATLAS:



Yu Zhang 🐈🐙 さんがリポスト

🎉 Excited to share that our paper “Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models” — the first comprehensive survey on hallucinations in LLMs — has been accepted to Computational Linguistics (CL) and will be presented at EMNLP 2025!…

yafuly's tweet image. 🎉 Excited to share that our paper “Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models” — the first comprehensive survey on hallucinations in LLMs — has been accepted to Computational Linguistics (CL) and will be presented at EMNLP 2025!…
yafuly's tweet image. 🎉 Excited to share that our paper “Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models” — the first comprehensive survey on hallucinations in LLMs — has been accepted to Computational Linguistics (CL) and will be presented at EMNLP 2025!…
yafuly's tweet image. 🎉 Excited to share that our paper “Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models” — the first comprehensive survey on hallucinations in LLMs — has been accepted to Computational Linguistics (CL) and will be presented at EMNLP 2025!…

🙋

The Canadian visa is a joke. So please don't take it seriously. It's their loss.



Yu Zhang 🐈🐙 さんがリポスト

Why does linear attention need Short Conv? kexue.fm/archives/11320


Yu Zhang 🐈🐙 さんがリポスト

🚀 New preprint: ExGRPO — Learning to Reason from Experience On-policy RLVR wastes valuable rollouts after one use. We ask: what makes a reasoning experience truly useful? 🔑 Key ideas: 1. Correctness & entropy as indicators of valuable experience 2. Medium-difficulty questions =…

yafuly's tweet image. 🚀 New preprint: ExGRPO — Learning to Reason from Experience
On-policy RLVR wastes valuable rollouts after one use. We ask: what makes a reasoning experience truly useful?
🔑 Key ideas:
1. Correctness & entropy as indicators of valuable experience
2. Medium-difficulty questions =…
yafuly's tweet image. 🚀 New preprint: ExGRPO — Learning to Reason from Experience
On-policy RLVR wastes valuable rollouts after one use. We ask: what makes a reasoning experience truly useful?
🔑 Key ideas:
1. Correctness & entropy as indicators of valuable experience
2. Medium-difficulty questions =…
yafuly's tweet image. 🚀 New preprint: ExGRPO — Learning to Reason from Experience
On-policy RLVR wastes valuable rollouts after one use. We ask: what makes a reasoning experience truly useful?
🔑 Key ideas:
1. Correctness & entropy as indicators of valuable experience
2. Medium-difficulty questions =…
yafuly's tweet image. 🚀 New preprint: ExGRPO — Learning to Reason from Experience
On-policy RLVR wastes valuable rollouts after one use. We ask: what makes a reasoning experience truly useful?
🔑 Key ideas:
1. Correctness & entropy as indicators of valuable experience
2. Medium-difficulty questions =…

Yu Zhang 🐈🐙 さんがリポスト

Big win for @Lei_Wang_1999 on Tilelang endorsement from the whale !

DeepSeek V3.2 breakdown 1. Sparse attention via lightning indexer + top_k attention 2. Uses V3.1 Terminus + 1T continued pretraining tokens 3. 5 specialized models (coding, math etc) via RL then distillation for final ckpt 4. GRPO. Reward functions for length penalty, language…

danielhanchen's tweet image. DeepSeek V3.2 breakdown
1. Sparse attention via lightning indexer + top_k attention
2. Uses V3.1 Terminus + 1T continued pretraining tokens
3. 5 specialized models (coding, math etc) via RL then distillation for final ckpt
4. GRPO. Reward functions for length penalty, language…


Yu Zhang 🐈🐙 さんがリポスト

Deepseek is using TileLang instead of Triton. TileLang is a rlly elegant language! Also reminds me of this surface-level blog I wrote when first learning about it. It only takes less than 100 lines of code to achieve 630 TFLOPS for softmax attn fwd in TileLang (1.3x of FA2)

nathancgy4's tweet image. Deepseek is using TileLang instead of Triton. TileLang is a rlly elegant language!

Also reminds me of this surface-level blog I wrote when first learning about it. It only takes less than 100 lines of code to achieve 630 TFLOPS for softmax attn fwd in TileLang (1.3x of FA2)

🛠 Open Source Release 🔗 Model: huggingface.co/deepseek-ai/De… 🔗 Tech report: github.com/deepseek-ai/De… 🔗 Key GPU kernels in TileLang & CUDA (use TileLang for rapid research prototyping!) 4/n



Yu Zhang 🐈🐙 さんがリポスト

man how is Kimi so different a model from a parallel universe

teortaxesTex's tweet image. man how is Kimi so different
a model from a parallel universe

Yu Zhang 🐈🐙 さんがリポスト

Some h800 performance results of the Attention Sink in tilelang. Take a look if you want to make some variants :) github.com/tile-ai/tilela…

Lei_Wang_1999's tweet image. Some h800 performance results of the Attention Sink in tilelang. Take a look if you want to make some variants :) 
github.com/tile-ai/tilela…
Lei_Wang_1999's tweet image. Some h800 performance results of the Attention Sink in tilelang. Take a look if you want to make some variants :) 
github.com/tile-ai/tilela…

Yu Zhang 🐈🐙 さんがリポスト

Sharing our second Connectionism research post on Modular Manifolds, a mathematical approach to refining training at each layer of the neural network

Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.…

thinkymachines's tweet image. Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.…


Yu Zhang 🐈🐙 さんがリポスト

(1/6) triton kernels are a great way to understand ML models. but tutorials are scattered the learning method for me was jst to read real, high performance code so i wrote a blog which walkthroughs the design and intuitions behind FLA's softmax attention kernel 🧵also a thread

nathancgy4's tweet image. (1/6) triton kernels are a great way to understand ML models. but tutorials are scattered

the learning method for me was jst to read real, high performance code

so i wrote a blog which walkthroughs the design and intuitions behind FLA's softmax attention kernel

🧵also a thread

Yu Zhang 🐈🐙 さんがリポスト

Check out @nathancgy4's awesome Deltaformer PR and stay tuned for a post on the architecture soon!

SEED's paper on associative memory and DeltaFormer is still one of my favorites 🎉so I'm happy share that DeltaFormer is now supported on FLA (flash linear attention)! Learned incredibly much from @yzhang_cs and Mingyu

nathancgy4's tweet image. SEED's paper on associative memory and DeltaFormer is still one of my favorites

🎉so I'm happy share that DeltaFormer is now supported on FLA (flash linear attention)! Learned incredibly much from @yzhang_cs and Mingyu


Yu Zhang 🐈🐙 さんがリポスト

I really don't want this guy winning..


Yu Zhang 🐈🐙 さんがリポスト

SEED's paper on associative memory and DeltaFormer is still one of my favorites 🎉so I'm happy share that DeltaFormer is now supported on FLA (flash linear attention)! Learned incredibly much from @yzhang_cs and Mingyu

nathancgy4's tweet image. SEED's paper on associative memory and DeltaFormer is still one of my favorites

🎉so I'm happy share that DeltaFormer is now supported on FLA (flash linear attention)! Learned incredibly much from @yzhang_cs and Mingyu

Why do FFNs use ReLU instead of more precise ones like Exp? "We propose the following hypothesis: A kernel with lower retrieval precision encourages a more polysemantic key–value memory: multiple unrelated facts can be stored under the same key space" Great and inspiring read!

nathancgy4's tweet image. Why do FFNs use ReLU instead of more precise ones like Exp?

"We propose the following hypothesis: A kernel with lower retrieval precision encourages a more polysemantic key–value memory: multiple unrelated facts can be stored under the same key space"

Great and inspiring read!


Yu Zhang 🐈🐙 さんがリポスト

Still one of the most insane lines I've read in any ML paper.

basicprompts's tweet image. Still one of the most insane lines I've read in any ML paper.

Yu Zhang 🐈🐙 さんがリポスト

swiglu-style gates working so well for attention (and not just in the ffn layers) is a beautiful thing. as it turns out, the "divine benevolence" might just be caused by better inductive biases for controlling where information goes.

kalomaze's tweet image. swiglu-style gates working so well for attention (and not just in the ffn layers) is a beautiful thing. as it turns out, the "divine benevolence" might just be caused by better inductive biases for controlling where information goes.

Yu Zhang 🐈🐙 さんがリポスト

Big day for AI agents! Tongyi Lab (@Ali_TongyiLab) just dropped half a dozen new papers, most focused on Deep Research agents. I’ll walk you through the highlights in this thread. (1/N)

arankomatsuzaki's tweet image. Big day for AI agents!

Tongyi Lab (@Ali_TongyiLab) just dropped half a dozen new papers, most focused on Deep Research agents.

I’ll walk you through the highlights in this thread. (1/N)

United States トレンド

Loading...

Something went wrong.


Something went wrong.