yzhang_cs's profile picture. @Kimi_Moonshot; PhD Student @ Soochow University; working on efficient methods for LLMs; disciple of parallel programming; INTP

Yu Zhang 🐈🐙

@yzhang_cs

@Kimi_Moonshot; PhD Student @ Soochow University; working on efficient methods for LLMs; disciple of parallel programming; INTP

You might like
Pinned

We’re also shipping fla-core in lock-step with flash-linear-attention: a minimal, forever-in-sync companion pkg that carries nothing except triton+torch Need only fused Norm, CausalConv, linear-attn kernels, w/o transformers worries? fla-core is enough. pypi.org/project/fla-co…

Excited to see Gated DeltaNet being adopted in the @Alibaba_Qwen series ! It has also previously demonstrated strong effectiveness in @nvidia's Jet-Nemotron



Yu Zhang 🐈🐙 reposted

I just love Hugging Face. Their new 200+ page Training Playbook covers everything: training frameworks, model architecture, data curation, pre/mid/post-training, eval, how GPUs work, latest research, and ablations. Packed with practical wisdom. I read it like a novel.

Yuchenj_UW's tweet image. I just love Hugging Face.

Their new 200+ page Training Playbook covers everything: training frameworks, model architecture, data curation, pre/mid/post-training, eval, how GPUs work, latest research, and ablations.

Packed with practical wisdom. I read it like a novel.

Yu Zhang 🐈🐙 reposted

Kimi K2 Thinking benchmarks are here and it's competitive with (and in some cases beats!) GPT-5 🔥🔥

synthwavedd's tweet image. Kimi K2 Thinking benchmarks are here and it's competitive with (and in some cases beats!) GPT-5 🔥🔥

Yu Zhang 🐈🐙 reposted

Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1. 🔗…

PyTorch's tweet image. Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM!  Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1.

🔗…

Yu Zhang 🐈🐙 reposted

love to see it - ongoing community effort makes deploying recurrent models (mamba, deltanet, other linear attention hybrids) easier than ever to realize their inference throughput wins

Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1. 🔗…

PyTorch's tweet image. Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM!  Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1.

🔗…


Yu Zhang 🐈🐙 reposted

Hybrid Models as First-Class Citizens in vLLM 😍

Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1. 🔗…

PyTorch's tweet image. Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM!  Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1.

🔗…


Yu Zhang 🐈🐙 reposted

Collaborator and friend Dan Alistarh talks at ETH about using the new NvFP4 and MXFP4 block formats for inference. Some going from "terrible" accuracy to acceptable using micro rotations to smoothen outliers in blocks. arxiv.org/abs/2509.23202 Great collaboration and cool stuff

thoefler's tweet image. Collaborator and friend Dan Alistarh talks at ETH about using the new NvFP4 and MXFP4 block formats for inference.

Some going from "terrible" accuracy to acceptable using micro rotations to smoothen outliers in blocks.

arxiv.org/abs/2509.23202

Great collaboration and cool stuff
thoefler's tweet image. Collaborator and friend Dan Alistarh talks at ETH about using the new NvFP4 and MXFP4 block formats for inference.

Some going from "terrible" accuracy to acceptable using micro rotations to smoothen outliers in blocks.

arxiv.org/abs/2509.23202

Great collaboration and cool stuff
thoefler's tweet image. Collaborator and friend Dan Alistarh talks at ETH about using the new NvFP4 and MXFP4 block formats for inference.

Some going from "terrible" accuracy to acceptable using micro rotations to smoothen outliers in blocks.

arxiv.org/abs/2509.23202

Great collaboration and cool stuff
thoefler's tweet image. Collaborator and friend Dan Alistarh talks at ETH about using the new NvFP4 and MXFP4 block formats for inference.

Some going from "terrible" accuracy to acceptable using micro rotations to smoothen outliers in blocks.

arxiv.org/abs/2509.23202

Great collaboration and cool stuff

Yu Zhang 🐈🐙 reposted

Grateful to the vLLM community for their immediate support of Ouro! Its recurrent architecture makes it ideal for edge use cases: deliver equivalent performance while using only 1/3 of the parameters.

Amazing work by @RidgerZhu and the ByteDance Seed team — Scaling Latent Reasoning via Looped LMs introduces looped reasoning as a new scaling dimension. 🔥 The Ouro model is now runnable on vLLM (nightly version) — bringing efficient inference to this new paradigm of latent…



Yu Zhang 🐈🐙 reposted

How powerful are Diffusion LLMs? Can they solve problems that Auto-Regressive (AR) LLMs can’t solve? Check our new paper "On Powerful Ways to Generate: Autoregression, Diffusion, and Beyond" 🔗 arxiv.org/pdf/2510.06190 In this work, we show while Diffusion LLMs are indeed more…

chenxiao_yang_'s tweet image. How powerful are Diffusion LLMs? Can they solve problems that Auto-Regressive (AR) LLMs can’t solve? 

Check our new paper "On Powerful Ways to Generate: Autoregression, Diffusion, and Beyond"
🔗 arxiv.org/pdf/2510.06190

In this work, we show while Diffusion LLMs are indeed more…

Yu Zhang 🐈🐙 reposted

It’s fascinating that FP16 can reduce training–inference mismatch in RL fine-tuning. Out of curiosity, I tried the same precision swap on the Tiny Recursive Model (TRM) @jm_alexia, which iterates hidden states to reason over inputs. The outcome was different: under FP16,…

🚀Excited to share our new work! 💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training. 💡Solution: Just switch to FP16. 🎯That's it. 📰Paper: arxiv.org/pdf/2510.26788 ⭐️Code: github.com/sail-sg/Precis…

QPHutu's tweet image. 🚀Excited to share our new work!

💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training.

💡Solution: Just switch to FP16.

🎯That's it.

📰Paper: arxiv.org/pdf/2510.26788
⭐️Code: github.com/sail-sg/Precis…
QPHutu's tweet image. 🚀Excited to share our new work!

💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training.

💡Solution: Just switch to FP16.

🎯That's it.

📰Paper: arxiv.org/pdf/2510.26788
⭐️Code: github.com/sail-sg/Precis…


Yu Zhang 🐈🐙 reposted

RoPE vs NoPE in hybrid linear attention models like Kimi Linear / Qwen3Next is tricky for example, when using NoPE, we found that slowly expanding the window size of the attention layers during training (64->4k) greatly helps convergence:

AlexandreTL2's tweet image. RoPE vs NoPE in hybrid linear attention models like Kimi Linear / Qwen3Next is tricky
for example, when using NoPE, we found that slowly expanding the window size of the attention layers during training (64->4k) greatly helps convergence:

You see: - a new arch that is better and faster than full attention verified with Kimi-style solidness. I see: - Starting with inferior performance even on short contexts. Nothing works and nobody knows why. - Tweaking every possible hyper-parameter to grasp what is wrong. -…



Yu Zhang 🐈🐙 reposted

🔥 Inside Kimi Linear: First-Hand Insights @Kimi_Moonshot just dropped something impressive again. @yzhang_cs from Kimi AI Infra, shared an insider's look at the making of Kimi Linear — an architecture designed around hybrid linear attention and optimized for efficiency ×…

ZhihuFrontier's tweet image. 🔥 Inside Kimi Linear: First-Hand Insights
@Kimi_Moonshot just dropped something impressive again. @yzhang_cs from Kimi AI Infra, shared an insider's look at the making of Kimi Linear — an architecture designed around hybrid linear attention and optimized for efficiency ×…
ZhihuFrontier's tweet image. 🔥 Inside Kimi Linear: First-Hand Insights
@Kimi_Moonshot just dropped something impressive again. @yzhang_cs from Kimi AI Infra, shared an insider's look at the making of Kimi Linear — an architecture designed around hybrid linear attention and optimized for efficiency ×…

Yu Zhang 🐈🐙 reposted

An interesting work. A new variant of ≥2 passes linear attention. IIUC, second-order HLA can be implemented as Z = LA(query=K, key=Q, value=V) O = LA(query=Q, key=K, value=Z) O = ([email protected])@([email protected]).T@V = ([email protected])@([email protected]@V) Fig.2 in this post shows another variant from another work.

YouJiacheng's tweet image. An interesting work.
A new variant of ≥2 passes linear attention.
IIUC, second-order HLA can be implemented as
Z = LA(query=K, key=Q, value=V)
O = LA(query=Q, key=K, value=Z)
O = (Q@K.T)@(Q@K.T).T@V = (Q@K.T)@(K@Q.T@V)
Fig.2 in this post shows another variant from another work.
YouJiacheng's tweet image. An interesting work.
A new variant of ≥2 passes linear attention.
IIUC, second-order HLA can be implemented as
Z = LA(query=K, key=Q, value=V)
O = LA(query=Q, key=K, value=Z)
O = (Q@K.T)@(Q@K.T).T@V = (Q@K.T)@(K@Q.T@V)
Fig.2 in this post shows another variant from another work.

🚀 HLA: Higher Linear Attention = attention vibes + RNN speed: Higher-order linear attention with parallelizable training! Project Page: github.com/yifanzhang-pro… WE ARE SO BACK! 🚀 #LLM #AI #DeepLearning #Transformers

yifan_zhang_'s tweet image. 🚀 HLA: Higher Linear Attention = attention vibes + RNN speed:

Higher-order linear attention with parallelizable training!

Project Page: github.com/yifanzhang-pro…

WE ARE SO BACK! 🚀

#LLM #AI #DeepLearning #Transformers


Yu Zhang 🐈🐙 reposted

🚀Really excited to see this amazing arch change (KDA) finally coming out! Replacing global attention with linear hybrid arch: better pretraining ppls, long context evals, downstream math&code&stem evals after RL, >6 * throughput at 1M to unblock more downstream potentials to…

This post is unavailable.

Yu Zhang 🐈🐙 reposted

Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually…


Yu Zhang 🐈🐙 reposted

so excited to see more SOTA linear models, including some novel technical refinements to improving recurrent models 🚀

Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Kim… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi…



Yu Zhang 🐈🐙 reposted

🎉 Congrats to @Kimi_Moonshot! vLLM Day-0 model expands! Now supporting Kimi Linear — hybrid linear attention with Kimi Delta Attention(KDA): - RULER 128k context: 84.3 perf + 3.98× speedup - Up to 6× faster decoding & 6.3× faster TPOT (1M tokens) - 75% KV cache reduction 💡…

vllm_project's tweet image. 🎉 Congrats to @Kimi_Moonshot! vLLM Day-0 model expands! Now supporting Kimi Linear — hybrid linear attention with Kimi Delta Attention(KDA):

- RULER 128k context: 84.3 perf + 3.98× speedup
- Up to 6× faster decoding & 6.3× faster TPOT (1M tokens)
- 75% KV cache reduction

💡…

Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Kim… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi…



Yu Zhang 🐈🐙 reposted

Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.

RidgerZhu's tweet image. Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.”

TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.

Yu Zhang 🐈🐙 reposted

Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Kim… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi…


Yu Zhang 🐈🐙 reposted

> beat Qwen3 Next not really, it's smaller and trained with much fewer tokens, and no distillation. it's more about a research preview with fair ablation studies :)


United States Trends

You might like

Loading...

Something went wrong.


Something went wrong.