Zhaoyue Cheng

@joey__cheng

joeycheng.me

Joined February 2011

125Posts 189Followers 2KFollowing

You might like

@huyushi98

@fwang_nlp

@minqian_liu

@mohitban47

@denghui_zhang

@CSamarinas

@redoragd

@jekbradbury

@yumo_xu

@abkashyap92

@bingbingwen1

@TenghaoHuang45

@jashokkumar83

@Prateek_AL

@negar_rz

Zhaoyue Cheng reposted

elie

@eliebakouch

Sep 29

Deepseek Sparse Attention from V3.2 is nice and "simple" (and compatible with MLA!), but at first it was surprising to me that it works well/better than other alternatives for both long context and short context. Some random thoughts I had: For short context both NSA and InfLLM…

eliebakouch's tweet image. Deepseek Sparse Attention from V3.2 is nice and "simple" (and compatible with MLA!), but at first it was surprising to me that it works well/better than other alternatives for both long context and short context. Some random thoughts I had:

For short context both NSA and InfLLM…

Zhaoyue Cheng reposted

xjdr

@_xjdr

Sep 29

first set of thoughts after quickly reading: DSA feels like a small step in between MLA -> and NSA's selection approach. While the DSA sparsity is interesting from a efficiency standpoint, i am more interested in its actual pure performance. Attention activation is something that…

Zhaoyue Cheng reposted

Horace He

@cHHillee

Sep 29

I quite enjoyed this and it covers a bunch of topics without good introductory resources! 1. A bunch of GPU hardware details in one place (warp schedulers, shared memory, etc.) 2. A breakdown/walkthrough of reading PTX and SASS. 3. Some details/walkthroughs of a number of other…

Aleksa Gordić (水平问题)

@gordic_aleksa

Sep 29

New in-depth blog post time: "Inside NVIDIA GPUs: Anatomy of high performance matmul kernels". If you want to deeply understand how one writes state of the art matmul kernels in CUDA read along. (Remember matmul is the single most important operation that transformers execute…

gordic_aleksa's tweet image. New in-depth blog post time: "Inside NVIDIA GPUs: Anatomy of high performance matmul kernels". If you want to deeply understand how one writes state of the art matmul kernels in CUDA read along.

(Remember matmul is the single most important operation that transformers execute…

Zhaoyue Cheng reposted

vLLM

@vllm_project

Sep 29

How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.

DeepSeek

@deepseek_ai

Sep 29

🚀 Introducing DeepSeek-V3.2-Exp — our latest experimental model! ✨ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context. 👉 Now live on App, Web, and API. 💰 API prices cut by 50%+! 1/n

Zhaoyue Cheng reposted

Aleksa Gordić (水平问题)

@gordic_aleksa

Sep 29

Zhaoyue Cheng reposted

You Jiacheng

@YouJiacheng

Sep 29

BTW, They released a deep dive on FP8 KVCache of main MLA. github.com/deepseek-ai/Fl… so, actually ≈1/5 compared to FP8 dense MLA.

You Jiacheng

@YouJiacheng

Sep 29

As expected, NSA is not compatible with MLA, so DeepSeek chose another method: use a smaller (d=128) attention (w/o value) as the indexer. Asymptotic cost ratio = 128/576. In addition, indexer uses FP8 while main MLA uses 16-bit, so = 64/576 = 1/9.

YouJiacheng's tweet image. As expected, NSA is not compatible with MLA, so DeepSeek chose another method: use a smaller (d=128) attention (w/o value) as the indexer.
Asymptotic cost ratio = 128/576.
In addition, indexer uses FP8 while main MLA uses 16-bit, so = 64/576 = 1/9.

Zhaoyue Cheng reposted

Horace He

@cHHillee

Sep 10

Apologies that I haven't written anything since joining Thinking Machines but I hope this blog post on a topic very near and dear to my heart (reproducible floating point numerics in LLM inference) will make up for it!

Thinking Machines

@thinkymachines

Sep 10

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to…

thinkymachines's tweet image. Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference”

We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to…

Zhaoyue Cheng reposted

Greg Brockman

@gdb

Sep 7

building pretraining infrastructure is an exercise in complexity management, abstraction design, operability/observability, and deep systems and ML understanding. reflects some of the trickiest and most rewarding problems in software engineering. which makes it really fun!

Zhaoyue Cheng reposted

Aleksa Gordić (水平问题)

@gordic_aleksa

Sep 1

New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work! Took me a while to get this level of understanding of the codebase and then to write up…

gordic_aleksa's tweet image. New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work!

Took me a while to get this level of understanding of the codebase and then to write up…

Zhaoyue Cheng reposted

Deedy

@deedydas

Aug 1

This is the best public resource on scaling hardware for AI, and its free. "How to Scale Your Model" is the bible from Google DeepMind that covers the math, systems and scaling laws for LLM training and inference workloads. Approachable yet thorough. Absolute Must-read.

deedydas's tweet image. This is the best public resource on scaling hardware for AI, and its free.

"How to Scale Your Model" is the bible from Google DeepMind that covers the math, systems and scaling laws for LLM training and inference workloads.

Approachable yet thorough. Absolute Must-read.

Zhaoyue Cheng reposted

九原客

@9hills

Jul 14

节选： 1. 开源了就意味着第一方再也不能用各种 hack 的方式粉饰效果，必须拿出足够通用、任何第三方拿到同样的 weights 都要能很简单地复现出你的效果才行。 2. 绝大多数 Agent 产品，离了 Claude 以后，什么都不是。

🐻熊狸

@bigeagle_xd

Jul 13

some of my own thoughts and opinions behind k2 bigeagle.me/2025/07/kimi-k…

Zhaoyue Cheng reposted

Tri Dao

@tri_dao

Jul 10

Getting mem-bound kernels to speed-of-light isn't a dark art, it's just about getting the a couple of details right. We wrote a tutorial on how to do this, with code you can directly use. Thanks to the new CuTe-DSL, we can hit speed-of-light without a single line of CUDA C++.

Wentao Guo

@WentaoGuo7

Jul 10

🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With @tedzadouri and @tri_dao

WentaoGuo7's tweet image. 🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯

With @tedzadouri and @tri_dao

Zhaoyue Cheng reposted

Jeff Dean

@JeffDean

Jun 25

Very cool thread about the CS336 Language Models from Scratch course at Stanford taught by @percyliang et al. Makes me wish I was a student again!

Percy Liang

@percyliang

Jun 18

Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team @tatsu_hashimoto @marcelroed @neilbband @rckpudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:

Zhaoyue Cheng reposted

Nando de Freitas

@NandoDF

Jun 10

Ilya Sutskever, U of T honorary degree recipient, June 6, 2025 youtu.be/zuZ2zaotrJs?si… via @YouTube This is a must watch speech, the wisest words you could hear. Congratulations 🙌 @ilyasut @UofT was very fortunate to have you and so many other amazing students and…

NandoDF's tweet card. Ilya Sutskever, U of T honorary degree recipient, June 6, 2025

youtube.com

YouTube

Ilya Sutskever, U of T honorary degree recipient, June 6, 2025

Source: youtube.com

Zhaoyue Cheng reposted

Dan Roy

@roydanroy

May 28

Dear PhD students now regretting taking offers at US schools: If you turned down PhD offers in Canada, but want to rethink that, email the professors who were trying to recruit you. They might be able to pull some strings. Your sane neighbor to the north, Canada

Zhaoyue Cheng reposted

Sasha Rush

@srush_nlp

May 23

Strong recommend for this book and the JAX/TPU docs, even if you are using Torch / GPUs. Clean notation and mental model for some challenging ideas. github.com/jax-ml/scaling… github.com/jax-ml/scaling… docs.jax.dev/en/latest/note…

srush_nlp's tweet image. Strong recommend for this book and the JAX/TPU docs, even if you are using Torch / GPUs. Clean notation and mental model for some challenging ideas.

github.com/jax-ml/scaling…
github.com/jax-ml/scaling…
docs.jax.dev/en/latest/note…

Zhaoyue Cheng reposted

LMSYS Org

@lmsysorg

May 5

🚀 Breaking: SGLang provides the first open-source implementation to serve @deepseek_ai V3/R1 models with large-scale expert parallelism and prefill-decode disaggregation on 96 GPUs. It nearly matches the throughput reported by the official DeepSeek blog, achieving 52.3K input…

lmsysorg's tweet image. 🚀 Breaking: SGLang provides the first open-source implementation to serve @deepseek_ai V3/R1 models with large-scale expert parallelism and prefill-decode disaggregation on 96 GPUs.
It nearly matches the throughput reported by the official DeepSeek blog, achieving 52.3K input…

Zhaoyue Cheng reposted

Chujie Zheng

@ChujieZheng

May 5

The SGLang guys @lmsysorg are always doing such incredible work. This is also what the open-source community has made possible! 🚀

ChujieZheng's tweet image. The SGLang guys @lmsysorg are always doing such incredible work. This is also what the open-source community has made possible! 🚀

Zhaoyue Cheng reposted

Sundar Pichai

@sundarpichai

Apr 24

4/ Our AI leadership was on display at @GoogleCloud Next, where we launched Ironwood, our most powerful TPU yet — 10X compute boost, optimized for inference at scale. And we’re first to bring NVIDIA’s next-gen Blackwell GPUs to customers. Now with tools for building multi-agent…

Zhaoyue Cheng reposted

Aravind Srinivas

@AravSrinivas

Dec 14

Full talk of Ilya here 👇

Vincent Weisser

@vincentweisser

Dec 13

.@ilyasut full talk at neurips 2024 "pre-training as we know it will end" and what comes next is superintelligence: agentic, reasons, understands and is self aware