Zihao Ye

@ye_combinator

Seattle

انضم في أكتوبر 2017

163المنشورات 2ألفالمتابعون 573المتابَعون

قد يعجبك

@lm_zheng

@zhuohan123

@RealVictrid

@DachengLi177

@hanrui_w

@noworkforsixian

@jian_w3ng

@junrushao

@ying11231

@lmxyy1999

@Michaelvll1

@yihongz_bot

@JiaweiLiu_

@LigengZhu

@ApacheTVM

Zihao Ye أعاد

Tianqi Chen

@tqchenml

٢٨ نوفمبرم

CuteDSL 4.3.1 is here 🚀 Major host overhead optimization (10-40µs down to a 2µs in hot loops_, streamlined PyTorch interop (pass torch.Tensors directly, no more conversions needed) and export and use in more languages and envs. All powered by apache tvm-ffi ABI

tqchenml's tweet image. CuteDSL 4.3.1 is here 🚀 Major host overhead optimization (10-40µs down to a 2µs in hot loops_, streamlined PyTorch interop (pass torch.Tensors directly, no more conversions needed) and export and use in more languages and envs. All powered by apache tvm-ffi ABI

Zihao Ye أعاد

Lei Wang

@Lei_Wang_1999

١٨ نوفمبرم

🚀 tilelang now fully embraces tvm-ffi! 💡 Not only is the compiler deeply powered by tvm_ffi, we've also replaced old pybind parts with tvm_ffi too. ⚙️ With host codegen moving attribute checks from Python → C++, CPU overhead dropped 2.1×–3.8×, compile speed boosted 2.1×–3.3×!

Lei_Wang_1999's tweet image. 🚀 tilelang now fully embraces tvm-ffi!
💡 Not only is the compiler deeply powered by tvm_ffi, we've also replaced old pybind parts with tvm_ffi too.
⚙️ With host codegen moving attribute checks from Python → C++, CPU overhead dropped 2.1×–3.8×, compile speed boosted 2.1×–3.3×!

Zihao Ye أعاد

Lequn Chen

@abcdabcd987

١٠ نوفمبرم

Wrote a blog post on why collective communication feels awkward for newer LLM workloads (disaggregated inference, RL weight update, MoE), why people don’t just use raw RDMA, how we approached it, and some behind-the-scenes stories. le.qun.ch/en/blog/2025/1…

Zihao Ye أعاد

GPU MODE

@GPU_MODE

٣ نوفمبرم

The deadline for each kernel Kernel #1 - NVFP4 Batched GEMV: Nov 10th - Nov 28th Kernel #2 - NVFP4 GEMM: Nov 29th - Dec 19th Kernel #3 - NVFP4 Gated Dual GEMM: Dec 20th - Jan 16th Kernel #4 - NVFP4 Grouped GEMM: Jan 17th - Feb 13th

Zihao Ye

@ye_combinator

٢٩ أكتوبرم

Excited to see TVM-FFI finally become a standalone module that can benefit many projects across the deep learning ecosystem! Seven years ago, I was working on @DGLGraph - a graph learning framework with multiple backend support (PyTorch, MXNet, TensorFlow) – alongside talented…

Tianqi Chen

@tqchenml

٢١ أكتوبرم

📢Excited to introduce Apache TVM FFI, an open ABI and FFI for ML systems, enabling compilers, libraries, DSLs, and frameworks to naturally interop with each other. Ship one library across pytorch, jax, cupy etc and runnable across python, c++, rust tvm.apache.org/2025/10/21/tvm…

tqchenml's tweet image. 📢Excited to introduce Apache TVM FFI, an open ABI and FFI for ML systems, enabling compilers, libraries, DSLs, and frameworks to naturally interop with each other. Ship one library across pytorch, jax, cupy etc and runnable across python, c++, rust tvm.apache.org/2025/10/21/tvm…

Zihao Ye أعاد

Li Dong

@donglixp

٢٨ أكتوبرم

On-policy + Reverse KLD = MiniLLM (arxiv.org/abs/2306.08543). Really nice blog by @thinkymachines. Exciting to see it being offered as a service!

donglixp's tweet image. On-policy + Reverse KLD = MiniLLM (arxiv.org/abs/2306.08543). Really nice blog by @thinkymachines. Exciting to see it being offered as a service!

Thinking Machines

@thinkymachines

٢٧ أكتوبرم

Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other…

thinkymachines's tweet image. Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other…

Zihao Ye أعاد

Edward Z. Yang

@ezyang

٢٥ أكتوبرم

A small thread about how you should be drawing the contents of higher dimensional tensors

Zihao Ye أعاد

Shanli Xing

@shanli_xing

٢١ أكتوبرم

🤔 Can AI optimize the systems it runs on? 🚀 Introducing FlashInfer-Bench, a workflow that makes AI systems self-improving with agents: - Standardized signature for LLM serving kernels - Implement kernels with your preferred language - Benchmark them against real-world serving…

shanli_xing's tweet image. 🤔 Can AI optimize the systems it runs on?

🚀 Introducing FlashInfer-Bench, a workflow that makes AI systems self-improving with agents:

- Standardized signature for LLM serving kernels
- Implement kernels with your preferred language
- Benchmark them against real-world serving…

Zihao Ye أعاد

Jeremy Kun ([email protected])

@jeremyjkun

٢٠ أكتوبرم

Integer Set Library (ISL) - A Primer jeremykun.com/2025/10/19/isl…

Zihao Ye أعاد

Baris Kasikci

@bariskasikci

٨ أكتوبرم

How to beat all compression using LLMs? ⚙️ Introducing LLMc — a lossless compressor built with LLMs. LLMc leverages the predictive power of LLMs to beat traditional compressors like Gzip and LZMA on natural language text. (1/4) 🔗 Blog Post: syfi.cs.washington.edu/blog/2025-10-0… 💻 Code:…

Zihao Ye أعاد

LMSYS Org

@lmsysorg

٢٥ سبتمبرم

🚀 Follow-up to our last breakthrough on DeepSeek V3/R1 inference! On NVIDIA GB200 NVL72, SGLang now achieves 26k input tokens/s and 13k output tokens/s per GPU with FP8 attention + NVFP4 MoE - that’s a 3.8× / 4.8× speedup vs H100 settings. See the details in the 🧵 (1/4)

lmsysorg's tweet image. 🚀 Follow-up to our last breakthrough on DeepSeek V3/R1 inference!

On NVIDIA GB200 NVL72, SGLang now achieves 26k input tokens/s and 13k output tokens/s per GPU with FP8 attention + NVFP4 MoE - that’s a 3.8× / 4.8× speedup vs H100 settings.

See the details in the 🧵 (1/4)

Zihao Ye أعاد

Haocheng Xi ✈️ NeurIPS 2025

@HaochengXiUCB

٢٥ سبتمبرم

🚀 Introducing Sparse VideoGen2 (SVG2) — Pareto-frontier video generation acceleration with semantic-aware sparse attention! 🏆Spotlight paper accepted by #NeurIPS2025 ✅ Training-free & plug-and-play ✅ Up to 2.5× faster on HunyuanVideo, 1.9× faster on Wan 2.1 ✅ SOTA quality…

Zihao Ye أعاد

Lianmin Zheng

@lm_zheng

٢٢ سبتمبرم

The main branch of sglang now supports deterministic inference with user-specified per-request seeds! It utilized kernels from @thinkymachines and introduced new optimizations & coverage. Run out of box for most hardware backends and pytorch versions.

LMSYS Org

@lmsysorg

٢٢ سبتمبرم

SGLang now supports deterministic LLM inference! Building on @thinkymachines batch-invariant kernels, we integrated deterministic attention & sampling ops into a high-throughput engine - fully compatible with chunked prefill, CUDA graphs, radix cache, and non-greedy sampling. ✅…

lmsysorg's tweet image. SGLang now supports deterministic LLM inference! Building on @thinkymachines batch-invariant kernels, we integrated deterministic attention &amp; sampling ops into a high-throughput engine - fully compatible with chunked prefill, CUDA graphs, radix cache, and non-greedy sampling.

✅…

Zihao Ye أعاد

Lequn Chen

@abcdabcd987

١٨ سبتمبرم

Glad that you enjoyed it! To be precise, it's EP64 on the inference side, around 30GB per inference GPU. So it's around 30GB / 1.3s = 23 GB/s.

abcdabcd987's tweet image. Glad that you enjoyed it! To be precise, it's EP64 on the inference side, around 30GB per inference GPU. So it's around 30GB / 1.3s = 23 GB/s.

Zihao Ye أعاد

Anne Ouyang

@anneouyang

١٥ سبتمبرم

Excited to share what friends and I have been working on at @Standard_Kernel We've raised from General Catalyst (@generalcatalyst), Felicis (@felicis), and a group of exceptional angels. We have some great H100 BF16 kernels in pure CUDA+PTX, featuring: - Matmul 102%-105% perf…

anneouyang's tweet image. Excited to share what friends and I have been working on at @Standard_Kernel

We've raised from General Catalyst (@generalcatalyst), Felicis (@felicis), and a group of exceptional angels.

We have some great H100 BF16 kernels in pure CUDA+PTX, featuring:
- Matmul 102%-105% perf…

Zihao Ye أعاد

Woosuk Kwon

@woosuk_k

١١ سبتمبرم

At Thinking Machines, our work includes collaborating with the broader research community. Today we are excited to share that we are building a vLLM team at @thinkymachines to advance open-source vLLM and serve frontier models. If you are interested, please DM me or @barret_zoph!…

Zihao Ye

@ye_combinator

١١ سبتمبرم

Awesome work from @thinkymachines and @cHHillee! The importance of determinism might be underestimated. Like with LLM-based compression (bellard.org/ts_zip/) - you really need things to work the same way whether you're doing prefill/decode or different batching setups. Here's…

Thinking Machines

@thinkymachines

١٠ سبتمبرم

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to…

thinkymachines's tweet image. Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference”

We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to…

Zihao Ye أعاد

Horace He

@cHHillee

٢٣ أغسطسم

This is the advantage of large nvlink domains or TPUs topology - the main reason to do PP is that you are bottlenecked on your DP comms and cannot scale TP further. But if you have high enough bandwidth across a large enough domain (like TPUs or NVL72), you don't need to do PP…

Zihao Ye أعاد

Baris Kasikci

@bariskasikci

٢٢ أغسطسم

🚀 Presenting LiteASR: a method that halves the compute cost of speech encoders by 2x, leveraging low-rank approximation of activations. LiteASR is accepted to #EMNLP2025 (main) @emnlpmeeting