
Zihao Ye
@ye_combinator
Proud to be an engineer. I'm building flashinfer (https://github.com/flashinfer-ai/flashinfer) Opinions are my own.
قد يعجبك
How to beat all compression using LLMs? ⚙️ Introducing LLMc — a lossless compressor built with LLMs. LLMc leverages the predictive power of LLMs to beat traditional compressors like Gzip and LZMA on natural language text. (1/4) 🔗 Blog Post: syfi.cs.washington.edu/blog/2025-10-0… 💻 Code:…
🚀 Follow-up to our last breakthrough on DeepSeek V3/R1 inference! On NVIDIA GB200 NVL72, SGLang now achieves 26k input tokens/s and 13k output tokens/s per GPU with FP8 attention + NVFP4 MoE - that’s a 3.8× / 4.8× speedup vs H100 settings. See the details in the 🧵 (1/4)

🚀 Introducing Sparse VideoGen2 (SVG2) — Pareto-frontier video generation acceleration with semantic-aware sparse attention! 🏆Spotlight paper accepted by #NeurIPS2025 ✅ Training-free & plug-and-play ✅ Up to 2.5× faster on HunyuanVideo, 1.9× faster on Wan 2.1 ✅ SOTA quality…
The main branch of sglang now supports deterministic inference with user-specified per-request seeds! It utilized kernels from @thinkymachines and introduced new optimizations & coverage. Run out of box for most hardware backends and pytorch versions.
SGLang now supports deterministic LLM inference! Building on @thinkymachines batch-invariant kernels, we integrated deterministic attention & sampling ops into a high-throughput engine - fully compatible with chunked prefill, CUDA graphs, radix cache, and non-greedy sampling. ✅…

Glad that you enjoyed it! To be precise, it's EP64 on the inference side, around 30GB per inference GPU. So it's around 30GB / 1.3s = 23 GB/s.

Excited to share what friends and I have been working on at @Standard_Kernel We've raised from General Catalyst (@generalcatalyst), Felicis (@felicis), and a group of exceptional angels. We have some great H100 BF16 kernels in pure CUDA+PTX, featuring: - Matmul 102%-105% perf…

At Thinking Machines, our work includes collaborating with the broader research community. Today we are excited to share that we are building a vLLM team at @thinkymachines to advance open-source vLLM and serve frontier models. If you are interested, please DM me or @barret_zoph!…
Awesome work from @thinkymachines and @cHHillee! The importance of determinism might be underestimated. Like with LLM-based compression (bellard.org/ts_zip/) - you really need things to work the same way whether you're doing prefill/decode or different batching setups. Here's…
Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to…

This is the advantage of large nvlink domains or TPUs topology - the main reason to do PP is that you are bottlenecked on your DP comms and cannot scale TP further. But if you have high enough bandwidth across a large enough domain (like TPUs or NVL72), you don't need to do PP…
🚀 Presenting LiteASR: a method that halves the compute cost of speech encoders by 2x, leveraging low-rank approximation of activations. LiteASR is accepted to #EMNLP2025 (main) @emnlpmeeting

Sub-10-microsecond Haskell Sudoku solver implemented in hardware. unsafeperform.io/papers/2025-hs…
Tilelang now supports SM120 — give it a try if you have RTX 5090 🚀😎
🎉 Excited to share: We’ve open-sourced Triton-distributed MegaKernel! A fresh, powerful take on MegaKernel for LLMs—built entirely on our Triton-distributed framework. github.com/ByteDance-Seed… Why it’s awesome? 🧩 Super programmable ⚡ Blazing performance 📊 Rock-solid precision

One nice thing you can do with an interactive world model, look down and see your footwear ... and if the model understands what puddles are. Genie 3 creation.
🚀 Excited to announce day-0 support from @NVIDIAAIDev for @OpenAI's gpt-oss model in flashinfer v0.2.10! github.com/flashinfer-ai/… ✅ Speed-of-light Blackwell mxfp4/mxfp8 MoE kernels + attention-sink from trtllm-gen ✅ FA2/FA3 template-based attention-sink support for earlier…
Like SGLang? Want speed of light decode perf? Checkout: github.com/sgl-project/sg…
Powered by TensorRT-LLM Gen kernels. Available via flashinfer and TRT-LLM. 🚀
🏎️ ️We accelerated @OpenAI’s new GPT open weight models -- gpt-oss-20b and gpt-oss-120b -- for leading inference performance on NVIDIA Blackwell architecture, delivering up to 1.5 million tokens per second on an NVIDIA GB200 NVL72 system.⏱️🏁 👀 Technical deep dive on how we…

Great to see StepFun acknowledges the idea of Attention-FFN disaggregation from our Megascale-infer work and take to the next level 🚀🚀🚀 arxiv.org/abs/2504.02263
This amazing Attention-FFN disaggregation implementation from @StepFun_ai , achieves decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA, for their 321B-A38B MoE model Step3 served with H800! The implementation is based on vLLM, and we are working…


I’ve been starting to collaborate with the folks who are building FlashInfer: nice project and pretty amazing set of people! @ye_combinator @tqchenml and everyone.
🔍 Our Deep Dive Blog Covering our Winning MLSys Paper on FlashInfer Is now live ➡️ nvda.ws/3ZA1Hca Accelerate LLM inference with FlashInfer—NVIDIA’s high-performance, JIT-compiled library built for ultra-efficient transformer inference on GPUs. Go under the hood with…

United States الاتجاهات
- 1. Happy Birthday Charlie 13.9K posts
- 2. #Worlds2025 20.5K posts
- 3. Bears 89.9K posts
- 4. Jake Moody 14.1K posts
- 5. Blake Snell 17.2K posts
- 6. Josh Allen 27.3K posts
- 7. Caleb 49.8K posts
- 8. Joji 33.1K posts
- 9. Jayden 22.9K posts
- 10. #BearDown 2,427 posts
- 11. Falcons 52.1K posts
- 12. Commanders 52.7K posts
- 13. Ben Johnson 4,499 posts
- 14. Swift 289K posts
- 15. #Dodgers 15.5K posts
- 16. Treinen 4,750 posts
- 17. Roki 6,083 posts
- 18. Bijan 33.8K posts
- 19. Turang 4,422 posts
- 20. #RaiseHail 8,448 posts
قد يعجبك
-
Lianmin Zheng
@lm_zheng -
Zhuohan Li
@zhuohan123 -
ビクトリード
@RealVictrid -
Dacheng Li
@DachengLi177 -
Ryan Hanrui Wang
@hanrui_w -
Sixian
@noworkforsixian -
Dr. Jian "dAIye" Weng
@b1antaidaye -
Junru Shao
@junrushao -
Ying Sheng
@ying11231 -
Muyang Li
@lmxyy1999 -
Zhanghao Wu
@Michaelvll1 -
Yihong Zhang
@yihongz_bot -
Jiawei Liu
@JiaweiLiu_ -
Jiaheng Zhang
@jiahengzhang96 -
Ligeng Zhu
@LigengZhu
Something went wrong.
Something went wrong.