Pro-tip for vLLM power-users: free ≈ 90 % of your GPU VRAM in seconds—no restarts required🚀 🚩 Why you’ll want this • Hot-swap new checkpoints on the same card • Rotate multiple LLMs on one GPU (batch jobs, micro-services, A/B tests) • Stage-based pipelines that call…

High performance. Scalable. Flexible. Open. @Oracle and AMD are scaling AI to new heights. Beginning in Q3 2026, @OracleCloud Infrastructure to be the first hyperscaler to offer a public AI supercluster, with an initial 50K AMD Instinct MI450 Series GPU deployment. 🔗 See…

🚀 vLLM just hit 60K GitHub stars! 🎉 From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…




🚀 vLLM x MinerU: Document Parsing at Lightning Speed! We’re excited to see MinerU fully powered by vLLM — bringing ultra-fast, accurate, and efficient document understanding to everyone. ⚡ Powered by vLLM’s high-throughput inference engine, MinerU 2.5 delivers: Instant…

Happy that InferenceMAX is here because it signals a milestone for vLLM's SOTA performance on NVIDIA Blackwell! 🥳 It has been a pleasure to deeply collaborate with @nvidia in @vllm_project, and we have much more to do Read about the work we did here: blog.vllm.ai/2025/10/09/bla…
Today we are launching InferenceMAX! We have support from Nvidia, AMD, OpenAI, Microsoft, Pytorch, SGLang, vLLM, Oracle, CoreWeave, TogetherAI, Nebius, Crusoe, HPE, SuperMicro, Dell It runs every day on the latest software (vLLM, SGLang, etc) across hundreds of GPUs, $10Ms of…
🚀 The RL community keeps pushing boundaries — from better on-policy data and partial rollouts to in-flight weight updates that mix KV caches across models during inference. Continuing inference while weights change and KV states stay stale sounds wild — but that’s exactly what…
I am excited to open-source PipelineRL - a scalable async RL implementation with in-flight weight updates. Why wait until your bored GPUs finish all sequences? Just update the weights and continue inference! Code: github.com/ServiceNow/Pip… Blog: huggingface.co/blog/ServiceNo…

Getting ready to try DeepSeek-V3.2-Exp from @deepseek_ai ? vLLM is here to help! We have verified that it works on H200 machines, and many other hardwares thanks to the hardware plugin mechanism. Check out the recipes docs.vllm.ai/projects/recip… for more details 😍 Note: currently…


How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.

How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.

🚀 Introducing DeepSeek-V3.2-Exp — our latest experimental model! ✨ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context. 👉 Now live on App, Web, and API. 💰 API prices cut by 50%+! 1/n
🚀 New in vLLM: dots.ocr 🔥 A powerful multilingual OCR model from @xiaohongshu hi lab is now officially supported in vLLM! 📝 Single end-to-end parser for text, tables (HTML), formulas (LaTeX), and layouts (Markdown) 🌍 Supports 100 languages with robust performance on…

we're all sleeping on this OCR model 🔥 dots.ocr is a new 3B model with sota performance, support for 100 languages & allowing commercial use! 🤯 single e2e model to extract image, convert tables, formula, and more into markdown 📝

Missed our latest vLLM office hours? We covered hybrid models as first-class citizens in @vllm_project. ✅ Hybrid model support in v1 ✅ Mamba, Mamba2, linear attention ✅ Performance from v0 → v1 ▶️ Recording: youtube.com/live/uWQ489ONv… 📑 Slides: docs.google.com/presentation/d…
We keep pushing the limits of speculative decoding (SD) in LLM inference -- check out our latest NeurIPS’25 paper: Lookahead Reasoning (LR). The high-level rationale is pretty simple: SD alone isn’t enough now: as GPUs get stronger (H200 -> B200 -> Rubin CPX), we'll be able to…
[1/N]🚀New decoding paradigm drop!🚀 Introducing Lookahead Reasoning(LR): step-level speculation that stacks with Speculative Decoding(SD). It has been accepted to #NeurIPS2025 🎉 📖 Blog: hao-ai-lab.github.io/blogs/lookahea… 💻 Code: github.com/hao-ai-lab/Loo… 📄 Paper: arxiv.org/abs/2506.19830
Day-0 support on one of the most anticipated model releases🚀🚀🚀 Detailed deployment guide coming soon at vllm-recipes docs.vllm.ai/projects/recip…

🚀 We're thrilled to unveil Qwen3-VL — the most powerful vision-language model in the Qwen series yet! 🔥 The flagship model Qwen3-VL-235B-A22B is now open-sourced and available in both Instruct and Thinking versions: ✅ Instruct outperforms Gemini 2.5 Pro on key vision…

Congrats to @deepseek_ai ! DeepSeek-R1 was published in Nature yesterday as the cover article, and vLLM is proud to have supported its RL training and inference🥰


Disaggregated Inference at Scale with #PyTorch & #vLLM: Meta’s vLLM disagg implementation improves inference efficiency in latency & throughput vs its internal stack, with optimizations now being upstreamed to the vLLM community. 🔗 hubs.la/Q03J87tS0

😂 Although AMD is now working pretty well for small to medium sized models
Welcome Qwen3-Next! You can run it efficiently on vLLM with accelerated kernels and native memory management for hybrid models. blog.vllm.ai/2025/09/11/qwe…

🚀 Introducing Qwen3-Next-80B-A3B — the FUTURE of efficient LLMs is here! 🔹 80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!) 🔹Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed &…

Deep dive into optimizing weight transfer step by step and improving it 60x!
1.5 seconds is long enough to transfer model weights from training nodes to RL rollout nodes (as opposed to 100s). Here's the full story of how I made it (not just presenting the solution): le.qun.ch/en/blog/2025/0…

⚡️ Efficient weight updates for RL at trillion-parameter scale 💡 Best practice from Kimi @Kimi_Moonshot vLLM is proud to collaborate with checkpoint-engine: • Broadcast weight sync for 1T params in ~20s across 1000s of GPUs • Dynamic P2P updates for elastic clusters •…
Introducing checkpoint-engine: our open-source, lightweight middleware for efficient, in-place weight updates in LLM inference engines, especially effective for RL. ✅ Update a 1T model on thousands of GPUs in ~20s ✅ Supports both broadcast (sync) & P2P (dynamic) updates ✅…


vLLM Singapore Meetup — Highlights Thanks to everyone who joined! Check out the slides by @vllm_project DarkLight1337 with tjtanaa / @EmbeddedLLM V1 is here: faster startup, stronger CI & perf checks. Scaling MoE: clear Expert Parallelism (EP) setup for single/multi-node +…




vLLM is proud to support the great Kimi update from @Kimi_Moonshot , better tool-calling, longer context, and more! Check the deployment guide at huggingface.co/moonshotai/Kim… 🔥

Kimi K2-0905 update 🚀 - Enhanced coding capabilities, esp. front-end & tool-calling - Context length extended to 256k tokens - Improved integration with various agent scaffolds (e.g., Claude Code, Roo Code, etc) 🔗 Weights & code: huggingface.co/moonshotai/Kim… 💬 Chat with new Kimi…

Amazing blogpost from @gordic_aleksa explaining internals of vLLM😍
New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work! Took me a while to get this level of understanding of the codebase and then to write up…

United States 趨勢
- 1. zendaya 4,686 posts
- 2. Apple TV 9,845 posts
- 3. trisha paytas 1,518 posts
- 4. #FanCashDropPromotion 1,300 posts
- 5. LINGORM ONLY YOU FINAL EP 1.61M posts
- 6. #เพียงเธอตอนจบ 1.66M posts
- 7. No Kings 222K posts
- 8. #Yunho 27.4K posts
- 9. #SlideToMe 16.9K posts
- 10. #FridayVibes 7,208 posts
- 11. GAME DAY 33.8K posts
- 12. Mamdani 287K posts
- 13. Good Friday 62.1K posts
- 14. Shabbat Shalom 5,224 posts
- 15. Cuomo 123K posts
- 16. Bolton 290K posts
- 17. Justice 343K posts
- 18. eli roth N/A
- 19. F1 TV 3,231 posts
- 20. Bob Myers 1,090 posts
Something went wrong.
Something went wrong.