
vLLM
@vllm_project
A high-throughput and memory-efficient inference and serving engine for LLMs. Join http://slack.vllm.ai to discuss together with the community!
Qwen3-VL has already become one of the most popular multimodal models supported by vLLM - Try it out!
Introducing the compact, dense versions of Qwen3-VL — now available in 4B and 8B pairs, each with both Instruct and Thinking variants. ✅ Lower VRAM usage ✅ Full Qwen3-VL capabilities retained ✅ Strong performance across the board Despite their size, they outperform models…

NVIDIA + Open Source AI Week 2025 – powered by partnerships, events, and community 👏 We’re excited to bring together technology, community, and collaboration at Open Source AI Week 2025, hosted by @linuxfoundation. From informal meetups to hackathons, panels to poster…

🚀🚀Just spun up gpt-oss-20B with vLLM on NVIDIA’s brand-new DGX Spark machine — a surprisingly powerful little box! This isn’t a rigorous benchmark, but early numbers look promising on a quick test (512×128 tokens) with stable serving and smooth setup!




🚀 vLLM just hit 60K GitHub stars! 🎉 From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…




Trivia & Community Meetup: NVIDIA × @DeepInfra × @vllm_project is happening during @PyTorch Open Source AI Week. Relax after panels and (trivially) compete with engineers, researchers, and AI practitioners who want to connect outside the PyTorch conference setting. Expect AI,…

🚀 vLLM x MinerU: Document Parsing at Lightning Speed! We’re excited to see MinerU fully powered by vLLM — bringing ultra-fast, accurate, and efficient document understanding to everyone. ⚡ Powered by vLLM’s high-throughput inference engine, MinerU 2.5 delivers: Instant…

Big shoutout to the @vllm_project team for an exceptional showing in the SemiAnalysis InferenceMAX benchmark on NVIDIA Blackwell GPUs 👏 Built through close collaboration with our engineers, vLLM delivered consistently strong Blackwell performance gains across the Pareto…
Happy that InferenceMAX is here because it signals a milestone for vLLM's SOTA performance on NVIDIA Blackwell! 🥳 It has been a pleasure to deeply collaborate with @nvidia in @vllm_project, and we have much more to do Read about the work we did here: blog.vllm.ai/2025/10/09/bla…
Today we are launching InferenceMAX! We have support from Nvidia, AMD, OpenAI, Microsoft, Pytorch, SGLang, vLLM, Oracle, CoreWeave, TogetherAI, Nebius, Crusoe, HPE, SuperMicro, Dell It runs every day on the latest software (vLLM, SGLang, etc) across hundreds of GPUs, $10Ms of…
Today we are launching InferenceMAX! We have support from Nvidia, AMD, OpenAI, Microsoft, Pytorch, SGLang, vLLM, Oracle, CoreWeave, TogetherAI, Nebius, Crusoe, HPE, SuperMicro, Dell It runs every day on the latest software (vLLM, SGLang, etc) across hundreds of GPUs, $10Ms of…
Going to be dropping something huge in 24 hours I think it'll reshape how everyone thinks about chips, inference, and infrastructure It's directly supported by NVIDIA, AMD, Microsoft, OpenAI, Together AI, CoreWeave, Nebius, PyTorch Foundation, Supermicro, Crusoe, HPE, Tensorwave,…
DoTS.ocr from @xiaohongshu just got native @vllm_project support! I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @huggingface Jobs Tested on 1800s library cards - works great ✨


Shoutout to @vllm_project team for developing a clean, infinitely hackable inference engine! vLLM V1 architecture is great for in-flight weight updates. Just don't stop vLLM, send POST /collective_rpc(method_name=update_weights), and watch inference continue with current KVs!
🚀 The RL community keeps pushing boundaries — from better on-policy data and partial rollouts to in-flight weight updates that mix KV caches across models during inference. Continuing inference while weights change and KV states stay stale sounds wild — but that’s exactly what…
🚀 The RL community keeps pushing boundaries — from better on-policy data and partial rollouts to in-flight weight updates that mix KV caches across models during inference. Continuing inference while weights change and KV states stay stale sounds wild — but that’s exactly what…
I am excited to open-source PipelineRL - a scalable async RL implementation with in-flight weight updates. Why wait until your bored GPUs finish all sequences? Just update the weights and continue inference! Code: github.com/ServiceNow/Pip… Blog: huggingface.co/blog/ServiceNo…

.@vllm_project has quickly become a go-to open source engine for efficient large language model inference, balancing performance with a strong developer experience. At NVIDIA, direct contributions to projects like vLLM reflect a commitment to advancing open source AI…

Keeping BERT alive in vLLM, via transformers!!
Today I can finally announce that the Transformers backend for vLLM will be official way to use encoder-only (think BERT et al.) models in vLLM going forward! (1/2)

Our bi-weekly @vllm_project update is here, with @mgoin_ covering: ✨ v0.10.2 (740 commits, 266 contributors) 🧩 New models: Qwen3-Next/Omni/VL, InternVL 3.5, Whisper, and more ⚡ Decode Context Parallel & full cudagraph support 🌍 Community highlights Watch 👇
Congrats to @AmandineFlachs and the team for winning the Wildcard prize at the OpenAI Open Model Hackathon! Last week I had the privilege of judging with @simon_mo_ on behalf of @vllm_project. This was a fun surprise and brought me right back to the 2019 OpenAI Five match that…
Thank you @NVIDIADC for supporting vLLM for @deepseek_ai model launch. Blackwell is now the go to release platform for new MoEs, and we cannot do it without the amazing team from NVIDIA.
📣 We partnered with @vllm_project to optimize DeepSeek-V3.2-Exp across our platform. @deepseek_ai's Sparse Attention uses lightning indexer to selectively attend to the most relevant 2K tokens enabling higher performance for long context use cases. vLLM, the open source…
📣 We partnered with @vllm_project to optimize DeepSeek-V3.2-Exp across our platform. @deepseek_ai's Sparse Attention uses lightning indexer to selectively attend to the most relevant 2K tokens enabling higher performance for long context use cases. vLLM, the open source…
How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.

Getting ready to try DeepSeek-V3.2-Exp from @deepseek_ai ? vLLM is here to help! We have verified that it works on H200 machines, and many other hardwares thanks to the hardware plugin mechanism. Check out the recipes docs.vllm.ai/projects/recip… for more details 😍 Note: currently…


How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.

United States Trends
- 1. #JoyForum N/A
- 2. #2025MAMAVOTE 114K posts
- 3. Good Thursday 16.8K posts
- 4. #MC필릭스의_냉터뷰 9,804 posts
- 5. MC FELIX FRIDGE INTERVIEW 10.6K posts
- 6. #FridgeInterview_MCFelix 9,929 posts
- 7. Deport Harry Sisson 11.1K posts
- 8. Mila 17.9K posts
- 9. Brevis ZK 137K posts
- 10. Ninja Gaiden 9,084 posts
- 11. DuPont 2,019 posts
- 12. Deloitte 8,347 posts
- 13. #PokemonZA 2,405 posts
- 14. BNB Chain 21.5K posts
- 15. Pelosi 143K posts
- 16. Angel Reese 55.5K posts
- 17. Gabe Vincent 4,381 posts
- 18. Domain For Sale 19.5K posts
- 19. tzuyu 266K posts
- 20. Big Mac 8,262 posts
Something went wrong.
Something went wrong.