vllm_project's profile picture. A high-throughput and memory-efficient inference and serving engine for LLMs. Join http://slack.vllm.ai to discuss together with the community!

vLLM

@vllm_project

A high-throughput and memory-efficient inference and serving engine for LLMs. Join http://slack.vllm.ai to discuss together with the community!

vLLM reposted

Qwen3-VL has already become one of the most popular multimodal models supported by vLLM - Try it out!

Introducing the compact, dense versions of Qwen3-VL — now available in 4B and 8B pairs, each with both Instruct and Thinking variants. ✅ Lower VRAM usage ✅ Full Qwen3-VL capabilities retained ✅ Strong performance across the board Despite their size, they outperform models…

Alibaba_Qwen's tweet image. Introducing the compact, dense versions of Qwen3-VL — now available in 4B and 8B pairs, each with both Instruct and Thinking variants.

✅ Lower VRAM usage
✅ Full Qwen3-VL capabilities retained
✅ Strong performance across the board

Despite their size, they outperform models…


vLLM reposted

NVIDIA + Open Source AI Week 2025 – powered by partnerships, events, and community 👏 We’re excited to bring together technology, community, and collaboration at Open Source AI Week 2025, hosted by @linuxfoundation. From informal meetups to hackathons, panels to poster…

NVIDIAAIDev's tweet image. NVIDIA + Open Source AI Week 2025 – powered by partnerships, events, and community 👏

We’re excited to bring together technology, community, and collaboration at Open Source AI Week 2025, hosted by @linuxfoundation.

From informal meetups to hackathons, panels to poster…

vLLM reposted

🚀🚀Just spun up gpt-oss-20B with vLLM on NVIDIA’s brand-new DGX Spark machine — a surprisingly powerful little box! This isn’t a rigorous benchmark, but early numbers look promising on a quick test (512×128 tokens) with stable serving and smooth setup!

eqhylxx's tweet image. 🚀🚀Just spun up gpt-oss-20B with vLLM on NVIDIA’s brand-new DGX Spark machine — a surprisingly powerful little box!
This isn’t a rigorous benchmark, but early numbers look promising on a quick test (512×128 tokens) with stable serving and smooth setup!
eqhylxx's tweet image. 🚀🚀Just spun up gpt-oss-20B with vLLM on NVIDIA’s brand-new DGX Spark machine — a surprisingly powerful little box!
This isn’t a rigorous benchmark, but early numbers look promising on a quick test (512×128 tokens) with stable serving and smooth setup!
eqhylxx's tweet image. 🚀🚀Just spun up gpt-oss-20B with vLLM on NVIDIA’s brand-new DGX Spark machine — a surprisingly powerful little box!
This isn’t a rigorous benchmark, but early numbers look promising on a quick test (512×128 tokens) with stable serving and smooth setup!
eqhylxx's tweet image. 🚀🚀Just spun up gpt-oss-20B with vLLM on NVIDIA’s brand-new DGX Spark machine — a surprisingly powerful little box!
This isn’t a rigorous benchmark, but early numbers look promising on a quick test (512×128 tokens) with stable serving and smooth setup!

🚀 vLLM just hit 60K GitHub stars! 🎉 From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…

vllm_project's tweet image. 🚀 vLLM just hit 60K GitHub stars! 🎉

From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…
vllm_project's tweet image. 🚀 vLLM just hit 60K GitHub stars! 🎉

From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…
vllm_project's tweet image. 🚀 vLLM just hit 60K GitHub stars! 🎉

From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…
vllm_project's tweet image. 🚀 vLLM just hit 60K GitHub stars! 🎉

From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…

vLLM reposted

Trivia & Community Meetup: NVIDIA × @DeepInfra × @vllm_project is happening during @PyTorch Open Source AI Week. Relax after panels and (trivially) compete with engineers, researchers, and AI practitioners who want to connect outside the PyTorch conference setting. Expect AI,…

NVIDIAAIDev's tweet image. Trivia & Community Meetup: NVIDIA × @DeepInfra × @vllm_project is happening during @PyTorch Open Source AI Week.

Relax after panels and (trivially) compete with engineers, researchers, and AI practitioners who want to connect outside the PyTorch conference setting.

Expect AI,…

🚀 vLLM x MinerU: Document Parsing at Lightning Speed! We’re excited to see MinerU fully powered by vLLM — bringing ultra-fast, accurate, and efficient document understanding to everyone. ⚡ Powered by vLLM’s high-throughput inference engine, MinerU 2.5 delivers: Instant…

vllm_project's tweet image. 🚀 vLLM x MinerU: Document Parsing at Lightning Speed!

We’re excited to see MinerU fully powered by vLLM — bringing ultra-fast, accurate, and efficient document understanding to everyone.

⚡ Powered by vLLM’s high-throughput inference engine, MinerU 2.5 delivers:
Instant…

MinerU 2.5 has arrived with demo on @huggingface

Xianbao_QIAN's tweet image. MinerU 2.5 has arrived with demo on @huggingface


vLLM reposted

Big shoutout to the @vllm_project team for an exceptional showing in the SemiAnalysis InferenceMAX benchmark on NVIDIA Blackwell GPUs 👏 Built through close collaboration with our engineers, vLLM delivered consistently strong Blackwell performance gains across the Pareto…


vLLM reposted

Happy that InferenceMAX is here because it signals a milestone for vLLM's SOTA performance on NVIDIA Blackwell! 🥳 It has been a pleasure to deeply collaborate with @nvidia in @vllm_project, and we have much more to do Read about the work we did here: blog.vllm.ai/2025/10/09/bla…

Today we are launching InferenceMAX! We have support from Nvidia, AMD, OpenAI, Microsoft, Pytorch, SGLang, vLLM, Oracle, CoreWeave, TogetherAI, Nebius, Crusoe, HPE, SuperMicro, Dell It runs every day on the latest software (vLLM, SGLang, etc) across hundreds of GPUs, $10Ms of…



vLLM reposted

Today we are launching InferenceMAX! We have support from Nvidia, AMD, OpenAI, Microsoft, Pytorch, SGLang, vLLM, Oracle, CoreWeave, TogetherAI, Nebius, Crusoe, HPE, SuperMicro, Dell It runs every day on the latest software (vLLM, SGLang, etc) across hundreds of GPUs, $10Ms of…

Going to be dropping something huge in 24 hours I think it'll reshape how everyone thinks about chips, inference, and infrastructure It's directly supported by NVIDIA, AMD, Microsoft, OpenAI, Together AI, CoreWeave, Nebius, PyTorch Foundation, Supermicro, Crusoe, HPE, Tensorwave,…



vLLM reposted

DoTS.ocr from @xiaohongshu just got native @vllm_project support! I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @huggingface Jobs Tested on 1800s library cards - works great ✨

vanstriendaniel's tweet image. DoTS.ocr from @xiaohongshu just got native @vllm_project support!

I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @huggingface Jobs  

Tested on 1800s library cards - works great ✨
vanstriendaniel's tweet image. DoTS.ocr from @xiaohongshu just got native @vllm_project support!

I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @huggingface Jobs  

Tested on 1800s library cards - works great ✨

vLLM reposted

Shoutout to @vllm_project team for developing a clean, infinitely hackable inference engine! vLLM V1 architecture is great for in-flight weight updates. Just don't stop vLLM, send POST /collective_rpc(method_name=update_weights), and watch inference continue with current KVs!

🚀 The RL community keeps pushing boundaries — from better on-policy data and partial rollouts to in-flight weight updates that mix KV caches across models during inference. Continuing inference while weights change and KV states stay stale sounds wild — but that’s exactly what…



🚀 The RL community keeps pushing boundaries — from better on-policy data and partial rollouts to in-flight weight updates that mix KV caches across models during inference. Continuing inference while weights change and KV states stay stale sounds wild — but that’s exactly what…

I am excited to open-source PipelineRL - a scalable async RL implementation with in-flight weight updates. Why wait until your bored GPUs finish all sequences? Just update the weights and continue inference! Code: github.com/ServiceNow/Pip… Blog: huggingface.co/blog/ServiceNo…

DBahdanau's tweet image. I am excited to open-source PipelineRL - a scalable async RL implementation with in-flight weight updates. Why wait until your bored GPUs finish all sequences? Just update the weights and continue inference!

Code: github.com/ServiceNow/Pip…
Blog: huggingface.co/blog/ServiceNo…


vLLM reposted

.@vllm_project has quickly become a go-to open source engine for efficient large language model inference, balancing performance with a strong developer experience. At NVIDIA, direct contributions to projects like vLLM reflect a commitment to advancing open source AI…

NVIDIAAIDev's tweet image. .@vllm_project has quickly become a go-to open source engine for efficient large language model inference, balancing performance with a strong developer experience. At NVIDIA, direct contributions to projects like vLLM reflect a commitment to advancing open source AI…

Keeping BERT alive in vLLM, via transformers!!

Today I can finally announce that the Transformers backend for vLLM will be official way to use encoder-only (think BERT et al.) models in vLLM going forward! (1/2)

hmellor_'s tweet image. Today I can finally announce that the Transformers backend for vLLM will be official way to use encoder-only (think BERT et al.) models in vLLM going forward!

(1/2)


vLLM reposted

Our bi-weekly @vllm_project update is here, with @mgoin_ covering: ✨ v0.10.2 (740 commits, 266 contributors) 🧩 New models: Qwen3-Next/Omni/VL, InternVL 3.5, Whisper, and more ⚡ Decode Context Parallel & full cudagraph support 🌍 Community highlights Watch 👇


vLLM reposted

Congrats to @AmandineFlachs and the team for winning the Wildcard prize at the OpenAI Open Model Hackathon! Last week I had the privilege of judging with @simon_mo_ on behalf of @vllm_project. This was a fun surprise and brought me right back to the 2019 OpenAI Five match that…


Thank you @NVIDIADC for supporting vLLM for @deepseek_ai model launch. Blackwell is now the go to release platform for new MoEs, and we cannot do it without the amazing team from NVIDIA.

📣 We partnered with @vllm_project to optimize DeepSeek-V3.2-Exp across our platform. @deepseek_ai's Sparse Attention uses lightning indexer to selectively attend to the most relevant 2K tokens enabling higher performance for long context use cases. vLLM, the open source…



vLLM reposted

📣 We partnered with @vllm_project to optimize DeepSeek-V3.2-Exp across our platform. @deepseek_ai's Sparse Attention uses lightning indexer to selectively attend to the most relevant 2K tokens enabling higher performance for long context use cases. vLLM, the open source…

How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.

vllm_project's tweet image. How does @deepseek_ai Sparse Attention (DSA) work? 

It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.


Getting ready to try DeepSeek-V3.2-Exp from @deepseek_ai ? vLLM is here to help! We have verified that it works on H200 machines, and many other hardwares thanks to the hardware plugin mechanism. Check out the recipes docs.vllm.ai/projects/recip… for more details 😍 Note: currently…

vllm_project's tweet image. Getting ready to try DeepSeek-V3.2-Exp from @deepseek_ai ? vLLM is here to help!

We have verified that it works on H200 machines, and many other hardwares thanks to the hardware plugin mechanism.

Check out the recipes docs.vllm.ai/projects/recip… for more details 😍

Note: currently…
vllm_project's tweet image. Getting ready to try DeepSeek-V3.2-Exp from @deepseek_ai ? vLLM is here to help!

We have verified that it works on H200 machines, and many other hardwares thanks to the hardware plugin mechanism.

Check out the recipes docs.vllm.ai/projects/recip… for more details 😍

Note: currently…

How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.

vllm_project's tweet image. How does @deepseek_ai Sparse Attention (DSA) work? 

It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.


United States Trends

Loading...

Something went wrong.


Something went wrong.