vLLM

@vllm_project

A high-throughput and memory-efficient inference and serving engine for LLMs. Join http://slack.vllm.ai to discuss together with the community!

github.com/vllm-project/v…

Joined March 2024

575Posts 20KFollowers 20Following

vLLM reposted

Harry Mellor

@hmellor_

13 h

It's not as exciting as BERT support... but the @huggingface Transformers backend for @vllm_project now supports mixture-of-expert (MoE) models at full speed! 🚀 Install both packages from source and take it for a spin!

hmellor_'s tweet image. It's not as exciting as BERT support... but the @huggingface Transformers backend for @vllm_project now supports mixture-of-expert (MoE) models at full speed! 🚀

Install both packages from source and take it for a spin!

vLLM reposted

Philipp Schmid

@_philschmid

Oct 16

New massive update Google TPUs Inference for open Models, like Gemma! `tpu-inference` isa new @vllm_project backend that delivers up to 5x performance gains over previous prototypes while supporting both frameworks via a single lowering path. - 🤝 Unifies PyTorch and JAX under…

vLLM

@vllm_project

Oct 16

Announcing the completely reimagined vLLM TPU! In collaboration with @Google, we've launched a new high-performance TPU backend unifying @PyTorch and JAX under a single lowering path for amazing performance and flexibility. 🚀 What's New? - JAX + Pytorch: Run PyTorch models on…

vllm_project's tweet image. Announcing the completely reimagined vLLM TPU! In collaboration with @Google, we've launched a new high-performance TPU backend unifying @PyTorch and JAX under a single lowering path for amazing performance and flexibility.

🚀 What's New?
- JAX + Pytorch: Run PyTorch models on…

vLLM reposted

Roger Wang

@rogerw0108

Oct 14

Qwen3-VL has already become one of the most popular multimodal models supported by vLLM - Try it out!

Qwen

@Alibaba_Qwen

Oct 14

Introducing the compact, dense versions of Qwen3-VL — now available in 4B and 8B pairs, each with both Instruct and Thinking variants. ✅ Lower VRAM usage ✅ Full Qwen3-VL capabilities retained ✅ Strong performance across the board Despite their size, they outperform models…

Alibaba_Qwen's tweet image. Introducing the compact, dense versions of Qwen3-VL — now available in 4B and 8B pairs, each with both Instruct and Thinking variants.

✅ Lower VRAM usage
✅ Full Qwen3-VL capabilities retained
✅ Strong performance across the board

Despite their size, they outperform models…

vLLM reposted

NVIDIA AI Developer

@NVIDIAAIDev

Oct 8

NVIDIA + Open Source AI Week 2025 – powered by partnerships, events, and community 👏 We’re excited to bring together technology, community, and collaboration at Open Source AI Week 2025, hosted by @linuxfoundation. From informal meetups to hackathons, panels to poster…

NVIDIAAIDev's tweet image. NVIDIA + Open Source AI Week 2025 – powered by partnerships, events, and community 👏

We’re excited to bring together technology, community, and collaboration at Open Source AI Week 2025, hosted by @linuxfoundation.

From informal meetups to hackathons, panels to poster…

vLLM reposted

Lily Liu

@eqhylxx

Oct 14

🚀🚀Just spun up gpt-oss-20B with vLLM on NVIDIA’s brand-new DGX Spark machine — a surprisingly powerful little box! This isn’t a rigorous benchmark, but early numbers look promising on a quick test (512×128 tokens) with stable serving and smooth setup!

vLLM

@vllm_project

Oct 13

🚀 vLLM just hit 60K GitHub stars! 🎉 From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…

vllm_project's tweet image. 🚀 vLLM just hit 60K GitHub stars! 🎉

From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…

vLLM reposted

NVIDIA AI Developer

@NVIDIAAIDev

Oct 11

Trivia & Community Meetup: NVIDIA × @DeepInfra × @vllm_project is happening during @PyTorch Open Source AI Week. Relax after panels and (trivially) compete with engineers, researchers, and AI practitioners who want to connect outside the PyTorch conference setting. Expect AI,…

vLLM

@vllm_project

Oct 11

🚀 vLLM x MinerU: Document Parsing at Lightning Speed! We’re excited to see MinerU fully powered by vLLM — bringing ultra-fast, accurate, and efficient document understanding to everyone. ⚡ Powered by vLLM’s high-throughput inference engine, MinerU 2.5 delivers: Instant…

vllm_project's tweet image. 🚀 vLLM x MinerU: Document Parsing at Lightning Speed!

We’re excited to see MinerU fully powered by vLLM — bringing ultra-fast, accurate, and efficient document understanding to everyone.

⚡ Powered by vLLM’s high-throughput inference engine, MinerU 2.5 delivers:
Instant…

Tiezhen WANG

@Xianbao_QIAN

Sep 29

MinerU 2.5 has arrived with demo on @huggingface

vLLM reposted

NVIDIA AI Developer

@NVIDIAAIDev

Oct 10

Big shoutout to the @vllm_project team for an exceptional showing in the SemiAnalysis InferenceMAX benchmark on NVIDIA Blackwell GPUs 👏 Built through close collaboration with our engineers, vLLM delivered consistently strong Blackwell performance gains across the Pareto…

vLLM reposted

Michael Goin

@mgoin_

Oct 10

Happy that InferenceMAX is here because it signals a milestone for vLLM's SOTA performance on NVIDIA Blackwell! 🥳 It has been a pleasure to deeply collaborate with @nvidia in @vllm_project, and we have much more to do Read about the work we did here: blog.vllm.ai/2025/10/09/bla…

Dylan Patel

@dylan522p

Oct 9

Today we are launching InferenceMAX! We have support from Nvidia, AMD, OpenAI, Microsoft, Pytorch, SGLang, vLLM, Oracle, CoreWeave, TogetherAI, Nebius, Crusoe, HPE, SuperMicro, Dell It runs every day on the latest software (vLLM, SGLang, etc) across hundreds of GPUs, $10Ms of…

vLLM reposted

Dylan Patel

@dylan522p

Oct 9

Dylan Patel

@dylan522p

Oct 8

Going to be dropping something huge in 24 hours I think it'll reshape how everyone thinks about chips, inference, and infrastructure It's directly supported by NVIDIA, AMD, Microsoft, OpenAI, Together AI, CoreWeave, Nebius, PyTorch Foundation, Supermicro, Crusoe, HPE, Tensorwave,…

vLLM reposted

Daniel van Strien

@vanstriendaniel

Oct 7

DoTS.ocr from @xiaohongshu just got native @vllm_project support! I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @huggingface Jobs Tested on 1800s library cards - works great ✨

vanstriendaniel's tweet image. DoTS.ocr from @xiaohongshu just got native @vllm_project support!

I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @huggingface Jobs

Tested on 1800s library cards - works great ✨

vLLM reposted

🇺🇦 Dzmitry Bahdanau

@DBahdanau

Oct 5

Shoutout to @vllm_project team for developing a clean, infinitely hackable inference engine! vLLM V1 architecture is great for in-flight weight updates. Just don't stop vLLM, send POST /collective_rpc(method_name=update_weights), and watch inference continue with current KVs!

vLLM

@vllm_project

Oct 5

🚀 The RL community keeps pushing boundaries — from better on-policy data and partial rollouts to in-flight weight updates that mix KV caches across models during inference. Continuing inference while weights change and KV states stay stale sounds wild — but that’s exactly what…

vLLM

@vllm_project

Oct 5

🇺🇦 Dzmitry Bahdanau

@DBahdanau

Apr 26

I am excited to open-source PipelineRL - a scalable async RL implementation with in-flight weight updates. Why wait until your bored GPUs finish all sequences? Just update the weights and continue inference! Code: github.com/ServiceNow/Pip… Blog: huggingface.co/blog/ServiceNo…

DBahdanau's tweet image. I am excited to open-source PipelineRL - a scalable async RL implementation with in-flight weight updates. Why wait until your bored GPUs finish all sequences? Just update the weights and continue inference!

Code: github.com/ServiceNow/Pip…
Blog: huggingface.co/blog/ServiceNo…

vLLM reposted

NVIDIA AI Developer

@NVIDIAAIDev

Oct 3

.@vllm_project has quickly become a go-to open source engine for efficient large language model inference, balancing performance with a strong developer experience. At NVIDIA, direct contributions to projects like vLLM reflect a commitment to advancing open source AI…

vLLM

@vllm_project

Oct 2

Keeping BERT alive in vLLM, via transformers!!

Harry Mellor

@hmellor_

Oct 1

Today I can finally announce that the Transformers backend for vLLM will be official way to use encoder-only (think BERT et al.) models in vLLM going forward! (1/2)

hmellor_'s tweet image. Today I can finally announce that the Transformers backend for vLLM will be official way to use encoder-only (think BERT et al.) models in vLLM going forward!

(1/2)

vLLM reposted

Red Hat AI

@RedHat_AI

Oct 1

Our bi-weekly @vllm_project update is here, with @mgoin_ covering: ✨ v0.10.2 (740 commits, 266 contributors) 🧩 New models: Qwen3-Next/Omni/VL, InternVL 3.5, Whisper, and more ⚡ Decode Context Parallel & full cudagraph support 🌍 Community highlights Watch 👇

vLLM reposted

Roger Wang

@rogerw0108

Oct 2

Congrats to @AmandineFlachs and the team for winning the Wildcard prize at the OpenAI Open Model Hackathon! Last week I had the privilege of judging with @simon_mo_ on behalf of @vllm_project. This was a fun surprise and brought me right back to the 2019 OpenAI Five match that…

vLLM

@vllm_project

Oct 1

Thank you @NVIDIADC for supporting vLLM for @deepseek_ai model launch. Blackwell is now the go to release platform for new MoEs, and we cannot do it without the amazing team from NVIDIA.

NVIDIA Data Center

@NVIDIADC

Sep 30

📣 We partnered with @vllm_project to optimize DeepSeek-V3.2-Exp across our platform. @deepseek_ai's Sparse Attention uses lightning indexer to selectively attend to the most relevant 2K tokens enabling higher performance for long context use cases. vLLM, the open source…