vllm_project's profile picture. A high-throughput and memory-efficient inference and serving engine for LLMs. Join http://slack.vllm.ai to discuss together with the community!

vLLM

@vllm_project

A high-throughput and memory-efficient inference and serving engine for LLMs. Join http://slack.vllm.ai to discuss together with the community!

vLLM reposted

It's not as exciting as BERT support... but the @huggingface Transformers backend for @vllm_project now supports mixture-of-expert (MoE) models at full speed! 🚀 Install both packages from source and take it for a spin!

hmellor_'s tweet image. It's not as exciting as BERT support... but the @huggingface Transformers backend for @vllm_project  now supports mixture-of-expert (MoE) models at full speed! 🚀

Install both packages from source and take it for a spin!

vLLM reposted

New massive update Google TPUs Inference for open Models, like Gemma! `tpu-inference` isa new @vllm_project backend that delivers up to 5x performance gains over previous prototypes while supporting both frameworks via a single lowering path. - 🤝 Unifies PyTorch and JAX under…

_philschmid's tweet image. New massive update Google TPUs Inference for open Models, like Gemma! `tpu-inference` isa  new @vllm_project backend that delivers up to 5x performance gains over previous prototypes while supporting both frameworks via a single lowering path.

- 🤝 Unifies PyTorch and JAX under…

Announcing the completely reimagined vLLM TPU! In collaboration with @Google, we've launched a new high-performance TPU backend unifying @PyTorch and JAX under a single lowering path for amazing performance and flexibility. 🚀 What's New? - JAX + Pytorch: Run PyTorch models on…

vllm_project's tweet image. Announcing the completely reimagined vLLM TPU! In collaboration with @Google, we've launched a new high-performance TPU backend unifying @PyTorch and JAX under a single lowering path for amazing performance and flexibility.

🚀 What's New?
- JAX + Pytorch: Run PyTorch models on…

vLLM reposted

Qwen3-VL has already become one of the most popular multimodal models supported by vLLM - Try it out!

Introducing the compact, dense versions of Qwen3-VL — now available in 4B and 8B pairs, each with both Instruct and Thinking variants. ✅ Lower VRAM usage ✅ Full Qwen3-VL capabilities retained ✅ Strong performance across the board Despite their size, they outperform models…

Alibaba_Qwen's tweet image. Introducing the compact, dense versions of Qwen3-VL — now available in 4B and 8B pairs, each with both Instruct and Thinking variants.

✅ Lower VRAM usage
✅ Full Qwen3-VL capabilities retained
✅ Strong performance across the board

Despite their size, they outperform models…


vLLM reposted

NVIDIA + Open Source AI Week 2025 – powered by partnerships, events, and community 👏 We’re excited to bring together technology, community, and collaboration at Open Source AI Week 2025, hosted by @linuxfoundation. From informal meetups to hackathons, panels to poster…

NVIDIAAIDev's tweet image. NVIDIA + Open Source AI Week 2025 – powered by partnerships, events, and community 👏

We’re excited to bring together technology, community, and collaboration at Open Source AI Week 2025, hosted by @linuxfoundation.

From informal meetups to hackathons, panels to poster…

vLLM reposted

🚀🚀Just spun up gpt-oss-20B with vLLM on NVIDIA’s brand-new DGX Spark machine — a surprisingly powerful little box! This isn’t a rigorous benchmark, but early numbers look promising on a quick test (512×128 tokens) with stable serving and smooth setup!

eqhylxx's tweet image. 🚀🚀Just spun up gpt-oss-20B with vLLM on NVIDIA’s brand-new DGX Spark machine — a surprisingly powerful little box!
This isn’t a rigorous benchmark, but early numbers look promising on a quick test (512×128 tokens) with stable serving and smooth setup!
eqhylxx's tweet image. 🚀🚀Just spun up gpt-oss-20B with vLLM on NVIDIA’s brand-new DGX Spark machine — a surprisingly powerful little box!
This isn’t a rigorous benchmark, but early numbers look promising on a quick test (512×128 tokens) with stable serving and smooth setup!
eqhylxx's tweet image. 🚀🚀Just spun up gpt-oss-20B with vLLM on NVIDIA’s brand-new DGX Spark machine — a surprisingly powerful little box!
This isn’t a rigorous benchmark, but early numbers look promising on a quick test (512×128 tokens) with stable serving and smooth setup!
eqhylxx's tweet image. 🚀🚀Just spun up gpt-oss-20B with vLLM on NVIDIA’s brand-new DGX Spark machine — a surprisingly powerful little box!
This isn’t a rigorous benchmark, but early numbers look promising on a quick test (512×128 tokens) with stable serving and smooth setup!

🚀 vLLM just hit 60K GitHub stars! 🎉 From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…

vllm_project's tweet image. 🚀 vLLM just hit 60K GitHub stars! 🎉

From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…
vllm_project's tweet image. 🚀 vLLM just hit 60K GitHub stars! 🎉

From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…
vllm_project's tweet image. 🚀 vLLM just hit 60K GitHub stars! 🎉

From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…
vllm_project's tweet image. 🚀 vLLM just hit 60K GitHub stars! 🎉

From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…

vLLM reposted

Trivia & Community Meetup: NVIDIA × @DeepInfra × @vllm_project is happening during @PyTorch Open Source AI Week. Relax after panels and (trivially) compete with engineers, researchers, and AI practitioners who want to connect outside the PyTorch conference setting. Expect AI,…

NVIDIAAIDev's tweet image. Trivia & Community Meetup: NVIDIA × @DeepInfra × @vllm_project is happening during @PyTorch Open Source AI Week.

Relax after panels and (trivially) compete with engineers, researchers, and AI practitioners who want to connect outside the PyTorch conference setting.

Expect AI,…

🚀 vLLM x MinerU: Document Parsing at Lightning Speed! We’re excited to see MinerU fully powered by vLLM — bringing ultra-fast, accurate, and efficient document understanding to everyone. ⚡ Powered by vLLM’s high-throughput inference engine, MinerU 2.5 delivers: Instant…

vllm_project's tweet image. 🚀 vLLM x MinerU: Document Parsing at Lightning Speed!

We’re excited to see MinerU fully powered by vLLM — bringing ultra-fast, accurate, and efficient document understanding to everyone.

⚡ Powered by vLLM’s high-throughput inference engine, MinerU 2.5 delivers:
Instant…

MinerU 2.5 has arrived with demo on @huggingface

Xianbao_QIAN's tweet image. MinerU 2.5 has arrived with demo on @huggingface


vLLM reposted

Big shoutout to the @vllm_project team for an exceptional showing in the SemiAnalysis InferenceMAX benchmark on NVIDIA Blackwell GPUs 👏 Built through close collaboration with our engineers, vLLM delivered consistently strong Blackwell performance gains across the Pareto…


vLLM reposted

Happy that InferenceMAX is here because it signals a milestone for vLLM's SOTA performance on NVIDIA Blackwell! 🥳 It has been a pleasure to deeply collaborate with @nvidia in @vllm_project, and we have much more to do Read about the work we did here: blog.vllm.ai/2025/10/09/bla…

Today we are launching InferenceMAX! We have support from Nvidia, AMD, OpenAI, Microsoft, Pytorch, SGLang, vLLM, Oracle, CoreWeave, TogetherAI, Nebius, Crusoe, HPE, SuperMicro, Dell It runs every day on the latest software (vLLM, SGLang, etc) across hundreds of GPUs, $10Ms of…



vLLM reposted

Today we are launching InferenceMAX! We have support from Nvidia, AMD, OpenAI, Microsoft, Pytorch, SGLang, vLLM, Oracle, CoreWeave, TogetherAI, Nebius, Crusoe, HPE, SuperMicro, Dell It runs every day on the latest software (vLLM, SGLang, etc) across hundreds of GPUs, $10Ms of…

Going to be dropping something huge in 24 hours I think it'll reshape how everyone thinks about chips, inference, and infrastructure It's directly supported by NVIDIA, AMD, Microsoft, OpenAI, Together AI, CoreWeave, Nebius, PyTorch Foundation, Supermicro, Crusoe, HPE, Tensorwave,…



vLLM reposted

DoTS.ocr from @xiaohongshu just got native @vllm_project support! I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @huggingface Jobs Tested on 1800s library cards - works great ✨

vanstriendaniel's tweet image. DoTS.ocr from @xiaohongshu just got native @vllm_project support!

I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @huggingface Jobs  

Tested on 1800s library cards - works great ✨
vanstriendaniel's tweet image. DoTS.ocr from @xiaohongshu just got native @vllm_project support!

I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @huggingface Jobs  

Tested on 1800s library cards - works great ✨

vLLM reposted

Shoutout to @vllm_project team for developing a clean, infinitely hackable inference engine! vLLM V1 architecture is great for in-flight weight updates. Just don't stop vLLM, send POST /collective_rpc(method_name=update_weights), and watch inference continue with current KVs!

🚀 The RL community keeps pushing boundaries — from better on-policy data and partial rollouts to in-flight weight updates that mix KV caches across models during inference. Continuing inference while weights change and KV states stay stale sounds wild — but that’s exactly what…



🚀 The RL community keeps pushing boundaries — from better on-policy data and partial rollouts to in-flight weight updates that mix KV caches across models during inference. Continuing inference while weights change and KV states stay stale sounds wild — but that’s exactly what…

I am excited to open-source PipelineRL - a scalable async RL implementation with in-flight weight updates. Why wait until your bored GPUs finish all sequences? Just update the weights and continue inference! Code: github.com/ServiceNow/Pip… Blog: huggingface.co/blog/ServiceNo…

DBahdanau's tweet image. I am excited to open-source PipelineRL - a scalable async RL implementation with in-flight weight updates. Why wait until your bored GPUs finish all sequences? Just update the weights and continue inference!

Code: github.com/ServiceNow/Pip…
Blog: huggingface.co/blog/ServiceNo…


vLLM reposted

.@vllm_project has quickly become a go-to open source engine for efficient large language model inference, balancing performance with a strong developer experience. At NVIDIA, direct contributions to projects like vLLM reflect a commitment to advancing open source AI…

NVIDIAAIDev's tweet image. .@vllm_project has quickly become a go-to open source engine for efficient large language model inference, balancing performance with a strong developer experience. At NVIDIA, direct contributions to projects like vLLM reflect a commitment to advancing open source AI…

Keeping BERT alive in vLLM, via transformers!!

Today I can finally announce that the Transformers backend for vLLM will be official way to use encoder-only (think BERT et al.) models in vLLM going forward! (1/2)

hmellor_'s tweet image. Today I can finally announce that the Transformers backend for vLLM will be official way to use encoder-only (think BERT et al.) models in vLLM going forward!

(1/2)


vLLM reposted

Our bi-weekly @vllm_project update is here, with @mgoin_ covering: ✨ v0.10.2 (740 commits, 266 contributors) 🧩 New models: Qwen3-Next/Omni/VL, InternVL 3.5, Whisper, and more ⚡ Decode Context Parallel & full cudagraph support 🌍 Community highlights Watch 👇


vLLM reposted

Congrats to @AmandineFlachs and the team for winning the Wildcard prize at the OpenAI Open Model Hackathon! Last week I had the privilege of judging with @simon_mo_ on behalf of @vllm_project. This was a fun surprise and brought me right back to the 2019 OpenAI Five match that…


Thank you @NVIDIADC for supporting vLLM for @deepseek_ai model launch. Blackwell is now the go to release platform for new MoEs, and we cannot do it without the amazing team from NVIDIA.

📣 We partnered with @vllm_project to optimize DeepSeek-V3.2-Exp across our platform. @deepseek_ai's Sparse Attention uses lightning indexer to selectively attend to the most relevant 2K tokens enabling higher performance for long context use cases. vLLM, the open source…



United States Trends

Loading...

Something went wrong.


Something went wrong.