vLLM
@vllm_project
A high-throughput and memory-efficient inference and serving engine for LLMs. Join http://slack.vllm.ai to discuss together with the community!
More inference workloads now mix autoregressive and diffusion models in a single pipeline to process and generate multiple modalities - text, image, audio, and video. Today we’re releasing vLLM-Omni: an open-source framework that extends vLLM’s easy, fast, and cost-efficient…
Transformers v5's first release candidate is out 🔥 The biggest release of my life. It's been five years since the last major (v4). From 20 architectures to 400, 20k daily downloads to 3 million. The release is huge, w/ tokenization (no slow tokenizers!), modeling & processing.
Love this: a community contributor built vLLM Playground to make inferencing visible, interactive, and experiment-friendly. From visual config toggles to automatic command generation, from GPU/M-chip support to GuideLLM benchmarking + LLMCompressor integration — it brings the…
vLLM is proud to support @PrimeIntellect 's post-training of the INTELLECT-3 model🥰
Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more
Interested in how NVIDIA Nemotron-H is being optimized for high performance inference in @vllm_project? Join @RedHat and @NVIDIAAI next week as we cover the Nemotron-H architecture, vLLM support, optimized MoE kernels, async scheduling, and new nsys profiles. Join links below 👇
Wide-EP and prefill/decode disaggregation APIs for vLLM are now available in Ray 2.52 🚀🚀 Validated at 2.4k tokens/H200 on Anyscale Runtime, these patterns maximize sparse MoE model inference efficiency, but often require non-trivial orchestration logic. Here’s how they…
Running multi-node vLLM on Ray can be complicated: different roles, env vars, and SSH glue to keep things together. The new `ray symmetric-run` command lets you run the same entrypoint on every node while Ray handles cluster startup, coordination, and teardown for you. Deep…
In 2.52, we're introducing ray symmetric-run -- a CLI command to simplify and improve the large model interactive development experience on Ray and @vllm_project! Spinning up multi-node vLLM with Ray on interactive environments can be tedious, requiring users to juggle separate…
📸 Scenes from the Future of Inferencing! PyTorch ATX, the @vllm_project community, and @RedHat brought together 90+ AI builders at Capital Factory back in September, to dive into the latest in LLM inference -> from quantization and PagedAttention to multi-node deployment.…
An amazing blog post dropped on @huggingface explaining how today's LLM inference engines like @vllm_project work! The concept of "continuous batching" is explained, along with KV-caching, attention masking, chunked prefill, and decoding. Continuous batching is the idea of…
I’ll be at NeurIPS next week, excited to see many friendly faces from past years :) And let me know if you need a spot in the @vllm_project party!! luma.com/alpuytvr?tk=gz…
🇲🇾 Malaysia vLLM Day is 5 days away. vLLM Malaysia Day — 2 Dec 2025 📍 ILHAM Tower, KL We are bringing the vLLM and @lmcache community together with @EmbeddedLLM, @AMD, @RedHat, @WekaIO to advance open, production-grade AI across ASEAN. The Lineup: 🚀 The State of vLLM &…
Last week, we shared how our experiment agent reproduced @thinkymachines 's RL w/ LoRA results using only natural language conversation. How was this possible? The secret isn’t just better reasoning; it’s giving the agent 𝗮 𝗱𝗲𝗱𝗶𝗰𝗮𝘁𝗲𝗱 𝗹𝗮𝘆𝗲𝗿 𝗼𝗳…
FP8 RL on consumer GPUs just got a boost🔥 Thrilled to team up with @UnslothAI and TorchAO to bring FP8 GRPO to vLLM: ~1.4× faster RL inference, 60% less VRAM, 12× longer context, and Qwen3-1.7B fitting in 5GB VRAM!
You can now run FP8 reinforcement learning on consumer GPUs! Try DeepSeek-R1’s FP8 GRPO at home using only a 5GB GPU. Qwen3-1.7B fits in 5GB VRAM. We collabed with PyTorch to make FP8 RL inference 1.4× faster. Unsloth: 60% less VRAM, 12× longer context. docs.unsloth.ai/new/fp8-reinfo…
Last time we shared Docker Model Runner + vLLM, many in the community showed strong interest — especially around how simple it makes local AI workflows. If you want a deeper look, don’t miss this upcoming session: 📅 Level Up Your Local AI Workflow with Model Runner 🔗…
🚀 Docker Model Runner now integrates vLLM! High-throughput LLM inference is now available with the same Docker workflow devs already use. ▸ Native safetensors support ▸ Automatic routing between llama.cpp (GGUF) and vLLM (safetensors) ▸ Works from laptops → clusters,…
Great to see more compact OCR models landing in open source lately — and HunyuanOCR is a standout: a strong, versatile 1B model with impressive real-world coverage. The vLLM community already provides Day-0 support, so you can try it right away: docs.vllm.ai/projects/recip…
We are thrilled to open-source HunyuanOCR, an expert, end-to-end OCR model built on Hunyuan's native multimodal architecture and training strategy. This model achieves SOTA performance with only 1 billion parameters, significantly reducing deployment costs. ⚡️Benchmark Leader:…
🚀 vLLM Talent Pool is Open! As LLM adoption accelerates, vLLM has become the mainstream inference engine used across major cloud providers (AWS, Google Cloud, Azure, Alibaba Cloud, ByteDance, Tencent, Baidu…) and leading model labs (DeepSeek, Moonshot, Qwen…). To meet the…
🚀Congrats on the launch! Speculators + vLLM brings a clean, standardized path for speculative decoding — making it easier to move from draft models to real production workloads. Excited to see how the community uses this to build faster, more efficient inference systems.
We’re open-sourcing a set of high quality speculator models for Llamas, Qwens, and gpt-oss on Hugging Face. In real workloads, you can expect 1.5 to 2.5x speedups and sometimes more than 4x. Here’s how this fits into the bigger story for speculative decoding. A thread 🧵:
United States Trends
- 1. Kalani 6,309 posts
- 2. Vanguard 14K posts
- 3. Real ID 7,551 posts
- 4. Penn State 9,665 posts
- 5. Milagro 31.9K posts
- 6. TOP CALL 12.2K posts
- 7. Merry Christmas 54.4K posts
- 8. Cyber Monday 62.3K posts
- 9. #OTGala11 179K posts
- 10. Hartline 4,040 posts
- 11. Admiral Bradley 12.4K posts
- 12. MRIs 5,108 posts
- 13. Jay Hill N/A
- 14. Crumbl N/A
- 15. Shakur 8,728 posts
- 16. #GivingTuesday 4,359 posts
- 17. Brent 10.6K posts
- 18. AIDS 68.4K posts
- 19. Monday Night Football 2,907 posts
- 20. Check Analyze 1,274 posts
Something went wrong.
Something went wrong.