vLLM
@vllm_project
A high-throughput and memory-efficient inference and serving engine for LLMs. Join http://slack.vllm.ai to discuss together with the community!
🎉Congratulations on the launch @nebiustf. Nebius has been supporting @vllm_project for more than one year. They are not only experts in vLLM but also contribute back! Let's make open-source AI production-grade!
⚡️Today marks a big milestone for Nebius. We’re launching Nebius Token Factory, the evolution of Nebius AI Studio, built to make open-source AI production-grade. Token Factory transforms raw open models into governed, scalable systems with dedicated inference, sub-second…
Looking forward to integrate @perplexity_ai's fast communication kernels in vLLM!
Faster than DeepEP for Decode on ConnectX-7. First viable kernel on EFA. SM-Free RDMA transfer. Support prefill. (Maybe portable to other hardware as well)
😉
Kimi-K2 Reasoning is coming very soon just got merged into VLLM LETS FUCKING GOOOO im so hyped im so hyped im so hyped github.com/vllm-project/v…
🚀 Learn how to deploy vLLM on NVIDIA DGX Spark — the right way! NVIDIA just published a detailed best practices guide for running high-throughput inference with vLLM, including multi-node setups and optimized Docker builds. @NVIDIAAIDev 👉 Dive in: build.nvidia.com/spark/vllm/ove……
Good news - we'll be live streaming the first official @vllm_project meetup in Europe from Zürich. Thu, Nov 6 at 11:30am ET / 8:30am PT / 5:30pm CET Hear from vLLM maintainers and contributors at @RedHat, @IBM, and @MistralAI covering quantization, hybrid models, distributed…
youtube.com
YouTube
[vLLM Office Hours #36] LIVE from Zürich vLLM Meetup - November 6,...
🔥 New Blog: “Disaggregated Inference: 18 Months Later” 18 months in LLM inference feels like a new Moore’s Law cycle – but this time not just 2x per year: 💸 Serving cost ↓10–100x 🚀 Throughput ↑10x ⚡ Latency ↓5x A big reason? Disaggregated Inference. From DistServe, our…
Love the retrospective on disaggregated inference. If you wonder where the technique named "PD" in vLLM comes from, read on! Thank you @haoailab for pushing the idea forward.
🔥 New Blog: “Disaggregated Inference: 18 Months Later” 18 months in LLM inference feels like a new Moore’s Law cycle – but this time not just 2x per year: 💸 Serving cost ↓10–100x 🚀 Throughput ↑10x ⚡ Latency ↓5x A big reason? Disaggregated Inference. From DistServe, our…
Amazing work by @RidgerZhu and the ByteDance Seed team — Scaling Latent Reasoning via Looped LMs introduces looped reasoning as a new scaling dimension. 🔥 The Ouro model is now runnable on vLLM (nightly version) — bringing efficient inference to this new paradigm of latent…
Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.
Wow Quantization-enhanced Reinforcement Learning using vLLM! Great job by @yukangchen_ 😃
We open-sourced QeRL — Quantization-enhanced Reinforcement Learning ! 🧠 4-bit quantized RL training 💪 Train a 32B LLM on a single H100 GPU ⚙️ 1.7× faster overall training 🎯 Accuracy on par with bfloat16-level accuracy 🔥 Supports NVFP4 quantization format Moreover, we show…
Wow excited to see PewDiePie using vLLM to serve language models locally 😃 vLLM brings easy, fast, and cheap LLM serving for everyone 🥰
PewDiePie in 2025: – built a 10×4090 rig – runs Llama 70B, gpt-oss-120B & Qwen 245B locally via vLLM – built a custom web UI (chat, RAG, search, TTS) – ran protein-folding simulations for charity – created an AI “council”, a swarm of 64 models – now fine-tuning his own model…
🚀Excited to team up with @NVIDIAAIDev to bring Nemotron Nano 2 VL to vLLM - a multimodal model powered by a hybrid Transformer–Mamba language backbone, built for video understanding and document intelligence✨ Full post here👇blog.vllm.ai/2025/10/31/run…
🎉 Congrats to @Kimi_Moonshot! vLLM Day-0 model expands! Now supporting Kimi Linear — hybrid linear attention with Kimi Delta Attention(KDA): - RULER 128k context: 84.3 perf + 3.98× speedup - Up to 6× faster decoding & 6.3× faster TPOT (1M tokens) - 75% KV cache reduction 💡…
Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Kim… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi…
welcome, LMCache!
LMCache joins the #PyTorch Ecosystem, advancing scalable #LLM inference through integration with @vllm_project. Developed at the University of Chicago, LMCache reuses and shares KV caches across queries and engines, achieving up to 15× faster throughput. 🔗…
🔥 Following our big announcement — here’s the full vLLM takeover at Ray Summit 2025! 📍 San Francisco • Nov 3–5 • Hosted by @anyscalecompute Get ready for deep dives into high-performance inference, unified backends, prefix caching, MoE serving, and large-scale…
🔥Ray Summit 2025 will be one of the biggest events for vLLM this year, with 10+ talks centered around @vllm_project! Looking forward to see you there. 🤩Use our discount code (limited time only!) RAYVLLM50 anyscale.com/ray-summit/2025
🔥Ray Summit 2025 will be one of the biggest events for vLLM this year, with 10+ talks centered around @vllm_project! Looking forward to see you there. 🤩Use our discount code (limited time only!) RAYVLLM50 anyscale.com/ray-summit/2025
United States الاتجاهات
- 1. #AEWDynamite 9,922 posts
- 2. #Survivor49 1,736 posts
- 3. Athena 10.6K posts
- 4. #SistasOnBET N/A
- 5. #iubb N/A
- 6. #ALLCAPS 1,086 posts
- 7. Binnington N/A
- 8. Ovechkin 4,333 posts
- 9. Harley 11.4K posts
- 10. Lamar Wilkerson N/A
- 11. Claudio 49.9K posts
- 12. Godzilla 27K posts
- 13. Savannah 4,898 posts
- 14. Paige 27.9K posts
- 15. Unplanned 3,512 posts
- 16. Mitchell Robinson N/A
- 17. Randle 2,742 posts
- 18. Breeze 24.1K posts
- 19. Rickea N/A
- 20. Mark Briscoe N/A
Something went wrong.
Something went wrong.