vllm_project's profile picture. A high-throughput and memory-efficient inference and serving engine for LLMs. Join http://slack.vllm.ai to discuss together with the community!

vLLM

@vllm_project

A high-throughput and memory-efficient inference and serving engine for LLMs. Join http://slack.vllm.ai to discuss together with the community!

🎉Congratulations on the launch @nebiustf. Nebius has been supporting @vllm_project for more than one year. They are not only experts in vLLM but also contribute back! Let's make open-source AI production-grade!

⚡️Today marks a big milestone for Nebius. We’re launching Nebius Token Factory, the evolution of Nebius AI Studio, built to make open-source AI production-grade. Token Factory transforms raw open models into governed, scalable systems with dedicated inference, sub-second…



Looking forward to integrate @perplexity_ai's fast communication kernels in vLLM!

Faster than DeepEP for Decode on ConnectX-7. First viable kernel on EFA. SM-Free RDMA transfer. Support prefill. (Maybe portable to other hardware as well)



😉

Kimi-K2 Reasoning is coming very soon just got merged into VLLM LETS FUCKING GOOOO im so hyped im so hyped im so hyped github.com/vllm-project/v…

scaling01's tweet image. Kimi-K2 Reasoning is coming very soon
just got merged into VLLM

LETS FUCKING GOOOO
im so hyped im so hyped im so hyped

github.com/vllm-project/v…


🚀 Learn how to deploy vLLM on NVIDIA DGX Spark — the right way! NVIDIA just published a detailed best practices guide for running high-throughput inference with vLLM, including multi-node setups and optimized Docker builds. @NVIDIAAIDev 👉 Dive in: build.nvidia.com/spark/vllm/ove…


vLLM أعاد

Good news - we'll be live streaming the first official @vllm_project meetup in Europe from Zürich. Thu, Nov 6 at 11:30am ET / 8:30am PT / 5:30pm CET Hear from vLLM maintainers and contributors at @RedHat, @IBM, and @MistralAI covering quantization, hybrid models, distributed…

RedHat_AI's tweet card. [vLLM Office Hours #36] LIVE from Zürich vLLM Meetup - November 6,...

youtube.com

YouTube

[vLLM Office Hours #36] LIVE from Zürich vLLM Meetup - November 6,...


vLLM أعاد

🔥 New Blog: “Disaggregated Inference: 18 Months Later” 18 months in LLM inference feels like a new Moore’s Law cycle – but this time not just 2x per year: 💸 Serving cost ↓10–100x 🚀 Throughput ↑10x ⚡ Latency ↓5x A big reason? Disaggregated Inference. From DistServe, our…


Love the retrospective on disaggregated inference. If you wonder where the technique named "PD" in vLLM comes from, read on! Thank you @haoailab for pushing the idea forward.

🔥 New Blog: “Disaggregated Inference: 18 Months Later” 18 months in LLM inference feels like a new Moore’s Law cycle – but this time not just 2x per year: 💸 Serving cost ↓10–100x 🚀 Throughput ↑10x ⚡ Latency ↓5x A big reason? Disaggregated Inference. From DistServe, our…



Amazing work by @RidgerZhu and the ByteDance Seed team — Scaling Latent Reasoning via Looped LMs introduces looped reasoning as a new scaling dimension. 🔥 The Ouro model is now runnable on vLLM (nightly version) — bringing efficient inference to this new paradigm of latent…

Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.

RidgerZhu's tweet image. Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.”

TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.


Wow Quantization-enhanced Reinforcement Learning using vLLM! Great job by @yukangchen_ 😃

We open-sourced QeRL — Quantization-enhanced Reinforcement Learning ! 🧠 4-bit quantized RL training 💪 Train a 32B LLM on a single H100 GPU ⚙️ 1.7× faster overall training 🎯 Accuracy on par with bfloat16-level accuracy 🔥 Supports NVFP4 quantization format Moreover, we show…



Wow excited to see PewDiePie using vLLM to serve language models locally 😃 vLLM brings easy, fast, and cheap LLM serving for everyone 🥰

PewDiePie in 2025: – built a 10×4090 rig – runs Llama 70B, gpt-oss-120B & Qwen 245B locally via vLLM – built a custom web UI (chat, RAG, search, TTS) – ran protein-folding simulations for charity – created an AI “council”, a swarm of 64 models – now fine-tuning his own model…

Yuchenj_UW's tweet image. PewDiePie in 2025:

– built a 10×4090 rig
– runs Llama 70B, gpt-oss-120B & Qwen 245B locally via vLLM
– built a custom web UI (chat, RAG, search, TTS)
– ran protein-folding simulations for charity
– created an AI “council”, a swarm of 64 models
– now fine-tuning his own model…


🚀Excited to team up with @NVIDIAAIDev to bring Nemotron Nano 2 VL to vLLM - a multimodal model powered by a hybrid Transformer–Mamba language backbone, built for video understanding and document intelligence✨ Full post here👇blog.vllm.ai/2025/10/31/run…


🎉 Congrats to @Kimi_Moonshot! vLLM Day-0 model expands! Now supporting Kimi Linear — hybrid linear attention with Kimi Delta Attention(KDA): - RULER 128k context: 84.3 perf + 3.98× speedup - Up to 6× faster decoding & 6.3× faster TPOT (1M tokens) - 75% KV cache reduction 💡…

vllm_project's tweet image. 🎉 Congrats to @Kimi_Moonshot! vLLM Day-0 model expands! Now supporting Kimi Linear — hybrid linear attention with Kimi Delta Attention(KDA):

- RULER 128k context: 84.3 perf + 3.98× speedup
- Up to 6× faster decoding & 6.3× faster TPOT (1M tokens)
- 75% KV cache reduction

💡…

Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Kim… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi…



welcome, LMCache!

LMCache joins the #PyTorch Ecosystem, advancing scalable #LLM inference through integration with @vllm_project. Developed at the University of Chicago, LMCache reuses and shares KV caches across queries and engines, achieving up to 15× faster throughput. 🔗…

PyTorch's tweet image. LMCache joins the #PyTorch Ecosystem, advancing scalable #LLM inference through integration with @vllm_project.

Developed at the University of Chicago, LMCache reuses and shares KV caches across queries and engines, achieving up to 15× faster throughput.

🔗…


🔥 Following our big announcement — here’s the full vLLM takeover at Ray Summit 2025! 📍 San Francisco • Nov 3–5 • Hosted by @anyscalecompute Get ready for deep dives into high-performance inference, unified backends, prefix caching, MoE serving, and large-scale…

🔥Ray Summit 2025 will be one of the biggest events for vLLM this year, with 10+ talks centered around @vllm_project! Looking forward to see you there. 🤩Use our discount code (limited time only!) RAYVLLM50 anyscale.com/ray-summit/2025



🔥Ray Summit 2025 will be one of the biggest events for vLLM this year, with 10+ talks centered around @vllm_project! Looking forward to see you there. 🤩Use our discount code (limited time only!) RAYVLLM50 anyscale.com/ray-summit/2025


United States الاتجاهات

Loading...

Something went wrong.


Something went wrong.