vLLM

@vllm_project

A high-throughput and memory-efficient inference and serving engine for LLMs. Join http://slack.vllm.ai to discuss together with the community!

github.com/vllm-project/v…

انضم في مارس 2024

647المنشورات 25ألفالمتابعون 23المتابَعون

vLLM

@vllm_project

8 س

🎉Congratulations on the launch @nebiustf. Nebius has been supporting @vllm_project for more than one year. They are not only experts in vLLM but also contribute back! Let's make open-source AI production-grade!

Nebius Token Factory

@nebiustf

14 س

⚡️Today marks a big milestone for Nebius. We’re launching Nebius Token Factory, the evolution of Nebius AI Studio, built to make open-source AI production-grade. Token Factory transforms raw open models into governed, scalable systems with dedicated inference, sub-second…

vLLM

@vllm_project

9 س

Looking forward to integrate @perplexity_ai's fast communication kernels in vLLM!

Lequn Chen

@abcdabcd987

10 س

Faster than DeepEP for Decode on ConnectX-7. First viable kernel on EFA. SM-Free RDMA transfer. Support prefill. (Maybe portable to other hardware as well)

vLLM

@vllm_project

12 س

😉

Lisan al Gaib

@scaling01

12 س

Kimi-K2 Reasoning is coming very soon just got merged into VLLM LETS FUCKING GOOOO im so hyped im so hyped im so hyped github.com/vllm-project/v…

vLLM

@vllm_project

14 س

🚀 Learn how to deploy vLLM on NVIDIA DGX Spark — the right way! NVIDIA just published a detailed best practices guide for running high-throughput inference with vLLM, including multi-node setups and optimized Docker builds. @NVIDIAAIDev 👉 Dive in: build.nvidia.com/spark/vllm/ove……

vLLM أعاد

Red Hat AI

@RedHat_AI

19 س

Good news - we'll be live streaming the first official @vllm_project meetup in Europe from Zürich. Thu, Nov 6 at 11:30am ET / 8:30am PT / 5:30pm CET Hear from vLLM maintainers and contributors at @RedHat, @IBM, and @MistralAI covering quantization, hybrid models, distributed…

RedHat_AI's tweet card. [vLLM Office Hours #36] LIVE from Zürich vLLM Meetup - November 6,...

youtube.com

YouTube

[vLLM Office Hours #36] LIVE from Zürich vLLM Meetup - November 6,...

المصدر: youtube.com

vLLM أعاد

Hao AI Lab

@haoailab

٤ نوفمبرم

🔥 New Blog: “Disaggregated Inference: 18 Months Later” 18 months in LLM inference feels like a new Moore’s Law cycle – but this time not just 2x per year: 💸 Serving cost ↓10–100x 🚀 Throughput ↑10x ⚡ Latency ↓5x A big reason? Disaggregated Inference. From DistServe, our…

haoailab's tweet card. Eighteen months ago, our lab introduced DistServe with a simple bet: split LLM inference into prefill and decode, and scale them independently on separate compute pools. Today, almost every product...

Disaggregated Inference: 18 Months Later

المصدر: hao-ai-lab.github.io

vLLM

@vllm_project

٤ نوفمبرم

Love the retrospective on disaggregated inference. If you wonder where the technique named "PD" in vLLM comes from, read on! Thank you @haoailab for pushing the idea forward.

Hao AI Lab

@haoailab

٤ نوفمبرم

Disaggregated Inference: 18 Months Later

المصدر: hao-ai-lab.github.io

vLLM

@vllm_project

٤ نوفمبرم

Amazing work by @RidgerZhu and the ByteDance Seed team — Scaling Latent Reasoning via Looped LMs introduces looped reasoning as a new scaling dimension. 🔥 The Ouro model is now runnable on vLLM (nightly version) — bringing efficient inference to this new paradigm of latent…

Rui-Jie (Ridger) Zhu

@RidgerZhu

٣٠ أكتوبرم

Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.

RidgerZhu's tweet image. Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.”

TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on &gt; 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.

vLLM

@vllm_project

٤ نوفمبرم

Wow Quantization-enhanced Reinforcement Learning using vLLM! Great job by @yukangchen_ 😃

Yukang Chen

@yukangchen_

١٤ أكتوبرم

We open-sourced QeRL — Quantization-enhanced Reinforcement Learning ! 🧠 4-bit quantized RL training 💪 Train a 32B LLM on a single H100 GPU ⚙️ 1.7× faster overall training 🎯 Accuracy on par with bfloat16-level accuracy 🔥 Supports NVFP4 quantization format Moreover, we show…

vLLM

@vllm_project

٣ نوفمبرم

Wow excited to see PewDiePie using vLLM to serve language models locally 😃 vLLM brings easy, fast, and cheap LLM serving for everyone 🥰

Yuchen Jin

@Yuchenj_UW

٣١ أكتوبرم

PewDiePie in 2025: – built a 10×4090 rig – runs Llama 70B, gpt-oss-120B & Qwen 245B locally via vLLM – built a custom web UI (chat, RAG, search, TTS) – ran protein-folding simulations for charity – created an AI “council”, a swarm of 64 models – now fine-tuning his own model…

vLLM

@vllm_project

٣١ أكتوبرم

🚀Excited to team up with @NVIDIAAIDev to bring Nemotron Nano 2 VL to vLLM - a multimodal model powered by a hybrid Transformer–Mamba language backbone, built for video understanding and document intelligence✨ Full post here👇blog.vllm.ai/2025/10/31/run…

vLLM

@vllm_project

٣٠ أكتوبرم

🎉 Congrats to @Kimi_Moonshot! vLLM Day-0 model expands! Now supporting Kimi Linear — hybrid linear attention with Kimi Delta Attention(KDA): - RULER 128k context: 84.3 perf + 3.98× speedup - Up to 6× faster decoding & 6.3× faster TPOT (1M tokens) - 75% KV cache reduction 💡…

vllm_project's tweet image. 🎉 Congrats to @Kimi_Moonshot! vLLM Day-0 model expands! Now supporting Kimi Linear — hybrid linear attention with Kimi Delta Attention(KDA):

- RULER 128k context: 84.3 perf + 3.98× speedup
- Up to 6× faster decoding &amp; 6.3× faster TPOT (1M tokens)
- 75% KV cache reduction

💡…

Kimi.ai

@Kimi_Moonshot

٣٠ أكتوبرم

Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Kim… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi…

moonshotai/Kimi-Linear-48B-A3B-Instruct · Hugging Face

المصدر: huggingface.co

vLLM

@vllm_project

٣٠ أكتوبرم

welcome, LMCache!

PyTorch

@PyTorch

٣٠ أكتوبرم

LMCache joins the #PyTorch Ecosystem, advancing scalable #LLM inference through integration with @vllm_project. Developed at the University of Chicago, LMCache reuses and shares KV caches across queries and engines, achieving up to 15× faster throughput. 🔗…

vLLM

@vllm_project

٢٩ أكتوبرم

🔥 Following our big announcement — here’s the full vLLM takeover at Ray Summit 2025! 📍 San Francisco • Nov 3–5 • Hosted by @anyscalecompute Get ready for deep dives into high-performance inference, unified backends, prefix caching, MoE serving, and large-scale…

vLLM

@vllm_project

٢٩ أكتوبرم

🔥Ray Summit 2025 will be one of the biggest events for vLLM this year, with 10+ talks centered around @vllm_project! Looking forward to see you there. 🤩Use our discount code (limited time only!) RAYVLLM50 anyscale.com/ray-summit/2025