vLLM

@vllm_project

A high-throughput and memory-efficient inference and serving engine for LLMs. Join http://slack.vllm.ai to discuss together with the community!

github.com/vllm-project/v…

Tham gia vào Tháng 3 2024

705Bài đăng 26KNgười theo dõi 25Đang theo dõi

vLLM

@vllm_project

19 giờ

🚀 vLLM now offers an optimized inference recipe for DeepSeek-V3.2. ⚙️ Startup details Run vLLM with DeepSeek-specific components: --tokenizer-mode deepseek_v32 \ --tool-call-parser deepseek_v32 🧰 Usage tips Enable thinking mode in vLLM: –…

$vllm_project's tweet image. 🚀 vLLM now offers an optimized inference recipe for DeepSeek-V3.2. ⚙️ Startup details Run vLLM with DeepSeek-specific components: --tokenizer-mode deepseek_v32 \ --tool-call-parser deepseek_v32 🧰 Usage tips Enable thinking mode in vLLM: –…$

DeepSeek

@deepseek_ai

1 thg 12

🚀 Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale — Reasoning-first models built for agents! 🔹 DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API. 🔹 DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now. 📄 Tech…

deepseek_ai's tweet image. 🚀 Launching DeepSeek-V3.2 &amp; DeepSeek-V3.2-Speciale — Reasoning-first models built for agents!

🔹 DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web &amp; API.
🔹 DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now.

📄 Tech…

vLLM

@vllm_project

3 thg 12

We’re taking CUDA debugging to the next level. 🚀 Building on our previous work with CUDA Core Dumps, we are releasing a new guide on tracing hanging and complicated kernels down to the source code. As kernels get more complex (deep inlining, async memory access), standard…

vLLM

@vllm_project

13 thg 8

Have you ever felt you are developing cuda kernels and your tests often run into illegal memory access (IMA for short) and you have no idea how to debug? We have collaborated with the @nvidia team to investigate how cuda core dump can help, check out the blogpost to learn more!…

vllm_project's tweet card. TL;DR: If you hit an illegal memory access was encountered error, you can enable CUDA core dump to debug the issue. Simply set the following environment variables and run your program again to...

CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond

Nguồn: blog.vllm.ai

vLLM

@vllm_project

3 thg 12

🤝 Proud to share the first production-ready vLLM plugin for Gaudi, developed in close collaboration with the Intel team and fully aligned with upstream vLLM. 🔧 This release is validated and ready for deployment, with support for the latest vLLM version coming soon. 📘 The…

vLLM

@vllm_project

3 thg 12

LLM agents are powerful but can be slow at scale. @Snowflake's model-free SuffixDecoding from Arctic Inference now runs natively in vLLM, beating tuned N-gram speculation across concurrency levels while keeping CPU and memory overhead in check. Quick Start in vLLM:…

Aurick Qiao

@aurickq

2 thg 12

Suffix Decoding is at #NeurIPS2025 as a 🏅spotlight! It accelerates LLM inference for coding, agents, and RL. We also optimized its speculation speed by 7.4x and merged it into vLLM (incoming to SGLang). Talk to @GabrieleOliaro or me at poster #816 Friday 11am! Links in🧵

aurickq's tweet image. Suffix Decoding is at #NeurIPS2025 as a 🏅spotlight! It accelerates LLM inference for coding, agents, and RL.

We also optimized its speculation speed by 7.4x and merged it into vLLM (incoming to SGLang).

Talk to @GabrieleOliaro or me at poster #816 Friday 11am!

Links in🧵

vLLM

@vllm_project

2 thg 12

🎉 Congratulations to the Mistral team on launching the Mistral 3 family! We’re proud to share that @MistralAI, @NVIDIAAIDev, @RedHat_AI, and vLLM worked closely together to deliver full Day-0 support for the entire Mistral 3 lineup. This collaboration enabled: • NVFP4…

Mistral AI

@MistralAI

2 thg 12

Introducing the Mistral 3 family of models: Frontier intelligence at all sizes. Apache 2.0. Details in 🧵

vLLM

@vllm_project

1 thg 12

More inference workloads now mix autoregressive and diffusion models in a single pipeline to process and generate multiple modalities - text, image, audio, and video. Today we’re releasing vLLM-Omni: an open-source framework that extends vLLM’s easy, fast, and cost-efficient…

vllm_project's tweet card. We are excited to announce the official release of vLLM-Omni, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models.

Announcing vLLM-Omni: Easy, Fast, and Cheap Omni-Modality Model Serving

Nguồn: blog.vllm.ai

vLLM đã đăng lại

Lysandre

@LysandreJik

1 thg 12

Transformers v5's first release candidate is out 🔥 The biggest release of my life. It's been five years since the last major (v4). From 20 architectures to 400, 20k daily downloads to 3 million. The release is huge, w/ tokenization (no slow tokenizers!), modeling & processing.

LysandreJik's tweet image. Transformers v5's first release candidate is out 🔥 The biggest release of my life.

It's been five years since the last major (v4). From 20 architectures to 400, 20k daily downloads to 3 million.

The release is huge, w/ tokenization (no slow tokenizers!), modeling &amp; processing.

vLLM

@vllm_project

30 thg 11

Love this: a community contributor built vLLM Playground to make inferencing visible, interactive, and experiment-friendly. From visual config toggles to automatic command generation, from GPU/M-chip support to GuideLLM benchmarking + LLMCompressor integration — it brings the…

vLLM

@vllm_project

28 thg 11

vLLM is proud to support @PrimeIntellect 's post-training of the INTELLECT-3 model🥰

Prime Intellect

@PrimeIntellect

27 thg 11

Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more

vLLM đã đăng lại

Red Hat AI

@RedHat_AI

25 thg 11

Interested in how NVIDIA Nemotron-H is being optimized for high performance inference in @vllm_project? Join @RedHat and @NVIDIAAI next week as we cover the Nemotron-H architecture, vLLM support, optimized MoE kernels, async scheduling, and new nsys profiles. Join links below 👇

RedHat_AI's tweet image. Interested in how NVIDIA Nemotron-H is being optimized for high performance inference in @vllm_project? Join @RedHat and @NVIDIAAI next week as we cover the Nemotron-H architecture, vLLM support, optimized MoE kernels, async scheduling, and new nsys profiles. Join links below 👇

vLLM đã đăng lại

Seiji Eicher

@seiji_________

26 thg 11

Wide-EP and prefill/decode disaggregation APIs for vLLM are now available in Ray 2.52 🚀🚀 Validated at 2.4k tokens/H200 on Anyscale Runtime, these patterns maximize sparse MoE model inference efficiency, but often require non-trivial orchestration logic. Here’s how they…

vLLM

@vllm_project

27 thg 11

Running multi-node vLLM on Ray can be complicated: different roles, env vars, and SSH glue to keep things together. The new `ray symmetric-run` command lets you run the same entrypoint on every node while Ray handles cluster startup, coordination, and teardown for you. Deep…

vllm_project's tweet card. Ray now has a new command: ray symmetric-run. This command makes it possible to launch the same entrypoint command on every node in a Ray cluster, simplifying the workflow to spawn vLLM servers with...

Streamlined multi-node serving with Ray symmetric-run

Nguồn: blog.vllm.ai

ray

@raydistributed

26 thg 11

In 2.52, we're introducing ray symmetric-run -- a CLI command to simplify and improve the large model interactive development experience on Ray and @vllm_project! Spinning up multi-node vLLM with Ray on interactive environments can be tedious, requiring users to juggle separate…

raydistributed's tweet image. In 2.52, we're introducing ray symmetric-run -- a CLI command to simplify and improve the large model interactive development experience on Ray and @vllm_project!

Spinning up multi-node vLLM with Ray on interactive environments can be tedious, requiring users to juggle separate…

vLLM đã đăng lại

PyTorch

@PyTorch

26 thg 11

📸 Scenes from the Future of Inferencing! PyTorch ATX, the @vllm_project community, and @RedHat brought together 90+ AI builders at Capital Factory back in September, to dive into the latest in LLM inference -> from quantization and PagedAttention to multi-node deployment.…

PyTorch's tweet image. 📸 Scenes from the Future of Inferencing! PyTorch ATX, the @vllm_project community, and @RedHat brought together 90+ AI builders at Capital Factory back in September, to dive into the latest in LLM inference -&gt; from quantization and PagedAttention to multi-node deployment.…

vLLM đã đăng lại

Niels Rogge

@NielsRogge

26 thg 11

An amazing blog post dropped on @huggingface explaining how today's LLM inference engines like @vllm_project work! The concept of "continuous batching" is explained, along with KV-caching, attention masking, chunked prefill, and decoding. Continuous batching is the idea of…