EmbeddedLLM's profile picture. Your open-source AI ally. We specialize in integrating LLM into your business.

EmbeddedLLM

@EmbeddedLLM

Your open-source AI ally. We specialize in integrating LLM into your business.

置頂

Pro-tip for vLLM power-users: free ≈ 90 % of your GPU VRAM in seconds—no restarts required🚀 🚩 Why you’ll want this • Hot-swap new checkpoints on the same card • Rotate multiple LLMs on one GPU (batch jobs, micro-services, A/B tests) • Stage-based pipelines that call…

EmbeddedLLM's tweet image. Pro-tip for vLLM power-users: free ≈ 90 % of your GPU VRAM in seconds—no restarts required🚀

🚩 Why you’ll want this
• Hot-swap new checkpoints on the same card
• Rotate multiple LLMs on one GPU (batch jobs, micro-services, A/B tests)
• Stage-based pipelines that call…

EmbeddedLLM 已轉發

High performance. Scalable. Flexible. Open. @Oracle and AMD are scaling AI to new heights. Beginning in Q3 2026, @OracleCloud Infrastructure to be the first hyperscaler to offer a public AI supercluster, with an initial 50K AMD Instinct MI450 Series GPU deployment. 🔗 See…

AMD's tweet image. High performance. Scalable. Flexible. Open. @Oracle and AMD are scaling AI to new heights. 
 
Beginning in Q3 2026, @OracleCloud Infrastructure to be the first hyperscaler to offer a public AI supercluster, with an initial 50K AMD Instinct MI450 Series GPU deployment.

🔗 See…

EmbeddedLLM 已轉發

🚀 vLLM just hit 60K GitHub stars! 🎉 From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…

vllm_project's tweet image. 🚀 vLLM just hit 60K GitHub stars! 🎉

From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…
vllm_project's tweet image. 🚀 vLLM just hit 60K GitHub stars! 🎉

From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…
vllm_project's tweet image. 🚀 vLLM just hit 60K GitHub stars! 🎉

From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…
vllm_project's tweet image. 🚀 vLLM just hit 60K GitHub stars! 🎉

From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.…

EmbeddedLLM 已轉發

🚀 vLLM x MinerU: Document Parsing at Lightning Speed! We’re excited to see MinerU fully powered by vLLM — bringing ultra-fast, accurate, and efficient document understanding to everyone. ⚡ Powered by vLLM’s high-throughput inference engine, MinerU 2.5 delivers: Instant…

vllm_project's tweet image. 🚀 vLLM x MinerU: Document Parsing at Lightning Speed!

We’re excited to see MinerU fully powered by vLLM — bringing ultra-fast, accurate, and efficient document understanding to everyone.

⚡ Powered by vLLM’s high-throughput inference engine, MinerU 2.5 delivers:
Instant…

MinerU 2.5 has arrived with demo on @huggingface

Xianbao_QIAN's tweet image. MinerU 2.5 has arrived with demo on @huggingface


EmbeddedLLM 已轉發

Happy that InferenceMAX is here because it signals a milestone for vLLM's SOTA performance on NVIDIA Blackwell! 🥳 It has been a pleasure to deeply collaborate with @nvidia in @vllm_project, and we have much more to do Read about the work we did here: blog.vllm.ai/2025/10/09/bla…

Today we are launching InferenceMAX! We have support from Nvidia, AMD, OpenAI, Microsoft, Pytorch, SGLang, vLLM, Oracle, CoreWeave, TogetherAI, Nebius, Crusoe, HPE, SuperMicro, Dell It runs every day on the latest software (vLLM, SGLang, etc) across hundreds of GPUs, $10Ms of…



EmbeddedLLM 已轉發

🚀 The RL community keeps pushing boundaries — from better on-policy data and partial rollouts to in-flight weight updates that mix KV caches across models during inference. Continuing inference while weights change and KV states stay stale sounds wild — but that’s exactly what…

I am excited to open-source PipelineRL - a scalable async RL implementation with in-flight weight updates. Why wait until your bored GPUs finish all sequences? Just update the weights and continue inference! Code: github.com/ServiceNow/Pip… Blog: huggingface.co/blog/ServiceNo…

DBahdanau's tweet image. I am excited to open-source PipelineRL - a scalable async RL implementation with in-flight weight updates. Why wait until your bored GPUs finish all sequences? Just update the weights and continue inference!

Code: github.com/ServiceNow/Pip…
Blog: huggingface.co/blog/ServiceNo…


EmbeddedLLM 已轉發

Getting ready to try DeepSeek-V3.2-Exp from @deepseek_ai ? vLLM is here to help! We have verified that it works on H200 machines, and many other hardwares thanks to the hardware plugin mechanism. Check out the recipes docs.vllm.ai/projects/recip… for more details 😍 Note: currently…

vllm_project's tweet image. Getting ready to try DeepSeek-V3.2-Exp from @deepseek_ai ? vLLM is here to help!

We have verified that it works on H200 machines, and many other hardwares thanks to the hardware plugin mechanism.

Check out the recipes docs.vllm.ai/projects/recip… for more details 😍

Note: currently…
vllm_project's tweet image. Getting ready to try DeepSeek-V3.2-Exp from @deepseek_ai ? vLLM is here to help!

We have verified that it works on H200 machines, and many other hardwares thanks to the hardware plugin mechanism.

Check out the recipes docs.vllm.ai/projects/recip… for more details 😍

Note: currently…

How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.

vllm_project's tweet image. How does @deepseek_ai Sparse Attention (DSA) work? 

It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.


EmbeddedLLM 已轉發

How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.

vllm_project's tweet image. How does @deepseek_ai Sparse Attention (DSA) work? 

It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.

🚀 Introducing DeepSeek-V3.2-Exp — our latest experimental model! ✨ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context. 👉 Now live on App, Web, and API. 💰 API prices cut by 50%+! 1/n



EmbeddedLLM 已轉發

🚀 New in vLLM: dots.ocr 🔥 A powerful multilingual OCR model from @xiaohongshu hi lab is now officially supported in vLLM! 📝 Single end-to-end parser for text, tables (HTML), formulas (LaTeX), and layouts (Markdown) 🌍 Supports 100 languages with robust performance on…

vllm_project's tweet image. 🚀 New in vLLM: dots.ocr 🔥

A powerful multilingual OCR model from @xiaohongshu hi lab is now officially supported in vLLM!
📝 Single end-to-end parser for text, tables (HTML), formulas (LaTeX), and layouts (Markdown)
🌍 Supports 100 languages with robust performance on…

we're all sleeping on this OCR model 🔥 dots.ocr is a new 3B model with sota performance, support for 100 languages & allowing commercial use! 🤯 single e2e model to extract image, convert tables, formula, and more into markdown 📝

mervenoyann's tweet image. we're all sleeping on this OCR model 🔥

dots.ocr is a new 3B model with sota performance, support for 100 languages & allowing commercial use! 🤯

single e2e model to extract image, convert tables, formula, and more into markdown 📝


EmbeddedLLM 已轉發

Missed our latest vLLM office hours? We covered hybrid models as first-class citizens in @vllm_project. ✅ Hybrid model support in v1 ✅ Mamba, Mamba2, linear attention ✅ Performance from v0 → v1 ▶️ Recording: youtube.com/live/uWQ489ONv… 📑 Slides: docs.google.com/presentation/d…


EmbeddedLLM 已轉發

We keep pushing the limits of speculative decoding (SD) in LLM inference -- check out our latest NeurIPS’25 paper: Lookahead Reasoning (LR). The high-level rationale is pretty simple: SD alone isn’t enough now: as GPUs get stronger (H200 -> B200 -> Rubin CPX), we'll be able to…

[1/N]🚀New decoding paradigm drop!🚀 Introducing Lookahead Reasoning(LR): step-level speculation that stacks with Speculative Decoding(SD). It has been accepted to #NeurIPS2025 🎉 📖 Blog: hao-ai-lab.github.io/blogs/lookahea… 💻 Code: github.com/hao-ai-lab/Loo… 📄 Paper: arxiv.org/abs/2506.19830



EmbeddedLLM 已轉發

Day-0 support on one of the most anticipated model releases🚀🚀🚀 Detailed deployment guide coming soon at vllm-recipes docs.vllm.ai/projects/recip…

rogerw0108's tweet image. Day-0 support on one of the most anticipated model releases🚀🚀🚀

Detailed deployment guide coming soon at vllm-recipes docs.vllm.ai/projects/recip…

🚀 We're thrilled to unveil Qwen3-VL — the most powerful vision-language model in the Qwen series yet! 🔥 The flagship model Qwen3-VL-235B-A22B is now open-sourced and available in both Instruct and Thinking versions: ✅ Instruct outperforms Gemini 2.5 Pro on key vision…

Alibaba_Qwen's tweet image. 🚀 We're thrilled to unveil Qwen3-VL — the most powerful vision-language model in the Qwen series yet!

🔥 The flagship model Qwen3-VL-235B-A22B is now open-sourced and available in both Instruct and Thinking versions:  
✅ Instruct outperforms Gemini 2.5 Pro on key vision…


EmbeddedLLM 已轉發

Congrats to @deepseek_ai ! DeepSeek-R1 was published in Nature yesterday as the cover article, and vLLM is proud to have supported its RL training and inference🥰

vllm_project's tweet image. Congrats to @deepseek_ai ! DeepSeek-R1 was published in Nature yesterday as the cover article, and vLLM is proud to have supported its RL training and inference🥰
vllm_project's tweet image. Congrats to @deepseek_ai ! DeepSeek-R1 was published in Nature yesterday as the cover article, and vLLM is proud to have supported its RL training and inference🥰

EmbeddedLLM 已轉發

Disaggregated Inference at Scale with #PyTorch & #vLLM: Meta’s vLLM disagg implementation improves inference efficiency in latency & throughput vs its internal stack, with optimizations now being upstreamed to the vLLM community. 🔗 hubs.la/Q03J87tS0

PyTorch's tweet image. Disaggregated Inference at Scale with #PyTorch & #vLLM:

Meta’s vLLM disagg implementation improves inference efficiency in latency & throughput vs its internal stack, with optimizations now being upstreamed to the vLLM community.

🔗 hubs.la/Q03J87tS0

EmbeddedLLM 已轉發

😂 Although AMD is now working pretty well for small to medium sized models


EmbeddedLLM 已轉發

Welcome Qwen3-Next! You can run it efficiently on vLLM with accelerated kernels and native memory management for hybrid models. blog.vllm.ai/2025/09/11/qwe…

vllm_project's tweet image. Welcome Qwen3-Next! You can run it efficiently on vLLM with accelerated kernels and native memory management for hybrid models. 

blog.vllm.ai/2025/09/11/qwe…

🚀 Introducing Qwen3-Next-80B-A3B — the FUTURE of efficient LLMs is here! 🔹 80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!) 🔹Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed &…

Alibaba_Qwen's tweet image. 🚀 Introducing Qwen3-Next-80B-A3B — the FUTURE of efficient LLMs is here!

🔹 80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!)
🔹Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed &…


EmbeddedLLM 已轉發

Deep dive into optimizing weight transfer step by step and improving it 60x!

1.5 seconds is long enough to transfer model weights from training nodes to RL rollout nodes (as opposed to 100s). Here's the full story of how I made it (not just presenting the solution): le.qun.ch/en/blog/2025/0…

abcdabcd987's tweet image. 1.5 seconds is long enough to transfer model weights from training nodes to RL rollout nodes (as opposed to 100s). Here's the full story of how I made it (not just presenting the solution):
le.qun.ch/en/blog/2025/0…


EmbeddedLLM 已轉發

⚡️ Efficient weight updates for RL at trillion-parameter scale 💡 Best practice from Kimi @Kimi_Moonshot vLLM is proud to collaborate with checkpoint-engine: • Broadcast weight sync for 1T params in ~20s across 1000s of GPUs • Dynamic P2P updates for elastic clusters •…

Introducing checkpoint-engine: our open-source, lightweight middleware for efficient, in-place weight updates in LLM inference engines, especially effective for RL. ✅ Update a 1T model on thousands of GPUs in ~20s ✅ Supports both broadcast (sync) & P2P (dynamic) updates ✅…

Kimi_Moonshot's tweet image. Introducing checkpoint-engine: our open-source, lightweight middleware for efficient, in-place weight updates in LLM inference engines, especially effective for RL.

✅ Update a 1T model on thousands of GPUs in ~20s
✅ Supports both broadcast (sync) & P2P (dynamic) updates
✅…
Kimi_Moonshot's tweet image. Introducing checkpoint-engine: our open-source, lightweight middleware for efficient, in-place weight updates in LLM inference engines, especially effective for RL.

✅ Update a 1T model on thousands of GPUs in ~20s
✅ Supports both broadcast (sync) & P2P (dynamic) updates
✅…


vLLM Singapore Meetup — Highlights Thanks to everyone who joined! Check out the slides by @vllm_project DarkLight1337 with tjtanaa / @EmbeddedLLM V1 is here: faster startup, stronger CI & perf checks. Scaling MoE: clear Expert Parallelism (EP) setup for single/multi-node +…

EmbeddedLLM's tweet image. vLLM Singapore Meetup — Highlights
Thanks to everyone who joined!

Check out the slides by @vllm_project DarkLight1337 with tjtanaa / @EmbeddedLLM
V1 is here: faster startup, stronger CI & perf checks.
Scaling MoE: clear Expert Parallelism (EP) setup for single/multi-node +…
EmbeddedLLM's tweet image. vLLM Singapore Meetup — Highlights
Thanks to everyone who joined!

Check out the slides by @vllm_project DarkLight1337 with tjtanaa / @EmbeddedLLM
V1 is here: faster startup, stronger CI & perf checks.
Scaling MoE: clear Expert Parallelism (EP) setup for single/multi-node +…
EmbeddedLLM's tweet image. vLLM Singapore Meetup — Highlights
Thanks to everyone who joined!

Check out the slides by @vllm_project DarkLight1337 with tjtanaa / @EmbeddedLLM
V1 is here: faster startup, stronger CI & perf checks.
Scaling MoE: clear Expert Parallelism (EP) setup for single/multi-node +…
EmbeddedLLM's tweet image. vLLM Singapore Meetup — Highlights
Thanks to everyone who joined!

Check out the slides by @vllm_project DarkLight1337 with tjtanaa / @EmbeddedLLM
V1 is here: faster startup, stronger CI & perf checks.
Scaling MoE: clear Expert Parallelism (EP) setup for single/multi-node +…

EmbeddedLLM 已轉發

vLLM is proud to support the great Kimi update from @Kimi_Moonshot , better tool-calling, longer context, and more! Check the deployment guide at huggingface.co/moonshotai/Kim… 🔥

vllm_project's tweet image. vLLM is proud to support the great Kimi update from @Kimi_Moonshot , better tool-calling, longer context, and more!

Check the deployment guide at huggingface.co/moonshotai/Kim… 🔥

Kimi K2-0905 update 🚀 - Enhanced coding capabilities, esp. front-end & tool-calling - Context length extended to 256k tokens - Improved integration with various agent scaffolds (e.g., Claude Code, Roo Code, etc) 🔗 Weights & code: huggingface.co/moonshotai/Kim… 💬 Chat with new Kimi…

Kimi_Moonshot's tweet image. Kimi K2-0905 update 🚀
- Enhanced coding capabilities, esp. front-end & tool-calling
- Context length extended to 256k tokens
- Improved integration with various agent scaffolds (e.g., Claude Code, Roo Code, etc)

🔗 Weights & code: huggingface.co/moonshotai/Kim…
💬 Chat with new Kimi…


EmbeddedLLM 已轉發

Amazing blogpost from @gordic_aleksa explaining internals of vLLM😍

New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work! Took me a while to get this level of understanding of the codebase and then to write up…

gordic_aleksa's tweet image. New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work!

Took me a while to get this level of understanding of the codebase and then to write up…


United States 趨勢

Loading...

Something went wrong.


Something went wrong.