lm_zheng's profile picture. Member of technical staff @xAI | Prev: Ph.D. @UCBerkeley, Co-founder @lmsysorg

Lianmin Zheng

@lm_zheng

Member of technical staff @xAI | Prev: Ph.D. @UCBerkeley, Co-founder @lmsysorg

Lianmin Zheng сделал(а) репост

(1/N) 🚀 We converted a high quality Wan2.2-MoE into an autoregressive model. Preview checkpoint: huggingface.co/FastVideo/Caus… - First autoregressive version of Wan2.2-A14B MoE - I2V compatible - 8-step distilled - Potential backbone for streaming generation and world modeling Try…


Lianmin Zheng сделал(а) репост

🚀 Introducing Miles — an enterprise-facing RL framework for large-scale MoE training & production, forked from slime. Slime is a lightweight, customizable RL framework that already powers real post-training pipelines and large MoE runs. Miles builds on slime but focuses on new…

lmsysorg's tweet image. 🚀 Introducing Miles — an enterprise-facing RL framework for large-scale MoE training & production, forked from slime.

Slime is a lightweight, customizable RL framework that already powers real post-training pipelines and large MoE runs. Miles builds on slime but focuses on new…
lmsysorg's tweet image. 🚀 Introducing Miles — an enterprise-facing RL framework for large-scale MoE training & production, forked from slime.

Slime is a lightweight, customizable RL framework that already powers real post-training pipelines and large MoE runs. Miles builds on slime but focuses on new…
lmsysorg's tweet image. 🚀 Introducing Miles — an enterprise-facing RL framework for large-scale MoE training & production, forked from slime.

Slime is a lightweight, customizable RL framework that already powers real post-training pipelines and large MoE runs. Miles builds on slime but focuses on new…

Lianmin Zheng сделал(а) репост

We have implemented unified FP8 RL. FP8 for both training and rollout effectively eliminates train–inference inconsistency caused by quantization error, improving both the speed and stability of RL training.

GenAI_is_real's tweet image. We have implemented unified FP8 RL. FP8 for both training and rollout effectively eliminates train–inference inconsistency caused by quantization error, improving both the speed and stability of RL training.
GenAI_is_real's tweet image. We have implemented unified FP8 RL. FP8 for both training and rollout effectively eliminates train–inference inconsistency caused by quantization error, improving both the speed and stability of RL training.
GenAI_is_real's tweet image. We have implemented unified FP8 RL. FP8 for both training and rollout effectively eliminates train–inference inconsistency caused by quantization error, improving both the speed and stability of RL training.
GenAI_is_real's tweet image. We have implemented unified FP8 RL. FP8 for both training and rollout effectively eliminates train–inference inconsistency caused by quantization error, improving both the speed and stability of RL training.

Lianmin Zheng сделал(а) репост

We conducted additional experiments to compare speculative training with frozen MTP layers and obtained solid results. Further experiments are being conducted on larger-scale (300B+) MOEs, looking forward!

GenAI_is_real's tweet image. We conducted additional experiments to compare speculative training with frozen MTP layers and obtained solid results.

Further experiments are being conducted on larger-scale (300B+) MOEs, looking forward!
GenAI_is_real's tweet image. We conducted additional experiments to compare speculative training with frozen MTP layers and obtained solid results.

Further experiments are being conducted on larger-scale (300B+) MOEs, looking forward!

We introduce speculative decoding into the RL sampling process, achieving a significant improvement in sampling speed under appropriate batch sizes. Compared to freezing the draft model, the accepted length maintain at a high level, generating long-term stable positive gains.

GenAI_is_real's tweet image. We introduce speculative decoding into the RL sampling process, achieving a significant improvement in sampling speed under appropriate batch sizes.  Compared to freezing the draft model, the accepted length maintain at a high level, generating long-term stable positive gains.
GenAI_is_real's tweet image. We introduce speculative decoding into the RL sampling process, achieving a significant improvement in sampling speed under appropriate batch sizes.  Compared to freezing the draft model, the accepted length maintain at a high level, generating long-term stable positive gains.
GenAI_is_real's tweet image. We introduce speculative decoding into the RL sampling process, achieving a significant improvement in sampling speed under appropriate batch sizes.  Compared to freezing the draft model, the accepted length maintain at a high level, generating long-term stable positive gains.
GenAI_is_real's tweet image. We introduce speculative decoding into the RL sampling process, achieving a significant improvement in sampling speed under appropriate batch sizes.  Compared to freezing the draft model, the accepted length maintain at a high level, generating long-term stable positive gains.


Lianmin Zheng сделал(а) репост

Grok 4.1 just released. You should notice a significant increase in speed and quality.

Introducing Grok 4.1, a frontier model that sets a new standard for conversational intelligence, emotional understanding, and real-world helpfulness. Grok 4.1 is available for free on grok.com, grok.x.com and our mobile apps. x.ai/news/grok-4-1



Grok 4.1 is out! Amazing improvements.

Introducing Grok 4.1, a frontier model that sets a new standard for conversational intelligence, emotional understanding, and real-world helpfulness. Grok 4.1 is available for free on grok.com, grok.x.com and our mobile apps. x.ai/news/grok-4-1



Numeric debugging is extremely challenging given the huge number of kernels in today’s training and inference systems. You can’t get a single kernel wrong. This is a great engineering achievement and will hopefully make RL training and debugging much easier.

💥 We've achieved perfect training-inference alignment for SGLang & FSDP in slime! (Flash Attn 3, DeepGEMM, etc.) The result? A strict KL divergence of 0. But here's the twist: We spent a month trying to find a baseline that crashes from mismatch... and couldn't. 🤷‍♂️ We haven't…

lmsysorg's tweet image. 💥 We've achieved perfect training-inference alignment for SGLang & FSDP in slime! (Flash Attn 3, DeepGEMM, etc.)

The result? A strict KL divergence of 0.

But here's the twist: We spent a month trying to find a baseline that crashes from mismatch... and couldn't. 🤷‍♂️ We haven't…
lmsysorg's tweet image. 💥 We've achieved perfect training-inference alignment for SGLang & FSDP in slime! (Flash Attn 3, DeepGEMM, etc.)

The result? A strict KL divergence of 0.

But here's the twist: We spent a month trying to find a baseline that crashes from mismatch... and couldn't. 🤷‍♂️ We haven't…
lmsysorg's tweet image. 💥 We've achieved perfect training-inference alignment for SGLang & FSDP in slime! (Flash Attn 3, DeepGEMM, etc.)

The result? A strict KL divergence of 0.

But here's the twist: We spent a month trying to find a baseline that crashes from mismatch... and couldn't. 🤷‍♂️ We haven't…
lmsysorg's tweet image. 💥 We've achieved perfect training-inference alignment for SGLang & FSDP in slime! (Flash Attn 3, DeepGEMM, etc.)

The result? A strict KL divergence of 0.

But here's the twist: We spent a month trying to find a baseline that crashes from mismatch... and couldn't. 🤷‍♂️ We haven't…


Lianmin Zheng сделал(а) репост

💥 We've achieved perfect training-inference alignment for SGLang & FSDP in slime! (Flash Attn 3, DeepGEMM, etc.) The result? A strict KL divergence of 0. But here's the twist: We spent a month trying to find a baseline that crashes from mismatch... and couldn't. 🤷‍♂️ We haven't…

lmsysorg's tweet image. 💥 We've achieved perfect training-inference alignment for SGLang & FSDP in slime! (Flash Attn 3, DeepGEMM, etc.)

The result? A strict KL divergence of 0.

But here's the twist: We spent a month trying to find a baseline that crashes from mismatch... and couldn't. 🤷‍♂️ We haven't…
lmsysorg's tweet image. 💥 We've achieved perfect training-inference alignment for SGLang & FSDP in slime! (Flash Attn 3, DeepGEMM, etc.)

The result? A strict KL divergence of 0.

But here's the twist: We spent a month trying to find a baseline that crashes from mismatch... and couldn't. 🤷‍♂️ We haven't…
lmsysorg's tweet image. 💥 We've achieved perfect training-inference alignment for SGLang & FSDP in slime! (Flash Attn 3, DeepGEMM, etc.)

The result? A strict KL divergence of 0.

But here's the twist: We spent a month trying to find a baseline that crashes from mismatch... and couldn't. 🤷‍♂️ We haven't…
lmsysorg's tweet image. 💥 We've achieved perfect training-inference alignment for SGLang & FSDP in slime! (Flash Attn 3, DeepGEMM, etc.)

The result? A strict KL divergence of 0.

But here's the twist: We spent a month trying to find a baseline that crashes from mismatch... and couldn't. 🤷‍♂️ We haven't…

Great progress!

insane blackwell progress in v0.5.5 by the sglang team. with new optimizations, it's stable like hopper and the performance is great even for multimodal models 181 tokens/s on Qwen3-VL-30B-A3B-Thinking on 1x B200:

casper_hansen_'s tweet image. insane blackwell progress in v0.5.5 by the sglang team. with new optimizations, it's stable like hopper and the performance is great even for multimodal models

181 tokens/s on Qwen3-VL-30B-A3B-Thinking on 1x B200:


Hao has been pioneering efficient architecture research for many years. Always eager to see the innovations from him and his group!

Excited to partner with SGLang: FastVideo + SGLang = the future open ecosystem for diffusion. 🥳🫡 ----------- A few extra cents: Since I started faculty at UCSD, our lab has been investing diffusion for video and text , and in both algorithms and systems. - Text-side, we…



Great progress!

Ant AQ-Team @AQ_MedAI @TheInclusionAI and SGLang RL Team @sgl_project just helped land Kimi-K2-Instruct RL on slime — fully wired up and running on 256× H20 141GB 🚀 Huge shout-out to @yngao016, @menlzy, @Yonah_x from AQ Team and @Ji_Li_233, @Yefei_RL from the SGLang RL Team for…



Lianmin Zheng сделал(а) репост

🚀 Introducing SGLang Diffusion — bringing SGLang’s high-performance serving to diffusion models. ⚡️ Up to 5.9× faster inference 🧩 Supports major open-source models: Wan, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux 🧰 Easy to use via OpenAI-compatible API, CLI & Python API…

lmsysorg's tweet image. 🚀 Introducing SGLang Diffusion — bringing SGLang’s high-performance serving to diffusion models.

⚡️ Up to 5.9× faster inference
🧩 Supports major open-source models: Wan, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux
🧰 Easy to use via OpenAI-compatible API, CLI & Python API…
lmsysorg's tweet image. 🚀 Introducing SGLang Diffusion — bringing SGLang’s high-performance serving to diffusion models.

⚡️ Up to 5.9× faster inference
🧩 Supports major open-source models: Wan, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux
🧰 Easy to use via OpenAI-compatible API, CLI & Python API…
lmsysorg's tweet image. 🚀 Introducing SGLang Diffusion — bringing SGLang’s high-performance serving to diffusion models.

⚡️ Up to 5.9× faster inference
🧩 Supports major open-source models: Wan, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux
🧰 Easy to use via OpenAI-compatible API, CLI & Python API…
lmsysorg's tweet image. 🚀 Introducing SGLang Diffusion — bringing SGLang’s high-performance serving to diffusion models.

⚡️ Up to 5.9× faster inference
🧩 Supports major open-source models: Wan, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux
🧰 Easy to use via OpenAI-compatible API, CLI & Python API…

Future models will be multi modal in multi modal out, potentially combining auto regressive and diffusion architectures. SGLang project takes the first step towards building a unified inference stack for all.

🚀 Introducing SGLang Diffusion — bringing SGLang’s high-performance serving to diffusion models. ⚡️ Up to 5.9× faster inference 🧩 Supports major open-source models: Wan, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux 🧰 Easy to use via OpenAI-compatible API, CLI & Python API…

lmsysorg's tweet image. 🚀 Introducing SGLang Diffusion — bringing SGLang’s high-performance serving to diffusion models.

⚡️ Up to 5.9× faster inference
🧩 Supports major open-source models: Wan, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux
🧰 Easy to use via OpenAI-compatible API, CLI & Python API…
lmsysorg's tweet image. 🚀 Introducing SGLang Diffusion — bringing SGLang’s high-performance serving to diffusion models.

⚡️ Up to 5.9× faster inference
🧩 Supports major open-source models: Wan, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux
🧰 Easy to use via OpenAI-compatible API, CLI & Python API…
lmsysorg's tweet image. 🚀 Introducing SGLang Diffusion — bringing SGLang’s high-performance serving to diffusion models.

⚡️ Up to 5.9× faster inference
🧩 Supports major open-source models: Wan, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux
🧰 Easy to use via OpenAI-compatible API, CLI & Python API…
lmsysorg's tweet image. 🚀 Introducing SGLang Diffusion — bringing SGLang’s high-performance serving to diffusion models.

⚡️ Up to 5.9× faster inference
🧩 Supports major open-source models: Wan, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux
🧰 Easy to use via OpenAI-compatible API, CLI & Python API…


Lianmin Zheng сделал(а) репост

Day-0 support for Kimi K2 Thinking on SGLang ⚡ The new open-source thinking-agent model pushes reasoning, coding, and multi-step tool use to new heights. Proud to collaborate with @Kimi_Moonshot to make it run seamlessly: python -m sglang.launch_server \ --model-path…

lmsysorg's tweet image. Day-0 support for Kimi K2 Thinking on SGLang ⚡
The new open-source thinking-agent model pushes reasoning, coding, and multi-step tool use to new heights.
Proud to collaborate with @Kimi_Moonshot  to make it run seamlessly:

python -m sglang.launch_server \
  --model-path…

🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built…

Kimi_Moonshot's tweet image. 🚀 Hello, Kimi K2 Thinking!
The Open-Source Thinking Agent Model is here.

🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%)
🔹 Executes up to 200 – 300 sequential tool calls without human interference
🔹 Excels in reasoning, agentic search, and coding
🔹 256K context window

Built…


Lianmin Zheng сделал(а) репост

Awesome step to support native JAX inference from sglang! Now we don't need any torch layer to land code directly on TPU. @SingularMattrix

SGLang now has a pure Jax backend, and it runs natively on TPU!



SGLang now has a pure Jax backend, and it runs natively on TPU!

SGLang now runs natively on TPU with a new pure Jax backend! SGLang-Jax leverages SGLang's high-performance server architecture and uses Jax to compile the model's forward pass. By combining SGLang and Jax, it delivers fast, native TPU inference while maintaining support for…

lmsysorg's tweet image. SGLang now runs natively on TPU with a new pure Jax backend!

SGLang-Jax leverages SGLang's high-performance server architecture and uses Jax to compile the model's forward pass. By combining SGLang and Jax, it delivers fast, native TPU inference while maintaining support for…


Lianmin Zheng сделал(а) репост

Exciting to see Glyph open-sourced -- exploring a direction similar to DeepSeek-OCR, using visual-text compression to scale context windows! Instead of reading text token by token, Glyph lets models see the text, achieving 3–4× compression while preserving strong performance,…

Glyph: Scaling Context Windows via Visual-Text Compression Paper: arxiv.org/abs/2510.17800 Weights: huggingface.co/zai-org/Glyph Repo: github.com/thu-coai/Glyph Glyph is a framework for scaling the context length through visual-text compression. It renders long textual sequences into…

Zai_org's tweet image. Glyph: Scaling Context Windows via Visual-Text Compression

Paper: arxiv.org/abs/2510.17800
Weights: huggingface.co/zai-org/Glyph
Repo: github.com/thu-coai/Glyph

Glyph is a framework for scaling the context length through visual-text compression. It renders long textual sequences into…


Lianmin Zheng сделал(а) репост

FYI, our SGLang fork is public here: github.com/chutesai/sglang (branch chutes) to boost your SGLang perf from 73% to 97% 👀 When I tested a few manually there were a few discrepancies where instead of generating a string it generated an array of strings, etc.; curious if switching…

Kimi K2vv updated! We've added case-by-case statistics for ToolCall-Trigger Similarity and ToolCall-Schema Accuracy. Feedback is welcome! github.com/MoonshotAI/K2-…

Kimi_Moonshot's tweet image. Kimi K2vv updated! We've added case-by-case statistics for ToolCall-Trigger Similarity and ToolCall-Schema Accuracy. 
Feedback is welcome!
github.com/MoonshotAI/K2-…


Lianmin Zheng сделал(а) репост

This huge contribution from @Baidu_Inc team enabled multi token prediction for Spare attention, achieving more than 2x decoding throughput improvements for the latest DeepSeek v3.2 models. The new architecture makes inference more interesting—we need to carefully handle the…

lmsysorg's tweet image. This huge contribution from @Baidu_Inc team enabled multi token prediction for Spare attention, achieving more than 2x decoding throughput improvements for the latest DeepSeek v3.2 models.

The new architecture makes inference more interesting—we need to carefully handle the…
lmsysorg's tweet image. This huge contribution from @Baidu_Inc team enabled multi token prediction for Spare attention, achieving more than 2x decoding throughput improvements for the latest DeepSeek v3.2 models.

The new architecture makes inference more interesting—we need to carefully handle the…

Loading...

Something went wrong.


Something went wrong.