OcssLin's profile picture. MTS @Alibaba_Qwen on MLsys, building SGLang @lmsysorg | Prev. @DukeU

Junrong Lin

@OcssLin

MTS @Alibaba_Qwen on MLsys, building SGLang @lmsysorg | Prev. @DukeU

Junrong Lin reposted

Airbnb CEO Brian Chesky: “We’re relying a lot on Alibaba’s Qwen model. It’s very good. It’s also fast and cheap... We use OpenAI’s latest models, but we typically don’t use them that much in production because there are faster and cheaper models.” The valley is built on Qwen?


Junrong Lin reposted

🚀 SGLang In-Depth Review of the NVIDIA DGX Spark is LIVE! Thanks to @NVIDIA’s early access program, SGLang makes its first ever appearance in a consumer product, the brand-new DGX Spark. The DGX Spark’s 128GB Unified Memory and Blackwell architecture set a new standard for…

lmsysorg's tweet image. 🚀 SGLang In-Depth Review of the NVIDIA DGX Spark is LIVE!

Thanks to @NVIDIA’s early access program, SGLang makes its first ever appearance in a consumer product, the brand-new DGX Spark.

The DGX Spark’s 128GB Unified Memory and Blackwell architecture set a new standard for…

Junrong Lin reposted

🧠For Qwen3-Next’s Day 0 support in SGLang, one tricky part was enabling spec decoding with the Hybrid Linear Model—since SSM & conv caches only store the last position (unlike KV cache). 🚀After tons of effort with @qingquan_song, we achieved >2× speedup! Benchmarks below

hebiao064's tweet image. 🧠For Qwen3-Next’s Day 0 support in SGLang, one tricky part was enabling spec decoding with the Hybrid Linear Model—since SSM & conv caches only store the last position (unlike KV cache).

🚀After tons of effort with @qingquan_song, we achieved >2× speedup!

Benchmarks below
hebiao064's tweet image. 🧠For Qwen3-Next’s Day 0 support in SGLang, one tricky part was enabling spec decoding with the Hybrid Linear Model—since SSM & conv caches only store the last position (unlike KV cache).

🚀After tons of effort with @qingquan_song, we achieved >2× speedup!

Benchmarks below

🚀 Introducing Qwen3-Next-80B-A3B — the FUTURE of efficient LLMs is here! 🔹 80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!) 🔹Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed &…

Alibaba_Qwen's tweet image. 🚀 Introducing Qwen3-Next-80B-A3B — the FUTURE of efficient LLMs is here!

🔹 80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!)
🔹Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed &…


Special thanks to my old friends from the SGLang community especially @hebiao064 @qingquan_song and more (sry I don’t know their X account 🥹) who help support the hybrid model MTP. For linear attention, the eviction during eagle verification phase is different from the regular…

Qwen3-Next is out! SGLang has supported it on day 0 with speculative decoding. Try it out 👇



Junrong Lin reposted

🚀 Introducing Qwen3-Next-80B-A3B — the FUTURE of efficient LLMs is here! 🔹 80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!) 🔹Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed &…

Alibaba_Qwen's tweet image. 🚀 Introducing Qwen3-Next-80B-A3B — the FUTURE of efficient LLMs is here!

🔹 80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!)
🔹Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed &…

Junrong Lin reposted

We’re live! 🎉 This is the official account for slime — an open-source, SGLang-native post-training framework for RL scaling. Kicking things off with our first milestone → v0.1.0 release 🧪 Blog: thudm.github.io/slime/blogs/re… Follow us to run RL faster ⚡️


cool

Grok videos can now talk. Major upgrade to image & video generation in a few weeks. This is still early beta.



😎Grad to see my first participated project at Qwen is finally out. More awesome work is coming

Big news: Introducing Qwen3-Max-Preview (Instruct) — our biggest model yet, with over 1 trillion parameters! 🚀 Now available via Qwen Chat & Alibaba Cloud API. Benchmarks show it beats our previous best, Qwen3-235B-A22B-2507. Internal tests + early user feedback confirm:…

Alibaba_Qwen's tweet image. Big news: Introducing Qwen3-Max-Preview (Instruct) — our biggest model yet, with over 1 trillion parameters! 🚀

Now available via Qwen Chat & Alibaba Cloud API.

Benchmarks show it beats our previous best, Qwen3-235B-A22B-2507. Internal tests + early user feedback confirm:…


Junrong Lin reposted

We are using SGLang at really large scale RL, and it’s been working great :)

xAI may be one of the single biggest contributors to open-source inference just by serving everything with SGLang

casper_hansen_'s tweet image. xAI may be one of the single biggest contributors to open-source inference just by serving everything with SGLang


Junrong Lin reposted

More on QAT: 1. QAT explanation: pytorch.org/blog/quantizat… 2. MXFP4 QAT is supported in NVIDIA ModelOpt: github.com/NVIDIA/TensorR… 3. A quick drawing of how gpt-oss is trained in my understanding:

huizi_mao's tweet image. More on QAT:
1. QAT explanation: pytorch.org/blog/quantizat…
2. MXFP4 QAT is supported in NVIDIA ModelOpt: github.com/NVIDIA/TensorR…
3. A quick drawing of how gpt-oss is trained in my understanding:

Junrong Lin reposted

🚀 Introducing the first OSS example of fine-tuning gpt-oss with MXFP4 QAT! Powered by NVIDIA ModelOpt + SGLang. Highlights 1. Fine-tune gpt-oss while keeping the original MXFP4 format 2. Preserve FP4 efficiency and recover accuracy 3. Deploy seamlessly with SGLang! Full Blog👇

lmsysorg's tweet image. 🚀 Introducing the first OSS example of fine-tuning gpt-oss with MXFP4 QAT! Powered by NVIDIA ModelOpt + SGLang.

Highlights
1. Fine-tune gpt-oss while keeping the original MXFP4 format
2. Preserve FP4 efficiency and recover accuracy
3. Deploy seamlessly with SGLang!

Full Blog👇

chad

This false nomenclature of “researcher” and “engineer”, which is a thinly-masked way of describing a two-tier engineering system, is being deleted from @xAI today. There are only engineers. Researcher is a relic term from academia.



👀

sth out of ur expectation



Junrong Lin reposted

✅ We’re excited to support @Alibaba_Qwen’s Qwen3-Coder on SGLang! With tool call parser and expert parallelism enabled, it runs smoothly with flexible configurations. Just give it a try! 🔗 github.com/zhaochenyang20…

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

Alibaba_Qwen's tweet image. >>> Qwen3-Coder is here! ✅

We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…


Junrong Lin reposted

Bye Qwen3-235B-A22B, hello Qwen3-235B-A22B-2507! After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we’ll train Instruct and Thinking models separately so we can get the best quality possible. Today, we’re releasing…

Alibaba_Qwen's tweet image. Bye Qwen3-235B-A22B, hello Qwen3-235B-A22B-2507!

After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we’ll train Instruct and Thinking models separately so we can get the best quality possible. Today, we’re releasing…

Junrong Lin reposted

salute

🤯A lot of folks have asked me: how does SGLang iterate so fast? PD disaggregation: ~3 weeks. Large-scale EP: ~1 month. GB200 NVL72 support? Also ~3 weeks. And usually with fewer than 5 core devs involved. SGLang moves fast because of talent like this: ~70k commits in a year!🔥

zhyncs42's tweet image. 🤯A lot of folks have asked me: how does SGLang iterate so fast? PD disaggregation: ~3 weeks. Large-scale EP: ~1 month. GB200 NVL72 support? Also ~3 weeks. And usually with fewer than 5 core devs involved. SGLang moves fast because of talent like this: ~70k commits in a year!🔥


Junrong Lin reposted

Meet Qwen-VLo, your AI creative engine: • Concept-to-Polish: Turn rough sketches or text prompts into high-res visuals • On-the-Fly Edits: Refine product shots, adjust layouts or styles with simple commands • Global-Ready: Generate image in multiple languages • Progressive…


Junrong Lin reposted

We're excited to release OME, which is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs). It optimizes the deployment and operation of LLMs by automating model management, intelligent runtime selection, efficient resource…

lmsysorg's tweet image. We're excited to release OME, which is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs). It optimizes the deployment and operation of LLMs by automating model management, intelligent runtime selection, efficient resource…

United States Trends

Loading...

Something went wrong.


Something went wrong.