DeepSpeedAI's profile picture. Official account for DeepSpeed, a library that enables unprecedented scale and speed for deep learning training + inference. 

日本語 : @DeepSpeedAI_JP

DeepSpeed

@DeepSpeedAI

Official account for DeepSpeed, a library that enables unprecedented scale and speed for deep learning training + inference. 日本語 : @DeepSpeedAI_JP

Pinned

UIUC, AnyScale, and Snowflake significantly enhanced LLM offloading for the Superchip era!

🚀 SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips Superchips like the NVIDIA GH200 offer tightly coupled GPU-CPU architectures for AI workloads. But most existing offloading techniques were designed for traditional PCIe-based systems. Are we truly…

_Minjia_Zhang_'s tweet image. 🚀 SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

Superchips like the NVIDIA GH200 offer tightly coupled GPU-CPU architectures for AI workloads. But most existing offloading techniques were designed for traditional PCIe-based systems. Are we truly…


DeepSpeed reposted

🚨Meetup Alert🚨 Join us for @raydistributed × @DeepSpeedAI Meetup: AI at Scale, including talks from researchers and engineers at @LinkedIn, @anyscalecompute and @Snowflake. Learn how leading AI teams are scaling efficiently with Ray’s distributed framework and DeepSpeed’s…

anyscalecompute's tweet image. 🚨Meetup Alert🚨

Join us for @raydistributed × @DeepSpeedAI Meetup: AI at Scale, including talks from researchers and engineers at @LinkedIn, @anyscalecompute and @Snowflake.

Learn how leading AI teams are scaling efficiently with Ray’s distributed framework and DeepSpeed’s…

Step into the future of AI at #PyTorchCon 2025, Oct 22–23 in San Francisco 🔥 Join the DeepSpeed keynote and technical talks. Register: events.linuxfoundation.org/pytorch-confer… + Oct 21 co-located events: Measuring Intelligence, Open Agent & AI Infra Summits / Startup Showcase & PyTorch Training


DeepSpeed reposted

The @DeepSpeedAI would like to thank @modal for sponsoring our gpus for CI. This is an amazing contribution to our AI-democratizing open source project. github.com/deepspeedai/De… The Modal team is outstanding in their amazing support - speed, expertise and a human experience!


ZenFlow is a massive improvement to DeepSpeed Offloading. Courtesy of an excellent collaboration among University of Virginia, UC Merced, Argonne National Laboratory, Microsoft, and Snowflake.

Introducing #ZenFlow: No Compromising Speed for #LLM Training w/ Offloading 5× faster LLM training with offloading 85% less GPU stalls 2× lower I/O overhead 🚀 Blog: hubs.la/Q03DJ6GJ0 🚀 Try ZenFlow and experience 5× faster training with offloading: hubs.la/Q03DJ6Vb0

PyTorch's tweet image. Introducing #ZenFlow: No Compromising Speed for #LLM Training w/ Offloading
5× faster LLM training with offloading
85% less GPU stalls
2× lower I/O overhead
🚀 Blog: hubs.la/Q03DJ6GJ0
🚀 Try ZenFlow and experience 5× faster training with offloading: hubs.la/Q03DJ6Vb0


Kudos to Xinyu for giving an excellent presentation of DeepSpeed Universal Checkpointing (UCP) paper at USENIX ATC 2015.

📢 Yesterday at USENIX ATC 2025, Xinyu Lian from UIUC SSAIL Lab presented our paper on Universal Checkpointing (UCP). UCP is a new distributed checkpointing system designed for today's large-scale DNN training, where models often use complex forms of parallelism, including data,…

_Minjia_Zhang_'s tweet image. 📢 Yesterday at USENIX ATC 2025, Xinyu Lian from UIUC SSAIL Lab presented our paper on Universal Checkpointing (UCP). UCP is a new distributed checkpointing system designed for today's large-scale DNN training, where models often use complex forms of parallelism, including data,…


DeepSpeed reposted

My first project at @Snowflake AI Research is complete! I present to you Arctic Long Sequence Training (ALST) Paper: arxiv.org/abs/2506.13996 Blog: snowflake.com/en/engineering… ALST is a set of modular, open-source techniques that enable training on sequences up to 15 million…

StasBekman's tweet image. My first project at @Snowflake AI Research is complete! 

I present to you Arctic Long Sequence Training (ALST) 

Paper: arxiv.org/abs/2506.13996
Blog: snowflake.com/en/engineering…

ALST is a set of modular, open-source techniques that enable training on sequences up to 15 million…

Improved DeepNVMe: Affordable I/O Scaling for AI - Faster I/O with PCIe Gen5 - 20x faster model checkpointing - Low-budget SGLang inference via NVMe offloading - Pinned memory for CPU-only workloads - Zero-copy tensor type casting Blog: tinyurl.com/yanbrjy9

DeepSpeedAI's tweet image. Improved DeepNVMe: Affordable I/O Scaling for AI

- Faster I/O with PCIe Gen5
- 20x faster model checkpointing
- Low-budget SGLang inference via NVMe offloading
- Pinned memory for CPU-only workloads
- Zero-copy tensor type casting

Blog: tinyurl.com/yanbrjy9
DeepSpeedAI's tweet image. Improved DeepNVMe: Affordable I/O Scaling for AI

- Faster I/O with PCIe Gen5
- 20x faster model checkpointing
- Low-budget SGLang inference via NVMe offloading
- Pinned memory for CPU-only workloads
- Zero-copy tensor type casting

Blog: tinyurl.com/yanbrjy9
DeepSpeedAI's tweet image. Improved DeepNVMe: Affordable I/O Scaling for AI

- Faster I/O with PCIe Gen5
- 20x faster model checkpointing
- Low-budget SGLang inference via NVMe offloading
- Pinned memory for CPU-only workloads
- Zero-copy tensor type casting

Blog: tinyurl.com/yanbrjy9

DeepSpeed reposted

PyTorch Foundation has expanded into an umbrella foundation. @vllm_project and @DeepSpeedAI have been accepted as hosted projects, advancing community-driven AI across the full lifecycle. Supporting quotes provided by the following members: @AMD, @Arm, @AWS, @Google, @Huawei,…

PyTorch's tweet image. PyTorch Foundation has expanded into an umbrella foundation. @vllm_project and @DeepSpeedAI have been accepted as hosted projects, advancing community-driven AI across the full lifecycle.

Supporting quotes provided by the following members: @AMD, @Arm, @AWS, @Google, @Huawei,…

Come hear all the exciting DeepSpeed updates at the upcoming PyTorch Day France 2025 DeepSpeed – Efficient Training Scalability for Deep Learning Models - sched.co/21nyy @sched


Introducing 🚀DeepCompile🚀: compiler-based distributed training optimizations. - Automatic parallelization & profile-guided optimizations - Enable ZeRO1, ZeRO3, Offloading, etc. via compiler passes - 1.2X-7X speedups over manual ZeRO1/ZeRO3/Offloading tinyurl.com/8cys28xk

DeepSpeedAI's tweet image. Introducing 🚀DeepCompile🚀: compiler-based distributed training optimizations.
- Automatic parallelization & profile-guided optimizations
- Enable ZeRO1, ZeRO3, Offloading, etc. via compiler passes 
- 1.2X-7X speedups over manual ZeRO1/ZeRO3/Offloading

tinyurl.com/8cys28xk

AutoTP + ZeRO Training for HF Models - Enhance HF post-training with larger models, batches, & contexts - 4x faster LLAMA3 fine-tuning with TP=2 vs TP=1 - No code changes needed Blog: tinyurl.com/5n8nfs2w

DeepSpeedAI's tweet image. AutoTP + ZeRO Training for HF Models
- Enhance HF post-training with larger models, batches, & contexts
- 4x faster LLAMA3 fine-tuning with TP=2 vs TP=1
- No code changes needed

Blog: tinyurl.com/5n8nfs2w

DeepSpeed reposted

1/4⚡️nanoton now supports DoMiNo with intra-layer communication overlapping, achieving 60% communication hiding for tensor parallelism (TP) in both the forward and backward passes while maintaining the same training loss.

xariusrke's tweet image. 1/4⚡️nanoton now supports DoMiNo with intra-layer communication overlapping, achieving 60% communication hiding for tensor parallelism (TP) in both the forward and backward passes while maintaining the same training loss.
xariusrke's tweet image. 1/4⚡️nanoton now supports DoMiNo with intra-layer communication overlapping, achieving 60% communication hiding for tensor parallelism (TP) in both the forward and backward passes while maintaining the same training loss.

DeepSpeed reposted

🚀 Excited to introduce DeepSpeed, a deep learning optimization library from @Microsoft! It simplifies distributed training and inference, making AI scaling more efficient and cost-effective. Learn more 👉 hubs.la/Q0351DJC0 #DeepSpeed #AI #OpenSource #LFAIData

LFAIDataFdn's tweet image. 🚀 Excited to introduce DeepSpeed, a deep learning optimization library from @Microsoft! It simplifies distributed training and inference, making AI scaling more efficient and cost-effective. 

Learn more 👉 hubs.la/Q0351DJC0

#DeepSpeed #AI #OpenSource #LFAIData

🚀Introducing Ulysses-Offload🚀 - Unlock the power of long context LLM training and finetuning with our latest system optimizations - Train LLaMA3-8B on 2M tokens context using 4xA100-80GB - Achieve over 55% MFU Blog: shorturl.at/Spx6Y Tutorial: shorturl.at/bAWu5

DeepSpeedAI's tweet image. 🚀Introducing Ulysses-Offload🚀

- Unlock the power of long context LLM training and finetuning with our latest system optimizations 
- Train LLaMA3-8B on 2M tokens context using 4xA100-80GB
-  Achieve over 55% MFU

Blog: shorturl.at/Spx6Y
Tutorial: shorturl.at/bAWu5

Introducing Domino: a novel zero-cost communication tensor parallelism (TP) training engine for both single node and multi-node settings. - Near-complete communication hiding - Novel multi-node scalable TP solution Blog: github.com/microsoft/Deep…

DeepSpeedAI's tweet image. Introducing Domino: a novel zero-cost communication tensor parallelism (TP) training engine for both single node and multi-node settings.

- Near-complete communication hiding
- Novel multi-node scalable TP solution 

Blog: github.com/microsoft/Deep…

Great to see the amazing DeepSpeed optimizations from @Guanhua_Wang_, Heyang Qin, @toh_tana, @QuentinAnthon15, and @samadejacobs presented by @ammar_awan at MUG '24.

Dr. Ammar Ahmad Awan from Microsoft DeepSpeed giving a presentation at MUG '24 over Trillion-parameter LLMs and optimization with MVAPICH. @OSUengineering @Microsoft @OhTechCo @mvapich @MSFTDeepSpeed @MSFTDeepSpeedJP #MUG24 #MPI #AI #LLM #DeepSpeed

mvapich's tweet image. Dr. Ammar Ahmad Awan from Microsoft DeepSpeed giving a presentation at MUG '24 over Trillion-parameter LLMs and optimization with MVAPICH.

@OSUengineering @Microsoft
@OhTechCo @mvapich
@MSFTDeepSpeed
@MSFTDeepSpeedJP #MUG24 #MPI #AI #LLM #DeepSpeed
mvapich's tweet image. Dr. Ammar Ahmad Awan from Microsoft DeepSpeed giving a presentation at MUG '24 over Trillion-parameter LLMs and optimization with MVAPICH.

@OSUengineering @Microsoft
@OhTechCo @mvapich
@MSFTDeepSpeed
@MSFTDeepSpeedJP #MUG24 #MPI #AI #LLM #DeepSpeed
mvapich's tweet image. Dr. Ammar Ahmad Awan from Microsoft DeepSpeed giving a presentation at MUG '24 over Trillion-parameter LLMs and optimization with MVAPICH.

@OSUengineering @Microsoft
@OhTechCo @mvapich
@MSFTDeepSpeed
@MSFTDeepSpeedJP #MUG24 #MPI #AI #LLM #DeepSpeed
mvapich's tweet image. Dr. Ammar Ahmad Awan from Microsoft DeepSpeed giving a presentation at MUG '24 over Trillion-parameter LLMs and optimization with MVAPICH.

@OSUengineering @Microsoft
@OhTechCo @mvapich
@MSFTDeepSpeed
@MSFTDeepSpeedJP #MUG24 #MPI #AI #LLM #DeepSpeed


Announcing that DeepSpeed now runs natively on Windows. This exciting combination unlocks DeepSpeed optimizations to Windows users and empowers more people and organizations with AI innovations. - HF Inference & Finetuning - LoRA - CPU Offload Blog: shorturl.at/a7TF8

DeepSpeedAI's tweet image. Announcing that DeepSpeed now runs natively on Windows. This exciting combination unlocks  DeepSpeed optimizations to Windows users and empowers more people and organizations with AI innovations. 
- HF Inference & Finetuning
- LoRA
- CPU Offload

Blog: shorturl.at/a7TF8

DeepSpeed reposted

💡Check out Comet’s latest integration with DeepSpeed, a deep learning optimization library! 🤝With the @MSFTDeepSpeed + @Cometml integration automatically start logging training metrics generated by DeepSpeed. Try the quick-start Colab to get started: colab.research.google.com/github/comet-m…


Introducing DeepNVMe, a suite of optimizations for fast and efficient I/O operations in DL applications. - POSIX-style APIs - Direct HBM/NVMe xfers via NVIDIA GDS - Cheap Inference scaling via NVMe-Offload Blog: shorturl.at/l7Oue @Azure @NVIDIADC #FMS24 #GPUDirect

DeepSpeedAI's tweet image. Introducing DeepNVMe, a suite of optimizations for fast and efficient I/O operations in DL applications. 
- POSIX-style APIs
- Direct HBM/NVMe xfers via NVIDIA GDS
- Cheap Inference scaling via NVMe-Offload
 
Blog: shorturl.at/l7Oue

@Azure 
@NVIDIADC 
#FMS24
#GPUDirect

Loading...

Something went wrong.


Something went wrong.