Vijay
@__tensorcore__
MLIR, CUTLASS,Tensor Core arch @NVIDIA. Mechanic @hpcgarage. Exercise of any 1st amendment rights are for none other than myself.
قد يعجبك
🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…
Today we are releasing our first public beta of Nsight Python! The goal is to simplify the life of a Python developer by proving a pythonic way to analyze your kernel code! Check it out, provide feedback! Nsight Python — nsight-python docs.nvidia.com/nsight-python/
BREAKING: The CUDA moat has just expanded again! PyTorch Compile/Inductor can now target NVIDIA Python CuTeDSL in addition to Triton. This enables 2x faster FlexAttention compared to Triton implementations. We explain below 👇 As we explained in our April 2025 AMD 2.0 piece,…
Quick life update. Moved to California to work at NVIDA. Oh I have so much to learn
Excited to share what friends and I have been working on at @Standard_Kernel We've raised from General Catalyst (@generalcatalyst), Felicis (@felicis), and a group of exceptional angels. We have some great H100 BF16 kernels in pure CUDA+PTX, featuring: - Matmul 102%-105% perf…
Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to…
“TogetherAI’s chief scientist @tri_dao announced Flash Attention v4 … uses CUTLASS CuTe Python DSL” As always, thanks for being the tip of the spear and pushing us along too 💚
TogetherAI's Chief Scientist @tri_dao announced Flash Attention v4 at HotChips Conference which is up to 22% faster than the attention kernel implementation from NVIDIA's cuDNN library. Tri Dao was able to achieve this 2 key algorithmic changes. Firstly, it uses a new online…
Using CUTLASS CuTe-DSL, TogetherAI's Chief Scientist @tri_dao announced that he has written kernels that is 50% faster than NVIDIA's latest cuBLAS 13.0 library for small K reduction dim shapes on Blackwell during today's hotchip conference. His kernels beats cuBLAS by using 2…
Cute-DSL is basically perfect (for me). thank you nvidia and cutlass team. i no longer need to wait for long compile times because i underspecified a template param. i hope everyone involved gets an extra chicken nugget in their happy meal
On Sep 6 in NYC, this won't be your typical hackathon where you do your own thing in a corner and then present at the of the day. You'll deploy real models to the market, trades will happen, chaos should be expected. The fastest model is great but time to market matters more.
ariXv gpu kernel researcher be like: • liquid nitrogen cooling their benchmark GPU • overclock their H200 to 1000W "Custom Thermal Solution CTS" • nvidia-smi boost-slider --vboost 1 • nvidia-smi -i 0 --lock-gpu-clocks=1830,1830 • use specially binned GPUs where the number…
Part 2: developer.nvidia.com/blog/cutlass-3… Covers the design of CUTLASS 3.x itself and how it builds a 2 layer GPU microkernel abstraction using CuTe as the foundation.
CUTLASS 4.1 is now available, which adds support for ARM systems (GB200) and block scaled MMAs
🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…
Hierarchical layout is super elegant. Feels like the right abstraction for high performance GPU kernels. FlashAttention 2 actually started bc we wanted to rewrite FA1 in CuTe
developer.nvidia.com/blog/cutlass-p… marks the start of a short series of blogposts about CUTLASS 3.x and CuTe that we've been meaning to write for years. There are a few more parts to come still, hope you enjoy!
developer.nvidia.com
CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial...
In the era of generative AI, utilizing GPUs to their maximum potential is essential to training better models and serving users at scale. Often, these models have layers that cannot be expressed as…
CuTe is such an elegant library that we stopped working on our own system and wholeheartedly adopted CUTLASS for vLLM in the beginning of 2024. I can happily report that was a very wise investment! Vijay and co should be so proud of the many strong OSS projects built on top 🥳
developer.nvidia.com/blog/cutlass-p… marks the start of a short series of blogposts about CUTLASS 3.x and CuTe that we've been meaning to write for years. There are a few more parts to come still, hope you enjoy!
developer.nvidia.com
CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial...
In the era of generative AI, utilizing GPUs to their maximum potential is essential to training better models and serving users at scale. Often, these models have layers that cannot be expressed as…
This is what the internet was made for 🥹
Cosmos-Predict2 meets NATTEN. We just released variants of Cosmos-Predict2 where we replace most self attentions with neighborhood attention, bringing up to 2.6X end-to-end speedup, with minimal effect on quality! github.com/nvidia-cosmos/… (1/5)
Getting mem-bound kernels to speed-of-light isn't a dark art, it's just about getting the a couple of details right. We wrote a tutorial on how to do this, with code you can directly use. Thanks to the new CuTe-DSL, we can hit speed-of-light without a single line of CUDA C++.
🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With @tedzadouri and @tri_dao
🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With @tedzadouri and @tri_dao
United States الاتجاهات
- 1. Kalani 5,393 posts
- 2. REAL ID 6,792 posts
- 3. Milagro 29.6K posts
- 4. Cyber Monday 60.1K posts
- 5. Vanguard 11.6K posts
- 6. Admiral Bradley 10.2K posts
- 7. TOP CALL 12.1K posts
- 8. Penn State 8,961 posts
- 9. Hartline 3,836 posts
- 10. MRIs 4,511 posts
- 11. #GivingTuesday 4,000 posts
- 12. Merry Christmas 49.7K posts
- 13. Jason Lee 2,441 posts
- 14. #OTGala11 94.4K posts
- 15. Shakur 8,242 posts
- 16. Brent 10.4K posts
- 17. Jay Hill N/A
- 18. Check Analyze 1,193 posts
- 19. Token Signal 4,571 posts
- 20. MSTR 34.6K posts
قد يعجبك
-
Haicheng Wu
@asdf1234_0 -
0x22h
@0x22h -
Underfox
@Underfox3 -
Nicholas Malaya
@nicholasmalaya -
Maxime Chevalier
@Love2Code -
SteakisGood
@SteakandChickn -
Cardyak
@Cardyak -
@parkbot.bsky.social
@philparkbot -
Alexander Yee
@Mysticial -
Gregory Stoner
@angstroms -
Andrew Kerr
@arkerr -
Dan Ernst
@ernstdj -
Alejandro (Alex) Chacon
@achacond -
Tact Comp Labs
@TactCompLabs -
Cheese
@System360Cheese
Something went wrong.
Something went wrong.