Vijay

@tensorcore

MLIR, CUTLASS,Tensor Core arch @NVIDIA. Mechanic @hpcgarage. Exercise of any 1st amendment rights are for none other than myself.

thakkarv.dev

Se unió en Julio de 2015

1KPosts 2KSeguidores 527Siguiendo

Tal vez te guste

@hotchipsorg

@Underfox3

@nicholasmalaya

@Love2Code

@Cardyak

@philparkbot

@arkerr

@ernstdj

@achacond

@System360Cheese

@shawnwzhang

@matt_dz

@rsinghal1

@lasserith

@peresuslog

Fijado

Vijay

@__tensorcore__

13 may

🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…

__tensorcore__'s tweet image. 🚨🔥 CUTLASS 4.0 is released 🔥🚨

pip install nvidia-cutlass-dsl

4.0 marks a major shift for CUTLASS: towards native GPU programming in Python

slidehelloworld.png

docs.nvidia.com/cutlass/media/…

Vijay reposteó

Excited to share what friends and I have been working on at @Standard_Kernel We've raised from General Catalyst (@generalcatalyst), Felicis (@felicis), and a group of exceptional angels. We have some great H100 BF16 kernels in pure CUDA+PTX, featuring: - Matmul 102%-105% perf…

anneouyang's tweet image. Excited to share what friends and I have been working on at @Standard_Kernel

We've raised from General Catalyst (@generalcatalyst), Felicis (@felicis), and a group of exceptional angels.

We have some great H100 BF16 kernels in pure CUDA+PTX, featuring:
- Matmul 102%-105% perf…

Vijay reposteó

Thinking Machines

@thinkymachines

10 sept

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to…

thinkymachines's tweet image. Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference”

We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to…

Vijay

@__tensorcore__

25 ago

“TogetherAI’s chief scientist @tri_dao announced Flash Attention v4 … uses CUTLASS CuTe Python DSL” As always, thanks for being the tip of the spear and pushing us along too 💚

SemiAnalysis

@SemiAnalysis_

25 ago

TogetherAI's Chief Scientist @tri_dao announced Flash Attention v4 at HotChips Conference which is up to 22% faster than the attention kernel implementation from NVIDIA's cuDNN library. Tri Dao was able to achieve this 2 key algorithmic changes. Firstly, it uses a new online…

SemiAnalysis_'s tweet image. TogetherAI's Chief Scientist @tri_dao announced Flash Attention v4 at HotChips Conference which is up to 22% faster than the attention kernel implementation from NVIDIA's cuDNN library. Tri Dao was able to achieve this 2 key algorithmic changes. Firstly, it uses a new online…

Vijay reposteó

SemiAnalysis

@SemiAnalysis_

25 ago

Using CUTLASS CuTe-DSL, TogetherAI's Chief Scientist @tri_dao announced that he has written kernels that is 50% faster than NVIDIA's latest cuBLAS 13.0 library for small K reduction dim shapes on Blackwell during today's hotchip conference. His kernels beats cuBLAS by using 2…

SemiAnalysis_'s tweet image. Using CUTLASS CuTe-DSL, TogetherAI's Chief Scientist @tri_dao announced that he has written kernels that is 50% faster than NVIDIA's latest cuBLAS 13.0 library for small K reduction dim shapes on Blackwell during today's hotchip conference.

His kernels beats cuBLAS by using 2…

Vijay reposteó

xjdr

@_xjdr

19 ago

Cute-DSL is basically perfect (for me). thank you nvidia and cutlass team. i no longer need to wait for long compile times because i underspecified a template param. i hope everyone involved gets an extra chicken nugget in their happy meal

Vijay reposteó

Mark Saroufim

@marksaroufim

25 jul

On Sep 6 in NYC, this won't be your typical hackathon where you do your own thing in a corner and then present at the of the day. You'll deploy real models to the market, trades will happen, chaos should be expected. The fastest model is great but time to market matters more.

marksaroufim's tweet image. On Sep 6 in NYC, this won't be your typical hackathon where you do your own thing in a corner and then present at the of the day. You'll deploy real models to the market, trades will happen, chaos should be expected. The fastest model is great but time to market matters more.

Vijay reposteó

SemiAnalysis

@SemiAnalysis_

22 jul

ariXv gpu kernel researcher be like: • liquid nitrogen cooling their benchmark GPU • overclock their H200 to 1000W "Custom Thermal Solution CTS" • nvidia-smi boost-slider --vboost 1 • nvidia-smi -i 0 --lock-gpu-clocks=1830,1830 • use specially binned GPUs where the number…

SemiAnalysis_'s tweet image. ariXv gpu kernel researcher be like:
• liquid nitrogen cooling their benchmark GPU
• overclock their H200 to 1000W "Custom Thermal Solution CTS"
• nvidia-smi boost-slider --vboost 1
• nvidia-smi -i 0 --lock-gpu-clocks=1830,1830
• use specially binned GPUs where the number…

Vijay reposteó

Vijay

@__tensorcore__

21 jul

Part 2: developer.nvidia.com/blog/cutlass-3… Covers the design of CUTLASS 3.x itself and how it builds a 2 layer GPU microkernel abstraction using CuTe as the foundation.

__tensorcore__'s tweet card. GEMM optimization on GPUs is a modular problem. Performant implementations need to specify hyperparameters such as tile shapes, math and copy instructions, and warp-specialization schemes.

CUTLASS 3.x: Orthogonal, Reusable, and Composable Abstractions for GEMM Kernel Design | NVIDIA...

Fuente: developer.nvidia.com

Vijay

@__tensorcore__

21 jul

CUTLASS 4.1 is now available, which adds support for ARM systems (GB200) and block scaled MMAs

Vijay

@__tensorcore__

13 may

Vijay reposteó

Tri Dao

@tri_dao

21 jul

Hierarchical layout is super elegant. Feels like the right abstraction for high performance GPU kernels. FlashAttention 2 actually started bc we wanted to rewrite FA1 in CuTe

Vijay

@__tensorcore__

20 jul

developer.nvidia.com/blog/cutlass-p… marks the start of a short series of blogposts about CUTLASS 3.x and CuTe that we've been meaning to write for years. There are a few more parts to come still, hope you enjoy!

__tensorcore__'s tweet card. In the era of generative AI, utilizing GPUs to their maximum potential is essential to training better models and serving users at scale. Often, these models have layers that cannot be expressed as…

CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial...

Fuente: developer.nvidia.com

Vijay reposteó

Michael Goin

@mgoin_

20 jul

CuTe is such an elegant library that we stopped working on our own system and wholeheartedly adopted CUTLASS for vLLM in the beginning of 2024. I can happily report that was a very wise investment! Vijay and co should be so proud of the many strong OSS projects built on top 🥳

Vijay

@__tensorcore__

20 jul

CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial...

Fuente: developer.nvidia.com

Vijay

@__tensorcore__

17 jul

This is what the internet was made for 🥹

typedfemale

@typedfemale

17 jul

presenting: big jeff's trainium hell

Vijay reposteó

Ali Hassani

@AliHassaniJr

11 jul

Cosmos-Predict2 meets NATTEN. We just released variants of Cosmos-Predict2 where we replace most self attentions with neighborhood attention, bringing up to 2.6X end-to-end speedup, with minimal effect on quality! github.com/nvidia-cosmos/… (1/5)

Vijay reposteó

Tri Dao

@tri_dao

10 jul

Getting mem-bound kernels to speed-of-light isn't a dark art, it's just about getting the a couple of details right. We wrote a tutorial on how to do this, with code you can directly use. Thanks to the new CuTe-DSL, we can hit speed-of-light without a single line of CUDA C++.

Wentao Guo

@WentaoGuo7

10 jul

🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With @tedzadouri and @tri_dao

WentaoGuo7's tweet image. 🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯

With @tedzadouri and @tri_dao

Vijay reposteó

Wentao Guo

@WentaoGuo7

10 jul

Vijay

@__tensorcore__

8 jun

Another 🔥 blog about CUTLASS from @colfaxintl, this time focusing on the gory details of block-scaled MXFP and NVFP data types and Blackwell kernels for them. research.colfax-intl.com/cutlass-tutori…

__tensorcore__'s tweet card. Welcome to part 3 of our series investigating GEMM on the NVIDIA Blackwell architecture. In parts 1 and 2, we looked at the Tensory Memory and 2 SM capabilities of the new Blackwell Tensor Core UMM…

CUTLASS Tutorial: Sub-byte GEMM on NVIDIA® Blackwell GPUs

Fuente: research.colfax-intl.com

Vijay reposteó

Moon

@MoonL88537

29 may

did i mention that this is totally nuts?

Vijay reposteó

Tri Dao

@tri_dao

29 may

We've been thinking about what the "ideal" architecture should look like in the era where inference is driving AI progress. GTA & GLA are steps in this direction: attention variants tailored for inference: high arithmetic intensity (make GPUs go brr even during decoding), easy to…

Ted Zadouri

@tedzadouri

29 may

"Pre-training was hard, inference easy; now everything is hard."-Jensen Huang. Inference drives AI progress b/c of test-time compute. Introducing inference aware attn: parallel-friendly, high arithmetic intensity – Grouped-Tied Attn & Grouped Latent Attn