Compile_Conquer's profile picture. Building fast & efficient AI systems, from low-level CUDA kernels to distributed training frameworks. AI Systems Engineer | 3rd Year undergraduate

Aman Swar

@Compile_Conquer

Building fast & efficient AI systems, from low-level CUDA kernels to distributed training frameworks. AI Systems Engineer | 3rd Year undergraduate

Started Learning inline PTX assembly in CUDA, built a small header that implements: - Guarded global loads/stores using predicate registers - cp.async async copy from global → shared memory - Vectorized 128-bit loads/stores to improve bandwidth #CUDA #PyTorch

Compile_Conquer's tweet image. Started Learning inline PTX assembly in CUDA, 
built a small header that implements:
- Guarded global loads/stores using predicate registers
- cp.async async copy from global → shared memory
- Vectorized 128-bit loads/stores to improve bandwidth
#CUDA #PyTorch

Aman Swar podał dalej

ASI will happen Because of tech? No Because the average IQ will experience such a steep drop that exceeding human intelligence becomes trivial


Going to share some of my old linkedIn post here ....


After 3 freakin hours of debugging fused Grouped Query Attention kernel , I finally see this plot feeling happy blue = PyTorch attention (dies as seq_len ↑) green = my Triton kernel (🚀) checkout : github.com/AmanSwar/Model… #PyTorch #CUDA

Compile_Conquer's tweet image. After 3 freakin hours of debugging fused Grouped Query Attention kernel , I finally see this plot  
feeling happy 
blue = PyTorch attention (dies as seq_len ↑) green = my Triton kernel (🚀) 
checkout : github.com/AmanSwar/Model…
#PyTorch  #CUDA

eating CuTe today for lunch and dinner #CUDA #CUTASS #Nvidia

Compile_Conquer's tweet image. eating CuTe today for lunch and dinner #CUDA #CUTASS #Nvidia

United States Trendy

Loading...

Something went wrong.


Something went wrong.