Aman Swar
@Compile_Conquer
Building fast & efficient AI systems, from low-level CUDA kernels to distributed training frameworks. AI Systems Engineer | 3rd Year undergraduate
Started Learning inline PTX assembly in CUDA, built a small header that implements: - Guarded global loads/stores using predicate registers - cp.async async copy from global → shared memory - Vectorized 128-bit loads/stores to improve bandwidth #CUDA #PyTorch
ASI will happen Because of tech? No Because the average IQ will experience such a steep drop that exceeding human intelligence becomes trivial
After 3 freakin hours of debugging fused Grouped Query Attention kernel , I finally see this plot feeling happy blue = PyTorch attention (dies as seq_len ↑) green = my Triton kernel (🚀) checkout : github.com/AmanSwar/Model… #PyTorch #CUDA
United States Trendy
- 1. $NVDA 85.4K posts
- 2. FEMA 18.3K posts
- 3. Peggy 39.3K posts
- 4. WE HURT PEOPLE 1,313 posts
- 5. Jensen 28.5K posts
- 6. Sheila Cherfilus-McCormick 15.7K posts
- 7. Dean Wade N/A
- 8. Ricochet 1,417 posts
- 9. Raisel Iglesias N/A
- 10. Jabari N/A
- 11. #Jupiter 4,444 posts
- 12. Koa Peat N/A
- 13. Baba Oladotun 1,063 posts
- 14. Sam Harris 1,219 posts
- 15. #CMAawards N/A
- 16. Nae'Qwan Tomlin N/A
- 17. #YIAYalpha N/A
- 18. NASA 58.2K posts
- 19. GeForce Season 6,664 posts
- 20. Bobby Lashley N/A
Something went wrong.
Something went wrong.