#100daysofcuda search results

day17 #100daysofcuda Today implemented naive outer vector add and a block wise outer vector add. Also spending some time reading pytorch graph computation. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day17 #100daysofcuda

Today implemented naive outer vector add and a block wise outer vector add.

Also spending some time reading pytorch graph computation.

Code - github.com/JINO-ROHIT/adv…

I did it!!! I did it guys. It was a nice run. All thanks to @hkproj for starting #100DaysofCuda CC: @salykova_

xkxxhk's tweet image. I did it!!! I did it guys. It was a nice run. All thanks to @hkproj for starting #100DaysofCuda

CC: @salykova_

It's your turn now @mobicham @ajhinh. Towards SOTA grayscale kernel 🤣💀 Jokes aside, the GPU Kernel Leaderboard is a great place to sharpen your CUDA skills and learn how to build the fastest kernels. Maybe @__tinygrad__ will make a comeback. Lets see. Link below

salykova_'s tweet image. It's your turn now @mobicham @ajhinh. Towards SOTA grayscale kernel 🤣💀 Jokes aside, the GPU Kernel Leaderboard is a great place to sharpen your CUDA skills and learn how to build the fastest kernels. Maybe @__tinygrad__ will make a comeback. Lets see. Link below


day23 #100daysofcuda Implemented matrix multiplication over batches. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day23 #100daysofcuda

Implemented matrix multiplication over batches.

Code - github.com/JINO-ROHIT/adv…

day19 #100daysofcuda Implemented sum across a batch of numbers. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day19 #100daysofcuda

Implemented sum across a batch of numbers.

Code - github.com/JINO-ROHIT/adv…

day22 #100daysofcuda Implemented a convolution2d kernel in triton :) Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day22 #100daysofcuda

Implemented a convolution2d kernel in triton :)

Code - github.com/JINO-ROHIT/adv…

Day 12 #100daysofcuda Implemented a relu kernel on triton. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. Day 12 #100daysofcuda

Implemented a relu kernel on triton.

Code - github.com/JINO-ROHIT/adv…

"10 days into CUDA, and I’ve earned my first badge of honor! 🚀 From simple kernels to profiling, every day is a step closer to mastering GPU computing. Onward to 100! #CUDA #GPUProgramming #100DaysOfCUDA"

limbizzz11's tweet image. "10 days into CUDA, and I’ve earned my first badge of honor! 🚀 From simple kernels to profiling, every day is a step closer to mastering GPU computing. Onward to 100! #CUDA #GPUProgramming #100DaysOfCUDA"

day10 #100daysofcuda This marks 1/10th of journey in consistently learning cuda and triton kernels. The goal is to write for unsloth and learn to make an optimal inference engine. Wrote a triton kernel for sigmoid. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day10 #100daysofcuda

This marks 1/10th of journey in consistently learning cuda and triton kernels. The goal is to write for unsloth and learn to make an optimal inference engine.

Wrote a triton kernel for sigmoid.

Code - github.com/JINO-ROHIT/adv…

day15 #100daysofcuda Implemented leaky relu and elu kernels in triton. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day15 #100daysofcuda

Implemented leaky relu and elu kernels in triton.

Code - github.com/JINO-ROHIT/adv…

day20 #100daysofcuda Today, I implemented softmax, actually a numerically stable version of softmax. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day20 #100daysofcuda

Today, I implemented softmax, actually a numerically stable version of softmax.

Code - github.com/JINO-ROHIT/adv…

day21 #100daysofcuda Implementing a scalar version of flash attention. Should be quite simple to read and understand. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day21 #100daysofcuda

Implementing a scalar version of flash attention. Should be quite simple to read and understand.
Code - github.com/JINO-ROHIT/adv…

day9 #100daysofcuda Wrote a triton kernel for tanh activation. Going to focus on getting more submissions on tensara and further optimize it. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day9 #100daysofcuda

Wrote a triton kernel for tanh activation. Going to focus on getting more submissions on tensara and further optimize it.

Code - github.com/JINO-ROHIT/adv…

day18 #100daysofcuda 1. Implemented fused outer multiplication along with backward pass. 2. Going through some liger kernels. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day18 #100daysofcuda

1. Implemented fused outer multiplication along with backward pass.
2. Going through some liger kernels.

Code - github.com/JINO-ROHIT/adv…

Day16 #100daysofcuda 1. Kept it simple and implemented a gelu kernel. 2. Spent a lot more time understanding the intuition for multiple programs/blocks/threads. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. Day16 #100daysofcuda

1. Kept it simple and implemented a gelu kernel.
2. Spent a lot more time understanding the intuition for multiple programs/blocks/threads.

Code - github.com/JINO-ROHIT/adv…

day 28 #100daysofcuda Implemented backward pass for mish kernel and compared it with torch backward gradients! Almost fully equipped with using custom triton kernels for pytorch modules Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day 28 #100daysofcuda

Implemented backward pass for mish kernel and compared it with torch backward gradients!
Almost fully equipped with using custom triton kernels for pytorch modules

Code - github.com/JINO-ROHIT/adv…

Day 2/100 of my #100DaysOfCUDA challenge: Optimizing Matrix Multiplication! Yesterday's "naive" kernel was slow. Why? Global memory latency. Today, I implemented a tiled MatMul using shared memory. The strategy: fetch data from the slow "warehouse" (VRAM) in chunks (tiles)

raghu_13B's tweet image. Day 2/100 of my #100DaysOfCUDA challenge: Optimizing Matrix Multiplication!
Yesterday's "naive" kernel was slow. Why? Global memory latency.
Today, I implemented a tiled MatMul using shared memory. The strategy: fetch data from the slow "warehouse" (VRAM) in chunks (tiles)

day24 #100daysofcuda Implementing quantized matrix multiplication in triton. When doing matrix multiplication with quantized neural networks a common strategy is to store the weight matrix in lower precision, with a shift and scale term. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day24 #100daysofcuda

Implementing quantized matrix multiplication in triton.

When doing matrix multiplication with quantized neural networks a common strategy is to store the weight matrix in lower precision, with a shift and scale term.

Code - github.com/JINO-ROHIT/adv…

day11 #100daysofcuda Today I had some help from tensara team on debugging my previous kernels, realized it had a critical flaw. Fixed those two today. For kernels today, I wrote a selu kernel and topped the tensara leaderboard. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day11 #100daysofcuda

Today I had some help from tensara team on debugging my previous kernels, realized it had a critical flaw. Fixed those two today.

For kernels today, I wrote a selu kernel and topped the tensara leaderboard.

Code - github.com/JINO-ROHIT/adv…

day 29 #100daysofcuda 1. Implementing mish activation forward and backward pass using a 2d launch grid in triton. 2. Looking into pytorch graph compilation . Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day 29 #100daysofcuda

1. Implementing mish activation forward and backward pass using a 2d launch grid in triton.
2. Looking into pytorch graph compilation .

Code - github.com/JINO-ROHIT/adv…

day 29 #100daysofcuda 1. Implementing mish activation forward and backward pass using a 2d launch grid in triton. 2. Looking into pytorch graph compilation . Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day 29 #100daysofcuda

1. Implementing mish activation forward and backward pass using a 2d launch grid in triton.
2. Looking into pytorch graph compilation .

Code - github.com/JINO-ROHIT/adv…

day 28 #100daysofcuda Implemented backward pass for mish kernel and compared it with torch backward gradients! Almost fully equipped with using custom triton kernels for pytorch modules Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day 28 #100daysofcuda

Implemented backward pass for mish kernel and compared it with torch backward gradients!
Almost fully equipped with using custom triton kernels for pytorch modules

Code - github.com/JINO-ROHIT/adv…

day27 #100daysofcuda Today I implemented the mish activation function in triton and bechmarked vs pytorch implementation. Was ~ 5-8% consistently better than torch implementation. Code - github.com/JINO-ROHIT/adv… Should i write beginner friendly blogs for triton?

jino_rohit's tweet image. day27 #100daysofcuda

Today I implemented the mish activation function in triton and bechmarked vs pytorch implementation. Was ~ 5-8% consistently better than torch implementation.

Code - github.com/JINO-ROHIT/adv…

Should i write beginner friendly blogs for triton?

day25 #100daysofcuda Today I put together the solutions for the triton puzzles from professor @srush_nlp . Amazing puzzles, they start from basic principles and work you up to flash attention and complex matrix multiplications. Solutions here - github.com/JINO-ROHIT/adv…


day24 #100daysofcuda Implementing quantized matrix multiplication in triton. When doing matrix multiplication with quantized neural networks a common strategy is to store the weight matrix in lower precision, with a shift and scale term. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day24 #100daysofcuda

Implementing quantized matrix multiplication in triton.

When doing matrix multiplication with quantized neural networks a common strategy is to store the weight matrix in lower precision, with a shift and scale term.

Code - github.com/JINO-ROHIT/adv…

day23 #100daysofcuda Implemented matrix multiplication over batches. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day23 #100daysofcuda

Implemented matrix multiplication over batches.

Code - github.com/JINO-ROHIT/adv…

day22 #100daysofcuda Implemented a convolution2d kernel in triton :) Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day22 #100daysofcuda

Implemented a convolution2d kernel in triton :)

Code - github.com/JINO-ROHIT/adv…

day21 #100daysofcuda Implementing a scalar version of flash attention. Should be quite simple to read and understand. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day21 #100daysofcuda

Implementing a scalar version of flash attention. Should be quite simple to read and understand.
Code - github.com/JINO-ROHIT/adv…

I am going to start #100DaysofCUDA , i know things in it but let's become consistent again so let's do it Got motivation by @jino_rohit Amazing repo !!!!!! SIR Let's good


day20 #100daysofcuda Today, I implemented softmax, actually a numerically stable version of softmax. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day20 #100daysofcuda

Today, I implemented softmax, actually a numerically stable version of softmax.

Code - github.com/JINO-ROHIT/adv…

day19 #100daysofcuda Implemented sum across a batch of numbers. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day19 #100daysofcuda

Implemented sum across a batch of numbers.

Code - github.com/JINO-ROHIT/adv…

day18 #100daysofcuda 1. Implemented fused outer multiplication along with backward pass. 2. Going through some liger kernels. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day18 #100daysofcuda

1. Implemented fused outer multiplication along with backward pass.
2. Going through some liger kernels.

Code - github.com/JINO-ROHIT/adv…

day17 #100daysofcuda Today implemented naive outer vector add and a block wise outer vector add. Also spending some time reading pytorch graph computation. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day17 #100daysofcuda

Today implemented naive outer vector add and a block wise outer vector add.

Also spending some time reading pytorch graph computation.

Code - github.com/JINO-ROHIT/adv…

Day16 #100daysofcuda 1. Kept it simple and implemented a gelu kernel. 2. Spent a lot more time understanding the intuition for multiple programs/blocks/threads. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. Day16 #100daysofcuda

1. Kept it simple and implemented a gelu kernel.
2. Spent a lot more time understanding the intuition for multiple programs/blocks/threads.

Code - github.com/JINO-ROHIT/adv…

day15 #100daysofcuda Implemented leaky relu and elu kernels in triton. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day15 #100daysofcuda

Implemented leaky relu and elu kernels in triton.

Code - github.com/JINO-ROHIT/adv…

day14 #100daysofcuda We made it to two weeks of writing a cuda kernel everyday! Wrote a triton kernel for softplus. Code - github.com/JINO-ROHIT/adv…


Day 12 #100daysofcuda Implemented a relu kernel on triton. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. Day 12 #100daysofcuda

Implemented a relu kernel on triton.

Code - github.com/JINO-ROHIT/adv…

day11 #100daysofcuda Today I had some help from tensara team on debugging my previous kernels, realized it had a critical flaw. Fixed those two today. For kernels today, I wrote a selu kernel and topped the tensara leaderboard. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day11 #100daysofcuda

Today I had some help from tensara team on debugging my previous kernels, realized it had a critical flaw. Fixed those two today.

For kernels today, I wrote a selu kernel and topped the tensara leaderboard.

Code - github.com/JINO-ROHIT/adv…

No results for "#100daysofcuda"

day23 #100daysofcuda Implemented matrix multiplication over batches. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day23 #100daysofcuda

Implemented matrix multiplication over batches.

Code - github.com/JINO-ROHIT/adv…

day17 #100daysofcuda Today implemented naive outer vector add and a block wise outer vector add. Also spending some time reading pytorch graph computation. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day17 #100daysofcuda

Today implemented naive outer vector add and a block wise outer vector add.

Also spending some time reading pytorch graph computation.

Code - github.com/JINO-ROHIT/adv…

I did it!!! I did it guys. It was a nice run. All thanks to @hkproj for starting #100DaysofCuda CC: @salykova_

xkxxhk's tweet image. I did it!!! I did it guys. It was a nice run. All thanks to @hkproj for starting #100DaysofCuda

CC: @salykova_

It's your turn now @mobicham @ajhinh. Towards SOTA grayscale kernel 🤣💀 Jokes aside, the GPU Kernel Leaderboard is a great place to sharpen your CUDA skills and learn how to build the fastest kernels. Maybe @__tinygrad__ will make a comeback. Lets see. Link below

salykova_'s tweet image. It's your turn now @mobicham @ajhinh. Towards SOTA grayscale kernel 🤣💀 Jokes aside, the GPU Kernel Leaderboard is a great place to sharpen your CUDA skills and learn how to build the fastest kernels. Maybe @__tinygrad__ will make a comeback. Lets see. Link below


Day 12 #100daysofcuda Implemented a relu kernel on triton. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. Day 12 #100daysofcuda

Implemented a relu kernel on triton.

Code - github.com/JINO-ROHIT/adv…

day22 #100daysofcuda Implemented a convolution2d kernel in triton :) Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day22 #100daysofcuda

Implemented a convolution2d kernel in triton :)

Code - github.com/JINO-ROHIT/adv…

day15 #100daysofcuda Implemented leaky relu and elu kernels in triton. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day15 #100daysofcuda

Implemented leaky relu and elu kernels in triton.

Code - github.com/JINO-ROHIT/adv…

day20 #100daysofcuda Today, I implemented softmax, actually a numerically stable version of softmax. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day20 #100daysofcuda

Today, I implemented softmax, actually a numerically stable version of softmax.

Code - github.com/JINO-ROHIT/adv…

day19 #100daysofcuda Implemented sum across a batch of numbers. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day19 #100daysofcuda

Implemented sum across a batch of numbers.

Code - github.com/JINO-ROHIT/adv…

day10 #100daysofcuda This marks 1/10th of journey in consistently learning cuda and triton kernels. The goal is to write for unsloth and learn to make an optimal inference engine. Wrote a triton kernel for sigmoid. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day10 #100daysofcuda

This marks 1/10th of journey in consistently learning cuda and triton kernels. The goal is to write for unsloth and learn to make an optimal inference engine.

Wrote a triton kernel for sigmoid.

Code - github.com/JINO-ROHIT/adv…

day9 #100daysofcuda Wrote a triton kernel for tanh activation. Going to focus on getting more submissions on tensara and further optimize it. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day9 #100daysofcuda

Wrote a triton kernel for tanh activation. Going to focus on getting more submissions on tensara and further optimize it.

Code - github.com/JINO-ROHIT/adv…

day21 #100daysofcuda Implementing a scalar version of flash attention. Should be quite simple to read and understand. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day21 #100daysofcuda

Implementing a scalar version of flash attention. Should be quite simple to read and understand.
Code - github.com/JINO-ROHIT/adv…

day18 #100daysofcuda 1. Implemented fused outer multiplication along with backward pass. 2. Going through some liger kernels. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day18 #100daysofcuda

1. Implemented fused outer multiplication along with backward pass.
2. Going through some liger kernels.

Code - github.com/JINO-ROHIT/adv…

day8 #100daysofcuda Implemented a memory efficient version of dropout in triton. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day8  #100daysofcuda

Implemented a memory efficient version of dropout in triton.

Code - github.com/JINO-ROHIT/adv…

"10 days into CUDA, and I’ve earned my first badge of honor! 🚀 From simple kernels to profiling, every day is a step closer to mastering GPU computing. Onward to 100! #CUDA #GPUProgramming #100DaysOfCUDA"

limbizzz11's tweet image. "10 days into CUDA, and I’ve earned my first badge of honor! 🚀 From simple kernels to profiling, every day is a step closer to mastering GPU computing. Onward to 100! #CUDA #GPUProgramming #100DaysOfCUDA"

Day16 #100daysofcuda 1. Kept it simple and implemented a gelu kernel. 2. Spent a lot more time understanding the intuition for multiple programs/blocks/threads. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. Day16 #100daysofcuda

1. Kept it simple and implemented a gelu kernel.
2. Spent a lot more time understanding the intuition for multiple programs/blocks/threads.

Code - github.com/JINO-ROHIT/adv…

Day6 #100daysofcuda Implemented a naive softmax kernel and spent some time learning the quirks of triton kernels. Planning to optimize this kernel for the next following days. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. Day6 #100daysofcuda

Implemented a naive softmax kernel and spent some time learning the quirks of triton kernels. Planning to optimize this kernel for the next following days.

Code - github.com/JINO-ROHIT/adv…

day7 #100daysofcuda Made to day 7, today certain concepts clicked on how triton passes in data via meta, kernel warmup striding etc. Implemented the fused softmax from the triton guide . Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day7 #100daysofcuda

Made to day 7, today certain concepts clicked on how triton passes in data via meta, kernel warmup striding etc.

Implemented the fused softmax from the triton guide .

Code - github.com/JINO-ROHIT/adv…

day 28 #100daysofcuda Implemented backward pass for mish kernel and compared it with torch backward gradients! Almost fully equipped with using custom triton kernels for pytorch modules Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day 28 #100daysofcuda

Implemented backward pass for mish kernel and compared it with torch backward gradients!
Almost fully equipped with using custom triton kernels for pytorch modules

Code - github.com/JINO-ROHIT/adv…

day11 #100daysofcuda Today I had some help from tensara team on debugging my previous kernels, realized it had a critical flaw. Fixed those two today. For kernels today, I wrote a selu kernel and topped the tensara leaderboard. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day11 #100daysofcuda

Today I had some help from tensara team on debugging my previous kernels, realized it had a critical flaw. Fixed those two today.

For kernels today, I wrote a selu kernel and topped the tensara leaderboard.

Code - github.com/JINO-ROHIT/adv…

day24 #100daysofcuda Implementing quantized matrix multiplication in triton. When doing matrix multiplication with quantized neural networks a common strategy is to store the weight matrix in lower precision, with a shift and scale term. Code - github.com/JINO-ROHIT/adv…

jino_rohit's tweet image. day24 #100daysofcuda

Implementing quantized matrix multiplication in triton.

When doing matrix multiplication with quantized neural networks a common strategy is to store the weight matrix in lower precision, with a shift and scale term.

Code - github.com/JINO-ROHIT/adv…

Loading...

Something went wrong.


Something went wrong.


United States Trends