mohitwt_'s profile picture. 19, building custom dl framework from scratch

mohit

@mohitwt_

19, building custom dl framework from scratch

Pinned

I'm building my own PyTorch from scratch, Implementing multiple core features like tensors, autograd, NN layers, optimizers, and more. The core engine will be written in C++/CUDA for performance, with a Pythonic, PyTorch-like API. starting C++ from today.

mohitwt_'s tweet image. I'm building my own PyTorch from scratch, Implementing multiple core features like tensors, autograd, NN layers, optimizers, and more.

The core engine will be written in C++/CUDA for performance, with a Pythonic, PyTorch-like API.

starting C++ from today.

1 away from 7 million

mohitwt_'s tweet image. 1 away from 7 million

helpful diagram for memory hierarchy explanation:

mohitwt_'s tweet image. helpful diagram for memory hierarchy explanation:

A Detailed Explanation of CUDA Thread Hierarchy (Threads, Blocks, and Grids): A CUDA thread is the smallest unit of execution on the GPU, similar to a CPU thread, but designed for massive parallelism, each thread has its own registers and runs the same kernel code independently,…

mohitwt_'s tweet image. A Detailed Explanation of CUDA Thread Hierarchy
(Threads, Blocks, and Grids):

A CUDA thread is the smallest unit of execution on the GPU, similar to a CPU thread, but designed for massive parallelism, each thread has its own registers and runs the same kernel code independently,…
mohitwt_'s tweet image. A Detailed Explanation of CUDA Thread Hierarchy
(Threads, Blocks, and Grids):

A CUDA thread is the smallest unit of execution on the GPU, similar to a CPU thread, but designed for massive parallelism, each thread has its own registers and runs the same kernel code independently,…


update on my framework: > started implementation of Autograd Engine: the autograd engine has few key tasks: track tensor dependencies, compute gradients automatically, and backpropagate through operations. > added computation graph in the framework: forward pass works for…

mohitwt_'s tweet image. update on my framework:

> started implementation of Autograd Engine: 
the autograd engine has few key tasks: 
track tensor dependencies, compute gradients automatically, and backpropagate through operations.

> added computation graph in the framework:
forward pass works for…

> added sum() and mean() in math ops > added more view operations like: .zeros() .ones() .randn() .arange() .permute() .contiguous() .expand() .unsqueeze() .squeeze() so far my framework has: > Tensor ops (core data structure, indexing, slicing, etc) > CPU ops (math, matmul,…

mohitwt_'s tweet image. > added sum() and mean() in math ops
> added more view operations like:
.zeros()
.ones()
.randn()
.arange()
.permute()
.contiguous()
.expand()
.unsqueeze()
.squeeze()

so far my framework has:
> Tensor ops (core data structure, indexing, slicing, etc)
> CPU ops (math, matmul,…
mohitwt_'s tweet image. > added sum() and mean() in math ops
> added more view operations like:
.zeros()
.ones()
.randn()
.arange()
.permute()
.contiguous()
.expand()
.unsqueeze()
.squeeze()

so far my framework has:
> Tensor ops (core data structure, indexing, slicing, etc)
> CPU ops (math, matmul,…


only reason to use LinkedIn

mohitwt_'s tweet image. only reason to use LinkedIn

finally i can sleep, updates on my framework tmr.

mohitwt_'s tweet image. finally i can sleep, updates on my framework tmr.

> added sum() and mean() in math ops > added more view operations like: .zeros() .ones() .randn() .arange() .permute() .contiguous() .expand() .unsqueeze() .squeeze() so far my framework has: > Tensor ops (core data structure, indexing, slicing, etc) > CPU ops (math, matmul,…

mohitwt_'s tweet image. > added sum() and mean() in math ops
> added more view operations like:
.zeros()
.ones()
.randn()
.arange()
.permute()
.contiguous()
.expand()
.unsqueeze()
.squeeze()

so far my framework has:
> Tensor ops (core data structure, indexing, slicing, etc)
> CPU ops (math, matmul,…
mohitwt_'s tweet image. > added sum() and mean() in math ops
> added more view operations like:
.zeros()
.ones()
.randn()
.arange()
.permute()
.contiguous()
.expand()
.unsqueeze()
.squeeze()

so far my framework has:
> Tensor ops (core data structure, indexing, slicing, etc)
> CPU ops (math, matmul,…


i didnt backup my frameworks obsidian vault before switching to fedora, fml 😭😭


A Detailed Explanation of CUDA Thread Hierarchy (Threads, Blocks, and Grids): A CUDA thread is the smallest unit of execution on the GPU, similar to a CPU thread, but designed for massive parallelism, each thread has its own registers and runs the same kernel code independently,…

mohitwt_'s tweet image. A Detailed Explanation of CUDA Thread Hierarchy
(Threads, Blocks, and Grids):

A CUDA thread is the smallest unit of execution on the GPU, similar to a CPU thread, but designed for massive parallelism, each thread has its own registers and runs the same kernel code independently,…
mohitwt_'s tweet image. A Detailed Explanation of CUDA Thread Hierarchy
(Threads, Blocks, and Grids):

A CUDA thread is the smallest unit of execution on the GPU, similar to a CPU thread, but designed for massive parallelism, each thread has its own registers and runs the same kernel code independently,…

matrix addition in CUDA, the first picture is launching 1 block of NxN threads. ~ each thread computes one element of matrix > dim3 threadsPerBlock(N,N); > MatAdd<<<1, threadsPerBlock>>>(d_A, d_B, d_C, N); this works fine with small matrices as i did N=4, which is only 16…

mohitwt_'s tweet image. matrix addition in CUDA, the first picture is launching 1 block of NxN threads.
~ each thread computes one element of matrix 

&amp;gt; dim3 threadsPerBlock(N,N);
&amp;gt; MatAdd&amp;lt;&amp;lt;&amp;lt;1, threadsPerBlock&amp;gt;&amp;gt;&amp;gt;(d_A, d_B, d_C, N);

this works fine with small matrices as i did N=4, which is only 16…
mohitwt_'s tweet image. matrix addition in CUDA, the first picture is launching 1 block of NxN threads.
~ each thread computes one element of matrix 

&amp;gt; dim3 threadsPerBlock(N,N);
&amp;gt; MatAdd&amp;lt;&amp;lt;&amp;lt;1, threadsPerBlock&amp;gt;&amp;gt;&amp;gt;(d_A, d_B, d_C, N);

this works fine with small matrices as i did N=4, which is only 16…
mohitwt_'s tweet image. matrix addition in CUDA, the first picture is launching 1 block of NxN threads.
~ each thread computes one element of matrix 

&amp;gt; dim3 threadsPerBlock(N,N);
&amp;gt; MatAdd&amp;lt;&amp;lt;&amp;lt;1, threadsPerBlock&amp;gt;&amp;gt;&amp;gt;(d_A, d_B, d_C, N);

this works fine with small matrices as i did N=4, which is only 16…
mohitwt_'s tweet image. matrix addition in CUDA, the first picture is launching 1 block of NxN threads.
~ each thread computes one element of matrix 

&amp;gt; dim3 threadsPerBlock(N,N);
&amp;gt; MatAdd&amp;lt;&amp;lt;&amp;lt;1, threadsPerBlock&amp;gt;&amp;gt;&amp;gt;(d_A, d_B, d_C, N);

this works fine with small matrices as i did N=4, which is only 16…

wrote my first CUDA, a simple c = a + b: what its doing: > allocate CPU mem and fill with input data > allocate GPU mem > copy input data CPU to GPU > gpu kernel computes > copy result back gpu to cpu > print results from CPU > free GPU mem 😭

mohitwt_'s tweet image. wrote my first CUDA, a simple c = a + b:

what its doing:
&amp;gt; allocate CPU mem and fill with input data
&amp;gt; allocate GPU mem
&amp;gt; copy input data CPU to GPU
&amp;gt; gpu kernel computes
&amp;gt; copy result back gpu to cpu
&amp;gt; print results from CPU
&amp;gt; free GPU mem
😭
mohitwt_'s tweet image. wrote my first CUDA, a simple c = a + b:

what its doing:
&amp;gt; allocate CPU mem and fill with input data
&amp;gt; allocate GPU mem
&amp;gt; copy input data CPU to GPU
&amp;gt; gpu kernel computes
&amp;gt; copy result back gpu to cpu
&amp;gt; print results from CPU
&amp;gt; free GPU mem
😭


mohit reposted

wrote my first CUDA, a simple c = a + b: what its doing: > allocate CPU mem and fill with input data > allocate GPU mem > copy input data CPU to GPU > gpu kernel computes > copy result back gpu to cpu > print results from CPU > free GPU mem 😭

mohitwt_'s tweet image. wrote my first CUDA, a simple c = a + b:

what its doing:
&amp;gt; allocate CPU mem and fill with input data
&amp;gt; allocate GPU mem
&amp;gt; copy input data CPU to GPU
&amp;gt; gpu kernel computes
&amp;gt; copy result back gpu to cpu
&amp;gt; print results from CPU
&amp;gt; free GPU mem
😭
mohitwt_'s tweet image. wrote my first CUDA, a simple c = a + b:

what its doing:
&amp;gt; allocate CPU mem and fill with input data
&amp;gt; allocate GPU mem
&amp;gt; copy input data CPU to GPU
&amp;gt; gpu kernel computes
&amp;gt; copy result back gpu to cpu
&amp;gt; print results from CPU
&amp;gt; free GPU mem
😭

got a new desk and monitor, so insane. loving fedora aswell

mohitwt_'s tweet image. got a new desk and monitor, so insane.

loving fedora aswell
mohitwt_'s tweet image. got a new desk and monitor, so insane.

loving fedora aswell

mohit reposted

Building LearnFlow -learnt about background jobs and queues -used Celery as task queue system and Redis as the message broker -offloaded goal inactivity reminder email sending task to background through this more details below

RajNair06's tweet image. Building LearnFlow
-learnt about background jobs and queues
-used Celery as task queue system and Redis as the message broker
-offloaded  goal inactivity reminder email sending task to background through this
more details below
RajNair06's tweet image. Building LearnFlow
-learnt about background jobs and queues
-used Celery as task queue system and Redis as the message broker
-offloaded  goal inactivity reminder email sending task to background through this
more details below
RajNair06's tweet image. Building LearnFlow
-learnt about background jobs and queues
-used Celery as task queue system and Redis as the message broker
-offloaded  goal inactivity reminder email sending task to background through this
more details below
RajNair06's tweet image. Building LearnFlow
-learnt about background jobs and queues
-used Celery as task queue system and Redis as the message broker
-offloaded  goal inactivity reminder email sending task to background through this
more details below

fedora is amazing. no ricing yet, just setting up stuff atm

Had my fun with Mint, its amazing, caused 0 problems since install but got bored of it, Installing Fedora

mohitwt_'s tweet image. Had my fun with Mint, its amazing, caused 0 problems since install but got bored of it, Installing Fedora
mohitwt_'s tweet image. Had my fun with Mint, its amazing, caused 0 problems since install but got bored of it, Installing Fedora


Had my fun with Mint, its amazing, caused 0 problems since install but got bored of it, Installing Fedora

mohitwt_'s tweet image. Had my fun with Mint, its amazing, caused 0 problems since install but got bored of it, Installing Fedora
mohitwt_'s tweet image. Had my fun with Mint, its amazing, caused 0 problems since install but got bored of it, Installing Fedora

all setup done for Linux Mint. customisation and all ill do later, will try it for few days first.

mohitwt_'s tweet image. all setup done for Linux Mint. 

customisation and all ill do later, will try it for few days first.


guy is building crazy stuff

(1/n) Building Learnflow -implemented throttling for Free/Premium Users 10/day for free and 1000 for premium(for the sake of testing) -cached monthly and weekly progress summaries -benchmark tests for cache and non-cached responses, insane diff more info below

RajNair06's tweet image. (1/n)
Building Learnflow
-implemented throttling for Free/Premium Users
10/day for free and 1000 for premium(for the sake of testing)
-cached monthly and weekly progress summaries
-benchmark tests for cache and non-cached responses, insane diff
more info below
RajNair06's tweet image. (1/n)
Building Learnflow
-implemented throttling for Free/Premium Users
10/day for free and 1000 for premium(for the sake of testing)
-cached monthly and weekly progress summaries
-benchmark tests for cache and non-cached responses, insane diff
more info below
RajNair06's tweet image. (1/n)
Building Learnflow
-implemented throttling for Free/Premium Users
10/day for free and 1000 for premium(for the sake of testing)
-cached monthly and weekly progress summaries
-benchmark tests for cache and non-cached responses, insane diff
more info below
RajNair06's tweet image. (1/n)
Building Learnflow
-implemented throttling for Free/Premium Users
10/day for free and 1000 for premium(for the sake of testing)
-cached monthly and weekly progress summaries
-benchmark tests for cache and non-cached responses, insane diff
more info below


mohit reposted

(1/n) Building Learnflow -implemented throttling for Free/Premium Users 10/day for free and 1000 for premium(for the sake of testing) -cached monthly and weekly progress summaries -benchmark tests for cache and non-cached responses, insane diff more info below

RajNair06's tweet image. (1/n)
Building Learnflow
-implemented throttling for Free/Premium Users
10/day for free and 1000 for premium(for the sake of testing)
-cached monthly and weekly progress summaries
-benchmark tests for cache and non-cached responses, insane diff
more info below
RajNair06's tweet image. (1/n)
Building Learnflow
-implemented throttling for Free/Premium Users
10/day for free and 1000 for premium(for the sake of testing)
-cached monthly and weekly progress summaries
-benchmark tests for cache and non-cached responses, insane diff
more info below
RajNair06's tweet image. (1/n)
Building Learnflow
-implemented throttling for Free/Premium Users
10/day for free and 1000 for premium(for the sake of testing)
-cached monthly and weekly progress summaries
-benchmark tests for cache and non-cached responses, insane diff
more info below
RajNair06's tweet image. (1/n)
Building Learnflow
-implemented throttling for Free/Premium Users
10/day for free and 1000 for premium(for the sake of testing)
-cached monthly and weekly progress summaries
-benchmark tests for cache and non-cached responses, insane diff
more info below

cool stuff, give it a read

I kinda went deeper into tokenization in NLP, partly because I found it super interesting and partly because a friend asked me to explain it :p Turns out it’s way more than just “splitting text on spaces.” It’s the foundation of how LLMs actually see language, and the choice of…

raydotsh's tweet image. I kinda went deeper into tokenization in NLP, partly because I found it super interesting and partly because a friend asked me to explain it :p

Turns out it’s way more than just “splitting text on spaces.” It’s the foundation of how LLMs actually see language, and the choice of…


mohit reposted

I kinda went deeper into tokenization in NLP, partly because I found it super interesting and partly because a friend asked me to explain it :p Turns out it’s way more than just “splitting text on spaces.” It’s the foundation of how LLMs actually see language, and the choice of…

raydotsh's tweet image. I kinda went deeper into tokenization in NLP, partly because I found it super interesting and partly because a friend asked me to explain it :p

Turns out it’s way more than just “splitting text on spaces.” It’s the foundation of how LLMs actually see language, and the choice of…

booted windows after months for some college work, bro its just 2 tabs and spotify, wdym 11gb 😭 for comparison, on linux using dc, spotify, brave with multiple tab is <5gb

mohitwt_'s tweet image. booted windows after months for some college work, bro its just 2 tabs and spotify, wdym 11gb 😭

for comparison, on linux using dc, spotify, brave with multiple tab is &amp;lt;5gb

United States Trends

Loading...

Something went wrong.


Something went wrong.