Vector434852's profile picture. Built a 1-layer O(n) model that reached 5.5 Val PPL on TinyStories  /
8.5M params · trained on a 4060 laptop .

Vector

@Vector434852

Built a 1-layer O(n) model that reached 5.5 Val PPL on TinyStories / 8.5M params · trained on a 4060 laptop .

Optimizing my custom O(n) model.. val ppl 16.7 @ 1°epoch is fair .. 37 mins / epoch on rtx4060 laptop #LocalLLM #SmallModels #AI #MachineLearning #RTX4060 #FromScratch

Vector434852's tweet image. Optimizing my custom O(n) model.. val ppl 16.7 @ 1°epoch is fair .. 37 mins / epoch on rtx4060 laptop #LocalLLM #SmallModels #AI #MachineLearning #RTX4060 #FromScratch

8.5M params . 1 layer of custom O(n) attention on rtx4060 laptop . Val PPL 5.5 (still decreasing) on Tinystories.

Vector434852's tweet image. 8.5M params . 1 layer of custom O(n) attention on rtx4060 laptop . Val PPL 5.5 (still decreasing) on Tinystories.

Added a second layer.. 8.5M params.. how it gonna ends?

Vector434852's tweet image. Added a second layer.. 8.5M params.. how it gonna ends?

I'm working on a new breed of linear attention. 1 single layer reached PPL 6.5 in 30 epochs on a rtx4060 laptop.. more to come :)

Vector434852's tweet image. I'm working on a new breed of linear attention. 1 single layer reached PPL 6.5 in 30 epochs on a rtx4060 laptop.. more to come :)

Built a new O(n) attention from scratch. Not Transformer / LSTM / Mamba. Single layer, 8.5M params, RTX 4060 laptop. TinyStories: Val PPL 18.6 | Test PPL 18.9 Context 2048 GPT-2 small needs ~124M params for similar PPL. ~15× parameter efficiency. @ylecun @karpathy Logs ↓

Vector434852's tweet image. Built a new O(n) attention from scratch.
Not Transformer / LSTM / Mamba.

Single layer, 8.5M params, RTX 4060 laptop.

TinyStories:
Val PPL 18.6 | Test PPL 18.9
Context 2048

GPT-2 small needs ~124M params for similar PPL.
~15× parameter efficiency.

@ylecun @karpathy
Logs ↓

8.5M params 1 layer O(n) attention chunk=2048 RTX 4060 laptop ~72h training TinyStories val PPL: 18.9 (still dropping) No transformers were harmed. log ↓

Vector434852's tweet image. 8.5M params  
1 layer  
O(n) attention  
chunk=2048
RTX 4060 laptop  
~72h training

TinyStories val PPL: 18.9 (still dropping)

No transformers were harmed.

log ↓

Loading...

Something went wrong.


Something went wrong.