code_star's profile picture. Data Dawg @datologyai | Formerly Data Research Lead @DbrxMosaicAI | Visiting Researcher @ Facebook | Ph.D | #TXSTFOOTBALL fan | http://linktr.ee/code_star

Cody Blakeney

@code_star

Data Dawg @datologyai | Formerly Data Research Lead @DbrxMosaicAI | Visiting Researcher @ Facebook | Ph.D | #TXSTFOOTBALL fan | http://linktr.ee/code_star

Sabitlenmiş

I've got something new for everyone. My first substack article! Not the one I planned to do first, but a fun one! I have made a handy calculator base on the DeepSeek v1 coefficients for finding optimal LR and batch sizes for dense LLMs.

code_star's tweet image. I've got something new for everyone. My first substack article! Not the one I planned to do first, but a fun one!

I have made a handy calculator base on the DeepSeek v1 coefficients for finding optimal LR and batch sizes for dense LLMs.

Cody Blakeney gönderiyi yeniden yayınladı

torch main source build

tenderizzation's tweet image. torch main source build

TORCH NIGHTLY

stochasticchasm's tweet image. TORCH NIGHTLY


Cody Blakeney gönderiyi yeniden yayınladı

binary searching nightlies is a hell of a feeling

TORCH NIGHTLY

stochasticchasm's tweet image. TORCH NIGHTLY


Cody Blakeney gönderiyi yeniden yayınladı

TORCH NIGHTLY

stochasticchasm's tweet image. TORCH NIGHTLY

Cody Blakeney gönderiyi yeniden yayınladı

lol dumb LLM data podcast idea “Talking Tokens”


gotta give the people what they want

Incredibly interested. The world deserves this



Cody Blakeney gönderiyi yeniden yayınladı

Nothing mid about this training 😤

Since mid-training is eating them both, we can as well call it training.



Cody Blakeney gönderiyi yeniden yayınladı

I guess what I’m getting at here is sparsity performance is an engineering problem and the science is pretty clear that you can make the models big without change the theoretical inference performance. Google has really good engineers. It doesn’t really seem like scaling…

Can I ask a dumb question. Let’s say it is 7.5T total parameters. Now that super sparse MoEs are the norm … who cares how big the parameters get? 8x more total params than kimi shouldn’t be hard or surprising for one of the worlds best capitalized companies. 15T next year…



Cody Blakeney gönderiyi yeniden yayınladı

Canonically I believe this is Olmo: Tokyo Drift

This release has SO MUCH • New pretrain corpus, new midtrain data, 380B+ long context tokens • 7B & 32B, Base, Instruct, Think, RL Zero • Close to Qwen 3 performance, but fully open!!

soldni's tweet image. This release has SO MUCH

• New pretrain corpus, new midtrain data, 380B+ long context tokens
• 7B & 32B, Base, Instruct, Think, RL Zero
• Close to Qwen 3 performance, but fully open!!


Cody Blakeney gönderiyi yeniden yayınladı

Olmo 3: Rawr XD

code_star's tweet image. Olmo 3: Rawr XD

team Rawr 🫡



Not for nothing the Nemotron Nano 2 paper also had one of these cool untalked about facts which lead me to make this awesome Claude Shannon meme for a slide once. You need to train at least 2x your desired effective sequence length to get good performance.

code_star's tweet image. Not for nothing the Nemotron Nano 2 paper also had one of these cool untalked about facts which lead me to make this awesome Claude Shannon meme for a slide once. You need to train at least 2x your desired effective sequence length to get good performance.

There is so much cool science to do on long context data research. Its basically never covered in tech reports other than the engineering efforts to solve sequence parallelism. Once again @allen_ai doing the lords work telling us the juicy details of what works and what doesn't



There is so much cool science to do on long context data research. Its basically never covered in tech reports other than the engineering efforts to solve sequence parallelism. Once again @allen_ai doing the lords work telling us the juicy details of what works and what doesn't

this is so good, taking my time to read this tech report like a good movie

eliebakouch's tweet image. this is so good, taking my time to read this tech report like a good movie
eliebakouch's tweet image. this is so good, taking my time to read this tech report like a good movie
eliebakouch's tweet image. this is so good, taking my time to read this tech report like a good movie
eliebakouch's tweet image. this is so good, taking my time to read this tech report like a good movie


Cody Blakeney gönderiyi yeniden yayınladı

this is so good, taking my time to read this tech report like a good movie

eliebakouch's tweet image. this is so good, taking my time to read this tech report like a good movie
eliebakouch's tweet image. this is so good, taking my time to read this tech report like a good movie
eliebakouch's tweet image. this is so good, taking my time to read this tech report like a good movie
eliebakouch's tweet image. this is so good, taking my time to read this tech report like a good movie

happy olmo day for those who celebrate!!!

eliebakouch's tweet image. happy olmo day for those who celebrate!!!


Be sure to subscribe so you don’t miss it. I’m hoping to actually get some quotes from dataset creators as well. It should be a lot of fun. open.substack.com/pub/cod3star

I'm thinking about doing a fun history of LLM datasets series on my substack with my partner in crime @_BrettLarsen . Would anyone be interested in that? Part reading list, part oral history, and part recounting the bad old days when we counted tokens up hills both ways.



People that I know have trained big models (maybe bigger) have liked this tweet and I in the replies I have people telling me its impractical and can't be done. smh.

Can I ask a dumb question. Let’s say it is 7.5T total parameters. Now that super sparse MoEs are the norm … who cares how big the parameters get? 8x more total params than kimi shouldn’t be hard or surprising for one of the worlds best capitalized companies. 15T next year…



Cody Blakeney gönderiyi yeniden yayınladı

honestly getting carried by the impressive students @hamishivi @scottgeng00 @VictoriaWGraf @heinemandavidj @abertsch72 @MayeeChen @saumyamalik44 @mnoukhov @jacobcares and others 🙏🏻


Cody Blakeney gönderiyi yeniden yayınladı

yeah, don't forget all the other goats. Its a goat farm!


code_star's tweet image.

I think we should have 100T parameter MoEs

code_star's tweet image. I think we should have 100T parameter MoEs


Cody Blakeney gönderiyi yeniden yayınladı

TECH REPORTS WITH INFORMATION AND STUFF 104 PAGES WE ARE SO, SO, SO BACK!!!!

xeophon_'s tweet image. TECH REPORTS WITH INFORMATION AND STUFF

104 PAGES

WE ARE SO, SO, SO BACK!!!!

Omg I just realized @pjreddie joined AI2 and now they are doing unhinged off axis plots. Total yolo victory.

This release has SO MUCH • New pretrain corpus, new midtrain data, 380B+ long context tokens • 7B & 32B, Base, Instruct, Think, RL Zero • Close to Qwen 3 performance, but fully open!!

soldni's tweet image. This release has SO MUCH

• New pretrain corpus, new midtrain data, 380B+ long context tokens
• 7B & 32B, Base, Instruct, Think, RL Zero
• Close to Qwen 3 performance, but fully open!!


Cody Blakeney gönderiyi yeniden yayınladı

Releases like this are, to me, more exciting than (very impressive) new SoTA models... Because when the OLMo team 🔥COOKS🔥 like this, we all get to read about it and learn from them!

This release has SO MUCH • New pretrain corpus, new midtrain data, 380B+ long context tokens • 7B & 32B, Base, Instruct, Think, RL Zero • Close to Qwen 3 performance, but fully open!!

soldni's tweet image. This release has SO MUCH

• New pretrain corpus, new midtrain data, 380B+ long context tokens
• 7B & 32B, Base, Instruct, Think, RL Zero
• Close to Qwen 3 performance, but fully open!!


Loading...

Something went wrong.


Something went wrong.