stochasm
@stochasticchasm
pretraining @arcee_ai • 25 • opinions my own
Beyond excited about this, it's been so fun to work on our next model
I would really love to see the same prompt given to every incremental version of 4o just to see how the model changed over its lifetime
tfw you start recognizing hostnames of specific nodes on the cluster
I wonder how many times ai2 employees have joked about things being olmost done
They call it fp32-fp19active
"ah, it's 19 bits then, right" "well not the storage, that's still 32 bit" "oh, so then the accumulation is done in 19 bits then, right" "no, the accumulation is still done in full fp32" "oh, so then there's basically no precision loss from fp32 then" "well, also no"
Insane result and while the smaller training-inference gap makes sense, I cannot believe it has such a huge effect
FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!
We have a busy month ahead of us. A lot of releases, announcements and information to absorb. We also need feedback. Join our discord to be the first to know about and use our upcoming family of models and toolkits!
Yes we do
if i asked the people what they wanted, they would’ve said bigger models
I'm giving myself two weeks to get this good at claude/codex kernel-wrangling
Seems like m2 is actually full attention, very interesting
A small correction, M2 is a full-attention model. Actually, during our pre-training phase, we tried to transform the full-attention model to a OSS-like structure using SWA. But we found that it hurt the performance of multi-hop reasoning, so we finally did not use this setting.
We are not back
Quadraticheads we are so back
Looks like they've abandoned naive linear attention for SWA
very interesting post
here's an example of how improvements to the best case ("pass@k") can sometimes lead to regressions for the worst case ("worst@k") sure, your deepfried LR might help your average reward go up, but "it improved on average" doesn't mean "the improvement is evenly distributed"!
If you were recently laid off at Meta Gen AI, my dms are open. Help us build the next frontier of Apache-2.0 models.
UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'.
You can just train things
United States Trendler
- 1. Penn State 22.5K posts
- 2. Penn State 22.5K posts
- 3. Mendoza 19.4K posts
- 4. Gus Johnson 6,396 posts
- 5. #iufb 4,085 posts
- 6. Omar Cooper 9,225 posts
- 7. Sayin 66.9K posts
- 8. #UFCVegas111 4,061 posts
- 9. Estevao 34.8K posts
- 10. Sunderland 152K posts
- 11. Iowa 19.1K posts
- 12. Texas Tech 13.4K posts
- 13. Jim Knowles N/A
- 14. Mizzou 3,512 posts
- 15. James Franklin 8,344 posts
- 16. Happy Valley 1,851 posts
- 17. Oregon 33.2K posts
- 18. Kirby Moore N/A
- 19. Zollers N/A
- 20. Garnacho 27.8K posts
Something went wrong.
Something went wrong.