stochasticchasm's profile picture. pretraining @arcee_ai • 25 • opinions my own

stochasm

@stochasticchasm

pretraining @arcee_ai • 25 • opinions my own

Sabitlenmiş

Beyond excited about this, it's been so fun to work on our next model

Coming Soon...



I would really love to see the same prompt given to every incremental version of 4o just to see how the model changed over its lifetime


Join the dark side, Luke. Give up parallelization. If you only knew the power of host-device streaming...



tfw you start recognizing hostnames of specific nodes on the cluster


I wonder how many times ai2 employees have joked about things being olmost done


They call it fp32-fp19active

"ah, it's 19 bits then, right" "well not the storage, that's still 32 bit" "oh, so then the accumulation is done in 19 bits then, right" "no, the accumulation is still done in full fp32" "oh, so then there's basically no precision loss from fp32 then" "well, also no"



Insane result and while the smaller training-inference gap makes sense, I cannot believe it has such a huge effect

FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!

rosinality's tweet image. FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!


stochasm gönderiyi yeniden yayınladı

We have a busy month ahead of us. A lot of releases, announcements and information to absorb. We also need feedback. Join our discord to be the first to know about and use our upcoming family of models and toolkits!

ArceeAI is on Discord! Join for early access to some exciting drops!

arcee_ai's tweet image. ArceeAI is on Discord! Join for early access to some exciting drops!


Yes we do

if i asked the people what they wanted, they would’ve said bigger models



I'm giving myself two weeks to get this good at claude/codex kernel-wrangling

it may not be interesting to anyone but this is where i left off working tonight. i will start back again in a few hours from this same tmux session

_xjdr's tweet image. it may not be interesting to anyone but this is where i left off working tonight. i will start back again in a few hours from this same tmux session


Seems like m2 is actually full attention, very interesting

A small correction, M2 is a full-attention model. Actually, during our pre-training phase, we tried to transform the full-attention model to a OSS-like structure using SWA. But we found that it hurt the performance of multi-hop reasoning, so we finally did not use this setting.



We are not back

Looks like nvm no SWA?? Very weird



Quadraticheads we are so back

Looks like they've abandoned naive linear attention for SWA

Grad62304977's tweet image. Looks like they've abandoned naive linear attention for SWA


very interesting post

here's an example of how improvements to the best case ("pass@k") can sometimes lead to regressions for the worst case ("worst@k") sure, your deepfried LR might help your average reward go up, but "it improved on average" doesn't mean "the improvement is evenly distributed"!

kalomaze's tweet image. here's an example of how improvements to the best case ("pass@k") can sometimes lead to regressions for the worst case ("worst@k")
sure, your deepfried LR might help your average reward go up, but "it improved on average" doesn't mean "the improvement is evenly distributed"!


stochasm gönderiyi yeniden yayınladı

If you were recently laid off at Meta Gen AI, my dms are open. Help us build the next frontier of Apache-2.0 models.


UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'.


You can just train things

stochasticchasm's tweet image. You can just train things

Pack it in boys

pli_cachete's tweet image. Pack it in boys


United States Trendler

Loading...

Something went wrong.


Something went wrong.