stochasm

@stochasticchasm

pretraining @arcee_ai • 25 • opinions my own

🌖

stochasm.blog

Ağustos 2024’de katıldı

8KGönderiler 3KTakipçiler 1KTakip edilenler

Sabitlenmiş

stochasm

@stochasticchasm

11 Eyl

Beyond excited about this, it's been so fun to work on our next model

stochasm

@stochasticchasm

4 sa

I would really love to see the same prompt given to every incremental version of 4o just to see how the model changed over its lifetime

mike64_t

@mike64_t

6 Kas

Join the dark side, Luke. Give up parallelization. If you only knew the power of host-device streaming...

stochasm

@stochasticchasm

4 Kas

tfw you start recognizing hostnames of specific nodes on the cluster

stochasm

@stochasticchasm

1 Kas

I wonder how many times ai2 employees have joked about things being olmost done

stochasm

@stochasticchasm

1 Kas

They call it fp32-fp19active

"ah, it's 19 bits then, right" "well not the storage, that's still 32 bit" "oh, so then the accumulation is done in 19 bits then, right" "no, the accumulation is still done in full fp32" "oh, so then there's basically no precision loss from fp32 then" "well, also no"

stochasm

@stochasticchasm

31 Eki

Insane result and while the smaller training-inference gap makes sense, I cannot believe it has such a huge effect

Rosinality

@rosinality

31 Eki

FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!

rosinality's tweet image. FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!

stochasm gönderiyi yeniden yayınladı

Lucas Atkins

@latkins

29 Eki

We have a busy month ahead of us. A lot of releases, announcements and information to absorb. We also need feedback. Join our discord to be the first to know about and use our upcoming family of models and toolkits!

Arcee.ai

@arcee_ai

29 Eki

ArceeAI is on Discord! Join for early access to some exciting drops!

stochasm

@stochasticchasm

28 Eki

Yes we do

Aidan McLaughlin

@aidan_mclau

28 Eki

if i asked the people what they wanted, they would’ve said bigger models

stochasm

@stochasticchasm

27 Eki

I'm giving myself two weeks to get this good at claude/codex kernel-wrangling

xjdr

@_xjdr

17 Eki

it may not be interesting to anyone but this is where i left off working tonight. i will start back again in a few hours from this same tmux session

_xjdr's tweet image. it may not be interesting to anyone but this is where i left off working tonight. i will start back again in a few hours from this same tmux session

stochasm

@stochasticchasm

27 Eki

Seems like m2 is actually full attention, very interesting

Pengyu Zhao

@zpysky1125

27 Eki

A small correction, M2 is a full-attention model. Actually, during our pre-training phase, we tried to transform the full-attention model to a OSS-like structure using SWA. But we found that it hurt the performance of multi-hop reasoning, so we finally did not use this setting.

stochasm

@stochasticchasm

27 Eki

We are not back

Grad

@Grad62304977

27 Eki

Looks like nvm no SWA?? Very weird

stochasm

@stochasticchasm

27 Eki

Quadraticheads we are so back

Grad

@Grad62304977

27 Eki

Looks like they've abandoned naive linear attention for SWA

stochasm

@stochasticchasm

26 Eki

very interesting post

kalomaze

@kalomaze

26 Eki

here's an example of how improvements to the best case ("pass@k") can sometimes lead to regressions for the worst case ("worst@k") sure, your deepfried LR might help your average reward go up, but "it improved on average" doesn't mean "the improvement is evenly distributed"!

kalomaze's tweet image. here's an example of how improvements to the best case ("pass@k") can sometimes lead to regressions for the worst case ("worst@k")
sure, your deepfried LR might help your average reward go up, but "it improved on average" doesn't mean "the improvement is evenly distributed"!

stochasm gönderiyi yeniden yayınladı

Lucas Atkins

@latkins

22 Eki

If you were recently laid off at Meta Gen AI, my dms are open. Help us build the next frontier of Apache-2.0 models.

stochasm

@stochasticchasm

21 Eki

UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'.