elie

@eliebakouch

Training llm's (now: @huggingface)

huggingface.co/eliebak

Tham gia vào Tháng 1 2024

3KBài đăng 6KNgười theo dõi 3KĐang theo dõi

Ghim

elie

@eliebakouch

8 thg 7

Super excited to share SmolLM3, a new strong 3B model. SmolLM3 is fully open, we share the recipe, the dataset, the training codebase and much more! > Train on 11T token on 384 H100 for 220k GPU hours > Support long context up to 128k thanks to NoPE and intra document masking >…

elie

@eliebakouch

20 giờ

<smol> is the way to go

Matthew Leavitt

@leavittron

22 giờ

Does @thinkymachines call them <thinky> tokens?

elie đã đăng lại

rohan anil

@_arohan_

9 thg 10

You can go far without, ask for mentorship, devour video lectures, find place with compute, practice your research skills.

elie đã đăng lại

Yacine Mahdid

@yacinelearning

9 thg 10

the team at huggingface made the absolute insane discovery that filtering garbage data out of the pretraining made small models think good incredible very bullish on the whole filtering garbage thing out asked stupid questions to elie over there and he handled it well

yacinelearning's tweet image. the team at huggingface made the absolute insane discovery that filtering garbage data out of the pretraining made small models think good

incredible very bullish on the whole filtering garbage thing out

asked stupid questions to elie over there and he handled it well

elie

@eliebakouch

9 thg 10

We’re presenting our work on SmolLM2 tomorrow at COLM, and I guess it’s a good time to say that I’m really proud of what our science team at @huggingface has accomplished with the smollm ecosystem 🤗 While going through the poster sessions this week and talking with researchers,…

elie

@eliebakouch

8 thg 10

Open source release of smol merch coming soon at colm 🤏

elie đã đăng lại

Leandro von Werra

@lvwerra

6 thg 10

Just arrived in Montreal for COLM! Who‘s around this week and down to chat? (we also brought some special merch)

elie đã đăng lại

bidhan

@bidhanxyz

7 thg 10

Revolutions need first principle thinking. that's what we did with Paris. instead of incremental improvements, we build an entirely new distributed learning stack from scratch that removes the communication bottleneck entirely. This is the spaceX moment for decentralized AI.

bagel.com

@bageldotcom

7 thg 10

Introducing Paris - world's first decentralized trained open-weight diffusion model. We named it Paris after the city that has always been a refuge for those creating without permission. Paris is open for research and commercial use.

elie đã đăng lại

gabriel teston

@GabrielTeston

7 thg 10

Want to learn how to train models across the world, with 400x less bits exchanged and a huge latency tolerance? 🌎 I’ll be presenting our work on how to efficiently scale distributed training at @COLM_conf. 🗓️ TODAY: Tuesday, 11:00 - 13:00 📍 Room 710 #COLM2025

elie đã đăng lại

Loubna Ben Allal @ COLM

@LoubnaBenAllal1

7 thg 10

Come say hi and get some copies of the SmolLM3 Blueprint! 🤗

Guilherme Penedo @ COLM

@gui_penedo

7 thg 10

We're in 🇨🇦 for @COLM_conf Come talk to me and @HKydlicek on Wednesday at the FineWeb2 poster session (Session 3, Poster #58) @LoubnaBenAllal1 will be on Session 5, Poster #23 (SmolLM2)

gui_penedo's tweet image. We're in 🇨🇦 for @COLM_conf

Come talk to me and @HKydlicek on Wednesday at the FineWeb2 poster session (Session 3, Poster #58)

@LoubnaBenAllal1 will be on Session 5, Poster #23 (SmolLM2)

elie

@eliebakouch

7 thg 10

I haven’t trained a model with cosine lr schedule + muon but kinda looks weird to have that much difference between muon and AdamW early on no? It’s also very different form the curve in the moonlight paper

eliebakouch's tweet image. I haven’t trained a model with cosine lr schedule + muon but kinda looks weird to have that much difference between muon and AdamW early on no?
It’s also very different form the curve in the moonlight paper

YIFENG LIU

@YIFENGLIU_AI

6 thg 10

2/n MARS-M enhances Muon by incorporating scaled gradient correction into momentum, which significantly reduces variance during model training. In GPT-2 experiments, MARS-M beats Muon at every step with the same hyper-params, and outperforms AdamW on training & validation loss.

YIFENGLIU_AI's tweet image. 2/n MARS-M enhances Muon by incorporating scaled gradient correction into momentum, which significantly reduces variance during model training. In GPT-2 experiments, MARS-M beats Muon at every step with the same hyper-params, and outperforms AdamW on training &amp; validation loss.

elie

@eliebakouch

6 thg 10

Very cool paper

David Louapre

@dlouapre

6 thg 10

LLMs are becoming creative partners in scientific discovery. We've seen Scott Aaronson, @terrence_tao , and many more using them to unlock pathways to proofs : not just for IMO-style competition problems, but for actual research questions ! While LLMs can't yet solve…

dlouapre's tweet image. LLMs are becoming creative partners in scientific discovery. We've seen Scott Aaronson, @terrence_tao , and many more using them to unlock pathways to proofs : not just for IMO-style competition problems, but for actual research questions !

While LLMs can't yet solve…

elie đã đăng lại

Zach Mueller

@TheZachMueller

6 thg 10

Okay let's do this thing. Friday, 1pm EST. How to train your tiny MoE. Don't miss it.

elie

@eliebakouch

5 thg 10

Heading to COLM (Montreal) next week, happy to chat about training llm and new research direction Also feel free to share in thread any cool accepted papers I should check out :)

elie đã đăng lại

Rookie

@nekofneko

3 thg 10

Many people are curious about the performance of Whale's new model on long texts. I conducted a comparative test between DeepSeek-V3.1-Terminus and DeepSeek-V3.2-Exp under the OpenAI mrcr-2needle. The detailed logs of both models' responses will be open-sourced tomorrow.

nekofneko's tweet image. Many people are curious about the performance of Whale's new model on long texts. I conducted a comparative test between DeepSeek-V3.1-Terminus and DeepSeek-V3.2-Exp under the OpenAI mrcr-2needle. The detailed logs of both models' responses will be open-sourced tomorrow.

elie đã đăng lại

samsja

@samsja19

2 thg 10

Prime-rl has now extensive support for MoE both for RL and SFT, we have been training 100B+ model with it We have support for: * Qwen3 a3-30b * GLM series and Moonlight * adding gpt oss series as we speak we end up rewriting most of the modelling code to make it works with…

elie đã đăng lại

Nathan Chen

@nathancgy4

1 thg 10

a thought experiment regarding DSA that comes to mind: in the original attn sink post, since PPL skyrockets for SWA the moment the initial tokens were evicted, that likely suggests that the tokens inside the window were deemed unimportant if similar synthetics were designed for…

nathancgy4's tweet image. a thought experiment regarding DSA that comes to mind:

in the original attn sink post, since PPL skyrockets for SWA the moment the initial tokens were evicted, that likely suggests that the tokens inside the window were deemed unimportant

if similar synthetics were designed for…

wh

@nrehiew_

1 thg 10

1 possible reason is that DSA is inherently a better long context algorithm than other quadratic attention variants because of Softmax. Since softmax is only done over 2048 tokens, QK attention weight magnitudes are preserved and no "attention budget" is given to useless tokens

nrehiew_'s tweet image. 1 possible reason is that DSA is inherently a better long context algorithm than other quadratic attention variants because of Softmax.

Since softmax is only done over 2048 tokens, QK attention weight magnitudes are preserved and no "attention budget" is given to useless tokens

elie

@eliebakouch

30 thg 9

Just asked smth else to gpt5-thinking and the citation list is insanely relevant and recent, first time I'm experiencing this

elie

@eliebakouch

30 thg 9

GPT-5 pro doesn't know Muon :(

elie

@eliebakouch

30 thg 9

This is kinda insane in terms of video quality. But the sound is somewhat far from "human level" imo no?

OpenAI

@OpenAI

30 thg 9

Sound on.

VIKASH C RAI

@VCR1113

timo

@timoletiel

s-godot

@s__godot

Qingye Meng

@hilbertmeng

Alex Gurung @ COLM2025

@AlexAag1234

Asif.(AR) 🚀

@Asif_Legendary

Dan Griffith

@dangriffithsr

XTao

@XTao

Odds

@OddsIsOnline

Jawad Chemaou

@jawad_chemaou

Dongyang Fan

@dyfan22

Kramerica Industries

@oldschoollib89

Fernando Reyes

@FernandoRe7g

Mehdi (e/λ)

@BetterCallMedhi

Mahdi Boumerzoug

@mahditamboum

Zaid

@zaidmukaddam

Vito Amedeo

@vito_amedeo

surenmichals83462

@surenmicha81469

asp

@underfitai

DCSC91

@dcsc_91

Galactic Internet

@GalacticNetIn

Selena Runolfsdottir

@SelenaRuno59168

I need you,Moonlight

@countedthemoon

Mark McQuade

@mmcquade_ai_u

Sanjiv Chemudupati

@SanjivCh09

Cole McIntosh

@colesmcintosh

RDSM

@das_rdsm

Federico Barbero

@fedzbar

Sebfox

@Sebfox1

Shauray

@Shauray7

Glenn Matlin

@GlennMatlin

Sushil Pokhrel

@sushilpokhrel

Yang Su @COLM2025

@YangSu2000

Neel Jain

@neeljain1717

Tong Chen

@tomchen0

Eigen

@outer_eigen

Vineet Amin

@vinit456

Gallil Maimon ✈️ COLM

@GallilMaimon

Rishi Rajasekaran

@rrajasek95

$m__vaisakh's profile picture. AI Research | Independent Researcher @ml_collective | {Compute/Data} efficient models | UG @PalakkadIIT$

Vaisakh M

@m__vaisakh

Harpreet Singh

@harpreetmann24

Yushi Bai

@realYushiBai

Betting Calculator

@Betting__win

Arseniy Andreyev

@arseniqum

Ondrej Škorňák

@o_skornak

Alex Amerling

@amerlingalex

Drew Breunig

@dbreunig

Jorge Vicente Cantero

@jvican

Sarthak Mittal @ COLM

@sarthmit

Naman Jain

@StringChaos

Siddarth Venkatraman

@siddarthv66

nataliestaud

@nataliestaud

$et_tu_deux's profile picture. Diffraction limit ignorant, applied category theoretician and chief metanarrative author #27 #lgrw #goblue #floodthezone Erev Shabbat! GPGPU to the car!$