eliebakouch's profile picture. Training llm's (now: @huggingface)
anon feedback: https://www.admonymous.co/eliebakouch

elie

@eliebakouch

Training llm's (now: @huggingface) anon feedback: https://www.admonymous.co/eliebakouch

Pinned

Training LLMs end to end is hard. Very excited to share our new blog (book?) that cover the full pipeline: pre-training, post-training and infra. 200+ pages of what worked, what didn’t, and how to make it run reliably huggingface.co/spaces/Hugging…

eliebakouch's tweet image. Training LLMs end to end is hard. Very excited to share our new blog (book?) that cover the full pipeline: pre-training, post-training and infra. 200+ pages of what worked, what didn’t, and how to make it run reliably

huggingface.co/spaces/Hugging…

elie reposted

Amazing pairing to learn information theory Blog from Olah which gives great visual intuition: colah.github.io/posts/2015-09-… Video from 3b1b where you see the power by solving a real world example, Wordle: youtube.com/watch?v=v68zYy…

fleetwood___'s tweet image. Amazing pairing to learn information theory

Blog from Olah which gives great visual intuition:
colah.github.io/posts/2015-09-…

Video from 3b1b where you see the power by solving a real world example, Wordle: youtube.com/watch?v=v68zYy…

elie reposted

Our infra engineer shared a great article “Why Kimi Chose INT4.” I asked if he could be on Twitter, but he’s shy and prefers to be the man behind Kimi. :)

🚀 "Quantization is not a compromise — it's the next paradigm." After K2-Thinking's release, many developers have been curious about its native INT4 quantization format. 刘少伟, infra engineer at @Kimi_Moonshot and Zhihu contributor, shares an insider's view on why this choice…

ZhihuFrontier's tweet image. 🚀 "Quantization is not a compromise — it's the next paradigm."
After K2-Thinking's release, many developers have been curious about its native INT4 quantization format.
刘少伟, infra engineer at @Kimi_Moonshot and Zhihu contributor, shares an insider's view on why this choice…
ZhihuFrontier's tweet image. 🚀 "Quantization is not a compromise — it's the next paradigm."
After K2-Thinking's release, many developers have been curious about its native INT4 quantization format.
刘少伟, infra engineer at @Kimi_Moonshot and Zhihu contributor, shares an insider's view on why this choice…
ZhihuFrontier's tweet image. 🚀 "Quantization is not a compromise — it's the next paradigm."
After K2-Thinking's release, many developers have been curious about its native INT4 quantization format.
刘少伟, infra engineer at @Kimi_Moonshot and Zhihu contributor, shares an insider's view on why this choice…
ZhihuFrontier's tweet image. 🚀 "Quantization is not a compromise — it's the next paradigm."
After K2-Thinking's release, many developers have been curious about its native INT4 quantization format.
刘少伟, infra engineer at @Kimi_Moonshot and Zhihu contributor, shares an insider's view on why this choice…


making meme like this should be a full time job

[ENG SUB] how it feels to use eager pytorch in 2025



eliebakouch's tweet image.

same model, same settings, just diff providers

xeophon_'s tweet image. same model, same settings, just diff providers


nevermind.. sorry openai i wasn't familiar with your game

eliebakouch's tweet image. nevermind.. sorry openai i wasn't familiar with your game

imagine openai official account answering "awesome!" on claude sonnet 4.5 release



elie reposted

you must feel the gradients flowing through you! (edit: accidentally added shampoo before, sorry @kellerjordan0)

snowclipsed's tweet image. you must feel the gradients flowing through you!

(edit: accidentally added shampoo before, sorry @kellerjordan0)

I don't think other tech report mention bf16 training / fp8 inference for RL training right?

before int4, it was bf16 training + fp8 inference, so discrepancy is not greater



elie reposted

Interesting that the quantization is applied to the routed experts but not to the shared one. My understanding is that the shared expert has plenty of time to compute (Meituan fit in two whole shared expert layers during one MoE step with ScMoE) which is probably why.

timfduffy's tweet image. Interesting that the quantization is applied to the routed experts but not to the shared one. My understanding is that the shared expert has plenty of time to compute (Meituan fit in two whole shared expert layers during one MoE step with ScMoE) which is probably why.

elie reposted

Insane how far the open source frontier has come

🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built…

Kimi_Moonshot's tweet image. 🚀 Hello, Kimi K2 Thinking!
The Open-Source Thinking Agent Model is here.

🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%)
🔹 Executes up to 200 – 300 sequential tool calls without human interference
🔹 Excels in reasoning, agentic search, and coding
🔹 256K context window

Built…


> "200-300 sequential tool calls" this is really the impressive part of this release imo, can't wait to see how they did it

eliebakouch's tweet image. > "200-300 sequential tool calls"

this is really the impressive part of this release imo, can't wait to see how they did it

🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built…

Kimi_Moonshot's tweet image. 🚀 Hello, Kimi K2 Thinking!
The Open-Source Thinking Agent Model is here.

🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%)
🔹 Executes up to 200 – 300 sequential tool calls without human interference
🔹 Excels in reasoning, agentic search, and coding
🔹 256K context window

Built…


imagine openai official account answering "awesome!" on claude sonnet 4.5 release

Awesome!



the score are insane, very cool to see native int4 quantization for the MoE layers > To overcome this challenge, we adopt Quantization-Aware Training (QAT) during the post-training phase, applying INT4 weight-only quantization to the MoE components. It allows K2 Thinking to…

🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built…

Kimi_Moonshot's tweet image. 🚀 Hello, Kimi K2 Thinking!
The Open-Source Thinking Agent Model is here.

🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%)
🔹 Executes up to 200 – 300 sequential tool calls without human interference
🔹 Excels in reasoning, agentic search, and coding
🔹 256K context window

Built…


elie reposted

🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built…

Kimi_Moonshot's tweet image. 🚀 Hello, Kimi K2 Thinking!
The Open-Source Thinking Agent Model is here.

🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%)
🔹 Executes up to 200 – 300 sequential tool calls without human interference
🔹 Excels in reasoning, agentic search, and coding
🔹 256K context window

Built…

elie reposted

This feels like a big moment

we’re very close to 50% on HLE, and bonus point: it’s with an open model :)

eliebakouch's tweet image. we’re very close to 50% on HLE, and bonus point: it’s with an open model :)


ok we're at 51% with "heavy" mode > ​Heavy Mode​: K2 Thinking Heavy Mode employs an efficient parallel strategy: it first rolls out eight trajectories simultaneously, then reflectively aggregates all outputs to generate the final result.

eliebakouch's tweet image. ok we're at 51% with "heavy" mode

> ​Heavy Mode​: K2 Thinking Heavy Mode employs an efficient parallel strategy: it first rolls out eight trajectories simultaneously, then reflectively aggregates all outputs to generate the final result.

we’re very close to 50% on HLE, and bonus point: it’s with an open model :)

eliebakouch's tweet image. we’re very close to 50% on HLE, and bonus point: it’s with an open model :)


we’re very close to 50% on HLE, and bonus point: it’s with an open model :)

eliebakouch's tweet image. we’re very close to 50% on HLE, and bonus point: it’s with an open model :)

From the original HLE paper:

DanHendrycks's tweet image. From the original HLE paper:


United States Trends

Loading...

Something went wrong.


Something went wrong.