Bala

@nbk_code

ML Scientist

انضم في أغسطس 2021

3ألفالمنشورات 24المتابعون 255المتابَعون

Bala أعاد

Extropic

@Extropic_AI

15 س

Hello Thermo World.

Bala أعاد

pdawg

@prathamgrv

٢٣ أكتوبرم

wrote a blog on KV caching. link in comments.

Bala أعاد

Sebastien Bubeck just cleared the air and honestly, it makes GPT-5’s achievement even more astonishing. No, GPT-5 didn’t magically solve a new Erdős problem. What it did was arguably harder to appreciate: it rediscovered a long-forgotten mathematical link, buried in obscure…

Sebastien Bubeck

@SebastienBubeck

٢٠ أكتوبرم

My posts last week created a lot of unnecessary confusion*, so today I would like to do a deep dive on one example to explain why I was so excited. In short, it’s not about AIs discovering new results on their own, but rather how tools like GPT-5 can help researchers navigate,…

SebastienBubeck's tweet image. My posts last week created a lot of unnecessary confusion*, so today I would like to do a deep dive on one example to explain why I was so excited. In short, it’s not about AIs discovering new results on their own, but rather how tools like GPT-5 can help researchers navigate,…

Bala أعاد

Ernest Ryu

@ErnestRyu

٢١ أكتوبرم

I used ChatGPT to solve an open problem in convex optimization. *Part I* (1/N)

Bala أعاد

Andrej Karpathy

@karpathy

٢٠ أكتوبرم

I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language…

vLLM

@vllm_project

٢٠ أكتوبرم

🚀 DeepSeek-OCR — the new frontier of OCR from @deepseek_ai , exploring optical context compression for LLMs, is running blazingly fast on vLLM ⚡ (~2500 tokens/s on A100-40G) — powered by vllm==0.8.5 for day-0 model support. 🧠 Compresses visual contexts up to 20× while keeping…

vllm_project's tweet image. 🚀 DeepSeek-OCR — the new frontier of OCR from @deepseek_ai , exploring optical context compression for LLMs, is running blazingly fast on vLLM ⚡ (~2500 tokens/s on A100-40G) — powered by vllm==0.8.5 for day-0 model support.

🧠 Compresses visual contexts up to 20× while keeping…

Bala أعاد

Fabian Gloeckle

@FabianGloeckle

١٤ أكتوبرم

Replicate IMO-Gold in less than 500 lines: gist.github.com/faabian/39d057… The prover-verifier workflow from Huang & Yang: Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline (arxiv.org/abs/2507.15855), original code at github.com/lyang36/IMO25/

FabianGloeckle's tweet image. Replicate IMO-Gold in less than 500 lines:

gist.github.com/faabian/39d057…

The prover-verifier workflow from Huang &amp; Yang: Winning Gold at IMO 2025 with a Model-Agnostic
Verification-and-Refinement Pipeline (arxiv.org/abs/2507.15855), original code at github.com/lyang36/IMO25/

Bala أعاد

Andrej Karpathy

@karpathy

١٣ أكتوبرم

Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single,…

karpathy's tweet image. Excited to release new repo: nanochat!
(it's among the most unhinged I've written).

Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single,…

Bala أعاد

Syeda Nahida Akter

@__SyedaAkter

٢ أكتوبرم

When should an LLM learn to reason? 🤔 Early in pretraining or late in fine-tuning? Our new work, "Front-Loading Reasoning", challenges the "save it for later" approach. We show that injecting reasoning data into pretraining is critical for building models that reach the…

__SyedaAkter's tweet image. When should an LLM learn to reason? 🤔 Early in pretraining or late in fine-tuning?

Our new work, "Front-Loading Reasoning", challenges the "save it for later" approach. We show that injecting reasoning data into pretraining is critical for building models that reach the…

Bala أعاد

Alexia Jolicoeur-Martineau

@jm_alexia

٧ أكتوبرم

New paper 📜: Tiny Recursion Model (TRM) is a recursive reasoning approach with a tiny 7M parameters neural network that obtains 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating most LLMs. Blog: alexiajm.github.io/2025/09/29/tin… Code: github.com/SamsungSAILMon… Paper: arxiv.org/abs/2510.04871

Bala أعاد

Jackson Atkins

@JacksonAtkinsX

٧ أكتوبرم

My brain broke when I read this paper. A tiny 7 Million parameter model just beat DeepSeek-R1, Gemini 2.5 pro, and o3-mini at reasoning on both ARG-AGI 1 and ARC-AGI 2. It's called Tiny Recursive Model (TRM) from Samsung. How can a model 10,000x smaller be smarter? Here's how…

JacksonAtkinsX's tweet image. My brain broke when I read this paper.

A tiny 7 Million parameter model just beat DeepSeek-R1, Gemini 2.5 pro, and o3-mini at reasoning on both ARG-AGI 1 and ARC-AGI 2.

It's called Tiny Recursive Model (TRM) from Samsung.

How can a model 10,000x smaller be smarter?

Here's how…

Bala أعاد

Rohan Paul

@rohanpaul_ai

٤ أكتوبرم

A beautiful paper from MIT+Harvard+ @GoogleDeepMind 👏 Explains why Transformers miss multi digit multiplication and shows a simple bias that fixes it. The researchers trained two small Transformer models on 4-digit-by-4-digit multiplication. One used a special training method…

rohanpaul_ai's tweet image. A beautiful paper from MIT+Harvard+ @GoogleDeepMind 👏

Explains why Transformers miss multi digit multiplication and shows a simple bias that fixes it.

The researchers trained two small Transformer models on 4-digit-by-4-digit multiplication.

One used a special training method…

Bala أعاد

Yifei Wang

@yifeiwang77

١ أكتوبرم

Very interesting result! Though perhaps not too surprising if you recall that RL-tuned models often show much smaller feature shifts than SFT (see e.g. Xiang’s post below). When the model doesn’t have to move very far, LoRA—or even something lighter—can already do the job. As…

Thinking Machines

@thinkymachines

٢٩ سبتمبرم

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.…

thinkymachines's tweet image. LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.…

Bala أعاد

Wes Roth

@WesRothMoney

١ أكتوبرم

Meta AI’s new paper shows that strong chain-of-thought reasoning can emerge in models under 1 billion parameters if you feed them the right data. Key takeaways 🔹Smaller, smarter. MobileLLM-R1 (140 M – 950 M params) beats or matches prior open-source peers on reasoning suites…

WesRothMoney's tweet image. Meta AI’s new paper shows that strong chain-of-thought reasoning can emerge in models under 1 billion parameters if you feed them the right data.

Key takeaways

🔹Smaller, smarter. MobileLLM-R1 (140 M – 950 M params) beats or matches prior open-source peers on reasoning suites…

Bala أعاد

Unsloth AI

@UnslothAI

٢٩ سبتمبرم

LoRA in reinforcement learning (RL) can match full-finetuning performance when done right! 💡 A new @thinkymachines post shows how using 10x larger learning rates, applying LoRA on all layers & more, LoRA at rank=1 even works. We're excited to have collaborated on this blog!

UnslothAI's tweet image. LoRA in reinforcement learning (RL) can match full-finetuning performance when done right! 💡

A new @thinkymachines post shows how using 10x larger learning rates, applying LoRA on all layers &amp; more, LoRA at rank=1 even works.

We're excited to have collaborated on this blog!

Thinking Machines

@thinkymachines

٢٩ سبتمبرم

Bala أعاد

elvis

@omarsar0

٢٨ سبتمبرم

Great work showing prompt synthesis as a new scaling axis for reasoning. Good training data is scarce. This work showcases a framework that might make it possible to construct high-quality training problems for reasoning-focused LLMs. Technical details below:

omarsar0's tweet image. Great work showing prompt synthesis as a new scaling axis for reasoning.

Good training data is scarce.

This work showcases a framework that might make it possible to construct high-quality training problems for reasoning-focused LLMs.

Technical details below:

Bala أعاد

Sebastien Bubeck

@SebastienBubeck

٢٨ سبتمبرم

Yet more evidence that a pretty major shift is happening, this time by Scott Aaronson scottaaronson.blog/?p=9183&fbclid…

SebastienBubeck's tweet image. Yet more evidence that a pretty major shift is happening, this time by Scott Aaronson

scottaaronson.blog/?p=9183&amp;fbclid…

Bala أعاد

Catherine Arnett

@linguist_cat

٢٥ سبتمبرم

I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!

linguist_cat's tweet image. I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!

Bala أعاد

Simo Ryu

@cloneofsimo

٢٣ سبتمبرم

Damn, very interesting paper. after rapid loss reduction, we see deceleration and follow "scaling law": this is because at these steps, gradients start to conflict each other. Updates are 'fightining for modal capacity' in some sense, and larger the model less fighting there…