Bala

@nbk_code

ML Scientist

8월 2021에 가입

3천게시물 23팔로워 253팔로우 중

Bala 님이 재게시함

Andrej Karpathy

@karpathy

15 시간

Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single,…

karpathy's tweet image. Excited to release new repo: nanochat!
(it's among the most unhinged I've written).

Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single,…

Bala 님이 재게시함

Syeda Nahida Akter

@__SyedaAkter

. 10. 2.

When should an LLM learn to reason? 🤔 Early in pretraining or late in fine-tuning? Our new work, "Front-Loading Reasoning", challenges the "save it for later" approach. We show that injecting reasoning data into pretraining is critical for building models that reach the…

__SyedaAkter's tweet image. When should an LLM learn to reason? 🤔 Early in pretraining or late in fine-tuning?

Our new work, "Front-Loading Reasoning", challenges the "save it for later" approach. We show that injecting reasoning data into pretraining is critical for building models that reach the…

Bala 님이 재게시함

Alexia Jolicoeur-Martineau

@jm_alexia

. 10. 7.

New paper 📜: Tiny Recursion Model (TRM) is a recursive reasoning approach with a tiny 7M parameters neural network that obtains 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating most LLMs. Blog: alexiajm.github.io/2025/09/29/tin… Code: github.com/SamsungSAILMon… Paper: arxiv.org/abs/2510.04871

Bala 님이 재게시함

Jackson Atkins

@JacksonAtkinsX

. 10. 7.

My brain broke when I read this paper. A tiny 7 Million parameter model just beat DeepSeek-R1, Gemini 2.5 pro, and o3-mini at reasoning on both ARG-AGI 1 and ARC-AGI 2. It's called Tiny Recursive Model (TRM) from Samsung. How can a model 10,000x smaller be smarter? Here's how…

JacksonAtkinsX's tweet image. My brain broke when I read this paper.

A tiny 7 Million parameter model just beat DeepSeek-R1, Gemini 2.5 pro, and o3-mini at reasoning on both ARG-AGI 1 and ARC-AGI 2.

It's called Tiny Recursive Model (TRM) from Samsung.

How can a model 10,000x smaller be smarter?

Here's how…

Bala 님이 재게시함

Rohan Paul

@rohanpaul_ai

. 10. 4.

A beautiful paper from MIT+Harvard+ @GoogleDeepMind 👏 Explains why Transformers miss multi digit multiplication and shows a simple bias that fixes it. The researchers trained two small Transformer models on 4-digit-by-4-digit multiplication. One used a special training method…

rohanpaul_ai's tweet image. A beautiful paper from MIT+Harvard+ @GoogleDeepMind 👏

Explains why Transformers miss multi digit multiplication and shows a simple bias that fixes it.

The researchers trained two small Transformer models on 4-digit-by-4-digit multiplication.

One used a special training method…

Bala 님이 재게시함

Yifei Wang

@yifeiwang77

. 10. 1.

Very interesting result! Though perhaps not too surprising if you recall that RL-tuned models often show much smaller feature shifts than SFT (see e.g. Xiang’s post below). When the model doesn’t have to move very far, LoRA—or even something lighter—can already do the job. As…

Thinking Machines

@thinkymachines

. 9. 29.

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.…

thinkymachines's tweet image. LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.…

Bala 님이 재게시함

Wes Roth

@WesRothMoney

. 10. 1.

Meta AI’s new paper shows that strong chain-of-thought reasoning can emerge in models under 1 billion parameters if you feed them the right data. Key takeaways 🔹Smaller, smarter. MobileLLM-R1 (140 M – 950 M params) beats or matches prior open-source peers on reasoning suites…

WesRothMoney's tweet image. Meta AI’s new paper shows that strong chain-of-thought reasoning can emerge in models under 1 billion parameters if you feed them the right data.

Key takeaways

🔹Smaller, smarter. MobileLLM-R1 (140 M – 950 M params) beats or matches prior open-source peers on reasoning suites…

Bala 님이 재게시함

Unsloth AI

@UnslothAI

. 9. 29.

LoRA in reinforcement learning (RL) can match full-finetuning performance when done right! 💡 A new @thinkymachines post shows how using 10x larger learning rates, applying LoRA on all layers & more, LoRA at rank=1 even works. We're excited to have collaborated on this blog!

UnslothAI's tweet image. LoRA in reinforcement learning (RL) can match full-finetuning performance when done right! 💡

A new @thinkymachines post shows how using 10x larger learning rates, applying LoRA on all layers &amp; more, LoRA at rank=1 even works.

We're excited to have collaborated on this blog!

Thinking Machines

@thinkymachines

. 9. 29.

Bala 님이 재게시함

elvis

@omarsar0

. 9. 28.

Great work showing prompt synthesis as a new scaling axis for reasoning. Good training data is scarce. This work showcases a framework that might make it possible to construct high-quality training problems for reasoning-focused LLMs. Technical details below:

omarsar0's tweet image. Great work showing prompt synthesis as a new scaling axis for reasoning.

Good training data is scarce.

This work showcases a framework that might make it possible to construct high-quality training problems for reasoning-focused LLMs.

Technical details below:

Bala 님이 재게시함

Sebastien Bubeck

@SebastienBubeck

. 9. 28.

Yet more evidence that a pretty major shift is happening, this time by Scott Aaronson scottaaronson.blog/?p=9183&fbclid…

SebastienBubeck's tweet image. Yet more evidence that a pretty major shift is happening, this time by Scott Aaronson

scottaaronson.blog/?p=9183&amp;fbclid…

Bala 님이 재게시함

Catherine Arnett @ 🍁 COLM🍁

@linguist_cat

. 9. 25.

I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!

linguist_cat's tweet image. I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!

Bala 님이 재게시함

Simo Ryu

@cloneofsimo

. 9. 23.

Damn, very interesting paper. after rapid loss reduction, we see deceleration and follow "scaling law": this is because at these steps, gradients start to conflict each other. Updates are 'fightining for modal capacity' in some sense, and larger the model less fighting there…

cloneofsimo's tweet image. Damn, very interesting paper. after rapid loss reduction, we see deceleration and follow "scaling law": this is because at these steps, gradients start to conflict each other.

Updates are 'fightining for modal capacity' in some sense, and larger the model less fighting there…

Bala 님이 재게시함

Martin Bauer

@martinmbauer

. 9. 10.

Correct! Just as a reminder: this is what a Transformer found after looking at 10M solar systems

François Chollet

@fchollet

. 9. 9.

A student who truly understands F=ma can solve more novel problems than a Transformer that has memorized every physics textbook ever written.

Bala 님이 재게시함

The Economist

@TheEconomist

. 9. 8.

That may be good news for AI laggards like Apple economist.com/business/2025/…

TheEconomist's tweet card. That may be good news for AI laggards like Apple

Faith in God-like large language models is waning

출처: economist.com

Bala 님이 재게시함

Andrej Karpathy

@karpathy

. 2. 27.

This is interesting as a first large diffusion-based LLM. Most of the LLMs you've been seeing are ~clones as far as the core modeling approach goes. They're all trained "autoregressively", i.e. predicting tokens from left to right. Diffusion is different - it doesn't go left to…

Inception

@_inception_ai

. 2. 26.

We are excited to introduce Mercury, the first commercial-grade diffusion large language model (dLLM)! dLLMs push the frontier of intelligence and speed with parallel, coarse-to-fine text generation.

Bala 님이 재게시함

Ekin Akyürek

@akyurekekin

. 11. 10.

Why do we treat train and test times so differently? Why is one “training” and the other “in-context learning”? Just take a few gradients during test-time — a simple way to increase test time compute — and get a SoTA in ARC public validation set 61%=avg. human score! @arcprize

akyurekekin's tweet image. Why do we treat train and test times so differently?

Why is one “training” and the other “in-context learning”?

Just take a few gradients during test-time — a simple way to increase test time compute — and get a SoTA in ARC public validation set 61%=avg. human score! @arcprize

Bala 님이 재게시함

Jia-Bin Huang

@jbhuang0604

. 8. 8.

Diffusion LLMs are promising ways to overcome the limitations of autoregressive LLMs. Less error propagation, easier to control, and faster to sample! But how do Diffusion LLMs actually work? 🤔 Let's explore some ideas on this fascinating topic! youtu.be/8BTOoc0yDVA

Bala 님이 재게시함

Lili

@lchen915

. 8. 8.

Self-Questioning Language Models: LLMs that learn to generate their own questions and answers via asymmetric self-play RL. There is no external training data – the only input is a single prompt specifying the topic.

lchen915's tweet image. Self-Questioning Language Models: LLMs that learn to generate their own questions and answers via asymmetric self-play RL.

There is no external training data – the only input is a single prompt specifying the topic.

Bala 님이 재게시함

Ben Blaiszik

@BenBlaiszik

. 8. 6.

Current AI benchmarks test textbook knowledge, but scientific discovery is iterative and creative, and that's where evaluation gets messy. @chenru_duan (MIT→Microsoft→startup founder) and I discuss his journey and the open source project Scientific Discovery Evaluation…

BenBlaiszik's tweet image. Current AI benchmarks test textbook knowledge, but scientific discovery is iterative and creative, and that's where evaluation gets messy.

@chenru_duan (MIT→Microsoft→startup founder) and I discuss his journey and the open source project Scientific Discovery Evaluation…

Bala 님이 재게시함

Lakshya A Agrawal

@LakshyAAAgrawal

. 7. 28.

How does prompt optimization compare to RL algos like GRPO? GRPO needs 1000s of rollouts, but humans can learn from a few trials—by reflecting on what worked & what didn't. Meet GEPA: a reflective prompt optimizer that can outperform GRPO by up to 20% with 35x fewer rollouts!🧵