Elias

@notes_own

The past is closer to the future than now 支那猪滚

天堂下水道

Joined June 2011

22KPosts 241Followers 3KFollowing

You might like

@subhobrata1

@GregKamradt

@afrocosmist

@braneloop

@RichardNadler1

@stockimgAI

@MktgAi

@ExaAILabs

@soulosaint

@Hythacg

@deathbyhibachi

@gordic_aleksa

@TheSpaceshipper

@gdsttian

@papers_daily

Elias reposted

DailyPapers

@HuggingPapers

Oct 8

Samsung's Tiny Recursive Model (TRM) masters complex reasoning With just 7M parameters, TRM outperforms large LLMs on hard puzzles like Sudoku & ARC-AGI. This "Less is More" approach redefines efficiency in AI, using less than 0.01% of competitors' parameters!

HuggingPapers's tweet image. Samsung's Tiny Recursive Model (TRM) masters complex reasoning

With just 7M parameters, TRM outperforms large LLMs on hard puzzles like Sudoku &amp; ARC-AGI. This "Less is More" approach redefines efficiency in AI, using less than 0.01% of competitors' parameters!

Elias reposted

John Nguyen

@__JohnNguyen__

Oct 7

Transfusion combines autoregressive with diffusion to train a single transformer, but what if we combine Flow with Flow? 🤔 🌊OneFlow🌊 the first non-autoregressive model to generate text and images concurrently using a single transformer—unifying Edit Flow (text) with Flow…

Elias reposted

Yulu Gan

@yule_gan

Oct 6

Reinforcement Learning (RL) has long been the dominant method for fine-tuning, powering many state-of-the-art LLMs. Methods like PPO and GRPO explore in action space. But can we instead explore directly in parameter space? YES we can. We propose a scalable framework for…

Elias reposted

Infini-AI-Lab

@InfiniAILab

Oct 7

🤔Can we train RL on LLMs with extremely stale data? 🚀Our latest study says YES! Stale data can be as informative as on-policy data, unlocking more scalable, efficient asynchronous RL for LLMs. We introduce M2PO, an off-policy RL algorithm that keeps training stable and…

InfiniAILab's tweet image. 🤔Can we train RL on LLMs with extremely stale data?

🚀Our latest study says YES!

Stale data can be as informative as on-policy data, unlocking more scalable, efficient asynchronous RL for LLMs.

We introduce M2PO, an off-policy RL algorithm that keeps training stable and…

Elias reposted

🇹🇼辛浩宁

@XinHaoNing

Oct 8

1951年，中国国防部为整军整党而刊印的 «中国国民党党员守则浅释»：苏俄帝国主义利用了共匪朱毛汉奸集团，要来消灭我们中华民国，奴役我们中华民族……

Elias reposted

Hadi Pouransari

@HPouransari

Oct 6

Introducing Pretraining with Hierarchical Memories: Separating Knowledge & Reasoning for On-Device LLM Deployment 💡We propose dividing LLM parameters into 1) anchor (always used, capturing commonsense) and 2) memory bank (selected per query, capturing world knowledge). [1/X]🧵

Elias reposted

Jackson Atkins

@JacksonAtkinsX

Oct 7

My brain broke when I read this paper. A tiny 7 Million parameter model just beat DeepSeek-R1, Gemini 2.5 pro, and o3-mini at reasoning on both ARG-AGI 1 and ARC-AGI 2. It's called Tiny Recursive Model (TRM) from Samsung. How can a model 10,000x smaller be smarter? Here's how…

JacksonAtkinsX's tweet image. My brain broke when I read this paper.

A tiny 7 Million parameter model just beat DeepSeek-R1, Gemini 2.5 pro, and o3-mini at reasoning on both ARG-AGI 1 and ARC-AGI 2.

It's called Tiny Recursive Model (TRM) from Samsung.

How can a model 10,000x smaller be smarter?

Here's how…

Elias reposted

Rohan Paul

@rohanpaul_ai

Oct 5

Absolutely classic @GoogleResearch paper on In-Context-Learning by LLMs. Shows the mechanisms of how LLMs learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change. 💡The mechanism they reveal for…

rohanpaul_ai's tweet image. Absolutely classic @GoogleResearch paper on In-Context-Learning by LLMs.

Shows the mechanisms of how LLMs learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change.

💡The mechanism they reveal for…

Elias reposted

DailyPapers

@HuggingPapers

Oct 6

No Prompt Left Behind: A New Era for LLM Reinforcement Learning This paper introduces RL-ZVP, a novel algorithm that unlocks learning signals from previously ignored "zero-variance prompts" in LLM training. It achieves significant accuracy improvements on math reasoning…

HuggingPapers's tweet image. No Prompt Left Behind: A New Era for LLM Reinforcement Learning

This paper introduces RL-ZVP, a novel algorithm that unlocks learning signals from previously ignored "zero-variance prompts" in LLM training. It achieves significant accuracy improvements on math reasoning…

Elias reposted

Elliot Arledge (h/eng)

@elliotarledge

Oct 5

behold... the CUDA grid

Elias reposted

Rishabh Anand

@rishabh16_

Oct 4

This is a super cool topic! I did a class project with a friend for a course on Kernels during senior year in college 🔗: github.com/rish-16/ma4270… Lots of fun connections between kernels and self-attention, especially when learning periodic functions The attention patterns…

rishabh16_'s tweet image. This is a super cool topic! I did a class project with a friend for a course on Kernels during senior year in college

🔗: github.com/rish-16/ma4270…

Lots of fun connections between kernels and self-attention, especially when learning periodic functions

The attention patterns…

Peyman Milanfar

@docmilanfar

Oct 4

How Kernel Regression is related to Attention Mechanism - a summary in 10 slides. 0/1

Elias reposted

Rohan Paul

@rohanpaul_ai

Oct 5

The paper shows that Group Relative Policy Optimization (GRPO) behaves like Direct Preference Optimization (DPO), so training on simple answer pairs works. Turns a complex GRPO setup into a simple pairwise recipe without losing quality. This cuts tokens, compute, and wall time,…

rohanpaul_ai's tweet image. The paper shows that Group Relative Policy Optimization (GRPO) behaves like Direct Preference Optimization (DPO), so training on simple answer pairs works.

Turns a complex GRPO setup into a simple pairwise recipe without losing quality.

This cuts tokens, compute, and wall time,…

Elias reposted

Zhangchen Xu

@zhangchen_xu

Oct 3

🚀 Want high-quality, realistic, and truly challenging post-training data for the agentic era? Introducing Toucan-1.5M (huggingface.co/papers/2510.01…) — the largest open tool-agentic dataset yet: ✨ 1.53M real agentic trajectories synthesized by 3 models ✨ Diverse, challenging tasks…

zhangchen_xu's tweet image. 🚀 Want high-quality, realistic, and truly challenging post-training data for the agentic era?

Introducing Toucan-1.5M (huggingface.co/papers/2510.01…) — the largest open tool-agentic dataset yet:
✨ 1.53M real agentic trajectories synthesized by 3 models
✨ Diverse, challenging tasks…

Elias reposted

Rohan Paul

@rohanpaul_ai

Oct 3

This is a solid blog breaking down how mixture-of-experts (MoE) language models can actually be served cheaply if their design is matched to hardware limits. MoE models only activate a few experts per token, which saves compute but causes heavy memory and communication use.…

rohanpaul_ai's tweet image. This is a solid blog breaking down how mixture-of-experts (MoE) language models can actually be served cheaply if their design is matched to hardware limits.

MoE models only activate a few experts per token, which saves compute but causes heavy memory and communication use.…

Elias reposted

Rosinality

@rosinality

Oct 3

Analysis of KL estimator k1, k2, k3 and their use as a reward or as a loss. Partially this is a continuation of previous discussions on it x.com/QuanquanGu/sta…. By the way RLHF is a bit misleading; they did RLVR.

rosinality's tweet image. Analysis of KL estimator k1, k2, k3 and their use as a reward or as a loss. Partially this is a continuation of previous discussions on it x.com/QuanquanGu/sta….

By the way RLHF is a bit misleading; they did RLVR.

Quanquan Gu

@QuanquanGu

Sep 28

The original GRPO is an off-policy RL algorithm, but its KL regularization isn't done right. Specifically, the k3 estimator for the unnormalized reverse KL is missing the importance weight. The correct formulation should be:

QuanquanGu's tweet image. The original GRPO is an off-policy RL algorithm, but its KL regularization isn't done right. Specifically, the k3 estimator for the unnormalized reverse KL is missing the importance weight. The correct formulation should be:

Elias reposted

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

Oct 3

Meta just ran one of the largest synthetic-data studies (over 1000 LLMs, more than 100k GPU hours). Result: mixing synthetic and natural data only helps once you cross the right scale and ratio (~30%). Small models learn nothing; larger ones suddenly gain a sharp threshold…

gm8xx8's tweet image. Meta just ran one of the largest synthetic-data studies (over 1000 LLMs, more than 100k GPU hours).
Result: mixing synthetic and natural data only helps once you cross the right scale and ratio (~30%).
Small models learn nothing; larger ones suddenly gain a sharp threshold…

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

May 27

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition 𝑨 𝑪𝑳𝑬𝑨𝑵, 𝑭𝑶𝑹𝑴𝑨𝑳 𝑩𝑹𝑬𝑨𝑲𝑫𝑶𝑾𝑵 𝑶𝑭 𝑾𝑯𝒀 𝒀𝑶𝑼𝑹 𝟕𝑩 𝑳𝑳𝑴 𝑳𝑬𝑨𝑹𝑵𝑺 𝑵𝑶𝑻𝑯𝑰𝑵𝑮 𝑭𝑹𝑶𝑴 𝑯𝑰𝑮𝑯-𝑸𝑼𝑨𝑳𝑰𝑻𝒀 𝑫𝑨𝑻𝑨 This paper reveals phase transitions in factual memorization…

gm8xx8's tweet image. Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

𝑨 𝑪𝑳𝑬𝑨𝑵, 𝑭𝑶𝑹𝑴𝑨𝑳 𝑩𝑹𝑬𝑨𝑲𝑫𝑶𝑾𝑵 𝑶𝑭 𝑾𝑯𝒀 𝒀𝑶𝑼𝑹 𝟕𝑩 𝑳𝑳𝑴 𝑳𝑬𝑨𝑹𝑵𝑺 𝑵𝑶𝑻𝑯𝑰𝑵𝑮 𝑭𝑹𝑶𝑴 𝑯𝑰𝑮𝑯-𝑸𝑼𝑨𝑳𝑰𝑻𝒀 𝑫𝑨𝑻𝑨

This paper reveals phase transitions in factual memorization…

Elias reposted

Chelsea Finn

@chelseabfinn

Oct 4

RL fine-tuning often prematurely collapses policy entropy. We consider a general framework, called set RL, i.e. RL over a set of trajectories from a policy. We use it to incentivize diverse solutions & optimize for inference time performance. Paper: arxiv.org/abs/2509.25424

Jubayer Ibn Hamid

@jubayer_hamid

Oct 1

Exploration is fundamental to RL. Yet policy gradient methods often collapse: during training they fail to explore broadly, and converge into narrow, easily exploitable behaviors. The result is poor generalization, limited gains from test-time scaling, and brittleness on tasks…

jubayer_hamid's tweet image. Exploration is fundamental to RL. Yet policy gradient methods often collapse: during training they fail to explore broadly, and converge into narrow, easily exploitable behaviors. The result is poor generalization, limited gains from test-time scaling, and brittleness on tasks…

Elias reposted

Anikait Singh

@Anikait_Singh_

Oct 3

🚨🚨New Paper: Training LLMs to Discover Abstractions for Solving Reasoning Problems Introducing RLAD, a two-player RL framework for LLMs to discover 'reasoning abstractions'—natural language hints that encode procedural knowledge for structured exploration in reasoning.🧵⬇️

Anikait_Singh_'s tweet image. 🚨🚨New Paper: Training LLMs to Discover Abstractions for Solving Reasoning Problems

Introducing RLAD, a two-player RL framework for LLMs to discover 'reasoning abstractions'—natural language hints that encode procedural knowledge for structured exploration in reasoning.🧵⬇️

Elias reposted

TuringPost

@TheTuringPost

Oct 4

A look at the history of Reinforcement Learning What is Temporal-Difference (TD) learning? @RichardSSutton introduced TD learning in 1988, and today most widely used RL algorithms, like deep actor-critic, rely on TD error as the learning signal. So, TD learning: ▪️ Allows…