
Elias
@notes_own
The past is closer to the future than now 支那猪滚
You might like
Samsung's Tiny Recursive Model (TRM) masters complex reasoning With just 7M parameters, TRM outperforms large LLMs on hard puzzles like Sudoku & ARC-AGI. This "Less is More" approach redefines efficiency in AI, using less than 0.01% of competitors' parameters!

Transfusion combines autoregressive with diffusion to train a single transformer, but what if we combine Flow with Flow? 🤔 🌊OneFlow🌊 the first non-autoregressive model to generate text and images concurrently using a single transformer—unifying Edit Flow (text) with Flow…
Reinforcement Learning (RL) has long been the dominant method for fine-tuning, powering many state-of-the-art LLMs. Methods like PPO and GRPO explore in action space. But can we instead explore directly in parameter space? YES we can. We propose a scalable framework for…
🤔Can we train RL on LLMs with extremely stale data? 🚀Our latest study says YES! Stale data can be as informative as on-policy data, unlocking more scalable, efficient asynchronous RL for LLMs. We introduce M2PO, an off-policy RL algorithm that keeps training stable and…

1951年, 中国国防部为整军整党而刊印的 «中国国民党党员守则浅释»: 苏俄帝国主义利用了 共匪朱毛汉奸集团, 要来消灭我们中华民国, 奴役我们中华民族……

Introducing Pretraining with Hierarchical Memories: Separating Knowledge & Reasoning for On-Device LLM Deployment 💡We propose dividing LLM parameters into 1) anchor (always used, capturing commonsense) and 2) memory bank (selected per query, capturing world knowledge). [1/X]🧵
My brain broke when I read this paper. A tiny 7 Million parameter model just beat DeepSeek-R1, Gemini 2.5 pro, and o3-mini at reasoning on both ARG-AGI 1 and ARC-AGI 2. It's called Tiny Recursive Model (TRM) from Samsung. How can a model 10,000x smaller be smarter? Here's how…

Absolutely classic @GoogleResearch paper on In-Context-Learning by LLMs. Shows the mechanisms of how LLMs learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change. 💡The mechanism they reveal for…

No Prompt Left Behind: A New Era for LLM Reinforcement Learning This paper introduces RL-ZVP, a novel algorithm that unlocks learning signals from previously ignored "zero-variance prompts" in LLM training. It achieves significant accuracy improvements on math reasoning…

This is a super cool topic! I did a class project with a friend for a course on Kernels during senior year in college 🔗: github.com/rish-16/ma4270… Lots of fun connections between kernels and self-attention, especially when learning periodic functions The attention patterns…



How Kernel Regression is related to Attention Mechanism - a summary in 10 slides. 0/1

The paper shows that Group Relative Policy Optimization (GRPO) behaves like Direct Preference Optimization (DPO), so training on simple answer pairs works. Turns a complex GRPO setup into a simple pairwise recipe without losing quality. This cuts tokens, compute, and wall time,…

🚀 Want high-quality, realistic, and truly challenging post-training data for the agentic era? Introducing Toucan-1.5M (huggingface.co/papers/2510.01…) — the largest open tool-agentic dataset yet: ✨ 1.53M real agentic trajectories synthesized by 3 models ✨ Diverse, challenging tasks…

This is a solid blog breaking down how mixture-of-experts (MoE) language models can actually be served cheaply if their design is matched to hardware limits. MoE models only activate a few experts per token, which saves compute but causes heavy memory and communication use.…




Analysis of KL estimator k1, k2, k3 and their use as a reward or as a loss. Partially this is a continuation of previous discussions on it x.com/QuanquanGu/sta…. By the way RLHF is a bit misleading; they did RLVR.

The original GRPO is an off-policy RL algorithm, but its KL regularization isn't done right. Specifically, the k3 estimator for the unnormalized reverse KL is missing the importance weight. The correct formulation should be:

Meta just ran one of the largest synthetic-data studies (over 1000 LLMs, more than 100k GPU hours). Result: mixing synthetic and natural data only helps once you cross the right scale and ratio (~30%). Small models learn nothing; larger ones suddenly gain a sharp threshold…

RL fine-tuning often prematurely collapses policy entropy. We consider a general framework, called set RL, i.e. RL over a set of trajectories from a policy. We use it to incentivize diverse solutions & optimize for inference time performance. Paper: arxiv.org/abs/2509.25424
Exploration is fundamental to RL. Yet policy gradient methods often collapse: during training they fail to explore broadly, and converge into narrow, easily exploitable behaviors. The result is poor generalization, limited gains from test-time scaling, and brittleness on tasks…

🚨🚨New Paper: Training LLMs to Discover Abstractions for Solving Reasoning Problems Introducing RLAD, a two-player RL framework for LLMs to discover 'reasoning abstractions'—natural language hints that encode procedural knowledge for structured exploration in reasoning.🧵⬇️

A look at the history of Reinforcement Learning What is Temporal-Difference (TD) learning? @RichardSSutton introduced TD learning in 1988, and today most widely used RL algorithms, like deep actor-critic, rely on TD error as the learning signal. So, TD learning: ▪️ Allows…

Shirokuro, a Japanese restaurant in NYC's East Village, where the interior mimics a hand-drawn black-and-white sketchbook,
United States Trends
- 1. Rickey 2,254 posts
- 2. #BeyondTheGates 5,386 posts
- 3. Big Balls 21.8K posts
- 4. Waddle 3,313 posts
- 5. Argentina 492K posts
- 6. Kings 154K posts
- 7. Westbrook 16.4K posts
- 8. Maybe in California N/A
- 9. Olave 2,880 posts
- 10. #TrumpsShutdownDragsOn 5,523 posts
- 11. Hayley 5,094 posts
- 12. Voting Rights Act 24.5K posts
- 13. Meyers 2,316 posts
- 14. Justice Jackson 17.4K posts
- 15. Aphrodite 4,150 posts
- 16. Veo 3.1 5,533 posts
- 17. Gold Glove 8,239 posts
- 18. Kate McKinnon 1,262 posts
- 19. #ClockTower1Year N/A
- 20. Capitol Police 25.8K posts
You might like
-
subho
@subhobrata1 -
Greg Kamradt
@GregKamradt -
ZOMBIE 🧟♂️ SLOP
@afrocosmist -
Jorge Hernandez 🇺🇦 🏳️🌈
@braneloop -
Richard Nadler
@RichardNadler1 -
Stockimg.ai
@stockimgAI -
Marketing AI Institute
@MktgAi -
Exa
@ExaAILabs -
soulo
@soulosaint -
Chris Hytha
@Hythacg -
⛩️ 𝑫𝒆𝒂𝒕𝒉 𝒃𝒚 𝑯𝒊𝒃𝒂𝒄𝒉𝒊 ⛩️
@deathbyhibachi -
Aleksa Gordić (水平问题)
@gordic_aleksa -
The Spaceshipper 🚀
@TheSpaceshipper -
T Tian
@gdsttian -
Daily AI Papers
@papers_daily
Something went wrong.
Something went wrong.