Elias

@notes_own

The past is closer to the future than now 支那猪滚

天堂下水道

於六月 2011 加入

22千貼文 250位跟隨者 3千個跟隨中

你可能會喜歡

@subhobrata1

@GregKamradt

@afrocosmist

@braneloop

@RichardNadler1

@stockimgAI

@MktgAi

@ExaAILabs

@ErrorPdf

@soulosaint

@Hythacg

@deathbyhibachi

@gordic_aleksa

@TheSpaceshipper

@gdsttian

Elias 已轉發

sui dev ☄️

@birdabo

年10月31日

pretty sure pewdiepie watched this.

Elias 已轉發

のらいぬ

@JapanBanZaiLove

年10月27日

近日，日本參議院議員石平在自己的個人YouTube頻道中，痛罵中國國家主席習近平29分鐘！包括但不限於「小學生」「傻X」「蠢豬」「不是個東西」等金句！火力之猛，實屬日本現役參議員中的罕見！影片連結放到下方第一條評論中 👇👇👇

Elias 已轉發

小蛋糕（日本勇者村）

@leocherry8

年10月27日

笑出猪叫近日，日本参议院议员石平在自己的个人YouTube頻道中，对习包子全程开大，多达29分鐘！包括但不限于「小学生」「傻X」「蠢猪」等金句！骂得好发出来给小粉们破防下🤪

Elias 已轉發

Samsung's Tiny Recursive Model (TRM) masters complex reasoning With just 7M parameters, TRM outperforms large LLMs on hard puzzles like Sudoku & ARC-AGI. This "Less is More" approach redefines efficiency in AI, using less than 0.01% of competitors' parameters!

HuggingPapers's tweet image. Samsung's Tiny Recursive Model (TRM) masters complex reasoning

With just 7M parameters, TRM outperforms large LLMs on hard puzzles like Sudoku &amp; ARC-AGI. This "Less is More" approach redefines efficiency in AI, using less than 0.01% of competitors' parameters!

Elias 已轉發

John Nguyen

@__JohnNguyen__

年10月7日

Transfusion combines autoregressive with diffusion to train a single transformer, but what if we combine Flow with Flow? 🤔 🌊OneFlow🌊 the first non-autoregressive model to generate text and images concurrently using a single transformer—unifying Edit Flow (text) with Flow…

Elias 已轉發

Yulu Gan

@yule_gan

年10月6日

Reinforcement Learning (RL) has long been the dominant method for fine-tuning, powering many state-of-the-art LLMs. Methods like PPO and GRPO explore in action space. But can we instead explore directly in parameter space? YES we can. We propose a scalable framework for…

Elias 已轉發

Infini-AI-Lab

@InfiniAILab

年10月7日

🤔Can we train RL on LLMs with extremely stale data? 🚀Our latest study says YES! Stale data can be as informative as on-policy data, unlocking more scalable, efficient asynchronous RL for LLMs. We introduce M2PO, an off-policy RL algorithm that keeps training stable and…

InfiniAILab's tweet image. 🤔Can we train RL on LLMs with extremely stale data?

🚀Our latest study says YES!

Stale data can be as informative as on-policy data, unlocking more scalable, efficient asynchronous RL for LLMs.

We introduce M2PO, an off-policy RL algorithm that keeps training stable and…

Elias 已轉發

Hadi Pouransari

@HPouransari

年10月6日

Introducing Pretraining with Hierarchical Memories: Separating Knowledge & Reasoning for On-Device LLM Deployment 💡We propose dividing LLM parameters into 1) anchor (always used, capturing commonsense) and 2) memory bank (selected per query, capturing world knowledge). [1/X]🧵

Elias 已轉發

Jackson Atkins

@JacksonAtkinsX

年10月7日

My brain broke when I read this paper. A tiny 7 Million parameter model just beat DeepSeek-R1, Gemini 2.5 pro, and o3-mini at reasoning on both ARG-AGI 1 and ARC-AGI 2. It's called Tiny Recursive Model (TRM) from Samsung. How can a model 10,000x smaller be smarter? Here's how…

JacksonAtkinsX's tweet image. My brain broke when I read this paper.

A tiny 7 Million parameter model just beat DeepSeek-R1, Gemini 2.5 pro, and o3-mini at reasoning on both ARG-AGI 1 and ARC-AGI 2.

It's called Tiny Recursive Model (TRM) from Samsung.

How can a model 10,000x smaller be smarter?

Here's how…

Elias 已轉發

Rohan Paul

@rohanpaul_ai

年10月5日

Absolutely classic @GoogleResearch paper on In-Context-Learning by LLMs. Shows the mechanisms of how LLMs learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change. 💡The mechanism they reveal for…

rohanpaul_ai's tweet image. Absolutely classic @GoogleResearch paper on In-Context-Learning by LLMs.

Shows the mechanisms of how LLMs learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change.

💡The mechanism they reveal for…

Elias 已轉發

DailyPapers

@HuggingPapers

年10月6日

No Prompt Left Behind: A New Era for LLM Reinforcement Learning This paper introduces RL-ZVP, a novel algorithm that unlocks learning signals from previously ignored "zero-variance prompts" in LLM training. It achieves significant accuracy improvements on math reasoning…

HuggingPapers's tweet image. No Prompt Left Behind: A New Era for LLM Reinforcement Learning

This paper introduces RL-ZVP, a novel algorithm that unlocks learning signals from previously ignored "zero-variance prompts" in LLM training. It achieves significant accuracy improvements on math reasoning…

Elias 已轉發

Elliot Arledge (h/eng)

@elliotarledge

年10月5日

behold... the CUDA grid

Elias 已轉發

Rishabh Anand

@rishabh16_

年10月4日

This is a super cool topic! I did a class project with a friend for a course on Kernels during senior year in college 🔗: github.com/rish-16/ma4270… Lots of fun connections between kernels and self-attention, especially when learning periodic functions The attention patterns…

rishabh16_'s tweet image. This is a super cool topic! I did a class project with a friend for a course on Kernels during senior year in college

🔗: github.com/rish-16/ma4270…

Lots of fun connections between kernels and self-attention, especially when learning periodic functions

The attention patterns…

Peyman Milanfar

@docmilanfar

年10月4日

How Kernel Regression is related to Attention Mechanism - a summary in 10 slides. 0/1

Elias 已轉發

Rohan Paul

@rohanpaul_ai

年10月5日

The paper shows that Group Relative Policy Optimization (GRPO) behaves like Direct Preference Optimization (DPO), so training on simple answer pairs works. Turns a complex GRPO setup into a simple pairwise recipe without losing quality. This cuts tokens, compute, and wall time,…

rohanpaul_ai's tweet image. The paper shows that Group Relative Policy Optimization (GRPO) behaves like Direct Preference Optimization (DPO), so training on simple answer pairs works.

Turns a complex GRPO setup into a simple pairwise recipe without losing quality.

This cuts tokens, compute, and wall time,…

Elias 已轉發

Zhangchen Xu

@zhangchen_xu

年10月3日

🚀 Want high-quality, realistic, and truly challenging post-training data for the agentic era? Introducing Toucan-1.5M (huggingface.co/papers/2510.01…) — the largest open tool-agentic dataset yet: ✨ 1.53M real agentic trajectories synthesized by 3 models ✨ Diverse, challenging tasks…

zhangchen_xu's tweet image. 🚀 Want high-quality, realistic, and truly challenging post-training data for the agentic era?

Introducing Toucan-1.5M (huggingface.co/papers/2510.01…) — the largest open tool-agentic dataset yet:
✨ 1.53M real agentic trajectories synthesized by 3 models
✨ Diverse, challenging tasks…

Elias 已轉發

Rohan Paul

@rohanpaul_ai

年10月3日

This is a solid blog breaking down how mixture-of-experts (MoE) language models can actually be served cheaply if their design is matched to hardware limits. MoE models only activate a few experts per token, which saves compute but causes heavy memory and communication use.…

rohanpaul_ai's tweet image. This is a solid blog breaking down how mixture-of-experts (MoE) language models can actually be served cheaply if their design is matched to hardware limits.

MoE models only activate a few experts per token, which saves compute but causes heavy memory and communication use.…

Elias 已轉發

Rosinality

@rosinality

年10月3日

Analysis of KL estimator k1, k2, k3 and their use as a reward or as a loss. Partially this is a continuation of previous discussions on it x.com/QuanquanGu/sta…. By the way RLHF is a bit misleading; they did RLVR.

rosinality's tweet image. Analysis of KL estimator k1, k2, k3 and their use as a reward or as a loss. Partially this is a continuation of previous discussions on it x.com/QuanquanGu/sta….

By the way RLHF is a bit misleading; they did RLVR.

Quanquan Gu

@QuanquanGu

年9月28日

The original GRPO is an off-policy RL algorithm, but its KL regularization isn't done right. Specifically, the k3 estimator for the unnormalized reverse KL is missing the importance weight. The correct formulation should be:

QuanquanGu's tweet image. The original GRPO is an off-policy RL algorithm, but its KL regularization isn't done right. Specifically, the k3 estimator for the unnormalized reverse KL is missing the importance weight. The correct formulation should be:

Elias 已轉發

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

年10月3日

Meta just ran one of the largest synthetic-data studies (over 1000 LLMs, more than 100k GPU hours). Result: mixing synthetic and natural data only helps once you cross the right scale and ratio (~30%). Small models learn nothing; larger ones suddenly gain a sharp threshold…

gm8xx8's tweet image. Meta just ran one of the largest synthetic-data studies (over 1000 LLMs, more than 100k GPU hours).
Result: mixing synthetic and natural data only helps once you cross the right scale and ratio (~30%).
Small models learn nothing; larger ones suddenly gain a sharp threshold…

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

年5月27日

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition 𝑨 𝑪𝑳𝑬𝑨𝑵, 𝑭𝑶𝑹𝑴𝑨𝑳 𝑩𝑹𝑬𝑨𝑲𝑫𝑶𝑾𝑵 𝑶𝑭 𝑾𝑯𝒀 𝒀𝑶𝑼𝑹 𝟕𝑩 𝑳𝑳𝑴 𝑳𝑬𝑨𝑹𝑵𝑺 𝑵𝑶𝑻𝑯𝑰𝑵𝑮 𝑭𝑹𝑶𝑴 𝑯𝑰𝑮𝑯-𝑸𝑼𝑨𝑳𝑰𝑻𝒀 𝑫𝑨𝑻𝑨 This paper reveals phase transitions in factual memorization…

gm8xx8's tweet image. Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

𝑨 𝑪𝑳𝑬𝑨𝑵, 𝑭𝑶𝑹𝑴𝑨𝑳 𝑩𝑹𝑬𝑨𝑲𝑫𝑶𝑾𝑵 𝑶𝑭 𝑾𝑯𝒀 𝒀𝑶𝑼𝑹 𝟕𝑩 𝑳𝑳𝑴 𝑳𝑬𝑨𝑹𝑵𝑺 𝑵𝑶𝑻𝑯𝑰𝑵𝑮 𝑭𝑹𝑶𝑴 𝑯𝑰𝑮𝑯-𝑸𝑼𝑨𝑳𝑰𝑻𝒀 𝑫𝑨𝑻𝑨

This paper reveals phase transitions in factual memorization…

Elias 已轉發

Chelsea Finn

@chelseabfinn

年10月4日

RL fine-tuning often prematurely collapses policy entropy. We consider a general framework, called set RL, i.e. RL over a set of trajectories from a policy. We use it to incentivize diverse solutions & optimize for inference time performance. Paper: arxiv.org/abs/2509.25424

Jubayer Ibn Hamid

@jubayer_hamid

年10月1日

Exploration is fundamental to RL. Yet policy gradient methods often collapse: during training they fail to explore broadly, and converge into narrow, easily exploitable behaviors. The result is poor generalization, limited gains from test-time scaling, and brittleness on tasks…

jubayer_hamid's tweet image. Exploration is fundamental to RL. Yet policy gradient methods often collapse: during training they fail to explore broadly, and converge into narrow, easily exploitable behaviors. The result is poor generalization, limited gains from test-time scaling, and brittleness on tasks…

Elias 已轉發

Anikait Singh

@Anikait_Singh_

年10月3日

🚨🚨New Paper: Training LLMs to Discover Abstractions for Solving Reasoning Problems Introducing RLAD, a two-player RL framework for LLMs to discover 'reasoning abstractions'—natural language hints that encode procedural knowledge for structured exploration in reasoning.🧵⬇️