yifan_zhang_'s profile picture. PhD student at @Princeton University, focusing on LLMs. Language Modeling and Pretraining, LLM Reasoning and RL. Open to 2026 Internships, Prev @Tsinghua_IIIS

Yifan Zhang

@yifan_zhang_

PhD student at @Princeton University, focusing on LLMs. Language Modeling and Pretraining, LLM Reasoning and RL. Open to 2026 Internships, Prev @Tsinghua_IIIS

置顶

Excellent to see the new Tinker-docs from @thinkymachines, which confirm an inconsistency in the GRPO loss. We explored this issue in our prior work (complex-reasoning.github.io/RPG) and developed a more robust method with substantial performance improvements: • +12 absolute points…

yifan_zhang_'s tweet image. Excellent to see the new Tinker-docs from @thinkymachines, which confirm an inconsistency in the GRPO loss.

We explored this issue in our prior work (complex-reasoning.github.io/RPG) and developed a more robust method with substantial performance improvements: 
• +12 absolute points…

Yifan Zhang 已转帖

Mamba-3 is forthcoming! Better performance than Transformers and Fast Weight Programmers (FWP) (people.idsia.ch/~juergen/fast-…)

yifan_zhang_'s tweet image. Mamba-3 is forthcoming!

Better performance than Transformers and Fast Weight Programmers (FWP) (people.idsia.ch/~juergen/fast-…)

Yifan Zhang 已转帖

The paper of Mamba-3 is now available at openreview.net/pdf?id=HwCvaJO…!

yifan_zhang_'s tweet image. The paper of Mamba-3 is now available at openreview.net/pdf?id=HwCvaJO…!

Yifan Zhang 已转帖

Mamba3 just silently dropped on ICLR🤯 A faster, longer-context, and more scalable LLM architecture than Transformers A few years ago, some researchers started rethinking sequence modeling from a different angle. Instead of stacking more attention layers, they went back to an…

JundeMorsenWu's tweet image. Mamba3 just silently dropped on ICLR🤯

A faster, longer-context, and more scalable LLM architecture than Transformers 

A few years ago, some researchers started rethinking sequence modeling from a different angle.
Instead of stacking more attention layers, they went back to an…
JundeMorsenWu's tweet image. Mamba3 just silently dropped on ICLR🤯

A faster, longer-context, and more scalable LLM architecture than Transformers 

A few years ago, some researchers started rethinking sequence modeling from a different angle.
Instead of stacking more attention layers, they went back to an…
JundeMorsenWu's tweet image. Mamba3 just silently dropped on ICLR🤯

A faster, longer-context, and more scalable LLM architecture than Transformers 

A few years ago, some researchers started rethinking sequence modeling from a different angle.
Instead of stacking more attention layers, they went back to an…
JundeMorsenWu's tweet image. Mamba3 just silently dropped on ICLR🤯

A faster, longer-context, and more scalable LLM architecture than Transformers 

A few years ago, some researchers started rethinking sequence modeling from a different angle.
Instead of stacking more attention layers, they went back to an…

Yifan Zhang 已转帖

For those who are not familiar with RWKV-7 and Fast Weight Programmers (Delta Networks), RWKV-7 can solve a problem that is NC1-complete under AC0 reductions. Note that this theorem does not hold for Fast Weight Programmers (Delta Networks). arxiv.org/abs/2503.14456


Yifan Zhang 已转帖

Yes and I cited FWP in RWKV-7 introduction (Nov 2024): x.com/BlinkDL_AI/sta…

RWKV-7 "Goose" 🪿 is a meta-in-context learner, test-time-training its state on the context via in-context gradient descent at every token. It is like a world model ever adapting to external environment: github.com/BlinkDL/RWKV-L…🙂#RWKV

BlinkDL_AI's tweet image. RWKV-7 "Goose" 🪿 is a meta-in-context learner, test-time-training its state on the context via in-context gradient descent at every token. It is like a world model ever adapting to external environment: github.com/BlinkDL/RWKV-L…🙂#RWKV


Yifan Zhang 已转帖

[1/N] How can we make attention more powerful—not just more efficient? How do different attention mechanisms handle associative memory, and can we design a better one from first principles? 🤔 Our new work explores these questions by introducing Local Linear Attention (LLA).…

YifeiZuoX's tweet image. [1/N] How can we make attention more powerful—not just more efficient?
How do different attention mechanisms handle associative memory, and can we design a better one from first principles? 🤔

Our new work explores these questions by introducing Local Linear Attention (LLA).…

Yifan Zhang 已转帖

RWKV-8 ROSA 🌹 mechanism: neurosymbolic infinite-range lossless information propagator beyond attention, enabling LLMs to invent their own inner monologue languages. First step towards scalable post-neural methods, for a new era in AI 🌌

BlinkDL_AI's tweet image. RWKV-8 ROSA 🌹 mechanism: neurosymbolic infinite-range lossless information propagator beyond attention, enabling LLMs to invent their own inner monologue languages. First step towards scalable post-neural methods, for a new era in AI 🌌

The new mechanism in RWKV-8 "Heron" 🪶 is named ROSA (acronym, note SA ≠ Self-Attention here) 🌹 ROSA is compromise-free: we get efficient, scalable, genuine infinite ctx, by applying some beautiful algorithms.

BlinkDL_AI's tweet image. The new mechanism in RWKV-8 "Heron" 🪶 is named ROSA (acronym, note SA ≠ Self-Attention here) 🌹 ROSA is compromise-free: we get efficient, scalable, genuine infinite ctx, by applying some beautiful algorithms.


Yifan Zhang 已转帖

My thesis, 𝘈 𝘵𝘩𝘦𝘰𝘳𝘺 𝘰𝘧 𝘵𝘩𝘦 𝘤𝘰𝘮𝘱𝘶𝘵𝘢𝘵𝘪𝘰𝘯𝘢𝘭 𝘱𝘰𝘸𝘦𝘳 𝘢𝘯𝘥 𝘭𝘪𝘮𝘪𝘵𝘢𝘵𝘪𝘰𝘯𝘴 𝘰𝘧 𝘭𝘢𝘯𝘨𝘶𝘢𝘨𝘦 𝘮𝘰𝘥𝘦𝘭𝘪𝘯𝘨 𝘢𝘳𝘤𝘩𝘪𝘵𝘦𝘤𝘵𝘶𝘳𝘦𝘴, is now online:

lambdaviking's tweet image. My thesis, 𝘈 𝘵𝘩𝘦𝘰𝘳𝘺 𝘰𝘧 𝘵𝘩𝘦 𝘤𝘰𝘮𝘱𝘶𝘵𝘢𝘵𝘪𝘰𝘯𝘢𝘭 𝘱𝘰𝘸𝘦𝘳 𝘢𝘯𝘥 𝘭𝘪𝘮𝘪𝘵𝘢𝘵𝘪𝘰𝘯𝘴 𝘰𝘧 𝘭𝘢𝘯𝘨𝘶𝘢𝘨𝘦 𝘮𝘰𝘥𝘦𝘭𝘪𝘯𝘨 𝘢𝘳𝘤𝘩𝘪𝘵𝘦𝘤𝘵𝘶𝘳𝘦𝘴, is now online:

Yifan Zhang 已转帖

The most transformative AI paper of the decade. No competition.

Hesamation's tweet image. The most transformative AI paper of the decade. No competition.

Yifan Zhang 已转帖

sharing a new paper w Peter Bartlett, @jasondeanlee, @ShamKakade6, Bin Yu ppl talking about implicit regularization, but how good is it? We show its surprisingly effective, that GD dominates ridge for all linear regression, w/ more cool stuff on GD vs SGD arxiv.org/abs/2509.17251

uuujingfeng's tweet image. sharing a new paper w Peter Bartlett, @jasondeanlee, @ShamKakade6, Bin Yu
ppl talking about implicit regularization, but how good is it? We show its surprisingly effective, that GD dominates ridge for all linear regression, w/ more cool stuff on GD vs SGD
arxiv.org/abs/2509.17251

Yifan Zhang 已转帖

My sincere thanks to @thinkymachines and @rm_rafailov for their swift and constructive feedback on addressing the KL inconsistency within the GRPO loss function. This type of engagement truly exemplifies the collaborative ethos of open-source research! Docs:…

yifan_zhang_'s tweet image. My sincere thanks to @thinkymachines and @rm_rafailov for their swift and constructive feedback on addressing the KL inconsistency within the GRPO loss function. 

This type of engagement truly exemplifies the collaborative ethos of open-source research!

Docs:…

Yifan Zhang 已转帖

Tinker advances our mission of enabling more people to do research on cutting-edge models and customize them to their needs. thinkingmachines.ai/blog/announcin…


Yifan Zhang 已转帖

Check Thinky's Tinker codebase. GRPO is out REINFORCE with Adv = Reward-mean(Reward) is in NO CLIPPING model ← model + η · advantage · ∇ logprob


United States 趋势

Loading...

Something went wrong.


Something went wrong.