LargerLanguage's profile picture. AI technical writer and poet. Not simultaneously. Research, warm takes, and cool stuff we're doing at @Scale_AI. For poems: @MatthewSiegel_

Matthew Siegel

@LargerLanguage

AI technical writer and poet. Not simultaneously. Research, warm takes, and cool stuff we're doing at @Scale_AI. For poems: @MatthewSiegel_

Pinned

thoroughly excited to launch this video series where I break down new research from @Scale_AI. this first one is how we knock out 90% of errors in data that require human review using an autorater. check it out 👀


Lots of people are talking about an AI bubble. Sure. I don't think these people are reading the same research I'm reading.


Matthew Siegel reposted

New @Scale_AI paper! The culprit behind reward hacking? We trace it to misspecification in high-reward tail. Our fix: rubric-based rewards to tell “excellent” responses apart from “great.” The result: Less hacking, stronger post-training!  arxiv.org/pdf/2509.21500

vbingliu's tweet image. New @Scale_AI paper!

The culprit behind reward hacking? We trace it to misspecification in high-reward tail.

Our fix: rubric-based rewards to tell “excellent” responses apart from “great.”

The result: Less hacking, stronger post-training!  arxiv.org/pdf/2509.21500
vbingliu's tweet image. New @Scale_AI paper!

The culprit behind reward hacking? We trace it to misspecification in high-reward tail.

Our fix: rubric-based rewards to tell “excellent” responses apart from “great.”

The result: Less hacking, stronger post-training!  arxiv.org/pdf/2509.21500

This was a HUGE lift from our research team, huge thanks to everyone who contributed to this benchmark. ...and the blog doesn't look so bad either: scale.com/blog/swe-bench…

🚀 Introducing SWE-Bench Pro — a new benchmark to evaluate LLM coding agents on real, enterprise-grade software engineering tasks. This is the next step beyond SWE-Bench: harder, contamination-resistant, and closer to real-world repos.



Working on this leaderboard page and blog was a LIFT! Huge thank you to every cook in the kitchen!! 🧑‍🍳👩‍🍳👨‍🍳

Introducing our Agentic Leaderboards. These new leaderboards test AI agents in real-world, high-complexity environments, setting a new standard for completing end-to-end digital tasks.

scale_AI's tweet image. Introducing our Agentic Leaderboards.

These new leaderboards test AI agents in real-world, high-complexity environments, setting a new standard for completing end-to-end digital tasks.


Loading...

Something went wrong.


Something went wrong.