scale_AI's profile picture. making AI work

Scale AI

@scale_AI

making AI work

👀

real metrics banger is hidden in the system card. Yes, you can overfit on Django and nail SWE-bench Verified. But there's this recent SWE-bench Pro from @scale_AI , and opus gets 52%. The next best, sonet 4.5, is only 43.6, and non-anthropic model, GPT-5, is 36%. This is HUGE…

stalkermustang's tweet image. real metrics banger is hidden in the system card. Yes, you can overfit on Django and nail SWE-bench Verified.  But there's this recent SWE-bench Pro from @scale_AI , and opus gets 52%. The next best, sonet 4.5, is only 43.6, and non-anthropic model, GPT-5, is 36%. 

This is HUGE…
stalkermustang's tweet image. real metrics banger is hidden in the system card. Yes, you can overfit on Django and nail SWE-bench Verified.  But there's this recent SWE-bench Pro from @scale_AI , and opus gets 52%. The next best, sonet 4.5, is only 43.6, and non-anthropic model, GPT-5, is 36%. 

This is HUGE…


Scale AI 已转帖

Our new research from @Scale_AI reveals that harmful biological knowledge can persist inside bio-foundation models even after filtering. We introduce BioRiskEval, the first comprehensive framework built to assess dual-use risk in these models using a realistic adversarial threat…

UdariMadhu's tweet image. Our new research from @Scale_AI reveals that harmful biological knowledge can persist inside bio-foundation models even after filtering.

We introduce BioRiskEval, the first comprehensive framework built to assess dual-use risk in these models using a realistic adversarial threat…

Scale AI 已转帖

1/ New Chain of Thought podcast episode from @Scale_AI, where we dive into SWE-Bench Pro, a benchmark designed to rigorously evaluate LLM coding agents on professional software engineering tasks.


Scale AI 已转帖

🚀New @scale_AI paper: 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵𝗥𝘂𝗯𝗿𝗶𝗰𝘀, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini & OpenAI DR achieve <𝟲𝟴% 𝗿𝘂𝗯𝗿𝗶𝗰 𝗰𝗼𝗺𝗽𝗹𝗶𝗮𝗻𝗰𝗲. We built 𝟮.𝟱𝗞+ expert rubrics with 𝟮.𝟴𝗞+ hrs of human labor to measure why.

ManasiSharma_'s tweet image. 🚀New @scale_AI paper: 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵𝗥𝘂𝗯𝗿𝗶𝗰𝘀, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini &amp;amp; OpenAI DR achieve &amp;lt;𝟲𝟴% 𝗿𝘂𝗯𝗿𝗶𝗰 𝗰𝗼𝗺𝗽𝗹𝗶𝗮𝗻𝗰𝗲. We built 𝟮.𝟱𝗞+ expert rubrics with 𝟮.𝟴𝗞+ hrs of human labor to measure why.

Today, we honor the courage, dedication, and sacrifice of all who have served. Thank you to our veterans. 🇺🇸


Scale 🤝 TIME Today, @TIME rolled out a Scale-powered, site-wide AI reading and discovery experience. The AI agent is just the latest activation in our ongoing partnership to help enhance access to journalism worldwide. Learn more via @axios: scl.ai/47BPIzm

scale_AI's tweet image. Scale 🤝 TIME

Today, @TIME rolled out a Scale-powered, site-wide AI reading and discovery experience.

The AI agent is just the latest activation in our ongoing partnership to help enhance access to journalism worldwide.

Learn more via @axios: scl.ai/47BPIzm

Loading...

Something went wrong.


Something went wrong.