basvanopheusden's profile picture. Research at OpenAI, previously @imbue_ai and @cocosci_lab lab at Princeton. All opinions my own

basvanopheusden

@basvanopheusden

Research at OpenAI, previously @imbue_ai and @cocosci_lab lab at Princeton. All opinions my own

I love these results that simultaneously show the upside of great models like gpt 5 and Gemini, but also identify their flaws. There's work to be done ;) Which gpt 5 is this btw? Would be interesting to run pro

GPT-5 and Gemini 2.5 Pro just achieved gold medal performance in the International Olympiad of Astronomy and Astrophysics (IOAA). AI is now world class at cutting edge physics.

deedydas's tweet image. GPT-5 and Gemini 2.5 Pro just achieved gold medal performance in the International Olympiad of Astronomy and Astrophysics (IOAA).

AI is now world class at cutting edge physics.


Cool work, excited to see which games end up being easier or harder and which models are good at which games. Presumably there is some game out there that Gemini or Claude just absolutely crush?

(9/9) We're releasing a ton of new and harder games on Game Arena every day. Think of 2 page -> 20+ page rulebook length! So stay tuned, and my DMs are open for those interested in partnering and early access!

sanerc110's tweet image. (9/9) We're releasing a ton of new and harder games on Game Arena every day.

Think of 2 page -> 20+ page rulebook length!

So stay tuned, and my DMs are open for those interested in partnering and early access!


Love mengus workshop!

Thank you for the visibility. If you like old Magic cards and want to see them being played in the best in paper with highly edited content make sure to check out Mengu's Workshop on Youtube!!! 😍(No link because I still didn't understand if Twitter punishes these or not).



My phone "autocorrects" this every single time, glad to see I'm not the only one :p


Impressive work!

We evaluated Gemini 2.5 Deep Think on FrontierMath. There is no API, so we ran it manually. The results: a new record! We also conducted a more holistic evaluation of its math capabilities. 🧵

EpochAIResearch's tweet image. We evaluated Gemini 2.5 Deep Think on FrontierMath. There is no API, so we ran it manually. The results: a new record!

We also conducted a more holistic evaluation of its math capabilities. 🧵


basvanopheusden 已轉發

New ARC-AGI SOTA: GPT-5 Pro - ARC-AGI-1: 70.2%, $4.78/task - ARC-AGI-2: 18.3%, $7.41/task @OpenAI’s GPT-5 Pro now holds the highest verified frontier LLM score on ARC-AGI’s Semi-Private benchmark

arcprize's tweet image. New ARC-AGI SOTA: GPT-5 Pro

 - ARC-AGI-1: 70.2%, $4.78/task
 - ARC-AGI-2: 18.3%, $7.41/task

@OpenAI’s GPT-5 Pro now holds the highest verified frontier LLM score on ARC-AGI’s Semi-Private benchmark
arcprize's tweet image. New ARC-AGI SOTA: GPT-5 Pro

 - ARC-AGI-1: 70.2%, $4.78/task
 - ARC-AGI-2: 18.3%, $7.41/task

@OpenAI’s GPT-5 Pro now holds the highest verified frontier LLM score on ARC-AGI’s Semi-Private benchmark

basvanopheusden 已轉發

For those of you on Sora, my Cameos are open. Have at it


basvanopheusden 已轉發

⬆️ LLMs’ forecasting abilities are steadily improving. GPT-4 (released March 2023) achieved a difficulty-adjusted Brier score of 0.131. Nearly two years later, GPT-4.5 (released Feb 2025) scored 0.101—a substantial improvement. A linear extrapolation of state-of-the-art LLM…

Research_FRI's tweet image. ⬆️ LLMs’ forecasting abilities are steadily improving.

GPT-4 (released March 2023) achieved a difficulty-adjusted Brier score of 0.131.

Nearly two years later, GPT-4.5 (released Feb 2025) scored 0.101—a substantial improvement.

A linear extrapolation of state-of-the-art LLM…

basvanopheusden 已轉發

@NotionHQ has processed (well over) a trillion tokens on @OpenAI , and now we have this cool token to show for it.

sarahmsachs's tweet image. @NotionHQ has processed (well over) a trillion tokens on @OpenAI , and now we have this cool token to show for it.

basvanopheusden 已轉發

GPT-5 Pro found a counterexample to the NICD-with-erasures majority optimality (Simons list, p.25). simons.berkeley.edu/sites/default/… At p=0.4, n=5, f(x) = sign(x_1-3x_2+x_3-x_4+3x_5) gives E|f(x)|=0.43024 vs best majority 0.42904.

PI010101's tweet image. GPT-5 Pro found a counterexample to the NICD-with-erasures majority optimality (Simons list, p.25).
simons.berkeley.edu/sites/default/…

At p=0.4, n=5, f(x) = sign(x_1-3x_2+x_3-x_4+3x_5) gives E|f(x)|=0.43024 vs best majority 0.42904.
PI010101's tweet image. GPT-5 Pro found a counterexample to the NICD-with-erasures majority optimality (Simons list, p.25).
simons.berkeley.edu/sites/default/…

At p=0.4, n=5, f(x) = sign(x_1-3x_2+x_3-x_4+3x_5) gives E|f(x)|=0.43024 vs best majority 0.42904.

basvanopheusden 已轉發

Terence Tao + AI for solving hard math problems🤯 In this example the insight comes from Terence, and the muscle—via an hourlong conversation with GPT-5 and some python code it wrote—comes from AI. "Here, the AI tool use was a significant time saver... Indeed I would have been…


Really cool stuff and cool to see a unique approach work out so well 😁

Our first-ever technical report is here! We’re providing an unprecedented look into the architecture and methodology behind our AI system; Aristotle, and shining light on the “how” behind our IMO gold-level performance. Read the full report below⬇️



Important work!

🚨 Just published! All frontier AI models have failed “Radiology’s Last Exam” - the toughest benchmark in radiology launched today! ✅ Board-certified radiologists scored 83%, trainees 45%, but the best performing AI from frontier labs, GPT-5, managed only 30%. ❌ These results…

DrDatta_AIIMS's tweet image. 🚨 Just published! All frontier AI models have failed “Radiology’s Last Exam” - the toughest benchmark in radiology launched today!

✅ Board-certified radiologists scored 83%, trainees 45%, but the best performing AI from frontier labs, GPT-5, managed only 30%.

❌ These results…
DrDatta_AIIMS's tweet image. 🚨 Just published! All frontier AI models have failed “Radiology’s Last Exam” - the toughest benchmark in radiology launched today!

✅ Board-certified radiologists scored 83%, trainees 45%, but the best performing AI from frontier labs, GPT-5, managed only 30%.

❌ These results…


What a champion

a fan asked for a pic

carissayipchess's tweet image. a fan asked for a pic


basvanopheusden 已轉發

First results are out for the @RepoPrompt benchmark! Repo Bench is a test set designed to push models on instruction following, large context reasoning, and precision file editing. Gearing up to release this shortly in the next update so you can run the bench yourself

pvncher's tweet image. First results are out for the @RepoPrompt benchmark!

Repo Bench is a test set designed to push models on instruction following, large context reasoning, and precision file editing.

Gearing up to release this shortly in the next update so you can run the bench yourself

basvanopheusden 已轉發

Gab being poached


True 🤩🤩🤣😂🤣😂☺️😀😍😍

This is the exact kind of post that an Android user would find funny



This is genuinely something I could see myself putting on my wall. Needs a bit more color though

Next up is Gemeni 2.5 Pro: Very nice organic shape- a phyllotaxis shape. Very fitting! Gemini strikes me as more organic in nature.

d33v33d0's tweet image. Next up is Gemeni 2.5 Pro:
Very nice organic shape- a phyllotaxis shape.
Very fitting! Gemini strikes me as more organic in nature.


This is cool and the harmonic keyboard might be even cooler 😎

It's a battle of the agents! ⚔️ Run 5 agents with different approaches to the same task. Toggle between them with one click, watch your app change instantly, then pick the winner and commit. That's Sculptor's Pairing Mode.



Loading...

Something went wrong.


Something went wrong.