basvanopheusden

@basvanopheusden

Research at OpenAI, previously @imbue_ai and @cocosci_lab lab at Princeton. All opinions my own

Science & Technology

San Francisco, USA

於十月 2010 加入

2K貼文 2K位跟隨者 254個跟隨中

你可能會喜歡

@eckstein_maria

@marcelomattar

@cocosci_lab

@smusslick

@Lianaaan

@_magrawal

@eraneldar

@angelaradulescu

@tedsumers

@Ikuperwajs

@DanMirea4

@Qihong_Lu

@MattPanichello

basvanopheusden

@basvanopheusden

15 小時

I love these results that simultaneously show the upside of great models like gpt 5 and Gemini, but also identify their flaws. There's work to be done ;) Which gpt 5 is this btw? Would be interesting to run pro

Deedy

@deedydas

年10月11日

GPT-5 and Gemini 2.5 Pro just achieved gold medal performance in the International Olympiad of Astronomy and Astrophysics (IOAA). AI is now world class at cutting edge physics.

deedydas's tweet image. GPT-5 and Gemini 2.5 Pro just achieved gold medal performance in the International Olympiad of Astronomy and Astrophysics (IOAA).

AI is now world class at cutting edge physics.

basvanopheusden

@basvanopheusden

16 小時

Cool work, excited to see which games end up being easier or harder and which models are good at which games. Presumably there is some game out there that Gemini or Claude just absolutely crush?

Saner Cakir

@sanerc110

年10月10日

(9/9) We're releasing a ton of new and harder games on Game Arena every day. Think of 2 page -> 20+ page rulebook length! So stay tuned, and my DMs are open for those interested in partnering and early access!

sanerc110's tweet image. (9/9) We're releasing a ton of new and harder games on Game Arena every day.

Think of 2 page -&gt; 20+ page rulebook length!

So stay tuned, and my DMs are open for those interested in partnering and early access!

basvanopheusden

@basvanopheusden

18 小時

Love mengus workshop!

Andrea Mengucci

@Mengu09

年10月11日

Thank you for the visibility. If you like old Magic cards and want to see them being played in the best in paper with highly edited content make sure to check out Mengu's Workshop on Youtube!!! 😍(No link because I still didn't understand if Twitter punishes these or not).

basvanopheusden

@basvanopheusden

19 小時

My phone "autocorrects" this every single time, glad to see I'm not the only one :p

Ohqay

@itsohqay

22 小時

basvanopheusden

@basvanopheusden

年10月10日

Impressive work!

Epoch AI

@EpochAIResearch

年10月9日

We evaluated Gemini 2.5 Deep Think on FrontierMath. There is no API, so we ran it manually. The results: a new record! We also conducted a more holistic evaluation of its math capabilities. 🧵

EpochAIResearch's tweet image. We evaluated Gemini 2.5 Deep Think on FrontierMath. There is no API, so we ran it manually. The results: a new record!

We also conducted a more holistic evaluation of its math capabilities. 🧵

basvanopheusden 已轉發

ARC Prize

@arcprize

年10月9日

New ARC-AGI SOTA: GPT-5 Pro - ARC-AGI-1: 70.2%, $4.78/task - ARC-AGI-2: 18.3%, $7.41/task @OpenAI’s GPT-5 Pro now holds the highest verified frontier LLM score on ARC-AGI’s Semi-Private benchmark

arcprize's tweet image. New ARC-AGI SOTA: GPT-5 Pro

- ARC-AGI-1: 70.2%, $4.78/task
- ARC-AGI-2: 18.3%, $7.41/task

@OpenAI’s GPT-5 Pro now holds the highest verified frontier LLM score on ARC-AGI’s Semi-Private benchmark

basvanopheusden 已轉發

Mark Cuban

@mcuban

年10月9日

For those of you on Sora, my Cameos are open. Have at it

basvanopheusden 已轉發

Forecasting Research Institute

@Research_FRI

年10月8日

⬆️ LLMs’ forecasting abilities are steadily improving. GPT-4 (released March 2023) achieved a difficulty-adjusted Brier score of 0.131. Nearly two years later, GPT-4.5 (released Feb 2025) scored 0.101—a substantial improvement. A linear extrapolation of state-of-the-art LLM…

Research_FRI's tweet image. ⬆️ LLMs’ forecasting abilities are steadily improving.

GPT-4 (released March 2023) achieved a difficulty-adjusted Brier score of 0.131.

Nearly two years later, GPT-4.5 (released Feb 2025) scored 0.101—a substantial improvement.

A linear extrapolation of state-of-the-art LLM…

basvanopheusden 已轉發

Sarah Sachs

@sarahmsachs

年10月7日

@NotionHQ has processed (well over) a trillion tokens on @OpenAI , and now we have this cool token to show for it.

basvanopheusden 已轉發

Paata Ivanisvili

@PI010101

年10月5日

GPT-5 Pro found a counterexample to the NICD-with-erasures majority optimality (Simons list, p.25). simons.berkeley.edu/sites/default/… At p=0.4, n=5, f(x) = sign(x_1-3x_2+x_3-x_4+3x_5) gives E|f(x)|=0.43024 vs best majority 0.42904.

PI010101's tweet image. GPT-5 Pro found a counterexample to the NICD-with-erasures majority optimality (Simons list, p.25).
simons.berkeley.edu/sites/default/…

At p=0.4, n=5, f(x) = sign(x_1-3x_2+x_3-x_4+3x_5) gives E|f(x)|=0.43024 vs best majority 0.42904.

basvanopheusden 已轉發

Kevin Weil 🇺🇸

@kevinweil

年10月3日

Terence Tao + AI for solving hard math problems🤯 In this example the insight comes from Terence, and the muscle—via an hourlong conversation with GPT-5 and some python code it wrote—comes from AI. "Here, the AI tool use was a significant time saver... Indeed I would have been…

basvanopheusden

@basvanopheusden

年10月4日

Really cool stuff and cool to see a unique approach work out so well 😁

Harmonic

@HarmonicMath

年10月3日

Our first-ever technical report is here! We’re providing an unprecedented look into the architecture and methodology behind our AI system; Aristotle, and shining light on the “how” behind our IMO gold-level performance. Read the full report below⬇️

basvanopheusden

@basvanopheusden

年10月4日

Important work!

Dr. Datta M.D. (AIIMS Delhi)

@DrDatta_AIIMS

年10月1日

🚨 Just published! All frontier AI models have failed “Radiology’s Last Exam” - the toughest benchmark in radiology launched today! ✅ Board-certified radiologists scored 83%, trainees 45%, but the best performing AI from frontier labs, GPT-5, managed only 30%. ❌ These results…

DrDatta_AIIMS's tweet image. 🚨 Just published! All frontier AI models have failed “Radiology’s Last Exam” - the toughest benchmark in radiology launched today!

✅ Board-certified radiologists scored 83%, trainees 45%, but the best performing AI from frontier labs, GPT-5, managed only 30%.

❌ These results…

basvanopheusden

@basvanopheusden

年10月4日

What a champion

Carissa Yip

@carissayipchess

年10月4日

a fan asked for a pic

basvanopheusden 已轉發

eric provencher

@pvncher

年10月3日

First results are out for the @RepoPrompt benchmark! Repo Bench is a test set designed to push models on instruction following, large context reasoning, and precision file editing. Gearing up to release this shortly in the next update so you can run the bench yourself

pvncher's tweet image. First results are out for the @RepoPrompt benchmark!

Repo Bench is a test set designed to push models on instruction following, large context reasoning, and precision file editing.

Gearing up to release this shortly in the next update so you can run the bench yourself

basvanopheusden

@basvanopheusden

年10月3日

😭

Ashley

@ashleydzhang

年10月3日

I ❤️ my colleagues

basvanopheusden 已轉發

Peter Gostev

@petergostev

年10月2日

Gab being poached

basvanopheusden

@basvanopheusden

年10月3日

True 🤩🤩🤣😂🤣😂☺️😀😍😍

Theo - t3.gg

@theo

年10月2日

This is the exact kind of post that an Android user would find funny

basvanopheusden

@basvanopheusden

年10月3日

This is genuinely something I could see myself putting on my wall. Needs a bit more color though

Martin_DeVido

@d33v33d0

年10月2日

Next up is Gemeni 2.5 Pro: Very nice organic shape- a phyllotaxis shape. Very fitting! Gemini strikes me as more organic in nature.

d33v33d0's tweet image. Next up is Gemeni 2.5 Pro:
Very nice organic shape- a phyllotaxis shape.
Very fitting! Gemini strikes me as more organic in nature.

basvanopheusden

@basvanopheusden

年10月3日

This is cool and the harmonic keyboard might be even cooler 😎

Imbue

@imbue_ai

年10月3日

It's a battle of the agents! ⚔️ Run 5 agents with different approaches to the same task. Toggle between them with one click, watch your app change instantly, then pick the winner and commit. That's Sculptor's Pairing Mode.