#imobench resultados da pesquisa

Garrett Bingham

4 de nov. de

We just released #IMOBench, a suite of advanced math reasoning benchmarks that were instrumental in our IMO Gold effort. It tests models for their abilities to write long-form rigorous proofs, solve complex problems, and accurately grade candidate solutions.

Thang Luong

4 de nov. de

Continuing our IMO-gold journey, I’m delighted to share our #EMNLP2025 paper “Towards Robust Mathematical Reasoning”, which tells some of the key stories behind the success of our advanced Gemini #DeepThink at this year IMO. Finding the right north-star metrics was highly

lmthang's tweet image. Continuing our IMO-gold journey, I’m delighted to share our #EMNLP2025 paper “Towards Robust Mathematical Reasoning”, which tells some of the key stories behind the success of our advanced Gemini #DeepThink at this year IMO. Finding the right north-star metrics was highly

Thang Luong

18 de nov. de

Cool figure of #Gemini3 with #DeepThink on ARC-AGI-2. The big jump looks quite like the results on #IMOBench that we obtained previously with Gemini 2.5 x.com/lmthang/status…

lmthang's tweet image. Cool figure of #Gemini3 with #DeepThink on ARC-AGI-2. The big jump looks quite like the results on #IMOBench that we obtained previously with Gemini 2.5 x.com/lmthang/status…

Thang Luong

4 de nov. de

IMO-Bench consists of three benchmarks that judge models on diverse capabilities: IMO-AnswerBench, a large-scale test on getting the right answer; IMO-ProofBench, a next-level evaluation for proof writing; and IMO-GradingBench, a new benchmarkto enable further progress in

lmthang's tweet image. IMO-Bench consists of three benchmarks that judge models on diverse capabilities: IMO-AnswerBench, a large-scale test on getting the right answer; IMO-ProofBench, a next-level evaluation for proof writing; and IMO-GradingBench, a new benchmarkto enable further progress in

Thang Luong

18 de nov. de

Cool figure of #Gemini3 with #DeepThink on ARC-AGI-2. The big jump looks quite like the results on #IMOBench that we obtained previously with Gemini 2.5 x.com/lmthang/status…

lmthang's tweet image. Cool figure of #Gemini3 with #DeepThink on ARC-AGI-2. The big jump looks quite like the results on #IMOBench that we obtained previously with Gemini 2.5 x.com/lmthang/status…

Thang Luong

4 de nov. de

IMO-Bench consists of three benchmarks that judge models on diverse capabilities: IMO-AnswerBench, a large-scale test on getting the right answer; IMO-ProofBench, a next-level evaluation for proof writing; and IMO-GradingBench, a new benchmarkto enable further progress in

lmthang's tweet image. IMO-Bench consists of three benchmarks that judge models on diverse capabilities: IMO-AnswerBench, a large-scale test on getting the right answer; IMO-ProofBench, a next-level evaluation for proof writing; and IMO-GradingBench, a new benchmarkto enable further progress in

Garrett Bingham

4 de nov. de

We just released #IMOBench, a suite of advanced math reasoning benchmarks that were instrumental in our IMO Gold effort. It tests models for their abilities to write long-form rigorous proofs, solve complex problems, and accurately grade candidate solutions.

Thang Luong

4 de nov. de

Continuing our IMO-gold journey, I’m delighted to share our #EMNLP2025 paper “Towards Robust Mathematical Reasoning”, which tells some of the key stories behind the success of our advanced Gemini #DeepThink at this year IMO. Finding the right north-star metrics was highly

lmthang's tweet image. Continuing our IMO-gold journey, I’m delighted to share our #EMNLP2025 paper “Towards Robust Mathematical Reasoning”, which tells some of the key stories behind the success of our advanced Gemini #DeepThink at this year IMO. Finding the right north-star metrics was highly

Imo Benchie

@ImoBenchiezk2

Thang Luong

18 de nov. de

Cool figure of #Gemini3 with #DeepThink on ARC-AGI-2. The big jump looks quite like the results on #IMOBench that we obtained previously with Gemini 2.5 x.com/lmthang/status…

lmthang's tweet image. Cool figure of #Gemini3 with #DeepThink on ARC-AGI-2. The big jump looks quite like the results on #IMOBench that we obtained previously with Gemini 2.5 x.com/lmthang/status…

Thang Luong

4 de nov. de

IMO-Bench consists of three benchmarks that judge models on diverse capabilities: IMO-AnswerBench, a large-scale test on getting the right answer; IMO-ProofBench, a next-level evaluation for proof writing; and IMO-GradingBench, a new benchmarkto enable further progress in

lmthang's tweet image. IMO-Bench consists of three benchmarks that judge models on diverse capabilities: IMO-AnswerBench, a large-scale test on getting the right answer; IMO-ProofBench, a next-level evaluation for proof writing; and IMO-GradingBench, a new benchmarkto enable further progress in

Something went wrong.

Something went wrong.

United States Trends