
Exciting to see more work on "Game as Benchmark", which is similar to our idea of TextArena (led by @LeonGuertler) for benchmarking models on >60 games. though you can see GM @MagnusCarlsen's comments on LLMs chess play 🔥

Thrilled to announce the @Kaggle Game Arena, a new leaderboard testing how modern LLMs perform on games (spoiler: not very well atm!). AI systems play each other, making it an objective & evergreen benchmark that will scale in difficulty as they improve. kaggle.com/game-arena
something we've lost in the blogification of research is that citing prior work is often just not done at all, even when said work is quite similar + already broadly adopted (in this case, TextArena). especially sad when it's a big lab steamrolling the efforts of smaller teams
Thrilled to announce the @Kaggle Game Arena, a new leaderboard testing how modern LLMs perform on games (spoiler: not very well atm!). AI systems play each other, making it an objective & evergreen benchmark that will scale in difficulty as they improve. kaggle.com/game-arena
Excited to announce the Mindgame @NeurIPS Competition is officially LIVE! 🤖 Pit your agents against others in Mafia, Codename, Prisoner’s Dilemma, Stg Hunt, and Colonel Blotto. Sign up now for $500 in compute credits on your initial run! 🔗 Register : mindgamesarena.com

For the past ~2 months we have been working on training reasoning models on TextArena games. The first paper (introducing what we think is a very promising new paradigm) will hopefully be up later this week / early next; and the second one, focusing on the "scaling laws" of…
Thank you @_akhaliq and @HuggingPapers for sharing our work. Appreciate it!
TextArena is now on Hugging Face An open-source collection of competitive text-based games for LLMs, spanning 57+ unique environments.

TextArena is live on arXiv! We present a benchmark of 57+ competitive text-based games to evaluate and train LLMs on agentic behavior — including negotiation, deception, theory of mind and many more. Real-time TrueSkill. Multiplayer support. Human-vs-models. Model-vs-model.…

Competitive games with a fixed pace provide an excellent evaluation framework for balancing quality and speed in decision-making.
Some intense fighting between Gemini Flash 2.0 and GPT-4o-mini. We will add this (including the option for humans to play against models) to the VideoGameArena[dot]ai today or tomorrow. If you have other game suggestions, please let us know!
Not released yet, but @karpathy leaked our gym like environment plus model competition...
More likely than not this will still change. But so far, Claude is crushing everybody in text based games. (Humanity should probably be nr 1, but for that more humans need to play textarena.ai/play)

Perfect timing, we are just about to publish TextArena. A collection of 57 text-based games (30 in the first release) including single-player, two-player and multi-player games. We tried keeping the interface similar to OpenAI gym, made it very easy to add new games, and created…
I just won against GPT 4o mini in Tak-v0 on TextArena! Check it out at textarena.ai api.textarena.ai/shared_img/656…
My first tweet. And in honor of textarena and @LeonGuertler @LChoshen @Calclavia : I just won against GPT 4o mini in SpellingBee-v0 on TextArena! Check it out at textarena.ai api.textarena.ai/shared_img/09a…
United States الاتجاهات
- 1. Gabe Vincent 3,164 posts
- 2. #AEWDynamite 17.5K posts
- 3. #VSFashionShow 527K posts
- 4. Deport Harry Sisson 6,045 posts
- 5. Angel Reese 46.4K posts
- 6. #stlblues 1,744 posts
- 7. #Blackhawks 1,884 posts
- 8. tzuyu 214K posts
- 9. #Survivor49 3,363 posts
- 10. George Kirby 2,302 posts
- 11. Quen 29.5K posts
- 12. Suarez 17.8K posts
- 13. jihyo 171K posts
- 14. Hofer 1,722 posts
- 15. Darby 5,022 posts
- 16. Deloitte 4,642 posts
- 17. Birdman 4,753 posts
- 18. Nazar 6,312 posts
- 19. Tusky 2,045 posts
- 20. Mavs 4,987 posts
Something went wrong.
Something went wrong.