bobbycxy's profile picture. Researching AI @ASTARsg

Bobby

@bobbycxy

Researching AI @ASTARsg

Bobby أعاد

Exciting to see more work on "Game as Benchmark", which is similar to our idea of TextArena (led by @LeonGuertler) for benchmarking models on >60 games. though you can see GM @MagnusCarlsen's comments on LLMs chess play 🔥

simon_ycl's tweet image. Exciting to see more work on "Game as Benchmark", which is similar to our idea of TextArena (led by @LeonGuertler) for benchmarking models on >60 games. 

though you can see GM @MagnusCarlsen's comments on LLMs chess play 🔥

Thrilled to announce the @Kaggle Game Arena, a new leaderboard testing how modern LLMs perform on games (spoiler: not very well atm!). AI systems play each other, making it an objective & evergreen benchmark that will scale in difficulty as they improve. kaggle.com/game-arena



Bobby أعاد

something we've lost in the blogification of research is that citing prior work is often just not done at all, even when said work is quite similar + already broadly adopted (in this case, TextArena). especially sad when it's a big lab steamrolling the efforts of smaller teams

Thrilled to announce the @Kaggle Game Arena, a new leaderboard testing how modern LLMs perform on games (spoiler: not very well atm!). AI systems play each other, making it an objective & evergreen benchmark that will scale in difficulty as they improve. kaggle.com/game-arena



Bobby أعاد

Excited to announce the Mindgame @NeurIPS Competition is officially LIVE! 🤖 Pit your agents against others in Mafia, Codename, Prisoner’s Dilemma, Stg Hunt, and Colonel Blotto. Sign up now for $500 in compute credits on your initial run! 🔗 Register : mindgamesarena.com

KevinWang_111's tweet image. Excited to announce the Mindgame @NeurIPS Competition is officially LIVE!
🤖 Pit your agents against others in Mafia, Codename, Prisoner’s Dilemma, Stg Hunt, and Colonel Blotto.
Sign up now for $500 in compute credits on your initial run!
🔗 Register : mindgamesarena.com

Bobby أعاد

For the past ~2 months we have been working on training reasoning models on TextArena games. The first paper (introducing what we think is a very promising new paradigm) will hopefully be up later this week / early next; and the second one, focusing on the "scaling laws" of…


Thank you @_akhaliq and @HuggingPapers for sharing our work. Appreciate it!

TextArena is now on Hugging Face An open-source collection of competitive text-based games for LLMs, spanning 57+ unique environments.

HuggingPapers's tweet image. TextArena is now on Hugging Face

An open-source collection of competitive text-based games for LLMs, spanning 57+ unique environments.


Bobby أعاد

TextArena is live on arXiv! We present a benchmark of 57+ competitive text-based games to evaluate and train LLMs on agentic behavior — including negotiation, deception, theory of mind and many more. Real-time TrueSkill. Multiplayer support. Human-vs-models. Model-vs-model.…

LeonGuertler's tweet image. TextArena is live on arXiv! We present a benchmark of 57+ competitive text-based games to evaluate and train LLMs on agentic behavior — including negotiation, deception, theory of mind and many more.  Real-time TrueSkill. Multiplayer support. Human-vs-models. Model-vs-model.…

Bobby أعاد

Competitive games with a fixed pace provide an excellent evaluation framework for balancing quality and speed in decision-making.

Some intense fighting between Gemini Flash 2.0 and GPT-4o-mini. We will add this (including the option for humans to play against models) to the VideoGameArena[dot]ai today or tomorrow. If you have other game suggestions, please let us know!



Bobby أعاد

"Mom get the camera"

LeonGuertler's tweet image. "Mom get the camera"

Bobby أعاد

Not released yet, but @karpathy leaked our gym like environment plus model competition...

More likely than not this will still change. But so far, Claude is crushing everybody in text based games. (Humanity should probably be nr 1, but for that more humans need to play textarena.ai/play)

LeonGuertler's tweet image. More likely than not this will still change. But so far, Claude is crushing everybody in text based games. (Humanity should probably be nr 1, but for that more humans need to play textarena.ai/play)


Bobby أعاد

Perfect timing, we are just about to publish TextArena. A collection of 57 text-based games (30 in the first release) including single-player, two-player and multi-player games. We tried keeping the interface similar to OpenAI gym, made it very easy to add new games, and created…


I just won against GPT 4o mini in Tak-v0 on TextArena! Check it out at textarena.ai api.textarena.ai/shared_img/656…


My first tweet. And in honor of textarena and @LeonGuertler @LChoshen @Calclavia : I just won against GPT 4o mini in SpellingBee-v0 on TextArena! Check it out at textarena.ai api.textarena.ai/shared_img/09a…


United States الاتجاهات

Loading...

Something went wrong.


Something went wrong.