shashank_bits's profile picture. Researcher at Ai2 || Work on NLP, LLMs, Reasoning, Agents, AI4Code || Prev: Microsoft AI, Univ. of Illinois (UIUC), Max Planck, IIT-Bombay || @shashanknlp 🟦sky

Shashank Gupta

@shashank_bits

Researcher at Ai2 || Work on NLP, LLMs, Reasoning, Agents, AI4Code || Prev: Microsoft AI, Univ. of Illinois (UIUC), Max Planck, IIT-Bombay || @shashanknlp 🟦sky

Shashank Gupta รีโพสต์แล้ว

🚨 DeepMind finally dropped the Veo3 paper which shows what we all realize from playing with video-gen models. Just like LLMs, visual reasoning on is an emergent property of training on tons of video. It can solve tasks not explicitly in training data. "Veo 3 is the GPT-3…

deedydas's tweet image. 🚨 DeepMind finally dropped the Veo3 paper which shows what we all realize from playing with video-gen models.

Just like LLMs, visual reasoning on is an emergent property of training on tons of video. It can solve tasks not explicitly in training data.

"Veo 3 is the GPT-3…

Shashank Gupta รีโพสต์แล้ว

I spent the past month reimplementing DeepMind’s Genie 3 world model from scratch Ended up making TinyWorlds, a 3M parameter world model capable of generating playable game environments demo below + everything I learned in thread (full repo at the end)👇🏼


Shashank Gupta รีโพสต์แล้ว

As part of Asta, our initiative to accelerate science with trustworthy AI agents, we built AstaBench—the first comprehensive benchmark to compare them. ⚖️

allen_ai's tweet image. As part of Asta, our initiative to accelerate science with trustworthy AI agents, we built AstaBench—the first comprehensive benchmark to compare them. ⚖️

Shashank Gupta รีโพสต์แล้ว

Introducing Asta—our bold initiative to accelerate science with trustworthy, capable agents, benchmarks, & developer resources that bring clarity to the landscape of scientific AI + agents. 🧵


Shashank Gupta รีโพสต์แล้ว

Test-time scaling w/ GRPO boosts accuracy, but also adds “filler tokens” increasing length w/o real progress. We present Group Filtered Policy Optimization (GFPO):🧵 1️⃣ Sample more per prompt 2️⃣ Rank by token efficiency (reward ÷ length) 3️⃣ Train on top-k 4️⃣ 🚀 Cut 80% of…

VaishShrivas's tweet image. Test-time scaling w/ GRPO boosts accuracy, but also adds “filler tokens” increasing length w/o real progress.
We present Group Filtered Policy Optimization (GFPO):🧵
 1️⃣ Sample more per prompt
 2️⃣ Rank by token efficiency (reward ÷ length)
 3️⃣ Train on top-k
 4️⃣ 🚀 Cut 80% of…

Shashank Gupta รีโพสต์แล้ว

With fresh support of $75M from @NSF and $77M from @NVIDIA, we’re set to scale our open model ecosystem, bolster the infrastructure behind it, and fast‑track reproducible AI research to unlock the next wave of scientific discovery. 💡

allen_ai's tweet image. With fresh support of $75M from @NSF and $77M from @NVIDIA, we’re set to scale our open model ecosystem, bolster the infrastructure behind it, and fast‑track reproducible AI research to unlock the next wave of scientific discovery. 💡

Shashank Gupta รีโพสต์แล้ว

Thinking Less at test-time requires Sampling More at training-time! GFPO is a new, cool, and simple Policy Opt algorithm is coming to your RL Gym tonite, led by @VaishShrivas and our MSR group: Group Filtered PO (GFPO) trades off training-time with test-time compute, in order…

DimitrisPapail's tweet image. Thinking Less at test-time requires Sampling More at training-time!

GFPO is a new, cool, and simple Policy Opt algorithm is coming to your RL Gym tonite, led by @VaishShrivas and our MSR group:

Group Filtered PO (GFPO) trades off training-time with test-time compute, in order…
DimitrisPapail's tweet image. Thinking Less at test-time requires Sampling More at training-time!

GFPO is a new, cool, and simple Policy Opt algorithm is coming to your RL Gym tonite, led by @VaishShrivas and our MSR group:

Group Filtered PO (GFPO) trades off training-time with test-time compute, in order…
DimitrisPapail's tweet image. Thinking Less at test-time requires Sampling More at training-time!

GFPO is a new, cool, and simple Policy Opt algorithm is coming to your RL Gym tonite, led by @VaishShrivas and our MSR group:

Group Filtered PO (GFPO) trades off training-time with test-time compute, in order…
DimitrisPapail's tweet image. Thinking Less at test-time requires Sampling More at training-time!

GFPO is a new, cool, and simple Policy Opt algorithm is coming to your RL Gym tonite, led by @VaishShrivas and our MSR group:

Group Filtered PO (GFPO) trades off training-time with test-time compute, in order…

Shashank Gupta รีโพสต์แล้ว

🚨 We're hiring a #ResearchScientist in #AI for Scientific Discovery at Ai2! Are you passionate about intelligent agents, data-driven discovery, and AI systems that accelerate science? Join us in shaping the future of research. 🧬🧠 Apply now: job-boards.greenhouse.io/thealleninstit…

allen_ai's tweet image. 🚨 We're hiring a #ResearchScientist in #AI for Scientific Discovery at Ai2!

Are you passionate about intelligent agents, data-driven discovery, and AI systems that accelerate science? Join us in shaping the future of research. 🧬🧠

Apply now: job-boards.greenhouse.io/thealleninstit…

Shashank Gupta รีโพสต์แล้ว

Introducing SciArena, a platform for benchmarking models across scientific literature tasks. Inspired by Chatbot Arena, SciArena applies a crowdsourced LLM evaluation approach to the scientific domain. 🧵

allen_ai's tweet image. Introducing SciArena, a platform for benchmarking models across scientific literature tasks. Inspired by Chatbot Arena, SciArena applies a crowdsourced LLM evaluation approach to the scientific domain. 🧵

Shashank Gupta รีโพสต์แล้ว

Today we’re releasing a prototype of Genesys, an autonomous multi-agent LLM discovery system that aims to discover new types of language model architectures. We found Genesys can discover novel architectures competitive with the industry-standard transformer. 🧵

allen_ai's tweet image. Today we’re releasing a prototype of Genesys, an autonomous multi-agent LLM discovery system that aims to discover new types of language model architectures. We found Genesys can discover novel architectures competitive with the industry-standard transformer. 🧵

Shashank Gupta รีโพสต์แล้ว

✨New edition of our community-building workshop series!✨ Tomorrow at @CVPR, we invite speakers to share their stories, values, and approaches for navigating a crowded and evolving field, especially for early-career researchers. Cheeky title🤭: How to Stand Out in the…

In this #CVPR2025 edition of our community-building workshop series, we focus on supporting the growth of early-career researchers. Join us tomorrow (Jun 11) at 12:45 PM in Room 209 Schedule: sites.google.com/view/standoutc… We have an exciting lineup of invited talks and candid…

anand_bhattad's tweet image. In this #CVPR2025 edition of our community-building workshop series, we focus on supporting the growth of early-career researchers.

Join us tomorrow (Jun 11) at 12:45 PM in Room 209

Schedule: sites.google.com/view/standoutc…

We have an exciting lineup of invited talks and candid…


Shashank Gupta รีโพสต์แล้ว

Excited to announce AlphaEvolve A powerful AI coding agent developed by our team in @GoogleDeepMind that is able to discover impactful new algorithms for important problems in Maths and Computing by combining the creativity of large language models with automated evaluators.


Shashank Gupta รีโพสต์แล้ว

We've just released HealthBench — a new eval for AI systems for health. Developed with 262 physicians who have practiced in 60 countries.

Evaluations are essential to understanding how models perform in health settings. HealthBench is a new evaluation benchmark, developed with input from 250+ physicians from around the world, now available in our GitHub repository. openai.com/index/healthbe…



Shashank Gupta รีโพสต์แล้ว

Scientific discovery with LLMs has so much potential yet is underexplored. Our new benchmark **LLM-SRBench** enable rigorous evaluations of equation discovery with LLMs! 🧠Key takeaway: Even SOTA discovery models with strong LLM backbones still fail to discover mathematical…

ParshinShojaee's tweet image. Scientific discovery with LLMs has so much potential yet is underexplored. Our new benchmark **LLM-SRBench** enable rigorous evaluations of equation discovery with LLMs!  

🧠Key takeaway: Even SOTA discovery models with strong LLM backbones still fail to discover mathematical…

Shashank Gupta รีโพสต์แล้ว

Excited to release R2E-Gym - 🔥 8.1K executable environments using synthetic data - 🧠 Hybrid verifiers for enhanced inference-time scaling - 📈 51% success-rate on the SWE-Bench Verified - 🤗 Open Source Data + Models + Trajectories 1/

StringChaos's tweet image. Excited to release R2E-Gym
  - 🔥 8.1K executable environments using synthetic data
  - 🧠 Hybrid verifiers for enhanced inference-time scaling
  - 📈 51% success-rate on the SWE-Bench Verified
  - 🤗 Open Source Data + Models + Trajectories

1/

Shashank Gupta รีโพสต์แล้ว

Announcing OLMo 2 32B: the first fully open model to beat GPT 3.5 & GPT-4o mini on a suite of popular, multi-skill benchmarks. Comparable to best open-weight models, but a fraction of training compute. When you have a good recipe, ✨ magical things happen when you scale it up!

allen_ai's tweet image. Announcing OLMo 2 32B: the first fully open model to beat GPT 3.5 & GPT-4o mini on a suite of popular, multi-skill benchmarks.

Comparable to best open-weight models, but a fraction of training compute. When you have a good recipe, ✨ magical things happen when you scale it up!

Shashank Gupta รีโพสต์แล้ว

Here is Tülu 3 405B 🐫 our open-source post-training model that surpasses the performance of DeepSeek-V3! The last member of the Tülu 3 family demonstrates that our recipe, which includes Reinforcement Learning from Verifiable Rewards (RVLR) scales to 405B - with performance on…

allen_ai's tweet image. Here is Tülu 3 405B 🐫 our open-source post-training model that surpasses the performance of DeepSeek-V3! The last member of the Tülu 3 family demonstrates that our recipe, which includes Reinforcement Learning from Verifiable Rewards (RVLR) scales to 405B - with performance on…

Shashank Gupta รีโพสต์แล้ว

Excited to share a sneak peek into what we've been building at Yutori! What you see below is our trained model and internal prototype — multiple agents running in parallel in the background, completing tasks of varying complexity, relevant information and cues to step in being…


Shashank Gupta รีโพสต์แล้ว

Interested in knowing more about LLMs agents and in contributing to this topic?🚀 📢We're thrilled to announce REALM: The first Workshop for Research on Agent Language Models 🤖 #ACL2025NLP in Vienna 🎻 We have an exciting lineup of speakers 🗓️ Submit your work by *March 1st*

nouhadziri's tweet image. Interested in knowing more about LLMs agents and in contributing to this topic?🚀

📢We're thrilled to announce REALM: The first Workshop for Research on Agent Language Models 🤖 #ACL2025NLP in Vienna 🎻
We have an exciting lineup of speakers 

🗓️ Submit your work by *March 1st*

Shashank Gupta รีโพสต์แล้ว

Can AI really help with literature reviews? 🧐 Meet Ai2 ScholarQA, an experimental solution that allows you to ask questions that require multiple scientific papers to answer. It gives more in-depth, detailed, and contextual answers with table comparisons, expandable sections…

allen_ai's tweet image. Can AI really help with literature reviews? 🧐

Meet Ai2 ScholarQA, an experimental solution that allows you to ask questions that require multiple scientific papers to answer. It gives more in-depth, detailed, and contextual answers with table comparisons, expandable sections…

Loading...

Something went wrong.


Something went wrong.