Karthik Narasimhan

@karthik_r_n

Professor@PrincetonCS, Research@SierraPlatform. Previously @OpenAI, @MIT_CSAIL, @iitmadras

Princeton, NJ

karthiknarasimhan.com

Joined July 2015

283Posts 4KFollowers 457Following

You might like

@jacobandreas

@sleepinyourhat

@HannaHajishirzi

@kaiwei_chang

@sewon__min

@LukeZettlemoyer

@mohitban47

@danqi_chen

@jaseweston

@shi_weiyan

@gneubig

@yoavartzi

@RishiBommasani

@hhexiy

@jhuclsp

Karthik Narasimhan reposted

Sijia Liu

@letti_liu

Oct 1

Is online alignment the only path to go despite being slow and computationally expensive? Inspired by prospect theory, we provide a human-centric explanation for why online alignment (e.g. GRPO) outperforms offline alignment (e.g. DPO, KTO) and empirically show how to close the…

letti_liu's tweet image. Is online alignment the only path to go despite being slow and computationally expensive?

Inspired by prospect theory, we provide a human-centric explanation for why online alignment (e.g. GRPO) outperforms offline alignment (e.g. DPO, KTO) and empirically show how to close the…

Karthik Narasimhan reposted

Ben Shi

@BenShi34

Jun 10

As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes? In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧵:

BenShi34's tweet image. As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes?

In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧵:

Karthik Narasimhan reposted

Clay Bavor

@claybavor

Jun 11

Today we announced a set of major advances to our agent benchmark, 𝜏-bench. This new benchmark, 𝜏², introduces the notion of "dual control", where AI agents are challenged not just to reason and act, but to coordinate, guide, and assist a user in achieving a shared objective.…

Karthik Narasimhan reposted

Sierra

@SierraPlatform

Jun 10

Learn more: sierra.ai/blog/benchmark…

SierraPlatform's tweet card. Benchmarking agents in collaborative real-world scenarios

𝜏²-bench

Source: sierra.ai

Karthik Narasimhan reposted

Sierra

@SierraPlatform

Jun 10

Last year, we introduced 𝜏-bench, a benchmark for evaluating AI agents on realistic, multi-step tasks involving tool use and domain-specific constraints. It surfaced a critical limitation in LLM-based agents: low repeatability, even under identical conditions. Now, we’re…

Karthik Narasimhan reposted

Alex Zhang

@a1zhang

May 28

Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II? 𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark! 🧵👇

Karthik Narasimhan reposted

Sierra

@SierraPlatform

May 21

Successful agents are the result of collaboration between teams: engineering, operations, customer experience, and marketing. Yet every platform available today except Sierra forces businesses to optimize for one group over another. Our Agent OS enables both no code and…

SierraPlatform's tweet image. Successful agents are the result of collaboration between teams: engineering, operations, customer experience, and marketing. Yet every platform available today except Sierra forces businesses to optimize for one group over another. Our Agent OS enables both no code and…

Karthik Narasimhan reposted

Clay Bavor

@claybavor

May 21

Like all great products, the best agents are the product of many teams working together — some technical, some non-technical. Sierra’s Agent OS uniquely supports both no code and programmatic agent development, enabling customer experience and engineering teams alike to build…

Karthik Narasimhan

@karthik_r_n

May 21

Humans evolved to communicate so we could coordinate better. But these days, it feels like we communicate so much, yet coordinate so little.

Karthik Narasimhan reposted

Shunyu Yao

@ShunyuYao12

Apr 24

I’m at ICLR to present a poster and give a talk, both related to the second half blogpost. See you there if you wanna chat about it :)

ShunyuYao12's tweet image. I’m at ICLR to present a poster and give a talk, both related to the second half blogpost. See you there if you wanna chat about it :)

Shunyu Yao

@ShunyuYao12

Apr 14

I finally wrote another blogpost: ysymyth.github.io/The-Second-Hal… AI just keeps getting better over time, but NOW is a special moment that i call “the halftime”. Before it, training > eval. After it, eval > training. The reason: RL finally works. Lmk ur feedback so I’ll polish it.

Karthik Narasimhan

@karthik_r_n

Mar 21

Interesting tidbits on using dedicated "thinking" steps in agents from @AnthropicAI Also loved seeing full pass^k curves for τ-bench - measuring this was the primary motivation of the benchmark, not just avg scores!

Anthropic

@AnthropicAI

Mar 21

We’re launching a new blog: Engineering at Anthropic. A hub where developers can find practical advice and our latest discoveries on how to get the most from Claude.

AnthropicAI's tweet image. We’re launching a new blog: Engineering at Anthropic.

A hub where developers can find practical advice and our latest discoveries on how to get the most from Claude.

Karthik Narasimhan reposted

Sierra

@SierraPlatform

Mar 18

In the AI age, agent reliability is key, and Sierra’s 𝜏-bench is setting the standard—shaping academic research, industry applications and next-generation development. Read more: sierra.ai/blog/tau-bench….

SierraPlatform's tweet image. In the AI age, agent reliability is key, and Sierra’s 𝜏-bench is setting the standard—shaping academic research, industry applications and next-generation development. Read more: sierra.ai/blog/tau-bench….

Karthik Narasimhan

@karthik_r_n

Feb 28

The best thing about SWE-agents and tools like cursor is the amount of additional agency they provide us

Karthik Narasimhan reposted

Kilian Lieret

@KLieret

Feb 13

SWE-agent 1.0 is the open-source SOTA on SWE-bench Lite! Tons of new features: massively parallel runs; cloud-based deployment; extensive configurability with tool bundles; new command line interface & utilities.

KLieret's tweet image. SWE-agent 1.0 is the open-source SOTA on SWE-bench Lite! Tons of new features: massively parallel runs; cloud-based deployment; extensive configurability with tool bundles; new command line interface &amp; utilities.

Karthik Narasimhan

@karthik_r_n

Jan 12

The biggest mistake we can make right now is not dreaming big enough, especially w.r.t AI

Karthik Narasimhan reposted

Common Sense Machines

@CSM_ai

Nov 25

Today we're releasing Common Sense Agents, a new backbone for agentic creative computing: 💻 Windows VMs for safe and repeatable workflows 🔧 Long workflows broken down into reusable tasks 🦾Support for off the shelf agents like Claude ⌛️ Data recording + finetuning infra

Karthik Narasimhan reposted

Sierra

@SierraPlatform

Oct 9, 2024

Today we're excited to announce a new way to interact with Sierra agents: voice. Learn more about how this new capability is transforming customer interactions in our latest blog post.: sierra.ai/blog/sierra-sp…

Karthik Narasimhan reposted

John Yang

@jyangballin

Oct 7, 2024

We're launching SWE-bench Multimodal to eval agents' ability to solve visual GitHub issues. - 617 *brand new* tasks from 17 JavaScript repos - Each task has an image! Existing agents struggle here! We present SWE-agent Multimodal to remedy some issues Led w/ @_carlosejimenez 🧵

Karthik Narasimhan reposted

Sierra

@SierraPlatform

Oct 4, 2024

Sierra partnered with @Casper to launch Luna 2.0, their AI agent delivering 24/7 personalized customer support. From helping with mattress purchases to driving lifelong loyalty, Luna 2.0 is transforming the shopping experience!💤✨️ Learn more: sierra.ai/customers/casp…

Karthik Narasimhan

@karthik_r_n

Oct 3, 2024

In a year or two from now, 'fine-tuning' will become synonymous with 'training' (as used in the good old ML days). LLMs will be seen more widely as starting points, just like weight initialization or choosing the number of layers for a Transformer. Pick a starting point, curate…