Greg Durrett

@gregd_nlp

Associate professor at NYU (Courant CS + Center for Data Science) | advisor for @bespokelabsai | large language models and NLP | he/him

Se unió en Diciembre de 2017

1KPosts 8KSeguidores 897Siguiendo

Tal vez te guste

@YejinChoinka

@jacobandreas

@uwnlp

@EdinburghNLP

@HannaHajishirzi

@LukeZettlemoyer

@mohitban47

@sewon__min

@MohitIyyer

@MaartenSap

@kaiwei_chang

@xiangrenNLP

@cocoweixu

@ssgrn

@Diyi_Yang

Fijado

Greg Durrett

@gregd_nlp

21 ago 2023

📣 Today we launched an overhauled NLP course to 600 students in the online MS programs at UT Austin. 98 YouTube videos 🎥 + readings 📖 open to all! cs.utexas.edu/~gdurrett/cour… w/5 hours of new 🎥 on LLMs, RLHF, chain-of-thought, etc! Meme trailer 🎬 youtu.be/DcB6ZPReeuU 🧵

gregd_nlp's tweet card. CS371N/388 NLP "trailer" (fall 2023)

youtube.com

YouTube

CS371N/388 NLP "trailer" (fall 2023)

Fuente: youtube.com

Greg Durrett reposteó

Tuhina Tripathi

@tuhina_tripathi

8 oct

We have been overlooking a key factor in LLM-as-a-judge evaluation: the feedback collection protocol. Our #COLM2025 paper presents a comprehensive study on how feedback protocols shape reliability and bias in LLM evaluations.

tuhina_tripathi's tweet image. We have been overlooking a key factor in LLM-as-a-judge evaluation: the feedback collection protocol.

Our #COLM2025 paper presents a comprehensive study on how feedback protocols shape reliability and bias in LLM evaluations.

Greg Durrett reposteó

Ziyu Yao (Hiring Fall'26 PhDs)

@ZiyuYao

7 oct

📢Schedule is now finalized! Join the brilliant invited talks with us (…reasoning-planning-workshop.github.io): Greg Durrett @gregd_nlp : LLM Reasoning Beyond Scaling Huan Sun @hhsun1 : How Explanations Can Advance Capability and Safety: World Modeling and Circuit Discovery Ana Marasović…

XLLM-Reason-Plan

@XllmReasonPlan

2 oct

⏰ Only 9 days away! Join us at @COLM_conf on October 10 for the first workshop on the application of LLM explainability to reasoning and planning. Featuring: 📑 20 poster presentations 🎤 9 distinguished speakers View our schedule at tinyurl.com/xllm-workshop.

XllmReasonPlan's tweet image. ⏰ Only 9 days away!
Join us at @COLM_conf on October 10 for the first workshop on the application of LLM explainability to reasoning and planning.
Featuring:
📑 20 poster presentations
🎤 9 distinguished speakers
View our schedule at tinyurl.com/xllm-workshop.

Greg Durrett

@gregd_nlp

7 oct

Find my students and collaborators at COLM this week! Tuesday morning: @juand_r_nlp and @RamyaNamuduri 's papers (find them if you missed it!) Wednesday pm: @ManyaWadhwa1 's EvalAgent Thursday am: @AnirudhKhatry 's CRUST-Bench oral spotlight + poster

gregd_nlp's tweet image. Find my students and collaborators at COLM this week!

Tuesday morning: @juand_r_nlp and @RamyaNamuduri 's papers (find them if you missed it!)

Wednesday pm: @ManyaWadhwa1 's EvalAgent

Thursday am: @AnirudhKhatry 's CRUST-Bench oral spotlight + poster

Greg Durrett reposteó

Tanya Goyal

@tanyaagoyal

2 oct

🚨Modeling Abstention via Selective Help-seeking LLMs learn to use search tools to answer questions they would otherwise hallucinate on. But can this also teach them what they know vs not? @momergul_ introduces MASH that trains LLMs for search and gets abstentions for free!…

tanyaagoyal's tweet image. 🚨Modeling Abstention via Selective Help-seeking

LLMs learn to use search tools to answer questions they would otherwise hallucinate on. But can this also teach them what they know vs not?

@momergul_ introduces MASH that trains LLMs for search and gets abstentions for free!…

Greg Durrett reposteó

Ari Holtzman

@universeinanegg

29 sept

For those who missed it, we just releaaed a little LLM-backed game called HR Simulator™ You play an intern ghostwriting emails for your boss. It’s like you’re stuck in corporate email hell…and you’re the devil 😈 link and an initial answer to “WHY WOULD YOU DO THIS?” below

universeinanegg's tweet image. For those who missed it, we just releaaed a little LLM-backed game called HR Simulator™

You play an intern ghostwriting emails for your boss. It’s like you’re stuck in corporate email hell…and you’re the devil 😈

link and an initial answer to “WHY WOULD YOU DO THIS?” below

Greg Durrett

@gregd_nlp

25 sept

Check out this feature about AstroVisBench, our upcoming NeurIPS D&B paper about code workflows and visualization in the astronomy domain! Great testbed for the interaction of code + VLM reasoning models.

CosmicAI

@CosmicAI_Inst

25 sept

Exciting news! Introducing AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy! A new benchmark developed by researchers CosmicAI is testing how well LLMs implement scientific workflows in astronomy and visualize results.

Greg Durrett reposteó

CosmicAI

@CosmicAI_Inst

25 sept

Greg Durrett reposteó

Chantal

@ChantalShaib

24 sept

"AI slop" seems to be everywhere, but what exactly makes text feel like slop? In our new work (w/ @TuhinChakr, @dgolano, @byron_c_wallace) we provide a systematic attempt at measuring AI slop in text! arxiv.org/abs/2509.19163 🧵 (1/7)

ChantalShaib's tweet image. "AI slop" seems to be everywhere, but what exactly makes text feel like slop?

In our new work (w/ @TuhinChakr, @dgolano, @byron_c_wallace) we provide a systematic attempt at measuring AI slop in text!

arxiv.org/abs/2509.19163

🧵 (1/7)

Aidan McLaughlin

@aidan_mclau

30 ene

help me fix get-4o slop reply with examples of slop behavior just a single sentence nothing crazy what annoys you what makes you wanna frisbee your laptop into a river i'll respond to every comment rt so we can maximize slop feedback help me de-sloptimize our models go

Greg Durrett reposteó

Liyan Tang

@LiyanTang4

19 sept

Our paper "ChartMuseum 🖼️" is now accepted to #NeurIPS2025 Datasets and Benchmarks Track! Even the latest models, such as GPT-5 and Gemini-2.5-Pro, still cannot do well on challenging 📉chart understanding questions , especially on those that involve visual reasoning 👀!

Liyan Tang

@LiyanTang4

20 may

Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B

LiyanTang4's tweet image. Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts!

✍🏻Entirely human-written questions by 13 CS researchers
👀Emphasis on visual reasoning – hard to be verbalized via text CoTs
📉Humans reach 93% but 63% from Gemini-2.5-Pro &amp; 38% from Qwen2.5-72B

Greg Durrett reposteó

Jessy Li

@jessyjli

18 sept

To appear #NeurIPS2025: Can AI aid scientists amidst their own workflows, when they do not know step-by-step workflows and may not know, in advance, the kinds of scientific utility a visualization would bring? The @CosmicAI_Inst presents ✨AstroVisBench:

Sebastian Joseph

@sebajoed

2 jun

How good are LLMs at 🔭 scientific computing and visualization 🔭? AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results. SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵

sebajoed's tweet image. How good are LLMs at 🔭 scientific computing and visualization 🔭?

AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results.

SOTA models like Gemini 2.5 Pro &amp; Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵

Greg Durrett reposteó

Li S. Yifei

@realliyifei

4 sept

How well can LLMs & deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*? Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using…

realliyifei's tweet image. How well can LLMs &amp; deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*?

Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using…

Greg Durrett reposteó

Mahesh Sathiamoorthy ✈️ COLM2025

@madiator

29 ago

We are looking to hire an intern who can work on our RL environment curation and data curation stack. Prior experience with post-training is a must, as well as great software engineering skills. DM me or email your resume to hiring at bespokelabs dot ai. In addition, we are…

madiator's tweet image. We are looking to hire an intern who can work on our RL environment curation and data curation stack. Prior experience with post-training is a must, as well as great software engineering skills.

DM me or email your resume to hiring at bespokelabs dot ai.

In addition, we are…

Greg Durrett reposteó

Megan Richards

@megan_richards_

29 ago

I'm shocked at how poorly this is advertised, so here's a PSA: NSF has a GRFP-like program specifically for computing disciplines called CISE. The program provides the same 3 years of PhD funding PLUS a year-long mentorship program for the application cycle.

Greg Durrett reposteó

Yuhan Liu

@YuhanLiu_nlp

28 ago

👀Have you asked LLM to provide a more detailed answer after inspecting its initial output? Users often provide such implicit feedback during interaction. ✨We study implicit user feedback found in LMSYS and WildChat. #EMNLP2025

YuhanLiu_nlp's tweet image. 👀Have you asked LLM to provide a more detailed answer after inspecting its initial output? Users often provide such implicit feedback during interaction.

✨We study implicit user feedback found in LMSYS and WildChat. #EMNLP2025

Greg Durrett reposteó

Amy Pavel

@amypavel

25 ago

📣I've joined @BerkeleyEECS as an Assistant Professor! My lab will join me soon to continue our research in accessibility, HCI, and supporting communication! I'm so excited to make new connections at @UCBerkeley and in the Bay Area more broadly, so please reach out to chat!

amypavel's tweet image. 📣I've joined @BerkeleyEECS as an Assistant Professor! My lab will join me soon to continue our research in accessibility, HCI, and supporting communication!

I'm so excited to make new connections at @UCBerkeley and in the Bay Area more broadly, so please reach out to chat!

Greg Durrett reposteó

Dan Jurafsky

@jurafsky

24 ago

Now that school is starting for lots of folks, it's time for a new release of Speech and Language Processing! Jim and I added all sorts of material for the August 2025 release! With slides to match! Check it out here: web.stanford.edu/~jurafsky/slp3/

Greg Durrett

@gregd_nlp

24 ago

Come work with us at Bespoke! It's a fantastic team and a great opportunity to work on the cutting edge of data curation for reasoning!

Alex Dimakis

@AlexGDimakis

24 ago

We are hiring in Bespoke Labs for a new role: Member of Technical Staff: AI Data and RL Environments. Work on data curation strategies with the team that created OpenThoughts. Invent novel data recipes, strategies of curating datasets, environments, tasks and verifiers. (My…

Greg Durrett reposteó

Tomer Wolfson

@TomerWolfson

18 ago

Many factual QA benchmarks have become saturated, yet factuality still poses a very real issue! ✨We present MoNaCo, an Ai2 benchmark of human-written time-consuming questions that, on average, require 43.3 documents per question!✨ 📣Blogpost: allenai.org/blog/monaco 🧵(1/5)

TomerWolfson's tweet image. Many factual QA benchmarks have become saturated, yet factuality still poses a very real issue!

✨We present MoNaCo, an Ai2 benchmark of human-written time-consuming questions that, on average, require 43.3 documents per question!✨

📣Blogpost: allenai.org/blog/monaco

🧵(1/5)

Greg Durrett reposteó

Shangbin Feng

@shangbinfeng

19 ago

Two caveats with self-alignment: ⚠️ A single model struggles to reliably judge its own generation. ⚠️ A single model struggles to reliably generate diverse responses to learn from. 👉 Introducing Sparta Alignment, where multiple LMs collectively align through ⚔️ combat.

shangbinfeng's tweet image. Two caveats with self-alignment:

⚠️ A single model struggles to reliably judge its own generation.

⚠️ A single model struggles to reliably generate diverse responses to learn from.

👉 Introducing Sparta Alignment, where multiple LMs collectively align through ⚔️ combat.

Greg Durrett reposteó

Mosh Levy

@mosh_levy

15 ago

Producing reasoning texts boosts the capabilities of AI models, but do we humans correctly understand these texts? Our latest research suggests that we do not. This highlights a new angle on the "Are they transparent?" debate: they might be, but we misinterpret them. 🧵