gregd_nlp's profile picture. Associate professor at NYU (Courant CS + Center for Data Science) | advisor for @bespokelabsai | large language models and NLP | he/him

Greg Durrett

@gregd_nlp

Associate professor at NYU (Courant CS + Center for Data Science) | advisor for @bespokelabsai | large language models and NLP | he/him

고정된 트윗

📣 Today we launched an overhauled NLP course to 600 students in the online MS programs at UT Austin. 98 YouTube videos 🎥 + readings 📖 open to all! cs.utexas.edu/~gdurrett/cour… w/5 hours of new 🎥 on LLMs, RLHF, chain-of-thought, etc! Meme trailer 🎬 youtu.be/DcB6ZPReeuU 🧵

gregd_nlp's tweet card. CS371N/388 NLP "trailer" (fall 2023)

youtube.com

YouTube

CS371N/388 NLP "trailer" (fall 2023)


Greg Durrett 님이 재게시함

We have been overlooking a key factor in LLM-as-a-judge evaluation: the feedback collection protocol. Our #COLM2025 paper presents a comprehensive study on how feedback protocols shape reliability and bias in LLM evaluations.

tuhina_tripathi's tweet image. We have been overlooking a key factor in LLM-as-a-judge evaluation: the feedback collection protocol.

Our #COLM2025 paper presents a comprehensive study on how feedback protocols shape reliability and bias in LLM evaluations.

Greg Durrett 님이 재게시함

📢Schedule is now finalized! Join the brilliant invited talks with us (…reasoning-planning-workshop.github.io): Greg Durrett @gregd_nlp : LLM Reasoning Beyond Scaling Huan Sun @hhsun1 : How Explanations Can Advance Capability and Safety: World Modeling and Circuit Discovery Ana Marasović…

⏰ Only 9 days away! Join us at @COLM_conf on October 10 for the first workshop on the application of LLM explainability to reasoning and planning. Featuring: 📑 20 poster presentations 🎤 9 distinguished speakers View our schedule at tinyurl.com/xllm-workshop.

XllmReasonPlan's tweet image. ⏰ Only 9 days away!
Join us at @COLM_conf on October 10 for the first workshop on the application of LLM explainability to reasoning and planning. 
Featuring:
📑 20 poster presentations
🎤 9 distinguished speakers 
View our schedule at tinyurl.com/xllm-workshop.


Find my students and collaborators at COLM this week! Tuesday morning: @juand_r_nlp and @RamyaNamuduri 's papers (find them if you missed it!) Wednesday pm: @ManyaWadhwa1 's EvalAgent Thursday am: @AnirudhKhatry 's CRUST-Bench oral spotlight + poster

gregd_nlp's tweet image. Find my students and collaborators at COLM this week!  

Tuesday morning: @juand_r_nlp and @RamyaNamuduri 's papers (find them if you missed it!)

Wednesday pm: @ManyaWadhwa1 's EvalAgent  

Thursday am: @AnirudhKhatry 's CRUST-Bench oral spotlight + poster

Greg Durrett 님이 재게시함

🚨Modeling Abstention via Selective Help-seeking LLMs learn to use search tools to answer questions they would otherwise hallucinate on. But can this also teach them what they know vs not? @momergul_ introduces MASH that trains LLMs for search and gets abstentions for free!…

tanyaagoyal's tweet image. 🚨Modeling Abstention via Selective Help-seeking

LLMs learn to use search tools to answer questions they would otherwise hallucinate on. But can this also teach them what they know vs not?

@momergul_ introduces MASH that trains LLMs for search and gets abstentions for free!…

Greg Durrett 님이 재게시함

For those who missed it, we just releaaed a little LLM-backed game called HR Simulator™ You play an intern ghostwriting emails for your boss. It’s like you’re stuck in corporate email hell…and you’re the devil 😈 link and an initial answer to “WHY WOULD YOU DO THIS?” below

universeinanegg's tweet image. For those who missed it, we just releaaed a little LLM-backed game called HR Simulator™

You play an intern ghostwriting emails for your boss. It’s like you’re stuck in corporate email hell…and you’re the devil 😈 

link and an initial answer to “WHY WOULD YOU DO THIS?” below

Check out this feature about AstroVisBench, our upcoming NeurIPS D&B paper about code workflows and visualization in the astronomy domain! Great testbed for the interaction of code + VLM reasoning models.

Exciting news! Introducing AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy! A new benchmark developed by researchers CosmicAI is testing how well LLMs implement scientific workflows in astronomy and visualize results.



Greg Durrett 님이 재게시함

Exciting news! Introducing AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy! A new benchmark developed by researchers CosmicAI is testing how well LLMs implement scientific workflows in astronomy and visualize results.


Greg Durrett 님이 재게시함

"AI slop" seems to be everywhere, but what exactly makes text feel like slop? In our new work (w/ @TuhinChakr, @dgolano, @byron_c_wallace) we provide a systematic attempt at measuring AI slop in text! arxiv.org/abs/2509.19163 🧵 (1/7)

ChantalShaib's tweet image. "AI slop" seems to be everywhere, but what exactly makes text feel like slop?

In our new work (w/ @TuhinChakr, @dgolano, @byron_c_wallace) we provide a systematic attempt at measuring AI slop in text!

arxiv.org/abs/2509.19163

🧵 (1/7)
ChantalShaib's tweet image. "AI slop" seems to be everywhere, but what exactly makes text feel like slop?

In our new work (w/ @TuhinChakr, @dgolano, @byron_c_wallace) we provide a systematic attempt at measuring AI slop in text!

arxiv.org/abs/2509.19163

🧵 (1/7)

help me fix get-4o slop reply with examples of slop behavior just a single sentence nothing crazy what annoys you what makes you wanna frisbee your laptop into a river i'll respond to every comment rt so we can maximize slop feedback help me de-sloptimize our models go



Greg Durrett 님이 재게시함

Our paper "ChartMuseum 🖼️" is now accepted to #NeurIPS2025 Datasets and Benchmarks Track! Even the latest models, such as GPT-5 and Gemini-2.5-Pro, still cannot do well on challenging 📉chart understanding questions , especially on those that involve visual reasoning 👀!

LiyanTang4's tweet image. Our paper "ChartMuseum 🖼️" is now accepted to #NeurIPS2025 Datasets and Benchmarks Track!

Even the latest models, such as GPT-5 and Gemini-2.5-Pro, still cannot do well on challenging 📉chart understanding questions , especially on those that involve visual reasoning 👀!

Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B

LiyanTang4's tweet image. Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts!

✍🏻Entirely human-written questions by 13 CS researchers
👀Emphasis on visual reasoning – hard to be verbalized via text CoTs
📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B
LiyanTang4's tweet image. Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts!

✍🏻Entirely human-written questions by 13 CS researchers
👀Emphasis on visual reasoning – hard to be verbalized via text CoTs
📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B


Greg Durrett 님이 재게시함

To appear #NeurIPS2025: Can AI aid scientists amidst their own workflows, when they do not know step-by-step workflows and may not know, in advance, the kinds of scientific utility a visualization would bring? The @CosmicAI_Inst presents ✨AstroVisBench:

How good are LLMs at 🔭 scientific computing and visualization 🔭? AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results. SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵

sebajoed's tweet image. How good are LLMs at 🔭 scientific computing and visualization 🔭?

AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results.

SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵
sebajoed's tweet image. How good are LLMs at 🔭 scientific computing and visualization 🔭?

AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results.

SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵


Greg Durrett 님이 재게시함

How well can LLMs & deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*? Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using…

realliyifei's tweet image. How well can LLMs & deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*?

Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using…

Greg Durrett 님이 재게시함

We are looking to hire an intern who can work on our RL environment curation and data curation stack. Prior experience with post-training is a must, as well as great software engineering skills. DM me or email your resume to hiring at bespokelabs dot ai. In addition, we are…

madiator's tweet image. We are looking to hire an intern who can work on our RL environment curation and data curation stack. Prior experience with post-training is a must, as well as great software engineering skills.

DM me or email your resume to hiring at bespokelabs dot ai.

In addition, we are…

Greg Durrett 님이 재게시함

I'm shocked at how poorly this is advertised, so here's a PSA: NSF has a GRFP-like program specifically for computing disciplines called CISE. The program provides the same 3 years of PhD funding PLUS a year-long mentorship program for the application cycle.


Greg Durrett 님이 재게시함

👀Have you asked LLM to provide a more detailed answer after inspecting its initial output? Users often provide such implicit feedback during interaction. ✨We study implicit user feedback found in LMSYS and WildChat. #EMNLP2025

YuhanLiu_nlp's tweet image. 👀Have you asked LLM to provide a more detailed answer after inspecting its initial output? Users often provide such implicit feedback during interaction.

✨We study implicit user feedback found in LMSYS and WildChat. #EMNLP2025

Greg Durrett 님이 재게시함

📣I've joined @BerkeleyEECS as an Assistant Professor! My lab will join me soon to continue our research in accessibility, HCI, and supporting communication! I'm so excited to make new connections at @UCBerkeley and in the Bay Area more broadly, so please reach out to chat!

amypavel's tweet image. 📣I've joined @BerkeleyEECS as an Assistant Professor! My lab will join me soon to continue our research in accessibility, HCI, and supporting communication!

I'm so excited to make new connections at @UCBerkeley and in the Bay Area more broadly, so please reach out to chat!
amypavel's tweet image. 📣I've joined @BerkeleyEECS as an Assistant Professor! My lab will join me soon to continue our research in accessibility, HCI, and supporting communication!

I'm so excited to make new connections at @UCBerkeley and in the Bay Area more broadly, so please reach out to chat!

Greg Durrett 님이 재게시함

Now that school is starting for lots of folks, it's time for a new release of Speech and Language Processing! Jim and I added all sorts of material for the August 2025 release! With slides to match! Check it out here: web.stanford.edu/~jurafsky/slp3/


Come work with us at Bespoke! It's a fantastic team and a great opportunity to work on the cutting edge of data curation for reasoning!

We are hiring in Bespoke Labs for a new role: Member of Technical Staff: AI Data and RL Environments. Work on data curation strategies with the team that created OpenThoughts. Invent novel data recipes, strategies of curating datasets, environments, tasks and verifiers. (My…



Greg Durrett 님이 재게시함

Many factual QA benchmarks have become saturated, yet factuality still poses a very real issue! ✨We present MoNaCo, an Ai2 benchmark of human-written time-consuming questions that, on average, require 43.3 documents per question!✨ 📣Blogpost: allenai.org/blog/monaco 🧵(1/5)

TomerWolfson's tweet image. Many factual QA benchmarks have become saturated, yet factuality still poses a very real issue!

✨We present MoNaCo, an Ai2 benchmark of human-written time-consuming questions that, on average, require 43.3 documents per question!✨

📣Blogpost: allenai.org/blog/monaco

🧵(1/5)

Greg Durrett 님이 재게시함

Two caveats with self-alignment: ⚠️ A single model struggles to reliably judge its own generation. ⚠️ A single model struggles to reliably generate diverse responses to learn from. 👉 Introducing Sparta Alignment, where multiple LMs collectively align through ⚔️ combat.

shangbinfeng's tweet image. Two caveats with self-alignment:

⚠️ A single model struggles to reliably judge its own generation.

⚠️ A single model struggles to reliably generate diverse responses to learn from.

👉 Introducing Sparta Alignment, where multiple LMs collectively align through ⚔️ combat.

Greg Durrett 님이 재게시함

Producing reasoning texts boosts the capabilities of AI models, but do we humans correctly understand these texts? Our latest research suggests that we do not. This highlights a new angle on the "Are they transparent?" debate: they might be, but we misinterpret them. 🧵

mosh_levy's tweet image. Producing reasoning texts boosts the capabilities of AI models, but do we humans correctly understand these texts? Our latest research suggests that we do not.
This highlights a new angle on the "Are they transparent?" debate: they might be, but we misinterpret them. 🧵

Loading...

Something went wrong.


Something went wrong.