
Greg Durrett
@gregd_nlp
Associate professor at NYU (Courant CS + Center for Data Science) | advisor for @bespokelabsai | large language models and NLP | he/him
Tal vez te guste
📣 Today we launched an overhauled NLP course to 600 students in the online MS programs at UT Austin. 98 YouTube videos 🎥 + readings 📖 open to all! cs.utexas.edu/~gdurrett/cour… w/5 hours of new 🎥 on LLMs, RLHF, chain-of-thought, etc! Meme trailer 🎬 youtu.be/DcB6ZPReeuU 🧵
youtube.com
YouTube
CS371N/388 NLP "trailer" (fall 2023)
We have been overlooking a key factor in LLM-as-a-judge evaluation: the feedback collection protocol. Our #COLM2025 paper presents a comprehensive study on how feedback protocols shape reliability and bias in LLM evaluations.

📢Schedule is now finalized! Join the brilliant invited talks with us (…reasoning-planning-workshop.github.io): Greg Durrett @gregd_nlp : LLM Reasoning Beyond Scaling Huan Sun @hhsun1 : How Explanations Can Advance Capability and Safety: World Modeling and Circuit Discovery Ana Marasović…
⏰ Only 9 days away! Join us at @COLM_conf on October 10 for the first workshop on the application of LLM explainability to reasoning and planning. Featuring: 📑 20 poster presentations 🎤 9 distinguished speakers View our schedule at tinyurl.com/xllm-workshop.

Find my students and collaborators at COLM this week! Tuesday morning: @juand_r_nlp and @RamyaNamuduri 's papers (find them if you missed it!) Wednesday pm: @ManyaWadhwa1 's EvalAgent Thursday am: @AnirudhKhatry 's CRUST-Bench oral spotlight + poster

🚨Modeling Abstention via Selective Help-seeking LLMs learn to use search tools to answer questions they would otherwise hallucinate on. But can this also teach them what they know vs not? @momergul_ introduces MASH that trains LLMs for search and gets abstentions for free!…

For those who missed it, we just releaaed a little LLM-backed game called HR Simulator™ You play an intern ghostwriting emails for your boss. It’s like you’re stuck in corporate email hell…and you’re the devil 😈 link and an initial answer to “WHY WOULD YOU DO THIS?” below

Check out this feature about AstroVisBench, our upcoming NeurIPS D&B paper about code workflows and visualization in the astronomy domain! Great testbed for the interaction of code + VLM reasoning models.
Exciting news! Introducing AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy! A new benchmark developed by researchers CosmicAI is testing how well LLMs implement scientific workflows in astronomy and visualize results.
Exciting news! Introducing AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy! A new benchmark developed by researchers CosmicAI is testing how well LLMs implement scientific workflows in astronomy and visualize results.
"AI slop" seems to be everywhere, but what exactly makes text feel like slop? In our new work (w/ @TuhinChakr, @dgolano, @byron_c_wallace) we provide a systematic attempt at measuring AI slop in text! arxiv.org/abs/2509.19163 🧵 (1/7)


help me fix get-4o slop reply with examples of slop behavior just a single sentence nothing crazy what annoys you what makes you wanna frisbee your laptop into a river i'll respond to every comment rt so we can maximize slop feedback help me de-sloptimize our models go
Our paper "ChartMuseum 🖼️" is now accepted to #NeurIPS2025 Datasets and Benchmarks Track! Even the latest models, such as GPT-5 and Gemini-2.5-Pro, still cannot do well on challenging 📉chart understanding questions , especially on those that involve visual reasoning 👀!

Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B


To appear #NeurIPS2025: Can AI aid scientists amidst their own workflows, when they do not know step-by-step workflows and may not know, in advance, the kinds of scientific utility a visualization would bring? The @CosmicAI_Inst presents ✨AstroVisBench:
How good are LLMs at 🔭 scientific computing and visualization 🔭? AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results. SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵


How well can LLMs & deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*? Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using…

We are looking to hire an intern who can work on our RL environment curation and data curation stack. Prior experience with post-training is a must, as well as great software engineering skills. DM me or email your resume to hiring at bespokelabs dot ai. In addition, we are…

I'm shocked at how poorly this is advertised, so here's a PSA: NSF has a GRFP-like program specifically for computing disciplines called CISE. The program provides the same 3 years of PhD funding PLUS a year-long mentorship program for the application cycle.
👀Have you asked LLM to provide a more detailed answer after inspecting its initial output? Users often provide such implicit feedback during interaction. ✨We study implicit user feedback found in LMSYS and WildChat. #EMNLP2025

📣I've joined @BerkeleyEECS as an Assistant Professor! My lab will join me soon to continue our research in accessibility, HCI, and supporting communication! I'm so excited to make new connections at @UCBerkeley and in the Bay Area more broadly, so please reach out to chat!


Now that school is starting for lots of folks, it's time for a new release of Speech and Language Processing! Jim and I added all sorts of material for the August 2025 release! With slides to match! Check it out here: web.stanford.edu/~jurafsky/slp3/
Come work with us at Bespoke! It's a fantastic team and a great opportunity to work on the cutting edge of data curation for reasoning!
We are hiring in Bespoke Labs for a new role: Member of Technical Staff: AI Data and RL Environments. Work on data curation strategies with the team that created OpenThoughts. Invent novel data recipes, strategies of curating datasets, environments, tasks and verifiers. (My…
Many factual QA benchmarks have become saturated, yet factuality still poses a very real issue! ✨We present MoNaCo, an Ai2 benchmark of human-written time-consuming questions that, on average, require 43.3 documents per question!✨ 📣Blogpost: allenai.org/blog/monaco 🧵(1/5)

Two caveats with self-alignment: ⚠️ A single model struggles to reliably judge its own generation. ⚠️ A single model struggles to reliably generate diverse responses to learn from. 👉 Introducing Sparta Alignment, where multiple LMs collectively align through ⚔️ combat.

Producing reasoning texts boosts the capabilities of AI models, but do we humans correctly understand these texts? Our latest research suggests that we do not. This highlights a new angle on the "Are they transparent?" debate: they might be, but we misinterpret them. 🧵

United States Tendencias
- 1. #SwiftDay 3,386 posts
- 2. Columbus 50.2K posts
- 3. #TSTheErasTour N/A
- 4. Knesset 84.3K posts
- 5. $ZOOZ 1,016 posts
- 6. Good Monday 33.1K posts
- 7. #MondayMotivation 10.6K posts
- 8. #IndigenousPeoplesDay 1,549 posts
- 9. #MondayVibes 2,649 posts
- 10. Marc 33.6K posts
- 11. Victory Monday N/A
- 12. Branch 44.6K posts
- 13. Israeli Parliament 8,950 posts
- 14. Rod Wave 2,460 posts
- 15. All 20 74.5K posts
- 16. Cryptocurrencies 4,239 posts
- 17. StandX 2,288 posts
- 18. CONGRATS LINGORM PFW EMV 103K posts
- 19. United States Navy N/A
- 20. Thanksgiving 39.9K posts
Tal vez te guste
-
Yejin Choi
@YejinChoinka -
Jacob Andreas
@jacobandreas -
UW NLP
@uwnlp -
EdinburghNLP
@EdinburghNLP -
Hanna Hajishirzi
@HannaHajishirzi -
Luke Zettlemoyer
@LukeZettlemoyer -
Mohit Bansal
@mohitban47 -
Sewon Min
@sewon__min -
Mohit Iyyer
@MohitIyyer -
Maarten Sap (he/him)
@MaartenSap -
Kai-Wei Chang
@kaiwei_chang -
Sean Ren
@xiangrenNLP -
Wei Xu
@cocoweixu -
Sachin Gururangan
@ssgrn -
Diyi Yang
@Diyi_Yang
Something went wrong.
Something went wrong.