chris_m_glaze's profile picture. Principal Research Scientist at @SnorkelAI. PhD in computational neuroscience. Previously: @penn @UofMaryland

Chris Glaze

@chris_m_glaze

Principal Research Scientist at @SnorkelAI. PhD in computational neuroscience. Previously: @penn @UofMaryland

Chris Glaze reposted

SnorkelAI is headed to #NeurIPS2025 this December with @fredsala and the team. Come talk benchmarks, rubrics, RL envs, & more. 🔬✨

SnorkelAI's tweet image. SnorkelAI is headed to #NeurIPS2025 this December with @fredsala  and the team.

Come talk benchmarks, rubrics, RL envs, & more. 🔬✨

Chris Glaze reposted

Chain-of-thought just became the newest safety nightmare in AI, and nobody was ready for this. A team from Anthropic, Stanford, and Oxford found something brutal: if you wrap a harmful request inside a long, harmless reasoning chain, the model’s guardrails weaken until it stops…

aiwithmayank's tweet image. Chain-of-thought just became the newest safety nightmare in AI, and nobody was ready for this.

A team from Anthropic, Stanford, and Oxford found something brutal: if you wrap a harmful request inside a long, harmless reasoning chain, the model’s guardrails weaken until it stops…

What’s the future-state of AI agents in real-world scenarios? How often will they just solve problems as coders vs interacting with more constrained but complex tool sets? Models like Claude Sonnet 4.5 are indeed impressive at coding and can in theory use these same skills to…

chris_m_glaze's tweet image. What’s the future-state of AI agents in real-world scenarios? How often will they just solve problems as coders vs interacting with more constrained but complex tool sets? Models like Claude Sonnet 4.5 are indeed impressive at coding and can in theory use these same skills to…

Chris Glaze reposted

New from @ArminPCM : how Snorkel builds reinforcement learning (RL) environments that train and evaluate agents in realistic, enterprise-grade settings.

SnorkelAI's tweet image. New from @ArminPCM : how Snorkel builds reinforcement learning (RL) environments that train and evaluate agents in realistic, enterprise-grade settings.

Chris Glaze reposted

🚨 New research from @SnorkelAI tackles a critical problem: LLMs are evolving faster than our ability to evaluate them 📊 We develop BeTaL— Benchmark Tuning with an LLM-in-the-loop— a framework that automates benchmark design using reasoning models as optimizers. BeTaL produces…

amanda_dsouza's tweet image. 🚨 New research from @SnorkelAI tackles a critical problem: LLMs are evolving faster than our ability to evaluate them 📊

We develop BeTaL— Benchmark Tuning with an LLM-in-the-loop— a framework that automates benchmark design using reasoning models as optimizers.

BeTaL produces…

Chris Glaze reposted

⛵Marin 32B Base (mantis) is done training! It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base. Ranking across 19 benchmarks:

percyliang's tweet image. ⛵Marin 32B Base (mantis) is done training!  It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base.  Ranking across 19 benchmarks:

Just how good are AI agents at exploring their environments in novel ways to solve real-world enterprise problems? As part of our ongoing experiments around agentic autonomy at @SnorkelAI we’re making “code-only” versions of environments in which we challenge agents to solve…

chris_m_glaze's tweet image. Just how good are AI agents at exploring their environments in novel ways to solve real-world enterprise problems? As part of our ongoing experiments around agentic autonomy at @SnorkelAI  we’re making “code-only” versions of environments in which we challenge agents to solve…
chris_m_glaze's tweet image. Just how good are AI agents at exploring their environments in novel ways to solve real-world enterprise problems? As part of our ongoing experiments around agentic autonomy at @SnorkelAI  we’re making “code-only” versions of environments in which we challenge agents to solve…
chris_m_glaze's tweet image. Just how good are AI agents at exploring their environments in novel ways to solve real-world enterprise problems? As part of our ongoing experiments around agentic autonomy at @SnorkelAI  we’re making “code-only” versions of environments in which we challenge agents to solve…

Chris Glaze reposted

Super excited to present our new work on hybrid architecture models—getting the best of Transformers and SSMs like Mamba—at #COLM2025! Come chat with @nick11roberts at poster session 2 on Tuesday. Thread below! (1)

fredsala's tweet image. Super excited to present our new work on hybrid architecture models—getting the best of Transformers and SSMs like Mamba—at #COLM2025! Come chat with @nick11roberts at poster session 2 on Tuesday. Thread below! (1)

Black holes just proved Stephen Hawking right with the clearest signal yet sciencedaily.com/releases/2025/…


AI eval:basic software testing :: AI capability:basic software capability. AI evals are difficult to get right (is probably why we see <<80% agreement rates often). Requires critical thinking about the problem domain, usability, objectives. Rubrics are vital and their…

Trend I'm following: evals becoming a must-have skill for product builders and AI companies. It's the first new hard skill in a long time that PMs/engineers/founders have had to learn to be successful. The last one was maybe SQL, and Excel? A few examples: @garrytan: "Evals are…



The term “horizon” is used inconsistently in the LM benchmarking lit, and not always super aligned with how the term is used in the RL lit. TL;DR: long horizon ≠ complex. Frontier models may do well on complex tasks but can still fail on basics of long horizon planning. METR…

chris_m_glaze's tweet image. The term “horizon” is used inconsistently in the LM benchmarking lit, and not always super aligned with how the term is used in the RL lit. TL;DR: long horizon ≠ complex. Frontier models may do well on complex tasks but can still fail on basics of long horizon planning.

METR…
chris_m_glaze's tweet image. The term “horizon” is used inconsistently in the LM benchmarking lit, and not always super aligned with how the term is used in the RL lit. TL;DR: long horizon ≠ complex. Frontier models may do well on complex tasks but can still fail on basics of long horizon planning.

METR…

Definitely tracks with our observations. Point (3) especially begs for how to most effectively involve experts in the dev process.

Most takes on RL environments are bad. 1. There are hardly any high-quality RL environments and evals available. Most agentic environments and evals are flawed when you look at the details. It’s a crisis: and no one is talking about it because they’re being hoodwinked by labs…



From ongoing idealized experiments I've been running on AI agents at @SnorkelAI: in this one, most frontier models either (1) are slow to learn about tool-use failures in-context or (2) have a very high tolerance for these failures, showing almost no in-context learning at all.…

chris_m_glaze's tweet image. From ongoing idealized experiments I&apos;ve been running on AI agents at @SnorkelAI: in this one, most frontier models either (1) are slow to learn about tool-use failures in-context or (2) have a very high tolerance for these failures, showing almost no in-context learning at all.…

Chris Glaze reposted

Tested gpt-oss-120b on @SnorkelAI 's agent benchmarks - performance seems split on real world tasks. It did well on multi-step Finance, but poorly on multi-turn Insurance. Think this could be a limitation of either domain knowledge or handling multi-turn scenarios.


Chris Glaze reposted

Agentic AI will transform every enterprise–but only if agents are trusted experts. The key: Evaluation & tuning on specialized, expert data. I’m excited to announce two new products to support this–@SnorkelAI Evaluate & Expert Data-as-a-Service–along w/ our $100M Series D! ---…


Chris Glaze reposted

There is an explosion of total nonsense around LLM auto-eval in the space these days... be wary of approaches that claim to kick the human out of the loop! E.g. 'Using GPT-4 to auto-eval' makes sense only if you are trying to *distill* GPT-4 functionality. I.e. you are…


Chris Glaze reposted

4/ Either way, the key step remains the same- tuning LLM systems on good data! Whether tuning/aligning an LLM or an LLM + RAG system, the key is the data you use, and how you develop it! Check out our work here @SnorkelAI !


Chris Glaze reposted

🎉We are making big strides towards programmatic alignment! Using a reward model w/ DPO, Snorkel AI ranked 2nd on the AlpacaEval 2.0 LLM leaderboard using only a 7B model! Hear from the researcher behind this achievement at our Enterprise LLM Summit tomorrow! #SnorkelAI #AI #LLMs


United States Trends

Loading...

Something went wrong.


Something went wrong.