Sebfox

@Sebfox1

Building AI evals | Previously AI at McKinsey & QuantumBlack

London, England

composo.ai

Joined October 2009

267Posts 299Followers 1KFollowing

You might like

@MeryIgnatenko

@quan3939

@Adam01700945

@GillesGity

Sebfox

@Sebfox1

15 h

This vibe shifts hall of fame in state of ai 2025 are great

Team A: "We need another week to validate this change. Sarah from the medical team is on vacation and she's the only one who can check the healthcare responses." Team B: "Medical accuracy is at 0.91, up from 0.87 last week. The failing cases are all around dosage…

Sebfox

@Sebfox1

Oct 9

Evals don't have to feel like a complex chore, i've seen many teams have huge success with a few simple evals. The real danger comes from overthinking, not shipping & spending more time on evals than the feature itself.

Sebfox

@Sebfox1

Oct 9

Sat in on a deployment review. The phrase "seems fine" came up twelve times in thirty minutes. "The outputs seem fine" "Performance seems fine" "Customers seem fine with it" This is a billion-dollar company's evaluation strategy. Seems. Fine. The engineer presenting looked…

Sebfox

@Sebfox1

Oct 8

Beyond Real-Time Evaluation: Visualize Your AI Performance at Scale Once you've traced your agents with Composo, the real power comes from visualization. While you get instant evaluation results directly in your code, the Composo platform takes it further with comprehensive…

Sebfox

@Sebfox1

Oct 7

3 lines. That's all it takes to go from opaque AI agents to full observability and evaluation. Most agent frameworks abstract away what's actually happening under the hood. You deploy to production, something breaks, and you're left debugging in the dark with no idea which agent…

Sebfox

@Sebfox1

Oct 6

Shipping AI agents to production just got a lot easier. Most AI teams are stuck in one of two places: - Flying blind—shipping and praying, finding out what broke from customer complaints. - Or stuck in beta—spending weeks on evals that still don't give them confidence to ship.…

Sebfox reposted

kitze

@thekitze

Oct 4

having a small meeting with all the ppl who have done something useful with mcp

Sebfox

@Sebfox1

Oct 4

Watched an interesting standup yesterday at a customer's office. Engineer: "Pushed the new retrieval logic last night." PM: "How's it performing?" Engineer: pulls up dashboard "Hallucination rate down 15%, relevance holding steady at 0.84." PM: "Ship it to the next tier."…

Sebfox reposted

signüll

@signulll

Oct 2

anthropic’s thinking campaign is just so damn tasteful… it feels like a warm room. the aesthetic is comforting as hell, i kinda get a sense of claude as ai helping you be you from it. that’s why we’re weaving claude into everything we build. i’ll post more about sonnet 4.5…

signulll's tweet image. anthropic’s thinking campaign is just so damn tasteful… it feels like a warm room. the aesthetic is comforting as hell, i kinda get a sense of claude as ai helping you be you from it.

that’s why we’re weaving claude into everything we build. i’ll post more about sonnet 4.5…

Sebfox reposted

Theo - t3.gg

@theo

Oct 2

The best decision OpenAI made with Sora was making it iOS only. Android users lack taste and would ruin the whole thing

Sebfox reposted

Amjad Masad

@amasad

Oct 3

It’s unacceptable and unprofessional for a VC like Josh Wolfe to attack a founder by altering screenshots to put words in my mouth and falsely tying my company and partners to them. If this is what he does publicly, imagine the distortions spread behind closed doors. I remain…

Josh Wolfe

@wolfejosh

Oct 2

I appreciate @amasad public comments fwd to me denouncing phrase "Globalise the Intifada" + false libel claims of "genocide" after seeing how hateful rhetoric incites antisemitic violence + kills jews like in UK today on holiest day 🙏Thank u for msg of solidarity + peace🇮🇱🕊

wolfejosh's tweet image. I appreciate @amasad public comments fwd to me denouncing phrase "Globalise the Intifada" + false libel claims of "genocide" after seeing how hateful rhetoric incites antisemitic violence + kills jews like in UK today on holiest day

🙏Thank u for msg of solidarity + peace🇮🇱🕊

Sebfox

@Sebfox1

Oct 2

Watched a team present their latest model improvements last week. Two months of work. The result: "We think it's better but we're not confident enough to ship to production." Their release process: - Week 1: Build - Week 2: Manual validation with domain experts - Week 3: Argue…

Sebfox

@Sebfox1

Sep 29

Every Monday, same scene in our customer Slack channels: "Did Friday's model update actually make things better?" Nobody really knows. The metrics look... different. Not clearly better, not obviously worse. Just different. So begins another week of manual spot checks, trying to…

Sebfox reposted

Luke Markham

@LukeEMarkham

Sep 18, 2023

5 years ago I might've dismissed a new framework as simply a UX improvement and not something that'll advance SOTA Today I'd probably describe abstractions as one of human's greatest inventions Better abstractions are certainly needed in the prompt engineering field #ChatGPT

Composo

@composo_ai

Sep 18, 2023

🧵it feels like today's LLM applications only scratch the surface of what's possible with today's LLMs we're starting to see empirical results that suggest the more tokens spent working through a problem the more likely the output is to be intelligent #ChatGPT #programming

composo_ai's tweet image. 🧵it feels like today's LLM applications only scratch the surface of what's possible with today's LLMs

we're starting to see empirical results that suggest the more tokens spent working through a problem the more likely the output is to be intelligent
#ChatGPT #programming

Sebfox reposted

Luke Markham

@LukeEMarkham

Sep 19, 2023

AI evolutionary tree 🌴🤖 🎯The emergence of LLM prompting was surely a milestone in the democratization of AI. LLMs could have easily been powerful, but hard to operate; indeed many of the early models were designed to perform a specific task that was unconfigurable at…

LukeEMarkham's tweet image. AI evolutionary tree 🌴🤖

🎯The emergence of LLM prompting was surely a milestone in the democratization of AI. LLMs could have easily been powerful, but hard to operate; indeed many of the early models were designed to perform a specific task that was unconfigurable at…

Sebfox

@Sebfox1

Sep 14, 2023

Arcus's 'RAG at planet scale' approach - using a multi-tiered retrieval at increasing levels of granularity enables LLMs to better incorporate accurate context, and to do so at a huge scale!

Sebfox1's tweet image. Arcus's 'RAG at planet scale' approach - using a multi-tiered retrieval at increasing levels of granularity enables LLMs to better incorporate accurate context, and to do so at a huge scale!

Sebfox reposted

Luke Markham

@LukeEMarkham

Sep 12, 2023

Pro-tip: Tired of multi-hop prompts not returning the right result? Split your multi-hop LLM prompt into multiple single-hop ones.💡 Better performance and much easier to programmatically intervene. What do you think? #AI #ChatGPT #LLM #promptengineering [Sakarvadia 2023…

LukeEMarkham's tweet image. Pro-tip: Tired of multi-hop prompts not returning the right result? Split your multi-hop LLM prompt into multiple single-hop ones.💡

Better performance and much easier to programmatically intervene. What do you think?

#AI #ChatGPT #LLM #promptengineering [Sakarvadia 2023…

Sebfox reposted

Luke Markham

@LukeEMarkham

Sep 6, 2023

It's no coincidence #LLMs are the first #AI developed that matches human level performance. Text is an efficient, information-dense format. More data-intense formats like image and video will guarantee AI labs will be #GPU constrained for at least the next decade.

Sebfox reposted

Luke Markham

@LukeEMarkham

Sep 5, 2023

LLM output validation slowing your response times too much? Consider using unvalidated output and validating it post-hoc. If you control the UI you can easily recall it once validation finishes and in that time your user probably won’t have had time to screenshot and tweet it #ai

Zayne

@Zayne421

hiro ⏪

@hiro_togo

Bijan Tavassoli

@BijanTavassoli

Forest

@Hdhsowwnwnwm

Anurag Sarkar

@angsak6

Freeman

@freemanlafleur

$buildooor's profile picture. fractional CTO for seed/series A startups. in my free time I enjoy explaining to strangers why i don’t accept equity only compensation packages$

buildooor

@buildooor

theLaFuenete

@S1mpl3Us3

AI RESEARCHER

@piax0x

Wayde Gilliam

@waydegilliam

Mukundan (e/acc)

@hey_mukundan

Bedwetting & Scat | Messy Trainer

@odin1688849

Alexey Orlov

@AlexeyAOrlov

Charles Monk

@charlesbmonk

Vincent Koc

@vincent_koc

Sébastien Darses

@DarsesSebastien

manis1 kath

@Manis1K66518

Sreejan Choudhary

@sreejan_c

Jay

@JozsefSzalma

Derek Cheng

@derekcheng

Andrew Akhiezer

@andrew_akhiezer

jfullstackdev

@jfullstackdev

Lautaro Gesuelli

@lgesuelli_p

Vishal Verma

@v_shaal

Alison Hill

@apreshill

Adnaan

@theadnaankhan

Sathya '+ve' @Democracy

@SathyaVadlamudi

Dara✨

@Dara_siimii

QuintinaSouthey

@50GBr1A4d4Z5CS

Violet

@Violet_7896

Denis A.

@den_run_ai

MartinaShakespeare

@t6LE8R7pihRz962

Spark

@SparkxBit

Arun

@all_in_all_arun

Björn Plüster

@bjoern_pl

Jordi Clive

@JordiClive

Allegra

@Jtm7b705nI156d

Number-One-AI-Fanboy

@Number1AIFanboy

Jiseung Hong

@jiseungh99

Alpay Ariyak

@AlpayAriyak

Ksenia Se

@Kseniase_

Chat

@Chatxchtt

Analytics Camp

@AnalyticsCamp

Kaushik B

@kaushik_bokka

Marko Tasic

@mtasic85

Connor Shorten

@CShorten30

Rito

@AllesistKode

Servando

@vandotorres

PhyllisPiers

@3OgYlOY1ylm3t

Nikan Doosti

@NIkronic

Martin Balla

@ballamist

Jonas vasconceloss

@Jonasvasco66460

lsy

@LeeStillfox

Just Fine

@jf0130

@ktlevell

Amey Varhade

@ameyvarhade

jimmy

@jimmy125641

Forest

@Hdhsowwnwnwm

Jacob O’Brien ☘️

@Jacobobriennnn

valerii kundas

@valeriikundas

Alexey Orlov

@AlexeyAOrlov

李治澎

@Lizhipengg

dat nguyen

@dat_nguyen94

Arxi vinsan

@arxivinsanity

Nark

@NarkJS

Rajin H

@RajinH16

mengmeng liu

@intersun007

Muneeb J T

@muneebjt

Matt Conway

@mattcconway

CaceresBed

@Caceres_Bed

Jaswanth

@Jaswanth__l

Anurag Sarkar

@angsak6

Mick

@Mick86966538309

Freeman

@freemanlafleur

Mukundan (e/acc)

@hey_mukundan

wormhole

@wapurun

♿️

@helpusw1n

Ji Sung

@JiSungNovae

afifi88

@FeiHao88

theLaFuenete

@S1mpl3Us3

Pieter Meyer

@PieterMey00

Daniel 🦔

@DanielW_Kiwi

Bijan Tavassoli

@BijanTavassoli

Alastair Campbell

@aljmcampbell

thomas

@thomashebrard

AI RESEARCHER

@piax0x

羊叔

@yangshuloveyou

Le Génie Humain

@legeniehumain

kalufsinen

@kalufsinen

Bedwetting & Scat | Messy Trainer

@odin1688849

mads campbell

@martyrdison

CyberKnow

@Cyberknow20

HQ

@huashuai

$buildooor's profile picture. fractional CTO for seed/series A startups. in my free time I enjoy explaining to strangers why i don’t accept equity only compensation packages$