Sebfox1's profile picture. Building AI evals | Previously AI at McKinsey & QuantumBlack

Sebfox

@Sebfox1

Building AI evals | Previously AI at McKinsey & QuantumBlack

This vibe shifts hall of fame in state of ai 2025 are great

Sebfox1's tweet image. This vibe shifts hall of fame in state of ai 2025 are great

Team A: "We need another week to validate this change. Sarah from the medical team is on vacation and she's the only one who can check the healthcare responses." Team B: "Medical accuracy is at 0.91, up from 0.87 last week. The failing cases are all around dosage…


Evals don't have to feel like a complex chore, i've seen many teams have huge success with a few simple evals. The real danger comes from overthinking, not shipping & spending more time on evals than the feature itself.

Sebfox1's tweet image. Evals don't have to feel like a complex chore, i've seen many teams have huge success with a few simple evals. 

The real danger comes from overthinking, not shipping & spending more time on evals than the feature itself.

Sat in on a deployment review. The phrase "seems fine" came up twelve times in thirty minutes. "The outputs seem fine" "Performance seems fine" "Customers seem fine with it" This is a billion-dollar company's evaluation strategy. Seems. Fine. The engineer presenting looked…


Beyond Real-Time Evaluation: Visualize Your AI Performance at Scale Once you've traced your agents with Composo, the real power comes from visualization. While you get instant evaluation results directly in your code, the Composo platform takes it further with comprehensive…

Sebfox1's tweet image. Beyond Real-Time Evaluation: Visualize Your AI Performance at Scale

Once you've traced your agents with Composo, the real power comes from visualization. 

While you get instant evaluation results directly in your code, the Composo platform takes it further with comprehensive…

3 lines. That's all it takes to go from opaque AI agents to full observability and evaluation. Most agent frameworks abstract away what's actually happening under the hood. You deploy to production, something breaks, and you're left debugging in the dark with no idea which agent…

Sebfox1's tweet image. 3 lines. That's all it takes to go from opaque AI agents to full observability and evaluation.

Most agent frameworks abstract away what's actually happening under the hood. You deploy to production, something breaks, and you're left debugging in the dark with no idea which agent…

Shipping AI agents to production just got a lot easier. Most AI teams are stuck in one of two places: - Flying blind—shipping and praying, finding out what broke from customer complaints. - Or stuck in beta—spending weeks on evals that still don't give them confidence to ship.…


Sebfox reposted

having a small meeting with all the ppl who have done something useful with mcp

thekitze's tweet image. having a small meeting with all the ppl who have done something useful with mcp

Watched an interesting standup yesterday at a customer's office. Engineer: "Pushed the new retrieval logic last night." PM: "How's it performing?" Engineer: pulls up dashboard "Hallucination rate down 15%, relevance holding steady at 0.84." PM: "Ship it to the next tier."…


Sebfox reposted

anthropic’s thinking campaign is just so damn tasteful… it feels like a warm room. the aesthetic is comforting as hell, i kinda get a sense of claude as ai helping you be you from it. that’s why we’re weaving claude into everything we build. i’ll post more about sonnet 4.5…

signulll's tweet image. anthropic’s thinking campaign is just so damn tasteful… it feels like a warm room. the aesthetic is comforting as hell, i kinda get a sense of claude as ai helping you be you from it.

that’s why we’re weaving claude into everything we build. i’ll post more about sonnet 4.5…

Sebfox reposted

The best decision OpenAI made with Sora was making it iOS only. Android users lack taste and would ruin the whole thing


Sebfox reposted

It’s unacceptable and unprofessional for a VC like Josh Wolfe to attack a founder by altering screenshots to put words in my mouth and falsely tying my company and partners to them. If this is what he does publicly, imagine the distortions spread behind closed doors. I remain…

I appreciate @amasad public comments fwd to me denouncing phrase "Globalise the Intifada" + false libel claims of "genocide" after seeing how hateful rhetoric incites antisemitic violence + kills jews like in UK today on holiest day 🙏Thank u for msg of solidarity + peace🇮🇱🕊

wolfejosh's tweet image. I appreciate @amasad public comments fwd to me denouncing phrase "Globalise the Intifada" + false libel claims of "genocide" after seeing how hateful rhetoric incites antisemitic violence + kills jews like in UK today on holiest day

🙏Thank u for msg of solidarity + peace🇮🇱🕊


Watched a team present their latest model improvements last week. Two months of work. The result: "We think it's better but we're not confident enough to ship to production." Their release process: - Week 1: Build - Week 2: Manual validation with domain experts - Week 3: Argue…


Every Monday, same scene in our customer Slack channels: "Did Friday's model update actually make things better?" Nobody really knows. The metrics look... different. Not clearly better, not obviously worse. Just different. So begins another week of manual spot checks, trying to…


Sebfox reposted

5 years ago I might've dismissed a new framework as simply a UX improvement and not something that'll advance SOTA Today I'd probably describe abstractions as one of human's greatest inventions Better abstractions are certainly needed in the prompt engineering field #ChatGPT

🧵it feels like today's LLM applications only scratch the surface of what's possible with today's LLMs we're starting to see empirical results that suggest the more tokens spent working through a problem the more likely the output is to be intelligent #ChatGPT #programming

composo_ai's tweet image. 🧵it feels like today's LLM applications only scratch the surface of what's possible with today's LLMs

we're starting to see empirical results that suggest the more tokens spent working through a problem the more likely the output is to be intelligent
#ChatGPT #programming


Sebfox reposted

AI evolutionary tree 🌴🤖 🎯The emergence of LLM prompting was surely a milestone in the democratization of AI. LLMs could have easily been powerful, but hard to operate; indeed many of the early models were designed to perform a specific task that was unconfigurable at…

LukeEMarkham's tweet image. AI evolutionary tree 🌴🤖

🎯The emergence of LLM prompting was surely a milestone in the democratization of AI. LLMs could have easily been powerful, but hard to operate; indeed many of the early models were designed to perform a specific task that was unconfigurable at…

Arcus's 'RAG at planet scale' approach - using a multi-tiered retrieval at increasing levels of granularity enables LLMs to better incorporate accurate context, and to do so at a huge scale!

Sebfox1's tweet image. Arcus's 'RAG at planet scale' approach - using a multi-tiered retrieval at increasing levels of granularity enables LLMs to better incorporate accurate context, and to do so at a huge scale!

Sebfox reposted

Pro-tip: Tired of multi-hop prompts not returning the right result? Split your multi-hop LLM prompt into multiple single-hop ones.💡 Better performance and much easier to programmatically intervene. What do you think? #AI #ChatGPT #LLM #promptengineering [Sakarvadia 2023…

LukeEMarkham's tweet image. Pro-tip: Tired of multi-hop prompts not returning the right result? Split your multi-hop LLM prompt into multiple single-hop ones.💡

Better performance and much easier to programmatically intervene. What do you think?

#AI #ChatGPT #LLM #promptengineering  [Sakarvadia 2023…

Sebfox reposted

It's no coincidence #LLMs are the first #AI developed that matches human level performance. Text is an efficient, information-dense format. More data-intense formats like image and video will guarantee AI labs will be #GPU constrained for at least the next decade.


Sebfox reposted

LLM output validation slowing your response times too much? Consider using unvalidated output and validating it post-hoc. If you control the UI you can easily recall it once validation finishes and in that time your user probably won’t have had time to screenshot and tweet it #ai


United States Trends

Loading...

Something went wrong.


Something went wrong.