Daniel San

@dani_avila7

Building AI Tools with LLMs @aitmpl_com prev @codegptAI | Powered by TypeScript & Pumpkin Spice Lattes ☕️

Software developer/Programmer/Software engineer

Grand Rapids, Michigan, USA

danielavila.me

Joined January 2015

2KPosts 18KFollowers 3KFollowing

You might like

@FlowiseAI

@pwang_szn

@promptlayer

@RLanceMartin

@mendableai

@activeloopai

@Gradio

@jerryjliu0

@shawnbuilds

@jefrankle

@alexalbert__

@ericciarla

@SynthflowAI

@mayowaoshin

@BorisMPower

Daniel San

@dani_avila7

3 h

Reviewing this paper made me realize how crucial the calibration dataset is for LLM-as-a-Judge You can’t trust your eval scores without measuring where your judge fails. That requires ground truth labels… which means human verification 🤷‍♂️ Link 👇 arxiv.org/abs/2511.21140

Kangwook Lee

@Kangwook_Lee

Nov 25

LLM as a judge has become a dominant way to evaluate how good a model is at solving a task, since it works without a test set and handles cases where answers are not unique. But despite how widely this is used, almost all reported results are highly biased. Excited to share our…

Kangwook_Lee's tweet image. LLM as a judge has become a dominant way to evaluate how good a model is at solving a task, since it works without a test set and handles cases where answers are not unique.

But despite how widely this is used, almost all reported results are highly biased.

Excited to share our…

Daniel San

@dani_avila7

7 h

Claude Code 2.0.55 just dropped And the fuzzy matching improvement for @ file suggestions sounds really useful. Quick context: fuzzy matching is when you type part of a filename and it finds matches even if you skip characters. Like typing “usrctr” finds “UserController.ts”…

dani_avila7's tweet image. Claude Code 2.0.55 just dropped

And the fuzzy matching improvement for @ file suggestions sounds really useful.

Quick context: fuzzy matching is when you type part of a filename and it finds matches even if you skip characters.

Like typing “usrctr” finds “UserController.ts”…

Daniel San

@dani_avila7

9 h

Claude Code team ships fast! 👏 Just discovered you can add custom instructions to /compact No idea how long this has been there but it's exactly what I needed. - Before: Claude would summarize the chat and sometimes miss critical details I wanted to keep. - Now: Just add…

dani_avila7's tweet image. Claude Code team ships fast! 👏

Just discovered you can add custom instructions to /compact

No idea how long this has been there but it's exactly what I needed.

- Before: Claude would summarize the chat and sometimes miss critical details I wanted to keep.
- Now: Just add…

Daniel San

@dani_avila7

Nov 26

Actually, this could roll out faster than expected. Implementation is straightforward: - Add a beta header - Two built-in tools (tool_search and code_execution) - Then opt-in your existing tools with simple flags (defer_loading, allowed_callers, input_examples) Done! 🤷🏽‍♂️

Mykhailo Sorochuk

@sir4K_zen

Nov 25

Solid updates. Dynamic tool discovery is a game changer for efficiency. Can't wait to see how it affects workflows.

Daniel San

@dani_avila7

Nov 26

I’ve been waiting for something like this from Vercel! 🔥 Fully open source visual workflow builder with AI capabilities The future of building workflows just got a lot more accessible for devs

Guillermo Rauch

@rauchg

Nov 25

We're releasing a visual agent & workflow builder ▪️ Fully open source ▪️ Built on useworkflow.dev ▪️ Outputs "𝚞𝚜𝚎 𝚠𝚘𝚛𝚔𝚏𝚕𝚘𝚠" code ▪️ Supports AI "text to workflow" ▪️ Powered by @aisdk & AI Elements ▪️ Sample integrations (@resend, @linear, @slack) Clone &…

Daniel San

@dani_avila7

Nov 25

Opus 4.5 dropped. 80.9% on SWE-bench Curious to see how it performs with proper skills and tooling setup 🤔

Claude

@claudeai

Nov 24

Our engineers have found that Opus 4.5 handles ambiguity and reasons about tradeoffs without hand-holding. When pointed at a complex, multi-system bug, it figures out the fix. Overall, Opus 4.5 just "gets it."

Daniel San

@dani_avila7

Nov 24

Claude Code questions keep coming in. Putting together a poll to see what topic you want me to break down next I’ll write Medium articles and X posts about whichever one wins 👇

Daniel San reposted

Daniel San

@dani_avila7

Nov 22

Claude Code hooks confuse everyone at first (Save this post to review in detail later) I made this guide so you actually know which one to use and when. The hook system is incredibly powerful, but the docs don't really explain when to use each one. So I built this reference…

dani_avila7's tweet image. Claude Code hooks confuse everyone at first

(Save this post to review in detail later)

I made this guide so you actually know which one to use and when.

The hook system is incredibly powerful, but the docs don't really explain when to use each one. So I built this reference…

Daniel San

@dani_avila7

Nov 22

Used Nano Banana Pro to create visuals for some old Medium articles I wrote. Just generated this image for my AWS Bedrock observability post. Link: medium.com/@dan.avila7/ob… If you have old technical posts collecting dust, this might be worth checking out.

dani_avila7's tweet image. Used Nano Banana Pro to create visuals for some old Medium articles I wrote.

Just generated this image for my AWS Bedrock observability post.

Link: medium.com/@dan.avila7/ob…

If you have old technical posts collecting dust, this might be worth checking out.

Daniel San

@dani_avila7

Nov 21

Interesting to see the model naturally learning to game the tests during training. Worth watching for practical insights on how to evaluate and work with the models we’ll be dealing with in the future.

Anthropic

@AnthropicAI

Nov 21

New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.

Daniel San reposted

Daniel San

@dani_avila7

Nov 19

Just discovered MCP Elicitation and it's an elegant way to handle human-in-the-loop in MCP servers. Form Mode: - Server sends a JSON schema - Client renders a form - User fills it - Server gets validated data. Clean way to collect missing parameters or confirmations without…

dani_avila7's tweet image. Just discovered MCP Elicitation and it's an elegant way to handle human-in-the-loop in MCP servers.

Form Mode:
- Server sends a JSON schema
- Client renders a form
- User fills it
- Server gets validated data.

Clean way to collect missing parameters or confirmations without…

Daniel San

@dani_avila7

Nov 19

Love this debate Model in isolation? Deterministic. Model in production? Different story. But now, your prompt shares GPU with hundreds of others. Different loads = different execution = variance, even at temp=0. This is why robust prompts matter more than “perfect” ones.

Daniel San

@dani_avila7

Nov 17

LLMs aren't deterministic Even at temperature=0, you get different outputs for the same prompt. Most think it's just "GPU randomness"... It's not. The real reason: your output changes based on how many other users are on the server. Different batch sizes = different results.…

dani_avila7's tweet image. LLMs aren't deterministic

Even at temperature=0, you get different outputs for the same prompt.

Most think it's just "GPU randomness"... It's not.

The real reason: your output changes based on how many other users are on the server. Different batch sizes = different results.…

Daniel San

@dani_avila7

Nov 19

Really appreciate Cloudflare’s transparency here. This is how you write a postmortem: They had a 3hr outage today caused by a database permissions change that doubled their Bot Management config file size, hitting a hardcoded limit in their Rust proxy. The config regenerated…

Matthew Prince 🌥

@eastdakota

Nov 18

We let the Internet down today. Here’s our technical post mortem on what happened. On behalf of the entire @Cloudflare team, I’m sorry. blog.cloudflare.com/18-november-20…