dani_avila7's profile picture. Building AI Tools with LLMs  @aitmpl_com prev @codegptAI | Powered by TypeScript & Pumpkin Spice Lattes ☕️

Daniel San

@dani_avila7

Building AI Tools with LLMs @aitmpl_com prev @codegptAI | Powered by TypeScript & Pumpkin Spice Lattes ☕️

Reviewing this paper made me realize how crucial the calibration dataset is for LLM-as-a-Judge You can’t trust your eval scores without measuring where your judge fails. That requires ground truth labels… which means human verification 🤷‍♂️ Link 👇 arxiv.org/abs/2511.21140

LLM as a judge has become a dominant way to evaluate how good a model is at solving a task, since it works without a test set and handles cases where answers are not unique. But despite how widely this is used, almost all reported results are highly biased. Excited to share our…

Kangwook_Lee's tweet image. LLM as a judge has become a dominant way to evaluate how good a model is at solving a task, since it works without a test set and handles cases where answers are not unique.

But despite how widely this is used, almost all reported results are highly biased.

Excited to share our…
Kangwook_Lee's tweet image. LLM as a judge has become a dominant way to evaluate how good a model is at solving a task, since it works without a test set and handles cases where answers are not unique.

But despite how widely this is used, almost all reported results are highly biased.

Excited to share our…
Kangwook_Lee's tweet image. LLM as a judge has become a dominant way to evaluate how good a model is at solving a task, since it works without a test set and handles cases where answers are not unique.

But despite how widely this is used, almost all reported results are highly biased.

Excited to share our…


Claude Code 2.0.55 just dropped And the fuzzy matching improvement for @ file suggestions sounds really useful. Quick context: fuzzy matching is when you type part of a filename and it finds matches even if you skip characters. Like typing “usrctr” finds “UserController.ts”…

dani_avila7's tweet image. Claude Code 2.0.55 just dropped 

And the fuzzy matching improvement for @ file suggestions sounds really useful.

Quick context: fuzzy matching is when you type part of a filename and it finds matches even if you skip characters. 

Like typing “usrctr” finds “UserController.ts”…

Claude Code team ships fast! 👏 Just discovered you can add custom instructions to /compact No idea how long this has been there but it's exactly what I needed. - Before: Claude would summarize the chat and sometimes miss critical details I wanted to keep. - Now: Just add…

dani_avila7's tweet image. Claude Code team ships fast! 👏

Just discovered you can add custom instructions to /compact 

No idea how long this has been there but it's exactly what I needed.

- Before: Claude would summarize the chat and sometimes miss critical details I wanted to keep.
- Now: Just add…

Actually, this could roll out faster than expected. Implementation is straightforward: - Add a beta header - Two built-in tools (tool_search and code_execution) - Then opt-in your existing tools with simple flags (defer_loading, allowed_callers, input_examples) Done! 🤷🏽‍♂️

dani_avila7's tweet image. Actually, this could roll out faster than expected. Implementation is straightforward: 

- Add a beta header 
- Two built-in tools (tool_search and code_execution)
- Then opt-in your existing tools with simple flags (defer_loading, allowed_callers, input_examples)

Done! 🤷🏽‍♂️

Solid updates. Dynamic tool discovery is a game changer for efficiency. Can't wait to see how it affects workflows.



I’ve been waiting for something like this from Vercel! 🔥 Fully open source visual workflow builder with AI capabilities The future of building workflows just got a lot more accessible for devs

We're releasing a visual agent & workflow builder ▪️ Fully open source ▪️ Built on useworkflow.dev ▪️ Outputs "𝚞𝚜𝚎 𝚠𝚘𝚛𝚔𝚏𝚕𝚘𝚠" code ▪️ Supports AI "text to workflow" ▪️ Powered by @aisdk & AI Elements ▪️ Sample integrations (@resend, @linear, @slack) Clone &…



Opus 4.5 dropped. 80.9% on SWE-bench Curious to see how it performs with proper skills and tooling setup 🤔

Our engineers have found that Opus 4.5 handles ambiguity and reasons about tradeoffs without hand-holding. When pointed at a complex, multi-system bug, it figures out the fix. Overall, Opus 4.5 just "gets it."



Claude Code questions keep coming in. Putting together a poll to see what topic you want me to break down next I’ll write Medium articles and X posts about whichever one wins 👇


Daniel San reposted

Claude Code hooks confuse everyone at first (Save this post to review in detail later) I made this guide so you actually know which one to use and when. The hook system is incredibly powerful, but the docs don't really explain when to use each one. So I built this reference…

dani_avila7's tweet image. Claude Code hooks confuse everyone at first 

(Save this post to review in detail later)

I made this guide so you actually know which one to use and when.

The hook system is incredibly powerful, but the docs don't really explain when to use each one. So I built this reference…

Used Nano Banana Pro to create visuals for some old Medium articles I wrote. Just generated this image for my AWS Bedrock observability post. Link: medium.com/@dan.avila7/ob… If you have old technical posts collecting dust, this might be worth checking out.

dani_avila7's tweet image. Used Nano Banana Pro to create visuals for some old Medium articles I wrote.

Just generated this image for my AWS Bedrock observability post.

Link: medium.com/@dan.avila7/ob…

If you have old technical posts collecting dust, this might be worth checking out.

Interesting to see the model naturally learning to game the tests during training. Worth watching for practical insights on how to evaluate and work with the models we’ll be dealing with in the future.

New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.



Daniel San reposted

Just discovered MCP Elicitation and it's an elegant way to handle human-in-the-loop in MCP servers. Form Mode: - Server sends a JSON schema - Client renders a form - User fills it - Server gets validated data. Clean way to collect missing parameters or confirmations without…

dani_avila7's tweet image. Just discovered MCP Elicitation and it's an elegant way to handle human-in-the-loop in MCP servers.

Form Mode: 
- Server sends a JSON schema 
- Client renders a form
- User fills it
- Server gets validated data.

Clean way to collect missing parameters or confirmations without…

Love this debate Model in isolation? Deterministic. Model in production? Different story. But now, your prompt shares GPU with hundreds of others. Different loads = different execution = variance, even at temp=0. This is why robust prompts matter more than “perfect” ones.

LLMs aren't deterministic Even at temperature=0, you get different outputs for the same prompt. Most think it's just "GPU randomness"... It's not. The real reason: your output changes based on how many other users are on the server. Different batch sizes = different results.…

dani_avila7's tweet image. LLMs aren't deterministic

Even at temperature=0, you get different outputs for the same prompt.

Most think it's just "GPU randomness"... It's not.

The real reason: your output changes based on how many other users are on the server. Different batch sizes = different results.…


Really appreciate Cloudflare’s transparency here. This is how you write a postmortem: They had a 3hr outage today caused by a database permissions change that doubled their Bot Management config file size, hitting a hardcoded limit in their Rust proxy. The config regenerated…

We let the Internet down today. Here’s our technical post mortem on what happened. On behalf of the entire @Cloudflare team, I’m sorry. blog.cloudflare.com/18-november-20…



Good news: Claude Code Templates (aitmpl.com) is working 👍 Bad news: Claude is down 👎 Cloudflare is down and we just have to wait


Loading...

Something went wrong.


Something went wrong.