Nimit Kalra @ ICML 2025
@qw3rtman
research @haizelabs @columbia, prev @citadel @utaustin currently feynman technique-ing my way through life
คุณอาจชื่นชอบ
Customers building AI agents often lament at the difficulty of using off-the-shelf LLM Eval tools for their specific app. While there's no doubt that human supervision is required, not all supervision is the same. Why not transform the supervision problem to make it easier?
Our paper "ChartMuseum 🖼️" is now accepted to #NeurIPS2025 Datasets and Benchmarks Track! Even the latest models, such as GPT-5 and Gemini-2.5-Pro, still cannot do well on challenging 📉chart understanding questions , especially on those that involve visual reasoning 👀!
Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B
New Anthropic Research: Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks. Fine-tuning LLMs through APIs can be harmful even if the data used for fine-tuning does not appear to be, often because the data encodes a hidden message.
born to do research forced to build b2b saas.... luckily at haize, you can do both (we are hiring)
JetBrains is no longer behind. @firebender_com just launched the first-ever background coding agents for all JetBrains IDEs. These coding agents are incredibly intelligent, have isolated workspaces, and don’t require any cloud setup.
.@RoundtableHQ_'s Proof of Human uses behavioral biometrics to stop bots and AI spam. With 87% accuracy (vs. Google’s 69% and Cloudflare’s 33%), it gives you frictionless, real-time authentication with a one-line API. ycombinator.com/launches/OEh-r… Congrats on the launch @_magrawal &…
Are we really running out of data??? No. We're just not using it correctly. The solution: let the model learn which data it needs to learn!!! 1/n
GPT-5 is now live in Firebender for Android Studio, free for a limited time 🚀 It’s definitely the best coding model I’ve ever used. Try it today and tell us what you think.
Android engineers have access to GPT 5 through Firebender. Enjoy
Flying out to #ICML2025 tonight! Always down to chat about unverifiable domains, evals, red-teaming, safeguards, or just meet cool people. I’ll be a panelist at the Methods and Opportunities at Small Scale workshop, sharing our work on tiny generalist reward models…
The current state of the ecosystem for post-training using GRPO w/ vllm + flash attention is frustratingly brittle. - The most recent vllm only supports PyTorch==2.7.0 - vllm requires xformers, but specifically only v0.0.30 is supported for torch 2.7.0. Any prior version of…
can’t even escape the arxiv speak in the group chat
Vogent has a fantastic battle-tested inference stack, glad to see they opened it up + already have a finetuning product. From what I've seen, open-source voice models solve the 0 → 1 quite well but require a lot of post-hoc tuning to get right
chart crime so bad you gotta transcribe the values by hand and plot it yourself
evals evals evals
Mercor (@mercor_ai) is now working with 6 out of the Magnificent 7, all of the top 5 AI labs, and most of the top application layer companies. One trend is common across every customer: we are entering The Era of Evals. RL is becoming so effective that models will be able to…
United States เทรนด์
- 1. Vandy 7,573 posts
- 2. Julian Sayin 3,524 posts
- 3. Carnell Tate 2,065 posts
- 4. Caicedo 18.8K posts
- 5. Vanderbilt 6,285 posts
- 6. Arch Manning 2,992 posts
- 7. Donaldson 1,947 posts
- 8. Pavia 2,510 posts
- 9. Jeremiah Smith 1,698 posts
- 10. Clemson 8,254 posts
- 11. French Laundry 4,267 posts
- 12. #HookEm 2,935 posts
- 13. Christmas 130K posts
- 14. Buckeyes 3,652 posts
- 15. Joao Pedro 11K posts
- 16. Arvell Reese N/A
- 17. Jim Knowles N/A
- 18. Xavi 11K posts
- 19. Dalot 23.6K posts
- 20. ESPN 79.8K posts
Something went wrong.
Something went wrong.