
if you’ve had a hunch that your favorite model does a weird thing, you can now go audit it in 5 minutes with Petri! lets go @kaifronsdal - thoughtful open source tooling for alignment research is so important, more coming soon!!
Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception. Now we’re open-sourcing the tool to run those audits.

New paper! We explore a radical paradigm for AI evals: assessing LLMs on *unsolved* questions. Instead of contrived exams where progress ≠ value, we eval LLMs on organic, unsolved problems via reference-free LLM validation & community verification. LLMs solved ~10/500 so far:

New position paper! Machine Learning Conferences Should Establish a “Refutations and Critiques” Track Joint w/ @sanmikoyejo @JoshuaK92829 @yegordb @bremen79 @koustuvsinha @in4dmatics @JesseDodge @suchenzang @BrandoHablando @MGerstgrasser @is_h_a @ObbadElyas 1/6

... and I just extended it to allow for multi-prompt, dual-model optimization, following the approach from the original @andyzou_jiaming GCG pager to make the jailbreak suffix universal and transferable. universal-nanoGCG available at github.com/isha-gpt/unive…!
Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception. Now we’re open-sourcing the tool to run those audits.

United States Trends
- 1. Good Sunday 51.8K posts
- 2. Discussing Web3 N/A
- 3. #HealingFromMozambique 18.5K posts
- 4. #sundayvibes 4,555 posts
- 5. Wordle 1,576 X N/A
- 6. Blessed Sunday 17K posts
- 7. Trump's FBI 11.1K posts
- 8. Biden FBI 17.5K posts
- 9. Gilligan's Island 5,485 posts
- 10. Macrohard 9,434 posts
- 11. The CDC 32.1K posts
- 12. Dissidia 7,600 posts
- 13. Go Broncos 1,260 posts
- 14. Pegula 5,512 posts
- 15. Nor'easter 1,687 posts
- 16. FDV 5min 2,177 posts
- 17. Utah 25.3K posts
- 18. Market Cap Surges N/A
- 19. QUICK TRADE 2,172 posts
- 20. Whale - Buy 1,792 posts
Something went wrong.
Something went wrong.