RepEEdwards's profile picture. Former 2-term NH State Representative, 2014-2018

RTs of content are typically endorsements of most concepts contained in that content, not the content creators

Hon. Elizabeth Edwards

@RepEEdwards

Former 2-term NH State Representative, 2014-2018 RTs of content are typically endorsements of most concepts contained in that content, not the content creators

Pinned

Thread: A common mistake I see, when people first get involved in politics, is they first pick the party that they think better aligns with their views (or the party chosen by most people around them) then they follow that party's lead on everything.


Hon. Elizabeth Edwards reposted

New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.


Hon. Elizabeth Edwards reposted

Part of my job on Anthropic’s Alignment Stress-Testing Team is to write internal reviews of our RSP activities, acting as a “second line of defense” for safety. Today, we’re publishing one of our reviews for the first time alongside the pilot Sabotage Risk Report.

🌱⚠️ weeds-ey but important milestone ⚠️🌱 This is a first concrete example of the kind of analysis, reporting, and accountability that we’re aiming for as part of our Responsible Scaling Policy commitments on misalignment.

sleepinyourhat's tweet image. 🌱⚠️ weeds-ey but important milestone ⚠️🌱

This is a first concrete example of the kind of analysis, reporting, and accountability that we’re aiming for as part of our Responsible Scaling Policy commitments on misalignment.


Hon. Elizabeth Edwards reposted

Once we had a stable draft, we brought in our internal stress-testing team as well as an independent external team at @METR, for both feedback and formal reviews of the final product. Both teams were great to work with, and their feedback very materially improved our work here!


Hon. Elizabeth Edwards reposted

If you’re at another developer and you’re considering doing something similar, leave yourself plenty of time for your first try!


Hon. Elizabeth Edwards reposted

🌱⚠️ weeds-ey but important milestone ⚠️🌱 This is a first concrete example of the kind of analysis, reporting, and accountability that we’re aiming for as part of our Responsible Scaling Policy commitments on misalignment.

sleepinyourhat's tweet image. 🌱⚠️ weeds-ey but important milestone ⚠️🌱

This is a first concrete example of the kind of analysis, reporting, and accountability that we’re aiming for as part of our Responsible Scaling Policy commitments on misalignment.

Hon. Elizabeth Edwards reposted

Technological Optimism and Appropriate Fear - an essay where I grapple with how I feel about the continued steady march towards powerful AI systems. The world will bend around AI akin to how a black hole pulls and bends everything around itself.

jackclarkSF's tweet image. Technological Optimism and Appropriate Fear - an essay where I grapple with how I feel about the continued steady march towards powerful AI systems. The world will bend around AI akin to how a black hole pulls and bends everything around itself.

Hon. Elizabeth Edwards reposted

🧵 Haiku 4.5 🧵 Looking at the alignment evidence, Haiku is similar to Sonnet: Very safe, though often eval-aware. I think the most interesting alignment content in the system card is about reasoning faithfulness…

sleepinyourhat's tweet image. 🧵 Haiku 4.5  🧵

Looking at the alignment evidence, Haiku is similar to Sonnet: Very safe, though often eval-aware. 

I think the most interesting alignment content in the system card is about reasoning faithfulness…

Hon. Elizabeth Edwards reposted

This is not normal. OpenAI used an unrelated lawsuit to intimidate advocates of a bill trying to regulate them. While the bill was still being debated. 7/15


Hon. Elizabeth Edwards reposted

One Tuesday night, as my wife and I sat down for dinner, a sheriff’s deputy knocked on the door to serve me a subpoena from OpenAI. I held back on talking about it because I didn't want to distract from SB 53, but Newsom just signed the bill so... here's what happened: 🧵

_NathanCalvin's tweet image. One Tuesday night, as my wife and I sat down for dinner, a sheriff’s deputy knocked on the door to serve me a subpoena from OpenAI.

I held back on talking about it because I didn't want to distract from SB 53, but Newsom just signed the bill so... here's what happened:
🧵

Hon. Elizabeth Edwards reposted

🎉🎉 Today we launched Claude Sonnet 4.5, which is not only highly capable but also a major improvement on safety and alignment x.com/claudeai/statu…

sprice354_'s tweet image. 🎉🎉 Today we launched Claude Sonnet 4.5, which is not only highly capable but also a major improvement on safety and alignment x.com/claudeai/statu…

Introducing Claude Sonnet 4.5—the best coding model in the world. It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.

claudeai's tweet image. Introducing Claude Sonnet 4.5—the best coding model in the world.

It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.


Hon. Elizabeth Edwards reposted

Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)

Jack_W_Lindsey's tweet image. Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)

Hon. Elizabeth Edwards reposted

[Sonnet 4.5 🧵] Here's the north-star goal for our pre-deployment alignment evals work: The information we share alongside a model should give you an accurate overall sense of the risks the model could pose. It won’t tell you everything, but you shouldn’t be...


Hon. Elizabeth Edwards reposted

New Anthropic research: Building and evaluating alignment auditing agents. We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.

AnthropicAI's tweet image. New Anthropic research: Building and evaluating alignment auditing agents.

We developed three AI agents to autonomously complete alignment auditing tasks.

In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.

Hon. Elizabeth Edwards reposted

New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.

AnthropicAI's tweet image. New Anthropic Research: Agentic Misalignment.

In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.

Hon. Elizabeth Edwards reposted

New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

AnthropicAI's tweet image. New Anthropic research: Do reasoning models accurately verbalize their reasoning?

Our new paper shows they don't.

This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

Hon. Elizabeth Edwards reposted

We're recruiting researchers to work with us on AI interpretability. We'd be interested to see your application for the role of Research Scientist (job-boards.greenhouse.io/anthropic/jobs…) or Research Engineer (job-boards.greenhouse.io/anthropic/jobs…).


Hon. Elizabeth Edwards reposted

New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.


Hon. Elizabeth Edwards reposted

PEPFAR has saved between 7.5 and 30 million lives. This administration likes to do all kinds of things, but they also change course sometimes when there's pushback. People are gathering in Washington DC this Friday to protect PEPFAR; you can join them facebook.com/events/s/rally…


Hon. Elizabeth Edwards reposted

My team is hiring researchers! I’m primarily interested in candidates who have (i) several years of experience doing excellent work as a SWE or RE, (ii) who have substantial research experience of some form, and (iii) who are familiar with modern ML and the AGI alignment…


Hon. Elizabeth Edwards reposted

We’re starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-time. Beginning in March 2025, we'll provide funding, compute, and research mentorship to 10–15 Fellows with strong coding and technical backgrounds.

AnthropicAI's tweet image. We’re starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-time.

Beginning in March 2025, we'll provide funding, compute, and research mentorship to 10–15 Fellows with strong coding and technical backgrounds.

Loading...

Something went wrong.


Something went wrong.