Jordan Taylor

@JordanTensor

Working on new methods for understanding machine learning systems and entangled quantum systems.

Brisbane

sites.google.com/view/jordanten…

Присоединился в Декабрь 2009

438Посты 370читателей 1Kв читаемых

Вам может понравиться

@morphillogical

@AlhambraAlvaro

@BogdanIonutCir2

@ThomasLemoine66

@EzraJNewman

@FrancescoMeleAn

@samsmisaligned

@crash23001

@casebash

@jacob_pfau

@erinbraid

@shakaz_

@CambrianImplos1

@Rudy111560061

@HenriLemoine13

Закреплено

Jordan Taylor

@JordanTensor

17 мая 2024 г.

I'm keen to share our new library for explaining more of a machine learning model's performance more interpretably than existing methods. This is the work of Dan Braun, Lee Sharkey and Nix Goldowsky-Dill which I helped out with during @MATSprogram: 🧵1/8

JordanTensor's tweet image. I'm keen to share our new library for explaining more of a machine learning model's performance more interpretably than existing methods.

This is the work of Dan Braun, Lee Sharkey and Nix Goldowsky-Dill which I helped out with during @MATSprogram:
🧵1/8

Lee Sharkey

@leedsharkey

17 мая 2024 г.

Proud to share Apollo Research's first interpretability paper! In collaboration w @JordanTensor! ⤵️ publications.apolloresearch.ai/end_to_end_spa… Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning Our SAEs explain significantly more performance than before! 1/

Jordan Taylor сделал(а) репост

Xander Davies

@alxndrdavies

12 сент.г.

Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6

alxndrdavies's tweet image. Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6

Jordan Taylor сделал(а) репост

Amanda Askell

@AmandaAskell

29 авг.г.

I'm learning the true Hanlon's razor is: never attribute to malice or incompetence that which is best explained by someone being a bit overstretched but intending to get around to it as soon as they possibly can.

Jordan Taylor сделал(а) репост

Neel Nanda

@NeelNanda5

15 июл.г.

It was great to be part of this statement. I wholeheartedly agree. It is a wild lucky coincidence that models often express dangerous intentions aloud, and it would be foolish to waste this opportunity. It is crucial to keep chain of thought monitorable as long as possible

Mikita Balesni 🇺🇦

@balesni

15 июл.г.

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

balesni's tweet image. A simple AGI safety technique: AI’s thoughts are in plain English, just read them

We know it works, with OK (not perfect) transparency!

The risk is fragility: RL training, new architectures, etc threaten transparency

Experts from many orgs agree we should try to preserve it:…

Jordan Taylor сделал(а) репост

Mikita Balesni 🇺🇦

@balesni

15 июл.г.

Jordan Taylor сделал(а) репост

AI Security Institute

@AISecurityInst

10 июл.г.

Can we leverage an understanding of what’s happening inside AI models to stop them from causing harm? At AISI, our dedicated White Box Control Team has been working on just this🧵

Jordan Taylor сделал(а) репост

Joseph Bloom

@JBloomAus

10 июл.г.

🧵 1/13 My new team at UK AISI - the White Box Control Team - has released progress updates! We've been investigating whether AI systems could deliberately underperform on evaluations without us noticing. Key findings below 👇

AI Security Institute

@AISecurityInst

10 июл.г.

We’ve released a detailed progress update on our white box control work so far! Read it here: alignmentforum.org/posts/pPEeMdgj…

Jordan Taylor сделал(а) репост

Owain Evans

@OwainEvans_UK

21 янв.г.

New paper: We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions. They can *describe* their new behavior, despite no explicit mentions in the training data. So LLMs have a form of intuitive self-awareness 🧵

OwainEvans_UK's tweet image. New paper:
We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions.
They can *describe* their new behavior, despite no explicit mentions in the training data.
So LLMs have a form of intuitive self-awareness 🧵

Jordan Taylor сделал(а) репост

Stephen McAleer

@McaleerStephen

15 янв.г.

Ok sounds like nobody knows. Blocked off some time on my calendar Monday.

Stephen McAleer

@McaleerStephen

15 янв.г.

Honest question: how are we supposed to control a scheming superintelligence? Even with a perfect monitor won't it just convince us to let it out of the sandbox?

Jordan Taylor сделал(а) репост

Yo Shavit

@yonashav

22 дек.г.

Observation 5: Technical alignment of AGI is the ballgame. With it, AI agents will pursue our goals and look out for our interests even as more and more of the economy begins to operate outside direct human oversight. Without it, it is plausible that we fail to notice as the…