weizeming25's profile picture. First-year Ph.D. student @PKU1898, ex-visiting student @UCBerkeley, research intern @BytedanceTalk. I focus on developing Trustworthy AI/ML.

Zeming Wei

@weizeming25

First-year Ph.D. student @PKU1898, ex-visiting student @UCBerkeley, research intern @BytedanceTalk. I focus on developing Trustworthy AI/ML.

Pinned

🚨False Sense of Security: Our new paper identifies a critical limitation in representation probing-based malicious input detection—purported "high detection accuracy" may confer a false sense of security: arxiv.org/pdf/2509.03888

weizeming25's tweet image. 🚨False Sense of Security:  Our new paper identifies a critical limitation in representation probing-based malicious input detection—purported "high detection accuracy" may confer a false sense of security: 
arxiv.org/pdf/2509.03888

Zeming Wei reposted

Mitigating racial bias from LLMs is a lot easier than removing it from humans! Can’t believe this happened at the best AI conference @NeurIPSConf We have ethical reviews for authors, but missed it for invited speakers? 😡

sunjiao123sun_'s tweet image. Mitigating racial bias from LLMs is a lot easier than removing it from humans! 

Can’t believe this happened at the best AI conference @NeurIPSConf 

We have ethical reviews for authors, but missed it for invited speakers? 😡

Zeming Wei reposted

Yes. AI could even lead to a decline of science since we may stop to ask why and how. We need to make AI more interpretable — not only for AI but also for humanity.

unfortunately, the solution that works best often isn’t the one that teaches you the most cleaning up pretraining data works great for improving LLM perf, prompt engineering works great for improving agent perf but you won’t learn much by doing these things



Zeming Wei reposted

Our theory on LLM self-correction has been accepted to #NeurIPS2024 ! A good time to revisit it after the GPT-o1 release:)

Can LLMs improve themselves by self-correction, just like humans? In this new paper, we take a serious look into the self-correction ability of Transformers. We find although possible, self-correction is much harder than supervised learning, so we need a fully armed Transformer!

yifeiwang77's tweet image. Can LLMs improve themselves by self-correction, just like humans? In this new paper, we take a serious look into the self-correction ability of Transformers. 
We find although possible, self-correction is much harder than supervised learning, so we need a fully armed Transformer!


Zeming Wei reposted

Excited to share that our LLM self-correction paper received the Spotlight Award (top 3) at ICML’24 ICL workshop! iclworkshop.github.io

yifeiwang77's tweet image. Excited to share that our LLM self-correction paper received the Spotlight Award (top 3) at ICML’24 ICL workshop! 
iclworkshop.github.io
yifeiwang77's tweet image. Excited to share that our LLM self-correction paper received the Spotlight Award (top 3) at ICML’24 ICL workshop! 
iclworkshop.github.io

Can LLMs improve themselves by self-correction, just like humans? In this new paper, we take a serious look into the self-correction ability of Transformers. We find although possible, self-correction is much harder than supervised learning, so we need a fully armed Transformer!

yifeiwang77's tweet image. Can LLMs improve themselves by self-correction, just like humans? In this new paper, we take a serious look into the self-correction ability of Transformers. 
We find although possible, self-correction is much harder than supervised learning, so we need a fully armed Transformer!


Zeming Wei reposted

😎So excited to see that our In-context Attack (ICA) method has been leveraged by Anthrophic to break down the most prominent LLMs -- simply by extending # in-context examples! What a lesson of scaling!😆 See how this idea originates w/ @weizeming25 arxiv.org/abs/2310.06387

New Anthropic research paper: Many-shot jailbreaking. We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers. Read our blog post and the paper here: anthropic.com/research/many-…

AnthropicAI's tweet image. New Anthropic research paper: Many-shot jailbreaking.

We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers.

Read our blog post and the paper here: anthropic.com/research/many-…


Glad to share that I just reached 100 citations according to Google Scholar. Thanks all coauthors! scholar.google.com/citations?user…


Zeming Wei reposted

Jatmo: Prompt Injection Defense by Task-Specific Finetuning "It harnesses a teacher instruction-tuned model to generate a task-specific dataset, which is then used to fine-tune a base model (i.e., a non-instruction-tuned model). Jatmo only needs a task prompt and a dataset of…

llm_sec's tweet image. Jatmo: Prompt Injection Defense by Task-Specific Finetuning

"It harnesses a teacher instruction-tuned model to generate a task-specific dataset, which is then used to fine-tune a base model (i.e., a non-instruction-tuned model). Jatmo only needs a task prompt and a dataset of…

Loading...

Something went wrong.


Something went wrong.