alphaXiv

History

Papers Benchmarks

Bosch BCAI

247

22 May 2025

adversarial-attacks adversarial-robustness ai-for-cybersecurity

Refusal Direction is Universal Across Safety-Aligned Languages

LMU Munich Munich Center for Machine Learning Bosch BCAI

Yihong Liu

LMU Munich researchers found that the internal 'refusal direction' controlling Large Language Model safety is universal across diverse safety-aligned languages. However, cross-lingual jailbreaks occur because non-English content representations are less clearly separated from harmless content in the model's internal space, weakening the universal refusal signal.

There are no more papers matching your filters at the moment.

Events

AI for Law
Joel Niklaus· Hugging Face
01/09
Register
Watch recordings

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Refusal Direction is Universal Across Safety-Aligned Languages

Events

AI for Law

Personalize Your Feed