Bosch BCAI
LMU Munich researchers found that the internal 'refusal direction' controlling Large Language Model safety is universal across diverse safety-aligned languages. However, cross-lingual jailbreaks occur because non-English content representations are less clearly separated from harmless content in the model's internal space, weakening the universal refusal signal.
7
There are no more papers matching your filters at the moment.