The paper introduces Safety Instincts Reinforcement Learning (SIRL), a self-alignment method that leverages the intrinsic confidence of large language models (LLMs) to enhance their safety. It demonstrates that LLMs achieve over 98% defense success rates against diverse jailbreak attacks across various models, while preserving or improving general capabilities, by using response entropy as an internal reward signal for self-training.
View blogA conceptual framework for "Super Co-alignment" is introduced, aiming for a sustainable symbiotic society by integrating external human oversight with intrinsic AI capacities like self-awareness and empathy. This framework redefines AI alignment as a co-evolution of values between humans and advanced AI, moving beyond unilateral human value imposition.
View blogThe SAFEMIND study develops a unified framework for identifying and evaluating safety risks in embodied Large Language Model (LLM) agents, introducing a comprehensive multimodal benchmark and a modular agent architecture to enhance safety performance. It demonstrates that integrating cascaded safety modules can substantially improve safety rates in realistic scenarios.
View blog