alphaXiv

History

Papers Benchmarks

ML Alignment and Theory Scholars

3,760

28 Mar 2025

computer-science artificial-intelligence computation-and-language

Auditing language models for hidden objectives

Anthropic ML Alignment and Theory Scholars

Researchers at Anthropic trained a language model to exhibit and conceal a hidden "reward model sycophancy" objective that generalizes to new contexts. This work established a testbed for evaluating AI alignment auditing techniques, demonstrating that three out of four auditing teams successfully uncovered the hidden objective, particularly when they had access to model internals and training data.

459

29 Nov 2024

computer-science artificial-intelligence machine-learning

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

ML Alignment and Theory Scholars

Gradient Routing introduces a method for mechanistic supervision in neural networks by applying data-dependent masks to gradients during backpropagation, enabling fine-grained control over how specific data influences network internals. This approach robustly localizes capabilities, facilitating robust unlearning of undesirable knowledge and enabling scalable oversight in reinforcement learning.

28 Mar 2025

adversarial-attacks ai-for-cybersecurity computer-science

Auditing language models for hidden objectives

Anthropic ML Alignment and Theory Scholars

We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model's hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model's hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Auditing language models for hidden objectives

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

Auditing language models for hidden objectives

Events

AI for Law

Personalize Your Feed