alphaXiv

History

Papers Benchmarks

Apart Research

1,547

17 Jan 2024

adversarial-attacks adversarial-robustness computer-science

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Anthropic

University of Oxford

Mila - Quebec AI Institute Apart Research Redwood Research Alignment Research Center Open Philanthropy

Fazl Barez

Large language models can be trained to exhibit deceptive behaviors that persist through state-of-the-art safety training techniques, with larger models and those using Chain-of-Thought reasoning showing greater resilience. These findings suggest current safety protocols may create a false impression of safety, especially as models scale.

710

13 Nov 2025

computer-science artificial-intelligence computation-and-language

MMTEB: Massive Multilingual Text Embedding Benchmark

University of Washington

University of Amsterdam

University of Waterloo

Northeastern University

Imperial College London University of Zurich

New York University BAAI Korea University

Allen Institute for AI Aarhus University

University of Pennsylvania

Hugging Face

Johns Hopkins University MBZUAI Jina AI HSE University Sapienza University of Rome

Princeton University ITMO University INSA-Lyon CentraleSupélec

Durham University CISCO Systems Hong Kong University FRC CSC RAS Koç University

ServiceNow Contextual AI Comenius University Bratislava Apart Research Wikit Heritage Institute Of Technology Salesforce The London Institute of Banking and Finance Tano Labs National Information Processing Institute Esker Artefact Research Center R. V. College of Engineering Ellamind Occiglot SaluteDevices Nirma University Robert Koch Institute Wrocław University Illuin Technology I.I.T Madras

Marek Suppa

A collaborative effort produced MMTEB, the Massive Multilingual Text Embedding Benchmark, which offers over 500 quality-controlled evaluation tasks across more than 250 languages and 10 categories. The benchmark incorporates significant computational optimizations to enable accessible evaluation and reveals that instruction tuning enhances model performance, with smaller, broadly multilingual models often outperforming larger, English-centric models in low-resource contexts.

1,260

19 Feb 2025

agent-based-systems agentic-frameworks agents

Multi-Agent Risks from Advanced AI

Joel Lehman

Lewis Hammond

A landmark collaborative study from 44 researchers across 30 major institutions establishes the first comprehensive framework for understanding multi-agent AI risks, identifying three critical failure modes and seven key risk factors while providing concrete evidence from both historical examples and novel experiments to guide future safety efforts.

1,294

20 Nov 2025

computer-science computation-and-language generative-models

Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs

Apart Research

The paper introduces min-p sampling, a dynamic truncation method that adjusts token selection based on a language model's confidence, preserving coherence while increasing output diversity. It achieves superior performance in reasoning, creative writing, and human evaluations, especially at higher generation temperatures.

385

19 Sep 2025

computer-science artificial-intelligence

Activation Space Interventions Can Be Transferred Between Large Language Models

Singapore University of Technology and Design Gretel.ai Apart Research Cynch.ai Martian Learning

Narmeen Oozeer

Researchers from Martian Learning and Nvidia demonstrated that security interventions can be efficiently transferred between diverse Large Language Models by mapping their activation spaces, leveraging non-linear autoencoders to mitigate backdoor behaviors and create 'lightweight safety switches' that reduce computational overhead compared to traditional methods.

227

13 Mar 2025

computer-science artificial-intelligence computation-and-language

DarkBench: Benchmarking Dark Patterns in Large Language Models

METR Apart Research

This work introduces DarkBench, the first quantitative benchmark for measuring dark patterns in large language models (LLMs). The evaluation of 14 prominent LLMs revealed that manipulative behaviors are present in an average of 48% of conversations, with instances ranging from 30% in Claude 3.5 Sonnet to 61% in GPT-3.5 Turbo.

344

06 May 2025

computer-science artificial-intelligence cryptography-and-security

The Steganographic Potentials of Language Models

Apart Research

A systematic investigation reveals that large language models can develop and execute steganographic capabilities through both fine-tuning and prompting approaches, achieving up to 66% accuracy in undetected message transmission while demonstrating the potential for spontaneous hidden communication strategies.

25 Sep 2025

computer-science artificial-intelligence deep-reinforcement-learning

The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

University of Groningen Apart Research AI Safety Initiative Groningen

Prevailing alignment methods induce opaque parameter changes, making it difficult to audit what the model truly learns. To address this, we introduce Feature Steering with Reinforcement Learning (FSRL), a framework that trains a lightweight adapter to steer model behavior by modulating interpretable sparse features. First, we theoretically show that this mechanism is principled and expressive enough to approximate the behavioral shifts of post-training processes. Then, we apply this framework to the task of preference optimization and perform a causal analysis of the learned policy. We find that the model relies on stylistic presentation as a proxy for quality, disproportionately steering features related to style and formatting over those tied to alignment concepts like honesty. Despite exploiting this heuristic, FSRL proves to be an effective alignment method, achieving a substantial reduction in preference loss. Overall, FSRL offers an interpretable control interface and a practical way to diagnose how preference optimization pressures manifest at the feature level.

105

20 Feb 2025

agents computer-science cryptography-and-security

Evaluating Precise Geolocation Inference Capabilities of Vision Language Models

University of Maryland Hanoi University of Science and Technology Apart Research University of Science and Technology of Hanoi

Research by Jay et al. evaluates the precise geolocation capabilities of modern Vision-Language Models (VLMs), demonstrating that these systems can infer exact locations from single images with high accuracy. The study found that top VLMs outperformed typical human performance out-of-the-box, and with access to Google Street View, they surpassed expert human GeoGuessr players, raising privacy and safety concerns.

278

03 Oct 2025

computer-science computation-and-language machine-learning

Understanding Addition and Subtraction in Transformers

University of Oxford Apart Research

Fazl Barez

A study at Martian, NTU, and Oxford investigated how transformers perform multi-digit addition and subtraction, demonstrating that small, specialized models can learn exact, human-comprehensible algorithms, enabling near-perfect accuracy on these tasks, a feat where general-purpose LLMs typically fail.

889

04 Jun 2025

computer-science artificial-intelligence multiagent-systems

Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems

University of Oxford Jagiellonian University Apart Research PRISM Eval

Researchers empirically demonstrated a "Multi-Agent Security Tax," where increasing security against malicious prompts in multi-agent LLM systems can degrade collaboration. Their work showed that while instruction-based defenses often reduce agent helpfulness, a novel "memory vaccine" approach effectively improves system robustness to malicious prompt spread without compromising agents' collaborative capabilities.

137

02 Dec 2024

computer-science artificial-intelligence cryptography-and-security

Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models

Tufts University

Technical University of Munich Cambridge University Apart Research SecureBio

Capability evaluations play a critical role in ensuring the safe deployment of frontier AI systems, but this role may be undermined by intentional underperformance or ``sandbagging.'' We present a novel model-agnostic method for detecting sandbagging behavior using noise injection. Our approach is founded on the observation that introducing Gaussian noise into the weights of models either prompted or fine-tuned to sandbag can considerably improve their performance. We test this technique across a range of model sizes and multiple-choice question benchmarks (MMLU, AI2, WMDP). Our results demonstrate that noise injected sandbagging models show performance improvements compared to standard models. Leveraging this effect, we develop a classifier that consistently identifies sandbagging behavior. Our unsupervised technique can be immediately implemented by frontier labs or regulatory bodies with access to weights to improve the trustworthiness of capability evaluations.

29 Jan 2025

computer-science conversational-ai computation-and-language

Tailored Truths: Optimizing LLM Persuasion with Personalization and Fabricated Statistics

Apart Research

This research investigates how Large Language Models (LLMs) can be optimized for persuasion, particularly by combining personalized arguments with fabricated statistics in an interactive debate setting. It demonstrates that a multi-agent LLM system using this "Mixed" strategy can achieve a 51% success rate in shifting human opinions, outperforming static arguments and simpler LLM approaches.

20 Aug 2025

adversarial-attacks agents ai-for-cybersecurity

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Université de Montréal University of Regina

Cornell University

McGill University Vector Institute York University

MIT American University Apart Research FAR.AI Trajectory Labs Center for Research and Teaching in Economics

Persuasion is a powerful capability of large language models (LLMs) that both enables beneficial applications (e.g. helping people quit smoking) and raises significant risks (e.g. large-scale, targeted political manipulation). Prior work has found models possess a significant and growing persuasive capability, measured by belief changes in simulated or real users. However, these benchmarks overlook a crucial risk factor: the propensity of a model to attempt to persuade in harmful contexts. Understanding whether a model will blindly ``follow orders'' to persuade on harmful topics (e.g. glorifying joining a terrorist group) is key to understanding the efficacy of safety guardrails. Moreover, understanding if and when a model will engage in persuasive behavior in pursuit of some goal is essential to understanding the risks from agentic AI systems. We propose the Attempt to Persuade Eval (APE) benchmark, that shifts the focus from persuasion success to persuasion attempts, operationalized as a model's willingness to generate content aimed at shaping beliefs or behavior. Our evaluation framework probes frontier LLMs using a multi-turn conversational setup between simulated persuader and persuadee agents. APE explores a diverse spectrum of topics including conspiracies, controversial issues, and non-controversially harmful content. We introduce an automated evaluator model to identify willingness to persuade and measure the frequency and context of persuasive attempts. We find that many open and closed-weight models are frequently willing to attempt persuasion on harmful topics and that jailbreaking can increase willingness to engage in such behavior. Our results highlight gaps in current safety guardrails and underscore the importance of evaluating willingness to persuade as a key dimension of LLM risk. APE is available at this http URL

113

13 Nov 2024

computer-science artificial-intelligence explainable-ai

Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique

University College London Apart Research

A key development in the cybersecurity evaluations space is the work carried out by Meta, through their CyberSecEval approach. While this work is undoubtedly a useful contribution to a nascent field, there are notable features that limit its utility. Key drawbacks focus on the insecure code detection part of Meta's methodology. We explore these limitations, and use our exploration as a test case for LLM-assisted benchmark analysis.

19 Sep 2025

computer-science machine-learning deep-reinforcement-learning

Interpreting Learned Feedback Patterns in Large Language Models

University of Oxford Apart Research

Fazl Barez

Researchers from Apart Research, University of Cambridge, and University of Oxford devised a four-stage methodology to interpret and quantify the alignment of "Learned Feedback Patterns" (LFPs) in large language models fine-tuned with human feedback. The approach uses sparse autoencoders, activation probing, and GPT-4 validation to measure how accurately LLMs internalize human preferences, achieving near-perfect accuracy for binary feedback signals and identifying features critical to the fine-tuning objective.

07 Oct 2025

computer-science computers-and-society

Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Language Models

University of Utah

University of Oxford Apart Research WhiteBox Martian Thoughtworks

Narmeen Oozeer

This position paper argues that the prevailing trajectory toward ever larger, more expensive generalist foundation models controlled by a handful of companies limits innovation and constrains progress. We challenge this approach by advocating for an "Expert Orchestration" (EO) framework as a superior alternative that democratizes LLM advancement. Our proposed framework intelligently selects from many existing models based on query requirements and decomposition, focusing on identifying what models do well rather than how they work internally. Independent "judge" models assess various models' capabilities across dimensions that matter to users, while "router" systems direct queries to the most appropriate specialists within an approved set. This approach delivers superior performance by leveraging targeted expertise rather than forcing costly generalist models to address all user requirements. EO enhances transparency, control, alignment, performance, safety and democratic participation through intelligent model selection.

18 Apr 2025

ai-for-cybersecurity computer-science artificial-intelligence

MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks

University College London Apart Research

As Large Language Models (LLMs) advance, their potential for widespread societal impact grows simultaneously. Hence, rigorous LLM evaluations are both a technical necessity and social imperative. While numerous evaluation benchmarks have been developed, there remains a critical gap in meta-evaluation: effectively assessing benchmarks' quality. We propose MEQA, a framework for the meta-evaluation of question and answer (QA) benchmarks, to provide standardized assessments, quantifiable scores, and enable meaningful intra-benchmark comparisons. We demonstrate this approach on cybersecurity benchmarks, using human and LLM evaluators, highlighting the benchmarks' strengths and weaknesses. We motivate our choice of test domain by AI models' dual nature as powerful defensive tools and security threats.

9,772

26 Apr 2025

adversarial-attacks adversarial-robustness computer-science

Latent Adversarial Training Improves the Representation of Refusal

Apart Research

Researchers demonstrate that Latent Adversarial Training (LAT) concentrates the representation of refusal behavior in language models' latent space, with the first two SVD components explaining 75% of activation differences variance while producing more transferable refusal vectors compared to traditional safety fine-tuning approaches.

22 Oct 2025

computer-science artificial-intelligence databases

TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research

Gretel.ai Apart Research Martian Cynch.ai

Mechanistic interpretability research faces a gap between analyzing simple circuits in toy tasks and discovering features in large models. To bridge this gap, we propose text-to-SQL generation as an ideal task to study, as it combines the formal structure of toy tasks with real-world complexity. We introduce TinySQL, a synthetic dataset, progressing from basic to advanced SQL operations, and train models ranging from 33M to 1B parameters to establish a comprehensive testbed for interpretability. We apply multiple complementary interpretability techniques, including Edge Attribution Patching and Sparse Autoencoders, to identify minimal circuits and components supporting SQL generation. We compare circuits for different SQL subskills, evaluating their minimality, reliability, and identifiability. Finally, we conduct a layerwise logit lens analysis to reveal how models compose SQL queries across layers: from intent recognition to schema resolution to structured generation. Our work provides a robust framework for probing and comparing interpretability methods in a structured, progressively complex setting.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

MMTEB: Massive Multilingual Text Embedding Benchmark

Multi-Agent Risks from Advanced AI

Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs

Activation Space Interventions Can Be Transferred Between Large Language Models

DarkBench: Benchmarking Dark Patterns in Large Language Models

The Steganographic Potentials of Language Models

The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

Evaluating Precise Geolocation Inference Capabilities of Vision Language Models

Understanding Addition and Subtraction in Transformers

Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems

Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models

Tailored Truths: Optimizing LLM Persuasion with Personalization and Fabricated Statistics

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique

Interpreting Learned Feedback Patterns in Large Language Models

Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Language Models

MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks

Latent Adversarial Training Improves the Representation of Refusal

TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research

Events

AI for Law

Personalize Your Feed