alphaXiv

History

Papers Benchmarks

Emory University

3,767

01 Jul 2025

computer-science artificial-intelligence machine-learning

The Curse of Depth in Large Language Models

University of Oxford

Westlake University Dalian University of Technology

Emory University University of Surrey

LI Pengxiang

Researchers at Westlake University, Emory University, Dalian University of Technology, University of Surrey, and University of Oxford investigated the 'Curse of Depth' in large language models, demonstrating that Pre-Layer Normalization leads to exponential output variance growth, rendering deep layers ineffective. They propose LayerNorm Scaling (LNS), a hyperparameter-free method that reduces variance growth to a polynomial rate, leading to improved pre-training perplexity and an average 1.8% gain on downstream tasks across various LLM scales.

561

17 Nov 2025

attention-mechanisms computer-science artificial-intelligence

On the Fundamental Limits of LLMs at Scale

Purdue University University of Glasgow The University of Oklahoma UC Riverside National University of Scinces and Technology

The paper scrutinizes the long-standing belief in unbounded Large Language Model (LLM) scaling, establishing a proof-informed framework that identifies intrinsic theoretical limits on their capabilities. It synthesizes empirical failures like hallucination and reasoning degradation with foundational concepts from computability theory, information theory, and statistical learning, showing that these issues are inherent rather than transient engineering challenges.

500

09 Oct 2025

computer-science contrastive-learning computation-and-language

OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment

Georgia Institute of Technology

Emory University

Purdue University University at Albany

Tianci LIU

A scalable framework was established for generating high-quality synthetic rubrics, enabling the development of a rubric-guided reward model (Rubric-RM) to improve large language model alignment. Rubric-RM consistently outperforms comparable 7B-scale reward models by an average of 6.8% and achieves state-of-the-art performance in instruction-following tasks.

1,132

06 Feb 2025

adversarial-robustness computer-science machine-learning

G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks

CUHK Shanghai AI Laboratory Tongji University

Emory University USTC

Kun Wang

G-Designer introduces a framework using Graph Neural Networks to dynamically generate task-aware communication topologies for LLM-based multi-agent systems. This approach achieved superior performance on various benchmarks while significantly reducing token consumption by up to 95.33% and demonstrating high adversarial robustness.

6,126

23 Jan 2025

computer-science artificial-intelligence computation-and-language

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

Tsinghua University

Emory University HKUST (GZ)

Jie Feng

This Tsinghua University-led survey synthesizes advancements in reinforced reasoning with Large Language Models, identifying the emerging paradigm of "Large Reasoning Models (LRMs)." It outlines how these models integrate train-time improvements through reinforcement learning and automated data construction, particularly with Process Reward Models, alongside test-time computational scaling for enhanced reasoning.

6,021

27 Oct 2025

ai-for-health computer-science computer-vision-and-pattern-recognition

Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models

University of Southern California

Georgia Institute of Technology

Emory University

Johns Hopkins University

University of Tokyo

Steve Zhao

Yuxiang Lai

Med-R1, a Vision-Language Model leveraging reinforcement learning, achieves superior generalizability in medical reasoning across eight imaging modalities and five clinical tasks. It demonstrates a 29.94% accuracy improvement over its base model and outperforms larger general-purpose VLMs while an introduced 'Think-After' strategy provides clinically sound, post-hoc rationales.

2,959

25 Oct 2025

computer-science computer-vision-and-pattern-recognition few-shot-learning

Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning

University of Southern California Shanghai AI Laboratory

Emory University

Steve Zhao

Yuxiang Lai

A study of thinking in rule-based visual reinforcement fine-tuning for multi-modal large language models reveals that explicit reasoning processes are not always necessary and can even be detrimental, especially for visual perception tasks. For 2B models, a "no-thinking" strategy achieved a 3.14% higher average accuracy in image classification and offered significant computational efficiency over thinking-based methods.

407

15 Oct 2025

ai-for-health attention-mechanisms computer-science

MedDINOv3: How to adapt vision foundation models for medical image segmentation?

Emory University

MedDINOv3 adapts Vision Transformer foundation models for medical image segmentation by integrating architectural refinements and large-scale, domain-adaptive pretraining on nearly 4 million CT slices. This framework achieves superior performance over nnU-Net on several organ segmentation tasks and competitive results for tumor segmentation across diverse medical imaging benchmarks.

1,113

29 Sep 2025

computer-science artificial-intelligence computation-and-language

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

Illinois Institute of Technology

Emory University University of Illinois Chicago

Arizona State University University of Maryland Baltimore County

Canyu Chen

Dawei Li

Researchers from Arizona State University and collaborating institutions present a comprehensive survey of the "LLM-as-a-judge" paradigm, defining its operational framework, taxonomizing methodologies, and exploring its opportunities and challenges across the LLM lifecycle. The work identifies six key judging attributes and categorizes various tuning and prompting strategies employed to enhance LLM judging capabilities.

442

221

01 Sep 2025

agents computer-science conversational-ai

We Politely Insist: Your LLM Must Learn the Persian Art of Taarof

Emory University York University Brock University Zoorna AI

Researchers developed TAAROFBENCH, the first open-ended benchmark for Persian *taarof*, revealing that current LLMs exhibit a strong bias towards directness and perform poorly on culturally nuanced interactions. Targeted fine-tuning, especially using Direct Preference Optimization, significantly improved LLM accuracy on *taarof*-expected scenarios, bringing performance close to native human levels.

517

29 May 2025

agent-based-systems ai-for-health computer-science

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

University of Illinois at Urbana-Champaign UT Austin

University of Chicago

UC Berkeley

Tsinghua University

Emory University Hong Kong Polytechnic University

Chulin Xie

GuardAgent introduces a framework for safeguarding Large Language Model (LLM) agents against safety and privacy violations by leveraging knowledge-enabled reasoning and code generation for deterministic policy enforcement. The system significantly outperforms existing text-based guardrails and hardcoded approaches in accuracy while maintaining the target agent's original task performance, validated on newly created healthcare access control and web safety benchmarks.

166

26 Sep 2025

computer-science artificial-intelligence

SEDM: Scalable Self-Evolving Distributed Memory for Agents

South China University of Technology

University of Toronto

Zhejiang University University of Technology Sydney

Emory University

Rice University

Waseda University Gradient

A memory management framework for multi-agent systems, SEDM, implements verifiable write admission, self-scheduling, and cross-domain knowledge diffusion to address noise accumulation and uncontrolled memory expansion. It enhances reasoning accuracy on benchmarks like LoCoMo, FEVER, and HotpotQA while reducing token consumption by up to 50% compared to previous memory systems.

10,784

17 Jun 2025

econometrics economics statistics

Difference-in-Differences Designs: A Practitioner's Guide

UC Berkeley

University of Georgia

Emory University

Baylor University Federal Reserve Bank of Minneapolis

This practitioner's guide clarifies the complexities of Difference-in-Differences (DiD) designs, particularly in multi-period and staggered treatment adoption settings, by advocating for a 'forward-engineering' approach to explicitly define causal parameters and their identification assumptions. The authors demonstrate how commonly used Two-Way Fixed Effects (TWFE) estimators can produce biased results in these complex scenarios and provide robust, transparent alternatives for applied researchers.

310

04 Aug 2025

computer-science artificial-intelligence machine-learning

An Electrocardiogram Foundation Model Built on over 10 Million Recordings with External Evaluation across Multiple Domains

Imperial College London

Sun Yat-Sen University

Georgia Institute of Technology

Emory University Harvard Medical School Massachusetts General Hospital National Institute of Health Data Science, Peking University Beth-Israel Deaconess Medical Center

Artificial intelligence (AI) has demonstrated significant potential in ECG analysis and cardiovascular disease assessment. Recently, foundation models have played a remarkable role in advancing medical AI. The development of an ECG foundation model holds the promise of elevating AI-ECG research to new heights. However, building such a model faces several challenges, including insufficient database sample sizes and inadequate generalization across multiple domains. Additionally, there is a notable performance gap between single-lead and multi-lead ECG analyses. We introduced an ECG Foundation Model (ECGFounder), a general-purpose model that leverages real-world ECG annotations from cardiology experts to broaden the diagnostic capabilities of ECG analysis. ECGFounder was trained on over 10 million ECGs with 150 label categories from the Harvard-Emory ECG Database, enabling comprehensive cardiovascular disease diagnosis through ECG analysis. The model is designed to be both an effective out-of-the-box solution, and a to be fine-tunable for downstream tasks, maximizing usability. Importantly, we extended its application to lower rank ECGs, and arbitrary single-lead ECGs in particular. ECGFounder is applicable to supporting various downstream tasks in mobile monitoring scenarios. Experimental results demonstrate that ECGFounder achieves expert-level performance on internal validation sets, with AUROC exceeding 0.95 for eighty diagnoses. It also shows strong classification performance and generalization across various diagnoses on external validation sets. When fine-tuned, ECGFounder outperforms baseline models in demographic analysis, clinical event detection, and cross-modality cardiac rhythm diagnosis. The trained model and data will be publicly released upon publication through the this http URL. Our code is available at this https URL

144

05 Nov 2025

agentic-frameworks agents chain-of-thought

TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

University of Southern California Shanghai AI Laboratory

Emory University

Rice University Chinese University of Hong Kong

Researchers at Shanghai AI Laboratory and collaborating institutions developed TIR-BENCH, a comprehensive benchmark designed to evaluate multimodal large language models' ability to 'think with images' by dynamically manipulating visual inputs with tools. The benchmark revealed that models with explicit agentic tool-use capabilities achieved up to 46% accuracy, significantly outperforming non-agentic models on tasks requiring active visual interaction.

917

23 Jun 2025

agentic-frameworks agents chain-of-thought

PlanGenLLMs: A Modern Survey of LLM Planning Capabilities

Emory University University of California, Merced PAII Inc.

The 'PlanGenLLMs' survey provides a structured overview of Large Language Model planning capabilities, proposing six consistent evaluation criteria: completeness, executability, optimality, representation, generalization, and efficiency, and analyzing current research against these standards. It synthesizes existing findings to highlight the strengths and limitations of current LLM planners and identifies key directions for future research in the field.

923

12 Jul 2025

computer-science machine-learning generative-models

GRAG: Graph Retrieval-Augmented Generation

Emory University

GRAG introduces a framework that enables Large Language Models to effectively utilize textual graphs for Retrieval-Augmented Generation by efficiently retrieving relevant subgraphs and integrating their structural and textual information. This method significantly enhances multi-hop reasoning and factual accuracy, while also reducing the computational costs typically associated with LLM fine-tuning.

120

28 Nov 2025

computer-science computation-and-language data-curation

Toward Honest Language Models for Deductive Reasoning

University of Illinois at Urbana-Champaign

Carnegie Mellon University

Cornell University

Emory University Amazon

Deductive reasoning is the process of deriving conclusions strictly from the given premises, without relying on external knowledge. We define honesty in this setting as a model's ability to respond only when the conclusion is logically entailed by the premises, and to abstain otherwise. However, current language models often fail to reason honestly, producing unwarranted answers when the input is insufficient. To study this challenge, we formulate honest deductive reasoning as multi-step tasks where models must either derive the correct conclusion or abstain. We curate two datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that prompting and existing training methods, including GRPO with or without supervised fine-tuning initialization, struggle on these tasks. In particular, GRPO optimize only for final task outcomes, leaving models vulnerable to collapse when negative rewards dominate early training. To address this, we propose ACNCHOR, a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse. Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance, underscoring the importance of training dynamics for enabling honest deductive reasoning in language models.

481

07 Apr 2025

chain-of-thought computer-science artificial-intelligence

Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration

Georgia Institute of Technology

Emory University SUNY Albany University of Texas Southwestern Medical Center

Wenqi Shi

Researchers from Emory University and partner institutions develop Collab-RAG, a framework that enables efficient collaboration between small and large language models for complex question answering, achieving 1.8-14.2% performance improvements across five multi-hop QA datasets through automated query decomposition and iterative preference optimization.

1,625

03 Oct 2024

adversarial-robustness computer-science conversational-ai

Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems

Shanghai AI Laboratory Tongji University

The Chinese University of Hong Kong

Nanyang Technological University

Emory University University of North Carolina at Chapel Hill

Kun Wang

AgentPrune presents a framework that addresses the high inference costs of LLM-based multi-agent systems by identifying and eliminating redundant communication paths. The approach reduces token consumption by 28.1% to 72.8% while maintaining comparable or improved task performance and enhancing robustness against adversarial attacks.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

The Curse of Depth in Large Language Models

On the Fundamental Limits of LLMs at Scale

OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment

G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models

Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning

MedDINOv3: How to adapt vision foundation models for medical image segmentation?

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

We Politely Insist: Your LLM Must Learn the Persian Art of Taarof

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

SEDM: Scalable Self-Evolving Distributed Memory for Agents

Difference-in-Differences Designs: A Practitioner's Guide

An Electrocardiogram Foundation Model Built on over 10 Million Recordings with External Evaluation across Multiple Domains

TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

PlanGenLLMs: A Modern Survey of LLM Planning Capabilities

GRAG: Graph Retrieval-Augmented Generation

Toward Honest Language Models for Deductive Reasoning

Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration

Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems

Events

AI for Law

Personalize Your Feed