alphaXiv

History

Papers Benchmarks

Dartmouth College

2,044

04 Dec 2024

ai-for-health computer-science machine-learning

Assessing Foundation Models' Transferability to Physiological Signals in Precision Medicine

University of Utah

University of Maryland

Dartmouth College Intermountain Health Huntsman Mental Health Institute Mountain Biometrics Inc

Researchers developed a three-stage pipeline to assess how well foundation models transfer to precision medicine applications involving physiological signals, utilizing BioGears for synthetic data generation and evaluating embedding quality. Initial application to the Moirai model demonstrated limitations in zero-shot transfer, including the introduction of spurious correlations, poor signal reconstruction, and distorted temporal dynamics in physiological embeddings.

1,544

14 Oct 2025

computer-science artificial-intelligence computation-and-language

Diffusion Language Models Know the Answer Before Decoding

Max Planck Institute for Intelligent Systems

Google DeepMind

Sun Yat-Sen University

The Hong Kong Polytechnic University University of Surrey

Dartmouth College ELLIS Institute Tübingen

LI Pengxiang

Researchers from The Hong Kong Polytechnic University, Dartmouth College, Max Planck Institute, Google DeepMind, and others developed Prophet, a training-free adaptive decoding paradigm for Diffusion Language Models (DLMs) that leverages early answer convergence. The method achieves up to 3.4 times faster inference by dynamically committing to answers when model confidence is high, often improving output quality compared to full-step decoding.

442

15 Oct 2025

computer-science computer-vision-and-pattern-recognition geometric-deep-learning

Trace Anything: Representing Any Video in 4D via Trajectory Fields

The paper introduces "Trajectory Fields" as a novel 4D video representation, mapping each pixel's continuous 3D path over time. The "Trace Anything" neural network predicts these fields in a single pass, achieving state-of-the-art performance in dynamic scene understanding and dense 3D tracking while being orders of magnitude faster than prior methods.

359

22 Sep 2025

agents chain-of-thought computer-science

Variation in Verification: Understanding Verification Dynamics in Large Language Models

University of Illinois at Urbana-Champaign Salesforce AI Research

Dartmouth College

Recent advances have shown that scaling test-time computation enables large language models (LLMs) to solve increasingly complex problems across diverse domains. One effective paradigm for test-time scaling (TTS) involves LLM generators producing multiple solution candidates, with LLM verifiers assessing the correctness of these candidates without reference answers. In this paper, we study generative verifiers, which perform verification by generating chain-of-thought (CoT) reasoning followed by a binary verdict. We systematically analyze verification dynamics across three dimensions - problem difficulty, generator capability, and verifier generation capability - with empirical studies on 12 benchmarks across mathematical reasoning, knowledge, and natural language reasoning tasks using 14 open-source models (2B to 72B parameter range) and GPT-4o. Our experiments reveal three key findings about verification effectiveness: (1) Easy problems allow verifiers to more reliably certify correct responses; (2) Weak generators produce errors that are easier to detect than strong generators; (3) Verification ability is generally correlated with the verifier's own problem-solving capability, but this relationship varies with problem difficulty. These findings reveal opportunities to optimize basic verification strategies in TTS applications. First, given the same verifier, some weak generators can nearly match stronger ones in post-verification TTS performance (e.g., the Gemma2-9B to Gemma2-27B performance gap shrinks by 75.5%). Second, we identify cases where strong verifiers offer limited advantage over weak ones, as both fail to provide meaningful verification gains, suggesting that verifier scaling alone cannot overcome fundamental verification challenges.

2,909

09 Jun 2021

attention-mechanisms computer-science computer-vision-and-pattern-recognition

Is Space-Time Attention All You Need for Video Understanding?

Dartmouth College Facebook AI

TimeSformer introduces a convolution-free video understanding architecture that leverages self-attention mechanisms to achieve state-of-the-art action recognition performance on multiple benchmarks with significantly improved training and inference efficiency.

1,626

403

11 Oct 2025

agents chain-of-thought computer-science

MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning

Dartmouth College

Dartmouth College researchers developed MA-RAG, a training-free multi-agent framework that uses collaborative Chain-of-Thought reasoning to improve Retrieval-Augmented Generation for complex information-seeking tasks. The system achieves state-of-the-art performance, with a Llama3-70B version scoring 59.5 EM on NQ and 52.1 EM on HotpotQA, while demonstrating strong generalization to medical and web search benchmarks.

2,174

13 Jul 2025

computer-science computation-and-language human-ai-interaction

Personalization of Large Language Models: A Survey

Vanderbilt University

Stanford University

University of California, San Diego University of Massachusetts Amherst Cisco Research

Dartmouth College University of Oregon

Adobe Dolby Research Pattern Data

Yijia Shao

Franck Dernoncourt

This survey from Dartmouth College, Adobe Research, and other institutions offers a comprehensive overview of personalized Large Language Models, proposing a unified framework that bridges personalized text generation and downstream applications. It introduces multi-dimensional taxonomies for usage, granularity, techniques, evaluation, and datasets, providing a structured understanding of the field.

822

20 Aug 2023

computer-science conversational-ai artificial-intelligence

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

LEGALBENCH provides a comprehensive, collaboratively-built benchmark designed to assess Large Language Models' (LLMs) capabilities in legal reasoning. It incorporates input from legal professionals and categorizes 162 tasks by the IRAC framework, demonstrating that GPT-4 generally leads performance, though no single model excels across all legal reasoning aspects.

441

06 Oct 2022

chain-of-thought computer-science artificial-intelligence

Language Models are Multilingual Chain-of-Thought Reasoners

Stanford University

Google Research

Dartmouth College

Toyota Technological Institute at Chicago

Freda Shi

Xuezhi Wang

Large language models, particularly PaLM-540B, demonstrate the ability to perform complex, multi-step reasoning across diverse languages, including those less represented in pre-training data, when prompted with Chain-of-Thought. The research shows that English Chain-of-Thought can effectively elicit these reasoning capabilities in multilingual contexts, achieving an average solve rate of 55% on a new multilingual arithmetic reasoning benchmark.

220

926

25 Oct 2024

computer-science computation-and-language edge-computing

A Survey of Small Language Models

Northeastern University

Carnegie Mellon University

Language Modeling by Language Models

Allen Institute for AI

Dartmouth College

Researchers from Dartmouth College and the Allen Institute for AI developed Genesys, an LLM-driven system that automates the discovery of novel language model architectures by simulating the scientific research process. The system successfully proposed and verified over 1,000 unique designs, with top-performing discoveries demonstrating competitive performance against human-designed state-of-the-art models like GPT2 and Mamba2 on multiple benchmarks.

3,106

1,170

05 Sep 2024

computer-science computer-vision-and-pattern-recognition machine-learning

Visual Prompting in Multimodal Large Language Models: A Survey

University of California, San Diego CSIRO’s Data61

The University of Hong Kong

Rutgers University

Dartmouth College The University of New South Wales

Adobe UC Los Angeles

This survey paper by Junda Wu and colleagues systematically categorizes and examines visual prompting techniques for Multimodal Large Language Models (MLLMs). It outlines how visual cues, such as bounding boxes and pixel-level masks, enhance MLLMs' abilities in visual grounding, object referring, and compositional reasoning by providing direct control over model attention.

638

02 Dec 2025

computer-science artificial-intelligence computation-and-language

OmniBench: Towards The Future of Universal Omni-Language Models

University of Manchester

Nanjing University

Queen Mary University of London

Dartmouth College 01.ai Hongkong University of Science and Technology

Yizhi Li

星威曲

Recent advancements in multimodal large language models (MLLMs) have focused on integrating multiple modalities, yet their ability to simultaneously process and reason across different inputs remains underexplored. We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as omni-language models (OLMs). OmniBench features high-quality human annotations that require integrated understanding across all modalities. Our evaluation reveals that: i) open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts; and ii) most baseline models perform poorly (around 50% accuracy) even with textual alternatives to image/audio inputs. To address these limitations, we develop OmniInstruct, an 96K-sample instruction tuning dataset for training OLMs. We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance. Codes and data could be found at our repo (this https URL).

234

01 Jun 2025

computer-science artificial-intelligence computation-and-language

LIFT the Veil for the Truth: Principal Weights Emerge after Rank Reduction for Reasoning-Focused Supervised Fine-Tuning

UC Berkeley

University of Oxford

Tsinghua University University of Exeter

Lawrence Berkeley National Laboratory University of Surrey Eindhoven University of Technology

Dartmouth College International Computer Science Institute

University of California Berkeley, Oxford University, and Dartmouth researchers develop LIFT (Low-rank Informed Sparse Fine-Tuning), which identifies "Principal Weights" by performing rank-r approximation on weight matrices via SVD and selecting the largest magnitude parameters from the low-rank representation for sparse fine-tuning, achieving superior performance over Full FT on reasoning tasks (2.02% higher on GPQA Diamond with Qwen-2.5, 1.14-1.60% higher on MATH-10K with LLaMA models) while reducing optimizer memory overhead from 27GB to 1.3GB for LLaMA-2-7B through dynamic mask updates that store optimizer states only for selected sparse parameters, with empirical validation showing that perturbing LIFT-identified weights drastically degrades performance while random parameter perturbation has minimal impact.

340

31 May 2025

active-learning computer-science computation-and-language

From Selection to Generation: A Survey of LLM-based Active Learning

Carnegie Mellon University

University of Southern California

Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study

Shanghai AI Laboratory

Nanjing University

Tsinghua University

ByteDance

The University of Hong Kong Beijing University of Posts and Telecommunications

Princeton University

Dartmouth College

Chao Peng

Researchers from Shanghai AI Laboratory and collaborators introduce DevEval, a comprehensive framework to evaluate large language models (LLMs) across the full software development lifecycle. Their case study reveals that current LLMs, including GPT-4-Turbo, struggle significantly with repository-level implementation tasks, achieving less than a 10% pass rate, while demonstrating the critical role of execution-feedback for performance improvement.

115

02 Oct 2025

computer-science computation-and-language computer-vision-and-pattern-recognition

From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens

University of Waterloo

University of Michigan

McGill University Vector Institute

Stony Brook University

Dartmouth College

We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by supporting the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs. VLM-Lens provides a unified, YAML-configurable interface that abstracts away model-specific complexities and supports user-friendly operation across diverse VLMs. It currently supports 16 state-of-the-art base VLMs and their over 30 variants, and is extensible to accommodate new models without changing the core logic. The toolkit integrates easily with various interpretability and analysis methods. We demonstrate its usage with two simple analytical experiments, revealing systematic differences in the hidden representations of VLMs across layers and target concepts. VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs.

126

23 Jun 2024

computer-science computation-and-language machine-psychology

Serial Position Effects of Large Language Models

Dartmouth College

Researchers at Dartmouth College systematically investigated serial position effects (SPE) in large language models, demonstrating their widespread occurrence across both encoder-decoder and decoder-only architectures, with a predominant primacy effect in classification and a recency effect in summarization tasks. The study found that while basic prompt engineering was inconsistent, Chain-of-Thought reasoning offered a more reliable method for mitigating these biases.

537

14 Oct 2024

computer-science machine-learning efficient-transformers

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

University of Oxford Nankai University

University of Texas at Austin

Lawrence Berkeley National Laboratory

Dartmouth College University of California at Berkeley International Computer Science Institute

Yefan Zhou

AlphaPruning introduces a novel layer-wise pruning method for Large Language Models that leverages Heavy-Tailed Self-Regularization theory to allocate sparsity based on the spectral properties of weight matrices. This approach successfully prunes LLaMA-7B to 80% sparsity while maintaining perplexity, achieving a 3.06x speedup on CPUs and an average 4.6% gain on zero-shot tasks compared to uniform pruning baselines.

306

16 Apr 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem

EPFL

Princeton University

Dartmouth College Princeton Neuroscience Institute

Iain Campbell

Nicolò De Sabbata

Recent work has documented striking heterogeneity in the performance of state-of-the-art vision language models (VLMs), including both multimodal language models and text-to-image models. These models are able to describe and generate a diverse array of complex, naturalistic images, yet they exhibit surprising failures on basic multi-object reasoning tasks -- such as counting, localization, and simple forms of visual analogy -- that humans perform with near perfect accuracy. To better understand this puzzling pattern of successes and failures, we turn to theoretical accounts of the binding problem in cognitive science and neuroscience, a fundamental problem that arises when a shared set of representational resources must be used to represent distinct entities (e.g., to represent multiple objects in an image), necessitating the use of serial processing to avoid interference. We find that many of the puzzling failures of state-of-the-art VLMs can be explained as arising due to the binding problem, and that these failure modes are strikingly similar to the limitations exhibited by rapid, feedforward processing in the human brain.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Assessing Foundation Models' Transferability to Physiological Signals in Precision Medicine

Diffusion Language Models Know the Answer Before Decoding

Trace Anything: Representing Any Video in 4D via Trajectory Fields

Variation in Verification: Understanding Verification Dynamics in Large Language Models

Is Space-Time Attention All You Need for Video Understanding?

MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning

Personalization of Large Language Models: A Survey

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Language Models are Multilingual Chain-of-Thought Reasoners

A Survey of Small Language Models

Language Modeling by Language Models

Visual Prompting in Multimodal Large Language Models: A Survey

OmniBench: Towards The Future of Universal Omni-Language Models

LIFT the Veil for the Truth: Principal Weights Emerge after Rank Reduction for Reasoning-Focused Supervised Fine-Tuning

From Selection to Generation: A Survey of LLM-based Active Learning

Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study

From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens

Serial Position Effects of Large Language Models

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem

Events

AI for Law

Personalize Your Feed