alphaXiv

History

Papers Benchmarks

CMU

17,700

19 May 2025

computer-science computer-vision-and-pattern-recognition machine-learning

Mean Flows for One-step Generative Modeling

CMU

MIT

MeanFlow, developed by researchers at CMU and MIT, introduces a generative modeling framework based on learning "average velocity" to enable efficient one-step image generation. The model achieves an FID of 3.43 with a single function evaluation on ImageNet 256x256, demonstrating improved performance over prior one-step methods and approaching the quality of multi-step diffusion models.

908

868

01 Dec 2025

computer-science computer-vision-and-pattern-recognition machine-learning

Improved Mean Flows: On the Challenges of Fastforward Generative Models

Carnegie Mellon University

Tsinghua University CMU

MIT THU

Adobe

Improved Mean Flows (iMF) enhances one-step generative image models by stabilizing training through an improved objective and enabling flexible Classifier-Free Guidance. This framework achieves a 1-NFE FID of 1.72 on ImageNet 256x256 without distillation, outperforming prior fastforward methods and significantly reducing model size by one-third.

777

08 Oct 2025

computer-science artificial-intelligence machine-learning

OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction

UC Berkeley

Stanford University CMU

MIT Amazon FAR

Amazon FAR researchers developed OmniRetarget, a framework that generates high-fidelity, physically plausible, and interaction-preserving kinematic trajectories from human demonstrations for humanoid robots. This approach enables the training of complex loco-manipulation and scene interaction skills with zero-shot sim-to-real transfer and a minimal reinforcement learning formulation.

607

4,936

30 May 2024

agent-based-systems autonomous-vehicles computer-science

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

University of Waterloo

The University of Hong Kong CMU Salesforce Research

Akari Asai

Xiaochuan Li

OSWORLD, developed by The University of Hong Kong and collaborators including Salesforce Research, introduces the first scalable benchmark for evaluating multimodal AI agents in real computer environments. It reveals a substantial performance gap between state-of-the-art models and humans on open-ended tasks, with agents achieving only 12.24% success compared to humans' 72.36%.

1,610

539

25 Nov 2025

chain-of-thought computer-science computer-vision-and-pattern-recognition

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

New York University

University of Oxford

Northwestern University UCSB CMU

Purdue University University of Rochester

Brown University

University of Virginia Sony Group Corporation

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: this https URL

104

4,700

18 Apr 2025

agent-based-systems computer-science artificial-intelligence

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

University of Illinois at Urbana-Champaign

UC Berkeley CMU

KAUST Alibaba Yale Contextual AI HCMUT ANU All Hands AI

Jiayi Pan

Mingchen Zhuge

OpenHands is an open-source platform facilitating the development, evaluation, and deployment of generalist AI agents that interact with digital environments by writing code, using command lines, and browsing the web. Its CodeAct agent achieved competitive performance across 15 diverse benchmarks, including software engineering, web browsing, and general assistance tasks, without task-specific modifications.

1,387

21 Apr 2025

computer-science computation-and-language machine-learning

DataComp-LM: In search of the next generation of training sets for language models

Alon Albalak

Alexandros Dimakis

DataComp-LM introduces a standardized, large-scale benchmark for evaluating language model training data curation strategies, complete with an openly released corpus, framework, and models. Its DCLM-BASELINE 7B model, trained on carefully filtered Common Crawl data, achieves 64% MMLU 5-shot accuracy, outperforming previous open-data state-of-the-art models while requiring substantially less compute.

360

30 Oct 2025

computer-science continual-learning computer-vision-and-pattern-recognition

Self-Improving Vision-Language-Action Models with Data Generation via Residual RL

Carnegie Mellon University UT Austin

UC Berkeley

The University of Texas at Austin

NVIDIA CMU

Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a three-stage plug-and-play framework that improves VLAs through residual reinforcement learning (RL) and distribution-aware data collection. In Stage 1, we train lightweight residual actors to probe failure regions of the VLA generalist. In Stage 2, we use a hybrid rollout scheme that aligns collected trajectories with the generalist's deployment distribution while capturing recovery behaviors. In Stage 3, we distill the curated trajectories back into the generalist with standard SFT. PLD achieves near-saturated 99% task success on LIBERO, over 50% gains in SimplerEnv, and 100% success on real-world Franka and YAM arm manipulation tasks. Ablations show that residual probing and distribution-aware replay are key to collecting deployment-aligned data that improves both seen and unseen tasks, offering a scalable path toward self-improving VLA models.

330

22 Sep 2025

adversarial-attacks adversarial-robustness chain-of-thought

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

CMU Center for AI Safety Gray Swan AI Amazon Nova Responsible AI

D-REX introduces a new benchmark for detecting deceptive reasoning in large language models, where models generate benign outputs while their internal thought processes execute malicious instructions. This benchmark, built from competitive red-teaming, reveals that current frontier models are highly susceptible to such attacks, highlighting the limitations of output-centric safety evaluations.

4,241

08 Nov 2025

computer-science computer-vision-and-pattern-recognition machine-learning

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

UC Berkeley

NVIDIA CMU

MIT SJTU Princeton Pika Labs

SVDQuant introduces a 4-bit post-training quantization method for diffusion models that absorbs outliers using a high-precision low-rank component, coupled with the Nunchaku inference engine. This approach enables state-of-the-art image quality while achieving up to 3.6x memory reduction and 10.1x end-to-end inference speedup across various models and hardware.

682

3,804

06 Jun 2025

agent-based-systems ai-for-health computer-science

Training Software Engineering Agents and Verifiers with SWE-Gym

University of Illinois at Urbana-Champaign

UC Berkeley CMU

Apple

Jiayi Pan

Graham Neubig

SWE-Gym is introduced as the first publicly available training environment combining real-world software engineering tasks with executable test verification. Fine-tuning open-weight language models on trajectories collected within SWE-Gym enables substantial performance gains for software engineering agents and allows for effective inference-time scaling via learned verifiers.

358

278

08 Oct 2025

computer-science machine-learning robotics

ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning

UC Berkeley

Stanford University CMU USC Amazon FAR (Frontier AI & Robotics)

Amazon FAR researchers developed ResMimic, a two-stage residual learning framework that transforms general motion tracking into precise humanoid whole-body loco-manipulation. The system achieved a 92.5% average task success rate in simulation, significantly outperforming baselines, and enabled a Unitree G1 robot to carry heavy and irregularly shaped objects using whole-body contact in real-world scenarios.

599

17 Jan 2024

computer-science machine-learning robotics

BridgeData V2: A Dataset for Robot Learning at Scale

Google DeepMind

UC Berkeley

Stanford University CMU

BridgeData V2 is a large-scale, diverse dataset for robot learning, comprising over 60,000 real-world robot trajectories collected across 24 environments and 13 skills using a low-cost robot. Policies trained on this dataset generalize to unseen objects and environments and can transfer to independent institutions with varied setups, demonstrating the benefits of data scale and diversity.

220

458

18 Nov 2025

computer-science continual-learning artificial-intelligence

EvoLM: In Search of Lost Language Model Training Dynamics

Stanford University

EPFL CMU Harvard

Hanlin Zhang

A systematic analysis of multi-stage large language model training dynamics investigates how design choices across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning impact model capabilities, providing a transparent framework and introducing outcome reward model scores as a reliable proxy for generative task evaluation.

964

2,406

05 Oct 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Humanoid Policy ~ Human Policy

University of Washington

University of California, San Diego CMU

MIT

This research introduces a cross-embodiment learning framework that leverages a large dataset of egocentric human demonstrations (PH²D) collected with consumer-grade VR to train robust humanoid robot manipulation policies. A unified Human Action Transformer (HAT) policy co-trained with human and robot data significantly improves generalization to novel objects, backgrounds, and placements by nearly 100% in out-of-distribution settings, and enables efficient few-shot transfer across different robot platforms.

210

404

17 Jun 2025

agentic-frameworks agents ai-for-health

MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning

Yale University

Microsoft CMU UNC-Chapel Hill

Medical Large Vision-Language Models (Med-LVLMs) have shown strong potential in multimodal diagnostic tasks. However, existing single-agent models struggle to generalize across diverse medical specialties, limiting their performance. Recent efforts introduce multi-agent collaboration frameworks inspired by clinical workflows, where general practitioners (GPs) and specialists interact in a fixed sequence. Despite improvements, these static pipelines lack flexibility and adaptability in reasoning. To address this, we propose MMedAgent-RL, a reinforcement learning (RL)-based multi-agent framework that enables dynamic, optimized collaboration among medical agents. Specifically, we train two GP agents based on Qwen2.5-VL via RL: the triage doctor learns to assign patients to appropriate specialties, while the attending physician integrates the judgments from multi-specialists and its own knowledge to make final decisions. To address the inconsistency in specialist outputs, we introduce a curriculum learning (CL)-guided RL strategy that progressively teaches the attending physician to balance between imitating specialists and correcting their mistakes. Experiments on five medical VQA benchmarks demonstrate that MMedAgent-RL not only outperforms both open-source and proprietary Med-LVLMs, but also exhibits human-like reasoning patterns. Notably, it achieves an average performance gain of 20.7% over supervised fine-tuning baselines.

181

30 Nov 2025

computer-science computer-vision-and-pattern-recognition robotics

Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer

CUHK

UC Berkeley

NVIDIA CMU

The DoorMan framework, developed by researchers from NVIDIA, UC Berkeley, CMU, and CUHK, enables a humanoid robot to autonomously open diverse real-world doors using only egocentric RGB vision. This achievement results from policies trained entirely in a massively randomized, photorealistic simulation, demonstrating superior performance to human teleoperation.

938

27 Apr 2025

computer-science artificial-intelligence computation-and-language

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Google DeepMind UC Santa Barbara CMU Google Cloud AI Research

Zifeng Wang

Chen-Yu Lee

Speculative Knowledge Distillation (SKD) introduces an interleaved sampling method for LLM compression, dynamically blending teacher-guided corrections with student-generated tokens. This approach consistently outperforms existing knowledge distillation techniques, achieving substantial gains across diverse tasks and data regimes while providing more stable training.

36,524

840

17 Feb 2023

computer-science contrastive-learning artificial-intelligence

Contrastive Learning as Goal-Conditioned Reinforcement Learning

UC Berkeley

Google Research CMU

This work reinterprets contrastive learning as a goal-conditioned reinforcement learning algorithm, demonstrating that the inner product of learned representations can directly serve as a Q-function. The proposed Contrastive RL (CR) methods achieve superior performance on image-based and offline goal-conditioned tasks, often without requiring auxiliary representation learning losses or explicit data augmentation.

1,031

12 Aug 2024

computer-science machine-learning

Large Language Models Are Zero-Shot Time Series Forecasters

New York University CMU

A new approach leverages large language models (LLMs) for zero-shot time series forecasting by encoding numerical data as strings of digits, treating it as a next-token prediction task. This method, LLMTIME, demonstrated competitive performance against specialized models on various benchmarks, particularly excelling in probabilistic forecasting.

805

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Mean Flows for One-step Generative Modeling

Improved Mean Flows: On the Challenges of Fastforward Generative Models

OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

DataComp-LM: In search of the next generation of training sets for language models

Self-Improving Vision-Language-Action Models with Data Generation via Residual RL

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Training Software Engineering Agents and Verifiers with SWE-Gym

ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning

BridgeData V2: A Dataset for Robot Learning at Scale

EvoLM: In Search of Lost Language Model Training Dynamics

Humanoid Policy ~ Human Policy

MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning

Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Contrastive Learning as Goal-Conditioned Reinforcement Learning

Large Language Models Are Zero-Shot Time Series Forecasters

Events

AI for Law

Personalize Your Feed