alphaXiv

Tsinghua University Harbin Institute of Technology

Huazhong University of Science and Technology

This survey paper systematically synthesizes advancements in Reinforcement Learning (RL) for Large Reasoning Models (LRMs), moving beyond human alignment to focus on enhancing intrinsic reasoning capabilities through verifiable rewards. It identifies key components, challenges, and future directions for scaling RL towards Artificial SuperIntelligence (ASI).

1,595

6,198

08 Nov 2025

agentic-frameworks agents computer-science

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

University of Illinois at Urbana-Champaign

University of California, Santa Barbara

Chinese Academy of Sciences

Imperial College London Shanghai AI Laboratory

National University of Singapore

University College London

University of Oxford

University of Science and Technology of China

University of Bristol

The Chinese University of Hong Kong

University of California, San Diego Dalian University of Technology

University of Georgia

Brown University

A comprehensive survey formally defines Agentic Reinforcement Learning (RL) for Large Language Models (LLMs) as a Partially Observable Markov Decision Process (POMDP), distinct from conventional LLM-RL, and provides a two-tiered taxonomy of capabilities and task domains. The work consolidates open-source resources and outlines critical open challenges for the field.

3,812

04 Nov 2025

computer-science artificial-intelligence computation-and-language

FlowRL: Matching Reward Distributions for LLM Reasoning

Renmin University of China

Stanford University

Toyota Technological Institute at Chicago

Microsoft

daixuan cheng

FlowRL presents a policy optimization algorithm for large language models that leverages GFlowNet principles to match reward distributions rather than merely maximizing expected reward. This approach yielded superior performance on math and code reasoning benchmarks and notably increased the diversity of generated solutions.

5,120

27 Aug 2025

agents computer-science computer-vision-and-pattern-recognition

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

computer-science artificial-intelligence computer-vision-and-pattern-recognition

InternVL3.5, developed by the InternVL Team at Shanghai AI Laboratory, introduces a new family of open-source multimodal models that achieves state-of-the-art results across 35 benchmarks, narrowing the performance gap with commercial systems to 3.9%, while delivering a 4.05x inference speedup through novel architectural and training strategies.

9,272

20,038

27 Oct 2025

Flow-GRPO: Training Flow Matching Models via Online RL

Tsinghua University CUHK MMLab Kuaishou Technology

Researchers from MMLab, Tsinghua University, Kuaishou Technology, and Shanghai AI Lab developed Flow-GRPO, a framework that integrates online policy gradient reinforcement learning into flow matching models. This method significantly enhances capabilities in compositional image generation, visual text rendering, and human preference alignment, achieving up to 95% GenEval accuracy on SD3.5-M by addressing challenges of determinism and sampling efficiency through an ODE-to-SDE conversion and denoising reduction.

2,204

05 Dec 2024

computer-science computer-vision-and-pattern-recognition efficient-transformers

FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression

Shanghai AI Laboratory Xiamen University

Tencent

FlashSloth, developed by researchers from Xiamen University, Tencent Youtu Lab, and Shanghai AI Laboratory, introduces a Multimodal Large Language Model (MLLM) architecture that significantly improves efficiency through embedded visual compression. The approach reduces visual tokens by 80-89% and achieves 2-5 times faster response times, while maintaining highly competitive performance across various vision-language benchmarks.

2,143

23 May 2024

computer-science computer-vision-security artificial-intelligence

Make-it-Real: Unleashing Large Multimodal Model for Painting 3D Objects with Realistic Materials

The Chinese University of Hong Kong

Stanford University

Nanyang Technological University Fudan Unversity

Researchers from Fudan University and Shanghai AI Laboratory introduce "Make-it-Real," a framework that leverages GPT-4V to automatically paint 3D objects with realistic materials from albedo-only inputs. It generates a full suite of SVBRDF maps, achieving up to 77.8% human user preference and 84.8% GPT evaluation preference for refined objects over unrefined ones, significantly enhancing visual authenticity.

176

11,130

28 May 2025

computer-science artificial-intelligence computation-and-language

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

University of Illinois at Urbana-Champaign

CUHK Shanghai AI Laboratory

computer-science artificial-intelligence computation-and-language

This paper identifies and characterizes a universal policy entropy collapse in reinforcement learning for large language models (LLMs), revealing an empirical law that links performance to entropy. It further provides a mechanistic understanding of this phenomenon through covariance analysis and proposes two covariance-aware regularization methods, Clip-Cov and KL-Cov, which successfully maintain higher entropy and improve LLM reasoning performance on math and coding tasks.

312

2,295

04 Sep 2025

Towards a Unified View of Large Language Model Post-Training

Tsinghua University WeChat AI

Researchers from Tsinghua University and Shanghai AI Lab introduce a unified theoretical framework that views various Large Language Model (LLM) post-training algorithms as a single optimization process. Based on this framework, they propose Hybrid Post-Training (HPT), a dynamic algorithm that adaptively switches between Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) signals, achieving state-of-the-art performance on mathematical reasoning benchmarks.

145

74,104

22 Jun 2025

computer-science artificial-intelligence computation-and-language

Learning to Reason under Off-Policy Guidance

The Chinese University of Hong Kong

Westlake University

Yafu Li

Jianhao Yan

LUFFY introduces a framework that enhances Large Reasoning Models (LRMs) by integrating off-policy guidance into Reinforcement Learning with Verifiable Rewards (RLVR). This approach enables LRMs to acquire new reasoning capabilities from stronger external policies, achieving state-of-the-art performance on math benchmarks, superior generalization on out-of-distribution tasks, and successfully training weaker foundation models where on-policy methods fail.

333

912

09 Sep 2025

computer-science computer-vision-and-pattern-recognition robotics

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Shanghai AI Laboratory Harbin Institute of Technology (Shenzhen)

F₁, a Vision-Language-Action (VLA) model, integrates explicit visual foresight into its decision-making process, moving beyond purely reactive control. This approach yields enhanced robustness in dynamic environments and improved generalization across a range of real-world and simulated robotic manipulation tasks.

103

15,517

19 Apr 2025

computer-science computer-vision-and-pattern-recognition few-shot-learning

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

The Chinese University of Hong Kong SenseTime Research

Huiser Wang

jie shao

InternVL3 establishes a new native multimodal pre-training paradigm for MLLMs, allowing the model to jointly acquire visual and linguistic capabilities from the outset. This approach achieves state-of-the-art performance among open-source models, reaching 72.2 on the MMMU benchmark, and demonstrates strong competitiveness with leading proprietary models across a wide range of multimodal tasks.

7,602

782

07 Oct 2025

computer-science computer-vision-and-pattern-recognition

Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

chain-of-thought computer-science artificial-intelligence

Lumina-DiMOO is an omni diffusion large language model employing fully discrete diffusion for unified multi-modal generation and understanding, achieving state-of-the-art performance across text-to-image, image-to-image, and visual understanding benchmarks, while also demonstrating a 32x speed improvement in T2I generation compared to leading autoregressive models.

806

717

25 Sep 2025

SIM-CoT: Supervised Implicit Chain-of-Thought

The Chinese University of Hong Kong Shanghai Innovation Institute

SIM-CoT stabilizes and enhances implicit Chain-of-Thought reasoning in large language models by integrating fine-grained, step-level supervision for latent tokens during training. It addresses latent instability, achieves higher accuracy than explicit CoT in some settings while preserving inference efficiency, and offers unprecedented interpretability into the model's internal thought processes.

621

13 Oct 2025

computer-science artificial-intelligence computation-and-language

From $f(x)$ and $g(x)$ to $f(g(x))$ : LLMs Learn New Skills in RL by Composing Old Ones

University of Illinois at Urbana-Champaign Shanghai AI Laboratory

computer-science artificial-intelligence robotics

Researchers at University of Illinois Urbana-Champaign, Tsinghua University, Peking University, and Shanghai AI Laboratory provide evidence that large language models (LLMs) can acquire new, generalizable compositional skills through reinforcement learning (RL) post-training. Their controlled synthetic experiments show that RL enables LLMs to compose atomic skills for complex tasks, demonstrating significant generalization to unseen difficulties and cross-task transfer, a capability not achieved by supervised fine-tuning.

4,033

19 May 2025

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

autonomous-vehicles computer-science computer-vision-and-pattern-recognition

Yan Ding

Shanghai AI Laboratory's SpatialVLA introduces novel spatial representations for Vision-Language-Action models, equipping them with profound 3D spatial understanding via Ego3D Position Encoding for observations and unifying actions via Adaptive Action Grids. The model achieves state-of-the-art zero-shot performance on diverse real-world robot tasks and demonstrates efficient adaptation to new robot setups, surpassing larger models like RT-2-X with a smaller parameter count.

297

4,332

23 Mar 2023

Planning-oriented Autonomous Driving

Wuhan University Shanghai AI Laboratory SenseTime Research

Chonghao Sima

Lin Tianwei

This paper introduces UniAD, a comprehensive, planning-oriented end-to-end framework for autonomous driving that integrates perception, prediction, and planning tasks into a single neural network. It achieves state-of-the-art performance across all integrated tasks on the nuScenes benchmark, demonstrating improved accuracy in motion forecasting and planning safety over previous methods.

3,769

578

16 Sep 2025

computer-science continual-learning artificial-intelligence

Forget What's Sensitive, Remember What Matters: Token-Level Differential Privacy in Memory Sculpting for Continual Learning

Shanghai AI Laboratory East China Normal University The Chinese University Of Hongkong

The PeCL framework enables privacy-enhanced continual learning for Large Language Models by introducing a novel token-level dynamic differential privacy mechanism and a privacy-guided memory sculpting module. This approach allows models to selectively forget sensitive information while efficiently retaining crucial task-invariant knowledge, outperforming existing baselines in balancing utility and privacy.

1,148

15 Oct 2025

agents computer-science artificial-intelligence

EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control

Northwestern Polytechnical University Shanghai AI Laboratory

Fudan University AgiBot EO-Robotics Team

The EO-Robotics Team developed EO-1, a 3B parameter embodied foundation model, employing a unified architecture and interleaved vision-text-action pretraining for general robot control. The model achieved state-of-the-art performance, surpassing GPT-4o and Gemini 1.5 Flash in overall embodied reasoning, and demonstrated an 86.0% completion rate across 28 diverse real-world manipulation tasks.

3,666

26 Sep 2025

computer-science computer-vision-and-pattern-recognition multi-modal-learning

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling