alphaXiv

History

Papers Benchmarks

Fudan University

6,198

08 Nov 2025

agentic-frameworks agents computer-science

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

University of Illinois at Urbana-Champaign

University of California, Santa Barbara

Chinese Academy of Sciences

Imperial College London Shanghai AI Laboratory

National University of Singapore

University College London

University of Oxford

Fudan University

University of Science and Technology of China

University of Bristol

The Chinese University of Hong Kong

University of California, San Diego Dalian University of Technology

University of Georgia

Brown University

A comprehensive survey formally defines Agentic Reinforcement Learning (RL) for Large Language Models (LLMs) as a Partially Observable Markov Decision Process (POMDP), distinct from conventional LLM-RL, and provides a two-tiered taxonomy of capabilities and task domains. The work consolidates open-source resources and outlines critical open challenges for the field.

128,466

28 Apr 2025

computer-science contrastive-learning computer-vision-and-pattern-recognition

Perception Encoder: The best visual embeddings are not at the output of the network

UT Austin

Fudan University

Meta MBZUAI Meta Reality Labs

Daniel Bolya

Andrea Madotto

Perception Encoder introduces a family of vision models that achieve state-of-the-art performance across diverse vision and vision-language tasks, demonstrating that general, high-quality visual features can be extracted from the intermediate layers of a single, contrastively-trained network. It provides specific alignment tuning methods to make these features accessible for tasks ranging from zero-shot classification to dense spatial prediction and multimodal language understanding.

383

2,197

10 Sep 2025

computer-science artificial-intelligence computation-and-language

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning

Fudan University

ByteDance Shanghai Innovation Institute

Rui Zheng

Honglin Guo

An open-source framework, AgentGym-RL, facilitates the training of large language model agents for long-horizon decision-making through multi-turn reinforcement learning and a progressive interaction scaling strategy called ScalingInter-RL. This approach enables a 7B parameter model to achieve an average success rate comparable to or exceeding larger proprietary models across diverse environments, highlighting the impact of RL training on agentic intelligence.

409

7,363

01 Aug 2025

agentic-frameworks agents ai-for-health

A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence

Princeton AI Lab

University of Illinois at Urbana-Champaign

University of California, Santa Barbara

Carnegie Mellon University

Fudan University

Shanghai Jiao Tong University

Tsinghua University

University of Michigan

The Chinese University of Hong Kong The Hong Kong University of Science and Technology (Guangzhou)

University of California, San Diego Pennsylvania State University

The University of Hong Kong

Princeton University

University of Sydney Oregon State University

An extensive international collaboration offers the first systematic review of self-evolving agents, establishing a unified theoretical framework categorized by 'what to evolve,' 'when to evolve,' and 'how to evolve'. The work consolidates diverse research, highlights key challenges, and maps applications, aiming to guide the development of AI systems capable of continuous autonomous improvement.

511

2,099

20 Aug 2025

autonomous-vehicles computer-science computer-vision-security

Translating Images to Road Network: A Sequence-to-Sequence Perspective

Huawei Noah’s Ark Lab

Fudan University Shanghai AI Lab

bzhou zhang

Researchers developed the RoadNet Sequence, a unified representation for encoding both geometric and topological road network information from multi-camera images. Their Transformer-based models, particularly the Non-Autoregressive RoadNetTransformer (NAR-RNTR), achieve real-time inference speeds, operating 47 times faster than the autoregressive baseline while demonstrating superior performance over existing road network extraction methods.

2,080

17 Oct 2024

ai-for-health computer-science computation-and-language

MedCare: Advancing Medical LLMs through Decoupling Clinical Alignment and Knowledge Aggregation

Shanghai Artificial Intelligence Laboratory

Fudan University

Shanghai Jiao Tong University

The MEDCARE framework introduces a two-stage fine-tuning pipeline to decouple clinical alignment from knowledge aggregation in medical large language models (LLMs). This approach leads to superior performance across over 20 diverse medical benchmarks, including both knowledge-intensive and alignment-required tasks, demonstrating robust efficiency and cross-lingual generalization.

1,276

06 Nov 2025

computer-science computation-and-language computer-vision-and-pattern-recognition

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Fudan University

The Chinese University of Hong Kong Harbin Institute of Technology Shanghai Innovation Institute

This research introduces "Thinking with Video," a new paradigm that leverages video generation for multimodal reasoning by enabling dynamic visualization and human-like imagination in problem-solving. It evaluates frontier video models like Sora-2 on a new, comprehensive benchmark, VideoThinkBench, showcasing their unexpected capabilities across vision and text-centric tasks.

18,595

27 Mar 2024

computer-science artificial-intelligence computation-and-language

Retrieval-Augmented Generation for Large Language Models: A Survey

Tongji University

Fudan University

This survey paper from researchers at Tongji University and Fudan University offers a comprehensive, systematic synthesis of Retrieval-Augmented Generation (RAG) for Large Language Models. It structures the field by delineating three evolutionary paradigms—Naive, Advanced, and Modular RAG—and details advancements across retrieval, generation, and augmentation components, while also providing a thorough framework for evaluating RAG systems.

548

3,408

10 Jul 2025

chain-of-thought computer-science computation-and-language

A Survey on Latent Reasoning

University of Manchester

Fudan University

Nanjing University

Renmin University of China

Peking University

University of Wisconsin-Madison Hong Kong Polytechnic University

University of California, Santa Cruz M-A-P

This comprehensive survey from a large multi-institutional collaboration examines "Latent Reasoning" in Large Language Models, an emerging paradigm that performs multi-step inference entirely within the model's high-bandwidth continuous hidden states to overcome the limitations of natural language-based explicit reasoning. It highlights the significant bandwidth advantage of latent representations (approximately 2700x higher) and provides a unified taxonomy of current methodologies.

238

1,057

16 Aug 2024

ai-for-health computer-science artificial-intelligence

A Medical Data-Effective Learning Benchmark for Highly Efficient Pre-training of Foundation Models

Fudan University

Foundation models, pre-trained on massive datasets, have achieved unprecedented generalizability. However, is it truly necessary to involve such vast amounts of data in pre-training, consuming extensive computational resources? This paper introduces data-effective learning, aiming to use data in the most impactful way to pre-train foundation models. This involves strategies that focus on data quality rather than quantity, ensuring the data used for training has high informational value. Data-effective learning plays a profound role in accelerating foundation model training, reducing computational costs, and saving data storage, which is very important as the volume of medical data in recent years has grown beyond many people's expectations. However, due to the lack of standards and comprehensive benchmarks, research on medical data-effective learning is poorly studied. To address this gap, our paper introduces a comprehensive benchmark specifically for evaluating data-effective learning in the medical field. This benchmark includes a dataset with millions of data samples from 31 medical centers (DataDEL), a baseline method for comparison (MedDEL), and a new evaluation metric (NormDEL) to objectively measure data-effective learning performance. Our extensive experimental results show the baseline MedDEL can achieve performance comparable to the original large dataset with only 5% of the data. Establishing such an open data-effective learning benchmark is crucial for the medical foundation model research community because it facilitates efficient data use, promotes collaborative breakthroughs, and fosters the development of cost-effective, scalable, and impactful healthcare solutions.

15,517

19 Apr 2025

computer-science computer-vision-and-pattern-recognition few-shot-learning

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Shanghai AI Laboratory

Fudan University

Shanghai Jiao Tong University

Nanjing University

Tsinghua University

The Chinese University of Hong Kong SenseTime Research

Huiser Wang

jie shao

InternVL3 establishes a new native multimodal pre-training paradigm for MLLMs, allowing the model to jointly acquire visual and linguistic capabilities from the outset. This approach achieves state-of-the-art performance among open-source models, reaching 72.2 on the MMMU benchmark, and demonstrates strong competitiveness with leading proprietary models across a wide range of multimodal tasks.

7,602

908

02 Feb 2024

image-and-video-processing electrical-engineering

DARCS: Memory-Efficient Deep Compressed Sensing Reconstruction for Acceleration of 3D Whole-Heart Coronary MR Angiography

Chinese Academy of Sciences

Fudan University

Shanghai Jiao Tong University Shenzhen Institutes of Advanced Technology Zhongshan Hospital Shanghai Medical Imaging Institute

Three-dimensional coronary magnetic resonance angiography (CMRA) demands reconstruction algorithms that can significantly suppress the artifacts from a heavily undersampled acquisition. While unrolling-based deep reconstruction methods have achieved state-of-the-art performance on 2D image reconstruction, their application to 3D reconstruction is hindered by the large amount of memory needed to train an unrolled network. In this study, we propose a memory-efficient deep compressed sensing method by employing a sparsifying transform based on a pre-trained artifact estimation network. The motivation is that the artifact image estimated by a well-trained network is sparse when the input image is artifact-free, and less sparse when the input image is artifact-affected. Thus, the artifact-estimation network can be used as an inherent sparsifying transform. The proposed method, named De-Aliasing Regularization based Compressed Sensing (DARCS), was compared with a traditional compressed sensing method, de-aliasing generative adversarial network (DAGAN), model-based deep learning (MoDL), and plug-and-play for accelerations of 3D CMRA. The results demonstrate that the proposed method improved the reconstruction quality relative to the compared methods by a large margin. Furthermore, the proposed method well generalized for different undersampling rates and noise levels. The memory usage of the proposed method was only 63% of that needed by MoDL. In conclusion, the proposed method achieves improved reconstruction quality for 3D CMRA with reduced memory burden.

7,708

15 Apr 2025

agent-based-systems computer-science artificial-intelligence

AFlow: Automating Agentic Workflow Generation

Fudan University

Nanjing University

Renmin University of China The Hong Kong University of Science and Technology (Guangzhou)

HKUST King Abdullah University of Science and Technology DeepWisdom Universit ́e de Montr ́eal & Mila

Xiong-Hui Chen

Bang Liu

AFLOW introduces an automated framework for generating and optimizing agentic workflows for Large Language Models, reformulating workflow optimization as a search problem over code-represented workflows. The system leverages Monte Carlo Tree Search with LLM-based optimization to iteratively refine workflows, yielding a 19.5% average performance improvement over existing automated methods while enabling smaller, more cost-effective LLMs to achieve performance parity with larger models.

130

1,018

03 Dec 2025

chain-of-thought computer-science computation-and-language

Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning

Fudan University

Southern University of Science and Technology Douyin Co., Ltd.

君赵

Vision-language reinforcement learning (RL) has primarily focused on narrow domains (e.g. geometry or chart reasoning). This leaves broader training scenarios and resources underexplored, limiting the exploration and learning of Vision Language Models (VLMs) through RL. We find video games inherently provide rich visual elements and mechanics that are easy to verify. To fully use the multimodal and verifiable reward in video games, we propose Game-RL, constructing diverse game tasks for RL training to boost VLMs general reasoning ability. To obtain training data, we propose Code2Logic, a novel approach that adapts game code to synthesize game reasoning task data, thus obtaining the GameQA dataset of 30 games and 158 tasks with controllable difficulty gradation. Unexpectedly, RL training solely on GameQA enables multiple VLMs to achieve performance improvements across 7 diverse vision-language benchmarks, demonstrating the value of Game-RL for enhancing VLMs' general reasoning. Furthermore, this suggests that video games may serve as valuable scenarios and resources to boost general reasoning abilities. Our code, dataset and models are available at the GitHub repository.

717

25 Sep 2025

chain-of-thought computer-science artificial-intelligence

SIM-CoT: Supervised Implicit Chain-of-Thought

Shanghai AI Laboratory

Fudan University

The Chinese University of Hong Kong Shanghai Innovation Institute

SIM-CoT stabilizes and enhances implicit Chain-of-Thought reasoning in large language models by integrating fine-grained, step-level supervision for latent tokens during training. It addresses latent instability, achieves higher accuracy than explicit CoT in some settings while preserving inference efficiency, and offers unprecedented interpretability into the model's internal thought processes.

708

24 Sep 2025

agents computer-science artificial-intelligence

Embodied AI: From LLMs to World Models

Fudan University

Tsinghua University Beijing National Research Center for Information Science and Technology

The paper outlines a joint Multimodal Large Language Model (MLLM) and World Model (WM) driven architecture to advance Embodied AI towards Artificial General Intelligence. This approach integrates MLLMs' semantic reasoning with WMs' physics-aware predictive capabilities, overcoming limitations in real-time adaptation and physical grounding to enable more robust and adaptable agents in dynamic environments.

1,405

18 Oct 2025

agentic-frameworks agents computer-science

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Zhongxing Xu

A comprehensive survey by researchers from Shanghai AI Lab and various global institutions outlines the intricate relationship between scientific large language models (Sci-LLMs) and their data foundations, tracing their evolution towards autonomous agents for scientific discovery. The paper establishes a taxonomy for scientific data and knowledge, meticulously reviews over 270 datasets and 190 benchmarks, and identifies critical data challenges alongside future paradigms.

370

19,091

18 Jul 2025

agents chain-of-thought computer-science

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Fudan University Central South University Harbin Institute of Technology

The University of Hong Kong

Researchers from Harbin Institute of Technology and collaborating institutions provide a systematic survey of Long Chain-of-Thought (Long CoT) in Large Language Models, establishing a formal distinction from Short CoT. The survey proposes a novel taxonomy based on deep reasoning, extensive exploration, and feasible reflection, and analyzes key phenomena observed in advanced reasoning models.

524

1,099

27 Aug 2025

computer-science artificial-intelligence computation-and-language

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Northeastern University

Sun Yat-Sen University

Fudan University

University of Science and Technology of China

Nanjing University

Tsinghua University Shanghai AI Lab Central South University Shenzhen University

Southern University of Science and Technology TeleAI D-robotics HKU MMLab Lumina EAI HKU-SH ICRC SJTU ScaleLab

RoboTwin 2.0 introduces a scalable simulation framework and benchmark designed to generate high-quality, domain-randomized data for robust bimanual robotic manipulation, addressing limitations in existing synthetic datasets. Policies trained with RoboTwin 2.0 data achieved a 24.4% improvement in real-world success rates for few-shot learning and 21.0% for zero-shot generalization on unseen backgrounds.

1,514

603

21 Oct 2025

agents computer-science artificial-intelligence

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Fudan University Shanghai Innovation Institute Shanghai Qiji Zhifeng Co., Ltd.

Rui Zheng

Honglin Guo

BAPO introduces an adaptive clipping mechanism for off-policy Reinforcement Learning in Large Language Models, which dynamically re-balances optimization signals and preserves policy entropy. This method achieves state-of-the-art performance on AIME reasoning benchmarks, outperforming comparable open-source models and demonstrating competitiveness with proprietary systems.

There are no more papers matching your filters at the moment.

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Perception Encoder: The best visual embeddings are not at the output of the network

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning

A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence

Translating Images to Road Network: A Sequence-to-Sequence Perspective

MedCare: Advancing Medical LLMs through Decoupling Clinical Alignment and Knowledge Aggregation

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Retrieval-Augmented Generation for Large Language Models: A Survey

A Survey on Latent Reasoning

A Medical Data-Effective Learning Benchmark for Highly Efficient Pre-training of Foundation Models

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

DARCS: Memory-Efficient Deep Compressed Sensing Reconstruction for Acceleration of 3D Whole-Heart Coronary MR Angiography

AFlow: Automating Agentic Workflow Generation

Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning

SIM-CoT: Supervised Implicit Chain-of-Thought

Embodied AI: From LLMs to World Models

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Personalize Your Feed