alphaXiv

ByteDance Shanghai Innovation Institute

Rui Zheng

Honglin Guo

An open-source framework, AgentGym-RL, facilitates the training of large language model agents for long-horizon decision-making through multi-turn reinforcement learning and a progressive interaction scaling strategy called ScalingInter-RL. This approach enables a 7B parameter model to achieve an average success rate comparable to or exceeding larger proprietary models across diverse environments, highlighting the impact of RL training on agentic intelligence.

409

1,276

06 Nov 2025

computer-science computation-and-language computer-vision-and-pattern-recognition

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

The Chinese University of Hong Kong Harbin Institute of Technology Shanghai Innovation Institute

This research introduces "Thinking with Video," a new paradigm that leverages video generation for multimodal reasoning by enabling dynamic visualization and human-like imagination in problem-solving. It evaluates frontier video models like Sora-2 on a new, comprehensive benchmark, VideoThinkBench, showcasing their unexpected capabilities across vision and text-centric tasks.

717

25 Sep 2025

chain-of-thought computer-science artificial-intelligence

SIM-CoT: Supervised Implicit Chain-of-Thought

The Chinese University of Hong Kong Shanghai Innovation Institute

SIM-CoT stabilizes and enhances implicit Chain-of-Thought reasoning in large language models by integrating fine-grained, step-level supervision for latent tokens during training. It addresses latent instability, achieves higher accuracy than explicit CoT in some settings while preserving inference efficiency, and offers unprecedented interpretability into the model's internal thought processes.

5,526

04 Aug 2025

computer-science computer-vision-and-pattern-recognition machine-learning

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Shanghai AI Lab Shanghai Innovation Institute AgiBot Inc.

Yuxiang Lu

Yan Ding

AgiBot World Colosseo presents a large-scale platform and dataset, AgiBot World, comprising over 1 million real-world robot manipulation trajectories spanning diverse tasks and environments. The work introduces GO-1, a generalist policy leveraging vision-language models and latent action representations, achieving an average 32% performance improvement over prior generalist policies on complex manipulation tasks.

1,702

603

21 Oct 2025

agents computer-science artificial-intelligence

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Fudan University Shanghai Innovation Institute Shanghai Qiji Zhifeng Co., Ltd.

Rui Zheng

Honglin Guo

BAPO introduces an adaptive clipping mechanism for off-policy Reinforcement Learning in Large Language Models, which dynamically re-balances optimization signals and preserves policy entropy. This method achieves state-of-the-art performance on AIME reasoning benchmarks, outperforming comparable open-source models and demonstrating competitiveness with proprietary systems.

526

24 Oct 2025

adversarial-robustness agents computer-science

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

National University of Singapore Tongji University

agents computer-science continual-learning

Researchers introduced LIBERO-Plus, a diagnostic benchmark for vision-language-action (VLA) models, revealing that current models exhibit substantial fragility to environmental perturbations and frequently ignore linguistic instructions. Fine-tuning with a generalized dataset significantly enhances their robustness.

947

25 Aug 2025

Proximal Supervised Fine-Tuning

Shanghai Jiao Tong University University of Macau

Tencent Shanghai Innovation Institute

Proximal Supervised Fine-Tuning (PSFT) introduces a PPO-inspired clipped objective to stabilize large language model fine-tuning, preventing entropy collapse and catastrophic forgetting. This approach yields a more robust and generalized base model, serving as a superior "cold start" for subsequent reinforcement learning from human feedback or direct preference optimization, which ultimately leads to enhanced performance across various tasks.

445

09 Oct 2025

agentic-frameworks agents computer-science

Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks

Shanghai Artificial Intelligence Laboratory

Zhejiang University Central South University Shanghai Innovation Institute

MUSE, an agent framework from the Shanghai Artificial Intelligence Laboratory and collaborators, enables Large Language Models to learn continuously from experience and self-evolve for complex, long-horizon real-world tasks. It achieved a new state-of-the-art performance of 51.78% partial completion score on the challenging TheAgentCompany (TAC) benchmark, surpassing previous methods by nearly 20%.

793

28 Aug 2025

computer-science computer-vision-and-pattern-recognition deep-reinforcement-learning

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Fudan University Shanghai AI Lab

Shanghai Jiaotong University

Tencent Shanghai Innovation Institute

PREF-GRPO introduces a novel training method for text-to-image (T2I) models that stabilizes reinforcement learning against reward hacking by utilizing pairwise preference rewards. The accompanying UNIGENBENCH offers a fine-grained, MLLM-powered framework for comprehensive and diagnostic evaluation of T2I models.

166

384

30 Nov 2025

computer-science computation-and-language computer-vision-and-pattern-recognition

SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models

Tongji University

computer-science contrastive-learning machine-learning

SRPO (Self-Referential Policy Optimization) enhances Vision-Language-Action (VLA) models for robotic manipulation by addressing reward sparsity, generating dense, progress-wise rewards using the model's own successful trajectories and latent world representations from V-JEPA 2. The method achieved a 99.2% success rate on the LIBERO benchmark, a 103% relative improvement over its one-shot SFT baseline, and demonstrated strong generalization on the LIBERO-Plus benchmark.

226

588

03 Oct 2025

Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models

Shanghai Jiao Tong University Hong Kong Baptist University Shanghai Innovation Institute

Jiangchao Yao

Co-rewarding establishes a stable self-supervised reinforcement learning framework for large language models, leveraging complementary supervision via data-side analogy-invariance and model-side temporal invariance. This effectively prevents training collapse and reward hacking, leading to enhanced reasoning capabilities that often surpass prior self-rewarding methods and sometimes rival ground-truth supervised approaches.

2,456

07 Mar 2025

computer-science computer-vision-and-pattern-recognition generative-models

Unified Reward Model for Multimodal Understanding and Generation

Fudan University Shanghai AI Lab Shanghai Innovation Institute Shanghai Academy of Artificial Intelligence for Science

Researchers from Fudan University and Shanghai AI Lab introduce UNIFIEDREWARD, the first unified reward model for multimodal understanding and generation that achieves superior performance across both image and video tasks through innovative joint learning of multiple visual domains, demonstrating significant improvements in preference alignment while enabling efficient Direct Preference Optimization (DPO) through a novel combination of pairwise ranking and pointwise scoring.

339

27 Oct 2025

ai-for-health computer-science computer-vision-and-pattern-recognition

UniMedVL: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

University of Cambridge

Monash University Shanghai Artificial Intelligence Laboratory

Imperial College London

Shanghai Jiao Tong University

The University of Hong Kong

HKUST Shanghai Innovation Institute Fuzhou University Shanghai Institute of Optics and Fine Mechanics

Researchers developed UniMedVL, a unified medical foundation model capable of simultaneously performing both understanding and generation tasks within a single architecture, leveraging the UniMed-5M multimodal dataset and a progressive curriculum learning strategy. The model achieves superior performance across diverse medical visual understanding benchmarks and demonstrates high-fidelity generation and seamless execution of complex interleaved multimodal tasks.

329

11 Oct 2025

agentic-frameworks agents computer-science

Don't Just Fine-tune the Agent, Tune the Environment

Nanjing University

Zhejiang University

Westlake University Shanghai Innovation Institute AWorld Team, Inclusion AI

Researchers from Zhejiang University, Ant Group, and others introduced ENVIRONMENT TUNING, a training paradigm for Large Language Model (LLM) agents that focuses on modifying the learning environment itself. This method enables agents to achieve robust generalization and stability in complex, multi-turn tool-use tasks despite extreme data scarcity, significantly boosting performance on benchmarks like BFCL V3 by up to 18.50% and improving out-of-distribution generalization where supervised fine-tuning models collapse.

1,248

09 Nov 2025

agents computer-science artificial-intelligence

Sekai: A Video Dataset towards World Exploration

the University of Tokyo Shenzhen MSU-BIT University Shanghai Innovation Institute Beĳing Institute of Technology

A new comprehensive video dataset named Sekai, curated by Shanghai AI Laboratory, supports advanced world exploration models with over 5,000 hours of videos from real-world and game sources, annotated with explicit camera trajectories and extensive metadata. This data effectively improved model performance in text-to-video, image-to-video, and camera-controlled video generation tasks.

2,174

15 Apr 2025

computer-science computer-vision-and-pattern-recognition deep-reinforcement-learning

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Shanghai Jiao Tong University

The University of Hong Kong Shanghai Innovation Institute

凌霄杜

Researchers from Shanghai AI Laboratory, Shanghai Jiao Tong University, and The University of Hong Kong developed MM-Eureka, an open-source multimodal model utilizing rule-based reinforcement learning with novel online filtering and two-stage training strategies. This model achieves leading open-source performance in multimodal mathematical and multidisciplinary reasoning, scoring 74.8 on MathVista and 73.4 on WeMath, and is accompanied by the new MMK12 high-quality multimodal mathematical reasoning dataset.

218

943

23 Jul 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Yume: An Interactive World Generation Model

computer-science conversational-ai artificial-intelligence

Yume introduces an interactive world generation model that synthesizes dynamic, explorable virtual environments from images or text, featuring intuitive keyboard-driven camera control. The model produces high-fidelity real-world urban scenes, outperforming existing methods in visual quality and instruction following.

315

272

02 Oct 2025

MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

Fudan University Shanghai Innovation Institute MOSI

luozhijie jin

MOSS-Speech, developed by the SII OpenMOSS Team, Fudan University, and MOSI, introduces a model for direct speech-to-speech interaction that eliminates reliance on intermediate text. It achieves state-of-the-art results in speech modeling and spoken question answering, for instance, scoring 67.19 in MMLU and 69.53 in CMMLU, while preserving the linguistic capabilities of its underlying text LLM better than previous multimodal systems.

272

01 Dec 2025

computer-science computer-vision-and-pattern-recognition generative-models

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

Shanghai Jiao Tong University Shanghai Innovation Institute State Key Laboratory for Novel Software Technology, Nanjing University Shenzhen Institutes of Advanced Technology, China

InternVideo-Next, a self-supervised framework developed by researchers including those from Shanghai AI Lab and Shanghai Jiao Tong University, establishes a new benchmark for general video understanding by achieving state-of-the-art performance across diverse tasks without requiring video-text supervision. It unifies pixel-level fidelity with semantic abstraction, outperforming text-supervised models on Kinetics-400 and Something-Something V2 while excelling in tasks like 3D depth estimation and object tracking.

2,129

271

01 Nov 2025

agents computer-science computation-and-language

RoboOmni: Proactive Robot Manipulation in Omni-modal Context

National University of Singapore