alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

Ningbo Institute of Digital TwinEastern Institute of TechnologyChina

Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning

01 Nov 2025

hanlin-wang

Hanlin Wang

The Hong Kong Polytechnic University Eastern Institute of Technology

A comprehensive survey by Chen et al. (2025) introduces the first unified taxonomy for latent Chain-of-Thought (CoT) reasoning, organizing a rapidly growing field into token-wise horizontal and layer-wise vertical approaches. It synthesizes current research, practical applications, and outlines critical challenges for future advancements in LLM efficiency and cognitive capabilities.

#chain-of-thought #computer-science #computation-and-language

Paper thumbnail

Reasoning in Space via Grounding in the World

16 Oct 2025

Fudan University

Shanghai Jiao Tong University

In this paper, we claim that 3D visual grounding is the cornerstone of spatial reasoning and introduce the Grounded-Spatial Reasoner (GS-Reasoner) to explore the effective spatial representations that bridge the gap between them. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information. This deficiency is manifested either in poor performance on grounding or in an excessive reliance on external modules, ultimately hindering the seamless integration of grounding and spatial reasoning. To address this, we propose a simple yet effective dual-path pooling mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation that encapsulates all essential information without increasing the number of input tokens. Leveraging this holistic representation, GS-Reasoner is the first 3D LLM that achieves autoregressive grounding entirely without external modules while delivering performance comparable to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning. To further bridge grounding and spatial reasoning, we introduce the Grounded Chain-of-Thought (GCoT) dataset. This dataset is meticulously curated to include both 3D bounding box annotations for objects referenced in reasoning questions and step-by-step reasoning paths that integrate grounding as a core component of the problem-solving process. Extensive experiments demonstrate that GS-Reasoner achieves impressive results on 3D visual grounding, which in turn significantly enhances its spatial reasoning capabilities, leading to state-of-the-art performance.

#chain-of-thought #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

24 Sep 2025

University of Illinois at Urbana-Champaign Shanghai AI Laboratory

While spatial reasoning has made progress in object localization relationships, it often overlooks object orientation-a key factor in 6-DoF fine-grained manipulation. Traditional pose representations rely on pre-defined frames or templates, limiting generalization and semantic grounding. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the "plug-in" direction of a USB or the "handle" direction of a cup). To support this, we construct OrienText300K, a large-scale dataset of 3D objects annotated with semantic orientations, and develop PointSO, a general model for zero-shot semantic orientation prediction. By integrating semantic orientation into VLM agents, our SoFar framework enables 6-DoF spatial reasoning and generates robotic actions. Extensive experiments demonstrated the effectiveness and generalization of our SoFar, e.g., zero-shot 48.7% successful rate on Open6DOR and zero-shot 74.9% successful rate on SIMPLER-Env.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

Open-World Reinforcement Learning over Long Short-Term Imagination

04 Mar 2025

Shanghai Jiao Tong University East China Normal University

LS-Imagine allows visual reinforcement learning agents to plan and explore over long horizons in complex open-world environments by integrating a long short-term world model and affordance-driven guidance. It demonstrates improved success rates and reduced steps to completion across various MineDojo tasks, including "Harvest log in plains" and "Mine iron ore."

#computer-science #machine-learning #deep-reinforcement-learning

Paper thumbnail

CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

08 Nov 2025

Xiamen University East China Normal University

This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need to sample multiple completions for each question. Our experiment and theoretical analysis reveal that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training -- their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experiments show that CPPO achieves up to

7.98\times

speedup on GSM8K and

3.48\times

on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at \href{this https URL}{this https URL}.

#agents #chain-of-thought #computer-science

Paper thumbnail

Bridging Past and Future: End-to-End Autonomous Driving with Historical Prediction and Planning

18 Mar 2025

bzhou-zhang

bzhou zhang

Fudan University Eastern Institute of Technology

BridgeAD, developed by researchers at Fudan University and Eastern Institute of Technology, introduces a framework for end-to-end autonomous driving that reformulates how historical temporal information is leveraged. It achieves state-of-the-art planning results on the nuScenes dataset by enabling fine-grained, step-level interactions with past data.

#computer-science #computer-vision-and-pattern-recognition #robotics

Paper thumbnail

ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving

15 Aug 2025

bzhou-zhang

bzhou zhang

Imperial College London Fudan University logo

Fudan University

An autonomous driving framework, ImagiDrive, integrates Vision-Language Models with Driving World Models in a unified imagination-and-planning loop. This iterative process allows the system to generate future driving scenarios and refine its trajectory predictions, leading to improved collision avoidance and more robust navigation across multiple datasets.

#autonomous-vehicles #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model

06 Mar 2025

Wuhan University Zhejiang University logo

Zhejiang University

The paper "Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model" provides a systematic review and unified benchmark for tuning MLLMs, classifying methods into Selective, Additive, and Reparameterization paradigms. It empirically analyzes the trade-offs between task-expert specialization and open-world stabilization, offering practical guidelines for MLLM deployment.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Vision-Centric Activation and Coordination for Multimodal Large Language Models

23 Oct 2025

Shanghai Jiao Tong University

The VaCo framework enhances Multimodal Large Language Models (MLLMs) by intrinsically activating and coordinating vision-centric information. It utilizes query-based discriminative alignment to harness visual priors from multiple Vision Foundation Models (VFMs) and employs a Token Gateway Mask to resolve representational conflicts, leading to superior visual comprehension and efficiency. VaCo outperforms state-of-the-art MLLMs on benchmarks such as MMBench-English (78.6, +2.3 points over LLaVA-1.5) and MMMU (46.7, +4.9 points over LLaVA-1.5) with a 7B model, while using only a single visual encoder during inference.

#attention-mechanisms #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

SkipGPT: Dynamic Layer Pruning Reinvented with Token Awareness and Module Decoupling

04 Jun 2025

Shanghai Jiao Tong University Nanjing University logo

Nanjing University

Researchers from Eastern Institute of Technology, Tencent, and collaborators develop SkipGPT, a dynamic layer pruning framework that introduces token-aware global routing and decoupled pruning policies for MLP versus self-attention modules, achieving over 40% parameter reduction while maintaining or exceeding original model performance through a two-stage training paradigm that first tunes lightweight routers (0.01% parameters) then applies LoRA fine-tuning, with SkipGPT-RT retaining over 90% performance on LLaMA2-7B/13B at 25% pruning and over 95% on LLaMA3.1-8B while outperforming static methods like ShortGPT and dynamic approaches like MoD-D across commonsense reasoning benchmarks, revealing through routing behavior analysis that attention modules exhibit higher redundancy than MLPs and that computational needs shift contextually with later tokens requiring more attention but less MLP processing, challenging the fixed 1:1 attention-MLP architecture design while demonstrating that joint training of routers with pre-trained parameters causes instability compared to their stable disentangled optimization approach.

#computer-science #computation-and-language #efficient-transformers

Paper thumbnail

Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

27 May 2025

The Hong Kong Polytechnic University Saarland University

Researchers systematically investigated factors influencing the distillation of Chain-of-Thought (CoT) reasoning into Small Language Models (SLMs), identifying that optimal CoT granularity is non-monotonic and student-dependent, format impact is minimal, and teacher choice effectiveness varies by task. The study revealed a 'Matthew Effect,' where stronger SLMs gained more from CoT distillation, challenging assumptions about knowledge transfer.

#computer-science #computation-and-language

Paper thumbnail

Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning

28 Aug 2025

Chinese Academy of Sciences

Shanghai Jiao Tong University

Training visual reinforcement learning (RL) in practical scenarios presents a significant challenge,

\textit{i.e.,}

RL agents suffer from low sample efficiency in environments with variations. While various approaches have attempted to alleviate this issue by disentangled representation learning, these methods usually start learning from scratch without prior knowledge of the world. This paper, in contrast, tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints. To enable effective cross-domain semantic knowledge transfer, we introduce an interpretable model-based RL framework, dubbed Disentangled World Models (DisWM). Specifically, we pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos. The disentanglement capability of the pretrained model is then transferred to the world model through latent distillation. For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model. During the adaptation phase, the incorporation of actions and rewards from online environment interactions enriches the diversity of the data, which in turn strengthens the disentangled representation learning. Experimental results validate the superiority of our approach on various benchmarks.

#computer-science #computer-vision-and-pattern-recognition #machine-learning

Paper thumbnail

A Survey of Behavior Foundation Model: Next-Generation Whole-Body Control System of Humanoid Robots

18 Jul 2025

The Hong Kong Polytechnic University University of Hong Kong

A survey by researchers from LimX Dynamics and various universities outlines Behavior Foundation Models (BFMs) as the next-generation paradigm for humanoid whole-body control. It synthesizes current approaches, categorizing pre-training and adaptation strategies, and discusses BFMs' potential to enable general-purpose physical intelligence through broad behavioral priors and rapid adaptation while also identifying key challenges.

#computer-science #robotics

Paper thumbnail

Large Language Models Empowered Personalized Web Agents

24 Mar 2025

National University of Singapore

The Hong Kong Polytechnic University

Researchers from National University of Singapore and collaborators introduced the concept of LLM-empowered personalized Web agents, aiming to automate online tasks by incorporating user-specific data. They developed the PersonalWAB benchmark and proposed the PUMA framework, which notably improved task accuracy and efficiency by leveraging personalized user memory and preference optimization, outperforming larger general-purpose LLMs.

#computer-science #conversational-ai #artificial-intelligence

Paper thumbnail

Revisiting MLLM Token Technology through the Lens of Classical Visual Coding

19 Aug 2025

kele-shao

Kele Shao

Westlake University Eastern Institute of Technology

Classical visual coding and Multimodal Large Language Model (MLLM) token technology share the core objective - maximizing information fidelity while minimizing computational cost. Therefore, this paper reexamines MLLM token technology, including tokenization, token compression, and token reasoning, through the established principles of long-developed visual coding area. From this perspective, we (1) establish a unified formulation bridging token technology and visual coding, enabling a systematic, module-by-module comparative analysis; (2) synthesize bidirectional insights, exploring how visual coding principles can enhance MLLM token techniques' efficiency and robustness, and conversely, how token technology paradigms can inform the design of next-generation semantic visual codecs; (3) prospect for promising future research directions and critical unsolved challenges. In summary, this study presents the first comprehensive and structured technology comparison of MLLM token and visual coding, paving the way for more efficient multimodal models and more powerful visual codecs simultaneously.

#computer-science #computer-vision-and-pattern-recognition #embedding-methods

Paper thumbnail

Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation

10 Oct 2025

University of Science and Technology of China

Shanghai Jiao Tong University

Current self-supervised monocular depth estimation (MDE) approaches encounter performance limitations due to insufficient semantic-spatial knowledge extraction. To address this challenge, we propose Hybrid-depth, a novel framework that systematically integrates foundation models (e.g., CLIP and DINO) to extract visual priors and acquire sufficient contextual information for MDE. Our approach introduces a coarse-to-fine progressive learning framework: 1) Firstly, we aggregate multi-grained features from CLIP (global semantics) and DINO (local spatial details) under contrastive language guidance. A proxy task comparing close-distant image patches is designed to enforce depth-aware feature alignment using text prompts; 2) Next, building on the coarse features, we integrate camera pose information and pixel-wise language alignment to refine depth predictions. This module seamlessly integrates with existing self-supervised MDE pipelines (e.g., Monodepth2, ManyDepth) as a plug-and-play depth encoder, enhancing continuous depth estimation. By aggregating CLIP's semantic context and DINO's spatial details through language guidance, our method effectively addresses feature granularity mismatches. Extensive experiments on the KITTI benchmark demonstrate that our method significantly outperforms SOTA methods across all metrics, which also indeed benefits downstream tasks like BEV perception. Code is available at this https URL.

#computer-science #contrastive-learning #computer-vision-and-pattern-recognition

Paper thumbnail

EDBench: Large-Scale Electron Density Data for Molecular Modeling

24 Sep 2025

University of Science and Technology of China Westlake University logo

Westlake University

Existing molecular machine learning force fields (MLFFs) generally focus on the learning of atoms, molecules, and simple quantum chemical properties (such as energy and force), but ignore the importance of electron density (ED)

\rho(r)

in accurately understanding molecular force fields (MFFs). ED describes the probability of finding electrons at specific locations around atoms or molecules, which uniquely determines all ground state properties (such as energy, molecular structure, etc.) of interactive multi-particle systems according to the Hohenberg-Kohn theorem. However, the calculation of ED relies on the time-consuming first-principles density functional theory (DFT) which leads to the lack of large-scale ED data and limits its application in MLFFs. In this paper, we introduce EDBench, a large-scale, high-quality dataset of ED designed to advance learning-based research at the electronic scale. Built upon the PCQM4Mv2, EDBench provides accurate ED data, covering 3.3 million molecules. To comprehensively evaluate the ability of models to understand and utilize electronic information, we design a suite of ED-centric benchmark tasks spanning prediction, retrieval, and generation. Our evaluation on several state-of-the-art methods demonstrates that learning from EDBench is not only feasible but also achieves high accuracy. Moreover, we show that learning-based method can efficiently calculate ED with comparable precision while significantly reducing the computational cost relative to traditional DFT calculations. All data and benchmarks from EDBench will be freely available, laying a robust foundation for ED-driven drug discovery and materials science.

#ai-for-genomics #ai-for-health #computer-science

Paper thumbnail

Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task

11 Oct 2025

Eastern Institute of Technology Ningbo Institute of Digital Twin

Researchers from the Ningbo Institute of Digital Twin, Eastern Institute of Technology, developed a hybrid OCR-LLM framework to efficiently extract information from enterprise-scale copy-heavy documents. This framework achieved sub-second latency and near-perfect F1 scores across various document types, demonstrating up to a 54x speedup over multimodal approaches by leveraging document structure.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Context-Alignment: Activating and Enhancing LLM Capabilities in Time Series

05 Apr 2025

Shanghai Jiao Tong University

The Hong Kong Polytechnic University

Recently, leveraging pre-trained Large Language Models (LLMs) for time series (TS) tasks has gained increasing attention, which involves activating and enhancing LLMs' capabilities. Many methods aim to activate LLMs' capabilities based on token-level alignment but overlook LLMs' inherent strength on natural language processing -- their deep understanding of linguistic logic and structure rather than superficial embedding processing. We propose Context-Alignment, a new paradigm that aligns TS with a linguistic component in the language environments familiar to LLMs to enable LLMs to contextualize and comprehend TS data, thereby activating their capabilities. Specifically, such context-level alignment comprises structural alignment and logical alignment, which is achieved by a Dual-Scale Context-Alignment GNNs (DSCA-GNNs) applied to TS-language multimodal inputs. Structural alignment utilizes dual-scale nodes to describe hierarchical structure in TS-language, enabling LLMs treat long TS data as a whole linguistic component while preserving intrinsic token features. Logical alignment uses directed edges to guide logical relationships, ensuring coherence in the contextual semantics. Demonstration examples prompt are employed to construct Demonstration Examples based Context-Alignment (DECA) following DSCA-GNNs framework. DECA can be flexibly and repeatedly integrated into various layers of pre-trained LLMs to improve awareness of logic and structure, thereby enhancing performance. Extensive experiments show the effectiveness of DECA and the importance of Context-Alignment across tasks, particularly in few-shot and zero-shot forecasting, confirming that Context-Alignment provide powerful prior knowledge on context.

#computer-science #computation-and-language #machine-learning

Paper thumbnail

Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method

27 Oct 2025

National University of Singapore

Shanghai Jiao Tong University

Driving scene generation is a critical domain for autonomous driving, enabling downstream applications, including perception and planning evaluation. Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities; however, their performance heavily depends on annotated occupancy data, which still remains scarce. To overcome this limitation, we curate Nuplan-Occ, the largest semantic occupancy dataset to date, constructed from the widely used Nuplan benchmark. Its scale and diversity facilitate not only large-scale generative modeling but also autonomous driving downstream applications. Based on this dataset, we develop a unified framework that jointly synthesizes high-quality semantic occupancy, multi-view videos, and LiDAR point clouds. Our approach incorporates a spatio-temporal disentangled architecture to support high-fidelity spatial expansion and temporal forecasting of 4D dynamic occupancy. To bridge modal gaps, we further propose two novel techniques: a Gaussian splatting-based sparse point map rendering strategy that enhances multi-view video generation, and a sensor-aware embedding strategy that explicitly models LiDAR sensor properties for realistic multi-LiDAR simulation. Extensive experiments demonstrate that our method achieves superior generation fidelity and scalability compared to existing approaches, and validates its practical value in downstream tasks. Repo: this https URL

#autonomous-vehicles #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

There are no more papers matching your filters at the moment.