alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

University of Chinese Academy of Science

RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving

25 Aug 2025

Chinese Academy of Sciences

National University of Singapore

RepoMaster enables LLM-based agents to autonomously explore and understand complex GitHub repositories for task solving by intelligently reusing and adapting existing codebases. It significantly improves task success rates by up to 110% and reduces token consumption by approximately 95% compared to state-of-the-art baselines.

#agents #autonomous-vehicles #computer-science

Paper thumbnail

Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents

17 Apr 2025

imagist

Shuo Ren

University of Chinese Academy of Science Institute of Automation, CAS

This survey from the Chinese Academy of Sciences provides a systematic review of Large Language Model (LLM)-based scientific agents, detailing their specialized architectures, evaluation benchmarks, diverse applications, and critical ethical considerations. It defines the unique characteristics that differentiate scientific agents from general-purpose LLMs, such as their integration with scientific tools and handling of complex data, outlining current capabilities and challenges in accelerating scientific discovery.

#agents #ai-for-health #chain-of-thought

Paper thumbnail

MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites

16 Oct 2025

Shanghai AI Laboratory Fudan University logo

Fudan University

Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains. In this task, current open-source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this paper proposes CapFlow, a novel multi-agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high-quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine-tuning, namely MetaCaptioner. Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top-tier multimodal performance in the open-source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost-effective visual captioning solution.

#agentic-frameworks #agents #computer-science

Paper thumbnail

FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation

21 Sep 2025

Chinese Academy of Sciences Peng Cheng Laboratory

Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling, enabling both accurate content understanding and flexible editing. However, current approaches treat "what to see" and "how to edit" separately: they either perform isolated object segmentation or utilize segmentation masks merely as conditional prompts for local edit generation tasks, often relying on multiple disjointed models. To bridge these gaps, we introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework. FOCUS employs a dual-branch visual encoder to simultaneously capture global semantic context and fine-grained spatial details. In addition, we leverage a MoVQGAN-based visual tokenizer to produce discrete visual tokens that enhance generation quality. To enable accurate and controllable image editing, we propose a progressive multi-stage training pipeline, where segmentation masks are jointly optimized and used as spatial condition prompts to guide the diffusion decoder. This strategy aligns visual encoding, segmentation, and generation modules, effectively bridging segmentation-aware perception with fine-grained visual synthesis. Extensive experiments across three core tasks, including multimodal understanding, referring segmentation accuracy, and controllable image generation, demonstrate that FOCUS achieves strong performance by jointly optimizing visual perception and generative capabilities.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

11 Oct 2025

Chinese Academy of Sciences Beihang University logo

Beihang University

SecureWebArena is introduced as the first holistic security evaluation benchmark for LVLM-based web agents, integrating diverse web environments, a broad attack taxonomy, and a multi-layered evaluation protocol. The benchmark reveals consistent vulnerabilities across state-of-the-art models, with pop-up attacks being particularly effective and achieving Payload Delivery Rates (PDR) from 76.67% to 100%.

#adversarial-attacks #adversarial-robustness #agents

Paper thumbnail

From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation

01 Oct 2025

Chinese Academy of Sciences Westlake University logo

Westlake University

TrajVLM-Gen, a two-stage framework developed by researchers from the Chinese Academy of Sciences and Westlake University, enables the generation of physically consistent videos by first employing a Vision-Language Model to predict physics-aware trajectories and then guiding a video diffusion model. The system achieves an 89.6% accuracy in trajectory generation, significantly outperforming baselines, and produces competitive FVD scores on standard video generation benchmarks.

#attention-mechanisms #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

DAIR-V2X: A Large-Scale Dataset for Vehicle-Infrastructure Cooperative 3D Object Detection

12 Apr 2022

haibao-yu

Haibao Yu

Tsinghua University University of Chinese Academy of Science

Autonomous driving faces great safety challenges for a lack of global perspective and the limitation of long-range perception capabilities. It has been widely agreed that vehicle-infrastructure cooperation is required to achieve Level 5 autonomy. However, there is still NO dataset from real scenarios available for computer vision researchers to work on vehicle-infrastructure cooperation-related problems. To accelerate computer vision research and innovation for Vehicle-Infrastructure Cooperative Autonomous Driving (VICAD), we release DAIR-V2X Dataset, which is the first large-scale, multi-modality, multi-view dataset from real scenarios for VICAD. DAIR-V2X comprises 71254 LiDAR frames and 71254 Camera frames, and all frames are captured from real scenes with 3D annotations. The Vehicle-Infrastructure Cooperative 3D Object Detection problem (VIC3D) is introduced, formulating the problem of collaboratively locating and identifying 3D objects using sensory inputs from both vehicle and infrastructure. In addition to solving traditional 3D object detection problems, the solution of VIC3D needs to consider the temporal asynchrony problem between vehicle and infrastructure sensors and the data transmission cost between them. Furthermore, we propose Time Compensation Late Fusion (TCLF), a late fusion framework for the VIC3D task as a benchmark based on DAIR-V2X. Find data, code, and more up-to-date information at this https URL and this https URL.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

Tree-of-Code: A Tree-Structured Exploring Framework for End-to-End Code Generation and Execution in Complex Task Handling

04 Aug 2025

Chinese Academy of Sciences University of Chinese Academy of Science

Tree-of-Code (ToC) introduces a self-growing tree framework that enables large language models to generate and execute end-to-end code programs for complex, multi-tool tasks without relying on intermediate ground truth. This approach yields accuracy improvements of nearly 20% on M3ToolEval and API-Bank level-3 datasets, concurrently reducing interaction turns and token usage substantially compared to prior methods.

#computer-science #artificial-intelligence #software-engineering

Paper thumbnail

Learning Cooperative Trajectory Representations for Motion Forecasting

31 Oct 2024

Institute for AI Industry Research (AIR), Tsinghua University The University of Hong Kong logo

The University of Hong Kong

Motion forecasting is an essential task for autonomous driving, and utilizing information from infrastructure and other vehicles can enhance forecasting capabilities. Existing research mainly focuses on leveraging single-frame cooperative information to enhance the limited perception capability of the ego vehicle, while underutilizing the motion and interaction context of traffic participants observed from cooperative devices. In this paper, we propose a forecasting-oriented representation paradigm to utilize motion and interaction features from cooperative information. Specifically, we present V2X-Graph, a representative framework to achieve interpretable and end-to-end trajectory feature fusion for cooperative motion forecasting. V2X-Graph is evaluated on V2X-Seq in vehicle-to-infrastructure (V2I) scenarios. To further evaluate on vehicle-to-everything (V2X) scenario, we construct the first real-world V2X motion forecasting dataset V2X-Traj, which contains multiple autonomous vehicles and infrastructure in every scenario. Experimental results on both V2X-Seq and V2X-Traj show the advantage of our method. We hope both V2X-Graph and V2X-Traj will benefit the further development of cooperative motion forecasting. Find the project at this https URL

#autonomous-vehicles #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

The Holography of Spread Complexity: A Story of Observers

11 Jul 2025

Shanxi University University of Chinese Academy of Science

We propose a holographic description of spread complexity and its rate in 2D CFTs, building on the pioneering work \cite{Caputa:2024sux}. By exploiting the

SL(2,\mathbb{R})

symmetry, we construct the Krylov basis and demonstrate that its non-uniqueness gives rise to ambiguities analogous to those in quantum complexity. Within the AdS/CFT correspondence, we identify the spread complexity as the energy measured by a bulk observer, with its rate corresponding to the radial momentum. These results suggest a novel holographic interpretation of complexity ambiguity: rather than arising from infinitely many gravitational observables, it may reflect the existence of infinitely many observers, each measuring complexity from their own perspective.

#high-energy-physics-theory #physics

Paper thumbnail

TC-RAG:Turing-Complete RAG's Case study on Medical LLM Systems

17 Aug 2024

fang-yue

fang yue

University of Electronic Science and Technology of China Peking University logo

Peking University

In the pursuit of enhancing domain-specific Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) emerges as a promising solution to mitigate issues such as hallucinations, outdated knowledge, and limited expertise in highly specialized queries. However, existing approaches to RAG fall short by neglecting system state variables, which are crucial for ensuring adaptive control, retrieval halting, and system convergence. In this paper, we introduce the TC-RAG through rigorous proof, a novel framework that addresses these challenges by incorporating a Turing Complete System to manage state variables, thereby enabling more efficient and accurate knowledge retrieval. By leveraging a memory stack system with adaptive retrieval, reasoning, and planning capabilities, TC-RAG not only ensures the controlled halting of retrieval processes but also mitigates the accumulation of erroneous knowledge via Push and Pop actions. In the case study of the medical domain, our extensive experiments on real-world healthcare datasets demonstrate the superiority of TC-RAG over existing methods in accuracy by over 7.20\%. Our dataset and code have been available at https://https://github.com/Artessay/SAMA.git.

#computer-science #information-retrieval

Paper thumbnail

MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation

27 Sep 2024

Chinese Academy of Sciences Tongji University

MiniVLN presents a framework that uses progressive knowledge distillation to create lightweight Vision-and-Language Navigation (VLN) models. It achieves comparable or better performance than teacher models (e.g., 77.59% SR on R2R test unseen vs. ScaleVLN's 77.00%) with only 12% of their parameters and over three times faster inference speed.

#computer-science #computer-vision-and-pattern-recognition #human-ai-interaction

Paper thumbnail

Thickness-Induced Topological Phase Transition Investigated by Helicity Dependent Photocurrent in

α

04 Sep 2025

Chinese Academy of Sciences Tsinghua University logo

Tsinghua University

\alpha

-Sn exhibits a rich topological phase diagram, yet experimental methods to tune and distinguish these phases remain limited. Here, we investigated the helicity-dependent photocurrent (HDPC) in

\alpha

-Sn films of varying thickness grown on CdTe(110) by molecular beam epitaxy. The HDPC of the 5 nm

\alpha

-Sn film shows an odd-function dependence on incident angle, whereas that of the 10 and 30 nm films exhibit an even-function dependence. Combined with high-resolution transmission electron microscopy (HR-TEM), point-group symmetry analysis, and first-principles calculations, it is revealed that a thickness-driven topological phase transition from a two dimensional (2D) to a three dimensional (3D) topological insulator occurs between 5 and 10 nm. These results demonstrate that HDPC serves as a sensitive diagnostic tool for topological phase transitions. The tunable electronic properties of

\alpha

-Sn(110) films enable thickness- and strain-mediated control of topological states, establishing a versatile platform for exploring emerging topological phenomena and developing spin-based devices.

#materials-science #physics #optics

Paper thumbnail

Think Hierarchically, Act Dynamically: Hierarchical Multi-modal Fusion and Reasoning for Vision-and-Language Navigation

24 Apr 2025

National University of Singapore Tsinghua University logo

Tsinghua University

Vision-and-Language Navigation (VLN) aims to enable embodied agents to follow natural language instructions and reach target locations in real-world environments. While prior methods often rely on either global scene representations or object-level features, these approaches are insufficient for capturing the complex interactions across modalities required for accurate navigation. In this paper, we propose a Multi-level Fusion and Reasoning Architecture (MFRA) to enhance the agent's ability to reason over visual observations, language instructions and navigation history. Specifically, MFRA introduces a hierarchical fusion mechanism that aggregates multi-level features-ranging from low-level visual cues to high-level semantic concepts-across multiple modalities. We further design a reasoning module that leverages fused representations to infer navigation actions through instruction-guided attention and dynamic context integration. By selectively capturing and combining relevant visual, linguistic, and temporal signals, MFRA improves decision-making accuracy in complex navigation scenarios. Extensive experiments on benchmark VLN datasets including REVERIE, R2R, and SOON demonstrate that MFRA achieves superior performance compared to state-of-the-art methods, validating the effectiveness of multi-level modal fusion for embodied navigation.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

Holographic Ordering and Negative entropy in Non-equilibrium Euclidean Black Hole Path Integralsl

07 Sep 2025

University of Chinese Academy of Science Kavli Institute for Theoretical Sciences

The Gibbons-Hawking-York (GHY) approach was developed for a Euclidean path integral derivation of equilibrial black hole entropy. To extend it to a near-equilibrium Euclidean path integral, we study a static Euclidean shell model. We calculate the Euclidean action shift for the static simple model thin shell held just outside the horizon, and find agreement with Casini's version of Bekenstein bound. We find a negative entropy deficit associated to the gravitational attraction towards the shell. For a holographic interpretation, the deficit corresponds precisely to the apparent horizon area deviation from the extremal surfaces Therefore, we develop a Euclidean path integral framework in which gravitational force emerges from negative entropy gradients due to Hawking temperature gradients. This setup allows us to introduce Onsager reciprocity and a linear-response relation to build a dissipating system, and treat the configuration as a near-equilibrium steady state (NESS). This clarify that the gravitational potential is a phenomenon informational and ordering, rather than entropic and disordering.

#general-relativity-and-quantum-cosmology #high-energy-physics-theory #physics

Paper thumbnail

MindRef: Mimicking Human Memory for Hierarchical Reference Retrieval with Fine-Grained Location Awareness

20 Sep 2025

Peking University East China Normal University

When completing knowledge-intensive tasks, humans sometimes need an answer and a corresponding reference passage for auxiliary reading. Previous methods required obtaining pre-segmented article chunks through additional retrieval models. This paper explores leveraging the parameterized knowledge stored during the pre-training phase of large language models (LLMs) to recall reference passage from any starting position independently. We propose a two-stage framework that simulates the scenario of humans recalling easily forgotten references. Initially, the LLM is prompted to recall document title identifiers to obtain a coarse-grained document set. Then, based on the acquired coarse-grained document set, it recalls fine-grained passage. In the two-stage recall process, we use constrained decoding to ensure that content outside of the stored documents is not generated. To increase speed, we only recall a short prefix in the second stage, and then locate its position to retrieve a complete passage. Experiments on KILT knowledge-sensitive tasks have verified that LLMs can independently recall reference passage locations in various task forms, and the obtained reference significantly assists downstream tasks.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

V2X-Seq: A Large-Scale Sequential Dataset for Vehicle-Infrastructure Cooperative Perception and Forecasting

10 May 2023

haibao-yu

Haibao Yu

Institute for AI Industry Research (AIR), Tsinghua University The University of Hong Kong logo

The University of Hong Kong

A new large-scale, real-world, and sequential dataset called V2X-Seq, developed by Tsinghua University's AIR and Baidu Inc., enables research in vehicle-infrastructure cooperative perception and forecasting. It provides comprehensive data, including real-time traffic light signals and trajectories, which improves 3D object tracking and multi-agent trajectory prediction in autonomous driving systems.

#agents #autonomous-vehicles #computer-science

Paper thumbnail

Reward Model Generalization for Compute-Aware Test-Time Reasoning

23 May 2025

University of Chinese Academy of Science Xian Jiaotong University

External test-time reasoning enhances large language models (LLMs) by decoupling generation and selection. At inference time, the model generates multiple reasoning paths, and an auxiliary process reward model (PRM) is used to score and select the best one. A central challenge in this setting is test-time compute optimality (TCO), i.e., how to maximize answer accuracy under a fixed inference budget. In this work, we establish a theoretical framework to analyze how the generalization error of the PRM affects compute efficiency and reasoning performance. Leveraging PAC-Bayes theory, we derive generalization bounds and show that a lower generalization error of PRM leads to fewer samples required to find correct answers. Motivated by this analysis, we propose Compute-Aware Tree Search (CATS), an actor-critic framework that dynamically controls search behavior. The actor outputs sampling hyperparameters based on reward distributions and sparsity statistics, while the critic estimates their utility to guide budget allocation. Experiments on the MATH and AIME benchmarks with various LLMs and PRMs demonstrate that CATS consistently outperforms other external TTS methods, validating our theoretical predictions.

#chain-of-thought #computer-science #machine-learning

Paper thumbnail

Complementary Subspace Low-Rank Adaptation of Vision-Language Models for Few-Shot Classification

25 Jan 2025

University of Chinese Academy of Science Dolby Lab. Inc.

Vision language model (VLM) has been designed for large scale image-text alignment as a pretrained foundation model. For downstream few shot classification tasks, parameter efficient fine-tuning (PEFT) VLM has gained much popularity in the computer vision community. PEFT methods like prompt tuning and linear adapter have been studied for fine-tuning VLM while low rank adaptation (LoRA) algorithm has rarely been considered for few shot fine-tuning VLM. The main obstacle to use LoRA for few shot fine-tuning is the catastrophic forgetting problem. Because the visual language alignment knowledge is important for the generality in few shot learning, whereas low rank adaptation interferes with the most informative direction of the pretrained weight matrix. We propose the complementary subspace low rank adaptation (Comp-LoRA) method to regularize the catastrophic forgetting problem in few shot VLM finetuning. In detail, we optimize the low rank matrix in the complementary subspace, thus preserving the general vision language alignment ability of VLM when learning the novel few shot information. We conduct comparison experiments of the proposed Comp-LoRA method and other PEFT methods on fine-tuning VLM for few shot classification. And we also present the suppression on the catastrophic forgetting problem of our proposed method against directly applying LoRA to VLM. The results show that the proposed method surpasses the baseline method by about +1.0\% Top-1 accuracy and preserves the VLM zero-shot performance over the baseline method by about +1.3\% Top-1 accuracy.

#computer-science #computer-vision-and-pattern-recognition #few-shot-learning

Paper thumbnail

Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking

11 Mar 2025

University of North Texas University of Chinese Academy of Science

TRACT introduces a novel approach for Open-Vocabulary Multi-Object Tracking by explicitly leveraging trajectory information to improve both object association and classification. This method achieves improved tracking performance, particularly boosting classification accuracy for novel object categories on the OV-TAO dataset by 2.5 times compared to prior methods.

#attention-mechanisms #computer-science #artificial-intelligence

Paper thumbnail

There are no more papers matching your filters at the moment.