alphaXiv

History

Papers Benchmarks

Beijing University of Chemical Technology

01 Dec 2025

computer-science robotics

PRIMT: Preference-based Reinforcement Learning with Multimodal Feedback and Trajectory Synthesis from Foundation Models

University of Illinois at Urbana-Champaign Sungkyunkwan University

Purdue University Indiana University Bloomington Beijing University of Chemical Technology

Preference-based reinforcement learning (PbRL) has emerged as a promising paradigm for teaching robots complex behaviors without reward engineering. However, its effectiveness is often limited by two critical challenges: the reliance on extensive human input and the inherent difficulties in resolving query ambiguity and credit assignment during reward learning. In this paper, we introduce PRIMT, a PbRL framework designed to overcome these challenges by leveraging foundation models (FMs) for multimodal synthetic feedback and trajectory synthesis. Unlike prior approaches that rely on single-modality FM evaluations, PRIMT employs a hierarchical neuro-symbolic fusion strategy, integrating the complementary strengths of large language models and vision-language models in evaluating robot behaviors for more reliable and comprehensive feedback. PRIMT also incorporates foresight trajectory generation, which reduces early-stage query ambiguity by warm-starting the trajectory buffer with bootstrapped samples, and hindsight trajectory augmentation, which enables counterfactual reasoning with a causal auxiliary loss to improve credit assignment. We evaluate PRIMT on 2 locomotion and 6 manipulation tasks on various benchmarks, demonstrating superior performance over FM-based and scripted baselines.

827

09 Oct 2025

attention-mechanisms computer-science computer-vision-and-pattern-recognition

RePainter: Empowering E-commerce Object Removal via Spatial-matting Reinforcement Learning

Sun Yat-Sen University Hunan University Beijing University of Chemical Technology JD.COM

RePainter introduces a reinforcement learning framework for e-commerce object removal, leveraging spatial-matting trajectory refinement and a local-global composite reward mechanism. The method generates visually seamless and semantically coherent images, outperforming existing state-of-the-art inpainting techniques across multiple quantitative and qualitative evaluations.

282

07 Mar 2025

agents ai-for-cybersecurity computer-science

NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM

Beijing Normal University

National University of Singapore Beijing University of Chemical Technology

Vision-and-Language Navigation (VLN) is an essential skill for embodied agents, allowing them to navigate in 3D environments following natural language instructions. High-performance navigation models require a large amount of training data, the high cost of manually annotating data has seriously hindered this field. Therefore, some previous methods translate trajectory videos into step-by-step instructions for expanding data, but such instructions do not match well with users' communication styles that briefly describe destinations or state specific needs. Moreover, local navigation trajectories overlook global context and high-level task planning. To address these issues, we propose NavRAG, a retrieval-augmented generation (RAG) framework that generates user demand instructions for VLN. NavRAG leverages LLM to build a hierarchical scene description tree for 3D scene understanding from global layout to local details, then simulates various user roles with specific demands to retrieve from the scene tree, generating diverse instructions with LLM. We annotate over 2 million navigation instructions across 861 scenes and evaluate the data quality and navigation performance of trained models.

06 Jun 2025

computer-science computer-vision-and-pattern-recognition image-segmentation

Dy3DGS-SLAM: Monocular 3D Gaussian Splatting SLAM for Dynamic Environments

Dalian University of Technology Beijing University of Chemical Technology Laval University Fraunhofer Institute for Nondestructive Testing Saarland University of Applied Sciences

Dy3DGS-SLAM presents the first monocular RGB-only 3D Gaussian Splatting SLAM system capable of operating in dynamic environments. The system achieves robust camera pose estimation and high-fidelity scene reconstruction by employing a sophisticated multi-modal mask fusion strategy and dynamic-aware loss functions, resulting in superior tracking accuracy and the elimination of dynamic object artifacts compared to existing methods.

136

14 Mar 2025

computer-science software-engineering

Prompt Alchemy: Automatic Prompt Refinement for Enhancing Code Generation

Chinese Academy of Sciences

Peking University Beijing University of Chemical Technology National Key Laboratory of Space Integrated Information System

Prochemy is an execution-driven framework for automatic prompt refinement, enhancing Large Language Model performance in code generation and translation tasks. It consistently improves code quality and translation accuracy across various models and datasets, yielding average gains of +4.04% for zero-shot code generation and up to +14.15% on challenging benchmarks like LiveCodeBench.

15 Sep 2025

chain-of-thought computer-science computer-vision-and-pattern-recognition

RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning

Beijing University of Chemical Technology

The RISE framework enhances Vision-Language Model image annotation by autonomously generating and verifying high-quality Chains of Thought (CoTs) through a self-supervised, closed-loop process. This approach achieves superior performance on complex reasoning tasks, such as reducing Jensen-Shannon Divergence to 0.071 on Emotion6 and improving mAP@0.5 to 0.404 on LISA, while mitigating the need for extensive manual CoT annotations.

21 Aug 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Multi-Cache Enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models

University of Minnesota Shanghai University Beijing University of Chemical Technology

In zero-shot setting, test-time adaptation adjusts pre-trained models using unlabeled data from the test phase to enhance performance on unknown test distributions. Existing cache-enhanced TTA methods rely on a low-entropy criterion to select samples for prototype construction, assuming intra-class compactness. However, low-entropy samples may be unreliable under distribution shifts, and the resulting prototypes may not ensure compact intra-class distributions. This study identifies a positive correlation between cache-enhanced performance and intra-class compactness. Based on this observation, we propose a Multi-Cache enhanced Prototype-based Test-Time Adaptation (MCP) featuring three caches: an entropy cache for initializing prototype representations with low-entropy samples, an align cache for integrating visual and textual information to achieve compact intra-class distributions, and a negative cache for prediction calibration using high-entropy samples. We further developed MCP++, a framework incorporating cross-modal prototype alignment and residual learning, introducing prototype residual fine-tuning. Comparative and ablation experiments across 15 downstream tasks demonstrate that the proposed method and framework achieve state-of-the-art generalization performance. Project Page available at: this https URL

23 May 2025

computer-science multiagent-systems

HYGMA: Hypergraph Coordination Networks with Dynamic Grouping for Multi-Agent Reinforcement Learning

Beijing University of Chemical Technology

Cooperative multi-agent reinforcement learning faces significant challenges in effectively organizing agent relationships and facilitating information exchange, particularly when agents need to adapt their coordination patterns dynamically. This paper presents a novel framework that integrates dynamic spectral clustering with hypergraph neural networks to enable adaptive group formation and efficient information processing in multi-agent systems. The proposed framework dynamically constructs and updates hypergraph structures through spectral clustering on agents' state histories, enabling higher-order relationships to emerge naturally from agent interactions. The hypergraph structure is enhanced with attention mechanisms for selective information processing, providing an expressive and efficient way to model complex agent relationships. This architecture can be implemented in both value-based and policy-based paradigms through a unified objective combining task performance with structural regularization. Extensive experiments on challenging cooperative tasks demonstrate that our method significantly outperforms state-of-the-art approaches in both sample efficiency and final performance.

117

13 Feb 2025

computer-science robotics

OpenBench: A New Benchmark and Baseline for Semantic Navigation in Smart Logistics

Chinese Academy of Sciences

Tsinghua University Harbin Institute of Technology Macau University of Science and Technology Beijing University of Chemical Technology Beĳing Institute of Technology

This research introduces OpenBench, a new benchmark for outdoor semantic navigation in last-mile delivery, and the Openstreetmap-enhanced oPen-air sEmantic Navigation (OPEN) system. The OPEN system combines OpenStreetMap data with Large Language Models and Vision-Language Models to achieve robust, scalable navigation, outperforming learning-based baselines in both simulated and real-world environments while significantly reducing map storage requirements.

31 Mar 2025

computer-science contrastive-learning computer-vision-and-pattern-recognition

Mitigating Cache Noise in Test-Time Adaptation for Large Vision-Language Models

Sun Yat-Sen University University of Minnesota Twin Cities Beijing University of Chemical Technology

Haotian Zhai

Researchers from the University of Minnesota, Beijing University of Chemical Technology, and Sun Yat-sen University developed CRG, a zero-shot Test-Time Adaptation framework for large Vision-Language Models. This method effectively mitigates cache noise and enhances robustness by integrating learnable residual parameters and Gaussian Discriminant Analysis, achieving state-of-the-art accuracy on ImageNet benchmarks and diverse recognition datasets.

25 Aug 2025

agents computer-science artificial-intelligence

DSADF: Thinking Fast and Slow for Decision Making

Shanghai Artificial Intelligence Laboratory Tongji University

Duke University

HKUST East China Normal University Northeast Electric Power University Beijing University of Chemical Technology

David Wang

Although Reinforcement Learning (RL) agents are effective in well-defined environments, they often struggle to generalize their learned policies to dynamic settings due to their reliance on trial-and-error interactions. Recent work has explored applying Large Language Models (LLMs) or Vision Language Models (VLMs) to boost the generalization of RL agents through policy optimization guidance or prior knowledge. However, these approaches often lack seamless coordination between the RL agent and the foundation model, leading to unreasonable decision-making in unfamiliar environments and efficiency bottlenecks. Making full use of the inferential capabilities of foundation models and the rapid response capabilities of RL agents and enhancing the interaction between the two to form a dual system is still a lingering scientific question. To address this problem, we draw inspiration from Kahneman's theory of fast thinking (System 1) and slow thinking (System 2), demonstrating that balancing intuition and deep reasoning can achieve nimble decision-making in a complex world. In this study, we propose a Dual-System Adaptive Decision Framework (DSADF), integrating two complementary modules: System 1, comprising an RL agent and a memory space for fast and intuitive decision making, and System 2, driven by a VLM for deep and analytical reasoning. DSADF facilitates efficient and adaptive decision-making by combining the strengths of both systems. The empirical study in the video game environment: Crafter and Housekeep demonstrates the effectiveness of our proposed method, showing significant improvements in decision abilities for both unseen and known tasks.

26 May 2024

computer-science computer-vision-and-pattern-recognition few-shot-learning

LTGC: Long-tail Recognition via Leveraging LLMs-driven Generated Content

Northwestern Polytechnical University

Nanyang Technological University Singapore University of Technology and Design Beijing University of Chemical Technology

昊李

Long-tail recognition is challenging because it requires the model to learn good representations from tail categories and address imbalances across all categories. In this paper, we propose a novel generative and fine-tuning framework, LTGC, to handle long-tail recognition via leveraging generated content. Firstly, inspired by the rich implicit knowledge in large-scale models (e.g., large language models, LLMs), LTGC leverages the power of these models to parse and reason over the original tail data to produce diverse tail-class content. We then propose several novel designs for LTGC to ensure the quality of the generated data and to efficiently fine-tune the model using both the generated and original data. The visualization demonstrates the effectiveness of the generation module in LTGC, which produces accurate and diverse tail data. Additionally, the experimental results demonstrate that our LTGC outperforms existing state-of-the-art methods on popular long-tailed benchmarks.

192

03 Apr 2025

attention-mechanisms autonomous-vehicles computer-science

MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception

Beihang University

Tsinghua University HKUST(GZ)Beijing University of Chemical Technology Beĳing Institute of Technology

MMTL-UniAD introduces a unified framework for assistive driving perception, simultaneously tackling driver state and traffic context recognition using multimodal data. The system achieved higher mean accuracy across four distinct tasks on the AIDE dataset compared to previous methods, demonstrating the benefits of its integrated approach.

10 Sep 2025

materials-science physics

Anomalous-Hall Neel textures in altermagnetic materials

Chinese Academy of Sciences

University of Science and Technology of China Anhui University Beijing University of Chemical Technology

Recently, the altermagnets, a new kind of collinear antiferromagnet with nearly zero net magnetization and momentum-dependent spin-splitting of bands, have sparked great interest. Despite simple magnetic structures, these altermagnets exhibit intriguing and intricate dependence of anomalous Hall effect (AHE) on the Néel vector, in contrast to the conventional perpendicular configuration of Hall current with magnetization in ferromagnets. However, the fundamental relationship between the AHE and the Néel vector remains largely elusive. Here, we reveal all the unconventional anomalous Hall textures in the Néel vector space, dubbed anomalous-Hall Néel textures (AHNTs) for altermagnets. Specifically, we identify 10 types across four categories of AHNTs for all altermagnets. Notably, we find that AHNTs resemble the known spin textures in momentum space, and further reveal their symmetry origin. Meanwhile, we examine our key discoveries in prototypical altermagnets. Our work offers a thorough understanding of AHE in altermagnets and a complete and pictorial classification of altermagnets based on the geometry of response functions.

09 Jun 2025

causal-inference computer-science information-retrieval

Addressing Correlated Latent Exogenous Variables in Debiased Recommender Systems

University of Michigan Dalhousie University Beijing University of Chemical Technology

Recommendation systems (RS) aim to provide personalized content, but they face a challenge in unbiased learning due to selection bias, where users only interact with items they prefer. This bias leads to a distorted representation of user preferences, which hinders the accuracy and fairness of recommendations. To address the issue, various methods such as error imputation based, inverse propensity scoring, and doubly robust techniques have been developed. Despite the progress, from the structural causal model perspective, previous debiasing methods in RS assume the independence of the exogenous variables. In this paper, we release this assumption and propose a learning algorithm based on likelihood maximization to learn a prediction model. We first discuss the correlation and difference between unmeasured confounding and our scenario, then we propose a unified method that effectively handles latent exogenous variables. Specifically, our method models the data generation process with latent exogenous variables under mild normality assumptions. We then develop a Monte Carlo algorithm to numerically estimate the likelihood function. Extensive experiments on synthetic datasets and three real-world datasets demonstrate the effectiveness of our proposed method. The code is at this https URL

11 Mar 2025

computer-science robotics deep-reinforcement-learning

Adaptive Task Allocation in Multi-Human Multi-Robot Teams under Team Heterogeneity and Dynamic Information Uncertainty

Purdue University Beijing University of Chemical Technology

Task allocation in multi-human multi-robot (MH-MR) teams presents significant challenges due to the inherent heterogeneity of team members, the dynamics of task execution, and the information uncertainty of operational states. Existing approaches often fail to address these challenges simultaneously, resulting in suboptimal performance. To tackle this, we propose ATA-HRL, an adaptive task allocation framework using hierarchical reinforcement learning (HRL), which incorporates initial task allocation (ITA) that leverages team heterogeneity and conditional task reallocation in response to dynamic operational states. Additionally, we introduce an auxiliary state representation learning task to manage information uncertainty and enhance task execution. Through an extensive case study in large-scale environmental monitoring tasks, we demonstrate the benefits of our approach.

03 Oct 2025

clustering-algorithms computer-science contrastive-learning

Hybrid-Collaborative Augmentation and Contrastive Sample Adaptive-Differential Awareness for Robust Attributed Graph Clustering

Beijing University of Technology The University of Sydney Beijing University of Chemical Technology

Due to its powerful capability of self-supervised representation learning and clustering, contrastive attributed graph clustering (CAGC) has achieved great success, which mainly depends on effective data augmentation and contrastive objective setting. However, most CAGC methods utilize edges as auxiliary information to obtain node-level embedding representation and only focus on node-level embedding augmentation. This approach overlooks edge-level embedding augmentation and the interactions between node-level and edge-level embedding augmentations across various granularity. Moreover, they often treat all contrastive sample pairs equally, neglecting the significant differences between hard and easy positive-negative sample pairs, which ultimately limits their discriminative capability. To tackle these issues, a novel robust attributed graph clustering (RAGC), incorporating hybrid-collaborative augmentation (HCA) and contrastive sample adaptive-differential awareness (CSADA), is proposed. First, node-level and edge-level embedding representations and augmentations are simultaneously executed to establish a more comprehensive similarity measurement criterion for subsequent contrastive learning. In turn, the discriminative similarity further consciously guides edge augmentation. Second, by leveraging pseudo-label information with high confidence, a CSADA strategy is elaborately designed, which adaptively identifies all contrastive sample pairs and differentially treats them by an innovative weight modulation function. The HCA and CSADA modules mutually reinforce each other in a beneficent cycle, thereby enhancing discriminability in representation learning. Comprehensive graph clustering evaluations over six benchmark datasets demonstrate the effectiveness of the proposed RAGC against several state-of-the-art CAGC methods.

18 Aug 2025

autonomous-vehicles computer-science computer-vision-security

Embodied Image Quality Assessment for Robotic Intelligence

Shanghai Jiao Tong University Beijing University of Chemical Technology

Researchers at Shanghai Jiao Tong University developed the first robot-centric image quality assessment (IQA) framework for embodied AI, establishing the Embodied Preference Database (EPD) with robot-generated labels and the lightweight Multi-scale Attention Embodied IQA (MA-EIQA) model. The study empirically confirms that robots evaluate image quality differently from humans, with MA-EIQA achieving state-of-the-art performance on the EPD dataset by linking image quality directly to robotic task success.

10 Sep 2025

materials-science physics

Extracting Phonon Quasiparticles from Molecular Dynamics Simulations

Beijing University of Chemical Technology Beĳing Institute of Technology

Phonon anharmonicity is ubiquitous in real materials and is crucial for understanding thermal properties and phase stability. In this work, we show that phonon quasiparticles are optimally described by modes with maximum lifetimes, and prove that information about these quasiparticles is contained in two small matrices

\mathcal{S}

and

\mathcal{Q}

, which can be constructed directly from molecular dynamics simulations. Based on these knowledge, we proposed an optimization scheme, which allows us to efficiently determine temperature-dependent phonon modes along with their frequencies and lifetimes. We verified this method by applying it to silicon and cubic CaSiO

_3

, where it successfully captured their temperature-dependent phonon behaviors and the well-known phonon softening in cubic CaSiO

_3

. This theory provides a convenient tool for investigating phonon quasiparticles and can be extended to study other quasiparticles, such as electrons, holes, and magnons.

18 Feb 2025

computer-science software-engineering

An Empirical Study on Challenges for LLM Application Developers

Technical University of Munich Nantong University Beijing University of Chemical Technology

In recent years, large language models (LLMs) have seen rapid advancements, significantly impacting various fields such as computer vision, natural language processing, and software engineering. These LLMs, exemplified by OpenAI's ChatGPT, have revolutionized the way we approach language understanding and generation tasks. However, in contrast to traditional software development practices, LLM development introduces new challenges for AI developers in design, implementation, and deployment. These challenges span different areas (such as prompts, APIs, and plugins), requiring developers to navigate unique methodologies and considerations specific to LLM application development. Despite the profound influence of LLMs, to the best of our knowledge, these challenges have not been thoroughly investigated in previous empirical studies. To fill this gap, we present the first comprehensive study on understanding the challenges faced by LLM developers. Specifically, we crawl and analyze 29,057 relevant questions from a popular OpenAI developer forum. We first examine their popularity and difficulty. After manually analyzing 2,364 sampled questions, we construct a taxonomy of challenges faced by LLM developers. Based on this taxonomy, we summarize a set of findings and actionable implications for LLM-related stakeholders, including developers and providers (especially the OpenAI organization).

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

PRIMT: Preference-based Reinforcement Learning with Multimodal Feedback and Trajectory Synthesis from Foundation Models

RePainter: Empowering E-commerce Object Removal via Spatial-matting Reinforcement Learning

NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM

Dy3DGS-SLAM: Monocular 3D Gaussian Splatting SLAM for Dynamic Environments

Prompt Alchemy: Automatic Prompt Refinement for Enhancing Code Generation

RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning

Multi-Cache Enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models

HYGMA: Hypergraph Coordination Networks with Dynamic Grouping for Multi-Agent Reinforcement Learning

OpenBench: A New Benchmark and Baseline for Semantic Navigation in Smart Logistics

Mitigating Cache Noise in Test-Time Adaptation for Large Vision-Language Models

DSADF: Thinking Fast and Slow for Decision Making

LTGC: Long-tail Recognition via Leveraging LLMs-driven Generated Content

MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception

Anomalous-Hall Neel textures in altermagnetic materials

Addressing Correlated Latent Exogenous Variables in Debiased Recommender Systems

Adaptive Task Allocation in Multi-Human Multi-Robot Teams under Team Heterogeneity and Dynamic Information Uncertainty

Hybrid-Collaborative Augmentation and Contrastive Sample Adaptive-Differential Awareness for Robust Attributed Graph Clustering

Embodied Image Quality Assessment for Robotic Intelligence

Extracting Phonon Quasiparticles from Molecular Dynamics Simulations

An Empirical Study on Challenges for LLM Application Developers

Events

AI for Law

Personalize Your Feed