alphaXiv

History

Papers Benchmarks

China Agricultural University

201

26 Sep 2025

computer-science artificial-intelligence computation-and-language

UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios

Tianjin University

Sun Yat-Sen University

Tsinghua University China Agricultural University

Nanyang Technological University

HKUST Huazhong Agricultural University Didichuxing Co. Ltd

Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development, commercial investment, and scientific discovery, unfold in long-horizon and partially observable scenarios where success hinges on sustained reasoning, planning, memory management, and tool use. Existing benchmarks rarely capture these long-horizon challenges, leaving a gap in systematic evaluation. To bridge this gap, we introduce \textbf{UltraHorizon} a novel benchmark that measures the foundational capabilities essential for complex real-world challenges. We use exploration as a unifying task across three distinct environments to validate these core competencies. Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules through sustained reasoning, planning, memory and tools management, and interaction with environments. Under the heaviest scale setting, trajectories average \textbf{200k+} tokens and \textbf{400+} tool calls, whereas in standard configurations they still exceed \textbf{35k} tokens and involve more than \textbf{60} tool calls on average. Our extensive experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores, underscoring a persistent gap in agents' long-horizon abilities. We also observe that simple scaling fails in our task. To better illustrate the failure of agents, we conduct an in-depth analysis of collected trajectories. We identify eight types of errors and attribute them to two primary causes: in-context locking and functional fundamental capability gaps. \href{this https URL}{Our code will be available here.}

179

12 Aug 2025

autonomous-vehicles computer-science computer-vision-and-pattern-recognition

VLM-3D:End-to-End Vision-Language Models for Open-World 3D Perception

Tsinghua University China Agricultural University Meituan Southwestern University of Finance and Economics

Open-set perception in complex traffic environments poses a critical challenge for autonomous driving systems, particularly in identifying previously unseen object categories, which is vital for ensuring safety. Visual Language Models (VLMs), with their rich world knowledge and strong semantic reasoning capabilities, offer new possibilities for addressing this task. However, existing approaches typically leverage VLMs to extract visual features and couple them with traditional object detectors, resulting in multi-stage error propagation that hinders perception accuracy. To overcome this limitation, we propose VLM-3D, the first end-to-end framework that enables VLMs to perform 3D geometric perception in autonomous driving scenarios. VLM-3D incorporates Low-Rank Adaptation (LoRA) to efficiently adapt VLMs to driving tasks with minimal computational overhead, and introduces a joint semantic-geometric loss design: token-level semantic loss is applied during early training to ensure stable convergence, while 3D IoU loss is introduced in later stages to refine the accuracy of 3D bounding box predictions. Evaluations on the nuScenes dataset demonstrate that the proposed joint semantic-geometric loss in VLM-3D leads to a 12.8% improvement in perception accuracy, fully validating the effectiveness and advancement of our method.

1,220

21 May 2025

chain-of-thought computer-science artificial-intelligence

Ada-R1: Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization

Sun Yat-Sen University

Tsinghua University

Zhejiang University China Agricultural University

Nanyang Technological University Didichuxing Co. Ltd

Ada-R1 enables large language models to adaptively select between long and short Chain-of-Thought reasoning paths, achieving over 50% reduction in inference length with less than 2% average accuracy degradation on mathematical benchmarks. This framework leverages a bi-level optimization approach to intelligently balance efficiency and performance.

268

09 Apr 2025

computer-science computer-vision-and-pattern-recognition few-shot-learning

Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception

South China University of Technology

Westlake University

Renmin University of China China Agricultural University

rt p

Researchers from multiple Chinese institutions develop a training-free method for enhancing MLLM-generated image captions through local patch analysis and semantic filtering, achieving +4.7 CIDEr score improvement on DID-Bench while reducing hallucinations through a divide-then-aggregate strategy that mimics human visual perception.

386

27 Feb 2025

computer-science continual-learning artificial-intelligence

When Continue Learning Meets Multimodal Large Language Model: A Survey

China Agricultural University

Peking University

A comprehensive survey by researchers from China Agricultural University and Peking University maps the rapidly evolving field of continual learning in multimodal large language models. The work reviews 440 research papers, providing the first dedicated overview of progress, challenges, and future directions for enabling MLLMs to adapt dynamically while mitigating catastrophic forgetting.

16 Sep 2025

computer-science computer-vision-and-pattern-recognition domain-adaptation

WHU-STree: A Multi-modal Benchmark Dataset for Street Tree Inventory

Wuhan University China Agricultural University

Technical University of Munich

HKUST Ministry of Agriculture and Rural Affairs

Street trees are vital to urban livability, providing ecological and social benefits. Establishing a detailed, accurate, and dynamically updated street tree inventory has become essential for optimizing these multifunctional assets within space-constrained urban environments. Given that traditional field surveys are time-consuming and labor-intensive, automated surveys utilizing Mobile Mapping Systems (MMS) offer a more efficient solution. However, existing MMS-acquired tree datasets are limited by small-scale scene, limited annotation, or single modality, restricting their utility for comprehensive analysis. To address these limitations, we introduce WHU-STree, a cross-city, richly annotated, and multi-modal urban street tree dataset. Collected across two distinct cities, WHU-STree integrates synchronized point clouds and high-resolution images, encompassing 21,007 annotated tree instances across 50 species and 2 morphological parameters. Leveraging the unique characteristics, WHU-STree concurrently supports over 10 tasks related to street tree inventory. We benchmark representative baselines for two key tasks--tree species classification and individual tree segmentation. Extensive experiments and in-depth analysis demonstrate the significant potential of multi-modal data fusion and underscore cross-domain applicability as a critical prerequisite for practical algorithm deployment. In particular, we identify key challenges and outline potential future works for fully exploiting WHU-STree, encompassing multi-modal fusion, multi-task collaboration, cross-domain generalization, spatial pattern learning, and Multi-modal Large Language Model for street tree asset management. The WHU-STree dataset is accessible at: this https URL.

13 Aug 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind

Sun Yat-Sen University

Tsinghua University China Agricultural University Southwest Jiaotong University National Supercomputing Center in Shenzhen

Large Multimodal Models (LMMs) has demonstrated capabilities across various domains, but comprehensive benchmarks for agricultural remote sensing (RS) remain scarce. Existing benchmarks designed for agricultural RS scenarios exhibit notable limitations, primarily in terms of insufficient scene diversity in the dataset and oversimplified task design. To bridge this gap, we introduce AgroMind, a comprehensive agricultural remote sensing benchmark covering four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning, with a total of 13 task types, ranging from crop identification and health monitoring to environmental analysis. We curate a high-quality evaluation set by integrating eight public datasets and one private farmland plot dataset, containing 27,247 QA pairs and 19,615 images. The pipeline begins with multi-source data pre-processing, including collection, format standardization, and annotation refinement. We then generate a diverse set of agriculturally relevant questions through the systematic definition of tasks. Finally, we employ LMMs for inference, generating responses, and performing detailed examinations. We evaluated 20 open-source LMMs and 4 closed-source models on AgroMind. Experiments reveal significant performance gaps, particularly in spatial reasoning and fine-grained recognition, it is notable that human performance lags behind several leading LMMs. By establishing a standardized evaluation framework for agricultural RS, AgroMind reveals the limitations of LMMs in domain knowledge and highlights critical challenges for future work. Data and code can be accessed at this https URL.

27 Aug 2025

attention-mechanisms computer-science computer-vision-and-pattern-recognition

Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation

University of Toronto China Agricultural University

Beijing Jiaotong University Concordia University McMaster University

CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn't consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP. In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patch-level correspondences back to the intermediate attention. The output predictions, being the culmination of the model's processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model's outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks.

265

08 Oct 2025

ai-for-health computer-science artificial-intelligence

Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine

Chinese Academy of Sciences China Agricultural University

Peking University Baidu Inc

In recent years, Multimodal Large Language Models (MLLM) have achieved notable advancements, demonstrating the feasibility of developing an intelligent biomedical assistant. However, current biomedical MLLMs predominantly focus on image-level understanding and restrict interactions to textual commands, thus limiting their capability boundaries and the flexibility of usage. In this paper, we introduce a novel end-to-end multimodal large language model for the biomedical domain, named MedPLIB, which possesses pixel-level understanding. Excitingly, it supports visual question answering (VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form shapes), and pixel-level grounding. We propose a novel Mixture-of-Experts (MoE) multi-stage training strategy, which divides MoE into separate training phases for a visual-language expert model and a pixel-grounding expert model, followed by fine-tuning using MoE. This strategy effectively coordinates multitask learning while maintaining the computational cost at inference equivalent to that of a single expert model. To advance the research of biomedical MLLMs, we introduce the Medical Complex Vision Question Answering Dataset (MeCoVQA), which comprises an array of 8 modalities for complex medical imaging question answering and image region understanding. Experimental results indicate that MedPLIB has achieved state-of-the-art outcomes across multiple medical visual language tasks. More importantly, in zero-shot evaluations for the pixel grounding task, MedPLIB leads the best small and large models by margins of 19.7 and 15.6 respectively on the mDice metric. The codes, data, and model checkpoints will be made publicly available at this https URL.

14 Oct 2025

computer-science computer-vision-and-pattern-recognition domain-adaptation

GTPBD: A Fine-Grained Global Terraced Parcel and Boundary Dataset

Sun Yat-Sen University

Tsinghua University China Agricultural University

The University of Hong Kong Southwest Jiaotong University National Supercomputing Center in Shenzhen

GTPBD introduces the first global, fine-grained dataset for terraced agricultural parcels, featuring high-resolution imagery and a multi-level annotation system that includes masks, boundaries, and parcel labels. This new benchmark reveals that current deep learning models struggle with the dataset's inherent complexity and domain shifts in tasks like semantic segmentation, edge detection, parcel extraction, and unsupervised domain adaptation.

105

06 Aug 2025

agents computer-science artificial-intelligence

SEA: Self-Evolution Agent with Step-wise Reward for Computer Use

China Agricultural University Lenovo

Computer use agent is an emerging area in artificial intelligence that aims to operate the computers to achieve the user's tasks, which attracts a lot of attention from both industry and academia. However, the present agents' performance is far from being used. In this paper, we propose the Self-Evolution Agent (SEA) for computer use, and to develop this agent, we propose creative methods in data generation, reinforcement learning, and model enhancement. Specifically, we first propose an automatic pipeline to generate the verifiable trajectory for training. And then, we propose efficient step-wise reinforcement learning to alleviate the significant computational requirements for long-horizon training. In the end, we propose the enhancement method to merge the grounding and planning ability into one model without any extra training. Accordingly, based on our proposed innovation of data generation, training strategy, and enhancement, we get the Selfevolution Agent (SEA) for computer use with only 7B parameters, which outperforms models with the same number of parameters and has comparable performance to larger ones. We will make the models' weight and related codes open-source in the future.

110

29 May 2024

computer-science computer-vision-and-pattern-recognition

Efficient Image Deblurring Networks based on Diffusion Models

China Agricultural University

A new deblurring model, Swintormer, combines latent prior features from diffusion models with an efficient Transformer architecture and a sliding window inference strategy, achieving state-of-the-art performance on defocus and motion deblurring benchmarks with significantly reduced computational cost.

121

02 Mar 2025

computer-science computer-vision-and-pattern-recognition machine-learning

Audio-Visual Instance Segmentation

China Agricultural University

Peking University University of Surrey

liao qu

In this paper, we propose a new multi-modal task, termed audio-visual instance segmentation (AVIS), which aims to simultaneously identify, segment and track individual sounding object instances in audible videos. To facilitate this research, we introduce a high-quality benchmark named AVISeg, containing over 90K instance masks from 26 semantic categories in 926 long videos. Additionally, we propose a strong baseline model for this task. Our model first localizes sound source within each frame, and condenses object-specific contexts into concise tokens. Then it builds long-range audio-visual dependencies between these tokens using window-based attention, and tracks sounding objects among the entire video sequences. Extensive experiments reveal that our method performs best on AVISeg, surpassing the existing methods from related tasks. We further conduct the evaluation on several multi-modal large models. Unfortunately, they exhibits subpar performance on instance-level sound source localization and temporal perception. We expect that AVIS will inspire the community towards a more comprehensive multi-modal understanding. Dataset and code is available at this https URL

100

05 Nov 2024

computer-science computation-and-language generative-models

Generative Text Steganography with Large Language Model

Fudan University China Agricultural University

LLM-Stega introduces the first black-box generative text steganography method, operating through large language model user interfaces via an external keyword set and encrypted mapping. It demonstrates superior embedding capacity, near-random guessing resistance to steganalysis, and high text quality and imperceptibility.

07 Jul 2024

ai-for-health computer-science computer-vision-and-pattern-recognition

Cross-Modal Conditioned Reconstruction for Language-guided Medical Image Segmentation

China Agricultural University

Peking University

Yale University

Mohamed bin Zayed University of Artificial Intelligence

HKUST

Chenyu You

Researchers from China Agricultural University, Peking University, MBZUAI, HKUST, and Yale University developed RecLMIS, a language-guided medical image segmentation model featuring a cross-modal conditioned reconstruction mechanism. This approach explicitly aligns image and text features, leading to segmentation improvements of up to 3.74% mIoU over prior methods while simultaneously achieving 46.3% faster inference speeds by removing reconstruction modules after training.

20 Aug 2025

computer-science computer-vision-and-pattern-recognition image-and-video-processing

A Comprehensive Review of Agricultural Parcel and Boundary Delineation from Remote Sensing Images: Recent Progress and Future Perspectives

Sun Yat-Sen University

Tsinghua University China Agricultural University Central China Normal University Huazhong Agricultural University HuanTian Wisdom Technology Co., Ltd.

Powered by advances in multiple remote sensing sensors, the production of high spatial resolution images provides great potential to achieve cost-efficient and high-accuracy agricultural inventory and analysis in an automated way. Lots of studies that aim at providing an inventory of the level of each agricultural parcel have generated many methods for Agricultural Parcel and Boundary Delineation (APBD). This review covers APBD methods for detecting and delineating agricultural parcels and systematically reviews the past and present of APBD-related research applied to remote sensing images. With the goal to provide a clear knowledge map of existing APBD efforts, we conduct a comprehensive review of recent APBD papers to build a meta-data analysis, including the algorithm, the study site, the crop type, the sensor type, the evaluation method, etc. We categorize the methods into three classes: (1) traditional image processing methods (including pixel-based, edge-based and region-based); (2) traditional machine learning methods (such as random forest, decision tree); and (3) deep learning-based methods. With deep learning-oriented approaches contributing to a majority, we further discuss deep learning-based methods like semantic segmentation-based, object detection-based and Transformer-based methods. In addition, we discuss five APBD-related issues to further comprehend the APBD domain using remote sensing data, such as multi-sensor data in APBD task, comparisons between single-task learning and multi-task learning in the APBD domain, comparisons among different algorithms and different APBD tasks, etc. Finally, this review proposes some APBD-related applications and a few exciting prospects and potential hot topics in future APBD research. We hope this review help researchers who involved in APBD domain to keep track of its development and tendency.

19 Sep 2025

computer-science computer-vision-and-pattern-recognition image-generation

USCTNet: A deep unfolding nuclear-norm optimization solver for physically consistent HSI reconstruction

China Agricultural University

Reconstructing hyperspectral images (HSIs) from a single RGB image is ill-posed and can become physically inconsistent when the camera spectral sensitivity (CSS) and scene illumination are misspecified. We formulate RGB-to-HSI reconstruction as a physics-grounded inverse problem regularized by a nuclear norm in a learnable transform domain, and we explicitly estimate CSS and illumination to define the forward operator embedded in each iteration, ensuring colorimetric consistency. To avoid the cost and instability of full singular-value decompositions (SVDs) required by singular-value thresholding (SVT), we introduce a data-adaptive low-rank subspace SVT operator. Building on these components, we develop USCTNet, a deep unfolding solver tailored to HSI that couples a parameter estimation module with learnable proximal updates. Extensive experiments on standard benchmarks show consistent improvements over state-of-the-art RGB-based methods in reconstruction accuracy. Code: this https URL

23 Nov 2024

computer-science artificial-intelligence computation-and-language

ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain

Sichuan University

Zhejiang University China Agricultural University

Peking University

Yale University Meta GenAI

The advancement and extensive application of large language models (LLMs) have been remarkable, including their use in scientific research assistance. However, these models often generate scientifically incorrect or unsafe responses, and in some cases, they may encourage users to engage in dangerous behavior. To address this issue in the field of chemistry, we introduce ChemSafetyBench, a benchmark designed to evaluate the accuracy and safety of LLM responses. ChemSafetyBench encompasses three key tasks: querying chemical properties, assessing the legality of chemical uses, and describing synthesis methods, each requiring increasingly deeper chemical knowledge. Our dataset has more than 30K samples across various chemical materials. We incorporate handcrafted templates and advanced jailbreaking scenarios to enhance task diversity. Our automated evaluation framework thoroughly assesses the safety, accuracy, and appropriateness of LLM responses. Extensive experiments with state-of-the-art LLMs reveal notable strengths and critical vulnerabilities, underscoring the need for robust safety measures. ChemSafetyBench aims to be a pivotal tool in developing safer AI technologies in chemistry. Our code and dataset are available at this https URL. Warning: this paper contains discussions on the synthesis of controlled chemicals using AI models.

18 Sep 2025

computer-science robotics

T-araVLN: Translator for Agricultural Robotic Agents on Vision-and-Language Navigation

China Agricultural University

Agricultural robotic agents have been becoming powerful helpers in a wide range of agricultural tasks, however, still heavily rely on manual operation or fixed railways for movement. To address this limitation, the AgriVLN method and the A2A benchmark pioneeringly extend Vision-and-Language Navigation (VLN) to the agricultural domain, enabling agents to navigate to the target positions following the natural language instructions. AgriVLN effectively understands the simple instructions, but often misunderstands the complex ones. To bridge this gap, we propose the method of Translator for Agricultural Robotic Agents on Vision-and-Language Navigation (T-araVLN), in which the Instruction Translator module translates the original instruction to be more refined and precise. When evaluated on the A2A benchmark, our T-araVLN effectively improves Success Rate from 0.47 to 0.63 and reduces Navigation Error from 2.91m to 2.28m, demonstrating the state-of-the-art performance in the agricultural domain. Code: this https URL.

10 Nov 2025

computer-science computer-vision-and-pattern-recognition generative-models

Image Restoration via Primal Dual Hybrid Gradient and Flow Generative Model

China Agricultural University Capital Normal University

Regularized optimization has been a classical approach to solving imaging inverse problems, where the regularization term enforces desirable properties of the unknown image. Recently, the integration of flow matching generative models into image restoration has garnered significant attention, owing to their powerful prior modeling capabilities. In this work, we incorporate such generative priors into a Plug-and-Play (PnP) framework based on proximal splitting, where the proximal operator associated with the regularizer is replaced by a time-dependent denoiser derived from the generative model. While existing PnP methods have achieved notable success in inverse problems with smooth squared

\ell_2

data fidelity--typically associated with Gaussian noise--their applicability to more general data fidelity terms remains underexplored. To address this, we propose a general and efficient PnP algorithm inspired by the primal-dual hybrid gradient (PDHG) method. Our approach is computationally efficient, memory-friendly, and accommodates a wide range of fidelity terms. In particular, it supports both

\ell_1

and

\ell_2

norm-based losses, enabling robustness to non-Gaussian noise types such as Poisson and impulse noise. We validate our method on several image restoration tasks, including denoising, super-resolution, deblurring, and inpainting, and demonstrate that

\ell_1

and

\ell_2

fidelity terms outperform the conventional squared

\ell_2

loss in the presence of non-Gaussian noise.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios

VLM-3D:End-to-End Vision-Language Models for Open-World 3D Perception

Ada-R1: Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization

Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception

When Continue Learning Meets Multimodal Large Language Model: A Survey

WHU-STree: A Multi-modal Benchmark Dataset for Street Tree Inventory

Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind

Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation

Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine

GTPBD: A Fine-Grained Global Terraced Parcel and Boundary Dataset

SEA: Self-Evolution Agent with Step-wise Reward for Computer Use

Efficient Image Deblurring Networks based on Diffusion Models

Audio-Visual Instance Segmentation

Generative Text Steganography with Large Language Model

Cross-Modal Conditioned Reconstruction for Language-guided Medical Image Segmentation

A Comprehensive Review of Agricultural Parcel and Boundary Delineation from Remote Sensing Images: Recent Progress and Future Perspectives

USCTNet: A deep unfolding nuclear-norm optimization solver for physically consistent HSI reconstruction

ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain

T-araVLN: Translator for Agricultural Robotic Agents on Vision-and-Language Navigation

Image Restoration via Primal Dual Hybrid Gradient and Flow Generative Model

Events

AI for Law

Personalize Your Feed