alphaXiv

History

Papers Benchmarks

Wuhan AI Research

706

28 Dec 2023

computer-science computer-vision-and-pattern-recognition

AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models

Chinese Academy of Sciences Objecteye Inc.Wuhan AI Research

Guibo Zhu

AnomalyGPT adapts large vision-language models for industrial anomaly detection, enabling direct anomaly judgment, precise pixel-level localization, and multi-turn dialogue. It achieves state-of-the-art performance in few-shot and unsupervised settings on datasets like MVTec-AD, with a 1-shot image-level AUC of 94.1%.

868

2,614

17 Apr 2025

agents ai-for-health chain-of-thought

Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents

University of Chinese Academy of Science Institute of Automation, CAS Wuhan AI Research

Shuo Ren

This survey from the Chinese Academy of Sciences provides a systematic review of Large Language Model (LLM)-based scientific agents, detailing their specialized architectures, evaluation benchmarks, diverse applications, and critical ethical considerations. It defines the unique characteristics that differentiate scientific agents from general-purpose LLMs, such as their integration with scientific tools and handling of complex data, outlining current capabilities and challenges in accelerating scientific discovery.

7,144

23 Mar 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning

Chinese Academy of Sciences Peng Cheng Laboratory Wuhan AI Research Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences

Researchers develop Vision-R1, a human-free alignment framework for large vision-language models that uses vision-guided reinforcement learning and progressive rule refinement to improve object localization capabilities, achieving 6% better generalization on unseen scenarios while eliminating the need for human preference data and reward model training.

238

749

17 Nov 2025

agents computer-science artificial-intelligence

KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning

Chinese Academy of Sciences Wuhan AI Research

Winter James

Recent advances have demonstrated that integrating reinforcement learning with rule-based rewards can significantly enhance the reasoning capabilities of large language models, even without supervised fine-tuning. However, prevalent reinforcement learning algorithms such as GRPO and its variants like DAPO, suffer from a coarse granularity issue when computing the advantage. Specifically, they compute rollout-level advantages that assign identical values to every token within a sequence, failing to capture token-specific contributions and hindering effective learning. To address this limitation, we propose Key-token Advantage Estimation (KTAE) - a novel algorithm that estimates fine-grained, token-level advantages without introducing additional models. KTAE leverages the correctness of sampled rollouts and applies statistical analysis to quantify the importance of individual tokens within a sequence to the final outcome. This quantified token-level importance is then combined with the rollout-level advantage to obtain a more fine-grained token-level advantage estimation. Empirical results show that models trained with GRPO+KTAE and DAPO+KTAE outperform baseline methods across five mathematical reasoning benchmarks. Notably, they achieve higher accuracy with shorter responses and even surpass R1-Distill-Qwen-1.5B using the same base model.

864

10 Jul 2024

computer-science artificial-intelligence computation-and-language

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

University of Waterloo 01.ai Wuhan AI Research

tianyu Zheng

Yizhi Li

Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model's weights are provided with most details (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.) being undisclosed. To improve the transparency of LLMs, the research community has formed to open-source truly open LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training corpus and training code) are being provided. These models have greatly advanced the scientific study of these large models including their strengths, weaknesses, biases and risks. However, we observe that the existing truly open LLMs on reasoning, knowledge, and coding tasks are still inferior to existing state-of-the-art LLMs with similar model sizes. To this end, we open-source MAP-Neo, a highly capable and transparent bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs. Moreover, we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided. Finally, we hope our MAP-Neo will enhance and strengthen the open research community and inspire more innovations and creativities to facilitate the further improvements of LLMs.

964

131

16 Sep 2025

computer-science conversational-ai computation-and-language

Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions

Chinese Academy of Sciences Wuhan AI Research

Winter James

In visual question answering (VQA) context, users often pose ambiguous questions to visual language models (VLMs) due to varying expression habits. Existing research addresses such ambiguities primarily by rephrasing questions. These approaches neglect the inherently interactive nature of user interactions with VLMs, where ambiguities can be clarified through user feedback. However, research on interactive clarification faces two major challenges: (1) Benchmarks are absent to assess VLMs' capacity for resolving ambiguities through interaction; (2) VLMs are trained to prefer answering rather than asking, preventing them from seeking clarification. To overcome these challenges, we introduce \textbf{ClearVQA} benchmark, which targets three common categories of ambiguity in VQA context, and encompasses various VQA scenarios.

136

21 Sep 2025

computer-science computer-vision-and-pattern-recognition generative-models

FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation

Chinese Academy of Sciences Peng Cheng Laboratory University of Chinese Academy of Science Wuhan AI Research

Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling, enabling both accurate content understanding and flexible editing. However, current approaches treat "what to see" and "how to edit" separately: they either perform isolated object segmentation or utilize segmentation masks merely as conditional prompts for local edit generation tasks, often relying on multiple disjointed models. To bridge these gaps, we introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework. FOCUS employs a dual-branch visual encoder to simultaneously capture global semantic context and fine-grained spatial details. In addition, we leverage a MoVQGAN-based visual tokenizer to produce discrete visual tokens that enhance generation quality. To enable accurate and controllable image editing, we propose a progressive multi-stage training pipeline, where segmentation masks are jointly optimized and used as spatial condition prompts to guide the diffusion decoder. This strategy aligns visual encoding, segmentation, and generation modules, effectively bridging segmentation-aware perception with fine-grained visual synthesis. Extensive experiments across three core tasks, including multimodal understanding, referring segmentation accuracy, and controllable image generation, demonstrate that FOCUS achieves strong performance by jointly optimizing visual perception and generative capabilities.

3,147

11 Dec 2024

autonomous-vehicles computer-science computer-vision-security

Monocular Lane Detection Based on Deep Learning: A Survey

Chinese Academy of Sciences Peng Cheng Laboratory Xian Jiaotong University Wuhan AI Research

This survey provides the first comprehensive overview of both 2D and 3D deep learning-based monocular lane detection, presenting a novel analytical framework based on four 'core designs' and a unified empirical evaluation of method efficiency. It identifies key advancements in perspective effect elimination and lane modeling, highlighting the shift towards learnable view transformations and direct 3D representations for enhanced robustness.

01 Oct 2025

attention-mechanisms computer-science computer-vision-and-pattern-recognition

From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation

Chinese Academy of Sciences

Westlake University Peng Cheng Laboratory University of Chinese Academy of Science Wuhan AI Research

TrajVLM-Gen, a two-stage framework developed by researchers from the Chinese Academy of Sciences and Westlake University, enables the generation of physically consistent videos by first employing a Vision-Language Model to predict physics-aware trajectories and then guiding a video diffusion model. The system achieves an 89.6% accuracy in trajectory generation, significantly outperforming baselines, and produces competitive FVD scores on standard video generation benchmarks.

1,171

06 Mar 2025

computer-science computer-vision-and-pattern-recognition machine-learning

Synthetic Data is an Elegant GIFT for Continual Vision-Language Models

Wuhan University

Chinese Academy of Sciences Wuhan AI Research

The GIFT (Generated data to Improve continual Fine-Tuning) framework enables Vision-Language Models to continually update their knowledge without catastrophic forgetting, particularly preserving their powerful pre-training generalization. This is achieved by generating synthetic image-text pairs via a diffusion model and applying a dual-objective distillation strategy alongside an adaptive weight consolidation method, leading to state-of-the-art performance on multi-domain continual learning benchmarks using minimal synthetic data.

349

26 Feb 2025

attention-mechanisms computer-science artificial-intelligence

Systematic Outliers in Large Language Models

Chinese Academy of Sciences Objecteye Inc.Wuhan AI Research

永琪安

Outliers have been widely observed in Large Language Models (LLMs), significantly impacting model performance and posing challenges for model compression. Understanding the functionality and formation mechanisms of these outliers is critically important. Existing works, however, largely focus on reducing the impact of outliers from an algorithmic perspective, lacking an in-depth investigation into their causes and roles. In this work, we provide a detailed analysis of the formation process, underlying causes, and functions of outliers in LLMs. We define and categorize three types of outliers-activation outliers, weight outliers, and attention outliers-and analyze their distributions across different dimensions, uncovering inherent connections between their occurrences and their ultimate influence on the attention mechanism. Based on these observations, we hypothesize and explore the mechanisms by which these outliers arise and function, demonstrating through theoretical derivations and experiments that they emerge due to the self-attention mechanism's softmax operation. These outliers act as implicit context-aware scaling factors within the attention mechanism. As these outliers stem from systematic influences, we term them systematic outliers. Our study not only enhances the understanding of Transformer-based LLMs but also shows that structurally eliminating outliers can accelerate convergence and improve model compression. The code is avilable at this https URL

277

21 Jun 2023

computer-science computer-vision-security artificial-intelligence

Fast Segment Anything

Chinese Academy of Sciences Objecteye Inc.Wuhan AI Research

永琪安

Fast Segment Anything (FastSAM) proposes a CNN-based alternative to the Transformer-based Segment Anything Model (SAM), achieving 50 times faster inference (40ms vs. 2099ms) while maintaining comparable performance for prompt-guided universal image segmentation. This model re-frames the task into all-instance segmentation followed by prompt-guided selection, making real-time applications feasible.

7,694

115

27 May 2025

chain-of-thought computer-science artificial-intelligence

Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models

Chinese Academy of Sciences Peng Cheng Laboratory Wuhan AI Research Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences

Large Multimodal Models (LMMs) have recently demonstrated remarkable visual understanding performance on both vision-language and vision-centric tasks. However, they often fall short in integrating advanced, task-specific capabilities for compositional reasoning, which hinders their progress toward truly competent general vision models. To address this, we present a unified visual reasoning mechanism that enables LMMs to solve complicated compositional problems by leveraging their intrinsic capabilities (e.g. grounding and visual understanding capabilities). Different from the previous shortcut learning mechanism, our approach introduces a human-like understanding-thinking-answering process, allowing the model to complete all steps in a single pass forwarding without the need for multiple inferences or external tools. This design bridges the gap between foundational visual capabilities and general question answering, encouraging LMMs to generate faithful and traceable responses for complex visual reasoning. Meanwhile, we curate 334K visual instruction samples covering both general scenes and text-rich scenes and involving multiple foundational visual capabilities. Our trained model, Griffon-R, has the ability of end-to-end automatic understanding, self-thinking, and reasoning answers. Comprehensive experiments show that Griffon-R not only achieves advancing performance on complex visual reasoning benchmarks including VSR and CLEVR, but also enhances multimodal capabilities across various benchmarks like MMBench and ScienceQA. Data, models, and codes will be release at this https URL soon.

240

224

08 Jul 2024

computer-science computation-and-language ensemble-methods

Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in the Era of Large Language Models

Chinese Academy of Sciences Nanjing University of Science and Technology Wuhan AI Research

This survey paper provides a comprehensive review and structured taxonomy of collaborative strategies for Large Language Models, classifying methods into Merging, Ensemble, and Cooperation to enhance their performance, efficiency, and robustness. It clarifies relationships between diverse approaches and identifies future research directions for multi-LLM systems.

143

17 Jan 2025

computer-science computer-vision-and-pattern-recognition few-shot-learning

FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization

Shanghai Artificial Intelligence Laboratory

Chinese Academy of Sciences Peng Cheng Laboratory Objecteye Inc.Wuhan AI Research

Guibo Zhu

Anomaly detection methods typically require extensive normal samples from the target class for training, limiting their applicability in scenarios that require rapid adaptation, such as cold start. Zero-shot and few-shot anomaly detection do not require labeled samples from the target class in advance, making them a promising research direction. Existing zero-shot and few-shot approaches often leverage powerful multimodal models to detect and localize anomalies by comparing image-text similarity. However, their handcrafted generic descriptions fail to capture the diverse range of anomalies that may emerge in different objects, and simple patch-level image-text matching often struggles to localize anomalous regions of varying shapes and sizes. To address these issues, this paper proposes the FiLo++ method, which consists of two key components. The first component, Fused Fine-Grained Descriptions (FusDes), utilizes large language models to generate anomaly descriptions for each object category, combines both fixed and learnable prompt templates and applies a runtime prompt filtering method, producing more accurate and task-specific textual descriptions. The second component, Deformable Localization (DefLoc), integrates the vision foundation model Grounding DINO with position-enhanced text descriptions and a Multi-scale Deformable Cross-modal Interaction (MDCI) module, enabling accurate localization of anomalies with various shapes and sizes. In addition, we design a position-enhanced patch matching approach to improve few-shot anomaly detection performance. Experiments on multiple datasets demonstrate that FiLo++ achieves significant performance improvements compared with existing methods. Code will be available at this https URL.

11 Aug 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

Chinese Academy of Sciences Peng Cheng Laboratory Wuhan AI Research

Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpassing the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, counting, \textit{etc}. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scale up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details and significantly improves multimodal perception ability, especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts, and even coordinates. Experiments demonstrate that Griffon v2 can localize objects of interest with visual and textual referring, achieve state-of-the-art performance on REC and phrase grounding, and outperform expert models in object detection, object counting, and REG. Data and codes are released at this https URL.

128

27 Oct 2025

computer-science conversational-ai artificial-intelligence

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

Chinese Academy of Sciences Wuhan AI Research GWM AI Lab

Winter James

Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at this https URL

220

28 May 2024

computer-science computation-and-language sound

BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing

Chinese Academy of Sciences Alibaba DAMO Academy Wuhan AI Research

Macro W

The emergence of large language models (LLMs) has sparked significant interest in extending their remarkable language capabilities to speech. However, modality alignment between speech and text still remains an open problem. Current solutions can be categorized into two strategies. One is a cascaded approach where outputs (tokens or states) of a separately trained speech recognition system are used as inputs for LLMs, which limits their potential in modeling alignment between speech and text. The other is an end-to-end approach that relies on speech instruction data, which is very difficult to collect in large quantities. In this paper, we address these issues and propose the BLSP approach that Bootstraps Language-Speech Pre-training via behavior alignment of continuation writing. We achieve this by learning a lightweight modality adapter between a frozen speech encoder and an LLM, ensuring that the LLM exhibits the same generation behavior regardless of the modality of input: a speech segment or its transcript. The training process can be divided into two steps. The first step prompts an LLM to generate texts with speech transcripts as prefixes, obtaining text continuations. In the second step, these continuations are used as supervised signals to train the modality adapter in an end-to-end manner. We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.

305

15 Apr 2024

computer-science computation-and-language

Bridging the Gap between Different Vocabularies for LLM Ensemble

Shanghai Artificial Intelligence Laboratory

Chinese Academy of Sciences Wuhan AI Research

EVA (Ensemble via Vocabulary Alignment) introduces a method to bridge vocabulary differences between large language models, enabling fine-grained token-level ensembling by learning explicit mappings. This approach achieved a 10.61% improvement on the GSM8K arithmetic reasoning task over the best individual model and outperformed an ensemble baseline with an additional fusion model on most tasks.

549

05 Jun 2025

computer-science artificial-intelligence computation-and-language

Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment

Shanghai Artificial Intelligence Laboratory

Chinese Academy of Sciences Wuhan AI Research

Macro W

Winter James

Direct Preference Optimization (DPO) has become a prominent method for aligning Large Language Models (LLMs) with human preferences. While DPO has enabled significant progress in aligning English LLMs, multilingual preference alignment is hampered by data scarcity. To address this, we propose a novel approach that

\textit{captures}

learned preferences from well-aligned English models by implicit rewards and

\textit{transfers}

them to other languages through iterative training. Specifically, we derive an implicit reward model from the logits of an English DPO-aligned model and its corresponding reference model. This reward model is then leveraged to annotate preference relations in cross-lingual instruction-following pairs, using English instructions to evaluate multilingual responses. The annotated data is subsequently used for multilingual DPO fine-tuning, facilitating preference knowledge transfer from English to other languages. Fine-tuning Llama3 for two iterations resulted in a 12.72% average improvement in Win Rate and a 5.97% increase in Length Control Win Rate across all training languages on the X-AlpacaEval leaderboard. Our findings demonstrate that leveraging existing English-aligned models can enable efficient and effective multilingual preference alignment, significantly reducing the need for extensive multilingual preference data. The code is available at this https URL

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models

Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents

Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning

KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions

FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation

Monocular Lane Detection Based on Deep Learning: A Survey

From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation

Synthetic Data is an Elegant GIFT for Continual Vision-Language Models

Systematic Outliers in Large Language Models

Fast Segment Anything

Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models

Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in the Era of Large Language Models

FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing

Bridging the Gap between Different Vocabularies for LLM Ensemble

Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment

Events

AI for Law

Personalize Your Feed