alphaXiv

History

Papers Benchmarks

The Hong Kong Polytechnic University

2,022

15 Mar 2024

attention-mechanisms computer-science computer-vision-and-pattern-recognition

How Powerful Potential of Attention on Image Restoration?

Sun Yat-Sen University

National University of Singapore Nanjing University of Science and Technology

The Hong Kong Polytechnic University Dalian University of Technology

Researchers at Sun Yat-sen University and collaborators introduce Continuous Scaling Attention (CSAttn), an attention-only Transformer block that achieves state-of-the-art performance across multiple image restoration tasks without relying on Feed-Forward Networks. The architecture demonstrates substantial improvements, including a 0.41 dB PSNR increase in image deraining and a 4.22 dB PSNR gain in low-light image enhancement, while maintaining competitive model efficiency.

1,118

01 Dec 2025

computer-science artificial-intelligence computation-and-language

SpikingBrain: Spiking Brain-inspired Large Models

Chinese Academy of Sciences

Beihang University

The Hong Kong Polytechnic University Beijing Academy of Artificial Intelligence Zhongguancun Academy LuxiTech Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology Beijing Key Laboratory of Brain-Inspired General Intelligence Large Model MetaX Integrated Circuit Co., Ltd.

Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms, and training remains stable for weeks on hundreds of MetaX GPUs with Model FLOPs Utilization at expected levels. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models also significantly improve long-context efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Furthermore, the proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.

1,022

67,682

02 Aug 2025

agentic-frameworks agents autonomous-vehicles

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Google DeepMind

University of Illinois at Urbana-Champaign

Université de Montréal

University of Southern California

Stanford University

Mila - Quebec AI Institute

The Hong Kong Polytechnic University

Yale University

University of Georgia

Nanyang Technological University

Microsoft

Argonne National Laboratory

Duke University

HKUST King Abdullah University of Science and Technology

University of Sydney

The Ohio State University Penn State University MetaGPT

Yu Su

Bang Liu

A comprehensive, brain-inspired framework integrates diverse research areas of LLM-based intelligent agents, encompassing individual architecture, collaborative systems, and safety. The framework formally conceptualizes agent components, maps AI capabilities to human cognition to identify research gaps, and outlines a roadmap for developing autonomous, adaptive, and safe AI.

596

1,544

14 Oct 2025

computer-science artificial-intelligence computation-and-language

Diffusion Language Models Know the Answer Before Decoding

Max Planck Institute for Intelligent Systems

Google DeepMind

Sun Yat-Sen University

The Hong Kong Polytechnic University University of Surrey

Dartmouth College ELLIS Institute Tübingen

LI Pengxiang

Researchers from The Hong Kong Polytechnic University, Dartmouth College, Max Planck Institute, Google DeepMind, and others developed Prophet, a training-free adaptive decoding paradigm for Diffusion Language Models (DLMs) that leverages early answer convergence. The method achieves up to 3.4 times faster inference by dynamically committing to answers when model confidence is high, often improving output quality compared to full-step decoding.

1,405

18 Oct 2025

agentic-frameworks agents computer-science

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Zhongxing Xu

A comprehensive survey by researchers from Shanghai AI Lab and various global institutions outlines the intricate relationship between scientific large language models (Sci-LLMs) and their data foundations, tracing their evolution towards autonomous agents for scientific discovery. The paper establishes a taxonomy for scientific data and knowledge, meticulously reviews over 270 datasets and 190 benchmarks, and identifies critical data challenges alongside future paradigms.

370

6,695

29 Sep 2025

computer-science artificial-intelligence computation-and-language

A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models

The Hong Kong Polytechnic University Jilin University

This survey systematically analyzes Graph Retrieval-Augmented Generation (GraphRAG), a paradigm that leverages graph structures for organizing, retrieving, and integrating knowledge to customize Large Language Models for specialized domains. It presents a comprehensive taxonomy and identifies technical foundations, achieving enhanced accuracy, contextual awareness, and reliability compared to traditional retrieval methods.

554

452

10 Nov 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

Chinese Academy of Sciences

The Hong Kong Polytechnic University Tencent AI Lab vivo Mobile Communication Co.ARC Lab, Tencent PCG

UniPixel introduces a unified large multi-modal model capable of concurrently performing object referring and segmentation across images and videos, achieving state-of-the-art results on 10 benchmarks and establishing a strong baseline for the new PixelQA task. This model integrates pixel-level understanding with general visual reasoning, demonstrating notable performance improvements over existing LMMs.

1,690

01 Nov 2025

chain-of-thought computer-science computation-and-language

Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning

The Hong Kong Polytechnic University Eastern Institute of Technology

Hanlin Wang

A comprehensive survey by Chen et al. (2025) introduces the first unified taxonomy for latent Chain-of-Thought (CoT) reasoning, organizing a rapidly growing field into token-wise horizontal and layer-wise vertical approaches. It synthesizes current research, practical applications, and outlines critical challenges for future advancements in LLM efficiency and cognitive capabilities.

171

898

07 Oct 2025

computer-science computation-and-language graph-neural-networks

When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation

The Hong Kong Polytechnic University Xiamen University

Jinsong Su

Researchers from Xiamen University and The Hong Kong Polytechnic University developed GraphRAG-Bench, a new benchmark to systematically evaluate graph-based Retrieval-Augmented Generation (GraphRAG). Their analysis reveals that GraphRAG excels in complex reasoning and creative generation tasks but faces efficiency challenges and can underperform vanilla RAG on simpler fact retrieval, underscoring the importance of task complexity and graph quality.

229

6,052

16 Sep 2025

chain-of-thought computer-science artificial-intelligence

TokenSkip: Controllable Chain-of-Thought Compression in LLMs

University of Science and Technology of China

The Hong Kong Polytechnic University

Researchers from The Hong Kong Polytechnic University developed TokenSkip, a method that enables controllable Chain-of-Thought (CoT) compression in large language models by selectively pruning less semantically important tokens. This approach reduces inference latency and token usage by up to 40% with minimal accuracy loss, making CoT reasoning more efficient for deployment.

37,609

09 Jun 2025

adversarial-attacks adversarial-robustness agents

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

University of Washington Wuhan University

University of Illinois at Urbana-Champaign

UCLA

Chinese Academy of Sciences Shanghai AI Laboratory

New York University

National University of Singapore

Fudan University

Georgia Institute of Technology

University of Science and Technology of China

Zhejiang University University of Electronic Science and Technology of China

Renmin University of China

The Hong Kong Polytechnic University

Peking University Griffith University

Nanyang Technological University

Johns Hopkins University

The University of Hong Kong

The Pennsylvania State University A*STAR Shanghai University University of Illinois at Chicago Singapore Management University

Southern University of Science and Technology

HKUST

Tencent TeleAI Squirrel Ai Learning Hong Kong University of Science and Technology (Guangzhou)The University of North Carolina at Chapel Hill Ben Gurion University Center for Applied Scientific Computing

Kun Wang

Fu An

This survey paper defines and applies a 'full-stack' safety concept for Large Language Models (LLMs), systematically analyzing safety concerns across their entire lifecycle from data to deployment and commercialization. The collaboration synthesizes findings from over 900 papers, providing a unified taxonomy of attacks and defenses while identifying key insights and future research directions for LLM and LLM-agent safety.

2,117

20 Oct 2025

agents computer-science artificial-intelligence

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

University of Washington

Carnegie Mellon University

The Hong Kong Polytechnic University

University of Pennsylvania M-A-P

tianyu Zheng

Graham Neubig

This research investigates how fine-tuning methods for math reasoning affect the broader capabilities of Large Language Models (LLMs), revealing that Reinforcement Learning (RL) approaches foster superior generalization across diverse reasoning and non-reasoning tasks compared to Supervised Fine-Tuning (SFT). RL-tuned models successfully transfer math gains and preserve performance on general tasks, while SFT models often exhibit narrow specialization and degradation in other domains.

2,814

01 Apr 2025

computer-science computation-and-language information-retrieval

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Alibaba Group

The Hong Kong Polytechnic University

The General Multimodal Embedder (GME) is introduced, an MLLM-based model designed for universal multimodal retrieval that can process text, images, visual documents, and fused-modal content. By training on a large, diversified dataset including newly synthesized fused-modal data, GME achieves state-of-the-art performance across various multimodal retrieval tasks, demonstrating the importance of comprehensive data for MLLM adaptation in this domain.

453

03 Sep 2025

computer-science information-retrieval

Youtu-GraphRAG: Vertically Unified Agents for Graph Retrieval-Augmented Complex Reasoning

Monash University

The Hong Kong Polytechnic University Tencent YouTu Lab

Youtu-GraphRAG introduces a vertically unified agentic paradigm that jointly optimizes graph construction and retrieval for large language models, significantly enhancing complex reasoning accuracy and reducing token consumption by up to 90.71% across various benchmarks while mitigating knowledge leaking through novel evaluation datasets.

743

272

16 Oct 2025

agents computer-science computation-and-language

Internalizing World Models via Self-Play Finetuning for Agentic RL

University of Washington

National University of Singapore

City University of Hong Kong

The Hong Kong Polytechnic University

Northwestern University

HKUST Oxford University Allen Institute for AI (AI2)

A collaborative effort from City University of Hong Kong, National University of Singapore, AI2, and other institutions presents SPA, a reinforcement learning framework that instills an internal world model in Large Language Model (LLM) agents via self-play supervised finetuning. This approach substantially boosts agent performance and generalization in out-of-distribution environments, for instance, raising the Qwen2.5-1.5B-Instruct's Sokoban success rate (Pass@1) from 25.6% to 59.8%.

264

28 Nov 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Vision Bridge Transformer at Scale

National University of Singapore

Shanghai Jiao Tong University

The Hong Kong Polytechnic University

Researchers from the National University of Singapore and collaborators developed the Vision Bridge Transformer (ViBT), scaling Brownian Bridge Models with a novel stabilized velocity-matching objective for efficient conditional generation. ViBT achieves state-of-the-art results across instruction-based image editing, video stylization, and depth-to-video synthesis, while demonstrating inference speedups of 2.28x to 4.03x compared to existing conditional Diffusion Transformers.

1,409

27 May 2025

agentic-frameworks agents chain-of-thought

SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution

The Hong Kong Polytechnic University

Hanlin Wang

Hong Kong Polytechnic University researchers develop SPA-RL (Stepwise Progress Attribution in Reinforcement Learning), a framework that addresses sparse and delayed reward challenges in training LLM agents by decomposing final task rewards into stepwise contributions through a lightweight progress estimator, achieving 2.5% success rate improvements and 1.9% grounding accuracy gains on ALFWorld while demonstrating consistent performance advantages across Webshop and VirtualHome benchmarks through a three-stage approach combining behavior cloning, reward redistribution via progress attribution, and PPO training with intermediate rewards.

347

07 Aug 2025

computer-science computer-vision-and-pattern-recognition multi-modal-learning

A Survey on Video Temporal Grounding with Multimodal Large Language Model

Harbin Institute of Technology

The Hong Kong Polytechnic University

Peking University Shandong Jianzhu University

This paper comprehensively surveys Video Temporal Grounding with Multimodal Large Language Models (VTG-MLLMs), presenting a novel three-dimensional taxonomy to classify methodologies and analyzing performance across diverse tasks and benchmarks. It provides a structured overview of architectural integrations, training strategies, and video feature processing techniques, consolidating advancements in the field.

3,384

01 Apr 2025

agents chain-of-thought computer-science

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

National University of Singapore

The Hong Kong Polytechnic University

VideoMind introduces a Chain-of-LoRA agent for long video reasoning, mimicking human comprehension by decomposing complex queries into sub-tasks with specialized roles. The system achieves state-of-the-art performance across various grounded video question-answering and temporal grounding benchmarks, particularly excelling on long-form video content.

957

31 Oct 2025

computer-science artificial-intelligence computation-and-language

R $^2$ ec: Towards Large Recommender Models with Reasoning

National University of Singapore

University of Science and Technology of China

The Hong Kong Polytechnic University Harbin Institute of Technology (Shenzhen)

润洋游

R2ec introduces a unified large recommender model that intrinsically integrates reasoning and recommendation capabilities within a single architecture, optimizing performance and interpretability without relying on human-annotated reasoning data. The model consistently surpasses existing baselines in recommendation quality across multiple datasets while maintaining competitive inference efficiency.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

How Powerful Potential of Attention on Image Restoration?

SpikingBrain: Spiking Brain-inspired Large Models

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Diffusion Language Models Know the Answer Before Decoding

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning

When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation

TokenSkip: Controllable Chain-of-Thought Compression in LLMs

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Youtu-GraphRAG: Vertically Unified Agents for Graph Retrieval-Augmented Complex Reasoning

Internalizing World Models via Self-Play Finetuning for Agentic RL

Vision Bridge Transformer at Scale

SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution

A Survey on Video Temporal Grounding with Multimodal Large Language Model

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

R $^2$ ec: Towards Large Recommender Models with Reasoning

Events

AI for Law

Personalize Your Feed

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

How Powerful Potential of Attention on Image Restoration?

SpikingBrain: Spiking Brain-inspired Large Models

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Diffusion Language Models Know the Answer Before Decoding

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning

When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation

TokenSkip: Controllable Chain-of-Thought Compression in LLMs

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Youtu-GraphRAG: Vertically Unified Agents for Graph Retrieval-Augmented Complex Reasoning

Internalizing World Models via Self-Play Finetuning for Agentic RL

Vision Bridge Transformer at Scale

SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution

A Survey on Video Temporal Grounding with Multimodal Large Language Model

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

R2^22ec: Towards Large Recommender Models with Reasoning

Events

AI for Law

Personalize Your Feed

R $^2$ ec: Towards Large Recommender Models with Reasoning