alphaXiv

Hong Kong Baptist University

1,177

03 Dec 2025

computer-science artificial-intelligence machine-learning

A Definition of AGI

The lack of a concrete definition for Artificial General Intelligence (AGI) obscures the gap between today's specialized AI and human-level cognition. This paper introduces a quantifiable framework to address this, defining AGI as matching the cognitive versatility and proficiency of a well-educated adult. To operationalize this, we ground our methodology in Cattell-Horn-Carroll theory, the most empirically validated model of human cognition. The framework dissects general intelligence into ten core cognitive domains-including reasoning, memory, and perception-and adapts established human psychometric batteries to evaluate AI systems. Application of this framework reveals a highly "jagged" cognitive profile in contemporary models. While proficient in knowledge-intensive domains, current AI systems have critical deficits in foundational cognitive machinery, particularly long-term memory storage. The resulting AGI scores (e.g., GPT-4 at 27%, GPT-5 at 57%) concretely quantify both rapid progress and the substantial gap remaining before AGI.

588

03 Oct 2025

computer-science contrastive-learning machine-learning

Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models

Shanghai Jiao Tong University Hong Kong Baptist University Shanghai Innovation Institute

Jiangchao Yao

Co-rewarding establishes a stable self-supervised reinforcement learning framework for large language models, leveraging complementary supervision via data-side analogy-invariance and model-side temporal invariance. This effectively prevents training collapse and reward hacking, leading to enhanced reasoning capabilities that often surpass prior self-rewarding methods and sometimes rival ground-truth supervised approaches.

1,093

04 Apr 2025

computer-science computer-vision-security computer-vision-and-pattern-recognition

ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

National University of Singapore East China Normal University Hong Kong Baptist University

ScreenSpot-Pro introduces a benchmark for GUI grounding in professional, high-resolution computing environments, where existing MLLMs struggle to precisely locate UI elements. The paper also presents ScreenSeekeR, an agentic framework that guides a grounding model through hierarchical visual search, improving accuracy from 18.9% to 48.1% on the new benchmark without retraining.

268

3,194

22 Feb 2025

computer-science artificial-intelligence computation-and-language

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

Northwestern Polytechnical University University of Surrey University of Rochester University of Science and Technology Beijing

HKUST Hong Kong Baptist University Chinese University of Hong Kong Shanghai Mobvoi Information Technology Co., Ltd.

Llasa introduces a unified, Llama-based architecture for speech synthesis that simplifies the text-to-speech pipeline to a single Transformer and a novel speech tokenizer. The work systematically investigates train-time and inference-time scaling, showing consistent quality improvements and strong performance on both speech generation and understanding tasks.

340

1,004

13 Oct 2025

causal-inference computer-science artificial-intelligence

Discovering and Reasoning of Causality in the Hidden World with Large Language Models

Carnegie Mellon University

The Chinese University of Hong Kong The University of Melbourne MBZUAI The University of Sydney Hong Kong Baptist University

Chenxi Liu

Revealing hidden causal variables alongside the underlying causal mechanisms is essential to the development of science. Despite the progress in the past decades, existing practice in causal discovery (CD) heavily relies on high-quality measured variables, which are usually given by human experts. In fact, the lack of well-defined high-level variables behind unstructured data has been a longstanding roadblock to a broader real-world application of CD. This procedure can naturally benefit from an automated process that can suggest potential hidden variables in the system. Interestingly, Large language models (LLMs) are trained on massive observations of the world and have demonstrated great capability in processing unstructured data. To leverage the power of LLMs, we develop a new framework termed Causal representatiOn AssistanT (COAT) that incorporates the rich world knowledge of LLMs to propose useful measured variables for CD with respect to high-value target variables on their paired unstructured data. Instead of directly inferring causality with LLMs, COAT constructs feedback from intermediate CD results to LLMs to refine the proposed variables. Given the target variable and the paired unstructured data, we first develop COAT-MB that leverages the predictivity of the proposed variables to iteratively uncover the Markov Blanket of the target variable. Built upon COAT-MB, COAT-PAG further extends to uncover a more complete causal graph, i.e., Partial Ancestral Graph, by iterating over the target variables and actively seeking new high-level variables. Moreover, the reliable CD capabilities of COAT also extend the debiased causal inference to unstructured data by discovering an adjustment set. We establish theoretical guarantees for the CD results and verify their efficiency and reliability across realistic benchmarks and real-world case studies.

196

09 Oct 2025

computer-science computer-vision-security computer-vision-and-pattern-recognition

Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection

South China University of Technology

University of Science and Technology of China University of Melbourne Pazhou Lab Hong Kong Baptist University Hunan University Key Laboratory of Big Data and Intelligent Robot, Ministry of Education

Researchers at South China University of Technology and collaborators introduced NSG-VD, a physics-driven method utilizing a Normalized Spatiotemporal Gradient (NSG) and Maximum Mean Discrepancy, to detect AI-generated videos by identifying violations of physical continuity. The approach achieves superior detection performance on advanced generative models like Sora and demonstrates strong robustness in data-imbalanced settings.

691

26 Oct 2025

computer-science computer-vision-and-pattern-recognition deep-reinforcement-learning

Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

Shanghai AI Lab

University of Wisconsin-Madison Hong Kong Baptist University

The VISIONARY-R1 framework trains visual language models for complex reasoning using reinforcement learning and CoT-free question-answer pairs, outperforming leading proprietary models like GPT-4o on benchmarks such as MathVista. It effectively mitigates shortcut learning in VLMs by enforcing a structured caption-reason-answer output and employing an AI-feedback-based caption reward.

178

20 Nov 2025

agents chain-of-thought computer-science

Learning to Think Fast and Slow for Visual Language Models

Beijing Academy of Artificial Intelligence

University of Wisconsin-Madison Hong Kong Baptist University Institute of Automation, CAS Institude of Automation, CAS

Researchers introduce DualMindVLM, a Visual Language Model capable of automatically switching between fast and slow thinking modes based on task complexity. This approach achieves competitive accuracy with leading reasoning VLMs while reducing token usage by approximately 40% and exhibiting improved hallucination mitigation.

738

27 May 2025

computer-science artificial-intelligence computation-and-language

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Microsoft Hong Kong Baptist University

WizardCoder, a Code Large Language Model, leverages a tailored instruction-tuning method called Code Evol-Instruct to generate complex programming tasks. This approach enables open-source models to achieve state-of-the-art performance, surpassing GPT-3.5 on HumanEval+ and leading all open-source baselines across multiple code generation benchmarks.

1,355

27 Nov 2024

computer-science artificial-intelligence computation-and-language

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

Microsoft University of Science and Technology Beijing

HKUST Hong Kong Baptist University Chinese University of Hong Kong

Wei Xue

Researchers from HKUST and collaborators developed X-Codec, an audio codec that embeds semantic information directly into its unified tokens. This design improves the performance of audio Large Language Models across various tasks, including reducing Word Error Rate in Text-to-Speech by up to 58% and enhancing music and general sound generation fidelity.

150

119

24 Sep 2025

adversarial-attacks computer-science computer-vision-security

Generative Model Inversion Through the Lens of the Manifold Hypothesis

The University of Melbourne

The University of Texas at Austin The University of Sydney Hong Kong Baptist University

Model inversion attacks (MIAs) aim to reconstruct class-representative samples from trained models. Recent generative MIAs utilize generative adversarial networks to learn image priors that guide the inversion process, yielding reconstructions with high visual quality and strong fidelity to the private training data. To explore the reason behind their effectiveness, we begin by examining the gradients of inversion loss with respect to synthetic inputs, and find that these gradients are surprisingly noisy. Further analysis reveals that generative inversion implicitly denoises these gradients by projecting them onto the tangent space of the generator manifold, filtering out off-manifold components while preserving informative directions aligned with the manifold. Our empirical measurements show that, in models trained with standard supervision, loss gradients often exhibit large angular deviations from the data manifold, indicating poor alignment with class-relevant directions. This observation motivates our central hypothesis: models become more vulnerable to MIAs when their loss gradients align more closely with the generator manifold. We validate this hypothesis by designing a novel training objective that explicitly promotes such alignment. Building on this insight, we further introduce a training-free approach to enhance gradient-manifold alignment during inversion, leading to consistent improvements over state-of-the-art generative MIAs.

05 Nov 2025

agents chain-of-thought computer-science

AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation

City University of Hong Kong

ByteDance The Hong Kong University of Science and Technology (Guangzhou)

King’s College London Hong Kong Baptist University

The creation of high-quality datasets to improve Large Language Model (LLM) reasoning remains a significant challenge, as current methods often suffer from generating low-quality/incorrect answers and limited information richness from available data sources. To address this, we propose AgenticMath, a novel agentic pipeline for generating high-quality mathematical question-answer pairs to enhance the supervised fine-tuning of LLMs. Our method operates through four stages: (1) Seed Question Filter that selects questions with high information richness, complexity, and clarity; (2) an Agentic Question Rephrase step that employs a multi-agent system to generate diverse, logically consistent paraphrases; (3) an Answer Augment step where rewrite answers using chain-of-thought reasoning to enhance numerical and logical correctness, without reliance on human-provided labels; and (4) a final Question and Answer Evaluation that retains only the most superior pairs. Extensive experiments demonstrate that, fine-tuning 3B-8B parameter LLMs on AgenticMath generated datasets (comprising only 30-60K math samples) achieves competitive or superior performance on diverse in domain and out-of-domain mathematical reasoning benchmarks compared to baselines trained on much more data (e.g., 400K or 2.3M samples). Our work demonstrates that targeted, high-quality data generation is a more efficient path to improving mathematical reasoning in LLMs than large-scale, low-quality alternatives.

07 Oct 2025

agents computer-science computation-and-language

EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

LMU Munich

The Chinese University of Hong Kong

Microsoft Hong Kong Baptist University

EEPO introduces a "sample-then-forget" mechanism for Reinforcement Learning with Verifiable Rewards (RLVR) to combat entropy collapse by temporarily suppressing previously sampled responses during rollouts. It achieves average relative gains of 24.3% on Qwen2.5-3B and 33.0% on Llama3.2-3B-Instruct across mathematical reasoning benchmarks, maintaining higher policy entropy and improved generalization.

289

09 Jun 2025

agents computer-science artificial-intelligence

From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?

Université de Montréal

Shanghai Jiao Tong University

Stanford University Hong Kong Baptist University Mila - Québec AI Institute

Jiangchao Yao

This research introduces AR-Bench, a new benchmark to assess Large Language Models' (LLMs) active reasoning abilities under incomplete information, revealing that current LLMs, including GPT-4o, struggle with interactive questioning and iterative information acquisition.

228

26 Sep 2024

computer-science computation-and-language computer-vision-and-pattern-recognition

MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems

National University of Singapore

The University of Hong Kong Hong Kong Baptist University

Brandon QSLV

MMCode introduces the first multimodal benchmark to evaluate large multimodal models' (LMMs) ability to generate code from programming problems featuring visually rich information, demonstrating that even state-of-the-art LMMs like GPT-4V achieve only a 19.4% Pass@1 rate, highlighting current models' significant limitations in this domain.

04 Dec 2025

computer-science computer-vision-and-pattern-recognition geometric-deep-learning

4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer

Harvard University

The Hong Kong Polytechnic University

Huazhong University of Science and Technology Hong Kong Baptist University Jianghan University Hubei University of Education

A new Transformer-based architecture introduces the first feed-forward unified framework for 4D language grounding, enabling open-vocabulary querying in dynamic environments without per-scene optimization. This approach achieves state-of-the-art performance on HyperNeRF and Neu3D datasets, demonstrating up to a 3% mIoU improvement and strong generalization capabilities for both time-agnostic and time-sensitive queries.

339

19 May 2025

chain-of-thought computer-science computation-and-language

On the Thinking-Language Modeling Gap in Large Language Models

Carnegie Mellon University

The Chinese University of Hong Kong MBZUAI The University of Sydney Hong Kong Baptist University

Chenxi Liu

Researchers identified a 'thinking-language modeling gap' in LLMs, where models learn from language expressions rather than underlying thought processes, often leading to reasoning biases when implicit information is present. They developed Language-of-Thoughts (LoT) prompting, a three-component approach (Observe, Expand, Echo) that explicitly addresses this implicitness, consistently outperforming Chain-of-Thought prompting on various reasoning benchmarks like GPQA, LSAT, and FOLIO across multiple LLM models.

18 Sep 2025

computer-science computers-and-society

AI and the Future of Academic Peer Review

University of Cambridge

National University of Singapore

University of Oxford

University of Copenhagen Yonsei University Hong Kong Baptist University London South Bank University Chinese EQUATOR Centre

A comprehensive, ethics-informed framework outlines the integration of AI, especially Large Language Models, into academic peer review, systematically addressing challenges and opportunities. It demonstrates how AI can mitigate persistent issues like lengthy publication delays and reviewer burden, offering specific safeguards against concerns such as hallucination and bias, and suggesting advanced AI architectures for improved performance.

13 Oct 2025

ai-for-health computer-science computer-vision-and-pattern-recognition

Multimodal Disease Progression Modeling via Spatiotemporal Disentanglement and Multiscale Alignment

The Hong Kong Polytechnic University Hong Kong Baptist University

Researchers at The Hong Kong Polytechnic University and Hong Kong Baptist University developed DiPro, a framework that models disease progression by disentangling dynamic pathological changes from static anatomical features in sequential chest X-rays and aligning these with electronic health records across multiple timescales. DiPro achieved improved performance in disease progression identification and general ICU prediction tasks while offering clinically aligned interpretations.

140

26 Jun 2025

agents computer-science artificial-intelligence

Weak-to-Strong GraphRAG: Aligning Weak Retrievers with Large Language Models for Graph-based Retrieval Augmented Generation

Georgia Institute of Technology

The Chinese University of Hong Kong Hong Kong Baptist University

Chenxi Liu

The ReG framework improves graph-based Retrieval-Augmented Generation by aligning weak retrievers with Large Language Models, using LLM-refined supervision signals and structure-aware knowledge reorganization. This approach achieves state-of-the-art performance on WebQSP and CWQ datasets, notably outperforming baselines even with 5% of the training data and reducing LLM reasoning token usage by up to 30%.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

A Definition of AGI

Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models

ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

Discovering and Reasoning of Causality in the Hidden World with Large Language Models

Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection

Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

Learning to Think Fast and Slow for Visual Language Models

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

Generative Model Inversion Through the Lens of the Manifold Hypothesis

AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation

EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?

MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems

4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer

On the Thinking-Language Modeling Gap in Large Language Models

AI and the Future of Academic Peer Review

Multimodal Disease Progression Modeling via Spatiotemporal Disentanglement and Multiscale Alignment

Weak-to-Strong GraphRAG: Aligning Weak Retrievers with Large Language Models for Graph-based Retrieval Augmented Generation

Events

AI for Law

Personalize Your Feed