alphaXiv

History

Papers Benchmarks

CUHK

2,102

22 Aug 2024

attention-mechanisms computer-science machine-learning

QuickLLaMA: Query-aware Inference Acceleration for Large Language Models

Huawei Noah’s Ark Lab

CUHK SmartMore

QuickLLaMA (Query-aware Inference for LLMs) introduces a training-free inference acceleration method that allows Large Language Models to efficiently process and accurately reason over arbitrarily long contexts, extending capabilities to up to 1 million tokens. The approach significantly improves performance on long-context benchmarks, outperforming prior state-of-the-art methods like InfLLM by over 7% on LLaMA3 while maintaining linear scaling of time and memory.

11,130

28 May 2025

computer-science artificial-intelligence computation-and-language

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

University of Illinois at Urbana-Champaign

CUHK Shanghai AI Laboratory

Nanjing University

Tsinghua University

Peking University

This paper identifies and characterizes a universal policy entropy collapse in reinforcement learning for large language models (LLMs), revealing an empirical law that links performance to entropy. It further provides a mechanistic understanding of this phenomenon through covariance analysis and proposes two covariance-aware regularization methods, Clip-Cov and KL-Cov, which successfully maintain higher entropy and improve LLM reasoning performance on math and coding tasks.

312

7,667

26 Sep 2025

computer-science artificial-intelligence computation-and-language

Process Reinforcement through Implicit Rewards

University of Illinois at Urbana-Champaign

CUHK

Tsinghua University Shanghai AI Lab

Peking University

Shanghai Jiaotong University

Kaiyan Zhang

bx h

The PRIME framework enhances Large Language Model reasoning by efficiently integrating dense, token-level implicit rewards through online reinforcement learning. It achieves a 15.1% average improvement across key reasoning benchmarks and demonstrates 2.5x sample efficiency, outperforming larger models like Qwen2.5-Math-7B-Instruct with significantly less training data.

4,574

26 Oct 2024

computer-science artificial-intelligence computation-and-language

LLaVA-OneVision: Easy Visual Task Transfer

CUHK

ByteDance NTU

HKUST

LLaVA-OneVision is an open Large Multimodal Model that achieves state-of-the-art performance among open LMMs across single-image, multi-image, and video tasks. This model narrows the performance gap with proprietary models like GPT-4V/o and demonstrates robust cross-modal task transfer, enabling diverse emergent visual understanding capabilities.

3,697

17,712

09 Jul 2025

computer-science computer-vision-and-pattern-recognition generative-models

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

CUHK

Zhejiang University Kuaishou Technology HUST

ReCamMaster introduces a framework for camera-controlled generative video re-rendering, enabling the synthesis of new videos from a single input video with novel camera trajectories. The method leverages a novel "Frame Dimension Conditioning" mechanism and a large-scale synthetic dataset, achieving improved visual quality, camera accuracy, and view synchronization over prior approaches.

440

1,162

05 Oct 2025

attention-mechanisms computer-science artificial-intelligence

Mixture of Contexts for Long Video Generation

CUHK

Stanford University

ByteDance

Johns Hopkins University

Researchers at Nanyang Technological University, Stanford University, and ByteDance Seed developed the Mixture of Contexts (MoC) framework, enabling Diffusion Transformers to generate minute-long multi-shot videos with high coherence. This method achieved an 85% sparsity, a 7x reduction in FLOPs, and a 2.2x speedup compared to dense attention baselines, while maintaining or improving perceptual quality and consistency.

544

21 Oct 2025

computer-science artificial-intelligence fine-tuning

SimKO: Simple Pass@K Policy Optimization

CUHK

Westlake University

University of British Columbia

SimKO is a method that improves Large Language Models trained with Reinforcement Learning with Verifiable Rewards by mitigating a phenomenon called "probability over-concentration" during token generation. The approach employs asymmetric gradient redistribution to enhance `pass@K` performance while also improving `pass@1` on various math and logical reasoning tasks, consistently outperforming existing RLVR techniques.

13,894

23 Mar 2025

agents chain-of-thought computer-science

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

CUHK NUS UCSB NTU ur

Wang Yaoting

This paper offers the first comprehensive survey of Multimodal Chain-of-Thought (MCoT) reasoning, analyzing its evolution, diverse methodologies, and applications across various modalities. It consolidates scattered research, providing a systematic taxonomy, elucidating foundational concepts, and identifying future research directions to foster innovation in multimodal AI.

22,668

28 Jun 2025

computer-science computer-vision-and-pattern-recognition multimedia

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

CUHK

HKUST RUC

Seg-Zero introduces a pure reinforcement learning framework for reasoning segmentation, enabling emergent chain-of-thought capabilities without explicit reasoning data. The approach demonstrates superior zero-shot generalization and preserves general Visual Question Answering abilities, achieving 57.5 gIoU on ReasonSeg, an 18% improvement over the prior state-of-the-art LISA-7B.

102

464

08 Oct 2025

computer-science computer-vision-and-pattern-recognition generative-models

DreamOmni2: Multimodal Instruction-based Editing and Generation

CUHK

The Chinese University of Hong Kong

ByteDance

The University of Hong Kong

HKUST HKU

DreamOmni2 introduces a multimodal instruction-based editing and generation framework that enables image manipulation using both text and multiple reference images for concrete objects and abstract attributes. The system, supported by a novel synthetic data pipeline and an enhanced Diffusion Transformer, achieves leading performance in human evaluations for image editing (e.g., 68.29% for abstract attribute editing) and competitive results in generation compared to commercial models.

213

1,493

30 May 2025

computer-science computation-and-language computer-vision-and-pattern-recognition

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

CUHK HKU USTC XMU ECNU PKU

Shuhuai Ren

Researchers from Nanjing University, CASIA, HKU, and other institutions introduced Video-MME, a comprehensive benchmark for evaluating multi-modal large language models (MLLMs) in video analysis, covering diverse domains, temporal lengths, and modalities. Evaluations on Video-MME revealed that commercial MLLMs, particularly Gemini 1.5 Pro, achieved higher accuracy than open-source models, and performance across all models declined with increasing video duration, despite gains from integrating subtitles and audio.

460

1,132

06 Feb 2025

adversarial-robustness computer-science machine-learning

G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks

CUHK Shanghai AI Laboratory Tongji University

Emory University USTC

Kun Wang

G-Designer introduces a framework using Graph Neural Networks to dynamically generate task-aware communication topologies for LLM-based multi-agent systems. This approach achieved superior performance on various benchmarks while significantly reducing token consumption by up to 95.33% and demonstrating high adversarial robustness.

8,942

23 Jul 2025

chain-of-thought computer-science artificial-intelligence

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

CUHK Shanghai AI Lab

Peking University

This work comprehensively investigates applying Chain-of-Thought reasoning strategies to autoregressive image generation, demonstrating how verification and reinforcement mechanisms can significantly enhance image quality and text-to-image alignment. The proposed approach achieves a 24% improvement over the Show-o baseline on the GenEval benchmark, surpassing Stable Diffusion 3 by 15%.

812

424

11 Sep 2025

computer-science computation-and-language computer-vision-and-pattern-recognition

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

CUHK HKU Alibaba BUAA

Researchers from CUHK, HKU, Beihang University, and Alibaba introduced FLUX-Reason-6M, a 6-million-image, reasoning-focused text-to-image dataset, and PRISM-Bench, a comprehensive benchmark for evaluating T2I models. This work provides an open-source resource with 20 million bilingual captions, including Generation Chain-of-Thought prompts, aiming to advance T2I reasoning capabilities and offering a robust evaluation of 19 leading models, highlighting persistent challenges in text rendering and long instruction following.

1,888

28 Sep 2025

computer-science computer-vision-and-pattern-recognition image-segmentation

VisionReasoner: Unified Reasoning-Integrated Visual Perception via Reinforcement Learning

CUHK

HKUST SmartMore

Researchers from the Chinese University of Hong Kong and SmartMore developed VisionReasoner, a unified framework for visual perception that integrates a large vision-language model with reinforcement learning. This system handles diverse tasks including detection, segmentation, and counting through a shared reasoning process, demonstrating improved performance across these benchmarks and generating interpretable thought traces without explicit reasoning training.

2,977

05 Dec 2024

computer-science artificial-intelligence computation-and-language

VisionZip: Longer is Better but Not Necessary in Vision Language Models

CUHK

HKUST HITSZ

Wang Chengyao

VisionZip introduces a method to reduce visual token redundancy in Vision Language Models (VLMs) by intelligently selecting dominant tokens and merging contextual ones. The approach achieves up to an 8x reduction in prefilling time while maintaining high model performance across image and video understanding tasks.

236

402

27 Nov 2025

computer-science artificial-intelligence computation-and-language

G $^2$ VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

UCLA

CUHK Shanghai AI Lab HKU USTC FDU SJTU ZJU

G G ²VLM integrates 3D reconstruction and spatial reasoning within a single Vision-Language Model, addressing the spatial intelligence limitations of current VLMs. It learns explicit visual geometry from 2D data using a Mixture-of-Transformer-Experts architecture, leading to robust spatial understanding and strong performance on both 3D reconstruction and complex spatial reasoning benchmarks.

1,201

18 Oct 2025

agents chain-of-thought computer-science

Geometric-Mean Policy Optimization

CUHK

Microsoft

HKUST UCAS

Geometric-Mean Policy Optimization (GMPO) stabilizes reinforcement learning for large language models (LLMs) by employing a geometric mean for token-level reward aggregation. This approach yielded an average 4.1% improvement in Pass@1 accuracy over GRPO on mathematical reasoning benchmarks and showed gains in multimodal and Mixture-of-Experts settings, demonstrating more stable training and enhanced policy exploration.

316

13 Oct 2025

chain-of-thought computer-science artificial-intelligence

CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

CUHK Meituan HKU

CodePlot-CoT introduces a code-driven Chain-of-Thought paradigm, enabling Vision Language Models (VLMs) to generate precise visual aids by producing executable plotting code that is then rendered and re-integrated into the reasoning process. This method, along with the new Math-VR dataset, allowed CodePlot-CoT to achieve up to a 21% performance increase on mathematical visual reasoning tasks, surpassing larger models and those using direct image generation.

15,654

23 Jun 2025

computer-science computer-vision-and-pattern-recognition robotics

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

CUHK

Peking University Beijing Academy of Artificial Intelligence (BAAI)

HybridVLA introduces a unified Vision-Language-Action model that integrates both diffusion-based continuous action prediction and autoregressive reasoning within a shared Large Language Model backbone. This approach achieves state-of-the-art performance in robotic manipulation, demonstrating up to 19% higher success rates over prior methods in real-world tasks and strong generalization to unseen conditions.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

QuickLLaMA: Query-aware Inference Acceleration for Large Language Models

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Process Reinforcement through Implicit Rewards

LLaVA-OneVision: Easy Visual Task Transfer

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Mixture of Contexts for Long Video Generation

SimKO: Simple Pass@K Policy Optimization

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

DreamOmni2: Multimodal Instruction-based Editing and Generation

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

VisionReasoner: Unified Reasoning-Integrated Visual Perception via Reinforcement Learning

VisionZip: Longer is Better but Not Necessary in Vision Language Models

G $^2$ VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Geometric-Mean Policy Optimization

CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Events

AI for Law

Personalize Your Feed

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

QuickLLaMA: Query-aware Inference Acceleration for Large Language Models

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Process Reinforcement through Implicit Rewards

LLaVA-OneVision: Easy Visual Task Transfer

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Mixture of Contexts for Long Video Generation

SimKO: Simple Pass@K Policy Optimization

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

DreamOmni2: Multimodal Instruction-based Editing and Generation

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

VisionReasoner: Unified Reasoning-Integrated Visual Perception via Reinforcement Learning

VisionZip: Longer is Better but Not Necessary in Vision Language Models

G2^22VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Geometric-Mean Policy Optimization

CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Events

AI for Law

Personalize Your Feed

G $^2$ VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning