alphaXiv

History

Papers Benchmarks

Shenzhen Loop Area Institute

356

03 Oct 2025

computer-science computer-vision-and-pattern-recognition deep-reinforcement-learning

Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft

The Chinese University of Hong Kong, Shenzhen

Microsoft

The University of Hong Kong Voyager Research, Didi Chuxing Shenzhen Loop Area Institute

The Memory Forcing framework enables autoregressive video diffusion models to achieve both natural content generation and long-term spatial consistency in open-world environments like Minecraft. It accomplishes this by adaptively utilizing temporal and geometry-indexed spatial memory, demonstrating superior generative quality and significantly faster, more memory-efficient retrieval compared to prior approaches.

152

21 Nov 2025

computer-science contrastive-learning computer-vision-and-pattern-recognition

Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models

The Chinese University of Hong Kong CUHK MMLab

HKUST Centre for Perceptual and Interactive Intelligence CPII under InnoHK Shenzhen Loop Area Institute Vivix.AI

Neighbor GRPO introduces an SDE-free method for aligning flow matching models with human preferences, enabling up to 5x faster training while improving generation quality and robustness against reward hacking. The approach fully leverages efficient high-order ODE solvers for visual generative models.

146

22 Oct 2025

computer-science computation-and-language

Lookahead Routing for Large Language Models

Sun Yat-Sen University Shenzhen Loop Area Institute

Researchers at Sun Yat-sen University developed "Lookahead Routing," a framework for multi-LLM systems that predicts latent representations of potential model responses to inform routing decisions. This approach consistently outperformed existing routing baselines, achieving an average normalized score 7.7% higher than the strongest competitor while reducing activated parameters by approximately 79% compared to ensembling.

138

15 Oct 2025

computer-science computation-and-language sound

UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

Harbin Institute of Technology Shenzhen Loop Area Institute

UniMoE-Audio achieves unified speech and music generation by employing a Dynamic-Capacity Mixture-of-Experts architecture and a three-stage training curriculum. This approach addresses task conflict and data imbalance, yielding state-of-the-art perceptual quality in speech synthesis (UTMOS 4.36 on SeedTTS-EN) and superior aesthetic scores in music generation.

773

07 Dec 2025

attention-mechanisms computer-science artificial-intelligence

Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior

Chinese Academy of Sciences Harbin Institute of Technology (Shenzhen)

HKUST Shenzhen Loop Area Institute

A training-free framework, DyToK, dynamically compresses visual tokens for Video Large Language Models by leveraging an LLM-guided keyframe prior to adaptively allocate per-frame token budgets. This approach significantly boosts inference speed and reduces memory while enhancing video understanding performance, especially under high compression.

24 Oct 2025

computer-science neural-and-evolutionary-computing

Unveiling the Spatial-temporal Effective Receptive Fields of Spiking Neural Networks

University of Electronic Science and Technology of China The Chinese University of Hong Kong, Shenzhen Shenzhen Loop Area Institute

Spiking Neural Networks (SNNs) demonstrate significant potential for energy-efficient neuromorphic computing through an event-driven paradigm. While training methods and computational models have greatly advanced, SNNs struggle to achieve competitive performance in visual long-sequence modeling tasks. In artificial neural networks, the effective receptive field (ERF) serves as a valuable tool for analyzing feature extraction capabilities in visual long-sequence modeling. Inspired by this, we introduce the Spatio-Temporal Effective Receptive Field (ST-ERF) to analyze the ERF distributions across various Transformer-based SNNs. Based on the proposed ST-ERF, we reveal that these models suffer from establishing a robust global ST-ERF, thereby limiting their visual feature modeling capabilities. To overcome this issue, we propose two novel channel-mixer architectures: \underline{m}ulti-\underline{l}ayer-\underline{p}erceptron-based m\underline{ixer} (MLPixer) and \underline{s}plash-and-\underline{r}econstruct \underline{b}lock (SRB). These architectures enhance global spatial ERF through all timesteps in early network stages of Transformer-based SNNs, improving performance on challenging visual long-sequence modeling tasks. Extensive experiments conducted on the Meta-SDT variants and across object detection and semantic segmentation tasks further validate the effectiveness of our proposed method. Beyond these specific applications, we believe the proposed ST-ERF framework can provide valuable insights for designing and optimizing SNN architectures across a broader range of tasks. The code is available at \href{this https URL}{\faGithub~EricZhang1412/Spatial-temporal-ERF}.

09 Dec 2025

attention-mechanisms computer-science artificial-intelligence

ContextDrag: Precise Drag-Based Image Editing via Context-Preserving Token Injection and Position-Consistent Attention

South China University of Technology

The University of Hong Kong Kuaishou Technology Shenzhen Loop Area Institute

ContextDrag introduces a tuning-free framework for precise and high-fidelity drag-based image editing by injecting noise-free, VAE-encoded reference features into Diffusion Transformer models, coupled with position-consistent attention. The method achieves superior drag precision, improving Mean Distance (MD) by 7.3% over prior art, while maintaining image fidelity and semantic coherence in edited results.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode