The Hong Kong Polytechnic University logoThe Hong Kong Polytechnic University
Researchers at Sun Yat-sen University and collaborators introduce Continuous Scaling Attention (CSAttn), an attention-only Transformer block that achieves state-of-the-art performance across multiple image restoration tasks without relying on Feed-Forward Networks. The architecture demonstrates substantial improvements, including a 0.41 dB PSNR increase in image deraining and a 4.22 dB PSNR gain in low-light image enhancement, while maintaining competitive model efficiency.
Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms, and training remains stable for weeks on hundreds of MetaX GPUs with Model FLOPs Utilization at expected levels. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models also significantly improve long-context efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Furthermore, the proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.
1,022
· +2
A comprehensive, brain-inspired framework integrates diverse research areas of LLM-based intelligent agents, encompassing individual architecture, collaborative systems, and safety. The framework formally conceptualizes agent components, maps AI capabilities to human cognition to identify research gaps, and outlines a roadmap for developing autonomous, adaptive, and safe AI.
596
Researchers from The Hong Kong Polytechnic University, Dartmouth College, Max Planck Institute, Google DeepMind, and others developed Prophet, a training-free adaptive decoding paradigm for Diffusion Language Models (DLMs) that leverages early answer convergence. The method achieves up to 3.4 times faster inference by dynamically committing to answers when model confidence is high, often improving output quality compared to full-step decoding.
30
This survey systematically analyzes Graph Retrieval-Augmented Generation (GraphRAG), a paradigm that leverages graph structures for organizing, retrieving, and integrating knowledge to customize Large Language Models for specialized domains. It presents a comprehensive taxonomy and identifies technical foundations, achieving enhanced accuracy, contextual awareness, and reliability compared to traditional retrieval methods.
554
UniPixel introduces a unified large multi-modal model capable of concurrently performing object referring and segmentation across images and videos, achieving state-of-the-art results on 10 benchmarks and establishing a strong baseline for the new PixelQA task. This model integrates pixel-level understanding with general visual reasoning, demonstrating notable performance improvements over existing LMMs.
61
A comprehensive survey by Chen et al. (2025) introduces the first unified taxonomy for latent Chain-of-Thought (CoT) reasoning, organizing a rapidly growing field into token-wise horizontal and layer-wise vertical approaches. It synthesizes current research, practical applications, and outlines critical challenges for future advancements in LLM efficiency and cognitive capabilities.
171
Researchers from Xiamen University and The Hong Kong Polytechnic University developed GraphRAG-Bench, a new benchmark to systematically evaluate graph-based Retrieval-Augmented Generation (GraphRAG). Their analysis reveals that GraphRAG excels in complex reasoning and creative generation tasks but faces efficiency challenges and can underperform vanilla RAG on simpler fact retrieval, underscoring the importance of task complexity and graph quality.
229
Researchers from The Hong Kong Polytechnic University developed TokenSkip, a method that enables controllable Chain-of-Thought (CoT) compression in large language models by selectively pruning less semantically important tokens. This approach reduces inference latency and token usage by up to 40% with minimal accuracy loss, making CoT reasoning more efficient for deployment.
52
· +1
This survey paper defines and applies a 'full-stack' safety concept for Large Language Models (LLMs), systematically analyzing safety concerns across their entire lifecycle from data to deployment and commercialization. The collaboration synthesizes findings from over 900 papers, providing a unified taxonomy of attacks and defenses while identifying key insights and future research directions for LLM and LLM-agent safety.
5
·
This research investigates how fine-tuning methods for math reasoning affect the broader capabilities of Large Language Models (LLMs), revealing that Reinforcement Learning (RL) approaches foster superior generalization across diverse reasoning and non-reasoning tasks compared to Supervised Fine-Tuning (SFT). RL-tuned models successfully transfer math gains and preserve performance on general tasks, while SFT models often exhibit narrow specialization and degradation in other domains.
99
The General Multimodal Embedder (GME) is introduced, an MLLM-based model designed for universal multimodal retrieval that can process text, images, visual documents, and fused-modal content. By training on a large, diversified dataset including newly synthesized fused-modal data, GME achieves state-of-the-art performance across various multimodal retrieval tasks, demonstrating the importance of comprehensive data for MLLM adaptation in this domain.
Youtu-GraphRAG introduces a vertically unified agentic paradigm that jointly optimizes graph construction and retrieval for large language models, significantly enhancing complex reasoning accuracy and reducing token consumption by up to 90.71% across various benchmarks while mitigating knowledge leaking through novel evaluation datasets.
743
A collaborative effort from City University of Hong Kong, National University of Singapore, AI2, and other institutions presents SPA, a reinforcement learning framework that instills an internal world model in Large Language Model (LLM) agents via self-play supervised finetuning. This approach substantially boosts agent performance and generalization in out-of-distribution environments, for instance, raising the Qwen2.5-1.5B-Instruct's Sokoban success rate (Pass@1) from 25.6% to 59.8%.
4
Researchers from the National University of Singapore and collaborators developed the Vision Bridge Transformer (ViBT), scaling Brownian Bridge Models with a novel stabilized velocity-matching objective for efficient conditional generation. ViBT achieves state-of-the-art results across instruction-based image editing, video stylization, and depth-to-video synthesis, while demonstrating inference speedups of 2.28x to 4.03x compared to existing conditional Diffusion Transformers.
3
Hong Kong Polytechnic University researchers develop SPA-RL (Stepwise Progress Attribution in Reinforcement Learning), a framework that addresses sparse and delayed reward challenges in training LLM agents by decomposing final task rewards into stepwise contributions through a lightweight progress estimator, achieving 2.5% success rate improvements and 1.9% grounding accuracy gains on ALFWorld while demonstrating consistent performance advantages across Webshop and VirtualHome benchmarks through a three-stage approach combining behavior cloning, reward redistribution via progress attribution, and PPO training with intermediate rewards.
45
This paper comprehensively surveys Video Temporal Grounding with Multimodal Large Language Models (VTG-MLLMs), presenting a novel three-dimensional taxonomy to classify methodologies and analyzing performance across diverse tasks and benchmarks. It provides a structured overview of architectural integrations, training strategies, and video feature processing techniques, consolidating advancements in the field.
38
VideoMind introduces a Chain-of-LoRA agent for long video reasoning, mimicking human comprehension by decomposing complex queries into sub-tasks with specialized roles. The system achieves state-of-the-art performance across various grounded video question-answering and temporal grounding benchmarks, particularly excelling on long-form video content.
25
R2ec introduces a unified large recommender model that intrinsically integrates reasoning and recommendation capabilities within a single architecture, optimizing performance and interpretability without relying on human-annotated reasoning data. The model consistently surpasses existing baselines in recommendation quality across multiple datasets while maintaining competitive inference efficiency.
21
There are no more papers matching your filters at the moment.