Institute of Artificial IntelligenceXiamen University
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression

FlashSloth, developed by researchers from Xiamen University, Tencent Youtu Lab, and Shanghai AI Laboratory, introduces a Multimodal Large Language Model (MLLM) architecture that significantly improves efficiency through embedded visual compression. The approach reduces visual tokens by 80-89% and achieves 2-5 times faster response times, while maintaining highly competitive performance across various vision-language benchmarks.

View blog
Resources24
LEAP: Optimization Hierarchical Federated Learning on Non-IID Data with Coalition Formation Game

The paper introduces LEAP, a framework for Hierarchical Federated Learning (HFL) that addresses non-IID data challenges and communication resource allocation in IoT environments. LEAP improves model accuracy by up to 20.62% over clustering baselines and reduces transmission energy consumption by at least 2.24 times while meeting latency requirements.

View blog
Resources
Tree Search for LLM Agent Reinforcement Learning

Researchers from Xiamen University, Southern University of Science and Technology, and Alibaba Group developed Tree-GRPO, an online reinforcement learning method that uses tree search to efficiently train large language model agents. This approach provides fine-grained process supervision from sparse outcome rewards and achieves superior performance with a quarter of the rollout budget compared to chain-based methods.

View blog
Resources105
Real-Time Object Detection Meets DINOv3

DEIMv2 introduces a real-time object detection framework that effectively integrates DINOv3 features, establishing new state-of-the-art accuracy-efficiency trade-offs across eight model scales, from ultra-lightweight (0.49M parameters) to high-performance (57.8 AP). The approach adeptly adapts single-scale Vision Transformer outputs for multi-scale detection while optimizing the decoder and training process.

View blog
Resources476
FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

FastVGGT introduces a training-free token merging approach to accelerate the Visual Geometry Grounded Transformer (VGGT) for long-sequence 3D reconstruction. It achieves up to a 4x speedup in inference time while maintaining or improving reconstruction accuracy and reducing camera pose estimation errors for sequences of up to 1000 images.

View blog
Resources526
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

MetaGPT introduces a meta-programming framework that simulates a software company with specialized LLM agents following Standardized Operating Procedures (SOPs) and an assembly line paradigm. The system significantly improves the coherence, accuracy, and executability of generated code for complex software development tasks, achieving state-of-the-art results on benchmarks like HumanEval and MBPP, and outperforming other multi-agent systems on a comprehensive software development dataset.

View blog
Resources55,360
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
Existing large video-language models (LVLMs) struggle to comprehend long videos correctly due to limited context. To address this problem, fine-tuning long-context LVLMs and employing GPT-based agents have emerged as promising solutions. However, fine-tuning LVLMs would require extensive high-quality data and substantial GPU resources, while GPT-based agents would rely on proprietary models (e.g., GPT-4o). In this paper, we propose Video Retrieval-Augmented Generation (Video-RAG), a training-free and cost-effective pipeline that employs visually-aligned auxiliary texts to help facilitate cross-modality alignment while providing additional information beyond the visual content. Specifically, we leverage open-source external tools to extract visually-aligned information from pure video data (e.g., audio, optical character, and object detection), and incorporate the extracted information into an existing LVLM as auxiliary texts, alongside video frames and queries, in a plug-and-play manner. Our Video-RAG offers several key advantages: (i) lightweight with low computing overhead due to single-turn retrieval; (ii) easy implementation and compatibility with any LVLM; and (iii) significant, consistent performance gains across long video understanding benchmarks, including Video-MME, MLVU, and LongVideoBench. Notably, our model demonstrates superior performance over proprietary models like Gemini-1.5-Pro and GPT-4o when utilized with a 72B model.
View blog
Resources114
When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation

Researchers from Xiamen University and The Hong Kong Polytechnic University developed GraphRAG-Bench, a new benchmark to systematically evaluate graph-based Retrieval-Augmented Generation (GraphRAG). Their analysis reveals that GraphRAG excels in complex reasoning and creative generation tasks but faces efficiency challenges and can underperform vanilla RAG on simpler fact retrieval, underscoring the importance of task complexity and graph quality.

View blog
Resources229
MCP-Zero: Active Tool Discovery for Autonomous LLM Agents

MCP-Zero introduces an active tool discovery framework that enables large language model (LLM) agents to dynamically identify and request external tools on demand. This approach reduces token consumption by up to 98% and maintains high tool selection accuracy even when presented with thousands of potential tools, thereby enhancing the scalability and efficiency of LLM agents.

View blog
Resources118
FlashWorld: High-quality 3D Scene Generation within Seconds

FlashWorld enables high-quality 3D scene generation from a single image or text prompt within seconds, achieving a 10-100x speedup over previous methods while delivering superior visual fidelity and consistent 3D structures. The model recovers intricate details and produces realistic backgrounds even for complex scenes, demonstrating strong performance across image-to-3D and text-to-3D tasks.

View blog
Resources10
Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph

Researchers from IDEA Research, Xiamen University, and other institutions developed Think-on-Graph (ToG), a training-free framework that tightly couples Large Language Models (LLMs) with Knowledge Graphs (KGs). ToG enables LLMs to perform iterative, explainable deep reasoning by actively exploring KG paths through a beam search process, achieving state-of-the-art performance on multiple knowledge-intensive QA datasets and reducing hallucination.

View blog
Resources176
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

This survey provides the first systematic review of multimodal long-context token compression, categorizing techniques across images, videos, and audio by both modality and algorithmic mechanism. It reveals how diverse compression strategies address the quadratic complexity of self-attention in Multimodal Large Language Models (MLLMs), improving efficiency and enabling new applications like real-time robotic perception and high-resolution medical image analysis.

View blog
Resources182
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MME introduces a new comprehensive benchmark to quantitatively evaluate Multimodal Large Language Models (MLLMs), featuring manually constructed, leakage-free instruction-answer pairs across 14 perception and cognition subtasks. The benchmark assesses 30 MLLMs, revealing significant performance gaps and identifying prevalent issues such as instruction non-compliance, perceptual failures, reasoning breakdowns, and object hallucination.

View blog
Resources16,459
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

VR-Bench, a new benchmark, is introduced to evaluate the spatial reasoning capabilities of video generation models through diverse maze-solving tasks. The paper demonstrates that fine-tuned video models can perform robust spatial reasoning, often outperforming Vision-Language Models, and exhibit strong generalization and a notable test-time scaling effect.

View blog
Resources
DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

DynamicVerse, developed by researchers from Xiamen University, Meta, and other institutions, introduces a physically-aware multimodal framework for 4D world modeling. It establishes DynamicGen, an automated pipeline that generates a large-scale 4D dataset comprising over 100K scenes from internet videos, annotated with metric-scale 3D geometry, precise camera parameters, object masks, and hierarchical captions. The framework achieves state-of-the-art results in video depth, camera pose, and camera intrinsics estimation, while also producing high-quality semantic descriptions.

View blog
Resources34
Rho-1: Not All Tokens Are What You Need
08 Jan 2025

RHO-1 introduces Selective Language Modeling (SLM), a pre-training approach that selectively applies loss to high-value tokens, achieving significant data and compute efficiency while improving performance in large language models, particularly in mathematical reasoning. It demonstrated a 97% reduction in effective pre-training tokens to reach similar state-of-the-art math performance compared to baselines.

View blog
Resources3
From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasoning failures, with hallucination being the most prominent. Collectively, these issues expose a fundamental challenge: the ability to process pixels does not yet confer the ability to construct a coherent, credible internal world model. To systematically dissect and address this challenge, this survey introduces a novel and unified analytical framework: ``From Perception to Cognition." We deconstruct the complex process of vision-language interactive understanding into two interdependent layers: Perception, the foundational ability to accurately extract visual information and achieve fine-grained alignment with textual instructions; and Cognition, the higher-order capability for proactive, multi-step, goal-oriented reasoning built upon this perceptual foundation, the core of which is the formation of a dynamic observe-think-verify reasoning loop. Guided by this framework, this paper systematically analyzes the key bottlenecks of current MLLMs at both layers. It surveys the landscape of cutting-edge methods designed to address these challenges, spanning from techniques that enhance low-level visual representations to those that improve high-level reasoning paradigms. Furthermore, we review critical benchmarks and delineate future research directions. This survey aims to provide the research community with a clear, structured perspective for understanding the intrinsic limitations of current MLLMs and to illuminate the path toward building next-generation models capable of deep reasoning and a genuine understanding of the world.
View blog
Resources
Data Interpreter: An LLM Agent For Data Science

Data Interpreter is an LLM agent framework developed by DeepWisdom and Mila, designed to automate end-to-end data science workflows through hierarchical planning and dynamic tool integration. It achieved 94.93% accuracy on InfiAgent-DABench with `gpt-4o`, representing a 19.01% absolute improvement over direct `gpt-4o` inference, and scored 0.95 on ML-Benchmark, outperforming AutoGen and OpenDevin while being more cost-efficient.

View blog
Resources55,378
LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS

LightGaussian introduces a multi-stage pipeline to compress 3D Gaussian Splatting models, achieving an average 15x storage reduction and boosting rendering speeds to over 200 FPS while largely maintaining visual quality. This method addresses the storage overhead and rendering efficiency of large-scale 3D scene representations.

View blog
Resources646
AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

AIGI-Holmes introduces a method for detecting AI-generated images that provides both accurate identification and human-aligned explanations. The approach leverages a novel dataset (Holmes-Set) and a multi-stage training pipeline to enhance Multimodal Large Language Models (MLLMs), achieving 99.2% accuracy on unseen AI-generated images and producing verifiable explanations that surpass existing MLLMs in quality and human alignment.

View blog
Resources121
There are no more papers matching your filters at the moment.