alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

Institute of Artificial IntelligenceXiamen University

FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression

05 Dec 2024

Shanghai AI Laboratory Xiamen University

FlashSloth, developed by researchers from Xiamen University, Tencent Youtu Lab, and Shanghai AI Laboratory, introduces a Multimodal Large Language Model (MLLM) architecture that significantly improves efficiency through embedded visual compression. The approach reduces visual tokens by 80-89% and achieves 2-5 times faster response times, while maintaining highly competitive performance across various vision-language benchmarks.

#computer-science #computer-vision-and-pattern-recognition #efficient-transformers

Paper thumbnail

LEAP: Optimization Hierarchical Federated Learning on Non-IID Data with Coalition Formation Game

01 May 2024

Xiamen University Wuhan University of Science and Technology

The paper introduces LEAP, a framework for Hierarchical Federated Learning (HFL) that addresses non-IID data challenges and communication resource allocation in IoT environments. LEAP improves model accuracy by up to 20.62% over clustering baselines and reduces transmission energy consumption by at least 2.24 times while meeting latency requirements.

#computer-science #computer-science-and-game-theory #distributed-learning

Paper thumbnail

Tree Search for LLM Agent Reinforcement Learning

11 Oct 2025

Xiamen University

Southern University of Science and Technology

Researchers from Xiamen University, Southern University of Science and Technology, and Alibaba Group developed Tree-GRPO, an online reinforcement learning method that uses tree search to efficiently train large language model agents. This approach provides fine-grained process supervision from sparse outcome rewards and achieves superior performance with a quarter of the rollout budget compared to chain-based methods.

#agentic-frameworks #agents #computer-science

Paper thumbnail

Real-Time Object Detection Meets DINOv3

26 Sep 2025

Xiamen University Intellindust AI Lab

DEIMv2 introduces a real-time object detection framework that effectively integrates DINOv3 features, establishing new state-of-the-art accuracy-efficiency trade-offs across eight model scales, from ultra-lightweight (0.49M parameters) to high-performance (57.8 AP). The approach adeptly adapts single-scale Vision Transformer outputs for multi-scale detection while optimizing the decoder and training process.

#computer-science #computer-vision-and-pattern-recognition #edge-computing

Paper thumbnail

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

09 Nov 2025

Shanghai Jiao Tong University Xiamen University

FastVGGT introduces a training-free token merging approach to accelerate the Visual Geometry Grounded Transformer (VGGT) for long-sequence 3D reconstruction. It achieves up to a 4x speedup in inference time while maintaining or improving reconstruction accuracy and reducing camera pose estimation errors for sequences of up to 1000 images.

#attention-mechanisms #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

01 Nov 2024

Nanjing University The Chinese University of Hong Kong, Shenzhen

MetaGPT introduces a meta-programming framework that simulates a software company with specialized LLM agents following Standardized Operating Procedures (SOPs) and an assembly line paradigm. The system significantly improves the coherence, accuracy, and executability of generated code for complex software development tasks, achieving state-of-the-art results on benchmarks like HumanEval and MBPP, and outperforming other multi-agent systems on a comprehensive software development dataset.

#computer-science #conversational-ai #artificial-intelligence

Resources 55,360

Paper thumbnail

Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

23 Nov 2025

Xiamen University University of Rochester

Existing large video-language models (LVLMs) struggle to comprehend long videos correctly due to limited context. To address this problem, fine-tuning long-context LVLMs and employing GPT-based agents have emerged as promising solutions. However, fine-tuning LVLMs would require extensive high-quality data and substantial GPU resources, while GPT-based agents would rely on proprietary models (e.g., GPT-4o). In this paper, we propose Video Retrieval-Augmented Generation (Video-RAG), a training-free and cost-effective pipeline that employs visually-aligned auxiliary texts to help facilitate cross-modality alignment while providing additional information beyond the visual content. Specifically, we leverage open-source external tools to extract visually-aligned information from pure video data (e.g., audio, optical character, and object detection), and incorporate the extracted information into an existing LVLM as auxiliary texts, alongside video frames and queries, in a plug-and-play manner. Our Video-RAG offers several key advantages: (i) lightweight with low computing overhead due to single-turn retrieval; (ii) easy implementation and compatibility with any LVLM; and (iii) significant, consistent performance gains across long video understanding benchmarks, including Video-MME, MLVU, and LongVideoBench. Notably, our model demonstrates superior performance over proprietary models like Gemini-1.5-Pro and GPT-4o when utilized with a 72B model.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation

07 Oct 2025

jinsong-su

Jinsong Su

The Hong Kong Polytechnic University Xiamen University

Researchers from Xiamen University and The Hong Kong Polytechnic University developed GraphRAG-Bench, a new benchmark to systematically evaluate graph-based Retrieval-Augmented Generation (GraphRAG). Their analysis reveals that GraphRAG excels in complex reasoning and creative generation tasks but faces efficiency challenges and can underperform vanilla RAG on simpler fact retrieval, underscoring the importance of task complexity and graph quality.

#computer-science #computation-and-language #graph-neural-networks

Paper thumbnail

MCP-Zero: Active Tool Discovery for Autonomous LLM Agents

24 Jun 2025

University of Science and Technology of China Xiamen University

MCP-Zero introduces an active tool discovery framework that enables large language model (LLM) agents to dynamically identify and request external tools on demand. This approach reduces token consumption by up to 98% and maintains high tool selection accuracy even when presented with thousands of potential tools, thereby enhancing the scalability and efficiency of LLM agents.

#agentic-frameworks #agents #chain-of-thought

Paper thumbnail

FlashWorld: High-quality 3D Scene Generation within Seconds

15 Oct 2025

Fudan University Xiamen University

FlashWorld enables high-quality 3D scene generation from a single image or text prompt within seconds, achieving a 10-100x speedup over previous methods while delivering superior visual fidelity and consistent 3D structures. The model recovers intricate details and produces realistic backgrounds even for complex scenes, demonstrating strong performance across image-to-3D and text-to-3D tasks.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph

24 Mar 2024

gasol-sun

Gasol Sun

University of Southern California Microsoft logo

Researchers from IDEA Research, Xiamen University, and other institutions developed Think-on-Graph (ToG), a training-free framework that tightly couples Large Language Models (LLMs) with Knowledge Graphs (KGs). ToG enables LLMs to perform iterative, explainable deep reasoning by actively exploring KG paths through a beam search process, achieving state-of-the-art performance on multiple knowledge-intensive QA datasets and reducing hallucination.

#computer-science #computation-and-language #explainable-ai

Paper thumbnail

When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

28 Aug 2025

kele-shao

Kele Shao

National University of Singapore Zhejiang University logo

Zhejiang University

This survey provides the first systematic review of multimodal long-context token compression, categorizing techniques across images, videos, and audio by both modality and algorithmic mechanism. It reveals how diverse compression strategies address the quadratic complexity of self-attention in Multimodal Large Language Models (MLLMs), improving efficiency and enabling new applications like real-time robotic perception and high-resolution medical image analysis.

#attention-mechanisms #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

24 Oct 2025

Xiamen University Tencent YouTu Lab

MME introduces a new comprehensive benchmark to quantitatively evaluate Multimodal Large Language Models (MLLMs), featuring manually constructed, leakage-free instruction-answer pairs across 14 perception and cognition subtasks. The benchmark assesses 30 MLLMs, revealing significant performance gaps and identifying prevalent issues such as instruction non-compliance, perceptual failures, reasoning breakdowns, and object hallucination.

#ai-for-health #computer-science #computer-vision-and-pattern-recognition

Resources 16,459

Paper thumbnail

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

24 Nov 2025

Shanghai Artificial Intelligence Laboratory

National University of Singapore

VR-Bench, a new benchmark, is introduced to evaluate the spatial reasoning capabilities of video generation models through diverse maze-solving tasks. The paper demonstrates that fine-tuned video models can perform robust spatial reasoning, often outperforming Vision-Language Models, and exhibit strong generalization and a notable test-time scaling effect.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

03 Dec 2025

University of Washington

The Chinese University of Hong Kong

DynamicVerse, developed by researchers from Xiamen University, Meta, and other institutions, introduces a physically-aware multimodal framework for 4D world modeling. It establishes DynamicGen, an automated pipeline that generates a large-scale 4D dataset comprising over 100K scenes from internet videos, annotated with metric-scale 3D geometry, precise camera parameters, object masks, and hierarchical captions. The framework achieves state-of-the-art results in video depth, camera pose, and camera intrinsics estimation, while also producing high-quality semantic descriptions.

#agents #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Rho-1: Not All Tokens Are What You Need

08 Jan 2025

redrobot

Yujiu Yang

ruochenx

Ruochen Xu

Tsinghua University Microsoft logo

RHO-1 introduces Selective Language Modeling (SLM), a pre-training approach that selectively applies loss to high-value tokens, achieving significant data and compute efficiency while improving performance in large language models, particularly in mathematical reasoning. It demonstrated a 97% reduction in effective pre-training tokens to reach similar state-of-the-art math performance compared to baselines.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

16 Oct 2025

Renmin University of China

Nanyang Technological University

Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasoning failures, with hallucination being the most prominent. Collectively, these issues expose a fundamental challenge: the ability to process pixels does not yet confer the ability to construct a coherent, credible internal world model. To systematically dissect and address this challenge, this survey introduces a novel and unified analytical framework: ``From Perception to Cognition." We deconstruct the complex process of vision-language interactive understanding into two interdependent layers: Perception, the foundational ability to accurately extract visual information and achieve fine-grained alignment with textual instructions; and Cognition, the higher-order capability for proactive, multi-step, goal-oriented reasoning built upon this perceptual foundation, the core of which is the formation of a dynamic observe-think-verify reasoning loop. Guided by this framework, this paper systematically analyzes the key bottlenecks of current MLLMs at both layers. It surveys the landscape of cutting-edge methods designed to address these challenges, spanning from techniques that enhance low-level visual representations to those that improve high-level reasoning paradigms. Furthermore, we review critical benchmarks and delineate future research directions. This survey aims to provide the research community with a clear, structured perspective for understanding the intrinsic limitations of current MLLMs and to illuminate the path toward building next-generation models capable of deep reasoning and a genuine understanding of the world.

#agents #chain-of-thought #computer-science

Paper thumbnail

Data Interpreter: An LLM Agent For Data Science

15 Oct 2024

bangliu

Bang Liu

Chinese Academy of Sciences University of Notre Dame logo

University of Notre Dame

Data Interpreter is an LLM agent framework developed by DeepWisdom and Mila, designed to automate end-to-end data science workflows through hierarchical planning and dynamic tool integration. It achieved 94.93% accuracy on InfiAgent-DABench with `gpt-4o`, representing a 19.01% absolute improvement over direct `gpt-4o` inference, and scored 0.95 on ML-Benchmark, outperforming AutoGen and OpenDevin while being more cost-efficient.

#agent-based-systems #computer-science #artificial-intelligence

Resources 55,378

Paper thumbnail

LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS

12 Nov 2024

Kairun Wen

University of Texas at Austin Xiamen University

LightGaussian introduces a multi-stage pipeline to compress 3D Gaussian Splatting models, achieving an average 15x storage reduction and boosting rendering speeds to over 200 FPS while largely maintaining visual quality. This method addresses the storage overhead and rendering efficiency of large-scale 3D scene representations.

#computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

07 Jul 2025

sh-ding

sh ding

Xiamen University Tencent YouTu Lab

AIGI-Holmes introduces a method for detecting AI-generated images that provides both accurate identification and human-aligned explanations. The approach leverages a novel dataset (Holmes-Set) and a multi-stage training pipeline to enhance Multimodal Large Language Models (MLLMs), achieving 99.2% accuracy on unseen AI-generated images and producing verifiable explanations that surpass existing MLLMs in quality and human alignment.

#ai-for-cybersecurity #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

There are no more papers matching your filters at the moment.