alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

UCAS

EventGPT: Event Stream Understanding with Multimodal Large Language Models

01 Dec 2024

Tsinghua University Xidian University

EventGPT introduces the first multimodal large language model specifically for understanding event streams, enabling comprehensive scene summarization, reasoning, and question answering from asynchronous event camera data. The model consistently outperforms existing MLLMs in challenging high dynamic range and high-speed motion conditions through a novel architecture and progressive three-stage training.

#computer-science #computer-vision-and-pattern-recognition #multi-modal-learning

Paper thumbnail

AUGlasses: Continuous Action Unit based Facial Reconstruction with Low-power IMUs on Smart Glasses

22 May 2024

Tsinghua University Institute of Computing Technology, CAS

Researchers at the Institute of Computing Technology, Chinese Academy of Sciences, and Tsinghua University introduced AUGlasses, a low-power system integrated into smart glasses that uses Inertial Measurement Units (IMUs) for continuous, privacy-preserving facial reconstruction. The system achieves real-time 3D facial animation with an average NME of 2.75% while consuming only 49.95 mW.

#computer-science #computer-vision-security #computer-vision-and-pattern-recognition

Paper thumbnail

Video-R1: Reinforcing Video Reasoning in MLLMs

22 Oct 2025

zonghao-guo

Zonghao Guo

Tsinghua University CUHK MMLab

Researchers introduced Video-R1, the first framework to apply a rule-based reinforcement learning paradigm for enhancing temporal reasoning capabilities in Multimodal Large Language Models (MLLMs) for video. Video-R1, leveraging a new temporal-aware RL algorithm and dedicated video reasoning datasets, consistently outperforms previous state-of-the-art models, achieving 37.1% accuracy on VSI-Bench, surpassing GPT-4o's 34.0%.

#computer-science #computer-vision-and-pattern-recognition #data-curation

Paper thumbnail

BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

14 Oct 2025

xiangnan-wu

Xiangnan Wu

ByteDance CASIA

Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we introduce BridgeVLA, a novel 3D VLA model that (1) projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone, and (2) utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space. In addition, we propose a scalable pre-training method that equips the VLM backbone with the capability to predict 2D heatmaps before downstream policy learning. Extensive experiments show the proposed method is able to learn 3D manipulation efficiently and effectively. BridgeVLA outperforms state-of-the-art baseline methods across three simulation benchmarks. In RLBench, it improves the average success rate from 81.4% to 88.2%. In COLOSSEUM, it demonstrates significantly better performance in challenging generalization settings, boosting the average success rate from 56.7% to 64.0%. In GemBench, it surpasses all the comparing baseline methods in terms of average success rate. In real-robot experiments, BridgeVLA outperforms a state-of-the-art baseline method by 32% on average. It generalizes robustly in multiple out-of-distribution settings, including visual disturbances and unseen instructions. Remarkably, it is able to achieve a success rate of 96.8% on 10+ tasks with only 3 trajectories per task, highlighting its extraordinary sample efficiency. Project Website:this https URL

#agents #computer-science #artificial-intelligence

Paper thumbnail

VMamba: Visual State Space Model

29 Dec 2024

UCAS Pengcheng Lab

VMamba adapts the efficient Mamba State Space Model to computer vision, introducing a 2D-Selective-Scan (SS2D) module that achieves linear computational complexity and memory consumption with image resolution. The architecture demonstrates competitive or superior performance across image classification, object detection, and semantic segmentation benchmarks compared to CNNs and Vision Transformers, while offering significantly higher throughput.

#computer-science #computer-vision-and-pattern-recognition #efficient-transformers

Resources 2,397

Paper thumbnail

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

22 Oct 2025

Wuhan University ByteDance logo

The Grasp Any Region (GAR) framework introduces a multimodal large language model that achieves precise, context-aware pixel understanding by integrating global context with local details. It enables sophisticated interactions among multiple visual prompts and advanced compositional reasoning, establishing new state-of-the-art performance across diverse region-level understanding benchmarks, including a new GAR-Bench dataset.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Geometric-Mean Policy Optimization

18 Oct 2025

Geometric-Mean Policy Optimization (GMPO) stabilizes reinforcement learning for large language models (LLMs) by employing a geometric mean for token-level reward aggregation. This approach yielded an average 4.1% improvement in Pass@1 accuracy over GRPO on mathematical reasoning benchmarks and showed gains in multimodal and Mixture-of-Experts settings, demonstrating more stable training and enhanced policy exploration.

#agents #chain-of-thought #computer-science

Paper thumbnail

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

19 Sep 2025

Johns Hopkins University THU

The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps, surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.

#computer-science #computation-and-language #computer-vision-and-pattern-recognition

Paper thumbnail

Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

02 Oct 2025

Kuaishou Technology UCAS

Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically use Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we show that pre-trained diffusion models are naturally suited for step-level reward modeling in the noisy latent space, as they are explicitly designed to process latent images at various noise levels. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of the diffusion model to predict preferences of latent images at arbitrary timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a step-level preference optimization method conducted directly in the noisy latent space. Experimental results indicate that LPO significantly improves the model's alignment with general, aesthetic, and text-image alignment preferences, while achieving a 2.5-28x training speedup over existing preference optimization methods. Our code and models are available at this https URL.

#computer-science #computer-vision-and-pattern-recognition #deep-reinforcement-learning

Paper thumbnail

Deep Research: A Systematic Survey

24 Nov 2025

University of Amsterdam Carnegie Mellon University logo

Carnegie Mellon University

This survey by multiple institutions defines Deep Research (DR) as a paradigm enabling Large Language Models to perform complex, open-ended tasks through autonomous workflows and verifiable outputs. It systematically categorizes DR into a three-stage roadmap, dissects its four core components, and outlines optimization techniques and evaluation benchmarks.

#agentic-frameworks #agents #computer-science

Paper thumbnail

Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models

29 Sep 2025

Reinforcement Learning (RL) has emerged as a central paradigm for advancing Large Language Models (LLMs), where pre-training and RL post-training share the same log-likelihood formulation. In contrast, recent RL approaches for diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO), optimize an objective different from the pretraining objectives--score/flow matching loss. In this work, we establish a novel theoretical analysis: DDPO is an implicit form of score/flow matching with noisy targets, which increases variance and slows convergence. Building on this analysis, we introduce \textbf{Advantage Weighted Matching (AWM)}, a policy-gradient method for diffusion. It uses the same score/flow-matching loss as pretraining to obtain a lower-variance objective and reweights each sample by its advantage. In effect, AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining. This unifies pretraining and RL conceptually and practically, is consistent with policy-gradient theory, reduces variance, and yields faster convergence. This simple yet effective design yields substantial benefits: on GenEval, OCR, and PickScore benchmarks, AWM delivers up to a

24\times

speedup over Flow-GRPO (which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX, without compromising generation quality. Code is available at this https URL.

#computer-science #machine-learning #deep-reinforcement-learning

Paper thumbnail

DocReward: A Document Reward Model for Structuring and Stylizing

13 Oct 2025

Chinese Academy of Sciences CUHK logo

Researchers at Microsoft and collaborating universities developed DOCREWARD, a Document Reward Model designed to evaluate the visual structure and style professionalism of multi-page documents, independently of their textual content. The model achieved 89.22% human preference accuracy, a 19.45 percentage point improvement over GPT-5, and increased the win rate of AI-generated documents in human comparisons to 60.8% when used as a reward signal.

#agentic-frameworks #agents #computer-science

Paper thumbnail

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

05 Feb 2025

kun-wang

Kun Wang

Comprehensive evaluation of Multimodal Large Language Models (MLLMs) has recently garnered widespread attention in the research community. However, we observe that existing benchmarks present several common barriers that make it difficult to measure the significant challenges that models face in the real world, including: 1) small data scale leads to a large performance variance; 2) reliance on model-based annotations results in restricted data quality; 3) insufficient task difficulty, especially caused by the limited image resolution. To tackle these issues, we introduce MME-RealWorld. Specifically, we collect more than

300

K images from public datasets and the Internet, filtering

13,366

high-quality images for annotation. This involves the efforts of professional

25

annotators and

7

experts in MLLMs, contributing to

29,429

question-answer pairs that cover

43

subtasks across

5

real-world scenarios, extremely challenging even for humans. As far as we know, MME-RealWorld is the largest manually annotated benchmark to date, featuring the highest resolution and a targeted focus on real-world applications. We further conduct a thorough evaluation involving

28

prominent MLLMs, such as GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet. Our results show that even the most advanced models struggle with our benchmarks, where none of them reach

60\%

accuracy. The challenges of perceiving high-resolution images and understanding complex real-world scenarios remain urgent issues to be addressed. The data and evaluation code are released at this https URL .

#computer-science #computer-vision-and-pattern-recognition #human-ai-interaction

Paper thumbnail

ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints

22 Oct 2025

Alibaba Group THU

ImagerySearch, an adaptive test-time search strategy, and LDT-Bench, a new evaluation benchmark, are presented to enhance text-to-video generation for imaginative scenarios characterized by long-distance semantic prompts. This approach achieves an 8.83% improvement in overall ImageryQA score on LDT-Bench and demonstrates superior robustness to increasing semantic distance when compared to existing methods.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

19 May 2025

floodsung

Flood Sung

Peking University UCAS

Researchers at Peking University, UCAS, and Moonshot AI developed a novel reinforcement learning environment, VLM-Gym, to train Vision-Language Models (VLMs) for improved performance in interactive visual games. Their G1-7B models, trained with a perception-enhanced cold start and reinforcement learning, consistently achieve higher scores than leading proprietary VLMs, demonstrating how VLM perception and reasoning abilities can mutually reinforce each other through reward-driven interaction.

#agents #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

24 Jul 2025

EVEv2.0 introduces a family of encoder-free Vision-Language Models that efficiently learn visual perception from scratch, leveraging a "Divide-and-Conquer" architecture and a high-quality captioning engine. The model achieves competitive performance with mainstream encoder-based VLMs of similar capacity on various benchmarks while utilizing significantly less training data.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

10 Jul 2025

ByteDance CASIA

Researchers from CASIA, UCAS, and ByteDance introduced TreeBench, a diagnostic benchmark that evaluates Large Multimodal Models' visual grounded reasoning by assessing fine-grained perception and traceable evidence through bounding box annotations. They also developed TreeVGR, a training paradigm that uses dual IoU rewards to enhance models' ability to generate explicit visual reasoning pathways, demonstrating substantial performance gains on VGR tasks, including a 13.4 point accuracy improvement on TreeBench over its base model.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization

12 Sep 2025

Researchers from CASIA, Meituan, GigaAI, and other institutions developed FullVQ (FVQ), a scalable training method for vector-quantized networks that consistently achieves 100% codebook utilization by introducing a novel VQBridge projector. FVQ sets a new state-of-the-art for discrete tokenizers with an rFID of 0.88 and enables autoregressive models to surpass advanced diffusion models in image generation quality without incurring inference overhead.

#computer-science #computer-vision-and-pattern-recognition #efficient-transformers

Paper thumbnail

Distribution Matching Variational AutoEncoder

08 Dec 2025

Peking University Tencent logo

A new framework, Distribution Matching Variational AutoEncoder (DMVAE), explicitly aligns a VAE's aggregate latent distribution with a pre-defined reference distribution using score-based matching. The approach achieves a state-of-the-art gFID of 1.82 on ImageNet 256x256, demonstrating superior training efficiency for downstream generative models, particularly when utilizing Self-Supervised Learning features as the reference.

#computer-science #computer-vision-and-pattern-recognition #embedding-methods

Paper thumbnail

SUDER: Self-Improving Unified Large Multimodal Models for Understanding and Generation with Dual Self-Rewards

08 Sep 2025

Wuhan University Renmin University of China logo

Renmin University of China

Building upon large language models (LLMs), recent large multimodal models (LMMs) unify cross-model understanding and generation into a single framework. However, LMMs still struggle to achieve accurate vision-language alignment, prone to generating text responses contradicting the visual input or failing to follow the text-to-image prompts. Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks-either understanding or generation. In this work, based on the observation that understanding and generation are naturally inverse dual tasks, we propose \textbf{SUDER} (\textbf{S}elf-improving \textbf{U}nified LMMs with \textbf{D}ual s\textbf{E}lf-\textbf{R}ewards), a framework reinforcing the understanding and generation capabilities of LMMs with a self-supervised dual reward mechanism. SUDER leverages the inherent duality between understanding and generation tasks to provide self-supervised optimization signals for each other. Specifically, we sample multiple outputs for a given input in one task domain, then reverse the input-output pairs to compute the dual likelihood within the model as self-rewards for optimization. Extensive experimental results on visual understanding and generation benchmarks demonstrate that our method can effectively enhance the performance of the model without any external supervision, especially achieving remarkable improvements in text-to-image tasks.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

There are no more papers matching your filters at the moment.