alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

ZJUChina

π^3

: Permutation-Equivariant Visual Geometry Learning

09 Sep 2025

Shanghai AI Lab SII

$\\pi^3$ presents a permutation-equivariant architecture for visual geometry learning that completely removes the reliance on a fixed reference view. The model achieves state-of-the-art performance across camera pose, depth, and point map estimation, demonstrating superior robustness to input image order and efficiency.

#computer-science #computer-vision-and-pattern-recognition #geometric-deep-learning

Resources 1,244

Paper thumbnail

P3-SAM: Native 3D Part Segmentation

25 Sep 2025

HKU Tencent Hunyuan

P³-SAM, developed by Tencent Hunyuan and academic collaborators, introduces a native 3D point-promptable part segmentation model trained on a new 3.7 million model dataset, achieving fully automatic and precise segmentation of complex 3D objects. This approach bypasses limitations of 2D-dependent methods, leading to superior quantitative performance and robust generalization.

#computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

^2

VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

27 Nov 2025

G G ²VLM integrates 3D reconstruction and spatial reasoning within a single Vision-Language Model, addressing the spatial intelligence limitations of current VLMs. It learns explicit visual geometry from 2D data using a Mixture-of-Transformer-Experts architecture, leading to robust spatial understanding and strong performance on both 3D reconstruction and complex spatial reasoning benchmarks.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

23 Oct 2025

HoloCine, developed by Ant Group and HKUST, generates coherent, cinematic multi-shot long video narratives from hierarchical text prompts. The framework introduces architectural innovations to maintain global consistency and directorial control while achieving computational efficiency, outperforming prior text-to-video approaches in narrative fidelity and consistency for minute-scale videos.

#computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

X-Part: high fidelity and structure coherent shape decomposition

24 Sep 2025

Generating 3D shapes at part level is pivotal for downstream applications such as mesh retopology, UV mapping, and 3D printing. However, existing part-based generation methods often lack sufficient controllability and suffer from poor semantically meaningful decomposition. To this end, we introduce X-Part, a controllable generative model designed to decompose a holistic 3D object into semantically meaningful and structurally coherent parts with high geometric fidelity. X-Part exploits the bounding box as prompts for the part generation and injects point-wise semantic features for meaningful decomposition. Furthermore, we design an editable pipeline for interactive part generation. Extensive experimental results show that X-Part achieves state-of-the-art performance in part-level shape generation. This work establishes a new paradigm for creating production-ready, editable, and structurally sound 3D assets. Codes will be released for public research.

#computer-science #computer-vision-and-pattern-recognition #graphics

Paper thumbnail

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

04 Dec 2025

Shanghai Jiao Tong University

Reward Forcing introduces EMA-Sink and Rewarded Distribution Matching Distillation (Re-DMD) to enable efficient, real-time streaming video generation. This framework achieves an overall VBench score of 84.13 and a generation speed of 23.1 FPS, while significantly enhancing motion dynamics and maintaining long-horizon consistency.

#computer-science #computer-vision-and-pattern-recognition #efficient-transformers

Paper thumbnail

Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

15 Oct 2025

Ant Group HKUST(GZ)

Researchers from HKUST(GZ), Kuaishou, and Ant Group introduced Minimal Test-Time Intervention (MTI), a training-free framework that enhances large language model reasoning accuracy and stability. MTI achieves this by selectively applying Classifier-Free Guidance (CFG) only at highly uncertain tokens, utilizing a lightweight negative-prompt mechanism for efficiency.

#chain-of-thought #computer-science #artificial-intelligence

Paper thumbnail

DeepVerse: 4D Autoregressive Video Generation as a World Model

01 Jun 2025

Shanghai AI Lab NTU

SJTU, USTC, Tsinghua, and Shanghai AI Lab researchers develop DeepVerse, a 4D autoregressive world model that generates video sequences by explicitly incorporating depth maps and camera poses alongside visual observations through a composite state representation and geometry-aware memory mechanism, addressing core limitations of visual-only approaches including scale ambiguity and temporal drift by training on 10 million synthetic game frames with precise geometric annotations and achieving superior performance in VBench consistency metrics while enabling long-horizon predictions through a sliding window approach that maintains global coordinate alignment across extended sequences.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

28 Aug 2025

junming-yang

Junming Yang

xinyu-fang

Xinyu Fang

CUHK Shanghai AI Laboratory

We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 200+ different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 80 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released on this https URL and is actively maintained.

#computer-science #computer-vision-security #computer-vision-and-pattern-recognition

Resources 3,188

Paper thumbnail

Learning Primitive Embodied World Models: Towards Scalable Robotic Learning

24 Nov 2025

Shanghai AI Lab HKUST logo

While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a "GPT moment" in the embodied domain. There is a naive observation: the diversity of embodied data far exceeds the relatively small space of possible primitive motions. Based on this insight, we propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM). By restricting video generation to fixed short horizons, our approach 1) enables fine-grained alignment between linguistic concepts and visual representations of robotic actions, 2) reduces learning complexity, 3) improves data efficiency in embodied data collection, and 4) decreases inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.

#computer-science #artificial-intelligence #multimedia

Paper thumbnail

CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

08 Nov 2025

Xiamen University East China Normal University

This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need to sample multiple completions for each question. Our experiment and theoretical analysis reveal that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training -- their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experiments show that CPPO achieves up to

7.98\times

speedup on GSM8K and

3.48\times

on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at \href{this https URL}{this https URL}.

#agents #chain-of-thought #computer-science

Paper thumbnail

On Path to Multimodal Generalist: General-Level and General-Bench

07 May 2025

wthu120851

Wentao

wang-yaoting

Wang Yaoting

A comprehensive framework called General-Level introduces a 5-level taxonomy for evaluating multimodal large language models (MLLMs), accompanied by General-Bench - a large-scale benchmark testing diverse modalities and tasks, revealing that even leading models like GPT-4V achieve only Level-3 capabilities while demonstrating limited cross-modal synergy.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

Motion Anything: Any to Motion Generation

12 Mar 2025

Researchers from Australian National University, University of Sydney, Tencent, and other institutions developed "Motion Anything," a framework for human motion generation that adaptively integrates multimodal conditions like text and music. It employs an attention-based masking strategy to dynamically prioritize motion segments, outperforming prior models in text-to-motion (e.g., 15% lower FID on HumanML3D) and music-to-dance tasks, and introduces a new Text-Music-Dance dataset.

#attention-mechanisms #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model

06 Mar 2025

Wuhan University Zhejiang University logo

Zhejiang University

The paper "Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model" provides a systematic review and unified benchmark for tuning MLLMs, classifying methods into Selective, Additive, and Reparameterization paradigms. It empirically analyzes the trade-offs between task-expert specialization and open-world stabilization, offering practical guidelines for MLLM deployment.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues

02 Dec 2025

MagicQuill V2 introduces a layered composition paradigm for precise, interactive image editing, integrating state-of-the-art diffusion transformers with granular control mechanisms. This system, developed by researchers at HKUST and Ant Group, integrates user-provided foreground pieces with context awareness, precisely adheres to visual cues (edges, colors), and achieves superior local editing and object removal, demonstrating a 68.5% user preference rate over baselines.

#computer-science #computer-vision-and-pattern-recognition #fine-tuning

Paper thumbnail

BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks

13 Apr 2020

BlendedMVS introduces a large-scale dataset for Multi-view Stereo (MVS) that leverages real-world 3D meshes and a novel frequency-domain image blending technique to provide geometrically consistent and visually realistic training data. Training MVS networks on BlendedMVS significantly improves their generalization ability to diverse, real-world scenes like Tanks and Temples, outperforming models trained on existing datasets.

#computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning

27 May 2025

Self-supervised representation learning for point cloud has demonstrated effectiveness in improving pre-trained model performance across diverse tasks. However, as pre-trained models grow in complexity, fully fine-tuning them for downstream applications demands substantial computational and storage resources. Parameter-efficient fine-tuning (PEFT) methods offer a promising solution to mitigate these resource requirements, yet most current approaches rely on complex adapter and prompt mechanisms that increase tunable parameters. In this paper, we propose PointLoRA, a simple yet effective method that combines low-rank adaptation (LoRA) with multi-scale token selection to efficiently fine-tune point cloud models. Our approach embeds LoRA layers within the most parameter-intensive components of point cloud transformers, reducing the need for tunable parameters while enhancing global feature capture. Additionally, multi-scale token selection extracts critical local information to serve as prompts for downstream fine-tuning, effectively complementing the global context captured by LoRA. The experimental results across various pre-trained models and three challenging public datasets demonstrate that our approach achieves competitive performance with only 3.43% of the trainable parameters, making it highly effective for resource-constrained applications. Source code is available at: this https URL

#computer-science #computer-vision-and-pattern-recognition #geometric-deep-learning

Paper thumbnail

AffordanceSAM: Segment Anything Once More in Affordance Grounding

25 Aug 2025

AffordanceSAM adapts the Segment Anything Model (SAM) for affordance grounding, integrating a specialized adaptation module and a coarse-to-fine training scheme. The framework achieves state-of-the-art performance on the AGD20K benchmark and demonstrates robust generalization to novel objects and actions.

#computer-science #computer-vision-security #computer-vision-and-pattern-recognition

Paper thumbnail

AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding

23 Nov 2025

Multimodal Large Language Models (MLLMs) perform well in video understanding but degrade on long videos due to fixed-length context and weak long-term dependency modeling. Retrieval-Augmented Generation (RAG) can expand knowledge dynamically, yet existing video RAG schemes adopt fixed retrieval paradigms that ignore query difficulty. This uniform design causes redundant computation and latency for simple queries, while coarse retrieval for complex, multi-hop reasoning can miss key information. Such single-step retrieval severely limits the trade-off between efficiency and cognitive depth. We propose AdaVideoRAG, an adaptive RAG framework for long-video understanding. A lightweight intent classifier dynamically selects suitable retrieval schemes according to query complexity from the simplest to the most sophisticated. We design an Omni-Knowledge Indexing module that extracts and organizes multi-modal information into three databases: (1) a text base built from clip captions, ASR, and OCR; (2) a visual base; and (3) a knowledge graph for deep semantic understanding. This supports hierarchical knowledge access, from naive retrieval to graph-based retrieval, balancing resource cost and reasoning ability. To evaluate deep understanding, we further construct the HiVU benchmark. Experiments show that AdaVideoRAG significantly improves both efficiency and accuracy on long-video QA tasks and can be seamlessly plugged into existing MLLMs through lightweight APIs, establishing a new paradigm for adaptive retrieval-augmented video analysis.

#agents #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

SPA: 3D Spatial-Awareness Enables Effective Embodied Representation

01 Mar 2025

jiange-yang

jiange yang

Shanghai AI Lab USTC

Researchers at Shanghai AI Lab developed SPA, a framework that imbues 2D Vision Transformers with 3D spatial awareness using differentiable neural rendering from multi-view images. It achieves superior performance across 268 embodied AI tasks and generalizes to real-world robot manipulation by effectively capturing 3D spatial relationships.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

There are no more papers matching your filters at the moment.