ZJUChina
π3π^3: Permutation-Equivariant Visual Geometry Learning

pi3\\pi^3 presents a permutation-equivariant architecture for visual geometry learning that completely removes the reliance on a fixed reference view. The model achieves state-of-the-art performance across camera pose, depth, and point map estimation, demonstrating superior robustness to input image order and efficiency.

View blog
Resources1,244
P3-SAM: Native 3D Part Segmentation

P³-SAM, developed by Tencent Hunyuan and academic collaborators, introduces a native 3D point-promptable part segmentation model trained on a new 3.7 million model dataset, achieving fully automatic and precise segmentation of complex 3D objects. This approach bypasses limitations of 2D-dependent methods, leading to superior quantitative performance and robust generalization.

View blog
Resources249
G2^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
27 Nov 2025

G G ²VLM integrates 3D reconstruction and spatial reasoning within a single Vision-Language Model, addressing the spatial intelligence limitations of current VLMs. It learns explicit visual geometry from 2D data using a Mixture-of-Transformer-Experts architecture, leading to robust spatial understanding and strong performance on both 3D reconstruction and complex spatial reasoning benchmarks.

View blog
Resources1
HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
23 Oct 2025

HoloCine, developed by Ant Group and HKUST, generates coherent, cinematic multi-shot long video narratives from hierarchical text prompts. The framework introduces architectural innovations to maintain global consistency and directorial control while achieving computational efficiency, outperforming prior text-to-video approaches in narrative fidelity and consistency for minute-scale videos.

View blog
Resources126
X-Part: high fidelity and structure coherent shape decomposition
24 Sep 2025
Generating 3D shapes at part level is pivotal for downstream applications such as mesh retopology, UV mapping, and 3D printing. However, existing part-based generation methods often lack sufficient controllability and suffer from poor semantically meaningful decomposition. To this end, we introduce X-Part, a controllable generative model designed to decompose a holistic 3D object into semantically meaningful and structurally coherent parts with high geometric fidelity. X-Part exploits the bounding box as prompts for the part generation and injects point-wise semantic features for meaningful decomposition. Furthermore, we design an editable pipeline for interactive part generation. Extensive experimental results show that X-Part achieves state-of-the-art performance in part-level shape generation. This work establishes a new paradigm for creating production-ready, editable, and structurally sound 3D assets. Codes will be released for public research.
View blog
Resources
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Reward Forcing introduces EMA-Sink and Rewarded Distribution Matching Distillation (Re-DMD) to enable efficient, real-time streaming video generation. This framework achieves an overall VBench score of 84.13 and a generation speed of 23.1 FPS, while significantly enhancing motion dynamics and maintaining long-horizon consistency.

View blog
Resources14
Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

Researchers from HKUST(GZ), Kuaishou, and Ant Group introduced Minimal Test-Time Intervention (MTI), a training-free framework that enhances large language model reasoning accuracy and stability. MTI achieves this by selectively applying Classifier-Free Guidance (CFG) only at highly uncertain tokens, utilizing a lightweight negative-prompt mechanism for efficiency.

View blog
Resources10
DeepVerse: 4D Autoregressive Video Generation as a World Model

SJTU, USTC, Tsinghua, and Shanghai AI Lab researchers develop DeepVerse, a 4D autoregressive world model that generates video sequences by explicitly incorporating depth maps and camera poses alongside visual observations through a composite state representation and geometry-aware memory mechanism, addressing core limitations of visual-only approaches including scale ambiguity and temporal drift by training on 10 million synthetic game frames with precise geometric annotations and achieving superior performance in VBench consistency metrics while enabling long-horizon predictions through a sliding window approach that maintains global coordinate alignment across extended sequences.

View blog
Resources181
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
28 Aug 2025
We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 200+ different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 80 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released on this https URL and is actively maintained.
View blog
Resources3,188
Learning Primitive Embodied World Models: Towards Scalable Robotic Learning
While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a "GPT moment" in the embodied domain. There is a naive observation: the diversity of embodied data far exceeds the relatively small space of possible primitive motions. Based on this insight, we propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM). By restricting video generation to fixed short horizons, our approach 1) enables fine-grained alignment between linguistic concepts and visual representations of robotic actions, 2) reduces learning complexity, 3) improves data efficiency in embodied data collection, and 4) decreases inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
View blog
Resources2
CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models
This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need to sample multiple completions for each question. Our experiment and theoretical analysis reveal that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training -- their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experiments show that CPPO achieves up to 7.98×7.98\times speedup on GSM8K and 3.48×3.48\times on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at \href{this https URL}{this https URL}.
View blog
Resources152
On Path to Multimodal Generalist: General-Level and General-Bench
07 May 2025

A comprehensive framework called General-Level introduces a 5-level taxonomy for evaluating multimodal large language models (MLLMs), accompanied by General-Bench - a large-scale benchmark testing diverse modalities and tasks, revealing that even leading models like GPT-4V achieve only Level-3 capabilities while demonstrating limited cross-modal synergy.

View blog
Resources19
Motion Anything: Any to Motion Generation
12 Mar 2025

Researchers from Australian National University, University of Sydney, Tencent, and other institutions developed "Motion Anything," a framework for human motion generation that adaptively integrates multimodal conditions like text and music. It employs an attention-based masking strategy to dynamically prioritize motion segments, outperforming prior models in text-to-motion (e.g., 15% lower FID on HumanML3D) and music-to-dance tasks, and introduces a new Text-Music-Dance dataset.

View blog
Resources1
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model

The paper "Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model" provides a systematic review and unified benchmark for tuning MLLMs, classifying methods into Selective, Additive, and Reparameterization paradigms. It empirically analyzes the trade-offs between task-expert specialization and open-world stabilization, offering practical guidelines for MLLM deployment.

View blog
Resources48
MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues
02 Dec 2025

MagicQuill V2 introduces a layered composition paradigm for precise, interactive image editing, integrating state-of-the-art diffusion transformers with granular control mechanisms. This system, developed by researchers at HKUST and Ant Group, integrates user-provided foreground pieces with context awareness, precisely adheres to visual cues (edges, colors), and achieves superior local editing and object removal, demonstrating a 68.5% user preference rate over baselines.

View blog
Resources37
BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks
13 Apr 2020

BlendedMVS introduces a large-scale dataset for Multi-view Stereo (MVS) that leverages real-world 3D meshes and a novel frequency-domain image blending technique to provide geometrically consistent and visually realistic training data. Training MVS networks on BlendedMVS significantly improves their generalization ability to diverse, real-world scenes like Tanks and Temples, outperforming models trained on existing datasets.

View blog
Resources575
PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning
27 May 2025
Self-supervised representation learning for point cloud has demonstrated effectiveness in improving pre-trained model performance across diverse tasks. However, as pre-trained models grow in complexity, fully fine-tuning them for downstream applications demands substantial computational and storage resources. Parameter-efficient fine-tuning (PEFT) methods offer a promising solution to mitigate these resource requirements, yet most current approaches rely on complex adapter and prompt mechanisms that increase tunable parameters. In this paper, we propose PointLoRA, a simple yet effective method that combines low-rank adaptation (LoRA) with multi-scale token selection to efficiently fine-tune point cloud models. Our approach embeds LoRA layers within the most parameter-intensive components of point cloud transformers, reducing the need for tunable parameters while enhancing global feature capture. Additionally, multi-scale token selection extracts critical local information to serve as prompts for downstream fine-tuning, effectively complementing the global context captured by LoRA. The experimental results across various pre-trained models and three challenging public datasets demonstrate that our approach achieves competitive performance with only 3.43% of the trainable parameters, making it highly effective for resource-constrained applications. Source code is available at: this https URL
View blog
Resources4
AffordanceSAM: Segment Anything Once More in Affordance Grounding
25 Aug 2025

AffordanceSAM adapts the Segment Anything Model (SAM) for affordance grounding, integrating a specialized adaptation module and a coarse-to-fine training scheme. The framework achieves state-of-the-art performance on the AGD20K benchmark and demonstrates robust generalization to novel objects and actions.

View blog
Resources
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding
23 Nov 2025
Multimodal Large Language Models (MLLMs) perform well in video understanding but degrade on long videos due to fixed-length context and weak long-term dependency modeling. Retrieval-Augmented Generation (RAG) can expand knowledge dynamically, yet existing video RAG schemes adopt fixed retrieval paradigms that ignore query difficulty. This uniform design causes redundant computation and latency for simple queries, while coarse retrieval for complex, multi-hop reasoning can miss key information. Such single-step retrieval severely limits the trade-off between efficiency and cognitive depth. We propose AdaVideoRAG, an adaptive RAG framework for long-video understanding. A lightweight intent classifier dynamically selects suitable retrieval schemes according to query complexity from the simplest to the most sophisticated. We design an Omni-Knowledge Indexing module that extracts and organizes multi-modal information into three databases: (1) a text base built from clip captions, ASR, and OCR; (2) a visual base; and (3) a knowledge graph for deep semantic understanding. This supports hierarchical knowledge access, from naive retrieval to graph-based retrieval, balancing resource cost and reasoning ability. To evaluate deep understanding, we further construct the HiVU benchmark. Experiments show that AdaVideoRAG significantly improves both efficiency and accuracy on long-video QA tasks and can be seamlessly plugged into existing MLLMs through lightweight APIs, establishing a new paradigm for adaptive retrieval-augmented video analysis.
View blog
Resources
SPA: 3D Spatial-Awareness Enables Effective Embodied Representation
01 Mar 2025

Researchers at Shanghai AI Lab developed SPA, a framework that imbues 2D Vision Transformers with 3D spatial awareness using differentiable neural rendering from multi-view images. It achieves superior performance across 268 embodied AI tasks and generalizes to real-world robot manipulation by effectively capturing 3D spatial relationships.

View blog
Resources123
There are no more papers matching your filters at the moment.