alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

CUHKSZ

GSFixer: Improving 3D Gaussian Splatting with Reference-Guided Video Diffusion Priors

13 Aug 2025

Xidian University University of Macau

GSFixer enhances 3D Gaussian Splatting reconstruction quality from sparse input views by introducing a reference-guided video diffusion model. This method leverages both 2D semantic and 3D geometric features to achieve superior visual fidelity and 3D consistency, quantitatively improving PSNR by 2.16 dB on artifact restoration and 3.55 dB for 3-view input over prior generative approaches.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation

06 Jun 2025

Researchers at The Chinese University of Hong Kong, Shenzhen, developed TASTE-Rob, a large-scale dataset, and a three-stage pose-refinement pipeline to generate high-fidelity, task-oriented hand-object interaction videos. This approach achieved state-of-the-art video generation quality with FVD 9.43, reduced grasping pose classification error from 67.8% to 9.7%, and improved robotic manipulation success rates from 84% to 96%.

#computer-science #computer-vision-and-pattern-recognition #robotics

Paper thumbnail

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

25 Sep 2025

A new benchmark, MME-VideoOCR, evaluates Multimodal Large Language Models (MLLMs) on their OCR-based capabilities within dynamic video scenarios. This benchmark, comprising 1,464 videos and 2,000 human-annotated QA pairs, reveals that state-of-the-art MLLMs achieve only up to 73.7% accuracy, particularly struggling with cross-frame information integration and exhibiting a strong language prior bias.

#computer-science #computer-vision-and-pattern-recognition #multi-modal-learning

Paper thumbnail

Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations

21 Jun 2025

A*STAR Singapore University of Technology and Design

Scene-R1 introduces a video-grounded large language model capable of 3D scene reasoning by leveraging 2D foundation models and reinforcement learning, thereby eliminating the need for costly 3D annotations. The model generates explicit chain-of-thought rationales, enhancing interpretability, and outperforms label-free baselines on 3D visual grounding and Visual Question Answering tasks.

#computer-science #computer-vision-and-pattern-recognition #explainable-ai

Paper thumbnail

Auto-Regressive Surface Cutting

22 Jun 2025

Surface cutting is a fundamental task in computer graphics, with applications in UV parameterization, texture mapping, and mesh decomposition. However, existing methods often produce technically valid but overly fragmented atlases that lack semantic coherence. We introduce SeamGPT, an auto-regressive model that generates cutting seams by mimicking professional workflows. Our key technical innovation lies in formulating surface cutting as a next token prediction task: sample point clouds on mesh vertices and edges, encode them as shape conditions, and employ a GPT-style transformer to sequentially predict seam segments with quantized 3D coordinates. Our approach achieves exceptional performance on UV unwrapping benchmarks containing both manifold and non-manifold meshes, including artist-created, and 3D-scanned models. In addition, it enhances existing 3D segmentation tools by providing clean boundaries for part decomposition.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model

18 Mar 2024

Alibaba Group Fudan University logo

Fudan University

VideoMV, developed by researchers from Alibaba Group's Institute for Intelligent Computing and collaborators, presents a framework for consistent multi-view image generation by leveraging pre-trained video generative models. The method fine-tunes these models and integrates a novel 3D-aware denoising sampling strategy, drastically reducing training time to 4 GPU hours while producing high-quality, consistent 24-view images that surpass prior techniques.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

MVImgNet: A Large-scale Dataset of Multi-view Images

10 Mar 2023

CUHKSZ FNii, CUHKSZ

This work introduces MVImgNet, a large-scale dataset of 6.5 million multi-view images extracted from 219,188 videos covering 238 object classes, and MVPNet, a derived 3D point cloud dataset. It provides rich annotations including camera parameters and 3D information, demonstrating improvements in novel view synthesis, multi-view stereo, and view-consistent image understanding through pretraining.

#computer-science #computer-vision-and-pattern-recognition #geometric-deep-learning

Paper thumbnail

Multi-Objective Large Language Model Unlearning

04 Jan 2025

CUHKSZ The Cyberspace Academy of Guangzhou University

Machine unlearning in the domain of large language models (LLMs) has attracted great attention recently, which aims to effectively eliminate undesirable behaviors from LLMs without full retraining from scratch. In this paper, we explore the Gradient Ascent (GA) approach in LLM unlearning, which is a proactive way to decrease the prediction probability of the model on the target data in order to remove their influence. We analyze two challenges that render the process impractical: gradient explosion and catastrophic forgetting. To address these issues, we propose Multi-Objective Large Language Model Unlearning (MOLLM) algorithm. We first formulate LLM unlearning as a multi-objective optimization problem, in which the cross-entropy loss is modified to the unlearning version to overcome the gradient explosion issue. A common descent update direction is then calculated, which enables the model to forget the target data while preserving the utility of the LLM. Our empirical results verify that MoLLM outperforms the SOTA GA-based LLM unlearning methods in terms of unlearning effect and model utility preservation. The source code is available at this https URL

#computer-science #continual-learning #artificial-intelligence

Paper thumbnail

X-Pose: Detecting Any Keypoints

17 Jul 2024

International Digital Economy Academy (IDEA)

This work aims to address an advanced keypoint detection problem: how to accurately detect any keypoints in complex real-world scenarios, which involves massive, messy, and open-ended objects as well as their associated keypoints definitions. Current high-performance keypoint detectors often fail to tackle this problem due to their two-stage schemes, under-explored prompt designs, and limited training data. To bridge the gap, we propose X-Pose, a novel end-to-end framework with multi-modal (i.e., visual, textual, or their combinations) prompts to detect multi-object keypoints for any articulated (e.g., human and animal), rigid, and soft objects within a given image. Moreover, we introduce a large-scale dataset called UniKPT, which unifies 13 keypoint detection datasets with 338 keypoints across 1,237 categories over 400K instances. Training with UniKPT, X-Pose effectively aligns text-to-keypoint and image-to-keypoint due to the mutual enhancement of multi-modal prompts based on cross-modality contrastive learning. Our experimental results demonstrate that X-Pose achieves notable improvements of 27.7 AP, 6.44 PCK, and 7.0 AP compared to state-of-the-art non-promptable, visual prompt-based, and textual prompt-based methods in each respective fair setting. More importantly, the in-the-wild test demonstrates X-Pose's strong fine-grained keypoint localization and generalization abilities across image styles, object categories, and poses, paving a new path to multi-object keypoint detection in real applications. Our code and dataset are available at this https URL.

#computer-science #computer-vision-security #contrastive-learning

Paper thumbnail

CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion

09 Aug 2025

Sun Yat-Sen University CUHKSZ

Diffusion Policy (DP) enables robots to learn complex behaviors by imitating expert demonstrations through action diffusion. However, in practical applications, hardware limitations often degrade data quality, while real-time constraints restrict model inference to instantaneous state and scene observations. These limitations seriously reduce the efficacy of learning from expert demonstrations, resulting in failures in object localization, grasp planning, and long-horizon task execution. To address these challenges, we propose Causal Diffusion Policy (CDP), a novel transformer-based diffusion model that enhances action prediction by conditioning on historical action sequences, thereby enabling more coherent and context-aware visuomotor policy learning. To further mitigate the computational cost associated with autoregressive inference, a caching mechanism is also introduced to store attention key-value pairs from previous timesteps, substantially reducing redundant computations during execution. Extensive experiments in both simulated and real-world environments, spanning diverse 2D and 3D manipulation tasks, demonstrate that CDP uniquely leverages historical action sequences to achieve significantly higher accuracy than existing methods. Moreover, even when faced with degraded input observation quality, CDP maintains remarkable precision by reasoning through temporal continuity, which highlights its practical robustness for robotic control under realistic, imperfect conditions.

#causal-inference #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

LoFA: Learning to Predict Personalized Priors for Fast Adaptation of Visual Generative Models

09 Dec 2025

The Chinese University of Hong Kong, Shenzhen Cardiff University

Personalizing visual generative models to meet specific user needs has gained increasing attention, yet current methods like Low-Rank Adaptation (LoRA) remain impractical due to their demand for task-specific data and lengthy optimization. While a few hypernetwork-based approaches attempt to predict adaptation weights directly, they struggle to map fine-grained user prompts to complex LoRA distributions, limiting their practical applicability. To bridge this gap, we propose LoFA, a general framework that efficiently predicts personalized priors for fast model adaptation. We first identify a key property of LoRA: structured distribution patterns emerge in the relative changes between LoRA and base model parameters. Building on this, we design a two-stage hypernetwork: first predicting relative distribution patterns that capture key adaptation regions, then using these to guide final LoRA weight prediction. Extensive experiments demonstrate that our method consistently predicts high-quality personalized priors within seconds, across multiple tasks and user prompts, even outperforming conventional LoRA that requires hours of processing. Project page: this https URL.

#computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

MVImgNet2.0: A Larger-scale Dataset of Multi-view Images

02 Dec 2024

Alibaba Group CUHKSZ

MVImgNet is a large-scale dataset that contains multi-view images of ~220k real-world objects in 238 classes. As a counterpart of ImageNet, it introduces 3D visual signals via multi-view shooting, making a soft bridge between 2D and 3D vision. This paper constructs the MVImgNet2.0 dataset that expands MVImgNet into a total of ~520k objects and 515 categories, which derives a 3D dataset with a larger scale that is more comparable to ones in the 2D domain. In addition to the expanded dataset scale and category range, MVImgNet2.0 is of a higher quality than MVImgNet owing to four new features: (i) most shoots capture 360-degree views of the objects, which can support the learning of object reconstruction with completeness; (ii) the segmentation manner is advanced to produce foreground object masks of higher accuracy; (iii) a more powerful structure-from-motion method is adopted to derive the camera pose for each frame of a lower estimation error; (iv) higher-quality dense point clouds are reconstructed via advanced methods for objects captured in 360-degree views, which can serve for downstream applications. Extensive experiments confirm the value of the proposed MVImgNet2.0 in boosting the performance of large 3D reconstruction models. MVImgNet2.0 will be public at this http URL, including multi-view images of all 520k objects, the reconstructed high-quality point clouds, and data annotation codes, hoping to inspire the broader vision community.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

PIR: Photometric Inverse Rendering with Shading Cues Modeling and Surface Reflectance Regularization

08 Apr 2025

Sun Yat-Sen University CUHKSZ

This paper addresses the problem of inverse rendering from photometric images. Existing approaches for this problem suffer from the effects of self-shadows, inter-reflections, and lack of constraints on the surface reflectance, leading to inaccurate decomposition of reflectance and illumination due to the ill-posed nature of inverse rendering. In this work, we propose a new method for neural inverse rendering. Our method jointly optimizes the light source position to account for the self-shadows in images, and computes indirect illumination using a differentiable rendering layer and an importance sampling strategy. To enhance surface reflectance decomposition, we introduce a new regularization by distilling DINO features to foster accurate and consistent material decomposition. Extensive experiments on synthetic and real datasets demonstrate that our method outperforms the state-of-the-art methods in reflectance decomposition.

#computer-science #computer-vision-security #computer-vision-and-pattern-recognition

Paper thumbnail

RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D

24 Dec 2023

Alibaba Group The Chinese University of Hong Kong, Shenzhen

RichDreamer introduces a generalizable Normal-Depth diffusion model and a depth-conditioned albedo diffusion model to produce detail-rich 3D content from text prompts. This approach achieves superior geometric fidelity and disentangled appearance properties for accurate relighting, establishing a new state-of-the-art in text-to-3D generation.

#computer-science #computer-vision-security #artificial-intelligence

Paper thumbnail

Empowering Large Language Models with 3D Situation Awareness

29 Mar 2025

yibo-peng

Yibo Peng

Driven by the great success of Large Language Models (LLMs) in the 2D image domain, their applications in 3D scene understanding has emerged as a new trend. A key difference between 3D and 2D is that the situation of an egocentric observer in 3D scenes can change, resulting in different descriptions (e.g., ''left" or ''right"). However, current LLM-based methods overlook the egocentric perspective and simply use datasets from a global viewpoint. To address this issue, we propose a novel approach to automatically generate a situation-aware dataset by leveraging the scanning trajectory during data collection and utilizing Vision-Language Models (VLMs) to produce high-quality captions and question-answer pairs. Furthermore, we introduce a situation grounding module to explicitly predict the position and orientation of observer's viewpoint, thereby enabling LLMs to ground situation description in 3D scenes. We evaluate our approach on several benchmarks, demonstrating that our method effectively enhances the 3D situational awareness of LLMs while significantly expanding existing datasets and reducing manual effort.

#computer-science #computer-vision-and-pattern-recognition #data-curation

Paper thumbnail

Diffusion Spectral Representation for Reinforcement Learning

01 Nov 2024

Nanjing University Georgia Tech

Diffusion-based models have achieved notable empirical successes in reinforcement learning (RL) due to their expressiveness in modeling complex distributions. Despite existing methods being promising, the key challenge of extending existing methods for broader real-world applications lies in the computational cost at inference time, i.e., sampling from a diffusion model is considerably slow as it often requires tens to hundreds of iterations to generate even one sample. To circumvent this issue, we propose to leverage the flexibility of diffusion models for RL from a representation learning perspective. In particular, by exploiting the connection between diffusion models and energy-based models, we develop Diffusion Spectral Representation (Diff-SR), a coherent algorithm framework that enables extracting sufficient representations for value functions in Markov decision processes (MDP) and partially observable Markov decision processes (POMDP). We further demonstrate how Diff-SR facilitates efficient policy optimization and practical algorithms while explicitly bypassing the difficulty and inference cost of sampling from the diffusion model. Finally, we provide comprehensive empirical studies to verify the benefits of Diff-SR in delivering robust and advantageous performance across various benchmarks with both fully and partially observable settings.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

Heterogeneous Value Alignment Evaluation for Large Language Models

11 Jan 2024

zhaowei-zhang

Zhaowei Zhang

University of Michigan Peking University logo

Peking University

This research introduces the Heterogeneous Value Alignment Evaluation (HVAE) system, an automated framework to assess how large language models (LLMs) align with diverse human values using the Social Value Orientation (SVO) framework. The evaluation of mainstream LLMs revealed a general propensity towards neutral values and varying performance based on model architecture and fine-tuning methods.

#computer-science #conversational-ai #artificial-intelligence

Paper thumbnail

Aerial Lifting: Neural Urban Semantic and Building Instance Lifting from Aerial Imagery

18 Mar 2024

Sun Yat-Sen University CUHKSZ

This paper presents "Aerial Lifting," a method for achieving urban-scale semantic and building-level instance segmentation directly from multi-view aerial images by lifting noisy 2D labels to a 3D neural radiance field. The approach employs scale-adaptive semantic label fusion and a cross-view instance label grouping strategy to address the challenges of scale variation and multi-view inconsistencies in 2D labels, demonstrating improved 3D semantic and instance segmentation accuracy on urban datasets.

#computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On

09 Jan 2025

Cardiff University CUHKSZ

Virtual Try-On (VTON) has become a crucial tool in ecommerce, enabling the realistic simulation of garments on individuals while preserving their original appearance and pose. Early VTON methods relied on single generative networks, but challenges remain in preserving fine-grained garment details due to limitations in feature extraction and fusion. To address these issues, recent approaches have adopted a dual-network paradigm, incorporating a complementary "ReferenceNet" to enhance garment feature extraction and fusion. While effective, this dual-network approach introduces significant computational overhead, limiting its scalability for high-resolution and long-duration image/video VTON applications. In this paper, we challenge the dual-network paradigm by proposing a novel single-network VTON method that overcomes the limitations of existing techniques. Our method, namely MNVTON, introduces a Modality-specific Normalization strategy that separately processes text, image and video inputs, enabling them to share the same attention layers in a VTON network. Extensive experimental results demonstrate the effectiveness of our approach, showing that it consistently achieves higher-quality, more detailed results for both image and video VTON tasks. Our results suggest that the single-network paradigm can rival the performance of dualnetwork approaches, offering a more efficient alternative for high-quality, scalable VTON applications.

#attention-mechanisms #computer-science #computer-vision-security

Paper thumbnail

REC-MV: REconstructing 3D Dynamic Cloth from Monocular Videos

27 May 2023

jiapeng-zhou

Jiapeng Zhou

The Chinese University of Hong Kong, Shenzhen Tencent logo

Reconstructing dynamic 3D garment surfaces with open boundaries from monocular videos is an important problem as it provides a practical and low-cost solution for clothes digitization. Recent neural rendering methods achieve high-quality dynamic clothed human reconstruction results from monocular video, but these methods cannot separate the garment surface from the body. Moreover, despite existing garment reconstruction methods based on feature curve representation demonstrating impressive results for garment reconstruction from a single image, they struggle to generate temporally consistent surfaces for the video input. To address the above limitations, in this paper, we formulate this task as an optimization problem of 3D garment feature curves and surface reconstruction from monocular video. We introduce a novel approach, called REC-MV, to jointly optimize the explicit feature curves and the implicit signed distance field (SDF) of the garments. Then the open garment meshes can be extracted via garment template registration in the canonical space. Experiments on multiple casually captured datasets show that our approach outperforms existing methods and can produce high-quality dynamic garment surfaces. The source code is available at https://github.com/GAP-LAB-CUHK-SZ/REC-MV.

#computer-science #computer-vision-security #computer-vision-and-pattern-recognition

Paper thumbnail

There are no more papers matching your filters at the moment.