SLAM3R introduces a two-hierarchy neural network framework for real-time dense 3D scene reconstruction from monocular RGB videos, directly regressing 3D pointmaps and progressively aligning local reconstructions. The system achieves an average accuracy of 2.13cm and completeness of 2.34cm on the 7 Scenes dataset, operating at 20+ FPS on an NVIDIA 4090D GPU.
921
Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities. We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: Standardized Traditional Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies." 2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning. 3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, VILA-U, and Gemini2-flash, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3). Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively. The code and evaluation data can be found in this https URL
41
GSFixer enhances 3D Gaussian Splatting reconstruction quality from sparse input views by introducing a reference-guided video diffusion model. This method leverages both 2D semantic and 3D geometric features to achieve superior visual fidelity and 3D consistency, quantitatively improving PSNR by 2.16 dB on artifact restoration and 3.55 dB for 3-view input over prior generative approaches.
47
Reloc3r presents a visual localization framework leveraging a large-scale trained relative camera pose regression network built on a Vision Transformer backbone and a minimalist motion averaging module. This approach achieves state-of-the-art accuracy, exhibits robust generalization across diverse unseen scenes, and maintains real-time inference speeds.
245
Current face reenactment and swapping methods mainly rely on GAN frameworks, but recent focus has shifted to pre-trained diffusion models for their superior generation capabilities. However, training these models is resource-intensive, and the results have not yet achieved satisfactory performance levels. To address this issue, we introduce Face-Adapter, an efficient and effective adapter designed for high-precision and high-fidelity face editing for pre-trained diffusion models. We observe that both face reenactment/swapping tasks essentially involve combinations of target structure, ID and attribute. We aim to sufficiently decouple the control of these factors to achieve both tasks in one model. Specifically, our method contains: 1) A Spatial Condition Generator that provides precise landmarks and background; 2) A Plug-and-play Identity Encoder that transfers face embeddings to the text space by a transformer decoder. 3) An Attribute Controller that integrates spatial conditions and detailed attributes. Face-Adapter achieves comparable or even superior performance in terms of motion control precision, ID retention capability, and generation quality compared to fully fine-tuned face reenactment/swapping models. Additionally, Face-Adapter seamlessly integrates with various StableDiffusion models.
395
·
The RoboTwin Dual-Arm Collaboration Challenge established a comprehensive benchmark for generalizable bimanual manipulation, bridging simulation and real-world environments through progressively difficult tasks. It revealed that while advanced simulation techniques can lead to high performance, significant challenges remain in real-world deformable object handling and sim-to-real transfer, with top teams achieving only 26.4% success in the final complex real-world tasks.
1,560
DescriptiveEdit, a framework developed by researchers from Nanjing University and vivo, re-frames semantic image editing as a description-driven generation task, leveraging pre-trained Text-to-Image models. It achieves superior image consistency and quality, demonstrated by the lowest L1 (0.065), L2 (0.011), and LPIPS (0.139) scores, and highest PSNR (20.99) on the Emu Edit benchmark, while maintaining strong instruction adherence and broad compatibility with existing T2I ecosystem tools.
11
Face swapping aims to generate results that combine the identity from the source with attributes from the target. Existing methods primarily focus on image-based face swapping. When processing videos, each frame is handled independently, making it difficult to ensure temporal stability. From a model perspective, face swapping is gradually shifting from generative adversarial networks (GANs) to diffusion models (DMs), as DMs have been shown to possess stronger generative capabilities. Current diffusion-based approaches often employ inpainting techniques, which struggle to preserve fine-grained attributes like lighting and makeup. To address these challenges, we propose a high fidelity video face swapping (HiFiVFS) framework, which leverages the strong generative capability and temporal prior of Stable Video Diffusion (SVD). We build a fine-grained attribute module to extract identity-disentangled and fine-grained attribute features through identity desensitization and adversarial learning. Additionally, We introduce detailed identity injection to further enhance identity similarity. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) in video face swapping, both qualitatively and quantitatively.
FeRA, developed by researchers at the National University of Singapore and collaborating institutions, introduces a frequency-energy constrained routing framework for parameter-efficient fine-tuning of diffusion models. This approach consistently achieves state-of-the-art performance in tasks like style adaptation and image customization, frequently outperforming full fine-tuning while utilizing significantly fewer trainable parameters.
Implicit Neural Representation (INR), which utilizes a neural network to map coordinate inputs to corresponding attributes, is causing a revolution in the field of signal processing. However, current INR techniques suffer from the "frequency"-specified spectral bias and capacity-convergence gap, resulting in imperfect performance when representing complex signals with multiple "frequencies". We have identified that both of these two characteristics could be handled by increasing the utilization of definition domain in current activation functions, for which we propose the FINER++ framework by extending existing periodic/non-periodic activation functions to variable-periodic ones. By initializing the bias of the neural network with different ranges, sub-functions with various frequencies in the variable-periodic function are selected for activation. Consequently, the supported frequency set can be flexibly tuned, leading to improved performance in signal representation. We demonstrate the generalization and capabilities of FINER++ with different activation function backbones (Sine, Gauss. and Wavelet) and various tasks (2D image fitting, 3D signed distance field representation, 5D neural radiance fields optimization and streamable INR transmission), and we show that it improves existing INRs. Project page: {https://liuzhen0212.github.io/finerpp/}
Recovering the intrinsic physical attributes of a scene from images, generally termed as the inverse rendering problem, has been a central and challenging task in computer vision and computer graphics. In this paper, we present GUS-IR, a novel framework designed to address the inverse rendering problem for complicated scenes featuring rough and glossy surfaces. This paper starts by analyzing and comparing two prominent shading techniques popularly used for inverse rendering, forward shading and deferred shading, effectiveness in handling complex materials. More importantly, we propose a unified shading solution that combines the advantages of both techniques for better decomposition. In addition, we analyze the normal modeling in 3D Gaussian Splatting (3DGS) and utilize the shortest axis as normal for each particle in GUS-IR, along with a depth-related regularization, resulting in improved geometric representation and better shape reconstruction. Furthermore, we enhance the probe-based baking scheme proposed by GS-IR to achieve more accurate ambient occlusion modeling to better handle indirect illumination. Extensive experiments have demonstrated the superior performance of GUS-IR in achieving precise intrinsic decomposition and geometric representation, supporting many downstream tasks (such as relighting, retouching) in computer vision, graphics, and extended reality.
Anomaly inspection plays a vital role in industrial manufacturing, but the scarcity of anomaly samples significantly limits the effectiveness of existing methods in tasks such as localization and classification. While several anomaly synthesis approaches have been introduced for data augmentation, they often struggle with low realism, inaccurate mask alignment, and poor generalization. To overcome these limitations, we propose Generate Aligned Anomaly (GAA), a region-guided, few-shot anomaly image-mask pair generation framework. GAA leverages the strong priors of a pretrained latent diffusion model to generate realistic, diverse, and semantically aligned anomalies using only a small number of samples. The framework first employs Localized Concept Decomposition to jointly model the semantic features and spatial information of anomalies, enabling flexible control over the type and location of anomalies. It then utilizes Adaptive Multi-Round Anomaly Clustering to perform fine-grained semantic clustering of anomaly concepts, thereby enhancing the consistency of anomaly representations. Subsequently, a region-guided mask generation strategy ensures precise alignment between anomalies and their corresponding masks, while a low-quality sample filtering module is introduced to further improve the overall quality of the generated samples. Extensive experiments on the MVTec AD and LOCO datasets demonstrate that GAA achieves superior performance in both anomaly synthesis quality and downstream tasks such as localization and classification.
Blind face restoration (BFR) is a fundamental and challenging problem in computer vision. To faithfully restore high-quality (HQ) photos from poor-quality ones, recent research endeavors predominantly rely on facial image priors from the powerful pretrained text-to-image (T2I) diffusion models. However, such priors often lead to the incorrect generation of non-facial features and insufficient facial details, thus rendering them less practical for real-world applications. In this paper, we propose a novel framework, namely AuthFace that achieves highly authentic face restoration results by exploring a face-oriented generative diffusion prior. To learn such a prior, we first collect a dataset of 1.5K high-quality images, with resolutions exceeding 8K, captured by professional photographers. Based on the dataset, we then introduce a novel face-oriented restoration-tuning pipeline that fine-tunes a pretrained T2I model. Identifying key criteria of quality-first and photography-guided annotation, we involve the retouching and reviewing process under the guidance of photographers for high-quality images that show rich facial features. The photography-guided annotation system fully explores the potential of these high-quality photographic images. In this way, the potent natural image priors from pretrained T2I diffusion models can be subtly harnessed, specifically enhancing their capability in facial detail restoration. Moreover, to minimize artifacts in critical facial areas, such as eyes and mouth, we propose a time-aware latent facial feature loss to learn the authentic face restoration process. Extensive experiments on the synthetic and real-world BFR datasets demonstrate the superiority of our approach.
Visual prompt, a pair of before-and-after edited images, can convey indescribable imagery transformations and prosper in image editing. However, current visual prompt methods rely on a pretrained text-guided image-to-image generative model that requires a triplet of text, before, and after images for retraining over a text-to-image model. Such crafting triplets and retraining processes limit the scalability and generalization of editing. In this paper, we present a framework based on any single text-to-image model without reliance on the explicit image-to-image model thus enhancing the generalizability and scalability. Specifically, by leveraging the probability-flow ordinary equation, we construct a diffusion bridge to transfer the distribution between before-and-after images under the text guidance. By optimizing the text via the bridge, the framework adaptively textualizes the editing transformation conveyed by visual prompts into text embeddings without other models. Meanwhile, we introduce differential attention control during text optimization, which disentangles the text embedding from the invariance of the before-and-after images and makes it solely capture the delicate transformation and generalize to edit various images. Experiments on real images validate competitive results on the generalization, contextual coherence, and high fidelity for delicate editing with just one image pair as the visual prompt.
2
·
In recent years, instruction-based image editing methods have garnered significant attention in image editing. However, despite encompassing a wide range of editing priors, these methods are helpless when handling editing tasks that are challenging to accurately describe through language. We propose InstructBrush, an inversion method for instruction-based image editing methods to bridge this gap. It extracts editing effects from exemplar image pairs as editing instructions, which are further applied for image editing. Two key techniques are introduced into InstructBrush, Attention-based Instruction Optimization and Transformation-oriented Instruction Initialization, to address the limitations of the previous method in terms of inversion effects and instruction generalization. To explore the ability of instruction inversion methods to guide image editing in open scenarios, we establish a TransformationOriented Paired Benchmark (TOP-Bench), which contains a rich set of scenes and editing types. The creation of this benchmark paves the way for further exploration of instruction inversion. Quantitatively and qualitatively, our approach achieves superior performance in editing and is more semantically consistent with the target editing effects.
13
In this work, we target the task of text-driven style transfer in the context of text-to-image (T2I) diffusion models. The main challenge is consistent structure preservation while enabling effective style transfer effects. The past approaches in this field directly concatenate the content and style prompts for a prompt-level style injection, leading to unavoidable structure distortions. In this work, we propose a novel solution to the text-driven style transfer task, namely, Adaptive Style Incorporation~(ASI), to achieve fine-grained feature-level style incorporation. It consists of the Siamese Cross-Attention~(SiCA) to decouple the single-track cross-attention to a dual-track structure to obtain separate content and style features, and the Adaptive Content-Style Blending (AdaBlending) module to couple the content and style information from a structure-consistent manner. Experimentally, our method exhibits much better performance in both structure preservation and stylized effects.
Integrated Sensing and Communication (ISAC) has been identified as a key 6G application by ITU and 3GPP. Channel measurement and modeling is a prerequisite for ISAC system design and has attracted widespread attention from both academia and industry. 3GPP Release 19 initiated the ISAC channel study item in December 2023 and finalized its modeling specification in May 2025 after extensive technical discussions. However, a comprehensive survey that provides a systematic overview,from empirical channel features to modeling methodologies and standardized simulators,remains unavailable. In this paper, the key requirements and challenges in ISAC channel research are first analyzed, followed by a structured overview of the standardization workflow throughout the 3GPP Release 19 process. Then, critical aspects of ISAC channels, including physical objects, target channels, and background channels, are examined in depth, together with additional features such as spatial consistency, environment objects, Doppler characteristics, and shared clusters, supported by measurement-based analysis. To establish a unified ISAC channel modeling framework, an Extended Geometry-based Stochastic Model (E-GBSM) is proposed, incorporating all the aforementioned ISAC channel characteristics. Finally, a standardized simulator is developed based on E-GBSM, and a two-phase calibration procedure aligned with 3GPP Release 19 is conducted to validate both the model and the simulator, demonstrating close agreement with industrial reference results. Overall, this paper provides a systematic survey of 3GPP Release 19 ISAC channel standardization and offers insights into best practices for new feature characterization, unified modeling methodology, and standardized simulator implementation, which can effectively supporting ISAC technology evaluation and future 6G standardization.
Stylized Text-to-Image Generation (STIG) aims to generate images from text prompts and style reference images. In this paper, we present ArtWeaver, a novel framework that leverages pretrained Stable Diffusion (SD) to address challenges such as misinterpreted styles and inconsistent semantics. Our approach introduces two innovative modules: the mixed style descriptor and the dynamic attention adapter. The mixed style descriptor enhances SD by combining content-aware and frequency-disentangled embeddings from CLIP with additional sources that capture global statistics and textual information, thus providing a richer blend of style-related and semantic-related knowledge. To achieve a better balance between adapter capacity and semantic control, the dynamic attention adapter is integrated into the diffusion UNet, dynamically calculating adaptation weights based on the style descriptors. Additionally, we introduce two objective functions to optimize the model alongside the denoising loss, further enhancing semantic and style consistency. Extensive experiments demonstrate the superiority of ArtWeaver over existing methods, producing images with diverse target styles while maintaining the semantic integrity of the text prompts.
Co-speech gesture generation is crucial for producing synchronized and realistic human gestures that accompany speech, enhancing the animation of lifelike avatars in virtual environments. While diffusion models have shown impressive capabilities, current approaches often overlook a wide range of modalities and their interactions, resulting in less dynamic and contextually varied gestures. To address these challenges, we present MambaGesture, a novel framework integrating a Mamba-based attention block, MambaAttn, with a multi-modality feature fusion module, SEAD. The MambaAttn block combines the sequential data processing strengths of the Mamba model with the contextual richness of attention mechanisms, enhancing the temporal coherence of generated gestures. SEAD adeptly fuses audio, text, style, and emotion modalities, employing disentanglement to deepen the fusion process and yield gestures with greater realism and diversity. Our approach, rigorously evaluated on the multi-modal BEAT dataset, demonstrates significant improvements in Fr\'echet Gesture Distance (FGD), diversity scores, and beat alignment, achieving state-of-the-art performance in co-speech gesture generation. Project website: \href\href{https://fcchit.github.io/mambagesture/}{\textit{this https URL}}.
Existing view-based methods excel at recognizing 3D objects from predefined viewpoints, but their exploration of recognition under arbitrary views is limited. This is a challenging and realistic setting because each object has different viewpoint positions and quantities, and their poses are not aligned. However, most view-based methods, which aggregate multiple view features to obtain a global feature representation, hard to address 3D object recognition under arbitrary views. Due to the unaligned inputs from arbitrary views, it is challenging to robustly aggregate features, leading to performance degradation. In this paper, we introduce a novel Part-aware Network (PANet), which is a part-based representation, to address these issues. This part-based representation aims to localize and understand different parts of 3D objects, such as airplane wings and tails. It has properties such as viewpoint invariance and rotation robustness, which give it an advantage in addressing the 3D object recognition problem under arbitrary views. Our results on benchmark datasets clearly demonstrate that our proposed method outperforms existing view-based aggregation baselines for the task of 3D object recognition under arbitrary views, even surpassing most fixed viewpoint methods.
There are no more papers matching your filters at the moment.