MPI for InformaticsSaarland Informatics Campus
Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces
24 Mar 2025

ETH Zürich and Tsinghua University researchers introduce OpenFunGraph, a framework that constructs functional 3D scene graphs by identifying interactive elements and their relationships in indoor spaces through foundation models, enabling richer scene understanding without requiring task-specific training data.

View blog
Resources11
Quantum-enhanced Computer Vision: Going Beyond Classical Algorithms

This survey comprehensively introduces and reviews Quantum-enhanced Computer Vision (QeCV), an emerging interdisciplinary field that applies quantum computational paradigms to address computational challenges in classical computer vision. It categorizes existing methods based on quantum annealing and gate-based quantum computing, detailing their application to a wide array of vision tasks.

View blog
Resources
DuetGen: Music Driven Two-Person Dance Generation via Hierarchical Masked Modeling

Researchers from DFKI, MPI for Informatics, and Snap Inc. developed DuetGen, a framework that generates synchronized and interactive two-person dance choreography directly from music. It employs a unified two-person motion representation, a hierarchical VQ-VAE for motion quantization, and two-stage masked transformers, achieving state-of-the-art realism and partner coordination on the DD100 dataset, with a 22% improvement in user study ratings over prior approaches.

View blog
Resources9
InfiniHuman: Infinite 3D Human Creation with Precise Control
Generating realistic and controllable 3D human avatars is a long-standing challenge, particularly when covering broad attribute ranges such as ethnicity, age, clothing styles, and detailed body shapes. Capturing and annotating large-scale human datasets for training generative models is prohibitively expensive and limited in scale and diversity. The central question we address in this paper is: Can existing foundation models be distilled to generate theoretically unbounded, richly annotated 3D human data? We introduce InfiniHuman, a framework that synergistically distills these models to produce richly annotated human data at minimal cost and with theoretically unlimited scalability. We propose InfiniHumanData, a fully automatic pipeline that leverages vision-language and image generation models to create a large-scale multi-modal dataset. User study shows our automatically generated identities are undistinguishable from scan renderings. InfiniHumanData contains 111K identities spanning unprecedented diversity. Each identity is annotated with multi-granularity text descriptions, multi-view RGB images, detailed clothing images, and SMPL body-shape parameters. Building on this dataset, we propose InfiniHumanGen, a diffusion-based generative pipeline conditioned on text, body shape, and clothing assets. InfiniHumanGen enables fast, realistic, and precisely controllable avatar generation. Extensive experiments demonstrate significant improvements over state-of-the-art methods in visual quality, generation speed, and controllability. Our approach enables high-quality avatar generation with fine-grained control at effectively unbounded scale through a practical and affordable solution. We will publicly release the automatic data generation pipeline, the comprehensive InfiniHumanData dataset, and the InfiniHumanGen models at this https URL.
View blog
Resources12
Continual Test-Time Domain Adaptation
25 Mar 2022

Researchers developed Continual Test-Time Adaptation (CoTTA), a method that enables deep learning models to continually adapt to evolving data distributions at inference time without requiring access to source data. CoTTA achieved a 16.2% error rate on CIFAR10C (standard setting) and improved semantic segmentation performance to 58.6% mIoU on Cityscapes-to-ACDC, outperforming prior test-time adaptation methods by mitigating error accumulation and catastrophic forgetting.

View blog
Resources265
PhySIC: Physically Plausible 3D Human-Scene Interaction and Contact from a Single Image

Researchers from the University of Tübingen developed PhySIC, a framework reconstructing metric-scale 3D human and scene geometries, alongside dense vertex-level contact maps, from a single monocular image. The method reduces human pose error (PA-MPJPE) to 41.99mm on PROX and 46.50mm on RICH-100, while achieving contact F1-scores of 0.511 and 0.428 respectively, demonstrating robust physical plausibility and efficient reconstruction.

View blog
Resources8
FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video
Egocentric motion capture with a head-mounted body-facing stereo camera is crucial for VR and AR applications but presents significant challenges such as heavy occlusions and limited annotated real-world data. Existing methods rely on synthetic pretraining and struggle to generate smooth and accurate predictions in real-world settings, particularly for lower limbs. Our work addresses these limitations by introducing a lightweight VR-based data collection setup with on-board, real-time 6D pose tracking. Using this setup, we collected the most extensive real-world dataset for ego-facing ego-mounted cameras to date in size and motion variability. Effectively integrating this multimodal input -- device pose and camera feeds -- is challenging due to the differing characteristics of each data source. To address this, we propose FRAME, a simple yet effective architecture that combines device pose and camera feeds for state-of-the-art body pose prediction through geometrically sound multimodal integration and can run at 300 FPS on modern hardware. Lastly, we showcase a novel training strategy to enhance the model's generalization capabilities. Our approach exploits the problem's geometric properties, yielding high-quality motion capture free from common artifacts in prior works. Qualitative and quantitative evaluations, along with extensive comparisons, demonstrate the effectiveness of our method. Data, code, and CAD designs will be available at this https URL
View blog
Resources
End-to-End Urban Driving by Imitating a Reinforcement Learning Coach

Researchers at ETH Zürich developed "Roach," a high-performance reinforcement learning expert, to act as a coach for end-to-end imitation learning agents in urban driving simulations. This approach enables a camera-based agent to achieve state-of-the-art performance, with up to an 88% driving score in new towns and weather conditions, by learning from Roach's diverse supervision signals including action distributions, latent features, and value estimations.

View blog
Resources320
Advances in Neural Rendering

A State of the Art Report, this paper surveys advanced neural rendering methods, focusing on the paradigm shift towards learning 3D-consistent neural scene representations. It details how these methods, especially those influenced by Neural Radiance Fields, enable high-quality novel-view synthesis and controllable scene manipulation.

View blog
Resources16
DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation
As acquiring pixel-wise annotations of real-world images for semantic segmentation is a costly process, a model can instead be trained with more accessible synthetic data and adapted to real images without requiring their annotations. This process is studied in unsupervised domain adaptation (UDA). Even though a large number of methods propose new adaptation strategies, they are mostly based on outdated network architectures. As the influence of recent network architectures has not been systematically studied, we first benchmark different network architectures for UDA and newly reveal the potential of Transformers for UDA semantic segmentation. Based on the findings, we propose a novel UDA method, DAFormer. The network architecture of DAFormer consists of a Transformer encoder and a multi-level context-aware feature fusion decoder. It is enabled by three simple but crucial training strategies to stabilize the training and to avoid overfitting to the source domain: While (1) Rare Class Sampling on the source domain improves the quality of the pseudo-labels by mitigating the confirmation bias of self-training toward common classes, (2) a Thing-Class ImageNet Feature Distance and (3) a learning rate warmup promote feature transfer from ImageNet pretraining. DAFormer represents a major advance in UDA. It improves the state of the art by 10.8 mIoU for GTA-to-Cityscapes and 5.4 mIoU for Synthia-to-Cityscapes and enables learning even difficult classes such as train, bus, and truck well. The implementation is available at https://github.com/lhoyer/DAFormer.
View blog
Resources478
Audio-Driven Universal Gaussian Head Avatars
We introduce the first method for audio-driven universal photorealistic avatar synthesis, combining a person-agnostic speech model with our novel Universal Head Avatar Prior (UHAP). UHAP is trained on cross-identity multi-view videos. In particular, our UHAP is supervised with neutral scan data, enabling it to capture the identity-specific details at high fidelity. In contrast to previous approaches, which predominantly map audio features to geometric deformations only while ignoring audio-dependent appearance variations, our universal speech model directly maps raw audio inputs into the UHAP latent expression space. This expression space inherently encodes, both, geometric and appearance variations. For efficient personalization to new subjects, we employ a monocular encoder, which enables lightweight regression of dynamic expression variations across video frames. By accounting for these expression-dependent changes, it enables the subsequent model fine-tuning stage to focus exclusively on capturing the subject's global appearance and geometry. Decoding these audio-driven expression codes via UHAP generates highly realistic avatars with precise lip synchronization and nuanced expressive details, such as eyebrow movement, gaze shifts, and realistic mouth interior appearance as well as motion. Extensive evaluations demonstrate that our method is not only the first generalizable audio-driven avatar model that can account for detailed appearance modeling and rendering, but it also outperforms competing (geometry-only) methods across metrics measuring lip-sync accuracy, quantitative image quality, and perceptual realism.
View blog
Resources
Language-guided 3D scene synthesis for fine-grained functionality understanding

SynthFun3D is a training-free framework for language-guided 3D scene synthesis that generates functionally-rich indoor environments with automatic, fine-grained part-level annotations. The framework addresses data scarcity for embodied AI, enabling a 2.12 mIoU improvement on the SceneFun3D benchmark when its synthetic data is used for pre-training.

View blog
Resources
Noise & pattern: identity-anchored Tikhonov regularization for robust structural anomaly detection
Anomaly detection plays a pivotal role in automated industrial inspection, aiming to identify subtle or rare defects in otherwise uniform visual patterns. As collecting representative examples of all possible anomalies is infeasible, we tackle structural anomaly detection using a self-supervised autoencoder that learns to repair corrupted inputs. To this end, we introduce a corruption model that injects artificial disruptions into training images to mimic structural defects. While reminiscent of denoising autoencoders, our approach differs in two key aspects. First, instead of unstructured i.i.d.\ noise, we apply structured, spatially coherent perturbations that make the task a hybrid of segmentation and inpainting. Second, and counterintuitively, we add and preserve Gaussian noise on top of the occlusions, which acts as a Tikhonov regularizer anchoring the Jacobian of the reconstruction function toward identity. This identity-anchored regularization stabilizes reconstruction and further improves both detection and segmentation accuracy. On the MVTec AD benchmark, our method achieves state-of-the-art results (I/P-AUROC: 99.9/99.4), supporting our theoretical framework and demonstrating its practical relevance for automatic inspection.
View blog
Resources
DODA: Data-oriented Sim-to-Real Domain Adaptation for 3D Semantic Segmentation
Deep learning approaches achieve prominent success in 3D semantic segmentation. However, collecting densely annotated real-world 3D datasets is extremely time-consuming and expensive. Training models on synthetic data and generalizing on real-world scenarios becomes an appealing alternative, but unfortunately suffers from notorious domain shifts. In this work, we propose a Data-Oriented Domain Adaptation (DODA) framework to mitigate pattern and context gaps caused by different sensing mechanisms and layout placements across domains. Our DODA encompasses virtual scan simulation to imitate real-world point cloud patterns and tail-aware cuboid mixing to alleviate the interior context gap with a cuboid-based intermediate domain. The first unsupervised sim-to-real adaptation benchmark on 3D indoor semantic segmentation is also built on 3D-FRONT, ScanNet and S3DIS along with 7 popular Unsupervised Domain Adaptation (UDA) methods. Our DODA surpasses existing UDA approaches by over 13% on both 3D-FRONT -> ScanNet and 3D-FRONT -> S3DIS. Code is available at this https URL.
View blog
Resources49
Decoupling Zero-Shot Semantic Segmentation

ZegFormer introduces a decoupled approach to zero-shot semantic segmentation by separating class-agnostic grouping from segment-level classification, effectively integrating pre-trained vision-language models (VLMs) like CLIP. This method achieves state-of-the-art performance with a 73.3% harmonic mean mIoU on PASCAL VOC and 34.8% on COCO-Stuff, while also demonstrating robustness on the challenging ADE20k-Full benchmark with 275 unseen classes.

View blog
Resources168
PoseTrack: A Benchmark for Human Pose Estimation and Tracking

PoseTrack introduces a large-scale benchmark for multi-person articulated pose estimation and tracking in unconstrained videos, providing 66,374 frames and 153,615 dense pose annotations. The benchmark reveals that top-performing methods achieve approximately 50% MOTA for tracking and 70% mAP for pose estimation, while highlighting challenges in dynamic, crowded scenes.

View blog
Resources
Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation
Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input. To capture this, we propose a token-level reliability measure that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into a sentence-level reliability score, providing a compact and interpretable measure of visual grounding. We evaluate the proposed measure on two SLT benchmarks (PHOENIX-2014T and CSL-Daily) with both gloss-based and gloss-free models. Our results show that reliability predicts hallucination rates, generalizes across datasets and architectures, and decreases under visual degradations. Beyond these quantitative trends, we also find that reliability distinguishes grounded tokens from guessed ones, allowing risk estimation without references; when combined with text-based signals (confidence, perplexity, or entropy), it further improves hallucination risk estimation. Qualitative analysis highlights why gloss-free models are more susceptible to hallucinations. Taken together, our findings establish reliability as a practical and reusable tool for diagnosing hallucinations in SLT, and lay the groundwork for more robust hallucination detection in multimodal generation.
View blog
Resources
CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations
Providing explanations in the context of Visual Question Answering (VQA) presents a fundamental problem in machine learning. To obtain detailed insights into the process of generating natural language explanations for VQA, we introduce the large-scale CLEVR-X dataset that extends the CLEVR dataset with natural language explanations. For each image-question pair in the CLEVR dataset, CLEVR-X contains multiple structured textual explanations which are derived from the original scene graphs. By construction, the CLEVR-X explanations are correct and describe the reasoning and visual information that is necessary to answer a given question. We conducted a user study to confirm that the ground-truth explanations in our proposed dataset are indeed complete and relevant. We present baseline results for generating natural language explanations in the context of VQA using two state-of-the-art frameworks on the CLEVR-X dataset. Furthermore, we provide a detailed analysis of the explanation generation quality for different question and answer types. Additionally, we study the influence of using different numbers of ground-truth explanations on the convergence of natural language generation (NLG) metrics. The CLEVR-X dataset is publicly available at \url{this https URL}.
View blog
Resources26
Logical Fallacy Detection

Researchers from Max Planck Institute and ETH Zürich introduce the task of logical fallacy detection in natural language, constructing two novel datasets and proposing a structure-aware classification model. Their model achieved a 5.46% F1 score improvement over the best baseline on the LOGIC dataset and maintained a 5.66% F1 improvement on the challenging LOGICCLIMATE dataset for climate change claims, indicating enhanced logical reasoning.

View blog
Resources17
ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models
14 Dec 2018
Machine learning (ML) has become a core component of many real-world applications and training data is a key factor that drives current progress. This huge success has led Internet companies to deploy machine learning as a service (MLaaS). Recently, the first membership inference attack has shown that extraction of information on the training set is possible in such MLaaS settings, which has severe security and privacy implications. However, the early demonstrations of the feasibility of such attacks have many assumptions on the adversary, such as using multiple so-called shadow models, knowledge of the target model structure, and having a dataset from the same distribution as the target model's training data. We relax all these key assumptions, thereby showing that such attacks are very broadly applicable at low cost and thereby pose a more severe risk than previously thought. We present the most comprehensive study so far on this emerging and developing threat using eight diverse datasets which show the viability of the proposed attacks across domains. In addition, we propose the first effective defense mechanisms against such broader class of membership inference attacks that maintain a high level of utility of the ML model.
View blog
Resources82
There are no more papers matching your filters at the moment.