alphaXiv

History

Papers Benchmarks

University of Bonn

212

03 Dec 2025

computer-science computer-vision-and-pattern-recognition robotics

What Is The Best 3D Scene Representation for Robotics? From Geometric to Foundation Models

Chinese Academy of Sciences

Shanghai Jiao Tong University University of Bonn

Nanyang Technological University University at Buffalo University of Zaragoza Key Laboratory of System Control and Information Processing, Ministry of Education

A comprehensive survey evaluates geometric, neural (NeRF, 3DGS), and foundation models as 3D scene representations for robotics, assessing their performance across perception, mapping, localization, manipulation, and navigation modules. The analysis outlines the strengths and limitations of each representation type and points to 3D Foundation Models as a promising future direction for unified robotic intelligence.

1,260

19 Feb 2025

agent-based-systems agentic-frameworks agents

Multi-Agent Risks from Advanced AI

Joel Lehman

Lewis Hammond

A landmark collaborative study from 44 researchers across 30 major institutions establishes the first comprehensive framework for understanding multi-agent AI risks, identifying three critical failure modes and seven key risk factors while providing concrete evidence from both historical examples and novel experiments to guide future safety efforts.

152

01 Dec 2025

computer-science computer-vision-and-pattern-recognition robotics

Register Any Point: Scaling 3D Point Cloud Registration by Flow Matching

University of Bonn

Stanford University

Register Any Point (RAP) presents a scalable, single-stage framework for multi-view 3D point cloud registration by reframing the task as conditional flow matching with explicit rigidity enforcement. The model achieves state-of-the-art performance, demonstrating superior robustness in low-overlap scenarios and strong zero-shot generalization across 17 diverse datasets, including previously unseen outdoor LiDAR and Terrestrial Laser Scanner data.

556

10 Nov 2025

computer-science computer-vision-and-pattern-recognition domain-adaptation

AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

University of Bonn University of Tübingen TU Berlin

KTH Royal Institute of Technology Tübingen AI Center Mercedes-Benz AG

Mercedes-Benz AG and University of Tübingen researchers developed AGO, a framework for adaptive grounding in open-world 3D occupancy prediction, enabling recognition of both known and novel objects from multi-camera inputs. The model achieved a state-of-the-art 19.23 mIoU on the Occ3D-nuScenes benchmark, showing robust generalization to unknown categories and outperforming previous methods by 4.09 mIoU.

431

11 Sep 2021

computer-science artificial-intelligence databases

Knowledge Graphs

Rensselaer Polytechnic Institute University of Bonn

University of Southampton Universität Stuttgart

Rutgers University Sapienza University of Rome Universidad de Chile Linköping University Vrije Universiteit Universidad de Oviedo University of Bari Universität Paderborn WU Vienna Universität Koblenz–Landau data.world École des mines de Saint-Étienne University of Milano Bicocca

This collaborative tutorial from 18 leading researchers provides a comprehensive, unifying summary of knowledge graphs, consolidating fragmented knowledge from diverse fields. It defines core concepts, surveys techniques across data models, knowledge representation, and AI, and outlines lifecycle management and governance for knowledge graphs.

2,832

13 Oct 2025

computer-science machine-learning graph-neural-networks

Revisiting Node Affinity Prediction in Temporal Graphs

University of Cambridge University of Bonn Ben-Gurion University of the Negev

The NAVIS model, developed at Ben-Gurion University of the Negev and the University of Bonn, resolves the issue where simple heuristics often surpass sophisticated Temporal Graph Neural Networks in node affinity prediction. It consistently outperforms state-of-the-art TGNNs and heuristics across various temporal graph datasets, including notable improvements such as +12.8% in NDCG@10 over TGNv2 on the `tgbn-trade` dataset.

460

19 Sep 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models

University of Oxford University of Bonn Lamarr Institute for Machine Learning and Artificial Intelligence

EPFL

Aalto University

Technical University of Munich Hong Kong Baptist University ETH Z urich

In real-world scenarios, achieving domain adaptation and generalization poses significant challenges, as models must adapt to or generalize across unknown target distributions. Extending these capabilities to unseen multimodal distributions, i.e., multimodal domain adaptation and generalization, is even more challenging due to the distinct characteristics of different modalities. Significant progress has been made over the years, with applications ranging from action recognition to semantic segmentation. Besides, the recent advent of large-scale pre-trained multimodal foundation models, such as CLIP, has inspired works leveraging these models to enhance adaptation and generalization performances or adapting them to downstream tasks. This survey provides the first comprehensive review of recent advances from traditional approaches to foundation models, covering: (1) Multimodal domain adaptation; (2) Multimodal test-time adaptation; (3) Multimodal domain generalization; (4) Domain adaptation and generalization with the help of multimodal foundation models; and (5) Adaptation of multimodal foundation models. For each topic, we formally define the problem and thoroughly review existing methods. Additionally, we analyze relevant datasets and applications, highlighting open challenges and potential future research directions. We maintain an active repository that contains up-to-date literature at this https URL.

22 Nov 2025

computer-science computer-vision-and-pattern-recognition deep-reinforcement-learning

EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses

University of Bonn

NVIDIA Lamarr Institute for Machine Learning and Artificial Intelligence

EgoControl introduces a method for generating future egocentric video frames, utilizing 3D full-body pose sequences as explicit control. The model significantly improves visual fidelity, achieving an FVD of 20.18, and demonstrates enhanced accuracy in reproducing articulated body movements with a mean Intersection-over-Union (mIoU) of 52.13 for arm masks.

19 Sep 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

MoAngelo: Motion-Aware Neural Surface Reconstruction for Dynamic Scenes

University of Bonn Lamarr Institute for Machine Learning and Artificial Intelligence

Dynamic scene reconstruction from multi-view videos remains a fundamental challenge in computer vision. While recent neural surface reconstruction methods have achieved remarkable results in static 3D reconstruction, extending these approaches with comparable quality for dynamic scenes introduces significant computational and representational challenges. Existing dynamic methods focus on novel-view synthesis, therefore, their extracted meshes tend to be noisy. Even approaches aiming for geometric fidelity often result in too smooth meshes due to the ill-posedness of the problem. We present a novel framework for highly detailed dynamic reconstruction that extends the static 3D reconstruction method NeuralAngelo to work in dynamic settings. To that end, we start with a high-quality template scene reconstruction from the initial frame using NeuralAngelo, and then jointly optimize deformation fields that track the template and refine it based on the temporal sequence. This flexible template allows updating the geometry to include changes that cannot be modeled with the deformation field, for instance occluded parts or the changes in the topology. We show superior reconstruction accuracy in comparison to previous state-of-the-art methods on the ActorsHQ dataset.

197

14 Dec 2023

attention-mechanisms computer-science artificial-intelligence

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

University of Bonn

Stanford University MIT-IBM Watson AI Lab Synthesia.io

The paper introduces the Grounding Everything Module (GEM), a training-free method that leverages inherent localization capabilities within pre-trained Vision-Language Models for zero-shot open-vocabulary object localization and semantic segmentation. GEM achieves state-of-the-art results among training-free approaches and is competitive with many fine-tuned methods across multiple datasets, including PascalVOC, PascalContext, ADE20K, and OpenImagesV7.

330

08 Apr 2025

active-learning computer-science computer-vision-security

ActiveGS: Active Scene Reconstruction Using Gaussian Splatting

University of Bonn TU Delft Lamarr Institute for Machine Learning and Artificial Intelligence

Robotics applications often rely on scene reconstructions to enable downstream tasks. In this work, we tackle the challenge of actively building an accurate map of an unknown scene using an RGB-D camera on a mobile platform. We propose a hybrid map representation that combines a Gaussian splatting map with a coarse voxel map, leveraging the strengths of both representations: the high-fidelity scene reconstruction capabilities of Gaussian splatting and the spatial modelling strengths of the voxel map. At the core of our framework is an effective confidence modelling technique for the Gaussian splatting map to identify under-reconstructed areas, while utilising spatial information from the voxel map to target unexplored areas and assist in collision-free path planning. By actively collecting scene information in under-reconstructed and unexplored areas for map updates, our approach achieves superior Gaussian splatting reconstruction results compared to state-of-the-art approaches. Additionally, we demonstrate the real-world applicability of our framework using an unmanned aerial vehicle.

179

25 Jul 2025

computer-science robotics

DogLegs: Robust Proprioceptive State Estimation for Legged Robots Using Multiple Leg-Mounted IMUs

Wuhan University University of Bonn

吴宜斌

Robust and accurate proprioceptive state estimation of the main body is crucial for legged robots to execute tasks in extreme environments where exteroceptive sensors, such as LiDARs and cameras, may become unreliable. In this paper, we propose DogLegs, a state estimation system for legged robots that fuses the measurements from a body-mounted inertial measurement unit (Body-IMU), joint encoders, and multiple leg-mounted IMUs (Leg-IMU) using an extended Kalman filter (EKF). The filter system contains the error states of all IMU frames. The Leg-IMUs are used to detect foot contact, thereby providing zero-velocity measurements to update the state of the Leg-IMU frames. Additionally, we compute the relative position constraints between the Body-IMU and Leg-IMUs by the leg kinematics and use them to update the main body state and reduce the error drift of the individual IMU frames. Field experimental results have shown that our proposed DogLegs system achieves better state estimation accuracy compared to the traditional leg odometry method (using only Body-IMU and joint encoders) across various terrains. We make our datasets publicly available to benefit the research community (this https URL).

156

05 Sep 2025

computer-science computer-vision-and-pattern-recognition domain-adaptation

3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection

Monocular 3D object detection is valuable for various applications such as robotics and AR/VR. Existing methods are confined to closed-set settings, where the training and testing sets consist of the same scenes and/or object categories. However, real-world applications often introduce new environments and novel object categories, posing a challenge to these methods. In this paper, we address monocular 3D object detection in an open-set setting and introduce the first end-to-end 3D Monocular Open-set Object Detector (3D-MOOD). We propose to lift the open-set 2D detection into 3D space through our designed 3D bounding box head, enabling end-to-end joint training for both 2D and 3D tasks to yield better overall performance. We condition the object queries with geometry prior and overcome the generalization for 3D estimation across diverse scenes. To further improve performance, we design the canonical image space for more efficient cross-dataset training. We evaluate 3D-MOOD on both closed-set settings (Omni3D) and open-set settings (Omni3D to Argoverse 2, ScanNet), and achieve new state-of-the-art results. Code and models are available at this http URL.

627

05 May 2025

computer-science computer-vision-security computer-vision-and-pattern-recognition

Detect, Classify, Act: Categorizing Industrial Anomalies with Multi-Modal Large Language Models

University of Freiburg University of Bonn Endress+Hauser

The paper introduces VELM, a two-stage pipeline that combines a fast vision-based anomaly detector with multi-modal large language models (MLLMs) to classify industrial anomalies into specific types. VELM achieved 81.4% accuracy on the MVTec-AD dataset, surpassing previous methods, and enabled nuanced distinction between critical defects and negligible anomalies, showing 89.8% accuracy in a simulated scenario. The work also provides new, refined benchmarks, MVTec-AC and VisA-AC, for multi-class anomaly classification.

26 Nov 2025

computer-science computer-vision-and-pattern-recognition generative-models

Sequence-Adaptive Video Prediction in Continuous Streams using Diffusion Noise Optimization

University of Bonn Lamarr Institute for Machine Learning and Artificial Intelligence Toyota Motor Europe University of Wuerzburg

SAVi-DNO enables pre-trained diffusion models for video prediction to continuously adapt to live video streams by optimizing diffusion noise during inference. This method improves prediction accuracy and video quality across diverse datasets and diffusion models, offering an efficient and privacy-preserving solution for real-world continuous adaptation.

113

03 Apr 2025

ai-for-cybersecurity computer-science computer-vision-security

STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection

University of Bonn The University of Western Australia Khalifa University of Science and Technology Lamarr Institute for ML and AI Abu Dhabi University

STING-BEE is a vision-language model designed for real-world X-ray baggage security inspection, trained on STCray, the first multimodal X-ray dataset featuring detailed image-caption pairs and diverse threat concealment scenarios. The model achieves state-of-the-art performance across scene comprehension, visual question answering, and threat localization, demonstrating robust cross-domain generalization to varied X-ray scanners and complex, unseen threats.

04 Dec 2025

computer-science robotics

Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

ETH Zurich University of Freiburg University of Bonn

Microsoft

Technical University of Munich

The "Hoi!" dataset provides synchronized multimodal data, including RGB-D, force, and tactile feedback, for human and robotic manipulation of articulated objects across various viewpoints and real-world environments. This resource reveals limitations of current state-of-the-art methods in articulation, tactile, and visual force estimation, facilitating research into robust embodied AI.

282

09 Sep 2025

computer-science computer-vision-security computer-vision-and-pattern-recognition

PINGS: Gaussian Splatting Meets Distance Fields within a Point-Based Implicit Neural Map

University of Bonn TU Delft Lamarr Institute for Machine Learning and Artificial Intelligence

Robots benefit from high-fidelity reconstructions of their environment, which should be geometrically accurate and photorealistic to support downstream tasks. While this can be achieved by building distance fields from range sensors and radiance fields from cameras, realising scalable incremental mapping of both fields consistently and at the same time with high quality is challenging. In this paper, we propose a novel map representation that unifies a continuous signed distance field and a Gaussian splatting radiance field within an elastic and compact point-based implicit neural map. By enforcing geometric consistency between these fields, we achieve mutual improvements by exploiting both modalities. We present a novel LiDAR-visual SLAM system called PINGS using the proposed map representation and evaluate it on several challenging large-scale datasets. Experimental results demonstrate that PINGS can incrementally build globally consistent distance and radiance fields encoded with a compact set of neural points. Compared to state-of-the-art methods, PINGS achieves superior photometric and geometric rendering at novel views by constraining the radiance field with the distance field. Furthermore, by utilizing dense photometric cues and multi-view consistency from the radiance field, PINGS produces more accurate distance fields, leading to improved odometry estimation and mesh reconstruction. We also provide an open-source implementation of PING at: this https URL.

399

07 Feb 2025

computer-science artificial-intelligence machine-learning

SOLD: Slot Object-Centric Latent Dynamics Models for Relational Manipulation Learning from Pixels

University of Bonn

Researchers at the University of Bonn developed SOLD, the first object-centric model-based reinforcement learning algorithm that learns complex manipulation skills directly from raw pixel data. This approach integrates Slot Attention for structured scene representation with a Dreamer-like control framework, leading to superior performance, enhanced sample efficiency, and interpretable dynamics on relational manipulation tasks.

150

01 Jun 2023

clustering-algorithms computer-science machine-learning

Graph Clustering with Graph Neural Networks

University of Bonn

Google Research

Researchers from Google Research and Technical University of Dortmund introduce Deep Modularity Networks (DMoN), an unsupervised Graph Neural Network method for graph clustering. DMoN consistently outperforms existing GNN-based clustering and pooling methods, achieving over 40% average improvement in conductance, modularity, and NMI across various real-world datasets, while also scaling effectively to large graphs.

268

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

What Is The Best 3D Scene Representation for Robotics? From Geometric to Foundation Models

Multi-Agent Risks from Advanced AI

Register Any Point: Scaling 3D Point Cloud Registration by Flow Matching

AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

Knowledge Graphs

Revisiting Node Affinity Prediction in Temporal Graphs

Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models

EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses

MoAngelo: Motion-Aware Neural Surface Reconstruction for Dynamic Scenes

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

ActiveGS: Active Scene Reconstruction Using Gaussian Splatting

DogLegs: Robust Proprioceptive State Estimation for Legged Robots Using Multiple Leg-Mounted IMUs

3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection

Detect, Classify, Act: Categorizing Industrial Anomalies with Multi-Modal Large Language Models

Sequence-Adaptive Video Prediction in Continuous Streams using Diffusion Noise Optimization

STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection

Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

PINGS: Gaussian Splatting Meets Distance Fields within a Point-Based Implicit Neural Map

SOLD: Slot Object-Centric Latent Dynamics Models for Relational Manipulation Learning from Pixels

Graph Clustering with Graph Neural Networks

Events

AI for Law

Personalize Your Feed