Graz University of Technology
DATACOMP introduces a benchmark and the 12.8 billion image-text pair COMMONPOOL dataset to systematically evaluate multimodal dataset design. A CLIP model trained on the resulting DATACOMP-1B dataset achieved 79.2% zero-shot ImageNet accuracy, outperforming models trained on larger, unfiltered datasets.
679
TAMING3DGS introduces a budget-constrained optimization for 3D Gaussian Splatting, providing strict control over resource consumption and accelerating the training process. The method achieves 4-5x reductions in both model size and training time while maintaining or improving visual quality, making high-quality 3D scene reconstruction feasible on resource-constrained devices.
318
STSBench, a new spatio-temporal scenario benchmark, assesses multi-modal large language models' holistic understanding in autonomous driving. Evaluation on STSnu, an instantiation on NuScenes, reveals that current models significantly lack spatio-temporal reasoning for complex traffic dynamics.
02 Oct 2025
Researchers at Graz University of Technology developed a novel dependence coefficient, "ψ", for categorical response variables and general covariates, ensuring invariance to category permutations and fully characterizing independence and functional dependence. Their method includes a statistically consistent estimator and an independence test with a pivotal chi-squared asymptotic distribution, applicable to high-dimensional data without resampling.
Researchers from Graz University of Technology, Complexity Science Hub Vienna, and ETH Zurich developed "model folding," a data-free and fine-tuning-free compression technique that reduces neural network size by merging structurally similar neurons across layers while preserving internal data statistics. This method consistently outperforms other data-free baselines and traditional pruning at high sparsity levels across CNNs and LLaMA-7B, achieving significant compression and efficiency without compromising performance.
3
This paper introduces "StopThePop," a refined rendering pipeline for 3D Gaussian Splatting that eliminates visual popping artifacts and view inconsistencies caused by approximate sorting. It achieves this by employing a novel hierarchical rasterization approach that maintains comparable image quality and near real-time performance, being only 4% slower than original 3DGS on average, and up to 1.6x faster with opacity decay.
191
Traffic safety remains a critical global concern, with timely and accurate accident detection essential for hazard reduction and rapid emergency response. Infrastructure-based vision sensors offer scalable and efficient solutions for continuous real-time monitoring, facilitating automated detection of accidents directly from captured images. This research investigates the zero-shot capabilities of multimodal large language models (MLLMs) for detecting and describing traffic accidents using images from infrastructure cameras, thus minimizing reliance on extensive labeled datasets. Main contributions include: (1) Evaluation of MLLMs using the simulated DeepAccident dataset from CARLA, explicitly addressing the scarcity of diverse, realistic, infrastructure-based accident data through controlled simulations; (2) Comparative performance analysis between Gemini 1.5 and 2.0, Gemma 3 and Pixtral models in accident identification and descriptive capabilities without prior fine-tuning; and (3) Integration of advanced visual analytics, specifically YOLO for object detection, Deep SORT for multi-object tracking, and Segment Anything (SAM) for instance segmentation, into enhanced prompts to improve model accuracy and explainability. Key numerical results show Pixtral as the top performer with an F1-score of 0.71 and 83% recall, while Gemini models gained precision with enhanced prompts (e.g., Gemini 1.5 rose to 90%) but suffered notable F1 and recall losses. Gemma 3 offered the most balanced performance with minimal metric fluctuation. These findings demonstrate the substantial potential of integrating MLLMs with advanced visual analytics techniques, enhancing their applicability in real-world automated traffic monitoring systems.
While most state-of-the-art instance segmentation methods produce binary segmentation masks, geographic and cartographic applications typically require precise vector polygons of extracted objects instead of rasterized output. This paper introduces PolyWorld, a neural network that directly extracts building vertices from an image and connects them correctly to create precise polygons. The model predicts the connection strength between each pair of vertices using a graph neural network and estimates the assignments by solving a differentiable optimal transport problem. Moreover, the vertex positions are optimized by minimizing a combined segmentation and polygonal angle difference loss. PolyWorld significantly outperforms the state of the art in building polygonization and achieves not only notable quantitative results, but also produces visually pleasing building polygons. Code and trained weights are publicly available at this https URL.
197
A LoD of Gaussians introduces a unified training and rendering framework for ultra-large-scale 3D Gaussian Splatting scenes, leveraging external memory and a dynamic Level-of-Detail system. The method enables artifact-free reconstruction of city-scale environments on a single consumer-grade GPU while achieving higher quality and faster convergence than prior chunk-based approaches.
Few-shot semantic segmentation is vital for deep learning-based infrastructure inspection applications, where labeled training examples are scarce and expensive. Although existing deep learning frameworks perform well, the need for extensive labeled datasets and the inability to learn new defect categories with little data are problematic. We present our Enhanced Feature Pyramid Network (E-FPN) framework for few-shot semantic segmentation of culvert and sewer defect categories using a prototypical learning framework. Our approach has three main contributions: (1) adaptive E-FPN encoder using InceptionSepConv blocks and depth-wise separable convolutions for efficient multi-scale feature extraction; (2) prototypical learning with masked average pooling for powerful prototype generation from small support examples; and (3) attention-based feature representation through global self-attention, local self-attention and cross-attention. Comprehensive experimentation on challenging infrastructure inspection datasets illustrates that the method achieves excellent few-shot performance, with the best configuration being 8-way 5-shot training configuration at 82.55% F1-score and 72.26% mIoU in 2-way classification testing. The self-attention method had the most significant performance improvements, providing 2.57% F1-score and 2.9% mIoU gain over baselines. Our framework addresses the critical need to rapidly respond to new defect types in infrastructure inspection systems with limited new training data that lead to more efficient and economical maintenance plans for critical infrastructure systems.
This research introduces a framework that integrates formal methods with reinforcement learning through a reactive “shield” to ensure provable safety while optimizing performance. The approach successfully enforces temporal logic safety specifications across various environments and often accelerates learning convergence by preventing unsafe exploration.
Medical image segmentation plays an important role in accurately identifying and isolating regions of interest within medical images. Generative approaches are particularly effective in modeling the statistical properties of segmentation masks that are closely related to the respective structures. In this work we introduce FlowSDF, an image-guided conditional flow matching framework, designed to represent the signed distance function (SDF), and, in turn, to represent an implicit distribution of segmentation masks. The advantage of leveraging the SDF is a more natural distortion when compared to that of binary masks. Through the learning of a vector field associated with the probability path of conditional SDF distributions, our framework enables accurate sampling of segmentation masks and the computation of relevant statistical measures. This probabilistic approach also facilitates the generation of uncertainty maps represented by the variance, thereby supporting enhanced robustness in prediction and further analysis. We qualitatively and quantitatively illustrate competitive performance of the proposed method on a public nuclei and gland segmentation data set, highlighting its utility in medical image segmentation applications.
We investigate the genus g(n,m)g(n,m) of the Erdős-Rényi random graph G(n,m)G(n,m), providing a thorough description of how this relates to the function m=m(n)m=m(n), and finding that there is different behaviour depending on which `region' mm falls into. Results already exist for mn2+O(n2/3)m \le \frac{n}{2} + O(n^{2/3}) and m=ω(n1+1j)m = \omega \left( n^{1+\frac{1}{j}} \right) for jNj \in \mathbb{N}, and so we focus on the intermediate cases. We establish that g(n,m)=(1+o(1))m2g(n,m) = (1+o(1)) \frac{m}{2} whp (with high probability) when nm=n1+o(1)n \ll m = n^{1+o(1)}, that g(n,m)=(1+o(1))μ(λ)mg(n,m) = (1+o(1)) \mu (\lambda) m whp for a given function μ(λ)\mu (\lambda) when mλnm \sim \lambda n for λ>12\lambda > \frac{1}{2}, and that g(n,m)=(1+o(1))8s33n2g(n,m) = (1+o(1)) \frac{8s^{3}}{3n^{2}} whp when m=n2+sm = \frac{n}{2} + s for n2/3snn^{2/3} \ll s \ll n. We then also show that the genus of a fixed graph can increase dramatically if a small number of random edges are added. Given any connected graph with bounded maximum degree, we find that the addition of ϵn\epsilon n edges will whp result in a graph with genus Ω(n)\Omega (n), even when ϵ\epsilon is an arbitrarily small constant! We thus call this the `fragile genus' property.
The Efficient Motion Prediction (EMP) model by Graz University of Technology achieves accuracy on par with state-of-the-art transformer-based models on Argoverse 2 while training in under 13 hours on a single GPU and demonstrating significantly lower inference latency. This work provides a highly efficient alternative for motion forecasting, reducing computational requirements for both development and deployment.
It was recently shown that the loss function used for training physics-informed neural networks (PINNs) exhibits local minima at solutions corresponding to fixed points of dynamical systems. In the forward setting, where the PINN is trained to solve initial value problems, these local minima can interfere with training and potentially leading to physically incorrect solutions. Building on stability theory, this paper proposes a regularization scheme that penalizes solutions corresponding to unstable fixed points. Experimental results on four dynamical systems, including the Lotka-Volterra model and the van der Pol oscillator, show that our scheme helps avoiding physically incorrect solutions and substantially improves the training success rate of PINNs.
MedShapeNet introduces a large-scale, community-formed dataset of over 100,000 3D medical shapes, derived from real patient imaging data, to bridge the gap between general 3D computer vision advancements and medical applications. The dataset provides standardized 3D anatomical models and surgical instruments, enabling the development and application of deep learning algorithms for tasks such as tumor classification, shape reconstruction, and extended reality medical applications.
JWST has identified a large population of faint, broad-line active galactic nuclei (AGN) in the early universe that are powered by black holes (BHs) that often appear overmassive relative to their host galaxies. In this study, we examine the relationship between BH mass and galaxy stellar mass at 33σ33\sigma above the relationship measured for local broad-line AGN. We derive an intrinsic scatter in this relationship of 0.90.9 dex, which does not vary over the redshift range of our sample. We also find that the MBH/MM_{\rm BH}/M_{\star} ratio increases by 2.32.3 dex from z=3.5z = 3.5 and z=6.5z = 6.5 with a confidence level of > 3\sigma. We attribute this trend with the increasing fraction of LRDs in our sample at z>4 as their host masses are 1\sim1 dex lower than the non-LRD AGN in our sample. These results support a picture in which the BHs powering JWST's broad-line AGN are genuinely overmassive and become increasingly so with redshift. We discuss the implications of our findings on early BH growth relative to that of their host galaxies and the constraints it places on BH seeding models.
Real-time visibility determination in expansive or dynamically changing environments has long posed a significant challenge in computer graphics. Existing techniques are computationally expensive and often applied as a precomputation step on a static scene. We present NeuralPVS, the first deep-learning approach for visibility computation that efficiently determines from-region visibility in a large scene, running at approximately 100 Hz processing with less than 1%1\% missing geometry. This approach is possible by using a neural network operating on a voxelized representation of the scene. The network's performance is achieved by combining sparse convolution with a 3D volume-preserving interleaving for data compression. Moreover, we introduce a novel repulsive visibility loss that can effectively guide the network to converge to the correct data distribution. This loss provides enhanced robustness and generalization to unseen scenes. Our results demonstrate that NeuralPVS outperforms existing methods in terms of both accuracy and efficiency, making it a promising solution for real-time visibility computation.
In this article we propose a novel method for sampling from Gibbs distributions of the form π(x)exp(U(x))\pi(x)\propto\exp(-U(x)) with a potential U(x)U(x). In particular, inspired by diffusion models we propose to consider a sequence (πtk)k(\pi^{t_k})_k of approximations of the target density, for which πtkπ\pi^{t_k}\approx \pi for kk small and, on the other hand, πtk\pi^{t_k} exhibits favorable properties for sampling for kk large. This sequence is obtained by replacing parts of the potential UU by its Moreau envelopes. Sampling is performed in an Annealed Langevin type procedure, that is, sequentially sampling from πtk\pi^{t_k} for decreasing kk, effectively guiding the samples from a simple starting density to the more complex target. In addition to a theoretical analysis we show experimental results supporting the efficacy of the method in terms of increased convergence speed and applicability to multi-modal densities π\pi.
Temporal action segmentation in untrimmed videos has gained increased attention recently. However, annotating action classes and frame-wise boundaries is extremely time consuming and cost intensive, especially on large-scale datasets. To address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences. In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning, to preserve the spatial layout and sequential nature of the video features. A two-step clustering pipeline on these embedded feature representations then allows us to enforce temporal consistency within, as well as across videos. Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes. Our evaluation on three challenging datasets shows the impact of each component and, furthermore, demonstrates our state-of-the-art unsupervised action segmentation results.
There are no more papers matching your filters at the moment.