alphaXiv

History

Papers Benchmarks

Technical University Munich

484

20 Oct 2025

agents computer-science artificial-intelligence

FineVision: Open Data Is All You Need

Stanford University

Hugging Face Technical University Munich

FineVision introduces a meticulously curated open multimodal dataset comprising 24 million samples derived from over 200 public sources. Models trained on FineVision achieve an average performance improvement of 5.1 to 14.3 percentage points across 11 diverse benchmarks compared to previous open-source datasets, while exhibiting a low test-set contamination rate of 1.02%.

2,529

11 Jun 2024

computer-science computer-vision-and-pattern-recognition efficient-transformers

An Image is Worth 32 Tokens for Reconstruction and Generation

ByteDance Technical University Munich

TiTok, a Transformer-based 1-Dimensional Tokenizer, introduces a new method for representing images with an extremely compact sequence of tokens. It enables high-quality image reconstruction and generation using as few as 32-64 tokens, leading to orders of magnitude faster inference speeds compared to traditional diffusion models.

167

1,255

08 Dec 2024

computer-science computer-vision-and-pattern-recognition machine-learning

MaskBit: Embedding-free Image Generation via Bit Tokens

Carnegie Mellon University

ByteDance Technical University Munich

MaskBit introduces an embedding-free image generation approach leveraging structured "bit tokens" and provides a modernized, open-source VQGAN+ tokenizer. This work achieved a state-of-the-art FID score of 1.52 on ImageNet 256x256, surpassing previous methods, while the VQGAN+ tokenizer reduced reconstruction FID to 1.66.

27 Aug 2025

ai-for-genomics computer-science contrastive-learning

A Large-Scale Benchmark of Cross-Modal Learning for Histology and Gene Expression in Spatial Transcriptomics

Microsoft Helmholtz Munich Technical University Munich

Spatial transcriptomics enables simultaneous measurement of gene expression and tissue morphology, offering unprecedented insights into cellular organization and disease mechanisms. However, the field lacks comprehensive benchmarks for evaluating multimodal learning methods that leverage both histology images and gene expression data. Here, we present HESCAPE, a large-scale benchmark for cross-modal contrastive pretraining in spatial transcriptomics, built on a curated pan-organ dataset spanning 6 different gene panels and 54 donors. We systematically evaluated state-of-the-art image and gene expression encoders across multiple pretraining strategies and assessed their effectiveness on two downstream tasks: gene mutation classification and gene expression prediction. Our benchmark demonstrates that gene expression encoders are the primary determinant of strong representational alignment, and that gene models pretrained on spatial transcriptomics data outperform both those trained without spatial data and simple baseline approaches. However, downstream task evaluation reveals a striking contradiction: while contrastive pretraining consistently improves gene mutation classification performance, it degrades direct gene expression prediction compared to baseline encoders trained without cross-modal objectives. We identify batch effects as a key factor that interferes with effective cross-modal alignment. Our findings highlight the critical need for batch-robust multimodal learning approaches in spatial transcriptomics. To accelerate progress in this direction, we release HESCAPE, providing standardized datasets, evaluation protocols, and benchmarking tools for the community

24 Dec 2024

optimization-methods physics quantum-physics

Quantum and classical correlations in shrinking algorithms for optimization

Infineon Technologies AG BMW Group University of Erlangen-Nuremberg Ludwig Maximilian University of Munich Technical University Munich Fraunhofer Institute for Integrated Circuits

Understanding the benefits of quantum computing for solving combinatorial optimization problems (COPs) remains an open research question. In this work, we extend and analyze algorithms that solve COPs by recursively shrinking them. The algorithms leverage correlations between variables extracted from quantum or classical subroutines to recursively simplify the problem. We compare the performance of the algorithms equipped with correlations from the quantum approximate optimization algorithm (QAOA) as well as the classical linear programming (LP) and semi-definite programming (SDP) relaxations. This allows us to benchmark the utility of QAOA correlations against established classical relaxation algorithms. We apply the recursive algorithm to MaxCut problem instances with up to a hundred vertices at different graph densities. Our results indicate that LP outperforms all other approaches for low-density instances, while SDP excels for high-density problems. Moreover, the shrinking algorithm proves to be a viable alternative to established methods of rounding LP and SDP relaxations. In addition, the recursive shrinking algorithm outperforms its bare counterparts for all three types of correlations, i.e., LP with spanning tree rounding, the Goemans-Williamson algorithm, and conventional QAOA. While the lowest depth QAOA consistently yields worse results than the SDP, our tensor network experiments show that the performance increases significantly for deeper QAOA circuits.

16 Jul 2025

ai-for-health computer-science computer-vision-and-pattern-recognition

CytoSAE: Interpretable Cell Embeddings for Hematology

Helmholtz Munich Technical University Munich Munich Center for Machine Learning Korea Advanced Institute of Science & Technology

Helmholtz Munich researchers developed CytoSAE, an interpretable AI tool that uses Sparse Autoencoders to disentangle complex cell embeddings from a hematology foundation model (DinoBloom-B) into discrete, human-interpretable morphological concepts. The system successfully discovers and validates a variety of relevant cellular features, demonstrating generalizability across diverse datasets and enabling explainable AML subtype classification with performance comparable to black-box models.

570

27 Oct 2025

computer-science artificial-intelligence computation-and-language

SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

IBM Research Technical University Munich

Fine-tuning large language models (LLMs) is a common practice to adapt generalist models to specialized domains. However, recent studies show that fine-tuning can erode safety alignment, causing LLMs to respond to harmful or unethical prompts. Many methods to realign safety have been proposed, but often introduce custom algorithms that are difficult to implement or compromise task utility. In this work, we propose SafeMERGE, a lightweight, post-fine-tuning framework that preserves safety while maintaining downstream performance. SafeMERGE selectively merges fine-tuned with safety-aligned model layers only when they deviate from safe behavior, measured by a cosine similarity criterion. Across three LLMs and two tasks, SafeMERGE consistently reduces harmful outputs compared to other defenses, with negligible or even positive impact on utility. Our results demonstrate that selective layer-wise merging offers an effective safeguard against the inadvertent loss of safety during fine-tuning, establishing SafeMERGE as a simple post-fine-tuning defense.

01 Dec 2025

computer-science artificial-intelligence machine-learning

A Diffusion Model Framework for Maximum Entropy Reinforcement Learning

Technical University Munich MIT World Peace University

Diffusion models have achieved remarkable success in data-driven learning and in sampling from complex, unnormalized target distributions. Building on this progress, we reinterpret Maximum Entropy Reinforcement Learning (MaxEntRL) as a diffusion model-based sampling problem. We tackle this problem by minimizing the reverse Kullback-Leibler (KL) divergence between the diffusion policy and the optimal policy distribution using a tractable upper bound. By applying the policy gradient theorem to this objective, we derive a modified surrogate objective for MaxEntRL that incorporates diffusion dynamics in a principled way. This leads to simple diffusion-based variants of Soft Actor-Critic (SAC), Proximal Policy Optimization (PPO) and Wasserstein Policy Optimization (WPO), termed DiffSAC, DiffPPO and DiffWPO. All of these methods require only minor implementation changes to their base algorithm. We find that on standard continuous control benchmarks, DiffSAC, DiffPPO and DiffWPO achieve better returns and higher sample efficiency than SAC and PPO.

27 Oct 2025

computer-science artificial-intelligence computation-and-language

Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance

IBM Research Technical University Munich

Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open- and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality. In this work, we conduct the first comprehensive side-by-side analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie framework, we annotate each sample with detailed quality metrics, including turn structure (single-turn vs. multi-turn), task category, input quality, and response quality, and we derive statistics that reveal structural and qualitative similarities and differences between the two datasets. Based on these insights, we design a principled curation recipe that produces a new data mixture, TuluTalk, which contains 14% fewer samples than either source dataset while matching or exceeding their performance on key benchmarks. Our findings offer actionable insights for constructing more effective post-training datasets that improve model performance within practical resource limits. To support future research, we publicly release both the annotated source datasets and our curated TuluTalk mixture.

208

21 Sep 2025

computer-science continual-learning computation-and-language

WikiBigEdit: Understanding the Limits of Lifelong Knowledge Editing in LLMs

Helmholtz Munich University of Tübingen

University of Virginia Technical University Munich

Keeping large language models factually up-to-date is crucial for deployment, yet costly retraining remains a challenge. Knowledge editing offers a promising alternative, but methods are only tested on small-scale or synthetic edit benchmarks. In this work, we aim to bridge research into lifelong knowledge editing to real-world edits at a practically relevant scale. We first introduce WikiBigEdit; a large-scale benchmark of real-world Wikidata edits, built to automatically extend lifelong for future-proof benchmarking. In its first instance, it includes over 500K question-answer pairs for knowledge editing alongside a comprehensive evaluation pipeline. Finally, we use WikiBigEdit to study existing knowledge editing techniques' ability to incorporate large volumes of real-world facts and contrast their capabilities to generic modification techniques such as retrieval augmentation and continual finetuning to acquire a complete picture of the practical extent of current lifelong knowledge editing.

19 Nov 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI

Arizona State University MBZUAI Clark University Technical University Munich IBM

ServiceNow IBM Research Europe ESA NASA

Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 permissively-licensed datasets. We introduce ''capability'' groups to rank models on datasets that share common characteristics (e.g., resolution, bands, temporality). This enables users to identify which models excel in each capability and determine which areas need improvement in future work. To support both fair comparison and methodological innovation, we define a prescriptive yet flexible evaluation protocol. This not only ensures consistency in benchmarking but also facilitates research into model adaptation strategies, a key and open challenge in advancing GeoFMs for downstream tasks. Our experiments show that no single model dominates across all tasks, confirming the specificity of the choices made during architecture design and pretraining. While models pretrained on natural images (ConvNext ImageNet, DINO V3) excel on high-resolution tasks, EO-specific models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such as agriculture and disaster response. These findings demonstrate that optimal model choice depends on task requirements, data modalities, and constraints. This shows that the goal of a single GeoFM model that performs well across all tasks remains open for future research. GEO-Bench-2 enables informed, reproducible GeoFM evaluation tailored to specific use cases. Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.

182

29 Sep 2020

computer-science computer-vision-security computer-vision-and-pattern-recognition

HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking

Max Planck Institute for Intelligent Systems

University of Oxford University of Tübingen Technical University Munich RWTH Aachen University

HOTA (Higher Order Tracking Accuracy) is proposed as a new evaluation metric for Multi-Object Tracking, designed to provide a balanced and interpretable score by integrating localization accuracy and assessing long-term trajectory consistency. The metric demonstrated stronger alignment with human judgment, particularly among MOT researchers, compared to existing evaluation methods.

1,148

22 Sep 2025

cosmology-and-nongalactic-astrophysics astrophysics-of-galaxies physics

DESI Strong Lens Foundry III: Keck Spectroscopy for Strong Lenses Discovered Using Residual Neural Networks

University of Pittsburgh

University of California, Santa Barbara

UCLA

Chinese Academy of Sciences

Université de Montréal

University of Chicago

UC Berkeley

University College London University of Edinburgh

Boston University Southern Methodist University

Johns Hopkins University The University of Texas at Dallas

Lawrence Berkeley National Laboratory University of Rochester Fermi National Accelerator Laboratory University of Portsmouth

The Ohio State University Aix-Marseille Univ Technical University Munich University of Hawai’i Universidad de Los Andes Institut d’Estudis Espacials de Catalunya (IEEC)NSF’s National Optical-Infrared Astronomy Research Laboratory University of San Francisco Max Planck Institute for Astrophysics (MPA)Gemini Observatory Kavli Institute for the Physics and Mathematics of the Universe, University of Tokyo Ciela – Montreal Institute for Astrophysical Data Analysis and Machine Learning Institute of Space Sciences (ICE–CSIC)

We present spectroscopic data of strong lenses and their source galaxies using the Keck Near-Infrared Echellette Spectrometer (NIRES) and the Dark Energy Spectroscopic Instrument (DESI), providing redshifts necessary for nearly all strong-lensing applications with these systems, especially the extraction of physical parameters from lensing modeling. These strong lenses were found in the DESI Legacy Imaging Surveys using Residual Neural Networks (ResNet) and followed up by our Hubble Space Telescope program, with all systems displaying unambiguous lensed arcs. With NIRES, we target eight lensed sources at redshifts difficult to measure in the optical range and determine the source redshifts for six, between

z_s

= 1.675 and 3.332. DESI observed one of the remaining source redshifts, as well as an additional source redshift within the six systems. The two systems with non-detections by NIRES were observed for a considerably shorter 600s at high airmass. Combining NIRES infrared spectroscopy with optical spectroscopy from our DESI Strong Lensing Secondary Target Program, these results provide the complete lens and source redshifts for six systems, a resource for refining automated strong lens searches in future deep- and wide-field imaging surveys and addressing a range of questions in astrophysics and cosmology.

110

23 May 2024

ai-for-health computer-science computer-vision-security

The Brain Tumor Segmentation (BraTS) Challenge 2023: Focus on Pediatrics (CBTN-CONNECT-DIPGR-ASNR-MICCAI BraTS-PEDs)

Pediatric tumors of the central nervous system are the most common cause of cancer-related death in children. The five-year survival rate for high-grade gliomas in children is less than 20\%. Due to their rarity, the diagnosis of these entities is often delayed, their treatment is mainly based on historic treatment concepts, and clinical trials require multi-institutional collaborations. The MICCAI Brain Tumor Segmentation (BraTS) Challenge is a landmark community benchmark event with a successful history of 12 years of resource creation for the segmentation and analysis of adult glioma. Here we present the CBTN-CONNECT-DIPGR-ASNR-MICCAI BraTS-PEDs 2023 challenge, which represents the first BraTS challenge focused on pediatric brain tumors with data acquired across multiple international consortia dedicated to pediatric neuro-oncology and clinical trials. The BraTS-PEDs 2023 challenge focuses on benchmarking the development of volumentric segmentation algorithms for pediatric brain glioma through standardized quantitative performance evaluation metrics utilized across the BraTS 2023 cluster of challenges. Models gaining knowledge from the BraTS-PEDs multi-parametric structural MRI (mpMRI) training data will be evaluated on separate validation and unseen test mpMRI dataof high-grade pediatric glioma. The CBTN-CONNECT-DIPGR-ASNR-MICCAI BraTS-PEDs 2023 challenge brings together clinicians and AI/imaging scientists to lead to faster development of automated segmentation techniques that could benefit clinical trials, and ultimately the care of children with brain tumors.

335

07 May 2025

computer-science computation-and-language computer-vision-and-pattern-recognition

RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance

Technical University Munich

Conversational AI tools that can generate and discuss clinically correct radiology reports for a given medical image have the potential to transform radiology. Such a human-in-the-loop radiology assistant could facilitate a collaborative diagnostic process, thus saving time and improving the quality of reports. Towards this goal, we introduce RaDialog, the first thoroughly evaluated and publicly available large vision-language model for radiology report generation and interactive dialog. RaDialog effectively integrates visual image features and structured pathology findings with a large language model (LLM) while simultaneously adapting it to a specialized domain using parameter-efficient fine-tuning. To keep the conversational abilities of the underlying LLM, we propose a comprehensive, semi-automatically labeled, image-grounded instruct dataset for chest X-ray radiology tasks. By training with this dataset, our method achieves state-of-the-art clinical correctness in report generation and shows impressive abilities in interactive tasks such as correcting reports and answering questions, serving as a foundational step toward clinical dialog systems. Our code is available on github: this https URL

11 Dec 2024

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Strong but simple: A Baseline for Domain Generalized Dense Perception by CLIP-based Transfer Learning

Technical University Berlin Technical University Munich CARIAD SE

This paper demonstrates that simple fine-tuning of Vision-Language Model (VLM) pre-trained encoders is highly effective for domain generalization in dense perception tasks, achieving competitive or superior performance on synthetic-to-real and real-to-real benchmarks without relying on complex domain adaptation techniques. It shows that VLMs, especially EVA-CLIP, provide a powerful foundation for building robust perception systems by inherently generalizing across various data distributions.

31 Oct 2025

statistical-mechanics physics quantum-physics

Counterdiabatic Driving with Performance Guarantees

Harvard University

New York University

Flatiron Institute Technical University Munich BMW AG Istituto Nazionale di Fisica Nucleare INFN Universit degli Studi di Padova

Counterdiabatic (CD) driving has the potential to speed up adiabatic quantum state preparation by suppressing unwanted excitations. However, existing approaches either require intractable classical computations or are based on approximations which do not have performance guarantees. We propose and analyze a non-variational, system-agnostic CD expansion method and analytically show that it converges exponentially quickly in the expansion order. In finite systems, the required resources scale inversely with the spectral gap, which we argue is asymptotically optimal. To extend our method to the thermodynamic limit and suppress errors stemming from high-frequency transitions, we leverage finite-time adiabatic protocols. In particular, we show that a time determined by the quantum speed limit is sufficient to prepare the desired ground state, without the need to optimize the adiabatic trajectory. Numerical tests of our method on the quantum Ising chain show that our method can outperform state-of-the-art variational CD approaches.

364

27 Aug 2024

adversarial-robustness computer-science machine-learning

Does Audio Deepfake Detection Generalize?

Technical University Munich Fraunhofer AISEC why do birds GmbH

Current text-to-speech algorithms produce realistic fakes of human voices, making deepfake detection a much-needed area of research. While researchers have presented various techniques for detecting audio spoofs, it is often unclear exactly why these architectures are successful: Preprocessing steps, hyperparameter settings, and the degree of fine-tuning are not consistent across related work. Which factors contribute to success, and which are accidental? In this work, we address this problem: We systematize audio spoofing detection by re-implementing and uniformly evaluating architectures from related work. We identify overarching features for successful audio deepfake detection, such as using cqtspec or logspec features instead of melspec features, which improves performance by 37% EER on average, all other factors constant. Additionally, we evaluate generalization capabilities: We collect and publish a new dataset consisting of 37.9 hours of found audio recordings of celebrities and politicians, of which 17.2 hours are deepfakes. We find that related work performs poorly on such real-world data (performance degradation of up to one thousand percent). This may suggest that the community has tailored its solutions too closely to the prevailing ASVSpoof benchmark and that deepfakes are much harder to detect outside the lab than previously thought.

01 Oct 2025

attention-mechanisms clustering-algorithms computer-science

Feature Identification for Hierarchical Contrastive Learning

Infineon Technologies AG Technical University Munich Friedrich-Alexander Universität, Erlangen-Nürnberg

Hierarchical classification is a crucial task in many applications, where objects are organized into multiple levels of categories. However, conventional classification approaches often neglect inherent inter-class relationships at different hierarchy levels, thus missing important supervisory signals. Thus, we propose two novel hierarchical contrastive learning (HMLC) methods. The first, leverages a Gaussian Mixture Model (G-HMLC) and the second uses an attention mechanism to capture hierarchy-specific features (A-HMLC), imitating human processing. Our approach explicitly models inter-class relationships and imbalanced class distribution at higher hierarchy levels, enabling fine-grained clustering across all hierarchy levels. On the competitive CIFAR100 and ModelNet40 datasets, our method achieves state-of-the-art performance in linear evaluation, outperforming existing hierarchical contrastive learning methods by 2 percentage points in terms of accuracy. The effectiveness of our approach is backed by both quantitative and qualitative results, highlighting its potential for applications in computer vision and beyond.

11 Jul 2024

ai-for-health computer-science computer-vision-security

The Brain Tumor Segmentation in Pediatrics (BraTS-PEDs) Challenge: Focus on Pediatrics (CBTN-CONNECT-DIPGR-ASNR-MICCAI BraTS-PEDs)

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring