alphaXiv

SIB Swiss Institute of Bioinformatics

23 Jul 2025

other-quantitative-biology quantitative-biology

Building a continuous benchmarking ecosystem in bioinformatics

University of Zurich SIB Swiss Institute of Bioinformatics Friedrich Miescher Institute for Biomedical Research Swiss Data Science Centre

Benchmarking, which involves collecting reference datasets and demonstrating method performances, is a requirement for the development of new computational tools, but also becomes a domain of its own to achieve neutral comparisons of methods. Although a lot has been written about how to design and conduct benchmark studies, this Perspective sheds light on a wish list for a computational platform to orchestrate benchmark studies. We discuss various ideas for organizing reproducible software environments, formally defining benchmarks, orchestrating standardized workflows, and how they interface with computing infrastructure.

109

10 Feb 2025

computer-science artificial-intelligence databases

LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs

SIB Swiss Institute of Bioinformatics

Researchers at the SIB Swiss Institute of Bioinformatics developed an LLM-based Retrieval-Augmented Generation (RAG) system to translate natural language questions into accurate federated SPARQL queries for complex bioinformatics knowledge graphs. The system significantly improves query generation accuracy and reduces hallucinations by incorporating schema-based validation and iterative correction, enhancing query success rates for models like `gpt-4o` to an F1 score of 0.91.

103

17 Mar 2022

computer-science machine-learning algebraic-topology

Topological Graph Neural Networks

ETH Zurich

KU Leuven SIB Swiss Institute of Bioinformatics

Researchers from ETH Zurich, KU Leuven, and Helmholtz Munich developed the Topological Graph Layer (TOGL), a generic GNN layer that integrates learnable multi-scale topological information into graph representations. The layer enhances GNN expressivity beyond the 1-dimensional Weisfeiler–Lehman test and consistently improves performance on graph and node classification tasks, particularly those dependent on structural and topological properties.

121

25 Jun 2025

genomics quantitative-methods quantitative-biology

Harnessing the Potential of Spatial Statistics for Spatial Omics Data with pasta

University of Zurich SIB Swiss Institute of Bioinformatics Centro Nacional de An ́alisis Gen ́omico (CNAG)

Spatial omics assays allow for the molecular characterisation of cells in their spatial context. Notably, the two main technological streams, imaging-based and high-throughput sequencing-based, can give rise to very different data modalities. The characteristics of the two data types are well known in adjacent fields such as spatial statistics as point patterns and lattice data, and there is a wide range of tools available. This paper discusses the application of spatial statistics to spatially-resolved omics data and in particular, discusses various advantages, challenges, and nuances. This work is accompanied by a vignette, pasta, that showcases the usefulness of spatial statistics in biology using several R packages.

110

13 Jun 2022

computer-science machine-learning geometric-deep-learning

Structure-Aware Transformer for Graph Representation Learning

ETH Zurich SIB Swiss Institute of Bioinformatics

The Structure-Aware Transformer (SAT) introduces a novel graph neural network architecture that explicitly incorporates local structural information into its self-attention mechanism via flexible subgraph extractors. This approach enables SAT to achieve state-of-the-art performance across five diverse graph prediction benchmarks, including a significant gain on OGBG-CODE2, and offers improved model interpretability by focusing attention on relevant structural motifs.

14 Aug 2023

computer-science machine-learning biomolecules

Pairing interacting protein sequences using masked language modeling

SIB Swiss Institute of Bioinformatics Ecole Polytechnique F ´ed ´erale de Lausanne (EPFL)

A differentiable framework, DiffPALM, infers interacting protein sequence pairings by adapting the Masked Language Modeling objective from MSA Transformer. This method demonstrates superior performance over traditional coevolutionary techniques on shallow multiple sequence alignments and enhances the accuracy of AlphaFold-Multimer's complex structure predictions for eukaryotic proteins.

16 Oct 2023

computer-science information-theory machine-learning

Beyond Normal: On the Evaluation of Mutual Information Estimators

ETH Zurich Institute of Fundamental Technological Research, Polish Academy of Sciences SIB Swiss Institute of Bioinformatics

Mutual information is a general statistical dependency measure which has found applications in representation learning, causality, domain generalization and computational biology. However, mutual information estimators are typically evaluated on simple families of probability distributions, namely multivariate normal distribution and selected distributions with one-dimensional random variables. In this paper, we show how to construct a diverse family of distributions with known ground-truth mutual information and propose a language-independent benchmarking platform for mutual information estimators. We discuss the general applicability and limitations of classical and neural estimators in settings involving high dimensions, sparse interactions, long-tailed distributions, and high mutual information. Finally, we provide guidelines for practitioners on how to select appropriate estimator adapted to the difficulty of problem considered and issues one needs to consider when applying an estimator to a new data set.

24 Sep 2025

computer-science performance

denet, a lightweight command-line tool for process monitoring in benchmarking and beyond

University of Zurich SIB Swiss Institute of Bioinformatics

Summary: denet is a lightweight process monitoring utility providing real-time resource profiling of running processes. denet reports CPU, memory, disk I/O, network activity, and thread usage, including recursive child monitoring, with adaptive sampling rates. It offers both a command-line interface (CLI) with colorized outputs and a Python API for inclusion in other software. Its output formats are structured as either JSON, JSONL, or CSV, and include performance metrics as well as process metadata, including PID and the executed command. The easy to parse profiling results make denet suitable for benchmarking, debugging, monitoring, and optimizing data-intensive pipelines in bioinformatics and other fields. Availability and implementation: denet is open-source software released under the GPLv3 license and maintained at this https URL. It is implemented in Rust, with Python bindings provided via maturin, and can be installed from Cargo (cargo install denet) or PyPI (pip install denet). Most functionality does not require administrative privileges, enabling use on cloud platforms, HPC clusters, and standard Linux workstations. Certain advanced features, such as eBPF support, may require elevated permissions. Documentation, including usage examples and API references, is provided.

12 Jul 2021

ai-for-health attention-mechanisms computer-science

Predicting sepsis in multi-site, multi-national intensive care cohorts using deep learning

ETH Zurich SIB Swiss Institute of Bioinformatics

Despite decades of clinical research, sepsis remains a global public health crisis with high mortality, and morbidity. Currently, when sepsis is detected and the underlying pathogen is identified, organ damage may have already progressed to irreversible stages. Effective sepsis management is therefore highly time-sensitive. By systematically analysing trends in the plethora of clinical data available in the intensive care unit (ICU), an early prediction of sepsis could lead to earlier pathogen identification, resistance testing, and effective antibiotic and supportive treatment, and thereby become a life-saving measure. Here, we developed and validated a machine learning (ML) system for the prediction of sepsis in the ICU. Our analysis represents the largest multi-national, multi-centre in-ICU study for sepsis prediction using ML to date. Our dataset contains

156,309

unique ICU admissions, which represent a refined and harmonised subset of five large ICU databases originating from three countries. Using the international consensus definition Sepsis-3, we derived hourly-resolved sepsis label annotations, amounting to

26,734

(

17.1\%

) septic stays. We compared our approach, a deep self-attention model, to several clinical baselines as well as ML baselines and performed an extensive internal and external validation within and across databases. On average, our model was able to predict sepsis with an AUROC of

0.847 \pm 0.050

(internal out-of sample validation) and

0.761 \pm 0.052

(external validation). For a harmonised prevalence of

17\%

, at

80\%

recall our model detects septic patients with

39\%

precision 3.7 hours in advance.

20 Nov 2022

computer-science machine-learning biomolecules

Generative power of a protein language model trained on multiple sequence alignments

SIB Swiss Institute of Bioinformatics École Polytechnique Fédérale de Lausanne (EPFL)

Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally-validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.

25 Sep 2024

other-quantitative-biology quantitative-biology

Omnibenchmark (alpha) for continuous and open benchmarking in bioinformatics

University of Zurich SIB Swiss Institute of Bioinformatics Friedrich Miescher Institute for Biomedical Research Swiss Data Science Centre

Benchmarking in bioinformatics is a process of designing, running and disseminating rigorous performance evaluations of methods (software). Benchmarking systems facilitate the benchmarking process by providing an entrypoint to store, coordinate and execute concrete benchmarks. We describe an alpha version of a new benchmarking system, Omnibenchmark, to facilitate benchmark formalization and execution in solo and community efforts. Omnibenchmark provides a benchmark definition syntax (in a configuration YAML file), a dynamic workflow generation based on Snakemake, S3-compatible storage handling, and reproducible software environments using EasyBuild, lmod, Apptainer or conda. Tutorials and installation instructions are available from this https URL.

22 Apr 2024

computer-science computation-and-language information-extraction

EnzChemRED, a rich enzyme chemistry relation extraction dataset

Dalian University of Technology SIB Swiss Institute of Bioinformatics National Institutes of Health National Library of Medicine National Center for Biotechnology Information

Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at this https URL.

18 Aug 2023

genomics quantitative-biology

Genomic reproducibility in the bioinformatics era

ETH Zurich

University of Cambridge

University of Southern California SIB Swiss Institute of Bioinformatics Jagiellonian University University of Maryland School of Medicine Cancer Research UK Cambridge Institute BOKU University Vienna Małopolska Centre of Biotechnology Institute for Genome Sciences NEXUS Personalized Health Technologies

In biomedical research, validation of a new scientific discovery is tied to the reproducibility of its experimental results. However, in genomics, the definition and implementation of reproducibility still remain imprecise. Here, we argue that genomic reproducibility, defined as the ability of bioinformatics tools to maintain consistent genomics results across technical replicates, is key to generating scientific knowledge and enabling medical applications. We first discuss different concepts of reproducibility and then focus on reproducibility in the context of genomics, aiming to establish clear definitions of relevant terms. We then focus on the role of bioinformatics tools and their impact on genomic reproducibility and assess methods of evaluating bioinformatics tools in terms of genomic reproducibility. Lastly, we suggest best practices for enhancing genomic reproducibility, with an emphasis on assessing the performance of bioinformatics tools through rigorous testing across multiple technical replicates.

30 Oct 2019

computer-science machine-learning molecular-networks

Wasserstein Weisfeiler-Lehman Graph Kernels

ETH Zurich SIB Swiss Institute of Bioinformatics

Most graph kernels are an instance of the class of

\mathcal{R}

-Convolution kernels, which measure the similarity of objects by comparing their substructures. Despite their empirical success, most graph kernels use a naive aggregation of the final set of substructures, usually a sum or average, thereby potentially discarding valuable information about the distribution of individual components. Furthermore, only a limited instance of these approaches can be extended to continuously attributed graphs. We propose a novel method that relies on the Wasserstein distance between the node feature vector distributions of two graphs, which allows to find subtler differences in data sets by considering graphs as high-dimensional objects, rather than simple means. We further propose a Weisfeiler-Lehman inspired embedding scheme for graphs with continuous node attributes and weighted edges, enhance it with the computed Wasserstein distance, and thus improve the state-of-the-art prediction performance on several graph classification tasks.

18 Aug 2023

genomics quantitative-biology

Genomic reproducibility in the bioinformatics era

ETH Zurich

University of Cambridge

29 May 2024

bayesian-optimization computer-science information-theory

On the Properties and Estimation of Pointwise Mutual Information Profiles

ETH Zurich Institute of Fundamental Technological Research, Polish Academy of Sciences SIB Swiss Institute of Bioinformatics

The pointwise mutual information profile, or simply profile, is the distribution of pointwise mutual information for a given pair of random variables. One of its important properties is that its expected value is precisely the mutual information between these random variables. In this paper, we analytically describe the profiles of multivariate normal distributions and introduce a novel family of distributions, Bend and Mix Models, for which the profile can be accurately estimated using Monte Carlo methods. We then show how Bend and Mix Models can be used to study the limitations of existing mutual information estimators, investigate the behavior of neural critics used in variational estimators, and understand the effect of experimental outliers on mutual information estimation. Finally, we show how Bend and Mix Models can be used to obtain model-based Bayesian estimates of mutual information, suitable for problems with available domain expertise in which uncertainty quantification is necessary.

24 Aug 2022

statistical-mechanics physics biomolecules

Combining phylogeny and coevolution improves the inference of interaction partners among paralogous proteins

CNRS SIB Swiss Institute of Bioinformatics

Sorbonne Université Universidad de La Habana Ecole Polytechnique F d rale de Lausanne (EPFL)

Predicting protein-protein interactions from sequences is an important goal of computational biology. Various sources of information can be used to this end. Starting from the sequences of two interacting protein families, one can use phylogeny or residue coevolution to infer which paralogs are specific interaction partners within each species. We show that these two signals can be combined to improve the performance of the inference of interaction partners among paralogs. For this, we first align the sequence-similarity graphs of the two families through simulated annealing, yielding a robust partial pairing. We next use this partial pairing to seed a coevolution-based iterative pairing algorithm. This combined method improves performance over either separate method. The improvement obtained is striking in the difficult cases where the average number of paralogs per species is large or where the total number of sequences is modest.

559

07 Feb 2024

computer-science artificial-intelligence computation-and-language

SPARQL Generation: an analysis on fine-tuning OpenLLaMA for Question Answering over a Life Science Knowledge Graph

SIB Swiss Institute of Bioinformatics University of Lausanne RIKEN Information R&D and Strategy Headquarters

The recent success of Large Language Models (LLM) in a wide range of Natural Language Processing applications opens the path towards novel Question Answering Systems over Knowledge Graphs leveraging LLMs. However, one of the main obstacles preventing their implementation is the scarcity of training data for the task of translating questions into corresponding SPARQL queries, particularly in the case of domain-specific KGs. To overcome this challenge, in this study, we evaluate several strategies for fine-tuning the OpenLlama LLM for question answering over life science knowledge graphs. In particular, we propose an end-to-end data augmentation approach for extending a set of existing queries over a given knowledge graph towards a larger dataset of semantically enriched question-to-SPARQL query pairs, enabling fine-tuning even for datasets where these pairs are scarce. In this context, we also investigate the role of semantic "clues" in the queries, such as meaningful variable names and inline comments. Finally, we evaluate our approach over the real-world Bgee gene expression knowledge graph and we show that semantic clues can improve model performance by up to 33% compared to a baseline with random variable names and no comments included.

20 Dec 2024

physics biological-physics populations-and-evolution

Spatial structure facilitates evolutionary rescue by drug resistance

CNRS SIB Swiss Institute of Bioinformatics Ecole Polytechnique F ´ed ´erale de Lausanne (EPFL)

Bacterial populations often have complex spatial structures, which can impact their evolution. Here, we study how spatial structure affects the evolution of antibiotic resistance in a bacterial population. We consider a minimal model of spatially structured populations where all demes (i.e., subpopulations) are identical and connected to each other by identical migration rates. We show that spatial structure can facilitate the survival of a bacterial population to antibiotic treatment, starting from a sensitive inoculum. Specifically, the bacterial population can be rescued if antibiotic resistant mutants appear and are present when drug is added, and spatial structure can impact the fate of these mutants and the probability that they are present. Indeed, the probability of fixation of neutral or deleterious mutations providing drug resistance is increased in smaller populations. This promotes local fixation of resistant mutants in the structured population, which facilitates evolutionary rescue by drug resistance in the rare mutation regime. Once the population is rescued by resistance, migrations allow resistant mutants to spread in all demes. Our main result that spatial structure facilitates evolutionary rescue by antibiotic resistance extends to more complex spatial structures, and to the case where there are resistant mutants in the inoculum.

07 Apr 2015

quantitative-methods quantitative-biology

Probabilistic modeling of occurring substitutions in PAR-CLIP data

SIB Swiss Institute of Bioinformatics ETH Z urich

Photoactivatable ribonucleoside-enhanced cross-linking and immunoprecipitation (PAR-CLIP) is an experimental method based on next-generation sequencing for identifying the RNA interaction sites of a given protein. The method deliberately inserts T-to-C substitutions at the RNA-protein interaction sites, which provides a second layer of evidence compared to other CLIP methods. However, the experiment includes several sources of noise which cause both low-frequency errors and spurious high-frequency alterations. Therefore, rigorous statistical analysis is required in order to separate true T-to-C base changes, following cross-linking, from noise. So far, most of the existing PAR-CLIP data analysis methods focus on discarding the low-frequency errors and rely on high-frequency substitutions to report binding sites, not taking into account the possibility of high-frequency false positive substitutions. Here, we introduce BMix, a new probabilistic method which explicitly accounts for the sources of noise in PAR- CLIP data and distinguishes cross-link induced T-to-C substitutions from low and high-frequency erroneous alterations. We demonstrate the superior speed and accuracy of our method compared to existing approaches on both simulated and real, publicly available human datasets. The model is implemented in the Matlab toolbox BMix, freely available at www.cbg.bsse.ethz.ch/software/BMix.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Building a continuous benchmarking ecosystem in bioinformatics

LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs

Topological Graph Neural Networks

Harnessing the Potential of Spatial Statistics for Spatial Omics Data with pasta

Structure-Aware Transformer for Graph Representation Learning

Pairing interacting protein sequences using masked language modeling

Beyond Normal: On the Evaluation of Mutual Information Estimators

denet, a lightweight command-line tool for process monitoring in benchmarking and beyond

Predicting sepsis in multi-site, multi-national intensive care cohorts using deep learning

Generative power of a protein language model trained on multiple sequence alignments

Omnibenchmark (alpha) for continuous and open benchmarking in bioinformatics

EnzChemRED, a rich enzyme chemistry relation extraction dataset

Genomic reproducibility in the bioinformatics era

Wasserstein Weisfeiler-Lehman Graph Kernels

Genomic reproducibility in the bioinformatics era

On the Properties and Estimation of Pointwise Mutual Information Profiles

Combining phylogeny and coevolution improves the inference of interaction partners among paralogous proteins

SPARQL Generation: an analysis on fine-tuning OpenLLaMA for Question Answering over a Life Science Knowledge Graph

Spatial structure facilitates evolutionary rescue by drug resistance

Probabilistic modeling of occurring substitutions in PAR-CLIP data

Events

AI for Law

Personalize Your Feed