European Bioinformatics InstituteEuropean Molecular Biology Laboratory
The shape of objects is an important source of visual information in a wide range of applications. One of the core challenges of shape quantification is to ensure that the extracted measurements remain invariant to transformations that preserve an object's intrinsic geometry, such as changing its size, orientation, and position in the image. In this work, we introduce ShapeEmbed, a self-supervised representation learning framework designed to encode the contour of objects in 2D images, represented as a Euclidean distance matrix, into a shape descriptor that is invariant to translation, scaling, rotation, reflection, and point indexing. Our approach overcomes the limitations of traditional shape descriptors while improving upon existing state-of-the-art autoencoder-based approaches. We demonstrate that the descriptors learned by our framework outperform their competitors in shape classification tasks on natural and biological images. We envision our approach to be of particular relevance to biological imaging applications.
A fundamental task in the analysis of datasets with many variables is screening for associations. This can be cast as a multiple testing task, where the objective is achieving high detection power while controlling type I error. We consider mm hypothesis tests represented by pairs $((P_i, X_i))_{1\leq i \leq m}ofpvalues of p-values P_iandcovariates and covariates X_i,suchthat, such that P_i \perp X_i$ if HiH_i is null. Here, we show how to use information potentially available in the covariates about heterogeneities among hypotheses to increase power compared to conventional procedures that only use the PiP_i. To this end, we upgrade existing weighted multiple testing procedures through the Independent Hypothesis Weighting (IHW) framework to use data-driven weights that are calculated as a function of the covariates. Finite sample guarantees, e.g., false discovery rate (FDR) control, are derived from cross-weighting, a data-splitting approach that enables learning the weight-covariate function without overfitting as long as the hypotheses can be partitioned into independent folds, with arbitrary within-fold dependence. IHW has increased power compared to methods that do not use covariate information. A key implication of IHW is that hypothesis rejection in common multiple testing setups should not proceed according to the ranking of the p-values, but by an alternative ranking implied by the covariate-weighted p-values.
2
The Cox model is an indispensable tool for time-to-event analysis, particularly in biomedical research. However, medicine is undergoing a profound transformation, generating data at an unprecedented scale, which opens new frontiers to study and understand diseases. With the wealth of data collected, new challenges for statistical inference arise, as datasets are often high dimensional, exhibit an increasing number of measurements at irregularly spaced time points, and are simply too large to fit in memory. Many current implementations for time-to-event analysis are ill-suited for these problems as inference is computationally demanding and requires access to the full data at once. Here we propose a Bayesian version for the counting process representation of Cox's partial likelihood for efficient inference on large-scale datasets with millions of data points and thousands of time-dependent covariates. Through the combination of stochastic variational inference and a reweighting of the log-likelihood, we obtain an approximation for the posterior distribution that factorizes over subsamples of the data, enabling the analysis in big data settings. Crucially, the method produces viable uncertainty estimates for large-scale and high-dimensional datasets. We show the utility of our method through a simulation study and an application to myocardial infarction in the UK Biobank.
Segmentation of very large images is a common problem in microscopy, medical imaging or remote sensing. The problem is usually addressed by sliding window inference, which can theoretically lead to seamlessly stitched predictions. However, in practice many of the popular pipelines still suffer from tiling artifacts. We investigate the root cause of these issues and show that they stem from the normalization layers within the neural networks. We propose indicators to detect normalization issues and further explore the trade-offs between artifact-free and high-quality predictions, using three diverse microscopy datasets as examples. Finally, we propose to use BatchRenorm as the most suitable normalization strategy, which effectively removes tiling artifacts and enhances transfer performance, thereby improving the reusability of trained networks for new datasets.
Soft targets combined with the cross-entropy loss have shown to improve generalization performance of deep neural networks on supervised classification tasks. The standard cross-entropy loss however assumes data to be categorically distributed, which may often not be the case in practice. In contrast, InfoNCE does not rely on such an explicit assumption but instead implicitly estimates the true conditional through negative sampling. Unfortunately, it cannot be combined with soft targets in its standard formulation, hindering its use in combination with sophisticated training strategies. In this paper, we address this limitation by proposing a loss function that is compatible with probabilistic targets. Our new soft target InfoNCE loss is conceptually simple, efficient to compute, and can be motivated through the framework of noise contrastive estimation. Using a toy example, we demonstrate shortcomings of the categorical distribution assumption of cross-entropy, and discuss implications of sampling from soft distributions. We observe that soft target InfoNCE performs on par with strong soft target cross-entropy baselines and outperforms hard target NLL and InfoNCE losses on popular benchmarks, including ImageNet. Finally, we provide a simple implementation of our loss, geared towards supervised classification and fully compatible with deep classification models trained with cross-entropy.
Summary: Linear mixed models are a commonly used statistical approach in genome-wide association studies when population structure is present. However, naive permutations to empirically estimate the null distribution of a statistic of interest are not appropriate in the presence of population structure, because the samples are not exchangeable with each other. For this reason we developed FlexLMM, a Nextflow pipeline that runs linear mixed models while allowing for flexibility in the definition of the exact statistical model to be used. FlexLMM can also be used to set a significance threshold via permutations, thanks to a two-step process where the population structure is first regressed out, and only then are the permutations performed. We envision this pipeline will be particularly useful for researchers working on multi-parental crosses among inbred lines of model organisms or farm animals and plants. Availability and implementation: The source code and documentation for the FlexLMM is available at this https URL.
Living materials adapt their shape to signals from the environment, yet the impact of shape changes on signal processing and associated feedback dynamics remain unclear. We find that droplets with signal-responsive interfacial tensions exhibit shape bistability, excitable dynamics, and oscillations. The underlying critical points reveal novel mechanisms for physical signal processing through shape adaptation in soft active materials. We recover signatures of one such critical point in experimental data from zebrafish embryos, where it supports boundary formation.
Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights. These include developing an AI-friendly ecosystem for proteomics data generation, sharing, and analysis; improving peptide and protein identification and quantification; characterizing protein-protein interactions and protein complexes; advancing spatial and perturbation proteomics; integrating multi-omics data; and ultimately enabling AI-empowered virtual cells.
Research software has become a central asset in academic research. It optimizes existing and enables new research methods, implements and embeds research knowledge, and constitutes an essential research product in itself. Research software must be sustainable in order to understand, replicate, reproduce, and build upon existing research or conduct new research effectively. In other words, software must be available, discoverable, usable, and adaptable to new needs, both now and in the future. Research software therefore requires an environment that supports sustainability. Hence, a change is needed in the way research software development and maintenance are currently motivated, incentivized, funded, structurally and infrastructurally supported, and legally treated. Failing to do so will threaten the quality and validity of research. In this paper, we identify challenges for research software sustainability in Germany and beyond, in terms of motivation, selection, research software engineering personnel, funding, infrastructure, and legal aspects. Besides researchers, we specifically address political and academic decision-makers to increase awareness of the importance and needs of sustainable research software practices. In particular, we recommend strategies and measures to create an environment for sustainable research software, with the ultimate goal to ensure that software-driven research is valid, reproducible and sustainable, and that software is recognized as a first class citizen in research. This paper is the outcome of two workshops run in Germany in 2019, at deRSE19 - the first International Conference of Research Software Engineers in Germany - and a dedicated DFG-supported follow-up workshop in Berlin.
The macroscopic behaviour of active matter arises from nonequilibrium microscopic processes. In soft materials, active stresses typically drive macroscopic shape changes, which in turn alter the geometry constraining the microscopic dynamics, leading to complex feedback effects. Although such mechanochemical coupling is common in living matter and associated with biological functions such as cell migration, division, and differentiation, the underlying principles are not well understood due to a lack of minimal models that bridge the scales from the microscopic biochemical processes to the macroscopic shape dynamics. To address this gap, we derive tractable coarse-grained equations from microscopic dynamics for a class of mechanochemical systems, in which biochemical signal processing is coupled to shape dynamics. Specifically, we consider molecular interactions at the surface of biological cells that commonly drive cell-cell signaling and adhesion, and obtain a macroscopic description of cells as signal-processing droplets that adaptively change their interfacial tensions. We find a rich phenomenology, including multistability, symmetry-breaking, excitability, and self-sustained shape oscillations, with the underlying critical points revealing universal characteristics of such systems. Our tractable framework provides a paradigm for how soft active materials respond to shape-dependent signals, and suggests novel modes of self-organisation at the collective scale. These are explored further in our companion paper [arxiv 2402.08664v3].
Estimating the reliability of individual predictions is key to increase the adoption of computational models and artificial intelligence in preclinical drug discovery, as well as to foster its application to guide decision making in clinical settings. Among the large number of algorithms developed over the last decades to compute prediction errors, Conformal Prediction (CP) has gained increasing attention in the computational drug discovery community. A major reason for its recent popularity is the ease of interpretation of the computed prediction errors in both classification and regression tasks. For instance, at a confidence level of 90% the true value will be within the predicted confidence intervals in at least 90% of the cases. This so called validity of conformal predictors is guaranteed by the robust mathematical foundation underlying CP. The versatility of CP relies on its minimal computational footprint, as it can be easily coupled to any machine learning algorithm at little computational cost. In this review, we summarize underlying concepts and practical applications of CP with a particular focus on virtual screening and activity modelling, and list open source implementations of relevant software. Finally, we describe the current limitations in the field, and provide a perspective on future opportunities for CP in preclinical and clinical drug discovery.
The structure and dynamics of important biological quasi-two-dimensional systems, ranging from cytoskeletal gels to tissues, are controlled by nematic order, flow, defects and activity. Continuum hydrodynamic descriptions combined with numerical simulations have been used to understand such complex systems. The development of thermodynamically consistent theories and numerical methods to model active nemato-hydrodynamics is eased by mathematical formalisms enabling systematic derivations and structured-preserving algorithms. Alternative to classical nonequilibrium thermodynamics and bracket formalisms, here we develop a theoretical and computational framework for active nematics based on Onsager's variational formalism to irreversible thermodynamics, according to which the dynamics result from the minimization of a Rayleighian functional capturing the competition between free-energy release, dissipation and activity. We show that two standard incompressible models of active nemato-hydrodynamics can be framed in the variational formalism, and develop a new compressible model for density-dependent active nemato-hydrodynamics relevant to model actomyosin gels. We show that the variational principle enables a direct and transparent derivation not only of the governing equations, but also of the finite element numerical scheme. We exercise this model in two representative examples of active nemato-hydrodynamics relevant to the actin cytoskeleton during wound healing and to the dynamics of confined colonies of elongated cells.
Background: Systems biology projects and omics technologies have led to a growing number of biochemical pathway reconstructions. However, mathematical models are still most often created de novo, based on reading the literature and processing pathway data manually. Results: To increase the efficiency with which such models can be created, we automatically generated mathematical models from pathway representations using a suite of freely available software. We produced models that combine data from KEGG PATHWAY, BioCarta, MetaCyc and SABIO-RK; According to the source data, three types of models are provided: kinetic, logical and constraint-based. All models are encoded using SBML Core and Qual packages, and available through BioModels Database. Each model contains the list of participants, the interactions, and the relevant mathematical constructs, but, in most cases, no meaningful parameter values. Most models are also available as easy to understand graphical SBGN maps. Conclusions: to date, the project has resulted in more than 140000 models freely available. We believe this resource can tremendously accelerate the development of mathematical models by providing initial starting points ready for parametrization.
Brillouin Light Scattering (BLS) spectroscopy is a non-invasive, non-contact, label-free optical technique that can provide information on the mechanical properties of a material on the sub-micron scale. Over the last decade it has seen increased applications in the life sciences, driven by the observed significance of mechanical properties in biological processes, the realization of more sensitive BLS spectrometers and its extension to an imaging modality. As with other spectroscopic techniques, BLS measurements not only detect signals characteristic of the investigated sample, but also of the experimental apparatus, and can be significantly affected by measurement conditions. The aim of this consensus statement is to improve the comparability of BLS studies by providing reporting recommendations for the measured parameters and detailing common artifacts. Given that most BLS studies of biological matter are still at proof-of-concept stages and use different--often self-built--spectrometers, a consensus statement is particularly timely to assure unified advancement.
Optimization is key to solve many problems in computational biology. Global optimization methods provide a robust methodology, and metaheuristics in particular have proven to be the most efficient methods for many applications. Despite their utility, there is limited availability of metaheuristic tools. We present MEIGO, an R and Matlab optimization toolbox (also available in Python via a wrapper of the R version), that implements metaheuristics capable of solving diverse problems arising in systems biology and bioinformatics: enhanced scatter search method (eSS) for continuous nonlinear programming (cNLP) and mixed-integer programming (MINLP) problems, and variable neighborhood search (VNS) for Integer Programming (IP) problems. Both methods can be run on a single-thread or in parallel using a cooperative strategy. The code is supplied under GPLv3 and is available at \url{this http URL}. Documentation and examples are included. The R package has been submitted to Bioconductor. We evaluate MEIGO against optimization benchmarks, and illustrate its applicability to a series of case studies in bioinformatics and systems biology, outperforming other state-of-the-art methods. MEIGO provides a free, open-source platform for optimization, that can be applied to multiple domains of systems biology and bioinformatics. It includes efficient state of the art metaheuristics, and its open and modular structure allows the addition of further methods.
Viruses and their hosts are involved in an 'arms race' where they continually evolve mechanisms to overcome each other. It has long been proposed that intrinsic disorder provides a substrate for the evolution of viral hijack functions and that short linear motifs (SLiMs) are important players in this process. Here, we review evidence in support of this tenet from two model systems: the papillomavirus E7 protein and the adenovirus E1A protein. Phylogenetic reconstructions reveal that SLiMs appear and disappear multiple times across evolution, providing evidence of convergent evolution within individual viral phylogenies. Multiple functionally related SLiMs show strong co-evolution signals that persist across long distances in the primary sequence and occur in unrelated viral proteins. Moreover, changes in SLiMs are associated with changes in phenotypic traits such as host range and tropism. Tracking viral evolutionary events reveals that host switch events are associated with the loss of several SLiMs, suggesting that SLiMs are under functional selection and that changes in SLiMs support viral adaptation. Fine-tuning of viral SLiM sequences can improve affinity, allowing them to outcompete host counterparts. However, viral SLiMs are not always competitive by themselves, and tethering of two suboptimal SLiMs by a disordered linker may instead enable viral hijack. Coevolution between the SLiMs and the linker indicates that the evolution of disordered regions may be more constrained than previously thought. In summary, experimental and computational studies support a role for SLiMs and intrinsic disorder in viral hijack functions and in viral adaptive evolution.
We report a droplet microfluidic method to target and sort individual cells directly from complex microbiome samples, and to prepare these cells for bulk whole genome sequencing without cultivation. We characterize this approach by recovering bacteria spiked into human stool samples at a ratio as low as 1:250 and by successfully enriching endogenous Bacteroides vulgatus to the level required for de-novo assembly of high-quality genomes. While microbiome strains are increasingly demanded for biomedical applications, the vast majority of species and strains are uncultivated and without reference genomes. We address this shortcoming by encapsulating complex microbiome samples directly into microfluidic droplets and amplify a target-specific genomic fragment using a custom molecular TaqMan probe. We separate those positive droplets by droplet sorting, selectively enriching single target strain cells. Finally, we present a protocol to purify the genomic DNA while specifically removing amplicons and cell debris for high-quality genome sequencing.
In a prototypical mode of single-cell migration, retrograde cytoskeletal flow is mechanically coupled to the environment, propels the cell, and is sustained by an anterograde cytosolic flow of disassembled cytoskeletal components. Supracellular collectives also develop fountain-flows to migrate, but the opposing cellular streams interact with the environment producing conflicting forces. To understand the biophysical constraints of fountain-flow supracellular migration, we develop an active gel model of a cell cluster driven by a polarized peripheral contractile cable. While the model develops fountain-flows and directed migration, efficiency and cluster velocity are extremely small compared to observations. We find that patterned friction or cluster-polarized single-cell directed migration, both suggested by contact inhibition of locomotion, rescue robust and efficient supracellular migration.
Short-range interactions and long-range contacts drive the 3D folding of structured proteins. The proteins' structure has a direct impact on their biological function. However, nearly 40% of the eukaryotes proteome is composed of intrinsically disordered proteins (IDPs) and protein regions that fluctuate between ensembles of numerous conformations. Therefore, to understand their biological function, it is critical to depict how the structural ensemble statistics correlate to the IDPs' amino acid sequence. Here, using small-angle x-ray scattering (SAXS) and time-resolved F\"orster resonance energy transfer (trFRET), we study the intra-molecular structural heterogeneity of the neurofilament low intrinsically disordered tail domain (NFLt). Using theoretical results of polymer physics, we find that the Flory scaling exponent of NFLt sub-segments correlates linearly with their net charge, ranging from statistics of ideal to self-avoiding chains. Surprisingly, measuring the same segments in the context of the whole NFLt protein, we find that regardless of the peptide sequence, the segments' structural statistics are more expanded than when measured independently. Our findings show that while polymer physics can, to some level, relate the IDP's sequence to its ensemble conformations, long-range contacts between distant amino acids play a crucial role in determining intra-molecular structures. This emphasizes the necessity of advanced polymer theories to fully describe IDPs ensembles with the hope it will allow us to model their biological function.
There are no more papers matching your filters at the moment.