European Molecular Biology LaboratoryEuropean Bioinformatics Institute (EMBL-EBI)
A fundamental task in the analysis of datasets with many variables is screening for associations. This can be cast as a multiple testing task, where the objective is achieving high detection power while controlling type I error. We consider mm hypothesis tests represented by pairs $((P_i, X_i))_{1\leq i \leq m}ofpvalues of p-values P_iandcovariates and covariates X_i,suchthat, such that P_i \perp X_i$ if HiH_i is null. Here, we show how to use information potentially available in the covariates about heterogeneities among hypotheses to increase power compared to conventional procedures that only use the PiP_i. To this end, we upgrade existing weighted multiple testing procedures through the Independent Hypothesis Weighting (IHW) framework to use data-driven weights that are calculated as a function of the covariates. Finite sample guarantees, e.g., false discovery rate (FDR) control, are derived from cross-weighting, a data-splitting approach that enables learning the weight-covariate function without overfitting as long as the hypotheses can be partitioned into independent folds, with arbitrary within-fold dependence. IHW has increased power compared to methods that do not use covariate information. A key implication of IHW is that hypothesis rejection in common multiple testing setups should not proceed according to the ranking of the p-values, but by an alternative ranking implied by the covariate-weighted p-values.
2
Segmentation of very large images is a common problem in microscopy, medical imaging or remote sensing. The problem is usually addressed by sliding window inference, which can theoretically lead to seamlessly stitched predictions. However, in practice many of the popular pipelines still suffer from tiling artifacts. We investigate the root cause of these issues and show that they stem from the normalization layers within the neural networks. We propose indicators to detect normalization issues and further explore the trade-offs between artifact-free and high-quality predictions, using three diverse microscopy datasets as examples. Finally, we propose to use BatchRenorm as the most suitable normalization strategy, which effectively removes tiling artifacts and enhances transfer performance, thereby improving the reusability of trained networks for new datasets.
Soft targets combined with the cross-entropy loss have shown to improve generalization performance of deep neural networks on supervised classification tasks. The standard cross-entropy loss however assumes data to be categorically distributed, which may often not be the case in practice. In contrast, InfoNCE does not rely on such an explicit assumption but instead implicitly estimates the true conditional through negative sampling. Unfortunately, it cannot be combined with soft targets in its standard formulation, hindering its use in combination with sophisticated training strategies. In this paper, we address this limitation by proposing a loss function that is compatible with probabilistic targets. Our new soft target InfoNCE loss is conceptually simple, efficient to compute, and can be motivated through the framework of noise contrastive estimation. Using a toy example, we demonstrate shortcomings of the categorical distribution assumption of cross-entropy, and discuss implications of sampling from soft distributions. We observe that soft target InfoNCE performs on par with strong soft target cross-entropy baselines and outperforms hard target NLL and InfoNCE losses on popular benchmarks, including ImageNet. Finally, we provide a simple implementation of our loss, geared towards supervised classification and fully compatible with deep classification models trained with cross-entropy.
Summary: Linear mixed models are a commonly used statistical approach in genome-wide association studies when population structure is present. However, naive permutations to empirically estimate the null distribution of a statistic of interest are not appropriate in the presence of population structure, because the samples are not exchangeable with each other. For this reason we developed FlexLMM, a Nextflow pipeline that runs linear mixed models while allowing for flexibility in the definition of the exact statistical model to be used. FlexLMM can also be used to set a significance threshold via permutations, thanks to a two-step process where the population structure is first regressed out, and only then are the permutations performed. We envision this pipeline will be particularly useful for researchers working on multi-parental crosses among inbred lines of model organisms or farm animals and plants. Availability and implementation: The source code and documentation for the FlexLMM is available at this https URL.
Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights. These include developing an AI-friendly ecosystem for proteomics data generation, sharing, and analysis; improving peptide and protein identification and quantification; characterizing protein-protein interactions and protein complexes; advancing spatial and perturbation proteomics; integrating multi-omics data; and ultimately enabling AI-empowered virtual cells.
Living materials adapt their shape to signals from the environment, yet the impact of shape changes on signal processing and associated feedback dynamics remain unclear. We find that droplets with signal-responsive interfacial tensions exhibit shape bistability, excitable dynamics, and oscillations. The underlying critical points reveal novel mechanisms for physical signal processing through shape adaptation in soft active materials. We recover signatures of one such critical point in experimental data from zebrafish embryos, where it supports boundary formation.
Research software has become a central asset in academic research. It optimizes existing and enables new research methods, implements and embeds research knowledge, and constitutes an essential research product in itself. Research software must be sustainable in order to understand, replicate, reproduce, and build upon existing research or conduct new research effectively. In other words, software must be available, discoverable, usable, and adaptable to new needs, both now and in the future. Research software therefore requires an environment that supports sustainability. Hence, a change is needed in the way research software development and maintenance are currently motivated, incentivized, funded, structurally and infrastructurally supported, and legally treated. Failing to do so will threaten the quality and validity of research. In this paper, we identify challenges for research software sustainability in Germany and beyond, in terms of motivation, selection, research software engineering personnel, funding, infrastructure, and legal aspects. Besides researchers, we specifically address political and academic decision-makers to increase awareness of the importance and needs of sustainable research software practices. In particular, we recommend strategies and measures to create an environment for sustainable research software, with the ultimate goal to ensure that software-driven research is valid, reproducible and sustainable, and that software is recognized as a first class citizen in research. This paper is the outcome of two workshops run in Germany in 2019, at deRSE19 - the first International Conference of Research Software Engineers in Germany - and a dedicated DFG-supported follow-up workshop in Berlin.
Estimating the reliability of individual predictions is key to increase the adoption of computational models and artificial intelligence in preclinical drug discovery, as well as to foster its application to guide decision making in clinical settings. Among the large number of algorithms developed over the last decades to compute prediction errors, Conformal Prediction (CP) has gained increasing attention in the computational drug discovery community. A major reason for its recent popularity is the ease of interpretation of the computed prediction errors in both classification and regression tasks. For instance, at a confidence level of 90% the true value will be within the predicted confidence intervals in at least 90% of the cases. This so called validity of conformal predictors is guaranteed by the robust mathematical foundation underlying CP. The versatility of CP relies on its minimal computational footprint, as it can be easily coupled to any machine learning algorithm at little computational cost. In this review, we summarize underlying concepts and practical applications of CP with a particular focus on virtual screening and activity modelling, and list open source implementations of relevant software. Finally, we describe the current limitations in the field, and provide a perspective on future opportunities for CP in preclinical and clinical drug discovery.
The macroscopic behaviour of active matter arises from nonequilibrium microscopic processes. In soft materials, active stresses typically drive macroscopic shape changes, which in turn alter the geometry constraining the microscopic dynamics, leading to complex feedback effects. Although such mechanochemical coupling is common in living matter and associated with biological functions such as cell migration, division, and differentiation, the underlying principles are not well understood due to a lack of minimal models that bridge the scales from the microscopic biochemical processes to the macroscopic shape dynamics. To address this gap, we derive tractable coarse-grained equations from microscopic dynamics for a class of mechanochemical systems, in which biochemical signal processing is coupled to shape dynamics. Specifically, we consider molecular interactions at the surface of biological cells that commonly drive cell-cell signaling and adhesion, and obtain a macroscopic description of cells as signal-processing droplets that adaptively change their interfacial tensions. We find a rich phenomenology, including multistability, symmetry-breaking, excitability, and self-sustained shape oscillations, with the underlying critical points revealing universal characteristics of such systems. Our tractable framework provides a paradigm for how soft active materials respond to shape-dependent signals, and suggests novel modes of self-organisation at the collective scale. These are explored further in our companion paper [arxiv 2402.08664v3].
The structure and dynamics of important biological quasi-two-dimensional systems, ranging from cytoskeletal gels to tissues, are controlled by nematic order, flow, defects and activity. Continuum hydrodynamic descriptions combined with numerical simulations have been used to understand such complex systems. The development of thermodynamically consistent theories and numerical methods to model active nemato-hydrodynamics is eased by mathematical formalisms enabling systematic derivations and structured-preserving algorithms. Alternative to classical nonequilibrium thermodynamics and bracket formalisms, here we develop a theoretical and computational framework for active nematics based on Onsager's variational formalism to irreversible thermodynamics, according to which the dynamics result from the minimization of a Rayleighian functional capturing the competition between free-energy release, dissipation and activity. We show that two standard incompressible models of active nemato-hydrodynamics can be framed in the variational formalism, and develop a new compressible model for density-dependent active nemato-hydrodynamics relevant to model actomyosin gels. We show that the variational principle enables a direct and transparent derivation not only of the governing equations, but also of the finite element numerical scheme. We exercise this model in two representative examples of active nemato-hydrodynamics relevant to the actin cytoskeleton during wound healing and to the dynamics of confined colonies of elongated cells.
We reanalyze the hydrodynamic theory of "flocks" that is, polar ordered "dry" active fluids in two dimensions. For "Malthusian" flocks, in which birth and death cause the density to relax quickly, thereby eliminating density as a hydrodynamic variable, we are able to obtain two exact scaling laws relating the three scaling exponents characterizing the long-distance properties of these systems. We also show that it is highly plausible that such flocks display long-range order in two dimensions. In addition, we demonstrate that for "immortal" flocks, in which the number of flockers is conserved, the extra non-linearities allowed by the presence of an extra slow variable (number density) make it impossible to obtain any exact scaling relations between the exponents. We thereby demonstrate that several past published claims of exact exponents for Malthusian and immortal flocks are all incorrect.
Brillouin Light Scattering (BLS) spectroscopy is a non-invasive, non-contact, label-free optical technique that can provide information on the mechanical properties of a material on the sub-micron scale. Over the last decade it has seen increased applications in the life sciences, driven by the observed significance of mechanical properties in biological processes, the realization of more sensitive BLS spectrometers and its extension to an imaging modality. As with other spectroscopic techniques, BLS measurements not only detect signals characteristic of the investigated sample, but also of the experimental apparatus, and can be significantly affected by measurement conditions. The aim of this consensus statement is to improve the comparability of BLS studies by providing reporting recommendations for the measured parameters and detailing common artifacts. Given that most BLS studies of biological matter are still at proof-of-concept stages and use different--often self-built--spectrometers, a consensus statement is particularly timely to assure unified advancement.
Short-range interactions and long-range contacts drive the 3D folding of structured proteins. The proteins' structure has a direct impact on their biological function. However, nearly 40% of the eukaryotes proteome is composed of intrinsically disordered proteins (IDPs) and protein regions that fluctuate between ensembles of numerous conformations. Therefore, to understand their biological function, it is critical to depict how the structural ensemble statistics correlate to the IDPs' amino acid sequence. Here, using small-angle x-ray scattering (SAXS) and time-resolved F\"orster resonance energy transfer (trFRET), we study the intra-molecular structural heterogeneity of the neurofilament low intrinsically disordered tail domain (NFLt). Using theoretical results of polymer physics, we find that the Flory scaling exponent of NFLt sub-segments correlates linearly with their net charge, ranging from statistics of ideal to self-avoiding chains. Surprisingly, measuring the same segments in the context of the whole NFLt protein, we find that regardless of the peptide sequence, the segments' structural statistics are more expanded than when measured independently. Our findings show that while polymer physics can, to some level, relate the IDP's sequence to its ensemble conformations, long-range contacts between distant amino acids play a crucial role in determining intra-molecular structures. This emphasizes the necessity of advanced polymer theories to fully describe IDPs ensembles with the hope it will allow us to model their biological function.
Viruses and their hosts are involved in an 'arms race' where they continually evolve mechanisms to overcome each other. It has long been proposed that intrinsic disorder provides a substrate for the evolution of viral hijack functions and that short linear motifs (SLiMs) are important players in this process. Here, we review evidence in support of this tenet from two model systems: the papillomavirus E7 protein and the adenovirus E1A protein. Phylogenetic reconstructions reveal that SLiMs appear and disappear multiple times across evolution, providing evidence of convergent evolution within individual viral phylogenies. Multiple functionally related SLiMs show strong co-evolution signals that persist across long distances in the primary sequence and occur in unrelated viral proteins. Moreover, changes in SLiMs are associated with changes in phenotypic traits such as host range and tropism. Tracking viral evolutionary events reveals that host switch events are associated with the loss of several SLiMs, suggesting that SLiMs are under functional selection and that changes in SLiMs support viral adaptation. Fine-tuning of viral SLiM sequences can improve affinity, allowing them to outcompete host counterparts. However, viral SLiMs are not always competitive by themselves, and tethering of two suboptimal SLiMs by a disordered linker may instead enable viral hijack. Coevolution between the SLiMs and the linker indicates that the evolution of disordered regions may be more constrained than previously thought. In summary, experimental and computational studies support a role for SLiMs and intrinsic disorder in viral hijack functions and in viral adaptive evolution.
We report a droplet microfluidic method to target and sort individual cells directly from complex microbiome samples, and to prepare these cells for bulk whole genome sequencing without cultivation. We characterize this approach by recovering bacteria spiked into human stool samples at a ratio as low as 1:250 and by successfully enriching endogenous Bacteroides vulgatus to the level required for de-novo assembly of high-quality genomes. While microbiome strains are increasingly demanded for biomedical applications, the vast majority of species and strains are uncultivated and without reference genomes. We address this shortcoming by encapsulating complex microbiome samples directly into microfluidic droplets and amplify a target-specific genomic fragment using a custom molecular TaqMan probe. We separate those positive droplets by droplet sorting, selectively enriching single target strain cells. Finally, we present a protocol to purify the genomic DNA while specifically removing amplicons and cell debris for high-quality genome sequencing.
In a prototypical mode of single-cell migration, retrograde cytoskeletal flow is mechanically coupled to the environment, propels the cell, and is sustained by an anterograde cytosolic flow of disassembled cytoskeletal components. Supracellular collectives also develop fountain-flows to migrate, but the opposing cellular streams interact with the environment producing conflicting forces. To understand the biophysical constraints of fountain-flow supracellular migration, we develop an active gel model of a cell cluster driven by a polarized peripheral contractile cable. While the model develops fountain-flows and directed migration, efficiency and cluster velocity are extremely small compared to observations. We find that patterned friction or cluster-polarized single-cell directed migration, both suggested by contact inhibition of locomotion, rescue robust and efficient supracellular migration.
The term Research Software Engineer, or RSE, emerged a little over 10 years ago as a way to represent individuals working in the research community but focusing on software development. The term has been widely adopted and there are a number of high-level definitions of what an RSE is. However, the roles of RSEs vary depending on the institutional context they work in. At one end of the spectrum, RSE roles may look similar to a traditional research role. At the other extreme, they resemble that of a software engineer in industry. Most RSE roles inhabit the space between these two extremes. Therefore, providing a straightforward, comprehensive definition of what an RSE does and what experience, skills and competencies are required to become one is challenging. In this community paper we define the broad notion of what an RSE is, explore the different types of work they undertake, and define a list of fundamental competencies as well as values that define the general profile of an RSE. On this basis, we elaborate on the progression of these skills along different dimensions, looking at specific types of RSE roles, proposing recommendations for organisations, and giving examples of future specialisations. An appendix details how existing curricula fit into this framework.
Computational methods for assessing the likely impacts of mutations, known as variant effect predictors (VEPs), are widely used in the assessment and interpretation of human genetic variation, as well as in other applications like protein engineering. Many different VEPs have been released to date, and there is tremendous variability in their underlying algorithms and outputs, and in the ways in which the methodologies and predictions are shared. This leads to considerable challenges for end users in knowing which VEPs to use and how to use them. Here, to address these issues, we provide guidelines and recommendations for the release of novel VEPs. Emphasising open-source availability, transparent methodologies, clear variant effect score interpretations, standardised scales, accessible predictions, and rigorous training data disclosure, we aim to improve the usability and interpretability of VEPs, and promote their integration into analysis and evaluation pipelines. We also provide a large, categorised list of currently available VEPs, aiming to facilitate the discovery and encourage the usage of novel methods within the scientific community.
Introduced by Boltzmann under the name "monode," the microcanonical ensemble serves as the fundamental representation of equilibrium thermodynamics in statistical mechanics by counting all possible realizations of a system's states. Ensemble theory connects this idea with probability and information theory, leading to the notion of Shannon-Gibbs entropy and, ultimately, to the principle of maximum caliber describing trajectories of systems--in and out of equilibrium. While the latter phenomenological generalization reproduces many results of nonequilibrium thermodynamics, given a proper choice of observables, its physical justification remains an open area of research. What is the microscopic origin and physical interpretation of this variational approach? What guides the choice of relevant observables? We address these questions by extending Boltzmann's method to a microcanonical caliber principle and counting realizations of a system's trajectories--all assumed equally probable. Maximizing the microcanonical caliber under the imposed constraints, we systematically develop generalized detailed-balance relations, clarify the statistical origins of inhomogeneous transport, and provide an independent derivation of key equations from stochastic thermodynamics. This approach introduces a dynamical ensemble theory for nonequilibrium steady states in spatially extended and active systems. While verifying the equivalence of ensembles, e.g. those of Norton and Thevenin, our framework contests other common assumptions about nonequilibrium regimes, with supporting evidence provided by stochastic simulations. Our theory suggests further connections to the first principles of microscopic dynamics in classical statistical mechanics, which are essential for investigating systems where the necessary conditions for thermodynamic behavior are not satisfied.
There are no more papers matching your filters at the moment.