Institute for Systems Biology
Large Language Models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains like biomedicine. Solutions such as pre-training and domain-specific fine-tuning add substantial computational overhead, requiring further domain expertise. Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging a massive biomedical KG (SPOKE) with LLMs such as Llama-2-13b, GPT-3.5-Turbo and GPT-4, to generate meaningful biomedical text rooted in established knowledge. Compared to the existing RAG technique for Knowledge Graphs, the proposed method utilizes minimal graph schema for context extraction and uses embedding methods for context pruning. This optimization in context extraction results in more than 50% reduction in token consumption without compromising the accuracy, making a cost-effective and robust RAG implementation on proprietary LLMs. KG-RAG consistently enhanced the performance of LLMs across diverse biomedical prompts by generating responses rooted in established knowledge, accompanied by accurate provenance and statistical evidence (if available) to substantiate the claims. Further benchmarking on human curated datasets, such as biomedical true/false and multiple-choice questions (MCQ), showed a remarkable 71% boost in the performance of the Llama-2 model on the challenging MCQ dataset, demonstrating the framework's capacity to empower open-source models with fewer parameters for domain specific questions. Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as GPT-3.5 and GPT-4. In summary, the proposed framework combines explicit and implicit knowledge of KG and LLM in a token optimized fashion, thus enhancing the adaptability of general-purpose LLMs to tackle domain-specific questions in a cost-effective fashion.
Background and Objectives: Reproducibility is a major challenge in developing machine learning (ML)-based solutions in computational pathology (CompPath). The NCI Imaging Data Commons (IDC) provides >120 cancer image collections according to the FAIR principles and is designed to be used with cloud ML services. Here, we explore its potential to facilitate reproducibility in CompPath research. Methods: Using the IDC, we implemented two experiments in which a representative ML-based method for classifying lung tumor tissue was trained and/or evaluated on different datasets. To assess reproducibility, the experiments were run multiple times with separate but identically configured instances of common ML services. Results: The AUC values of different runs of the same experiment were generally consistent. However, we observed small variations in AUC values of up to 0.045, indicating a practical limit to reproducibility. Conclusions: We conclude that the IDC facilitates approaching the reproducibility limit of CompPath research (i) by enabling researchers to reuse exactly the same datasets and (ii) by integrating with cloud ML services so that experiments can be run in identically configured computing environments.
Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights. These include developing an AI-friendly ecosystem for proteomics data generation, sharing, and analysis; improving peptide and protein identification and quantification; characterizing protein-protein interactions and protein complexes; advancing spatial and perturbation proteomics; integrating multi-omics data; and ultimately enabling AI-empowered virtual cells.
There has been a long debate on how new levels of organization have evolved. It might seem unlikely, as cooperation must prevail over competition. One well-studied example is the emergence of autocatalytic sets, which seem to be a prerequisite for the evolution of life. Using a simple model, we investigate how varying bias toward cooperation versus antagonism shapes network dynamics, revealing that higher-order organization emerges even amid pervasive antagonistic interactions. In general, we observe that a quantitative increase in the number of elements in a system leads to a qualitative transition. We present a random threshold-directed network model that integrates node-specific traits with dynamic edge formation and node removal, simulating arbitrary levels of cooperation and competition. In our framework, intrinsic node values determine directed links through various threshold rules. Our model generates a multi-digraph with signed edges (reflecting support/antagonism, labeled ``help''/``harm''), which ultimately yields two parallel yet interdependent threshold graphs. Incorporating temporal growth and node turnover in our approach allows exploration of the evolution, adaptation, and potential collapse of communities and reveals phase transitions in both connectivity and resilience. Our findings extend classical random threshold and Erdős-Rényi models, offering new insights into adaptive systems in biological and economic contexts, with emphasis on the application to Collective Affordance Sets. This framework should also be useful for making predictions that will be tested by ongoing experiments of microbial communities in soil.
A data matrix may be seen simply as a means of organizing observations into rows ( e.g., by measured object) and into columns ( e.g., by measured variable) so that the observations can be analyzed with mathematical tools. As a mathematical object, a matrix defines a linear mapping between points representing weighted combinations of its rows (the row vector space) and points representing weighted combinations of its columns (the column vector space). From this perspective, a data matrix defines a relationship between the information that labels its rows and the information that labels its columns, and numerical methods are used to analyze this relationship. A first step is to normalize the data, transforming each observation from scales convenient for measurement to a common scale, on which addition and multiplication can meaningfully combine the different observations. For example, z-transformation rescales every variable to the same scale, standardized variation from an expected value, but ignores scale differences between measured objects. Here we develop the concepts and properties of projective decomposition, which applies the same normalization strategy to both rows and columns by separating the matrix into row- and column-scaling factors and a scale-normalized matrix. We show that different scalings of the same scale-normalized matrix form an equivalence class, and call the scale-normalized, canonical member of the class its scale-invariant form that preserves all pairwise relative ratios. Projective decomposition therefore provides a means of normalizing the broad class of ratio-scale data, in which relative ratios are of primary interest, onto a common scale without altering the ratios of interest, and simultaneously accounting for scale effects for both organizations of the matrix values. Both of these properties distinguish it from z-transformation.
Proteomics is the large scale study of protein structure and function from biological systems through protein identification and quantification. "Shotgun proteomics" or "bottom-up proteomics" is the prevailing strategy, in which proteins are hydrolyzed into peptides that are analyzed by mass spectrometry. Proteomics studies can be applied to diverse studies ranging from simple protein identification to studies of proteoforms, protein-protein interactions, protein structural alterations, absolute and relative protein quantification, post-translational modifications, and protein stability. To enable this range of different experiments, there are diverse strategies for proteome analysis. The nuances of how proteomic workflows differ may be challenging to understand for new practitioners. Here, we provide a comprehensive overview of different proteomics methods to aid the novice and experienced researcher. We cover from biochemistry basics and protein extraction to biological interpretation and orthogonal validation. We expect this work to serve as a basic resource for new practitioners in the field of shotgun or bottom-up proteomics.
The emergence of self-sustaining autocatalytic networks in chemical reaction systems has been studied as a possible mechanism for modelling how living systems first arose. It has been known for several decades that such networks will form within systems of polymers (under cleavage and ligation reactions) under a simple process of random catalysis, and this process has since been mathematically analysed. In this paper, we provide an exact expression for the expected number of self-sustaining autocatalytic networks that will form in a general chemical reaction system, and the expected number of these networks that will also be uninhibited (by some molecule produced by the system). Using these equations, we are able to describe the patterns of catalysis and inhibition that maximise or minimise the expected number of such networks. We apply our results to derive a general theorem concerning the trade-off between catalysis and inhibition, and to provide some insight into the extent to which the expected number of self-sustaining autocatalytic networks coincides with the probability that at least one such system is present.
Recent studies at individual cell resolution have revealed phenotypic heterogeneity in nominally clonal tumor cell populations. The heterogeneity affects cell growth behaviors, which can result in departure from the idealized uniform exponential growth of the cell population. Here we measured the stochastic time courses of growth of an ensemble of populations of HL60 leukemia cells in cultures, starting with distinct initial cell numbers to capture a departure from the {uniform exponential growth model for the initial growth (``take-off'')}. Despite being derived from the same cell clone, we observed significant variations in the early growth patterns of individual cultures with statistically significant differences in growth dynamics, which could be explained by the presence of inter-converting subpopulations with different growth rates, and which could last for many generations. Based on the hypothesis of existence of multiple subpopulations, we developed a branching process model that was consistent with the experimental observations.
Evolution depends on the possibility of successfully exploring fitness landscapes via mutation and recombination. With these search procedures, exploration is difficult in "rugged" fitness landscapes, where small mutations can drastically change functionalities in an organism. Random Boolean networks (RBNs), being general models, can be used to explore theories of how evolution can take place in rugged landscapes; or even change the landscapes. In this paper, we study the effect that redundant nodes have on the robustness of RBNs. Using computer simulations, we have found that the addition of redundant nodes to RBNs increases their robustness. We conjecture that redundancy is a way of "smoothening" fitness landscapes. Therefore, redundancy can facilitate evolutionary searches. However, too much redundancy could reduce the rate of adaptation of an evolutionary process. Our results also provide supporting evidence in favour of Kauffman's conjecture (Kauffman, 2000, p.195).
We investigate solutions to the TAP equation, a phenomenological implementation of the Theory of the Adjacent Possible. Several implementations of TAP are studied, with potential applications in a range of topics including economics, social sciences, environmental change, evolutionary biological systems, and the nature of physical laws. The generic behaviour is an extended plateau followed by a sharp explosive divergence. We find accurate analytic approximations for the blow-up time that we validate against numerical simulations, and explore the properties of the equation in the vicinity of equilibrium between innovation and extinction. A particular variant, the two-scale TAP model, replaces the initial plateau with a phase of exponential growth, a widening of the TAP equation phenomenology that may enable it to be applied in a wider range of contexts.
The Anthropic Principle has been with us since the 1970s. This Principle is advanced to account for the "fine tuning" of the 25 constants of the Standard Model of Particle Physics. Were these constants very different, life could not exist. The Anthropic Principle conditions on the existence of life and concludes that the value of the 25 constants must be within a range that allows life. The most common further step is to postulate the existence of a vast multiverse with vastly many combinations of the values of the 25 constants. By conditioning on our own life, we must be in a universe whose values allow life. The Anthropic Principle is commonly held to be untestable because we cannot be in contact with other universes. I aim here to show the Anthropic Principle is testable and that its explanatory power is weak: The Principle seems to make testably false predictions about planet Earth and the life on it. The Anthropic Principle seems unable to predict the existence of 98 stable atoms, when only 19 small atoms are needed for life.
We present a scenario for the origin of biological coding. In this context, coding is a semiotic relationship between chemical information stored in one location that links to chemical information stored in a separate location. Coding originated by the cooperative interaction of two, originally separate collectively autocatalytic sets, one for nucleic acids and one for peptides. When these two sets interacted, a series of RNA-folding-directed processes led to their joint cooperativity. The amino acyl adenylate, today amino acid-AMP, was the first covalent association made by these two collectively autocatalytic sets and solidified their interdependence. This molecule is a palimpsest of this era, and is a relic of the original semiotic, and thus coding, relationship between RNA and proteins. More defined coding was driven by selection pressure to eliminate waste in the collective autocatalytic sets. Eventually a 1:1 relationship between single amino acids and short RNA pieces (e.g., three nucleotides) was established, leading to what is today known as the genetic code. Transfer RNA aminoacylating enzymes, or aaRSs, arose concomitantly with the advent of specific coding. The two classes of aaRS enzymes are remnants of the duality of complementary information in two nucleic acid strands, as originally postulated by Rodin and Ohno. Every stage in the evolution of coding was driven by the downward selection on the components of a system to satisfy the Kantian whole. Coding was ultimately forced because there were at least two chemically distinct classes of polymers needed for open-ended evolution; systems with only one polymer cannot exhibit this characteristic. Coding is thus synonymous with life as we know it, and can be thought of as a phase transition in the history of the universe.
The transition from the quantum to the classical world is not yet understood. Here we take a new approach. Central to this is the understanding that measurement and actualization cannot occur except in some specific basis. But we have no established theory for the emergence of a specific basis.
In recent years, methods from network science are gaining rapidly interest in economics and finance. A reason for this is that in a globalized world the interconnectedness among economic and financial entities are crucial to understand and networks provide a natural framework for representing and studying such systems. In this paper, we are surveying the use of networks and network-based methods for studying economy related questions. We start with a brief overview of graph theory and basic definitions. Then we discuss descriptive network measures and network complexity measures for quantifying structural properties of economic networks. Finally, we discuss different network and tree structures as relevant for applications.
Transcript enumeration methods such as SAGE, MPSS, and sequencing-by-synthesis EST ``digital northern'', are important high-throughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These properties are not present on regular Euclidean spaces, on which hybridization-based microarray data is often modeled. Therefore, pattern recognition methods commonly used for microarray data analysis may be non-informative for the data generated by transcript enumeration techniques since they ignore certain fundamental properties of this space. Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space. We present Simcluster as a stand-alone command-line C package and as a user-friendly on-line tool. Both versions are available at: this http URL . Simcluster is designed in accordance with a well-established mathematical framework for compositional data analysis, which provides principled procedures for dealing with the simplex space, and is thus applicable in a number of contexts, including enumeration-based gene expression data.
Dimensional reduction techniques have long been used to visualize the structure and geometry of high dimensional data. However, most widely used techniques are difficult to interpret due to nonlinearities and opaque optimization processes. Here we present a specific graph based construction for dimensionally reducing continuous stochastic systems with multiplicative noise moving under the influence of a potential. To achieve this, we present a specific graph construction which generates the Fokker-Planck equation of the stochastic system in the continuum limit. The eigenvectors and eigenvalues of the normalized graph Laplacian are used as a basis for the dimensional reduction and yield a low dimensional representation of the dynamics which can be used for downstream analysis such as spectral clustering. We focus on the use case of single cell RNA sequencing data and show how current diffusion map implementations popular in the single cell literature fit into this framework.
The evolution of the biosphere unfolds as a luxuriant generative process of new living forms and functions. Organisms adapt to their environment, exploit novel opportunities that are created in this continuous blooming dynamics. Affordances play a fundamental role in the evolution of the biosphere, for organisms can exploit them for new morphological and behavioral adaptations achieved by heritable variations and selection. This way, the opportunities offered by affordances are then actualized as ever novel adaptations. In this paper we maintain that affordances elude a formalization that relies on set theory: we argue that it is not possible to apply set theory to affordances, therefore we cannot devise a set-based mathematical theory of the diachronic evolution of the biosphere.
The information processing capacity of a complex dynamical system is reflected in the partitioning of its state space into disjoint basins of attraction, with state trajectories in each basin flowing towards their corresponding attractor. We introduce a novel network parameter, the basin entropy, as a measure of the complexity of information that such a system is capable of storing. By studying ensembles of random Boolean networks, we find that the basin entropy scales with system size only in critical regimes, suggesting that the informationally optimal partition of the state space is achieved when the system is operating at the critical boundary between the ordered and disordered phases.
Cosmologists wish to explain how our Universe, in all its complexity, could ever have come about. For that, we assess the number of states in our Universe now. This plays the role of entropy in thermodynamics of the Universe, and reveals the magnitude of the problem of initial conditions to be solved. The usual budgeting accounts for gravity, thermal motions, and finally the vacuum energy whose entropy, given by the Bekenstein bound, dominates the entropy budget today. There is however one number which we have not accounted for: the number of states in our complex biosphere. What is the entropy of life and is it sizeable enough to need to be accounted for at the Big Bang? Building on emerging ideas within theoretical biology, we show that the configuration space of living systems, unlike that of their fundamental physics counterparts, can grow rapidly in response to emerging biological complexity. A model for this expansion is provided through combinatorial innovation by the Theory of the Adjacent Possible (TAP) and its corresponding TAP equation, whose solutions we investigate, confirming the possibility of rapid state-pace growth. While the results of this work remain far from being firmly established, the evidence we provide is many-fold and strong. The implications are far-reaching, and open a variety of lines for future investigation, a new scientific field we term biocosmology. In particular the relationship between the information content in life and the information content in the Universe may need to be rebuilt from scratch.
A feature of human creativity is the ability to take a subset of existing items (e.g. objects, ideas, or techniques) and combine them in various ways to give rise to new items, which, in turn, fuel further growth. Occasionally, some of these items may also disappear (extinction). We model this process by a simple stochastic birth--death model, with non-linear combinatorial terms in the growth coefficients to capture the propensity of subsets of items to give rise to new items. In its simplest form, this model involves just two parameters (P,α)(P, \alpha). This process exhibits a characteristic 'hockey-stick' behaviour: a long period of relatively little growth followed by a relatively sudden 'explosive' increase. We provide exact expressions for the mean and variance of this time to explosion and compare the results with simulations. We then generalise our results to allow for more general parameter assignments, and consider possible applications to data involving human productivity and creativity.
There are no more papers matching your filters at the moment.