quantitative-methods
Virtually every biological rate changes with temperature, but the mechanisms underlying these responses differ between different processes. Here, we bring together the main theoretical approaches used to describe temperature-rate relationships, ranging from empirical curve shapes to reaction-level kinetics and network-based dynamical frameworks. These models highlight how temperature influences not only the speed of elementary reactions, but also the behavior that emerges when many reactions interact through regulation, feedback, or stochastic transitions. By outlining the assumptions and implications of each perspective, we aim to clarify how different modeling strategies connect molecular processes to physiological temperature response curves and to point toward integrative frameworks that can better explain the diversity of biological thermal responses.
Accurately predicting individual neurons' responses and spatial functional properties in complex visual tasks remains a key challenge in understanding neural computation. Existing whole-brain connectome models of Drosophila often rely on parameter assumptions or deep learning approaches, yet remain limited in their ability to reliably predict dynamic neuronal responses. We introduce a Multi-Path Aggregation (MPA) framework, based on neural network steady-state theory, to build a whole-brain Visual Function Profiles (VFP) of Drosophila neurons and predict their responses under diverse visual tasks. Unlike conventional methods relying on redundant parameters, MPA combines visual input features with the whole-brain connectome topology. It uses adjacency matrix powers and finite-path optimization to efficiently predict neuronal function, including ON/OFF polarity, direction selectivity, and responses to complex visual stimuli. Our model achieves a Pearson correlation of 0.84+/-0.12 for ON/OFF responses, outperforming existing methods (0.33+/-0.59), and accurately captures neuron functional properties, including luminance and direction preferences, while allowing single-neuron or population-level blockade simulations. Replacing CNN modules with VFP-derived Lobula Columnar(LC) population responses in a Drosophila simulation enables successful navigation and obstacle avoidance, demonstrating the model's effectiveness in guiding embodied behavior. This study establishes a "connectome-functional profile-behavior" framework, offering a whole-brain quantitative tool to study Drosophila visual computation and a neuron-level guide for brain-inspired intelligence.
Background and objective: Spatial transcriptomics provides rich spatial context but lacks sufficient resolution for large-scale causal inference. We developed SpeF-Phixer, a spatially extended phi-mixing framework integrating whole-slide image (WSI)-derived spatial cell distributions with mapped scRNA-seq expression fields to infer directed gene regulatory triplets with spatial coherence. Methods: Using CD103/CD8-immunostained colorectal cancer WSIs and publicly available scRNA-seq datasets, spatial gene fields were constructed around mapped cells and discretized for signed phi-mixing computation. Pairwise dependencies, directional signs, and triplet structures were evaluated through kNN-based neighborhood screening and bootstrap consensus inference. Mediation and convergence were distinguished using generalized additive models (GAMs), with spatial validity assessed by real-null comparisons and database-backed direction checks. Results: Across tissue patches, the pipeline reduced approximately 3.6x10^4 triplet candidates to a reproducible consensus set (approximately 3x10^2 per patch). The downstream edge (Y to Z) showed significant directional bias consistent with curated regulatory databases. Spatial path tracing demonstrated markedly higher coherence for real triplets than for null controls, indicating that inferred chains represent biologically instantiated regulatory flows. Conclusion: SpeF-Phixer extracts spatially coherent, directionally consistent gene regulatory triplets from histological images. This framework bridges single-cell molecular profiles with microenvironmental organization and provides a scalable foundation for constructing spatially informed causal gene networks.
Lung cancer is a primary contributor to cancer-related mortality globally, highlighting the necessity for precise early detection of pulmonary nodules through low-dose CT (LDCT) imaging. Deep learning methods have improved nodule detection and classification; however, their performance is frequently limited by the availability of annotated data and variability among imaging centers. This research presents a CT-driven, semi-supervised framework utilizing the Inf-Net architecture to enhance lung nodule analysis with minimal annotation. The model incorporates multi-scale feature aggregation, Reverse Attention refinement, and pseudo-labeling to efficiently utilize unlabeled CT slices. Experiments conducted on subsets of the LUNA16 dataset indicate that the supervised Inf-Net attains a score of 0.825 on 10,000 labeled slices. In contrast, the semi-supervised variant achieves a score of 0.784 on 20,000 slices that include both labeled and pseudo-labeled data, thus surpassing its supervised baseline of 0.755. This study presents a conceptual framework for the integration of genomic biomarkers with CT-derived features, facilitating the development of future multimodal, biologically informed CAD systems. The proposed semi-supervised Inf-Net framework improves CT-based lung nodule assessment and lays the groundwork for flexible multi-omics diagnostic models.
Intracellular compartmentalization of proteins underpins their function and the metabolic processes they sustain. Various mass spectrometry-based proteomics methods (subcellular spatial proteomics) now allow high throughput subcellular protein localization. Yet, the curation, analysis and interpretation of these data remain challenging, particularly in non-model organisms where establishing reliable marker proteins is difficult, and in contexts where experimental replication and subcellular fractionation are constrained. Here, we develop FSPmix, a semi-supervised functional clustering method implemented as an open-source R package, which leverages partial annotations from a subset of marker proteins to predict protein subcellular localization from subcellular spatial proteomics data. This method explicitly assumes that protein signatures vary smoothly across subcellular fractions, enabling more robust inference under low signal-to-noise data regimes. We applied FSPmix to a subcellular proteomics dataset from a marine diatom, allowing us to assign probabilistic localizations to proteins and uncover potentially new protein functions. Altogether, this work lays the foundation for more robust statistical analysis and interpretation of subcellular proteomics datasets, particularly in understudied organisms.
This research provides a systematic investigation into masking designs for self-supervised learning on molecular graphs, formalizing the pretraining pipeline and employing information-theoretic measures. The study reveals that the semantic richness of the prediction target is crucial for downstream performance, particularly when paired with expressive encoder architectures.
Generative models of complex systems often require post-hoc parameter adjustments to produce useful outputs. For example, energy-based models for protein design are sampled at an artificially low ''temperature'' to generate novel, functional sequences. This temperature tuning is a common yet poorly understood heuristic used across machine learning contexts to control the trade-off between generative fidelity and diversity. Here, we develop an interpretable, physically motivated framework to explain this phenomenon. We demonstrate that in systems with a large ''energy gap'' - separating a small fraction of meaningful states from a vast space of unrealistic states - learning from sparse data causes models to systematically overestimate high-energy state probabilities, a bias that lowering the sampling temperature corrects. More generally, we characterize how the optimal sampling temperature depends on the interplay between data size and the system's underlying energy landscape. Crucially, our results show that lowering the sampling temperature is not always desirable; we identify the conditions where \emph{raising} it results in better generative performance. Our framework thus casts post-hoc temperature tuning as a diagnostic tool that reveals properties of the true data distribution and the limits of the learned model.
Spatial transcriptomics (ST) enables simultaneous mapping of tissue morphology and spatially resolved gene expression, offering unique opportunities to study tumor microenvironment heterogeneity. Here, we introduce a computational framework that predicts spatial pathway activity directly from hematoxylin-and-eosin-stained histology images at microscale resolution 55 and 100 um. Using image features derived from a computational pathology foundation model, we found that TGFb signaling was the most accurately predicted pathway across three independent breast and lung cancer ST datasets. In 87-88% of reliably predicted cases, the resulting spatial TGFb activity maps reflected the expected contrast between tumor and adjacent non-tumor regions, consistent with the known role of TGFb in regulating interactions within the tumor microenvironment. Notably, linear and nonlinear predictive models performed similarly, suggesting that image features may relate to pathway activity in a predominantly linear fashion or that nonlinear structure is small relative to measurement noise. These findings demonstrate that features extracted from routine histopathology may recover spatially coherent and biologically interpretable pathway patterns, offering a scalable strategy for integrating image-based inference with ST information in tumor microenvironment studies.
Accurate quantification of complex human movements, such as gait, is essential for clinical diagnosis and rehabilitation but is often limited by traditional linear models rooted in Euclidean geometry. These frameworks frequently fail to capture the intrinsic non-linear dynamics and posture-dependent dependencies of biological systems. To address this, we present a computational framework that maps kinematic data onto a Riemannian manifold of Symmetric Positive Definite (SPD) matrices. Using the Log-Euclidean metric, we transformed raw skeletal pose sequences into geometric feature vectors to quantify gait variability and smoothness across three velocity profiles: slow, medium, and fast. Our comparative analysis reveals a critical divergence between geometric approaches. While Euclidean metrics exhibit a strictly linear increase in variability with speed (Slow < Medium < Fast), implying instability, the proposed Riemannian metrics reveal a non-linear "inverted-U'' pattern with varying speeds. Specifically, we observed a stabilization of variance at high speeds (sprinting), suggesting that the motor system optimizes efficiency by adhering to geodesic trajectories of minimum effort. These findings demonstrate that manifold-based representations offer superior sensitivity to biomechanical efficiency compared to standard linear methods, providing a robust foundation for future diagnostic algorithms and explainable machine learning models in clinical biomechanics.
Healthcare AI systems have historically faced challenges in merging contextual reasoning, long-term state management, and human-verifiable workflows into a cohesive framework. This paper introduces a completely innovative architecture and concept: combining the Model Context Protocol (MCP) with a specific clinical application, known as MCP-AI. This integration allows intelligent agents to reason over extended periods, collaborate securely, and adhere to authentic clinical logic, representing a significant shift away from traditional Clinical Decision Support Systems (CDSS) and prompt-based Large Language Models (LLMs). As healthcare systems become more complex, the need for autonomous, context-aware clinical reasoning frameworks has become urgent. We present MCP-AI, a novel architecture for explainable medical decision-making built upon the Model Context Protocol (MCP) a modular, executable specification for orchestrating generative and descriptive AI agents in real-time workflows. Each MCP file captures clinical objectives, patient context, reasoning state, and task logic, forming a reusable and auditable memory object. Unlike conventional CDSS or stateless prompt-based AI systems, MCP-AI supports adaptive, longitudinal, and collaborative reasoning across care settings. MCP-AI is validated through two use cases: (1) diagnostic modeling of Fragile X Syndrome with comorbid depression, and (2) remote coordination for Type 2 Diabetes and hypertension. In either scenario, the protocol facilitates physician-in-the-loop validation, streamlines clinical processes, and guarantees secure transitions of AI responsibilities between healthcare providers. The system connects with HL7/FHIR interfaces and adheres to regulatory standards, such as HIPAA and FDA SaMD guidelines. MCP-AI provides a scalable basis for interpretable, composable, and safety-oriented AI within upcoming clinical environments.
Understanding how protein mutations affect protein structure is essential for advancements in computational biology and bioinformatics. We introduce PRIMRose, a novel approach that predicts energy values for each residue given a mutated protein sequence. Unlike previous models that assess global energy shifts, our method analyzes the localized energetic impact of double amino acid insertions or deletions (InDels) at the individual residue level, enabling residue-specific insights into structural and functional disruption. We implement a Convolutional Neural Network architecture to predict the energy changes of each residue in a protein mutation. We train our model on datasets constructed from nine proteins, grouped into three categories: one set with exhaustive double InDel mutations, another with approximately 145k randomly sampled double InDel mutations, and a third with approximately 80k randomly sampled double InDel mutations. Our model achieves high predictive accuracy across a range of energy metrics as calculated by the Rosetta molecular modeling suite and reveals localized patterns that influence model performance, such as solvent accessibility and secondary structure context. This per-residue analysis offers new insights into the mutational tolerance of specific regions within proteins and provides higher interpretable and biologically meaningful predictions of InDels' effects.
Researchers at ETH Zurich and the Chinese Academy of Sciences developed DeepSKA, a neural framework that provides interpretable and reliable estimation of expected outputs for Stochastic Reaction Networks (SRNs). This method combines spectral decomposition-based neural networks with hybrid Deep Learning/Monte Carlo estimators, achieving unbiased and provably convergent results while reducing variance up to 10,000-fold compared to standard simulations.
3
Protein inverse folding, the design of an amino acid sequence based on a target 3D structure, is a fundamental problem of computational protein engineering. Existing methods either generate sequences without leveraging external knowledge or relying on protein language models (PLMs). The former omits the evolutionary information stored in protein databases, while the latter is parameter-inefficient and inflexible to adapt to ever-growing protein data. To overcome the above drawbacks, in this paper we propose a novel method, called retrieval-augmented denoising diffusion (RadDiff), for protein inverse folding. Given the target protein backbone, RadDiff uses a hierarchical search strategy to efficiently retrieve structurally similar proteins from large protein databases. The retrieved structures are then aligned residue-by-residue to the target to construct a position-specific amino acid profile, which serves as an evolutionary-informed prior that conditions the denoising process. A lightweight integration module is further designed to incorporate this prior effectively. Experimental results on the CATH, PDB, and TS50 datasets show that RadDiff consistently outperforms existing methods, improving sequence recovery rate by up to 19%. Experimental results also demonstrate that RadDiff generates highly foldable sequences and scales effectively with database size.
Snarls and superbubbles are fundamental pangenome decompositions capturing variant sites. These bubble-like structures underpin key tasks in computational pangenomics, including structural-variant genotyping, distance indexing, haplotype sampling, and variant annotation. Snarls can be quadratically-many in the size of the graph, and since their introduction in 2018 with the vg toolkit, there has been no work on identifying all snarls in linear time. Moreover, while it is known how to find superbubbles in linear time, this result is a highly specialized solution only achieved after a long series of papers. We present the first algorithm identifying all snarls in linear time. This is based on a new representation of all snarls, of size linear in the input graph size, and which can be computed in linear time. Our algorithm is based on a unified framework that also provides a new linear-time algorithm for finding superbubbles. An observation behind our results is that all such structures are separated from the rest of the graph by two vertices (except for cases which are trivially computable), i.e. their endpoints are a 2-separator of the underlying undirected graph. Based on this, we employ the well-known SPQR tree decomposition, which encodes all 2-separators, to guide a traversal that finds the bubble-like structures efficiently. We implemented our algorithms in C++ (available at this https URL) and evaluated them on various pangenomic datasets. Our algorithms outcompete or they are on the same level of existing methods. For snarls, we are up to two times faster than vg, while identifying all snarls. When computing superbubbles, we are up to 50 times faster than BubbleGun. Our SPQR tree framework provides a unifying perspective on bubble-like structures in pangenomics, together with a template for finding other bubble-like structures efficiently.
6
Accurate and scalable cell type annotation remains a challenge in single-cell transcriptomics, especially when datasets exhibit strong batch effects or contain previously unseen cell populations. Here we introduce SpikGPT, a hybrid deep learning framework that integrates scGPT-derived cell embeddings with a spiking Transformer architecture to achieve efficient and robust annotation. scGPT provides biologically informed dense representations of each cell, which are further processed by a multi-head Spiking Self-Attention mechanism for energy-efficient feature extraction. Across multiple benchmark datasets, SpikGPT consistently matches or exceeds the performance of leading annotation tools. Notably, SpikGPT uniquely identifies unseen cell types by assigning low-confidence predictions to an "Unknown" category, allowing accurate rejection of cell states absent from the training reference. Together, these results demonstrate that SpikGPT is a versatile and reliable annotation tool capable of generalizing across datasets, resolving complex cellular heterogeneity, and facilitating discovery of novel or disease-associated cell populations.
Foundation models have revolutionized various fields such as natural language processing (NLP) and computer vision (CV). While efforts have been made to transfer the success of the foundation models in general AI domains to biology, existing works focus on directly adopting the existing foundation model architectures from general machine learning domains without a systematic design considering the unique physicochemical and structural properties of each biological data modality. This leads to suboptimal performance, as these repurposed architectures struggle to capture the long-range dependencies, sparse information, and complex underlying ``grammars'' inherent to biological data. To address this gap, we introduce BioArc, a novel framework designed to move beyond intuition-driven architecture design towards principled, automated architecture discovery for biological foundation models. Leveraging Neural Architecture Search (NAS), BioArc systematically explores a vast architecture design space, evaluating architectures across multiple biological modalities while rigorously analyzing the interplay between architecture, tokenization, and training strategies. This large-scale analysis identifies novel, high-performance architectures, allowing us to distill a set of empirical design principles to guide future model development. Furthermore, to make the best of this set of discovered principled architectures, we propose and compare several architecture prediction methods that effectively and efficiently predict optimal architectures for new biological tasks. Overall, our work provides a foundational resource and a principled methodology to guide the creation of the next generation of task-specific and foundation models for biology.
Predicting the binding affinity of protein protein complexes directly from sequence remains a challenging problem, particularly in the absence of reliable structural information. Here I present ProtT Affinity, a sequence only model that combines ProtT5 embeddings with a lightweight Transformer architecture. The model is trained and evaluated on homology filtered subsets of the PDBBind database following a curation protocol consistent with prior structure based work. Across two independent test sets,ProtT Affinity reaches Pearson correlation coefficients of 0.628 and 0.459, this http URL its performance does not match the strongest structure based methods, it is competitive with several widely used approaches and provides a practical alternative when structural data are missing or uncertain. The results suggest that large protein language models capture features relevant to binding energetics, and that these features can be exploited to approximate affinity trends at scale.
Understanding the evolution of cellular microenvironments in spatiotemporal data is essential for deciphering tissue development and disease progression. While experimental techniques like spatial transcriptomics now enable high-resolution mapping of tissue organization across space and time, current methods that model cellular evolution operate at the single-cell level, overlooking the coordinated development of cellular states in a tissue. We introduce NicheFlow, a flow-based generative model that infers the temporal trajectory of cellular microenvironments across sequential spatial slides. By representing local cell neighborhoods as point clouds, NicheFlow jointly models the evolution of cell states and spatial coordinates using optimal transport and Variational Flow Matching. Our approach successfully recovers both global spatial architecture and local microenvironment composition across diverse spatiotemporal datasets, from embryonic to brain development.
5
Accurately predicting the three-dimensional structures of protein-ligand complexes remains a fundamental challenge in computational drug discovery that limits the pace and success of therapeutic design. Deep learning methods have recently shown strong potential as structural prediction tools, achieving promising accuracy across diverse biomolecular systems. However, their performance and utility are constrained by scarce experimental data, inefficient architectures, physically invalid poses, and the limited ability to exploit auxiliary information available at inference. To address these issues, we introduce Pearl (Placing Every Atom in the Right Location), a foundation model for protein-ligand cofolding at scale. Pearl addresses these challenges with three key innovations: (1) training recipes that include large-scale synthetic data to overcome data scarcity; (2) architectures that incorporate an SO(3)-equivariant diffusion module to inherently respect 3D rotational symmetries, improving generalization and sample efficiency, and (3) controllable inference, including a generalized multi-chain templating system supporting both protein and non-polymeric components as well as dual unconditional/conditional modes. Pearl establishes a new state-of-the-art performance in protein-ligand cofolding. On the key metric of generating accurate (RMSD < 2 Å) and physically valid poses, Pearl surpasses AlphaFold 3 and other open source baselines on the public Runs N' Poses and PoseBusters benchmarks, delivering 14.5% and 14.2% improvements, respectively, over the next best model. In the pocket-conditional cofolding regime, Pearl delivers 3.6×3.6\times improvement on a proprietary set of challenging, real-world drug targets at the more rigorous RMSD < 1 Å threshold. Finally, we demonstrate that model performance correlates directly with synthetic dataset size used in training.
AlphaFold has transformed protein structure prediction, but emerging applications such as virtual ligand screening, proteome-wide folding, and de novo binder design demand predictions at a massive scale, where runtime and memory costs become prohibitive. A major bottleneck lies in the Pairformer backbone of AlphaFold3-style models, which relies on computationally expensive triangular primitives-especially triangle attention-for pairwise reasoning. We introduce Pairmixer, a streamlined alternative that eliminates triangle attention while preserving higher-order geometric reasoning capabilities that are critical for structure prediction. Pairmixer substantially improves computational efficiency, matching state-of-the-art structure predictors across folding and docking benchmarks, delivering up to 4x faster inference on long sequences while reducing training cost by 34%. Its efficiency alleviates the computational burden of downstream applications such as modeling large protein complexes, high-throughput ligand and binder screening, and hallucination-based design. Within BoltzDesign, for example, Pairmixer delivers over 2x faster sampling and scales to sequences ~30% longer than the memory limits of Pairformer.
3
There are no more papers matching your filters at the moment.