Fred Hutchinson Cancer Research Center
Learning informative representations of phylogenetic tree structures is essential for analyzing evolutionary relationships. Classical distance-based methods have been widely used to project phylogenetic trees into Euclidean space, but they are often sensitive to the choice of distance metric and may lack sufficient resolution. In this paper, we introduce phylogenetic variational autoencoders (PhyloVAEs), an unsupervised learning framework designed for representation learning and generative modeling of tree topologies. Leveraging an efficient encoding mechanism inspired by autoregressive tree topology generation, we develop a deep latent-variable generative model that facilitates fast, parallelized topology generation. PhyloVAE combines this generative model with a collaborative inference model based on learnable topological features, allowing for high-resolution representations of phylogenetic tree samples. Extensive experiments demonstrate PhyloVAE's robust representation learning capabilities and fast generation of phylogenetic tree topologies.
Reconstructing the evolutionary history relating a collection of molecular sequences is the main subject of modern Bayesian phylogenetic inference. However, the commonly used Markov chain Monte Carlo methods can be inefficient due to the complicated space of phylogenetic trees, especially when the number of sequences is large. An alternative approach is variational Bayesian phylogenetic inference (VBPI) which transforms the inference problem into an optimization problem. While effective, the default diagonal lognormal approximation for the branch lengths of the tree used in VBPI is often insufficient to capture the complexity of the exact posterior. In this work, we propose a more flexible family of branch length variational posteriors based on semi-implicit hierarchical distributions using graph neural networks. We show that this semi-implicit construction emits straightforward permutation equivariant distributions, and therefore can handle the non-Euclidean branch length space across different tree topologies with ease. To deal with the intractable marginal probability of semi-implicit variational distributions, we develop several alternative lower bounds for stochastic optimization. We demonstrate the effectiveness of our proposed method over baseline methods on benchmark data examples, in terms of both marginal likelihood estimation and branch length posterior approximation.
This research introduces Iteration by Regimenting Self-Attention (IRSA), a prompting methodology that enables large language models to perform iterative program execution. By carefully structuring prompts, GPT-3 models achieved 100% accuracy on tasks like Bubble Sort and Longest Substring Without Repeating Characters, and 76% on Logical Deduction puzzles, indicating a capacity for algorithmic simulation.
The advent and subsequent widespread availability of preventive vaccines has altered the course of public health over the past century. Despite this success, effective vaccines to prevent many high-burden diseases, including HIV, have been slow to develop. Vaccine development can be aided by the identification of immune response markers that serve as effective surrogates for clinically significant infection or disease endpoints. However, measuring immune response marker activity is often costly, which has motivated the usage of two-phase sampling for immune response evaluation in clinical trials of preventive vaccines. In such trials, the measurement of immunological markers is performed on a subset of trial participants, where enrollment in this second phase is potentially contingent on the observed study outcome and other participant-level information. We propose nonparametric methodology for efficiently estimating a counterfactual parameter that quantifies the impact of a given immune response marker on the subsequent probability of infection. Along the way, we fill in theoretical gaps pertaining to the asymptotic behavior of nonparametric efficient estimators in the context of two-phase sampling, including a multiple robustness property enjoyed by our estimators. Techniques for constructing confidence intervals and hypothesis tests are presented, and an open source software implementation of the methodology, the txshift R package, is introduced. We illustrate the proposed techniques using data from a recent preventive HIV vaccine efficacy trial.
Predicting the stability and fitness effects of amino acid mutations in proteins is a cornerstone of biological discovery and engineering. Various experimental techniques have been developed to measure mutational effects, providing us with extensive datasets across a diverse range of proteins. By training on these data, traditional computational modeling and more recent machine learning approaches have advanced significantly in predicting mutational effects. Here, we introduce HERMES, a 3D rotationally equivariant structure-based neural network model for mutational effect and stability prediction. Pre-trained to predict amino acid propensity from its surrounding 3D structure, HERMES can be fine-tuned for mutational effects using our open-source code. We present a suite of HERMES models, pre-trained with different strategies, and fine-tuned to predict the stability effect of mutations. Benchmarking against other models shows that HERMES often outperforms or matches their performance in predicting mutational effect on stability, binding, and fitness. HERMES offers versatile tools for evaluating mutational effects and can be fine-tuned for specific predictive objectives.
Measurement error is a major issue in self-reported diet that can distort diet-disease relationships. Use of blood concentration biomarkers has the potential to mitigate the subjective bias inherent in self-report. As part of the Hispanic Community Health Study/Study of Latinos (HCHS/SOL) baseline visit (2008-2011), self-reported diet was collected on all participants (N=16,415). Blood concentration biomarkers for carotenoids, tocopherols, retinol, vitamin B12 and folate were collected on a subset (N=476), as part of the Study of Latinos: Nutrition and Physical Activity Assessment Study (SOLNAS). We examine the relationship between biomarker levels, self-reported intake, Hispanic/Latino background, and other participant characteristics in this diverse cohort. We build regression calibration-based prediction equations for ten nutritional biomarkers and use a simulation to study the power of detecting a diet-disease association in a multivariable Cox model using a predicted concentration level. Good power was observed for some nutrients with high prediction model R2 values, but further research is needed to understand how best to realize the potential of these dietary biomarkers. This study provides a comprehensive examination of several nutritional biomarkers within the HCHS/SOL, characterizing their associations with subject characteristics and the influence of the measurement characteristics on the power to detect associations with health outcomes.
We introduce in this paper an extension of the meta-analytic (MA) framework for evaluating surrogate endpoints. While the MA framework is regarded as the gold standard for surrogate endpoint evaluation, it is limited in its ability to handle complex surrogates and does not take into account possible differences in the distribution of baseline covariates across trials. By contrast, in the context of data fusion, the surrogate-index (SI) framework accommodates complex surrogates and allows for complex relationships between baseline covariates, surrogates, and clinical endpoints. However, the SI framework is not a surrogate evaluation framework and relies on strong identifying assumptions. To address the MA framework's limitations, we propose an extension that incorporates ideas from the SI framework. We first formalize the data-generating mechanism underlying the MA framework, providing a transparent description of the untestable assumptions required for valid inferences in any evaluation of trial-level surrogacy -- assumptions often left implicit in the MA framework. While this formalization is meaningful in its own right, it is also required for our main contribution: We propose to estimate a specific transformation of the baseline covariates and the surrogate, the so-called surrogate index. This estimated transformation serves as a new potential univariate surrogate and is optimal in a trial-level surrogacy sense under certain conditions. We show that, under weak additional conditions, this new univariate surrogate can be evaluated as a trial-level surrogate as if the transformation were known a priori. This approach enables the evaluation of the trial-level surrogacy of complex surrogates and can be implemented using standard software. We illustrate this approach with a set of COVID-19 vaccine trials where antibody markers are assessed as potential trial-level surrogate endpoints.
The accurate classification of lymphoma subtypes using hematoxylin and eosin (H&E)-stained tissue is complicated by the wide range of morphological features these cancers can exhibit. We present LymphoML - an interpretable machine learning method that identifies morphologic features that correlate with lymphoma subtypes. Our method applies steps to process H&E-stained tissue microarray cores, segment nuclei and cells, compute features encompassing morphology, texture, and architecture, and train gradient-boosted models to make diagnostic predictions. LymphoML's interpretable models, developed on a limited volume of H&E-stained tissue, achieve non-inferior diagnostic accuracy to pathologists using whole-slide images and outperform black box deep-learning on a dataset of 670 cases from Guatemala spanning 8 lymphoma subtypes. Using SHapley Additive exPlanation (SHAP) analysis, we assess the impact of each feature on model prediction and find that nuclear shape features are most discriminative for DLBCL (F1-score: 78.7%) and classical Hodgkin lymphoma (F1-score: 74.5%). Finally, we provide the first demonstration that a model combining features from H&E-stained tissue with features from a standardized panel of 6 immunostains results in a similar diagnostic accuracy (85.3%) to a 46-stain panel (86.1%).
Predicting risks of chronic diseases has become increasingly important in clinical practice. When a prediction model is developed in a given source cohort, there is often a great interest to apply the model to other cohorts. However, due to potential discrepancy in baseline disease incidences between different cohorts and shifts in patient composition, the risk predicted by the original model often under- or over-estimates the risk in the new cohort. The remedy of such a poorly calibrated prediction is needed for proper medical decision-making. In this article, we assume the relative risks of predictors are the same between the two cohorts, and propose a novel weighted estimating equation approach to re-calibrating the projected risk for the targeted population through updating the baseline risk. The recalibration leverages the knowledge about the overall survival probabilities for the disease of interest and competing events, and the summary information of risk factors from the targeted population. The proposed re-calibrated risk estimators gain efficiency if the risk factor distributions are the same for both the source and target cohorts, and are robust with little bias if they differ. We establish the consistency and asymptotic normality of the proposed estimators. Extensive simulation studies demonstrate that the proposed estimators perform very well in terms of robustness and efficiency in finite samples. A real data application to colorectal cancer risk prediction also illustrates that the proposed method can be used in practice for model recalibration.
In this paper, we propose a computationally efficient approach -- space(Sparse PArtial Correlation Estimation)-- for selecting non-zero partial correlations under the high-dimension-low-sample-size setting. This method assumes the overall sparsity of the partial correlation matrix and employs sparse regression techniques for model fitting. We illustrate the performance of space by extensive simulation studies. It is shown that space performs well in both non-zero partial correlation selection and the identification of hub variables, and also outperforms two existing methods. We then apply space to a microarray breast cancer data set and identify a set of hub genes which may provide important insights on genetic regulatory networks. Finally, we prove that, under a set of suitable assumptions, the proposed procedure is asymptotically consistent in terms of model selection and parameter estimation.
Bayesian phylogenetic inference is currently done via Markov chain Monte Carlo (MCMC) with simple proposal mechanisms. This hinders exploration efficiency and often requires long runs to deliver accurate posterior estimates. In this paper, we present an alternative approach: a variational framework for Bayesian phylogenetic analysis. We propose combining subsplit Bayesian networks, an expressive graphical model for tree topology distributions, and a structured amortization of the branch lengths over tree topologies for a suitable variational family of distributions. We train the variational approximation via stochastic gradient ascent and adopt gradient estimators for continuous and discrete variational parameters separately to deal with the composite latent space of phylogenetic models. We show that our variational approach provides competitive performance to MCMC, while requiring much fewer (though more costly) iterations due to a more efficient exploration mechanism enabled by variational inference. Experiments on a benchmark of challenging real data Bayesian phylogenetic inference problems demonstrate the effectiveness and efficiency of our methods.
The UK Biobank is a large-scale health resource comprising genetic, environmental and medical information on approximately 500,000 volunteer participants in the UK, recruited at ages 40--69 during the years 2006--2010. The project monitors the health and well-being of its participants. This work demonstrates how these data can be used to estimate in a semi-parametric fashion the effects of genetic and environmental risk factors on the hazard functions of various diseases, such as colorectal cancer. An illness-death model is adopted, which inherently is a semi-competing risks model, since death can censor the disease, but not vice versa. Using a shared-frailty approach to account for the dependence between time to disease diagnosis and time to death, we provide a new illness-death model that assumes Cox models for the marginal hazard functions. The recruitment procedure used in this study introduces delayed entry to the data. An additional challenge arising from the recruitment procedure is that information coming from both prevalent and incident cases must be aggregated. Lastly, we do not observe any deaths prior to the minimal recruitment age, 40. In this work we provide an estimation procedure for our new illness-death model that overcomes all the above challenges.
Fundamental to quantitative characterization of the B cell receptor repertoire is clonal diversity - the number of distinct somatically recombined receptors present in the repertoire and their relative abundances, defining the search space available for immune response. This study synthesizes flow cytometry and immunosequencing to study memory and naive B cells from the peripheral blood of three adults. A combinatorial experimental design was employed, constituting a sample abundance probe robust to amplification stochasticity, a crucial quantitative advance over previous sequencing studies of diversity. These data are leveraged to interrogate repertoire diversity, motivating an extension of a canonical diversity model in ecology and corpus linguistics. Maximum likelihood diversity estimates are provided for memory and naive B cell repertoires. Both evince domination by rare clones and regimes of power law scaling in abundance. Memory clones have more disparate repertoire abundances than naive clones, and most naive clones undergo no proliferation prior to antigen recognition.
Cell populations are never truly homogeneous; individual cells exist in biochemical states that define functional differences between them. New technology based on microfluidic arrays combined with multiplexed quantitative polymerase chain reactions (qPCR) now enables high-throughput single-cell gene expression measurement, allowing assessment of cellular heterogeneity. However very little analytic tools have been developed specifically for the statistical and analytical challenges of single-cell qPCR data. We present a statistical framework for the exploration, quality control, and analysis of single-cell gene expression data from microfluidic arrays. We assess accuracy and within-sample heterogeneity of single-cell expression and develop quality control criteria to filter unreliable cell measurements. We propose a statistical model accounting for the fact that genes at the single-cell level can be on (and for which a continuous expression measure is recorded) or dichotomously off (and the recorded expression is zero). Based on this model, we derive a combined likelihood-ratio test for differential expression that incorporates both the discrete and continuous components. Using an experiment that examines treatment-specific changes in expression, we show that this combined test is more powerful than either the continuous or dichotomous component in isolation, or a t-test on the zero-inflated data. While developed for measurements from a specific platform (Fluidigm), these tools are generalizable to other multi-parametric measures over large numbers of events.
In two harmonized efficacy studies to prevent HIV infection through multiple infusions of the monoclonal antibody VRC01, a key objective is to evaluate whether the serum concentration of VRC01, which changes cyclically over time along with the infusion schedule, is associated with the rate of HIV infection. Simulation studies are needed in the development of such survival models. In this paper, we consider simulating event time data with a continuous time-varying covariate whose values vary with time through multiple drug administration cycles, and whose effect on survival changes differently before and after a threshold within each cycle. The latter accommodates settings with a zero-protection biomarker threshold above which the drug provides a varying level of protection depending on the biomarker level, but below which the drug provides no protection. We propose two simulation approaches: one based on simulating survival data under a single-dose regimen first before data are aggregated over multiple doses, and another based on simulating survival data directly under a multiple-dose regimen. We generate time-to-event data following a Cox proportional hazards model based on inverting the cumulative hazard function and a log link function for relating the hazard function to the covariates. The method's validity is assessed in two sets of simulation experiments. The results indicate that the proposed procedures perform well in producing data that conform to their cyclic nature and assumptions of the Cox proportional hazards model.
Modern biological techniques enable very dense genetic sampling of unfolding evolutionary histories, and thus frequently sample some genotypes multiple times. This motivates strategies to incorporate genotype abundance information in phylogenetic inference. In this paper, we synthesize a stochastic process model with standard sequence-based phylogenetic optimality, and show that tree estimation is substantially improved by doing so. Our method is validated with extensive simulations and an experimental single-cell lineage tracing study of germinal center B cell receptor affinity maturation.
In this paper we continue to study a question proposed by Babadi and Tarokh \cite{ba2} on the mysterious randomness of Gold sequences. Upon improving their result, we establish the randomness of product of pseudorandom matrices formed from two linear block codes with respect to the empirical spectral distribution, if the dual distance of both codes is at least 5, hence providing an affirmative answer to the question.
The conditional survival function of a time-to-event outcome subject to censoring and truncation is a common target of estimation in survival analysis. This parameter may be of scientific interest and also often appears as a nuisance in nonparametric and semiparametric problems. In addition to classical parametric and semiparametric methods (e.g., based on the Cox proportional hazards model), flexible machine learning approaches have been developed to estimate the conditional survival function. However, many of these methods are either implicitly or explicitly targeted toward risk stratification rather than overall survival function estimation. Others apply only to discrete-time settings or require inverse probability of censoring weights, which can be as difficult to estimate as the outcome survival function itself. Here, we employ a decomposition of the conditional survival function in terms of observable regression models in which censoring and truncation play no role. This allows application of an array of flexible regression and classification methods rather than only approaches that explicitly handle the complexities inherent to survival data. We outline estimation procedures based on this decomposition, empirically assess their performance, and demonstrate their use on data from an HIV vaccine trial.
In many applications, it is of interest to identify a parsimonious set of features, or panel, from multiple candidates that achieves a desired level of performance in predicting a response. This task is often complicated in practice by missing data arising from the sampling design or other random mechanisms. Most recent work on variable selection in missing data contexts relies in some part on a finite-dimensional statistical model, e.g., a generalized or penalized linear model. In cases where this model is misspecified, the selected variables may not all be truly scientifically relevant and can result in panels with suboptimal classification performance. To address this limitation, we propose a nonparametric variable selection algorithm combined with multiple imputation to develop flexible panels in the presence of missing-at-random data. We outline strategies based on the proposed algorithm that achieve control of commonly used error rates. Through simulations, we show that our proposal has good operating characteristics and results in panels with higher classification and variable selection performance compared to several existing penalized regression approaches in cases where a generalized linear model is misspecified. Finally, we use the proposed method to develop biomarker panels for separating pancreatic cysts with differing malignancy potential in a setting where complicated missingness in the biomarkers arose due to limited specimen volumes.
Change-point analysis plays a significant role in various fields to reveal discrepancies in distribution in a sequence of observations. While a number of algorithms have been proposed for high-dimensional data, kernel-based methods have not been well explored due to difficulties in controlling false discoveries and mediocre performance. In this paper, we propose a new kernel-based framework that makes use of an important pattern of data in high dimensions to boost power. Analytic approximations to the significance of the new statistics are derived and fast tests based on the asymptotic results are proposed, offering easy off-the-shelf tools for large datasets. The new tests show superior performance for a wide range of alternatives when compared with other state-of-the-art methods. We illustrate these new approaches through an analysis of a phone-call network data. All proposed methods are implemented in an R package KerSeg.
There are no more papers matching your filters at the moment.