Fred Hutchinson Cancer Center
As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g., rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as substitutes for missing or unobserved data. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value. In this paper, we characterize the statistical challenges inherent to drawing inference with predicted data (IPD) and show that high predictive accuracy does not guarantee valid downstream inference. We show that all such failures reduce to statistical notions of (i) bias, when predictions systematically shift the estimand or distort relationships among variables, and (ii) variance, when uncertainty from the prediction model and the intrinsic variability of the true data are ignored. We then review recent methods for conducting IPD and discuss how this framework is deeply rooted in classical statistical theory. We then comment on some open questions and interesting avenues for future work in this area, and end with some comments on how to use predicted data in scientific studies that is both transparent and statistically principled.
Active subspace analysis uses the leading eigenspace of the gradient's second moment to conduct supervised dimension reduction. In this article, we extend this methodology to real-valued functionals on Hilbert space. We define an operator which coincides with the active subspace matrix when applied to a Euclidean space. We show that many of the desirable properties of Active Subspace analysis extend directly to the infinite dimensional setting. We also propose a Monte Carlo procedure and discuss its convergence properties. Finally, we deploy this methodology to create visualizations and improve modeling and optimization on complex test problems.
Recent work has focused on nonparametric estimation of conditional treatment effects, but inference has remained relatively unexplored. We propose a class of nonparametric tests for both quantitative and qualitative treatment effect heterogeneity. The tests can incorporate a variety of structured assumptions on the conditional average treatment effect, allow for both continuous and discrete covariates, and do not require sample splitting. Furthermore, we show how the tests are tailored to detect alternatives where the population impact of adopting a personalized decision rule differs from using a rule that discards covariates. The proposal is thus relevant for guiding treatment policies. The utility of the proposal is borne out in simulation studies and a re-analysis of an AIDS clinical trial.
Wearable devices enable continuous multi-modal physiological and behavioral monitoring, yet analysis of these data streams faces fundamental challenges including the lack of gold-standard labels and incomplete sensor data. While self-supervised learning approaches have shown promise for addressing these issues, existing multi-modal extensions present opportunities to better leverage the rich temporal and cross-modal correlations inherent in simultaneously recorded wearable sensor data. We propose the Multi-modal Cross-masked Autoencoder (MoCA), a self-supervised learning framework that combines transformer architecture with masked autoencoder (MAE) methodology, using a principled cross-modality masking scheme that explicitly leverages correlation structures between sensor modalities. MoCA demonstrates strong performance boosts across reconstruction and downstream classification tasks on diverse benchmark datasets. We further establish theoretical guarantees by establishing a fundamental connection between multi-modal MAE loss and kernelized canonical correlation analysis through a Reproducing Kernel Hilbert Space framework, providing principled guidance for correlation-aware masking strategy design. Our approach offers a novel solution for leveraging unlabeled multi-modal wearable data while handling missing modalities, with broad applications across digital health domains.
We provide an inferential framework to assess variable importance for heterogeneous treatment effects. This assessment is especially useful in high-risk domains such as medicine, where decision makers hesitate to rely on black-box treatment recommendation algorithms. The variable importance measures we consider are local in that they may differ across individuals, while the inference is global in that it tests whether a given variable is important for any individual. Our approach builds on recent developments in semiparametric theory for function-valued parameters, and is valid even when statistical machine learning algorithms are employed to quantify treatment effect heterogeneity. We demonstrate the applicability of our method to infectious disease prevention strategies.
1
We present a machine learning method capable of accurately detecting chromosome abnormalities that cause blood cancers directly from microscope images of the metaphase stage of cell division. The pipeline is built on a series of fine-tuned Vision Transformers. Current state of the art (and standard clinical practice) requires expensive, manual expert analysis, whereas our pipeline takes only 15 seconds per metaphase image. Using a novel pretraining-finetuning strategy to mitigate the challenge of data scarcity, we achieve a high precision-recall score of 94% AUC for the clinically significant del(5q) and t(9;22) anomalies. Our method also unlocks zero-shot detection of rare aberrations based on model latent embeddings. The ability to quickly, accurately, and scalably diagnose genetic abnormalities directly from metaphase images could transform karyotyping practice and improve patient outcomes. We will make code publicly available.
In many clinical settings, an active-controlled trial design (e.g., a non-inferiority or superiority design) is often used to compare an experimental medicine to an active control (e.g., an FDA-approved, standard therapy). One prominent example is a recent phase 3 efficacy trial, HIV Prevention Trials Network Study 084 (HPTN 084), comparing long-acting cabotegravir, a new HIV pre-exposure prophylaxis (PrEP) agent, to the FDA-approved daily oral tenofovir disoproxil fumarate plus emtricitabine (TDF/FTC) in a population of heterosexual women in 7 African countries. One key complication of interpreting study results in an active-controlled trial like HPTN 084 is that the placebo arm is not present and the efficacy of the active control (and hence the experimental drug) compared to the placebo can only be inferred by leveraging other data sources. \bz{In this article, we study statistical inference for the intention-to-treat (ITT) effect of the active control using relevant historical placebo-controlled trials data under the potential outcomes (PO) framework}. We highlight the role of adherence and unmeasured confounding, discuss in detail identification assumptions and two modes of inference (point versus partial identification), propose estimators under identification assumptions permitting point identification, and lay out sensitivity analyses needed to relax identification assumptions. We applied our framework to estimating the intention-to-treat effect of daily oral TDF/FTC versus placebo in HPTN 084 using data from an earlier Phase 3, placebo-controlled trial of daily oral TDF/FTC (Partners PrEP).
Laboratory research is a complex, collaborative process that involves several stages, including hypothesis formulation, experimental design, data generation and analysis, and manuscript writing. Although reproducibility and data sharing are increasingly prioritized at the publication stage, integrating these principles at earlier stages of laboratory research has been hampered by the lack of broadly applicable solutions. Here, we propose that the workflow used in modern software development offers a robust framework for enhancing reproducibility and collaboration in laboratory research. In particular, we show that GitHub, a platform widely used for collaborative software projects, can be effectively adapted to organize and document all aspects of a research project's lifecycle in a molecular biology laboratory. We outline a three-step approach for incorporating the GitHub ecosystem into laboratory research workflows: 1. designing and organizing experiments using issues and project boards, 2. documenting experiments and data analyses with a version control system, and 3. ensuring reproducible software environments for data analyses and writing tasks with containerized packages. The versatility, scalability, and affordability of this approach make it suitable for various scenarios, ranging from small research groups to large, cross-institutional collaborations. Adopting this framework from a project's outset can increase the efficiency and fidelity of knowledge transfer within and across research laboratories. An example GitHub repository based on this approach is available at this https URL and a template repository that can be copied is available at this https URL
Understanding the viral dynamics and immunizing antibodies of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is crucial for devising better therapeutic and prevention strategies for COVID-19. Here, we present a Bayesian hierarchical model that jointly estimates the diagnostic RNA viral load reflecting genomic materials of SARS-CoV-2, the subgenomic RNAs (sgRNA) viral load reflecting active viral replication, and the rate and timing of seroconversion reflecting presence of antibodies. Our proposed method accounts for the dynamical relationship and correlation structure between the two types of viral load, allows for borrowing of information between viral load and antibody data, and identifies potential correlates of viral load characteristics and propensity for seroconversion. We demonstrate the features of the joint model through application to the COVID-19 PEP study and conduct a cross-validation exercise to illustrate the model's ability to impute the sgRNA viral trajectories for people who only had diagnostic viral load data.
Differential expression (DE) analysis is a key task in RNA-seq studies, aiming to identify genes with expression differences across conditions. A central challenge is balancing false discovery rate (FDR) control with statistical power. Parametric methods such as DESeq2 and edgeR achieve high power by modeling gene-level counts using negative binomial distributions and applying empirical Bayes shrinkage. However, these methods may suffer from FDR inflation when model assumptions are mildly violated, especially in large-sample settings. In contrast, non-parametric tests like Wilcoxon offer more robust FDR control but often lack power and do not support covariate adjustment. We propose Nullstrap-DE, a general add-on framework that combines the strengths of both approaches. Designed to augment tools like DESeq2 and edgeR, Nullstrap-DE calibrates FDR while preserving power, without modifying the original method's implementation. It generates synthetic null data from a model fitted under the gene-specific null (no DE), applies the same test statistic to both observed and synthetic data, and derives a threshold that satisfies the target FDR level. We show theoretically that Nullstrap-DE asymptotically controls FDR while maintaining power consistency. Simulations confirm that it achieves reliable FDR control and high power across diverse settings, where DESeq2, edgeR, or Wilcoxon often show inflated FDR or low power. Applications to real datasets show that Nullstrap-DE enhances statistical rigor and identifies biologically meaningful genes.
Instrumental variables (IV) are a commonly used tool to estimate causal effects from non-randomized data. An archetype of an IV is a randomized trial with non-compliance where the randomized treatment assignment serves as an IV for the non-ignorable treatment received. Under a monotonicity assumption, a valid IV non-parametrically identifies the average treatment effect among a non-identified, latent complier subgroup, whose generalizability is often under debate. In many studies, there could exist multiple versions of an IV, for instance, different nudges to take the same treatment in different study sites in a multicentre clinical trial. These different versions of an IV may result in different compliance rates and offer a unique opportunity to study IV estimates' generalizability. In this article, we introduce a novel nested IV assumption and study identification of the average treatment effect among two latent subgroups: always-compliers and switchers, who are defined based on the joint potential treatment received under two versions of a binary IV. We derive the efficient influence function for the SWitcher Average Treatment Effect (SWATE) under a non-parametric model and propose efficient estimators. We then propose formal statistical tests of the generalizability of IV estimates under the nested IV framework. We apply the proposed method to the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial and study the causal effect of colorectal cancer screening and its generalizability.
Random survival forests are widely used for estimating covariate-conditional survival functions under right-censoring. Their standard log-rank splitting criterion is typically recomputed at each candidate split. This O(M) cost per split, with M the number of distinct event times in a node, creates a bottleneck for large cohort datasets with long follow-up. We revisit approximations proposed by LeBlanc and Crowley (1995) and develop simple constant-time updates for the log-rank criterion. The method is implemented in grf and substantially reduces training time on large datasets while preserving predictive performance.
Artificial intelligence (AI) and machine learning (ML) are increasingly used to generate data for downstream analyses, yet naively treating these predictions as true observations can lead to biased results and incorrect inference. Wang et al. (2020) proposed a method, post-prediction inference, which calibrates inference by modeling the relationship between AI/ML-predicted and observed outcomes in a small, gold-standard sample. Since then, several methods have been developed for inference with predicted data. We revisit Wang et al. in light of these recent developments. We reflect on their assumptions and offer a simple extension of their method which relaxes these assumptions. Our extension (1) yields unbiased point estimates under standard conditions and (2) incorporates a simple scaling factor to preserve calibration variability. In extensive simulations, we show that our method maintains nominal Type I error rates, reduces bias, and achieves proper coverage.
Evaluating the performance of a prediction model is a common task in medical statistics. Standard accuracy metrics require the observation of the true outcomes. This is typically not possible in the setting with time-to-event outcomes due to censoring. Interval censoring, the presence of time-varying covariates, and competing risks present additional challenges in obtaining those accuracy metrics. In this study, we propose two methods to deal with interval censoring in a time-varying competing risk setting: a model-based approach and the inverse probability of censoring weighting (IPCW) approach, focusing on three key time-dependent metrics: area under the receiver-operating characteristic curve (AUC), Brier score, and expected predictive cross-entropy (EPCE). The evaluation is conducted over a medically relevant time interval of interest, [t,Δt)[t, \Delta t). The model-based approach includes all subjects in the risk set, using their predicted risks to contribute to the accuracy metrics. In contrast, the IPCW approach only considers the subset of subjects who are known to be event-free or experience the event within the interval of interest. we performed a simulation study to compare the performance of the two approaches with regard to the three metrics.
1
T-cells play a key role in adaptive immunity by mounting specific responses against diverse pathogens. An effective binding between T-cell receptors (TCRs) and pathogen-derived peptides presented on Major Histocompatibility Complexes (MHCs) mediate an immune response. However, predicting these interactions remains challenging due to limited functional data on T-cell reactivities. Here, we introduce a computational approach to predict TCR interactions with peptides presented on MHC class I alleles, and to design novel immunogenic peptides for specified TCR-MHC complexes. Our method leverages HERMES, a structure-based, physics-guided machine learning model trained on the protein universe to predict amino acid preferences based on local structural environments. Despite no direct training on TCR-pMHC data, the implicit physical reasoning in HERMES enables us to make accurate predictions of both TCR-pMHC binding affinities and T-cell activities across diverse viral epitopes and cancer neoantigens, achieving up to 0.72 correlation with experimental data. Leveraging our TCR recognition model, we develop a computational protocol for de novo design of immunogenic peptides. Through experimental validation in three TCR-MHC systems targeting viral and cancer peptides, we demonstrate that our designs -- with up to five substitutions from the native sequence -- activate T-cells at success rates of up to 50%. Lastly, we use our generative framework to quantify the diversity of the peptide recognition landscape for various TCR-MHC complexes, offering key insights into T-cell specificity in both humans and mice. Our approach provides a platform for immunogenic peptide and neoantigen design, as well as for evaluating TCR specificity, offering a computational framework to inform design of engineered T-cell therapies and vaccines.
Organisms have evolved immune systems that can counter pathogenic threats. The adaptive immune system in vertebrates consists of a diverse repertoire of immune receptors that can dynamically reorganize to specifically target the ever-changing pathogenic landscape. Pathogens in return evolve to escape the immune challenge, forming an co-evolutionary arms race. We introduce a formalism to characterize out-of-equilibrium interactions in co-evolutionary processes. We show that the rates of information exchange and entropy production can distinguish the leader from the follower in an evolutionary arms races. Lastly, we introduce co-evolutionary efficiency as a metric to quantify each population's ability to exploit information in response to the other. Our formalism provides insights into the conditions necessary for stable co-evolution and establishes bounds on the limits of information exchange and adaptation in co-evolving systems.
We introduce a new data fusion method that utilizes multiple data sources to estimate a smooth, finite-dimensional parameter. Most existing methods only make use of fully aligned data sources that share common conditional distributions of one or more variables of interest. However, in many settings, the scarcity of fully aligned sources can make existing methods require unduly large sample sizes to be useful. Our approach enables the incorporation of weakly aligned data sources that are not perfectly aligned, provided their degree of misalignment is known up to finite-dimensional parameters. {We quantify the additional efficiency gains achieved through the integration of these weakly aligned sources. We characterize the semiparametric efficiency bound and provide a general means to construct estimators achieving these efficiency gains.} We illustrate our results by fusing data from two harmonized HIV monoclonal antibody prevention efficacy trials to study how a neutralizing antibody biomarker associates with HIV genotype.
Cytotoxic T lymphocytes eliminate infected or malignant cells, safeguarding surrounding tissues. Although experimental and systems-immunology studies have cataloged many molecular and cellular actors involved in an immune response, the design principles governing how the speed and magnitude of T-cell responses emerge from cellular decision-making remain elusive. Here, we recast a T-cell response as a feedback-controlled program, wherein the rates of activation, proliferation, differentiation and death are regulated through antigenic, pro- and anti-inflammatory cues. We demonstrate how the speed and magnitude of T-cell responses emerge from optimizing signal-feedback to protect against diverse infection settings. We recover an inherent trade-off: infection clearance at the cost of immunopathology. We show how this trade-off is encoded into the logic of T-cell responses by hierarchical sensitivity to different immune signals. This model explains experimentally observed patterns of T-cell effector expansion regulation in mice. Extending our model to immune-based T-cell therapies for cancer tumors, we identify a trade-off between the affinity for tumor antigens ("quality") and the abundance ("quantity") of infused T-cells necessary for effective treatment. Finally, we show how therapeutic efficacy can be improved by targeted genetic perturbations to T-cells. Our findings offer a unified control-logic for cytotoxic T-cell responses and point to specific regulatory programs that can be engineered for more robust T-cell therapies.
As part of work to connect phylogenetics with machine learning, there has been considerable recent interest in vector encodings of phylogenetic trees. We present a simple new ``ordered leaf attachment'' (OLA) method for uniquely encoding a binary, rooted phylogenetic tree topology as an integer vector. OLA encoding and decoding take linear time in the number of leaf nodes, and the set of vectors corresponding to trees is a simply-described subset of integer sequences. The OLA encoding is unique compared to other existing encodings in having these properties. The integer vector encoding induces a distance on the set of trees, and we investigate this distance in relation to the NNI and SPR distances.
There are no more papers matching your filters at the moment.