Harvard School of Public Health
Causal inference is a critical task across fields such as healthcare, economics, and the social sciences. While recent advances in machine learning, especially those based on the deep-learning architectures, have shown potential in estimating causal effects, existing approaches often fall short in handling complex causal structures and lack adaptability across various causal scenarios. In this paper, we present a novel transformer-based method for causal inference that overcomes these challenges. The core innovation of our model lies in its integration of causal Directed Acyclic Graphs (DAGs) directly into the attention mechanism, enabling it to accurately model the underlying causal structure. This allows for flexible estimation of both average treatment effects (ATE) and conditional average treatment effects (CATE). Extensive experiments on both synthetic and real-world datasets demonstrate that our approach surpasses existing methods in estimating causal effects across a wide range of scenarios. The flexibility and robustness of our model make it a valuable tool for researchers and practitioners tackling complex causal inference problems.
2
A semi-supervised learning framework classifies thirteen diverse dental conditions in panoramic radiographs using a large dataset. The approach leverages Large Language Models for automated label extraction from textual reports and Masked Autoencoders for self-supervised pretraining, achieving diagnostic accuracy comparable to junior dental professionals.
In health and social sciences, it is critically important to identify subgroups of the study population where there is notable heterogeneity of treatment effects (HTE) with respect to the population average. Decision trees have been proposed and commonly adopted for the data-driven discovery of HTE due to their high level of interpretability. However, single-tree discovery of HTE can be unstable and oversimplified. This paper introduces the Causal Rule Ensemble (CRE), a new method for HTE discovery and estimation using an ensemble-of-trees approach. CRE offers several key features, including 1) an interpretable representation of the HTE; 2) the ability to explore complex heterogeneity patterns; and 3) high stability in subgroups discovery. The discovered subgroups are defined in terms of interpretable decision rules. Estimation of subgroup-specific causal effects is performed via a two-stage approach, for which we provide theoretical guarantees. Through simulations, we show that the CRE method is highly competitive compared to state-of-the-art techniques. Finally, we apply CRE to discover the heterogeneous health effects of exposure to air pollution on mortality for 35.3 million Medicare beneficiaries across the contiguous U.S.
In this note we give proofs for results relating to the Instrumental Variable (IV) model with binary response YY and binary treatment XX, but with an instrument ZZ with KK states. These results were originally stated in Richardson & Robins (2014), "ACE Bounds; SEMS with Equilibrium Conditions," arXiv:1410.0470.
Manifold learning builds on the "manifold hypothesis," which posits that data in high-dimensional datasets are drawn from lower-dimensional manifolds. Current tools generate global embeddings of data, rather than the local maps used to define manifolds mathematically. These tools also cannot assess whether the manifold hypothesis holds true for a dataset. Here, we describe DeepAtlas, an algorithm that generates lower-dimensional representations of the data's local neighborhoods, then trains deep neural networks that map between these local embeddings and the original data. Topological distortion is used to determine whether a dataset is drawn from a manifold and, if so, its dimensionality. Application to test datasets indicates that DeepAtlas can successfully learn manifold structures. Interestingly, many real datasets, including single-cell RNA-sequencing, do not conform to the manifold hypothesis. In cases where data is drawn from a manifold, DeepAtlas builds a model that can be used generatively and promises to allow the application of powerful tools from differential geometry to a variety of datasets.
The primary tool for predicting infectious disease spread and intervention effectiveness is the mass action Susceptible-Infected-Recovered model of Kermack and McKendrick. Its usefulness derives largely from its conceptual and mathematical simplicity; however, it incorrectly assumes all individuals have the same contact rate and contacts are fleeting. This paper is the first of three investigating edge-based compartmental modeling, a technique eliminating these assumptions. In this paper, we derive simple ordinary differential equation models capturing social heterogeneity (heterogeneous contact rates) while explicitly considering the impact of contact duration. We introduce a graphical interpretation allowing for easy derivation and communication of the model. This paper focuses on the technique and how to apply it in different contexts. The companion papers investigate choosing the appropriate level of complexity for a model and how to apply edge-based compartmental modeling to populations with various sub-structures.
Conditional independence models associated with directed acyclic graphs (DAGs) may be characterized in at least three different ways: via a factorization, the global Markov property (given by the d-separation criterion), and the local Markov property. Marginals of DAG models also imply equality constraints that are not conditional independences; the well-known ``Verma constraint'' is an example. Constraints of this type are used for testing edges, and in a computationally efficient marginalization scheme via variable elimination. We show that equality constraints like the ``Verma constraint'' can be viewed as conditional independences in kernel objects obtained from joint distributions via a fixing operation that generalizes conditioning and marginalization. We use these constraints to define, via ordered local and global Markov properties, and a factorization, a graphical model associated with acyclic directed mixed graphs (ADMGs). We prove that marginal distributions of DAG models lie in this model, and that a set of these constraints given by Tian provides an alternative definition of the model. Finally, we show that the fixing operation used to define the model leads to a particularly simple characterization of identifiable causal effects in hidden variable causal DAG models.
Consider the causal effect that one individual's treatment may have on another individual's outcome when the outcome is contagious, with specific application to the effect of vaccination on an infectious disease outcome. The effect of one individual's vaccination on another's outcome can be decomposed into two different causal effects, called the "infectiousness" and "contagion" effects. We present identifying assumptions and estimation or testing procedures for infectiousness and contagion effects in two different settings: (1) using data sampled from independent groups of observations, and (2) using data collected from a single interdependent social network. The methods that we propose for social network data require fitting generalized linear models (GLMs). GLMs and other statistical models that require independence across subjects have been used widely to estimate causal effects in social network data, but, because the subjects in networks are presumably not independent, the use of such models is generally invalid, resulting in inference that is expected to be anticonservative. We introduce a way to ensure that GLM residuals are uncorrelated across subjects despite the fact that outcomes are non-independent. This simultaneously demonstrates the possibility of using GLMs and related statistical models for network data and highlights their limitations.
Satellite DNA are long tandemly repeating sequences in a genome and may be organized as high-order repeats (HORs). They are enriched in centromeres and are challenging to assemble. Existing algorithms for identifying satellite repeats either require the complete assembly of satellites or only work for simple repeat structures without HORs. Here we describe Satellite Repeat Finder (SRF), a new algorithm for reconstructing satellite repeat units and HORs from accurate reads or assemblies without prior knowledge on repeat structures. Applying SRF to real sequence data, we showed that SRF could reconstruct known satellites in human and well-studied model organisms. We also found satellite repeats are pervasive in various other species, accounting for up to 12% of their genome contents but are often underrepresented in assemblies. With the rapid progress on genome sequencing, SRF will help the annotation of new genomes and the study of satellite DNA evolution even if such repeats are not fully assembled.
Gun violence is a major source of injury and death in the United States. However, relatively little is known about the effects of firearm injuries on survivors and their family members and how these effects vary across subpopulations. To study these questions and, more generally, to address a gap in the causal inference literature, we present a framework for the study of effect modification or heterogeneous treatment effects in difference-in-differences designs. We implement a new matching technique, which combines profile matching and risk set matching, to (i) preserve the time alignment of covariates, exposure, and outcomes, avoiding pitfalls of other common approaches for difference-in-differences, and (ii) explicitly control biases due to imbalances in observed covariates in subgroups discovered from the data. Our case study shows significant and persistent effects of nonfatal firearm injuries on several health outcomes for those injured and on the mental health of their family members. Sensitivity analyses reveal that these results are moderately robust to unmeasured confounding bias. Finally, while the effects for those injured vary largely by the severity of the injury and its documented intent, for families, effects are strongest for those whose relative's injury is documented as resulting from an assault, self-harm, or law enforcement intervention.
The standard way to parameterize the distributions represented by a directed acyclic graph is to insert a parametric family for the conditional distribution of each random variable given its parents. We show that when one's goal is to test for or estimate an effect of a sequentially applied treatment, this natural parameterization has serious deficiencies. By reparameterizing the graph using structural nested models, these deficiencies can be avoided.
Most of the literature on direct and indirect effects assumes that there are no post-treatment common causes of the mediator and the outcome. In contrast to natural direct and indirect effects, organic direct and indirect effects, which were introduced in Lok (2016, 2020), can be extended to provide an identification result for settings with post-treatment common causes of the mediator and the outcome. This article provides a definition and an identification result for organic direct and indirect effects in the presence of post-treatment common causes of mediator and outcome. These new organic indirect and direct effects have interpretations in terms of intervention effects. Organic indirect effects in the presence of post-treatment common causes are an addition to indirect effects through multivariate mediators. Organic indirect effects in the presence of post-treatment common causes can be used e.g. 1. to predict the effect of the initial treatment if its side affects are suppressed through additional interventions or 2. to predict the effect of a treatment that does not affect the post-treatment common cause and affects the mediator the same way as the initial treatment.
The use of instrumental variables for estimating the effect of an exposure on an outcome is popular in econometrics, and increasingly so in epidemiology. This increasing popularity may be attributed to the natural occurrence of instrumental variables in observational studies that incorporate elements of randomization, either by design or by nature (e.g., random inheritance of genes). Instrumental variables estimation of exposure effects is well established for continuous outcomes and to some extent for binary outcomes. It is, however, largely lacking for time-to-event outcomes because of complications due to censoring and survivorship bias. In this paper, we make a novel proposal under a class of structural cumulative survival models which parameterize time-varying effects of a point exposure directly on the scale of the survival function; these models are essentially equivalent with a semi-parametric variant of the instrumental variables additive hazards model. We propose a class of recursive instrumental variable estimators for these exposure effects, and derive their large sample properties along with inferential tools. We examine the performance of the proposed method in simulation studies and illustrate it in a Mendelian randomization study to evaluate the effect of diabetes on mortality using data from the Health and Retirement Study. We further use the proposed method to investigate potential benefit from breast cancer screening on subsequent breast cancer mortality based on the HIP-study.
Causal variable selection in time-varying treatment settings is challenging due to evolving confounding effects. Existing methods mainly focus on time-fixed exposures and are not directly applicable to time-varying scenarios. We propose a novel two-step procedure for variable selection when modeling the treatment probability at each time point. We first introduce a novel approach to longitudinal confounder selection using a Longitudinal Outcome Adaptive LASSO (LOAL) that will data-adaptively select covariates with theoretical justification of variance reduction of the estimator of the causal effect. We then propose an Adaptive Fused LASSO that can collapse treatment model parameters over time points with the goal of simplifying the models in order to improve the efficiency of the estimator while minimizing model misspecification bias compared with naive pooled logistic regression models. Our simulation studies highlight the need for and usefulness of the proposed approach in practice. We implemented our method on data from the Nicotine Dependence in Teens study to estimate the effect of the timing of alcohol initiation during adolescence on depressive symptoms in early adulthood.
Demand for data science education is surging and traditional courses offered by statistics departments are not meeting the needs of those seeking training. This has led to a number of opinion pieces advocating for an update to the Statistics curriculum. The unifying recommendation is computing should play a more prominent role. We strongly agree with this recommendation, but advocate the main priority is to bring applications to the forefront as proposed by Nolan and Speed (1999). We also argue that the individuals tasked with developing data science courses should not only have statistical training, but also have experience analyzing data with the main objective of solving real-world problems. Here, we share a set of general principles and offer a detailed guide derived from our successful experience developing and teaching a graduate-level, introductory data science course centered entirely on case studies. We argue for the importance of statistical thinking, as defined by Wild and Pfannkuck (1999) and describe how our approach teaches students three key skills needed to succeed in data science, which we refer to as creating, connecting, and computing. This guide can also be used for statisticians wanting to gain more practical knowledge about data science before embarking on teaching an introductory course.
Although propensity scores have been central to the estimation of causal effects for over 30 years, only recently has the statistical literature begun to consider in detail methods for Bayesian estimation of propensity scores and causal effects. Underlying this recent body of literature on Bayesian propensity score estimation is an implicit discordance between the goal of the propensity score and the use of Bayes theorem. The propensity score condenses multivariate covariate information into a scalar to allow estimation of causal effects without specifying a model for how each covariate relates to the outcome. Avoiding specification of a detailed model for the outcome response surface is valuable for robust estimation of causal effects, but this strategy is at odds with the use of Bayes theorem, which presupposes a full probability model for the observed data. The goal of this paper is to explicate this fundamental feature of Bayesian estimation of causal effects with propensity scores in order to provide context for the existing literature and for future work on this important topic.
We consider the problem of estimating the effects of a binary treatment on a continuous outcome of interest from observational data in the absence of confounding by unmeasured factors. We provide a new estimator of the population average treatment effect (ATE) based on the difference of novel double-robust (DR) estimators of the treatment-specific outcome means. We compare our new estimator with previously estimators both theoretically and via simulation. DR-difference estimators may have poor finite sample behavior when the estimated propensity scores in the treated and untreated do not overlap. We therefore propose an alternative approach, which can be used even in this unfavorable setting, based on locally efficient double-robust estimation of a semiparametric regression model for the modification on an additive scale of the magnitude of the treatment effect by the baseline covariates XX. In contrast with existing methods, our approach simultaneously provides estimates of: i) the average treatment effect in the total study population, ii) the average treatment effect in the random subset of the population with overlapping estimated propensity scores, and iii) the treatment effect at each level of the baseline covariates XX. When the covariate vector XX is high dimensional, one cannot be certain, owing to lack of power, that given models for the propensity score and for the regression of the outcome on treatment and XX used in constructing our DR estimators are nearly correct, even if they pass standard goodness of fit tests. Therefore to select among candidate models, we propose a novel approach to model selection that leverages the DR-nature of our treatment effect estimator and that outperforms cross-validation in a small simulation study.
29 Oct 2020
In this paper, we propose an inferential framework testing the general community combinatorial properties of the stochastic block model. Instead of estimating the community assignments, we aim to test the hypothesis on whether a certain community property is satisfied. For instance, we propose to test whether a given set of nodes belong to the same community or whether different network communities have the same size. We propose a general inference framework that can be applied to all symmetric community properties. To ease the challenges caused by the combinatorial nature of communities properties, we develop a novel shadowing bootstrap testing method. By utilizing the symmetry, our method can find a shadowing representative of the true assignment and the number of assignments to be tested in the alternative can be largely reduced. In theory, we introduce a combinatorial distance between two community classes and show a combinatorial-probabilistic trade-off phenomenon in the community properties test. Our test is honest as long as the product of combinatorial distance between two communities and the probabilistic distance between two assignment probabilities is sufficiently large. On the other hand, we shows that such trade-off also exists in the information-theoretic lower bound of the community property test. We also implement numerical experiments on both the synthetic data and the protein interaction application to show the validity of our method.
This chapter introduces statistical methods used in the analysis of social networks and in the rapidly evolving parallel-field of network science. Although several instances of social network analysis in health services research have appeared recently, the majority involve only the most basic methods and thus scratch the surface of what might be accomplished. Cutting-edge methods using relevant examples and illustrations in health services research are provided.
Causal inference with interference is a rapidly growing area. The literature has begun to relax the "no-interference" assumption that the treatment received by one individual does not affect the outcomes of other individuals. In this paper we briefly review the literature on causal inference in the presence of interference when treatments have been randomized. We then consider settings in which causal effects in the presence of interference are not identified, either because randomization alone does not suffice for identification or because treatment is not randomized and there may be unmeasured confounders of the treatment-outcome relationship. We develop sensitivity analysis techniques for these settings. We describe several sensitivity analysis techniques for the infectiousness effect which, in a vaccine trial, captures the effect of the vaccine of one person on protecting a second person from infection even if the first is infected. We also develop two sensitivity analysis techniques for causal effects under interference in the presence of unmeasured confounding which generalize analogous techniques when interference is absent. These two techniques for unmeasured confounding are compared and contrasted.
There are no more papers matching your filters at the moment.