Murdoch Children’s Research Institute
Observational epidemiological studies commonly seek to estimate the causal effect of an exposure on an outcome. Adjustment for potential confounding bias in modern studies is challenging due to the presence of high-dimensional confounding, which occurs when there are many confounders relative to sample size or complex relationships between continuous confounders and exposure and outcome. Despite recent advances, limited evaluation, and guidance are available on the implementation of doubly robust methods, Augmented Inverse Probability Weighting (AIPW) and Targeted Maximum Likelihood Estimation (TMLE), with data-adaptive approaches and cross-fitting in realistic settings where high-dimensional confounding is present. Motivated by an early-life cohort study, we conducted an extensive simulation study to compare the relative performance of AIPW and TMLE using data-adaptive approaches in estimating the average causal effect (ACE). We evaluated the benefits of using cross-fitting with a varying number of folds, as well as the impact of using a reduced versus full (larger, more diverse) library in the Super Learner ensemble learning approach used for implementation. We found that AIPW and TMLE performed similarly in most cases for estimating the ACE, but TMLE was more stable. Cross-fitting improved the performance of both methods, but was more important for estimation of standard error and coverage than for point estimates, with the number of folds a less important consideration. Using a full Super Learner library was important to reduce bias and variance in complex scenarios typical of modern health research studies.
White matter alterations are increasingly implicated in neurological diseases and their progression. International-scale studies use diffusion-weighted magnetic resonance imaging (DW-MRI) to qualitatively identify changes in white matter microstructure and connectivity. Yet, quantitative analysis of DW-MRI data is hindered by inconsistencies stemming from varying acquisition protocols. There is a pressing need to harmonize the preprocessing of DW-MRI datasets to ensure the derivation of robust quantitative diffusion metrics across acquisitions. In the MICCAI-CDMRI 2023 QuantConn challenge, participants were provided raw data from the same individuals collected on the same scanner but with two different acquisitions and tasked with preprocessing the DW-MRI to minimize acquisition differences while retaining biological variation. Submissions are evaluated on the reproducibility and comparability of cross-acquisition bundle-wise microstructure measures, bundle shape features, and connectomics. The key innovations of the QuantConn challenge are that (1) we assess bundles and tractography in the context of harmonization for the first time, (2) we assess connectomics in the context of harmonization for the first time, and (3) we have 10x additional subjects over prior harmonization challenge, MUSHAC and 100x over SuperMUDI. We find that bundle surface area, fractional anisotropy, connectome assortativity, betweenness centrality, edge count, modularity, nodal strength, and participation coefficient measures are most biased by acquisition and that machine learning voxel-wise correction, RISH mapping, and NeSH methods effectively reduce these biases. In addition, microstructure measures AD, MD, RD, bundle length, connectome density, efficiency, and path length are least biased by these acquisition differences.
Missing data are ubiquitous in medical research. Although there is increasing guidance on how to handle missing data, practice is changing slowly and misapprehensions abound, particularly in observational research. We present a practical framework for handling and reporting the analysis of incomplete data in observational studies, which we illustrate using a case study from the Avon Longitudinal Study of Parents and Children. The framework consists of three steps: 1) Develop an analysis plan specifying the analysis model and how missing data are going to be addressed. An important consideration is whether a complete records analysis is likely to be valid, whether multiple imputation or an alternative approach is likely to offer benefits, and whether a sensitivity analysis regarding the missingness mechanism is required. 2) Explore the data, checking the methods outlined in the analysis plan are appropriate, and conduct the pre-planned analysis. 3) Report the results, including a description of the missing data, details on how the missing data were addressed, and the results from all analyses, interpreted in light of the missing data and the clinical relevance. This framework seeks to support researchers in thinking systematically about missing data, and transparently reporting the potential effect on the study results.
Regression methods dominate the practice of biostatistical analysis, but biostatistical training emphasises the details of regression models and methods ahead of the purposes for which such modelling might be useful. More broadly, statistics is widely understood to provide a body of techniques for "modelling data", underpinned by what we describe as the "true model myth": that the task of the statistician/data analyst is to build a model that closely approximates the true data generating process. By way of our own historical examples and a brief review of mainstream clinical research journals, we describe how this perspective has led to a range of problems in the application of regression methods, including misguided "adjustment" for covariates, misinterpretation of regression coefficients and the widespread fitting of regression models without a clear purpose. We then outline a new approach to the teaching and application of biostatistical methods, which situates them within a framework that first requires clear definition of the substantive research question at hand within one of three categories: descriptive, predictive, or causal. Within this approach, the development and application of (multivariable) regression models, as well as other advanced biostatistical methods, should proceed differently according to the type of question. Regression methods will no doubt remain central to statistical practice as they provide a powerful tool for representing variation in a response or outcome variable as a function of "input" variables, but their conceptualisation and usage should follow from the purpose at hand.
Mediation analysis is commonly used in epidemiological research, but guidance is lacking on how multivariable missing data should be dealt with in these analyses. Multiple imputation (MI) is a widely used approach, but questions remain regarding impact of missingness mechanism, how to ensure imputation model compatibility and approaches to variance estimation. To address these gaps, we conducted a simulation study based on the Victorian Adolescent Health Cohort Study. We considered six missingness mechanisms, involving varying assumptions regarding the influence of outcome and/or mediator on missingness in key variables. We compared the performance of complete-case analysis, seven MI approaches, differing in how the imputation model was tailored, and a "substantive model compatible" MI approach. We evaluated both the MI-Boot (MI, then bootstrap) and Boot-MI (bootstrap, then MI) approaches to variance estimation. Results showed that when the mediator and/or outcome influenced their own missingness, there was large bias in effect estimates, while for other mechanisms appropriate MI approaches yielded approximately unbiased estimates. Beyond incorporating all analysis variables in the imputation model, how MI was tailored for compatibility with mediation analysis did not greatly impact point estimation bias. BootMI returned variance estimates with smaller bias than MIBoot, especially in the presence of incompatibility.
In the context of missing data, the identifiability or "recoverability" of the average causal effect (ACE) depends on causal and missingness assumptions. The latter can be depicted by adding variable-specific missingness indicators to causal diagrams, creating "missingness-directed acyclic graphs" (m-DAGs). Previous research described ten canonical m-DAGs, representing typical multivariable missingness mechanisms in epidemiological studies, and determined the recoverability of the ACE in the absence of effect modification. We extend the research by determining the recoverability of the ACE in settings with effect modification and conducting a simulation study evaluating the performance of widely used missing data methods when estimating the ACE using correctly specified g-computation, which has not been previously studied. Methods assessed were complete case analysis (CCA) and various multiple imputation (MI) implementations regarding the degree of compatibility with the outcome model used in g-computation. Simulations were based on an example from the Victorian Adolescent Health Cohort Study (VAHCS), where interest was in estimating the ACE of adolescent cannabis use on mental health in young adulthood. In the canonical m-DAGs that excluded unmeasured common causes of missingness indicators, we derived the recoverable ACE if no incomplete variable causes its missingness, and non-recoverable otherwise. Besides, the simulation showed that compatible MI approaches may enable approximately unbiased ACE estimation, unless the outcome causes its missingness or it causes the missingness of a variable that causes its missingness. Researchers must consider sensitivity analysis methods incorporating external information in the latter setting. The VAHCS case study illustrates the practical implications of these findings.
The concept of missing at random is central in the literature on statistical analysis with missing data. In general, inference using incomplete data should be based not only on observed data values but should also take account of the pattern of missing values. However, it is often said that if data are missing at random, valid inference using likelihood approaches (including Bayesian) can be obtained ignoring the missingness mechanism. Unfortunately, the term "missing at random" has been used inconsistently and not always clearly; there has also been a lack of clarity around the meaning of "valid inference using likelihood". These issues have created potential for confusion about the exact conditions under which the missingness mechanism can be ignored, and perhaps fed confusion around the meaning of "analysis ignoring the missingness mechanism". Here we provide standardised precise definitions of "missing at random" and "missing completely at random", in order to promote unification of the theory. Using these definitions we clarify the conditions that suffice for "valid inference" to be obtained under a variety of inferential paradigms.
Longitudinal cohort studies, which follow a group of individuals over time, provide the opportunity to examine causal effects of complex exposures on long-term health outcomes. Utilizing data from multiple cohorts has the potential to add further benefit by improving precision of estimates through data pooling and by allowing examination of effect heterogeneity through replication of analyses across cohorts. However, the interpretation of findings can be complicated by biases that may be compounded when pooling data, or, contribute to discrepant findings when analyses are replicated. The "target trial" is a powerful tool for guiding causal inference in single-cohort studies. Here we extend this conceptual framework to address the specific challenges that can arise in the multi-cohort setting. By representing a clear definition of the target estimand, the target trial provides a central point of reference against which biases arising in each cohort and from data pooling can be systematically assessed. Consequently, analyses can be designed to reduce these biases and the resulting findings appropriately interpreted in light of potential remaining biases. We use a case study to demonstrate the framework and its potential to strengthen causal inference in multi-cohort studies through improved analysis design and clarity in the interpretation of findings.
Joint modelling of longitudinal and time-to-event data has received much attention recently. Increasingly, extensions to standard joint modelling approaches are being proposed to handle complex data structures commonly encountered in applied research. In this paper we propose a joint model for hierarchical longitudinal and time-to-event data. Our motivating application explores the association between tumor burden and progression-free survival in non-small cell lung cancer patients. We define tumor burden as a function of the sizes of target lesions clustered within a patient. Since a patient may have more than one lesion, and each lesion is tracked over time, the data have a three-level hierarchical structure: repeated measurements taken at time points (level 1) clustered within lesions (level 2) within patients (level 3). We jointly model the lesion-specific longitudinal trajectories and patient-specific risk of death or disease progression by specifying novel association structures that combine information across lower level clusters (e.g. lesions) into patient-level summaries (e.g. tumor burden). We provide user-friendly software for fitting the model under a Bayesian framework. Lastly, we discuss alternative situations in which additional clustering factor(s) occur at a level higher in the hierarchy than the patient-level, since this has implications for the model formulation.
Scientific knowledge and advances are a cornerstone of modern society. They improve our understanding of the world we live in and help us navigate global challenges including emerging infectious diseases, climate change and the biodiversity crisis. For any scientist, whether they work primarily in fundamental knowledge generation or in the applied sciences, it is important to understand how science fits into a decision-making framework. Decision science is a field that aims to pinpoint evidence-based management strategies. It provides a framework for scientists to directly impact decisions or to understand how their work will fit into a decision process. Decision science is more than undertaking targeted and relevant scientific research or providing tools to assist policy makers; it is an approach to problem formulation, bringing together mathematical modelling, stakeholder values and logistical constraints to support decision making. In this paper we describe decision science, its use in different contexts, and highlight current gaps in methodology and application. The COVID-19 pandemic has thrust mathematical models into the public spotlight, but it is one of innumerable examples in which modelling informs decision making. Other examples include models of storm systems (eg. cyclones, hurricanes) and climate change. Although the decision timescale in these examples differs enormously (from hours to decades), the underlying decision science approach is common across all problems. Bridging communication gaps between different groups is one of the greatest challenges for scientists. However, by better understanding and engaging with the decision-making processes, scientists will have greater impact and make stronger contributions to important societal problems.
Incidence of whooping cough, an infection caused by Bordetella pertussis and Bordetella parapertussis, has been on the rise since the 1980s in many countries. Immunological interactions, such as immune boosting and cross-immunity between pathogens, have been hypothesised to be important drivers of epidemiological dynamics. We present a two-pathogen model of transmission which examines how immune boosting and cross-immunity can influence the timing and severity of epidemics. We use a combination of numerical simulations and bifurcation techniques to study the dynamical properties of the system, particularly the conditions under which stable periodic solutions are present. We derive analytic expressions for the steady state of the single-pathogen model, and give a condition for the presence of periodic solutions. A key result from our two-pathogen model is that, while studies have shown that immune boosting at relatively strong levels can independently generate periodic solutions, cross-immunity allows for the presence of periodic solutions even when the level of immune boosting is weak. Asymmetric cross-immunity can produce striking increases in the incidence and period. Our study underscores the importance of developing a better understanding of the immunological interactions between pathogens in order to improve model-based interpretations of epidemiological data.
Targeted Maximum Likelihood Estimation (TMLE) is increasingly used for doubly robust causal inference, but how missing data should be handled when using TMLE with data-adaptive approaches is unclear. Based on the Victorian Adolescent Health Cohort Study, we conducted a simulation study to evaluate eight missing data methods in this context: complete-case analysis, extended TMLE incorporating outcome-missingness model, missing covariate missing indicator method, five multiple imputation (MI) approaches using parametric or machine-learning models. Six scenarios were considered, varying in exposure/outcome generation models (presence of confounder-confounder interactions) and missingness mechanisms (whether outcome influenced missingness in other variables and presence of interaction/non-linear terms in missingness models). Complete-case analysis and extended TMLE had small biases when outcome did not influence missingness in other variables. Parametric MI without interactions had large bias when exposure/outcome generation models included interactions. Parametric MI including interactions performed best in bias and variance reduction across all settings, except when missingness models included a non-linear term. When choosing a method to handle missing data in the context of TMLE, researchers must consider the missingness mechanism and, for MI, compatibility with the analysis method. In many settings, a parametric MI approach that incorporates interactions and non-linearities is expected to perform well.
Causal mediation analysis examines causal pathways linking exposures to disease. The estimation of interventional effects, which are mediation estimands that overcome certain identifiability problems of natural effects, has been advanced through causal machine learning methods, particularly for high-dimensional mediators. Recently, it has been proposed interventional effects can be defined in each study by mapping to a target trial assessing specific hypothetical mediator interventions. This provides an appealing framework to directly address real-world research questions about the extent to which such interventions might mitigate an increased disease risk in the exposed. However, existing estimators for interventional effects mapped to a target trial rely on singly-robust parametric approaches, limiting their applicability in high-dimensional settings. Building upon recent developments in causal machine learning for interventional effects, we address this gap by developing causal machine learning estimators for three interventional effect estimands, defined by target trials assessing hypothetical interventions inducing distinct shifts in joint mediator distributions. These estimands are motivated by a case study within the Longitudinal Study of Australian Children, used for illustration, which assessed how intervening on high inflammatory burden and other non-inflammatory adverse metabolomic markers might mitigate the adverse causal effect of overweight or obesity on high blood pressure in adolescence. We develop one-step and (partial) targeted minimum loss-based estimators based on efficient influence functions of those estimands, demonstrating they are root-n consistent, efficient, and multiply robust under certain conditions.
Longitudinal studies are frequently used in medical research and involve collecting repeated measures on individuals over time. Observations from the same individual are invariably correlated and thus an analytic approach that accounts for this clustering by individual is required. While almost all research suffers from missing data, this can be particularly problematic in longitudinal studies as participation often becomes harder to maintain over time. Multiple imputation (MI) is widely used to handle missing data in such studies. When using MI, it is important that the imputation model is compatible with the proposed analysis model. In a longitudinal analysis, this implies that the clustering considered in the analysis model should be reflected in the imputation process. Several MI approaches have been proposed to impute incomplete longitudinal data, such as treating repeated measurements of the same variable as distinct variables or using generalized linear mixed imputation models. However, the uptake of these methods has been limited, as they require additional data manipulation and use of advanced imputation procedures. In this tutorial, we review the available MI approaches that can be used for handling incomplete longitudinal data, including where individuals are clustered within higher-level clusters. We illustrate implementation with replicable R and Stata code using a case study from the Childhood to Adolescence Transition Study.
In the human brain, white matter development is a complex and long-lasting process involving intermingling micro-and macrostructural mechanisms, such as fiber growth, pruning and myelination. Did you know that all these neurodevelopmental changes strongly affect MRI signals, with consequences on tractography performances and reliability? This communication aims to elaborate on these aspects, highlighting the importance of tracking and studying the developing connections with dedicated approaches.
Vaccine infodemics, driven by misinformation, disinformation, and inauthentic online behaviours, pose significant threats to global public health. This paper presents our response to this challenge, demonstrating how we developed VaxPulse Vaccine Infodemic Risk Assessment Lifecycle (VIRAL), an AI-powered social listening platform designed to monitor and assess vaccine-related infodemic risks. Leveraging interdisciplinary expertise and international collaborations, VaxPulse VIRAL integrates machine learning methods, including deep learning, active learning, and data augmentation, to provide real-time insights into public sentiments, misinformation trends, and social bot activity. Iterative feedback from domain experts and stakeholders has guided the development of dynamic dashboards that offer tailored, actionable insights to support immunisation programs and address information disorder. Ongoing improvements to VaxPulse will continue through collaboration with our international network and community leaders.
The recent vaccine-related infodemic has amplified public concerns, highlighting the need for proactive misinformation management. We describe how we enhanced the reporting surveillance system of Victoria's vaccine safety service, SAEFVIC, through the incorporation of new information sources for public sentiment analysis, topics of discussion, and hesitancies about vaccinations online. Using VaxPulse, a multi-step framework, we integrate adverse events following immunisation (AEFI) with sentiment analysis, demonstrating the importance of contextualising public concerns. Additionally, we emphasise the need to address non-English languages to stratify concerns across ethno-lingual communities, providing valuable insights for vaccine uptake strategies and combating mis/disinformation. The framework is applied to real-world examples and a case study on women's vaccine hesitancy, showcasing its benefits and adaptability by identifying public opinion from online media.
Lack of standardization and various intrinsic parameters for magnetic resonance (MR) image acquisition results in heterogeneous images across different sites and devices, which adversely affects the generalization of deep neural networks. To alleviate this issue, this work proposes a novel unsupervised harmonization framework that leverages normalizing flows to align MR images, thereby emulating the distribution of a source domain. The proposed strategy comprises three key steps. Initially, a normalizing flow network is trained to capture the distribution characteristics of the source domain. Then, we train a shallow harmonizer network to reconstruct images from the source domain via their augmented counterparts. Finally, during inference, the harmonizer network is updated to ensure that the output images conform to the learned source domain distribution, as modeled by the normalizing flow network. Our approach, which is unsupervised, source-free, and task-agnostic is assessed in the context of both adults and neonatal cross-domain brain MRI segmentation, as well as neonatal brain age estimation, demonstrating its generalizability across tasks and population demographics. The results underscore its superior performance compared to existing methodologies. The code is available at this https URL
Vaccine hesitancy threatens public health, leading to delayed or rejected vaccines. Social media is a vital source for understanding public concerns, and traditional methods like topic modelling often struggle to capture nuanced opinions. Though trained for query answering, large Language Models (LLMs) often miss current events and community concerns. Additionally, hallucinations in LLMs can compromise public health communication. To address these limitations, we developed a tool (VaxPulse Query Corner) using the Retrieval Augmented Generation technique. It addresses complex queries about public vaccine concerns on various online platforms, aiding public health administrators and stakeholders in understanding public concerns and implementing targeted interventions to boost vaccine confidence. Analysing 35,103 Shingrix social media posts, it achieved answer faithfulness (0.96) and relevance (0.94).
Background: The proportional odds (PO) model is the most common analytic method for ordinal outcomes in randomised controlled trials. While parameter estimates obtained under departures from PO can be interpreted as an average odds ratio, they can obscure differing treatment effects across the distribution of the ordinal categories. Extensions to the PO model exist and this work evaluates their performance under deviations to the PO assumption. Methods: We evaluated the bias, coverage and mean square error of four modeling approaches for ordinal outcomes via Monte Carlo simulation. Specifically, independent logistic regression models, the PO model, and constrained and unconstrained partial proportional odds (PPO) models were fit to simulated ordinal outcome data. The simulated data were designed to represent a hypothetical two-arm randomised trial under a range of scenarios. Additionally, we report on a case study; an Australasian COVID-19 Trial that adopted multiple secondary ordinal endpoints. Results: The PO model performed best when the data are generated under PO, as expected, but can result in bias and poor coverage in the presence of non-PO, particularly with increasing effect size and number of categories. The odds ratios (ORs) estimated using the unconstrained PPO and separate logistic regression models in the presence of non-PO had negligible bias and good coverage across most scenarios. The unconstrained PPO model under-performed when there was sparse data within some categories. Conclusions: While the PO model is effective when PO holds, the unconstrained and constrained PPO and logistic regression models provide unbiased and efficient estimates under non-PO conditions.
There are no more papers matching your filters at the moment.