Johns Hopkins Bloomberg School of Public Health
Systematic reviews are crucial for synthesizing scientific evidence but remain labor-intensive, especially when extracting detailed methodological information. Large language models (LLMs) offer potential for automating methodological assessments, promising to transform evidence synthesis. Here, using causal mediation analysis as a representative methodological domain, we benchmarked state-of-the-art LLMs against expert human reviewers across 180 full-text scientific articles. Model performance closely correlated with human judgments (accuracy correlation 0.71; F1 correlation 0.97), achieving near-human accuracy on straightforward, explicitly stated methodological criteria. However, accuracy sharply declined on complex, inference-intensive assessments, lagging expert reviewers by up to 15%. Errors commonly resulted from superficial linguistic cues -- for instance, models frequently misinterpreted keywords like "longitudinal" or "sensitivity" as automatic evidence of rigorous methodological approache, leading to systematic misclassifications. Longer documents yielded lower model accuracy, whereas publication year showed no significant effect. Our findings highlight an important pattern for practitioners using LLMs for methods review and synthesis from full texts: current LLMs excel at identifying explicit methodological features but require human oversight for nuanced interpretations. Integrating automated information extraction with targeted expert review thus provides a promising approach to enhance efficiency and methodological rigor in evidence synthesis across diverse scientific fields.
We introduce OrigamiPlot, an open-source R package and Shiny web application designed to enhance the visualization of multivariate data. This package implements the origami plot, a novel visualization technique proposed by Duan et al. in 2023, which improves upon traditional radar charts by ensuring that the area of the connected region is invariant to the ordering of attributes, addressing a key limitation of radar charts. The software facilitates multivariate decision-making by supporting comparisons across multiple objects and attributes, offering customizable features such as auxiliary axes and weighted attributes for enhanced clarity. Through the R package and user-friendly Shiny interface, researchers can efficiently create and customize plots without requiring extensive programming knowledge. Demonstrated using network meta-analysis as a real-world example, OrigamiPlot proves to be a versatile tool for visualizing multivariate data across various fields. This package opens new opportunities for simplifying decision-making processes with complex data.
Researchers at Stanford University and Google Research developed a framework utilizing continuous glucose monitoring (CGM) and machine learning to accurately predict individual metabolic subphenotypes from at-home tests. This approach enables precise identification of underlying metabolic defects, outperforming traditional markers and informing personalized lifestyle interventions.
Data science and informatics tools are developing at a blistering rate, but their users often lack the educational background or resources to efficiently apply the methods to their research. Training resources often deprecate because their maintenance is not prioritized by funding, giving teams little time to devote to such endeavors. Our group has developed Open-source Tools for Training Resources (OTTR) to offer greater efficiency and flexibility for creating and maintaining online course content. OTTR empowers creators to customize their work and allows for a simple workflow to publish using multiple platforms. OTTR allows content creators to publish material to multiple massive online learner communities using familiar rendering mechanics. OTTR allows the incorporation of pedagogical practices like formative and summative assessments in the form of multiple choice questions and fill in the blank problems that are automatically graded. No local installation of any software is required to begin creating content with OTTR. Thus far, 15 courses have been created with OTTR repository template. By using the OTTR system, the maintenance workload for updating these courses across platforms has been drastically reduced.
When estimating causal effects using observational data, it is desirable to replicate a randomized experiment as closely as possible by obtaining treated and control groups with similar covariate distributions. This goal can often be achieved by choosing well-matched samples of the original treated and control groups, thereby reducing bias due to the covariates. Since the 1970s, work on matching methods has examined how to best choose treated and control subjects for comparison. Matching methods are gaining popularity in fields such as economics, epidemiology, medicine and political science. However, until now the literature and related advice has been scattered across disciplines. Researchers who are interested in using matching methods---or developing methods related to matching---do not have a single place to turn to learn about past and current research. This paper provides a structure for thinking about matching methods and guidance on their use, coalescing the existing research (both old and new) and providing a summary of where the literature on matching methods is now and where it should be headed.
Background. Systematic reviews in comparative effectiveness research require timely evidence synthesis. Preprints accelerate knowledge dissemination but vary in quality, posing challenges for systematic reviews. Methods. We propose AutoConfidence (automated confidence assessment), an advanced framework for predicting preprint publication, which reduces reliance on manual curation and expands the range of predictors, including three key advancements: (1) automated data extraction using natural language processing techniques, (2) semantic embeddings of titles and abstracts, and (3) large language model (LLM)-driven evaluation scores. Additionally, we employed two prediction models: a random forest classifier for binary outcome and a survival cure model that predicts both binary outcome and publication risk over time. Results. The random forest classifier achieved AUROC 0.692 with LLM-driven scores, improving to 0.733 with semantic embeddings and 0.747 with article usage metrics. The survival cure model reached AUROC 0.716 with LLM-driven scores, improving to 0.731 with semantic embeddings. For publication risk prediction, it achieved a concordance index of 0.658, increasing to 0.667 with semantic embeddings. Conclusion. Our study advances the framework for preprint publication prediction through automated data extraction and multiple feature integration. By combining semantic embeddings with LLM-driven evaluations, AutoConfidence enhances predictive performance while reducing manual annotation burden. The framework has the potential to facilitate incorporation of preprint articles during the appraisal phase of systematic reviews, supporting researchers in more effective utilization of preprint resources.
Men experiencing infertility face unique challenges navigating Traditional Masculinity Ideologies that discourage emotional expression and help-seeking. This study examines how Reddit's r/maleinfertility community helps overcome these barriers through digital support networks. Using topic modeling (115 topics), network analysis (11 micro-communities), and time-lagged regression on 11,095 posts and 79,503 comments from 8,644 users, we found the community functions as a hybrid space: informal diagnostic hub, therapeutic commons, and governed institution. Medical advice dominates discourse (63.3\%), while emotional support (7.4\%) and moderation (29.2\%) create essential infrastructure. Sustained engagement correlates with actionable guidance and affiliation language, not emotional processing. Network analysis revealed structurally cohesive but topically diverse clusters without echo chamber characteristics. Cross-posters (20\% of users) who bridge r/maleinfertility and the gender-mixed r/infertility community serve as navigators and mentors, transferring knowledge between spaces. These findings inform trauma-informed design for stigmatized health communities, highlighting role-aware systems and navigation support.
Matching and weighting methods for observational studies involve the choice of an estimand, the causal effect with reference to a specific target population. Commonly used estimands include the average treatment effect in the treated (ATT), the average treatment effect in the untreated (ATU), the average treatment effect in the population (ATE), and the average treatment effect in the overlap (i.e., equipoise population; ATO). Each estimand has its own assumptions, interpretation, and statistical methods that can be used to estimate it. This article provides guidance on selecting and interpreting an estimand to help medical researchers correctly implement statistical methods used to estimate causal effects in observational studies and to help audiences correctly interpret the results and limitations of these studies. The interpretations of the estimands resulting from regression and instrumental variable analyses are also discussed. Choosing an estimand carefully is essential for making valid inferences from the analysis of observational data and ensuring results are replicable and useful for practitioners.
State-level policy evaluations commonly employ a difference-in-differences (DID) study design; yet within this framework, statistical model specification varies notably across studies. Motivated by applied state-level opioid policy evaluations, this simulation study compares statistical performance of multiple variations of two-way fixed effect models traditionally used for DID under a range of simulation conditions. While most linear models resulted in minimal bias, non-linear models and population-weighted versions of classic linear two-way fixed effect and linear GEE models yielded considerable bias (60 to 160%). Further, root mean square error is minimized by linear AR models when examining crude mortality rates and by negative binomial models when examining raw death counts. In the context of frequentist hypothesis testing, many models yielded high Type I error rates and very low rates of correctly rejecting the null hypothesis (< 10%), raising concerns of spurious conclusions about policy effectiveness. When considering performance across models, the linear autoregressive models were optimal in terms of directional bias, root mean squared error, Type I error, and correct rejection rates. These findings highlight notable limitations of traditional statistical models commonly used for DID designs, designs widely used in opioid policy studies and in state policy evaluations more broadly.
Purpose: To quantify the relative performance of step counting algorithms in studies that collect free-living high-resolution wrist accelerometry data and to highlight the implications of using these algorithms in translational research. Methods: Five step counting algorithms (four open source and one proprietary) were applied to the publicly available, free-living, high-resolution wrist accelerometry data collected by the National Health and Nutrition Examination Survey (NHANES) in 2011-2014. The mean daily total step counts were compared in terms of correlation, predictive performance, and estimated hazard ratios of mortality. Results: The estimated number of steps were highly correlated (median=0.91, range 0.77 to 0.98), had high and comparable predictive performance of mortality (median concordance=0.72, range 0.70 to 0.73). The distributions of the number of steps in the population varied widely (mean step counts range from 2,453 to 12,169). Hazard ratios of mortality associated with a 500-step increase per day varied among step counting algorithms between HR=0.88 and 0.96, corresponding to a 300% difference in mortality risk reduction ([1-0.88]/[1-0.96]=3). Conclusion: Different step counting algorithms provide correlated step estimates and have similar predictive performance that is better than traditional predictors of mortality. However, they provide widely different distributions of step counts and estimated reductions in mortality risk for a 500-step increase.
Machine learning has been an emerging tool for various aspects of infectious diseases including tuberculosis surveillance and detection. However, WHO provided no recommendations on using computer-aided tuberculosis detection software because of the small number of studies, methodological limitations, and limited generalizability of the findings. To quantify the generalizability of the machine-learning model, we developed a Deep Convolutional Neural Network (DCNN) model using a TB-specific CXR dataset of one population (National Library of Medicine Shenzhen No.3 Hospital) and tested it with non-TB-specific CXR dataset of another population (National Institute of Health Clinical Centers). The findings suggested that a supervised deep learning model developed by using the training dataset from one population may not have the same diagnostic performance in another population. Technical specification of CXR images, disease severity distribution, overfitting, and overdiagnosis should be examined before implementation in other settings.
Large health surveys increasingly collect high-dimensional functional data from wearable devices, and function on scalar regression (FoSR) is often used to quantify the relationship between these functional outcomes and scalar covariates such as age and sex. However, existing methods for FoSR fail to account for complex survey design. We introduce inferential methods for FoSR for studies with complex survey designs. The method combines fast univariate inference (FUI) developed for functional data outcomes and survey sampling inferential methods developed for scalar outcomes. Our approach consists of three steps: (1) fit survey weighted GLMs at each point along the functional domain, (2) smooth coefficients along the functional domain, and (3) use balanced repeated replication (BRR) or the Rao-Wu-Yue-Beaumont (RWYB) bootstrap to obtain pointwise and joint confidence bands for the functional coefficients. The method is motivated by association studies between continuous physical activity data and covariates collected in the National Health and Nutrition Examination Survey (NHANES). A first-of-its-kind analytical simulation study and empirical simulation using the NHANES data demonstrates that our method performs better than existing methods that do not account for the survey structure. Finally, application of the method in NHANES shows the practical implications of accounting for survey structure. The method is implemented in the R package svyfosr.
We address the challenge of estimation in the context of constant linear effect models with dense functional responses. In this framework, the conditional expectation of the response curve is represented by a linear combination of functional covariates with constant regression parameters. In this paper, we present an alternative solution by employing the quadratic inference approach, a well-established method for analyzing correlated data, to estimate the regression coefficients. Our approach leverages non-parametrically estimated basis functions, eliminating the need for choosing working correlation structures. Furthermore, we demonstrate that our method achieves a parametric n\sqrt{n}-convergence rate, contingent on an appropriate choice of bandwidth. This convergence is observed when the number of repeated measurements per trajectory exceeds a certain threshold, specifically, when it surpasses na0n^{a_{0}}, with nn representing the number of trajectories. Additionally, we establish the asymptotic normality of the resulting estimator. The performance of the proposed method is compared with that of existing methods through extensive simulation studies, where our proposed method outperforms. Real data analysis is also conducted to demonstrate the proposed method.
Real-world data, such as administrative claims and electronic health records, are increasingly used for safety monitoring and to help guide regulatory decision-making. In these settings, it is important to document analytic decisions transparently and objectively to ensure that analyses meet their intended goals. The Causal Roadmap is an established framework that can guide and document analytic decisions through each step of the analytic pipeline, which will help investigators generate high-quality real-world evidence. In this paper, we illustrate the utility of the Causal Roadmap using two case studies previously led by workgroups sponsored by the Sentinel Initiative -- a program for actively monitoring the safety of regulated medical products. Each case example focuses on different aspects of the analytic pipeline for drug safety monitoring. The first case study shows how the Causal Roadmap encourages transparency, reproducibility, and objective decision-making for causal analyses. The second case study highlights how this framework can guide analytic decisions beyond inference on causal parameters, improving outcome ascertainment in clinical phenotyping. These examples provide a structured framework for implementing the Causal Roadmap in safety surveillance and guide transparent, reproducible, and objective analysis.
The data science revolution has led to an increased interest in the practice of data analysis. While much has been written about statistical thinking, a complementary form of thinking that appears in the practice of data analysis is design thinking -- the problem-solving process to understand the people for whom a product is being designed. For a given problem, there can be significant or subtle differences in how a data analyst (or producer of a data analysis) constructs, creates, or designs a data analysis, including differences in the choice of methods, tooling, and workflow. These choices can affect the data analysis products themselves and the experience of the consumer of the data analysis. Therefore, the role of a producer can be thought of as designing the data analysis with a set of design principles. Here, we introduce design principles for data analysis and describe how they can be mapped to data analyses in a quantitative, objective and informative manner. We also provide empirical evidence of variation of principles within and between both producers and consumers of data analyses. Our work leads to two insights: it suggests a formal mechanism to describe data analyses based on the design principles for data analysis, and it provides a framework to teach students how to build data analyses using formal design principles.
Traditionally, statistical and causal inference on human subjects rely on the assumption that individuals are independently affected by treatments or exposures. However, recently there has been increasing interest in settings, such as social networks, where individuals may interact with one another such that treatments may spill over from the treated individual to their social contacts and outcomes may be contagious. Existing models proposed for causal inference using observational data from networks of interacting individuals have two major shortcomings. First, they often require a level of granularity in the data that is practically infeasible to collect in most settings, and second, the models are high-dimensional and often too big to fit to the available data. In this paper we illustrate and justify a parsimonious parameterization for network data with interference and contagion. Our parameterization corresponds to a particular family of graphical models known as chain graphs. We argue that, in some settings, chain graph models approximate the marginal distribution of a snapshot of a longitudinal data generating process on interacting units. We illustrate the use of chain graphs for causal inference about collective decision making in social networks using data from U.S. Supreme Court decisions between 1994 and 2004 and in simulations.
Advances in spatially-resolved transcriptomics (SRT) technologies have propelled the development of new computational analysis methods to unlock biological insights. As the cost of generating these data decreases, these technologies provide an exciting opportunity to create large-scale atlases that integrate SRT data across multiple tissues, individuals, species, or phenotypes to perform population-level analyses. Here, we describe unique challenges of varying spatial resolutions in SRT data, as well as highlight the opportunities for standardized preprocessing methods along with computational algorithms amenable to atlas-scale datasets leading to improved sensitivity and reproducibility in the future.
The study of treatment effects is often complicated by noncompliance and missing data. In the one-sided noncompliance setting where of interest are the complier and noncomplier average causal effects (CACE and NACE), we address outcome missingness of the \textit{latent missing at random} type (LMAR, also known as \textit{latent ignorability}). That is, conditional on covariates and treatment assigned, the missingness may depend on compliance type. Within the instrumental variable (IV) approach to noncompliance, methods have been proposed for handling LMAR outcome that additionally invoke an exclusion restriction type assumption on missingness, but no solution has been proposed for when a non-IV approach is used. This paper focuses on effect identification in the presence of LMAR outcome, with a view to flexibly accommodate different principal identification approaches. We show that under treatment assignment ignorability and LMAR only, effect nonidentifiability boils down to a set of two connected mixture equations involving unidentified stratum-specific response probabilities and outcome means. This clarifies that (except for a special case) effect identification generally requires two additional assumptions: a \textit{specific missingness mechanism} assumption and a \textit{principal identification} assumption. This provides a template for identifying effects based on separate choices of these assumptions. We consider a range of specific missingness assumptions, including those that have appeared in the literature and some new ones. Incidentally, we find an issue in the existing assumptions, and propose a modification of the assumptions to avoid the issue. Results under different assumptions are illustrated using data from the Baltimore Experience Corps Trial.
In this paper, we develop a semiparametric sensitivity analysis approach designed to address unmeasured confounding in observational studies with time-to-event outcomes. We target estimation of the marginal distributions of potential outcomes under competing exposures using influence function-based techniques. We derived the non-parametric influence function for uncensored data and mapped the uncensored data influence function to the observed data influence function. Our methodology is motivated by and applied to an observational study evaluating the effectiveness of radical prostatectomy (RP) versus external beam radiotherapy with androgen deprivation (EBRT+AD) for the treatment of prostate cancer. We also present a simulation study to evaluate the statistical properties of our methodology.
Extending (generalizing or transporting) causal inferences from a randomized trial to a target population requires ``generalizability'' or ``transportability'' assumptions, which state that randomized and non-randomized individuals are exchangeable conditional on baseline covariates. These assumptions are made on the basis of background knowledge, which is often uncertain or controversial, and need to be subjected to sensitivity analysis. We present simple methods for sensitivity analyses that do not require detailed background knowledge about specific unknown or unmeasured determinants of the outcome or modifiers of the treatment effect. Instead, our methods directly parameterize violations of the assumptions using bias functions. We show how the methods can be applied to non-nested trial designs, where the trial data are combined with a separately obtained sample of non-randomized individuals, as well as to nested trial designs, where a clinical trial is embedded within a cohort sampled from the target population. We illustrate the methods using data from a clinical trial comparing treatments for chronic hepatitis C infection.
There are no more papers matching your filters at the moment.