Systematic reviews are crucial for synthesizing scientific evidence but remain labor-intensive, especially when extracting detailed methodological information. Large language models (LLMs) offer potential for automating methodological assessments, promising to transform evidence synthesis. Here, using causal mediation analysis as a representative methodological domain, we benchmarked state-of-the-art LLMs against expert human reviewers across 180 full-text scientific articles. Model performance closely correlated with human judgments (accuracy correlation 0.71; F1 correlation 0.97), achieving near-human accuracy on straightforward, explicitly stated methodological criteria. However, accuracy sharply declined on complex, inference-intensive assessments, lagging expert reviewers by up to 15%. Errors commonly resulted from superficial linguistic cues -- for instance, models frequently misinterpreted keywords like "longitudinal" or "sensitivity" as automatic evidence of rigorous methodological approache, leading to systematic misclassifications. Longer documents yielded lower model accuracy, whereas publication year showed no significant effect. Our findings highlight an important pattern for practitioners using LLMs for methods review and synthesis from full texts: current LLMs excel at identifying explicit methodological features but require human oversight for nuanced interpretations. Integrating automated information extraction with targeted expert review thus provides a promising approach to enhance efficiency and methodological rigor in evidence synthesis across diverse scientific fields.
We introduce OrigamiPlot, an open-source R package and Shiny web application designed to enhance the visualization of multivariate data. This package implements the origami plot, a novel visualization technique proposed by Duan et al. in 2023, which improves upon traditional radar charts by ensuring that the area of the connected region is invariant to the ordering of attributes, addressing a key limitation of radar charts. The software facilitates multivariate decision-making by supporting comparisons across multiple objects and attributes, offering customizable features such as auxiliary axes and weighted attributes for enhanced clarity. Through the R package and user-friendly Shiny interface, researchers can efficiently create and customize plots without requiring extensive programming knowledge. Demonstrated using network meta-analysis as a real-world example, OrigamiPlot proves to be a versatile tool for visualizing multivariate data across various fields. This package opens new opportunities for simplifying decision-making processes with complex data.
Researchers at Stanford University and Google Research developed a framework utilizing continuous glucose monitoring (CGM) and machine learning to accurately predict individual metabolic subphenotypes from at-home tests. This approach enables precise identification of underlying metabolic defects, outperforming traditional markers and informing personalized lifestyle interventions.
Data science and informatics tools are developing at a blistering rate, but their users often lack the educational background or resources to efficiently apply the methods to their research. Training resources often deprecate because their maintenance is not prioritized by funding, giving teams little time to devote to such endeavors. Our group has developed Open-source Tools for Training Resources (OTTR) to offer greater efficiency and flexibility for creating and maintaining online course content. OTTR empowers creators to customize their work and allows for a simple workflow to publish using multiple platforms. OTTR allows content creators to publish material to multiple massive online learner communities using familiar rendering mechanics. OTTR allows the incorporation of pedagogical practices like formative and summative assessments in the form of multiple choice questions and fill in the blank problems that are automatically graded. No local installation of any software is required to begin creating content with OTTR. Thus far, 15 courses have been created with OTTR repository template. By using the OTTR system, the maintenance workload for updating these courses across platforms has been drastically reduced.
When estimating causal effects using observational data, it is desirable to
replicate a randomized experiment as closely as possible by obtaining treated
and control groups with similar covariate distributions. This goal can often be
achieved by choosing well-matched samples of the original treated and control
groups, thereby reducing bias due to the covariates. Since the 1970s, work on
matching methods has examined how to best choose treated and control subjects
for comparison. Matching methods are gaining popularity in fields such as
economics, epidemiology, medicine and political science. However, until now the
literature and related advice has been scattered across disciplines.
Researchers who are interested in using matching methods---or developing
methods related to matching---do not have a single place to turn to learn about
past and current research. This paper provides a structure for thinking about
matching methods and guidance on their use, coalescing the existing research
(both old and new) and providing a summary of where the literature on matching
methods is now and where it should be headed.
Background. Systematic reviews in comparative effectiveness research require timely evidence synthesis. Preprints accelerate knowledge dissemination but vary in quality, posing challenges for systematic reviews.
Methods. We propose AutoConfidence (automated confidence assessment), an advanced framework for predicting preprint publication, which reduces reliance on manual curation and expands the range of predictors, including three key advancements: (1) automated data extraction using natural language processing techniques, (2) semantic embeddings of titles and abstracts, and (3) large language model (LLM)-driven evaluation scores. Additionally, we employed two prediction models: a random forest classifier for binary outcome and a survival cure model that predicts both binary outcome and publication risk over time.
Results. The random forest classifier achieved AUROC 0.692 with LLM-driven scores, improving to 0.733 with semantic embeddings and 0.747 with article usage metrics. The survival cure model reached AUROC 0.716 with LLM-driven scores, improving to 0.731 with semantic embeddings. For publication risk prediction, it achieved a concordance index of 0.658, increasing to 0.667 with semantic embeddings.
Conclusion. Our study advances the framework for preprint publication prediction through automated data extraction and multiple feature integration. By combining semantic embeddings with LLM-driven evaluations, AutoConfidence enhances predictive performance while reducing manual annotation burden. The framework has the potential to facilitate incorporation of preprint articles during the appraisal phase of systematic reviews, supporting researchers in more effective utilization of preprint resources.
Men experiencing infertility face unique challenges navigating Traditional Masculinity Ideologies that discourage emotional expression and help-seeking. This study examines how Reddit's r/maleinfertility community helps overcome these barriers through digital support networks. Using topic modeling (115 topics), network analysis (11 micro-communities), and time-lagged regression on 11,095 posts and 79,503 comments from 8,644 users, we found the community functions as a hybrid space: informal diagnostic hub, therapeutic commons, and governed institution. Medical advice dominates discourse (63.3\%), while emotional support (7.4\%) and moderation (29.2\%) create essential infrastructure. Sustained engagement correlates with actionable guidance and affiliation language, not emotional processing. Network analysis revealed structurally cohesive but topically diverse clusters without echo chamber characteristics. Cross-posters (20\% of users) who bridge r/maleinfertility and the gender-mixed r/infertility community serve as navigators and mentors, transferring knowledge between spaces. These findings inform trauma-informed design for stigmatized health communities, highlighting role-aware systems and navigation support.
Matching and weighting methods for observational studies involve the choice
of an estimand, the causal effect with reference to a specific target
population. Commonly used estimands include the average treatment effect in the
treated (ATT), the average treatment effect in the untreated (ATU), the average
treatment effect in the population (ATE), and the average treatment effect in
the overlap (i.e., equipoise population; ATO). Each estimand has its own
assumptions, interpretation, and statistical methods that can be used to
estimate it. This article provides guidance on selecting and interpreting an
estimand to help medical researchers correctly implement statistical methods
used to estimate causal effects in observational studies and to help audiences
correctly interpret the results and limitations of these studies. The
interpretations of the estimands resulting from regression and instrumental
variable analyses are also discussed. Choosing an estimand carefully is
essential for making valid inferences from the analysis of observational data
and ensuring results are replicable and useful for practitioners.
State-level policy evaluations commonly employ a difference-in-differences (DID) study design; yet within this framework, statistical model specification varies notably across studies. Motivated by applied state-level opioid policy evaluations, this simulation study compares statistical performance of multiple variations of two-way fixed effect models traditionally used for DID under a range of simulation conditions. While most linear models resulted in minimal bias, non-linear models and population-weighted versions of classic linear two-way fixed effect and linear GEE models yielded considerable bias (60 to 160%). Further, root mean square error is minimized by linear AR models when examining crude mortality rates and by negative binomial models when examining raw death counts. In the context of frequentist hypothesis testing, many models yielded high Type I error rates and very low rates of correctly rejecting the null hypothesis (< 10%), raising concerns of spurious conclusions about policy effectiveness. When considering performance across models, the linear autoregressive models were optimal in terms of directional bias, root mean squared error, Type I error, and correct rejection rates. These findings highlight notable limitations of traditional statistical models commonly used for DID designs, designs widely used in opioid policy studies and in state policy evaluations more broadly.
Purpose: To quantify the relative performance of step counting algorithms in
studies that collect free-living high-resolution wrist accelerometry data and
to highlight the implications of using these algorithms in translational
research. Methods: Five step counting algorithms (four open source and one
proprietary) were applied to the publicly available, free-living,
high-resolution wrist accelerometry data collected by the National Health and
Nutrition Examination Survey (NHANES) in 2011-2014. The mean daily total step
counts were compared in terms of correlation, predictive performance, and
estimated hazard ratios of mortality. Results: The estimated number of steps
were highly correlated (median=0.91, range 0.77 to 0.98), had high and
comparable predictive performance of mortality (median concordance=0.72, range
0.70 to 0.73). The distributions of the number of steps in the population
varied widely (mean step counts range from 2,453 to 12,169). Hazard ratios of
mortality associated with a 500-step increase per day varied among step
counting algorithms between HR=0.88 and 0.96, corresponding to a 300%
difference in mortality risk reduction ([1-0.88]/[1-0.96]=3). Conclusion:
Different step counting algorithms provide correlated step estimates and have
similar predictive performance that is better than traditional predictors of
mortality. However, they provide widely different distributions of step counts
and estimated reductions in mortality risk for a 500-step increase.
Machine learning has been an emerging tool for various aspects of infectious
diseases including tuberculosis surveillance and detection. However, WHO
provided no recommendations on using computer-aided tuberculosis detection
software because of the small number of studies, methodological limitations,
and limited generalizability of the findings. To quantify the generalizability
of the machine-learning model, we developed a Deep Convolutional Neural Network
(DCNN) model using a TB-specific CXR dataset of one population (National
Library of Medicine Shenzhen No.3 Hospital) and tested it with non-TB-specific
CXR dataset of another population (National Institute of Health Clinical
Centers). The findings suggested that a supervised deep learning model
developed by using the training dataset from one population may not have the
same diagnostic performance in another population. Technical specification of
CXR images, disease severity distribution, overfitting, and overdiagnosis
should be examined before implementation in other settings.
Large health surveys increasingly collect high-dimensional functional data from wearable devices, and function on scalar regression (FoSR) is often used to quantify the relationship between these functional outcomes and scalar covariates such as age and sex. However, existing methods for FoSR fail to account for complex survey design. We introduce inferential methods for FoSR for studies with complex survey designs. The method combines fast univariate inference (FUI) developed for functional data outcomes and survey sampling inferential methods developed for scalar outcomes. Our approach consists of three steps: (1) fit survey weighted GLMs at each point along the functional domain, (2) smooth coefficients along the functional domain, and (3) use balanced repeated replication (BRR) or the Rao-Wu-Yue-Beaumont (RWYB) bootstrap to obtain pointwise and joint confidence bands for the functional coefficients. The method is motivated by association studies between continuous physical activity data and covariates collected in the National Health and Nutrition Examination Survey (NHANES). A first-of-its-kind analytical simulation study and empirical simulation using the NHANES data demonstrates that our method performs better than existing methods that do not account for the survey structure. Finally, application of the method in NHANES shows the practical implications of accounting for survey structure. The method is implemented in the R package svyfosr.
We address the challenge of estimation in the context of constant linear effect models with dense functional responses. In this framework, the conditional expectation of the response curve is represented by a linear combination of functional covariates with constant regression parameters. In this paper, we present an alternative solution by employing the quadratic inference approach, a well-established method for analyzing correlated data, to estimate the regression coefficients. Our approach leverages non-parametrically estimated basis functions, eliminating the need for choosing working correlation structures. Furthermore, we demonstrate that our method achieves a parametric n-convergence rate, contingent on an appropriate choice of bandwidth. This convergence is observed when the number of repeated measurements per trajectory exceeds a certain threshold, specifically, when it surpasses na0, with n representing the number of trajectories. Additionally, we establish the asymptotic normality of the resulting estimator. The performance of the proposed method is compared with that of existing methods through extensive simulation studies, where our proposed method outperforms. Real data analysis is also conducted to demonstrate the proposed method.
Real-world data, such as administrative claims and electronic health records, are increasingly used for safety monitoring and to help guide regulatory decision-making. In these settings, it is important to document analytic decisions transparently and objectively to ensure that analyses meet their intended goals. The Causal Roadmap is an established framework that can guide and document analytic decisions through each step of the analytic pipeline, which will help investigators generate high-quality real-world evidence. In this paper, we illustrate the utility of the Causal Roadmap using two case studies previously led by workgroups sponsored by the Sentinel Initiative -- a program for actively monitoring the safety of regulated medical products. Each case example focuses on different aspects of the analytic pipeline for drug safety monitoring. The first case study shows how the Causal Roadmap encourages transparency, reproducibility, and objective decision-making for causal analyses. The second case study highlights how this framework can guide analytic decisions beyond inference on causal parameters, improving outcome ascertainment in clinical phenotyping. These examples provide a structured framework for implementing the Causal Roadmap in safety surveillance and guide transparent, reproducible, and objective analysis.
The data science revolution has led to an increased interest in the practice of data analysis. While much has been written about statistical thinking, a complementary form of thinking that appears in the practice of data analysis is design thinking -- the problem-solving process to understand the people for whom a product is being designed. For a given problem, there can be significant or subtle differences in how a data analyst (or producer of a data analysis) constructs, creates, or designs a data analysis, including differences in the choice of methods, tooling, and workflow. These choices can affect the data analysis products themselves and the experience of the consumer of the data analysis. Therefore, the role of a producer can be thought of as designing the data analysis with a set of design principles. Here, we introduce design principles for data analysis and describe how they can be mapped to data analyses in a quantitative, objective and informative manner. We also provide empirical evidence of variation of principles within and between both producers and consumers of data analyses. Our work leads to two insights: it suggests a formal mechanism to describe data analyses based on the design principles for data analysis, and it provides a framework to teach students how to build data analyses using formal design principles.
Traditionally, statistical and causal inference on human subjects rely on the
assumption that individuals are independently affected by treatments or
exposures. However, recently there has been increasing interest in settings,
such as social networks, where individuals may interact with one another such
that treatments may spill over from the treated individual to their social
contacts and outcomes may be contagious. Existing models proposed for causal
inference using observational data from networks of interacting individuals
have two major shortcomings. First, they often require a level of granularity
in the data that is practically infeasible to collect in most settings, and
second, the models are high-dimensional and often too big to fit to the
available data. In this paper we illustrate and justify a parsimonious
parameterization for network data with interference and contagion. Our
parameterization corresponds to a particular family of graphical models known
as chain graphs. We argue that, in some settings, chain graph models
approximate the marginal distribution of a snapshot of a longitudinal data
generating process on interacting units. We illustrate the use of chain graphs
for causal inference about collective decision making in social networks using
data from U.S. Supreme Court decisions between 1994 and 2004 and in
simulations.
Advances in spatially-resolved transcriptomics (SRT) technologies have
propelled the development of new computational analysis methods to unlock
biological insights. As the cost of generating these data decreases, these
technologies provide an exciting opportunity to create large-scale atlases that
integrate SRT data across multiple tissues, individuals, species, or phenotypes
to perform population-level analyses. Here, we describe unique challenges of
varying spatial resolutions in SRT data, as well as highlight the opportunities
for standardized preprocessing methods along with computational algorithms
amenable to atlas-scale datasets leading to improved sensitivity and
reproducibility in the future.
The study of treatment effects is often complicated by noncompliance and
missing data. In the one-sided noncompliance setting where of interest are the
complier and noncomplier average causal effects (CACE and NACE), we address
outcome missingness of the \textit{latent missing at random} type (LMAR, also
known as \textit{latent ignorability}). That is, conditional on covariates and
treatment assigned, the missingness may depend on compliance type. Within the
instrumental variable (IV) approach to noncompliance, methods have been
proposed for handling LMAR outcome that additionally invoke an exclusion
restriction type assumption on missingness, but no solution has been proposed
for when a non-IV approach is used. This paper focuses on effect identification
in the presence of LMAR outcome, with a view to flexibly accommodate different
principal identification approaches. We show that under treatment assignment
ignorability and LMAR only, effect nonidentifiability boils down to a set of
two connected mixture equations involving unidentified stratum-specific
response probabilities and outcome means. This clarifies that (except for a
special case) effect identification generally requires two additional
assumptions: a \textit{specific missingness mechanism} assumption and a
\textit{principal identification} assumption. This provides a template for
identifying effects based on separate choices of these assumptions. We consider
a range of specific missingness assumptions, including those that have appeared
in the literature and some new ones. Incidentally, we find an issue in the
existing assumptions, and propose a modification of the assumptions to avoid
the issue. Results under different assumptions are illustrated using data from
the Baltimore Experience Corps Trial.
In this paper, we develop a semiparametric sensitivity analysis approach
designed to address unmeasured confounding in observational studies with
time-to-event outcomes. We target estimation of the marginal distributions of
potential outcomes under competing exposures using influence function-based
techniques. We derived the non-parametric influence function for uncensored
data and mapped the uncensored data influence function to the observed data
influence function. Our methodology is motivated by and applied to an
observational study evaluating the effectiveness of radical prostatectomy (RP)
versus external beam radiotherapy with androgen deprivation (EBRT+AD) for the
treatment of prostate cancer. We also present a simulation study to evaluate
the statistical properties of our methodology.
Extending (generalizing or transporting) causal inferences from a randomized
trial to a target population requires ``generalizability'' or
``transportability'' assumptions, which state that randomized and
non-randomized individuals are exchangeable conditional on baseline covariates.
These assumptions are made on the basis of background knowledge, which is often
uncertain or controversial, and need to be subjected to sensitivity analysis.
We present simple methods for sensitivity analyses that do not require detailed
background knowledge about specific unknown or unmeasured determinants of the
outcome or modifiers of the treatment effect. Instead, our methods directly
parameterize violations of the assumptions using bias functions. We show how
the methods can be applied to non-nested trial designs, where the trial data
are combined with a separately obtained sample of non-randomized individuals,
as well as to nested trial designs, where a clinical trial is embedded within a
cohort sampled from the target population. We illustrate the methods using data
from a clinical trial comparing treatments for chronic hepatitis C infection.
There are no more papers matching your filters at the moment.