alphaXiv

History

Papers Benchmarks

MRC Biostatistics Unit

10 Oct 2025

applications statistics other-statistics

A tutorial on optimal dynamic treatment regimes

University of Cambridge MRC Biostatistics Unit

A dynamic treatment regime is a sequence of treatment decision rules tailored to an individual's evolving status over time. In precision medicine, much focus has been placed on finding an optimal dynamic treatment regime which, if followed by everyone in the population, would yield the best outcome on average; and extensive investigation has been conducted from both methodological and applications standpoints. The aim of this tutorial is to provide readers who are interested in optimal dynamic treatment regimes with a systematic, detailed but accessible introduction, including the formal definition and formulation of this topic within the framework of causal inference, identification assumptions required to link the causal quantity of interest to the observed data, existing statistical models and estimation methods to learn the optimal regime from data, and application of these methods to both simulated and real data.

29 Aug 2025

statistics methodology

NExON-Bayes: A Bayesian approach to network estimation informed by ordinal covariates

University of Oslo

University of Cambridge MRC Biostatistics Unit

In heterogeneous disease settings, accounting for intrinsic sample variability is crucial for obtaining reliable and interpretable omic network estimates. However, most graphical model analyses of biomedical data assume homogeneous conditional dependence structures, potentially leading to misleading conclusions. To address this, we propose a joint Gaussian graphical model that leverages sample-level ordinal covariates (e.g., disease stage) to account for heterogeneity and improve the estimation of partial correlation structures. Our modelling framework, called NExON-Bayes, extends the graphical spike-and-slab framework to account for ordinal covariates, jointly estimating their relevance to the graph structure and leveraging them to improve the accuracy of network estimation. To scale to high-dimensional omic settings, we develop an efficient variational inference algorithm tailored to our model. Through simulations, we demonstrate that our method outperforms the vanilla graphical spike-and-slab (with no covariate information), as well as other state-of-the-art network approaches which exploit covariate information. Applying our method to reverse phase protein array data from patients diagnosed with stage I, II or III breast carcinoma, we estimate the behaviour of proteomic networks as breast carcinoma progresses. Our model provides insights not only through inspection of the estimated proteomic networks, but also of the estimated ordinal covariate dependencies of key groups of proteins within those networks, offering a comprehensive understanding of how biological pathways shift across disease stages. Availability and Implementation: A user-friendly R package for NExON-Bayes with tutorials is available on Github at this http URL.

12 Sep 2025

applications statistics

Adaptive Bayesian computation for efficient biobank-scale genomic inference

University of Cambridge MRC Biostatistics Unit MRC Biostatistics Unit, University of Cambridge

Motivation: Modern biobanks, with unprecedented sample sizes and phenotypic diversity, have become foundational resources for genomic studies, enabling powerful cross-phenotype and population-scale analyses. As studies grow in complexity, Bayesian hierarchical models offer a principled framework for jointly modeling multiple units such as cells, traits, and experimental conditions, increasing statistical power through information sharing. However, adoption of Bayesian hierarchical models in biobank-scale studies remains limited due to computational inefficiencies, particularly in posterior inference over high-dimensional parameter spaces. Deterministic approximations such as variational inference provide scalable alternatives to Markov Chain Monte Carlo, yet current implementations do not fully exploit the structure of genome-wide multi-unit modeling, especially when biological effects of interest are concentrated in a few units. Results: We propose an adaptive focus (AF) strategy within a block coordinate ascent variational inference (CAVI) framework that selectively updates subsets of parameters at each iteration, corresponding to units deemed relevant based on current estimates. We illustrate this approach in protein quantitative trait locus (pQTL) mapping using a joint model of hierarchically linked regressions with shared parameters across traits. In both simulated data and real proteomic data from the UK Biobank, AF-CAVI achieves up to a 50\% reduction in runtime while maintaining statistical performance. We also provide a genome-wide pipeline for multi-trait pQTL mapping across thousands of traits, demonstrating AF-CAVI as an efficient scheme for large-scale, multi-unit Bayesian analysis in biobanks.

12 Jan 2012

statistics methodology

On Instrumental Variables Estimation of Causal Odds Ratios

Ghent University Golestan University MRC Biostatistics Unit

Inference for causal effects can benefit from the availability of an instrumental variable (IV) which, by definition, is associated with the given exposure, but not with the outcome of interest other than through a causal exposure effect. Estimation methods for instrumental variables are now well established for continuous outcomes, but much less so for dichotomous outcomes. In this article we review IV estimation of so-called conditional causal odds ratios which express the effect of an arbitrary exposure on a dichotomous outcome conditional on the exposure level, instrumental variable and measured covariates. In addition, we propose IV estimators of so-called marginal causal odds ratios which express the effect of an arbitrary exposure on a dichotomous outcome at the population level, and are therefore of greater public health relevance. We explore interconnections between the different estimators and support the results with extensive simulation studies and three applications.

16 Sep 2024

statistics methodology

Partial Ordering Bayesian Logistic Regression Model for Phase I Combination Trials and Computationally Efficient Approach to Operational Prior Specification

University of Cambridge MRC Biostatistics Unit

Recent years have seen increased interest in combining drug agents and/or schedules. Several methods for Phase I combination-escalation trials are proposed, among which, the partial ordering continual reassessment method (POCRM) gained great attention for its simplicity and good operational characteristics. However, the one-parameter nature of the POCRM makes it restrictive in more complicated settings such as the inclusion of a control group. This paper proposes a Bayesian partial ordering logistic model (POBLRM), which combines partial ordering and the more flexible (than CRM) two-parameter logistic model. Simulation studies show that the POBLRM performs similarly as the POCRM in non-randomised settings. When patients are randomised between the experimental dose-combinations and a control, performance is drastically improved. Most designs require specifying hyper-parameters, often chosen from statistical considerations (operational prior). The conventional "grid search'' calibration approach requires large simulations, which are computationally costly. A novel "cyclic calibration" has been proposed to reduce the computation from multiplicative to additive. Furthermore, calibration processes should consider wide ranges of scenarios of true toxicity probabilities to avoid bias. A method to reduce scenarios based on scenario-complexities is suggested. This can reduce the computation by more than 500 folds while remaining operational characteristics similar to the grid search.

24 Jul 2023

applications statistics methodology

Online multiple hypothesis testing

University of Cambridge

Carnegie Mellon University Newcastle University MRC Biostatistics Unit

Modern data analysis frequently involves large-scale hypothesis testing, which naturally gives rise to the problem of maintaining control of a suitable type I error rate, such as the false discovery rate (FDR). In many biomedical and technological applications, an additional complexity is that hypotheses are tested in an online manner, one-by-one over time. However, traditional procedures that control the FDR, such as the Benjamini-Hochberg procedure, assume that all p-values are available to be tested at a single time point. To address these challenges, a new field of methodology has developed over the past 15 years showing how to control error rates for online multiple hypothesis testing. In this framework, hypotheses arrive in a stream, and at each time point the analyst decides whether to reject the current hypothesis based both on the evidence against it, and on the previous rejection decisions. In this paper, we present a comprehensive exposition of the literature on online error rate control, with a review of key theory as well as a focus on applied examples. We also provide simulation results comparing different online testing algorithms and an up-to-date overview of the many methodological extensions that have been proposed.

12 Sep 2025

statistics methodology

Using joint models in phase I dose-finding designs in oncology: considerations for frequentist approaches

University of Cambridge University of Regensburg MRC Biostatistics Unit Cancer Research UK

Dose-finding trials for oncology studies are traditionally designed to assess safety in the early stages of drug development. With the rise of molecularly targeted therapies and immuno-oncology compounds, biomarker-driven approaches have gained significant importance. In this paper, we propose a novel approach that incorporates multiple values of a predictive biomarker to assist in evaluating binary toxicity outcomes using the factorization of a joint model in phase I dose-finding oncology trials. The proposed joint model framework, which utilizes additional repeated biomarker values as an early predictive marker for potential toxicity, is compared to the likelihood-based continual reassessment method (CRM) using only binary toxicity data, across various dose-toxicity relationship scenarios. Our findings highlight a critical limitation of likelihood-based approaches in early-phase dose-finding studies with small sample sizes: estimation challenges that have been previously overlooked in the phase I dose-escalation setting. We explore potential remedies to address these challenges and emphasize the appropriate use of likelihood-based methods. Simulation results demonstrate that the proposed joint model framework, by integrating biomarker information, can alleviate estimation problems in the the likelihood-based continual reassessment method (CRM) and improve the proportion of correct selection. However, we highlight that the inherent data limitations in early-phase dose-finding studies remain a significant challenge that cannot fully be overcomed in the frequentist framework.

12 Jun 2013

statistics methodology

What Is Meant by "Missing at Random"?

University of Melbourne La Trobe University Murdoch Children’s Research Institute MRC Biostatistics Unit

The concept of missing at random is central in the literature on statistical analysis with missing data. In general, inference using incomplete data should be based not only on observed data values but should also take account of the pattern of missing values. However, it is often said that if data are missing at random, valid inference using likelihood approaches (including Bayesian) can be obtained ignoring the missingness mechanism. Unfortunately, the term "missing at random" has been used inconsistently and not always clearly; there has also been a lack of clarity around the meaning of "valid inference using likelihood". These issues have created potential for confusion about the exact conditions under which the missingness mechanism can be ignored, and perhaps fed confusion around the meaning of "analysis ignoring the missingness mechanism". Here we provide standardised precise definitions of "missing at random" and "missing completely at random", in order to promote unification of the theory. Using these definitions we clarify the conditions that suffice for "valid inference" to be obtained under a variety of inferential paradigms.

15 Sep 2025

statistics methodology

A computational method for type I error rate control in power-maximizing response-adaptive randomization

University of Cambridge George Mason University MRC Biostatistics Unit

Maximizing statistical power in experimental design often involves imbalanced treatment allocation, but several challenges hinder its practical adoption: (1) the misconception that equal allocation always maximizes power, (2) when only targeting maximum power, more than half the participants may be expected to obtain inferior treatment, and (3) response-adaptive randomization (RAR) targeting maximum statistical power may inflate type I error rates substantially. Recent work identified issue (3) and proposed a novel allocation procedure combined with the asymptotic score test. Instead, the current research focuses on finite-sample guarantees. First, we analyze the power for traditional power-maximizing RAR procedures under exact tests, including a novel generalization of Boschloo's test. Second, we evaluate constrained Markov decision process (CMDP) RAR procedures under exact tests. These procedures target maximum average power under constraints on pointwise and average type I error rates, with averages taken across the parametric space. A combination of the unconditional exact test and the CMDP procedure protecting allocations to the superior arm gives the best performance, providing substantial power gains over equal allocation while allocating more participants in expectation to the superior treatment. Future research could focus on the randomization test, in which CMDP procedures exhibited lower power compared to other examined RAR procedures.

29 Nov 2024

applications statistics methodology

Thompson, Ulam, or Gauss? Multi-criteria recommendations for posterior probability computation methods in Bayesian response-adaptive trials

University of Cambridge MRC Biostatistics Unit

To implement a Bayesian response-adaptive trial it is necessary to evaluate a sequence of posterior probabilities. This sequence is often approximated by simulation due to the unavailability of closed-form formulae to compute it exactly. Approximating these probabilities by simulation can be computationally expensive and impact the accuracy or the range of scenarios that may be explored. An alternative approximation method based on Gaussian distributions can be faster but its accuracy is not guaranteed. The literature lacks practical recommendations for selecting approximation methods and comparing their properties, particularly considering trade-offs between computational speed and accuracy. In this paper, we focus on the case where the trial has a binary endpoint with Beta priors. We first outline an efficient way to compute the posterior probabilities exactly for any number of treatment arms. Then, using exact probability computations, we show how to benchmark calculation methods based on considerations of computational speed, patient benefit, and inferential accuracy. This is done through a range of simulations in the two-armed case, as well as an analysis of the three-armed Established Status Epilepticus Treatment Trial. Finally, we provide practical guidance for which calculation method is most appropriate in different settings, and how to choose the number of simulations if the simulation-based approximation method is used.

232

27 Mar 2025

computation statistics methodology

Simulation-based assessment of a Bayesian survival model with flexible baseline hazard and time-dependent effects

University of Cambridge Karolinska Institutet University of Leicester Swansea University Cancer Registry of Norway Norwegian Institute of Public Health AstraZeneca MRC Biostatistics Unit

This paper evaluates the performance of the Bayesian flexible parametric survival model implemented in the R package `survextrap` using extensive simulation studies based on realistic oncology clinical trial data. It identifies optimal model specifications and computational settings for accurately modeling complex survival patterns and time-varying effects, while comparing its performance to established frequentist methods.

12 Feb 2014

computation statistics

MCMC algorithms for Bayesian variable selection in the logistic regression model for large-scale genomic applications

German Cancer Research Center (DKFZ)MRC Biostatistics Unit Institute of Public Health

In large-scale genomic applications vast numbers of molecular features are scanned in order to find a small number of candidates which are linked to a particular disease or phenotype. This is a variable selection problem in the "large p, small n" paradigm where many more variables than samples are available. Additionally, a complex dependence structure is often observed among the markers/genes due to their joint involvement in biological processes and pathways. Bayesian variable selection methods that introduce sparseness through additional priors on the model size are well suited to the problem. However, the model space is very large and standard Markov chain Monte Carlo (MCMC) algorithms such as a Gibbs sampler sweeping over all p variables in each iteration are often computationally infeasible. We propose to employ the dependence structure in the data to decide which variables should always be updated together and which are nearly conditionally independent and hence do not need to be considered together. Here, we focus on binary classification applications. We follow the implementation of the Bayesian probit regression model by Albert and Chib (1993) and the Bayesian logistic regression model by Holmes and Held (2006) which both lead to marginal Gaussian distributions. We in- vestigate several MCMC samplers using the dependence structure in different ways. The mixing and convergence performances of the resulting Markov chains are evaluated and compared to standard samplers in two simulation studies and in an application to a real gene expression data set.

12 Nov 2014

statistics

Exact Estimation of Multiple Directed Acyclic Graphs

University of Warwick University of York MRC Biostatistics Unit CRUK Cambridge Institute

This paper considers the problem of estimating the structure of multiple related directed acyclic graph (DAG) models. Building on recent developments in exact estimation of DAGs using integer linear programming (ILP), we present an ILP approach for joint estimation over multiple DAGs, that does not require that the vertices in each DAG share a common ordering. Furthermore, we allow also for (potentially unknown) dependency structure between the DAGs. Results are presented on both simulated data and fMRI data obtained from multiple subjects.

29 Jul 2015

statistics methodology

Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges

MRC Biostatistics Unit

Multi-armed bandit problems (MABPs) are a special type of optimal control problem well suited to model resource allocation under uncertainty in a wide variety of contexts. Since the first publication of the optimal solution of the classic MABP by a dynamic index rule, the bandit literature quickly diversified and emerged as an active research topic. Across this literature, the use of bandit models to optimally design clinical trials became a typical motivating application, yet little of the resulting theory has ever been used in the actual design and analysis of clinical trials. To this end, we review two MABP decision-theoretic approaches to the optimal allocation of treatments in a clinical trial: the infinite-horizon Bayesian Bernoulli MABP and the finite-horizon variant. These models possess distinct theoretical properties and lead to separate allocation rules in a clinical trial design context. We evaluate their performance compared to other allocation rules, including fixed randomization. Our results indicate that bandit approaches offer significant advantages, in terms of assigning more patients to better treatments, and severe limitations, in terms of their resulting statistical power. We propose a novel bandit-based patient allocation rule that overcomes the issue of low power, thus removing a potential barrier for their use in practice.

26 May 2022

ai-for-health computer-science machine-learning

DeepJoint: Robust Survival Modelling Under Clinical Presence Shift

University of Cambridge

University of Manchester MRC Biostatistics Unit Health e-Research Centre

Observational data in medicine arise as a result of the complex interaction between patients and the healthcare system. The sampling process is often highly irregular and itself constitutes an informative process. When using such data to develop prediction models, this phenomenon is often ignored, leading to sub-optimal performance and generalisability of models when practices evolve. We propose a multi-task recurrent neural network which models three clinical presence dimensions -- namely the longitudinal, the inter-observation and the missingness processes -- in parallel to the survival outcome. On a prediction task using MIMIC III laboratory tests, explicit modelling of these three processes showed improved performance in comparison to state-of-the-art predictive models (C-index at 1 day horizon: 0.878). More importantly, the proposed approach was more robust to change in the clinical presence setting, demonstrated by performance comparison between patients admitted on weekdays and weekends. This analysis demonstrates the importance of studying and leveraging clinical presence to improve performance and create more transportable clinical models.

19 Dec 2019

applications statistics

Bayesian log-Gaussian Cox process regression: applications to meta-analysis of neuroimaging working memory studies

Northeastern University

University of Michigan Massachusetts General Hospital

University of Warwick University of Colorado at Boulder MRC Biostatistics Unit Heinrich Heine University Dusseldorf

Working memory (WM) was one of the first cognitive processes studied with functional magnetic resonance imaging. With now over 20 years of studies on WM, each study with tiny sample sizes, there is a need for meta-analysis to identify the brain regions that are consistently activated by WM tasks, and to understand the interstudy variation in those activations. However, current methods in the field cannot fully account for the spatial nature of neuroimaging meta-analysis data or the heterogeneity observed among WM studies. In this work, we propose a fully Bayesian random-effects metaregression model based on log-Gaussian Cox processes, which can be used for meta-analysis of neuroimaging studies. An efficient Markov chain Monte Carlo scheme for posterior simulations is presented which makes use of some recent advances in parallel computing using graphics processing units. Application of the proposed model to a real data set provides valuable insights regarding the function of the WM.

24 May 2016

statistics methodology

An empirical Bayes approach to network recovery using external knowledge

Leiden University MRC Biostatistics Unit VU University Medical Center VU University Cambridge Institute of Public Health

Reconstruction of a high-dimensional network may benefit substantially from the inclusion of prior knowledge on the network topology. In the case of gene interaction networks such knowledge may come for instance from pathway repositories like KEGG, or be inferred from data of a pilot study. The Bayesian framework provides a natural means of including such prior knowledge. Based on a Bayesian Simultaneous Equation Model, we develop an appealing empirical Bayes procedure which automatically assesses the relevance of the used prior knowledge. We use variational Bayes method for posterior densities approximation and compare its accuracy with that of Gibbs sampling strategy. Our method is computationally fast, and can outperform known competitors. In a simulation study we show that accurate prior data can greatly improve the reconstruction of the network, but need not harm the reconstruction if wrong. We demonstrate the benefits of the method in an analysis of gene expression data from GEO. In particular, the edges of the recovered network have superior reproducibility (compared to that of competitors) over resampled versions of the data.

12 Aug 2025

statistics methodology

Debiased machine learning for combining probability and non-probability survey data

University of Cambridge MRC Biostatistics Unit

We consider the problem of estimating the finite population mean

\bar{Y}

of an outcome variable

Y

using data from a nonprobability sample and auxiliary information from a probability sample. Existing double robust (DR) estimators of this mean

\bar{Y}

require the estimation of two nuisance functions: the conditional probability of selection into the nonprobability sample given covariates

X

that are observed in both samples, and the conditional expectation of

Y

given

X

. These nuisance functions can be estimated using parametric models, but the resulting estimator of

\bar{Y}

will typically be biased if both parametric models are misspecified. It would therefore be advantageous to be able to use more flexible data-adaptive / machine-learning estimators of the nuisance functions. Here, we develop a general framework for the valid use of DR estimators of

\bar{Y}

when the design of the probability sample uses sampling without replacement at the first stage and data-adaptive / machine-learning estimators are used for the nuisance functions. We prove that several DR estimators of

\bar{Y}

, including targeted maximum likelihood estimators, are asymptotically normally distributed when the estimators of the nuisance functions converge faster than the

n^{1/4}

rate and cross-fitting is used. We present a simulation study that demonstrates good performance of these DR estimators compared to the corresponding DR estimators that rely on at least one correctly specified parametric model.

28 Jun 2018

statistics methodology

Admissible multi-arm stepped-wedge cluster randomized trial designs

Newcastle University MRC Biostatistics Unit

Numerous publications have now addressed the principles of designing, analyzing, and reporting the results of, stepped-wedge cluster randomized trials. In contrast, there is little research available pertaining to the design and analysis of multi-arm stepped-wedge cluster randomized trials, utilized to evaluate the effectiveness of multiple experimental interventions. In this paper, we address this by explaining how the required sample size in these multi-arm trials can be ascertained when data are to be analyzed using a linear mixed model. We then go on to describe how the design of such trials can be optimized to balance between minimizing the cost of the trial, and minimizing some function of the covariance matrix of the treatment effect estimates. Using a recently commenced trial that will evaluate the effectiveness of sensor monitoring in an occupational therapy rehabilitation program for older persons after hip fracture as an example, we demonstrate that our designs could reduce the number of observations required for a fixed power level by up to 58%. Consequently, when logistical constraints permit the utilization of any one of a range of possible multi-arm stepped-wedge cluster randomized trial designs, researchers should consider employing our approach to optimize their trials efficiency.

11 Dec 2024

ai-for-genomics bayesian-deep-learning clustering-algorithms

Outcome-guided spike-and-slab Lasso Biclustering: A Novel Approach for Enhancing Biclustering Techniques for Gene Expression Analysis

University of Cambridge MRC Biostatistics Unit

Biclustering has gained interest in gene expression data analysis due to its ability to identify groups of samples that exhibit similar behaviour in specific subsets of genes (or vice versa), in contrast to traditional clustering methods that classify samples based on all genes. Despite advances, biclustering remains a challenging problem, even with cutting-edge methodologies. This paper introduces an extension of the recently proposed Spike-and-Slab Lasso Biclustering (SSLB) algorithm, termed Outcome-Guided SSLB (OG-SSLB), aimed at enhancing the identification of biclusters in gene expression analysis. Our proposed approach integrates disease outcomes into the biclustering framework through Bayesian profile regression. By leveraging additional clinical information, OG-SSLB improves the interpretability and relevance of the resulting biclusters. Comprehensive simulations and numerical experiments demonstrate that OG-SSLB achieves superior performance, with improved accuracy in estimating the number of clusters and higher consensus scores compared to the original SSLB method. Furthermore, OG-SSLB effectively identifies meaningful patterns and associations between gene expression profiles and disease states. These promising results demonstrate the effectiveness of OG-SSLB in advancing biclustering techniques, providing a powerful tool for uncovering biologically relevant insights. The OGSSLB software can be found as an R/C++ package at this https URL .

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

A tutorial on optimal dynamic treatment regimes

NExON-Bayes: A Bayesian approach to network estimation informed by ordinal covariates

Adaptive Bayesian computation for efficient biobank-scale genomic inference

On Instrumental Variables Estimation of Causal Odds Ratios

Partial Ordering Bayesian Logistic Regression Model for Phase I Combination Trials and Computationally Efficient Approach to Operational Prior Specification

Online multiple hypothesis testing

Using joint models in phase I dose-finding designs in oncology: considerations for frequentist approaches

What Is Meant by "Missing at Random"?

A computational method for type I error rate control in power-maximizing response-adaptive randomization

Thompson, Ulam, or Gauss? Multi-criteria recommendations for posterior probability computation methods in Bayesian response-adaptive trials

Simulation-based assessment of a Bayesian survival model with flexible baseline hazard and time-dependent effects

MCMC algorithms for Bayesian variable selection in the logistic regression model for large-scale genomic applications

Exact Estimation of Multiple Directed Acyclic Graphs

Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges

DeepJoint: Robust Survival Modelling Under Clinical Presence Shift

Bayesian log-Gaussian Cox process regression: applications to meta-analysis of neuroimaging working memory studies

An empirical Bayes approach to network recovery using external knowledge

Debiased machine learning for combining probability and non-probability survey data

Admissible multi-arm stepped-wedge cluster randomized trial designs

Outcome-guided spike-and-slab Lasso Biclustering: A Novel Approach for Enhancing Biclustering Techniques for Gene Expression Analysis

Events

AI for Law

Personalize Your Feed