methodology
We study estimation of the conditional law P(YX=x)P(Y|X=\mathbf{x}) and continuous functionals Ψ(P(YX=x))\Psi(P(Y|X=\mathbf{x})) when YY takes values in a locally compact Polish space, XRpX \in \mathbb{R}^p, and the observations arise from a complex survey design. We propose a survey-calibrated distributional random forest (SDRF) that incorporates complex-design features via a pseudo-population bootstrap, PSU-level honesty, and a Maximum Mean Discrepancy (MMD) split criterion computed from kernel mean embeddings of Hájek-type (design-weighted) node distributions. We provide a framework for analyzing forest-style estimators under survey designs; establish design consistency for the finite-population target and model consistency for the super-population target under explicit conditions on the design, kernel, resampling multipliers, and tree partitions. As far as we are aware, these are the first results on model-free estimation of conditional distributions under survey designs. Simulations under a stratified two-stage cluster design provide finite sample performance and demonstrate the statistical error price of ignoring the survey design. The broad applicability of SDRF is demonstrated using NHANES: We estimate the tolerance regions of the conditional joint distribution of two diabetes biomarkers, illustrating how distributional heterogeneity can support subgroup-specific risk profiling for diabetes mellitus in the U.S. population.
We seek to conduct statistical inference for a large collection of primary parameters, each with its own nuisance parameters. Our approach is partially Bayesian, in that we treat the primary parameters as fixed while we model the nuisance parameters as random and drawn from an unknown distribution which we endow with a nonparametric prior. We compute partially Bayes p-values by conditioning on nuisance parameter statistics, that is, statistics that are ancillary for the primary parameters and informative about the nuisance parameters. The proposed p-values have a Bayesian interpretation as tail areas computed with respect to the posterior distribution of the nuisance parameters. Similarly to the conditional predictive p-values of Bayarri and Berger, the partially Bayes p-values avoid double use of the data (unlike posterior predictive p-values). A key ingredient of our approach is that we model nuisance parameters hierarchically across problems; the sharing of information across problems leads to improved calibration. We illustrate the proposed partially Bayes p-values in two applications: the normal means problem with unknown variances and a location-scale model with unknown distribution shape. We model the scales via Dirichlet processes in both examples and the distribution shape via Pólya trees in the second. Our proposed partially Bayes p-values increase power and calibration compared to purely frequentist alternatives.
To address the computational issue in empirical likelihood methods with massive data, this paper proposes a grouped empirical likelihood (GEL) method. It divides NN observations into nn groups, and assigns the same probability weight to all observations within the same group. GEL estimates the n (N)n\ (\ll N) weights by maximizing the empirical likelihood ratio. The dimensionality of the optimization problem is thus reduced from NN to nn, thereby lowering the computational complexity. We prove that GEL possesses the same first order asymptotic properties as the conventional empirical likelihood method under the estimating equation settings and the classical two-sample mean problem. A distributed GEL method is also proposed with several servers. Numerical simulations and real data analysis demonstrate that GEL can keep the same inferential accuracy as the conventional empirical likelihood method, and achieves substantial computational acceleration compared to the divide-and-conquer empirical likelihood method. We can analyze a billion data with GEL in tens of seconds on only one PC.
Accurately quantifying uncertainty of individual treatment effects (ITEs) across multiple decision points is crucial for personalized decision-making in fields such as healthcare, finance, education, and online marketplaces. Previous work has focused on predicting non-causal longitudinal estimands or constructing prediction bands for ITEs using cross-sectional data based on exchangeability assumptions. We propose a novel method for constructing prediction intervals using conformal inference techniques for time-varying ITEs with weaker assumptions than prior literature. We guarantee a lower bound for coverage, which is dependent on the degree of non-exchangeability in the data. Although our method is broadly applicable across decision-making contexts, we support our theoretical claims with simulations emulating micro-randomized trials (MRTs) -- a sequential experimental design for mobile health (mHealth) studies. We demonstrate the practical utility of our method by applying it to a real-world MRT - the Intern Health Study (IHS).
We propose ϕ\phi-test, a global feature-selection and significance procedure for black-box predictors that combines Shapley attributions with selective inference. Given a trained model and an evaluation dataset, ϕ\phi-test performs SHAP-guided screening and fits a linear surrogate on the screened features via a selection rule with a tractable selective-inference form. For each retained feature, it outputs a Shapley-based global score, a surrogate coefficient, and post-selection pp-values and confidence intervals in a global feature-importance table. Experiments on real tabular regression tasks with tree-based and neural backbones suggest that ϕ\phi-test can retain much of the predictive ability of the original model while using only a few features and producing feature sets that remain fairly stable across resamples and backbone classes. In these settings, ϕ\phi-test acts as a practical global explanation layer linking Shapley-based importance summaries with classical statistical inference.
Nonstationarity in spatial and spatio-temporal processes is ubiquitous in environmental datasets, but is not often addressed in practice, due to a scarcity of statistical software packages that implement nonstationary models. In this article, we introduce the R software package deepspat, which allows for modeling, fitting and prediction with nonstationary spatial and spatio-temporal models applied to Gaussian and extremes data. The nonstationary models in our package are constructed using a deep multi-layered deformation of the original spatial or spatio-temporal domain, and are straightforward to implement. Model parameters are estimated using gradient-based optimization of customized loss functions with tensorflow, which implements automatic differentiation. The functionalities of the package are illustrated through simulation studies and an application to Nepal temperature data.
In this article, we introduce a Topological Data Analysis (TDA) pipeline for neural spike train data. Understanding how the brain transforms sensory information into perception and behavior requires analyzing coordinated neural population activity. Modern electrophysiology enables simultaneous recording of spike train ensembles, but extracting meaningful information from these datasets remains a central challenge in neuroscience. A fundamental question is how ensembles of neurons discriminate between different stimuli or behavioral states, particularly when individual neurons exhibit weak or no stimulus selectivity, yet their coordinated activity may still contribute to network-level encoding. We describe a TDA framework that identifies stimulus-discriminative structure in spike train ensembles recorded from the mouse insular cortex during presentation of deionized water stimuli at distinct non-nociceptive temperatures. We show that population-level topological signatures effectively differentiate oral thermal stimuli even when individual neurons provide little or no discrimination. These findings demonstrate that ensemble organization can carry perceptually relevant information that standard single-unit analysis may miss. The framework builds on a mathematical representation of spike train ensembles that enables persistent homology to be applied to collections of point processes. At its core is the widely-used Victor-Purpura (VP) distance. Using this metric, we construct persistence-based descriptors that capture multiscale topological features of ensemble geometry. Two key theoretical results support the method: a stability theorem establishing robustness of persistent homology to perturbations in the VP metric parameter, and a probabilistic stability theorem ensuring robustness of topological signatures.
Traffic accidents result in millions of injuries and fatalities globally, with a significant number occurring at intersections each year. Traffic Signal Control (TSC) is an effective strategy for enhancing safety at these urban junctures. Despite the growing popularity of Reinforcement Learning (RL) methods in optimizing TSC, these methods often prioritize driving efficiency over safety, thus failing to address the critical balance between these two aspects. Additionally, these methods usually need more interpretability. CounterFactual (CF) learning is a promising approach for various causal analysis fields. In this study, we introduce a novel framework to improve RL for safety aspects in TSC. This framework introduces a novel method based on CF learning to address the question: ``What if, when an unsafe event occurs, we backtrack to perform alternative actions, and will this unsafe event still occur in the subsequent period?'' To answer this question, we propose a new structure causal model to predict the result after executing different actions, and we propose a new CF module that integrates with additional ``X'' modules to promote safe RL practices. Our new algorithm, CFLight, which is derived from this framework, effectively tackles challenging safety events and significantly improves safety at intersections through a near-zero collision control strategy. Through extensive numerical experiments on both real-world and synthetic datasets, we demonstrate that CFLight reduces collisions and improves overall traffic performance compared to conventional RL methods and the recent safe RL model. Moreover, our method represents a generalized and safe framework for RL methods, opening possibilities for applications in other domains. The data and code are available in the github this https URL.
There has been significant progress in Bayesian inference based on sparsity-inducing (e.g., spike-and-slab and horseshoe-type) priors for high-dimensional regression models. The resulting posteriors, however, in general do not possess desirable frequentist properties, and the credible sets thus cannot serve as valid confidence sets even asymptotically. We introduce a novel debiasing approach that corrects the bias for the entire Bayesian posterior distribution. We establish a new Bernstein-von Mises theorem that guarantees the frequentist validity of the debiased posterior. We demonstrate the practical performance of our proposal through Monte Carlo simulations and two empirical applications in economics.
The Weibull distribution is a commonly adopted choice for modeling the survival of systems subject to maintenance over time. When only proxy indicators and censored observations are available, it becomes necessary to express the distribution's parameters as functions of time-dependent covariates. Deep neural networks provide the flexibility needed to learn complex relationships between these covariates and operational lifetime, thereby extending the capabilities of traditional regression-based models. Motivated by the analysis of a fleet of military vehicles operating in highly variable and demanding environments, as well as by the limitations observed in existing methodologies, this paper introduces WTNN, a new neural network-based modeling framework specifically designed for Weibull survival studies. The proposed architecture is specifically designed to incorporate qualitative prior knowledge regarding the most influential covariates, in a manner consistent with the shape and structure of the Weibull distribution. Through numerical experiments, we show that this approach can be reliably trained on proxy and right-censored data, and is capable of producing robust and interpretable survival predictions that can improve existing approaches.
This research introduces Empirical Decision Theory, a framework that enables principled decision-making based solely on observed act-consequence pairs, bypassing the need for explicit "states of the world." The approach provides statistical inferential guarantees, including robustness against data contamination, and is demonstrated in evaluating generative AI prompting strategies.
1
Extreme weather events are widely studied in fields such as agriculture, ecology, and meteorology. The spatio-temporal co-occurrence of extreme events can strengthen or weaken under changing climate conditions. In this paper, we propose a novel approach to model spatio-temporal extremes by integrating climate indices via a conditional variational autoencoder (cXVAE). A convolutional neural network (CNN) is embedded in the decoder to convolve climatological indices with the spatial dependence within the latent space, thereby allowing the decoder to be dependent on the climate variables. There are three main contributions here. First, we demonstrate through extensive simulations that the proposed conditional XVAE accurately emulates spatial fields and recovers spatially and temporally varying extremal dependence with very low computational cost post training. Second, we provide a simple, scalable approach to detecting condition-driven shifts and whether the dependence structure is invariant to the conditioning variable. Third, when dependence is found to be condition-sensitive, the conditional XVAE supports counterfactual experiments allowing intervention on the climate covariate and propagating the associated change through the learned decoder to quantify differences in joint tail risk, co-occurrence ranges, and return metrics. To demonstrate the practical utility and performance of the model in real-world scenarios, we apply our method to analyze the monthly maximum Fire Weather Index (FWI) over eastern Australia from 2014 to 2024 conditioned on the El Niño/Southern Oscillation (ENSO) index.
Watermarking has recently emerged as a crucial tool for protecting the intellectual property of generative models and for distinguishing AI-generated content from human-generated data. Despite its practical success, most existing watermarking schemes are empirically driven and lack a theoretical understanding of the fundamental trade-off between detection power and generation fidelity. To address this gap, we formulate watermarking as a statistical hypothesis testing problem between a null distribution and its watermarked counterpart. Under explicit constraints on false-positive and false-negative rates, we derive a tight lower bound on the achievable fidelity loss, measured by a general ff-divergence, and characterize the optimal watermarked distribution that attains this bound. We further develop a corresponding sampling rule that provides an optimal mechanism for inserting watermarks with minimal fidelity distortion. Our result establishes a simple yet broadly applicable principle linking hypothesis testing, information divergence, and watermark generation.
We address the challenge of causal inference status and the dose-response effects with a semi-continuous exposure. A two-stage approach is proposed using estimating equation for multiple outcomes with large sample properties derived for the resulting estimators. Homogeneity tests are developed to assess whether causal effects of exposure status and the dose-response effects are the same across multiple outcomes. A global homogeneity test is also developed to assess whether the effect of exposure status (exposed/not exposed) and the dose-response effect of the continuous exposure level are each equal across all outcomes. The methods of estimation and testing are rigorously evaluated in simulation studies and applied to a motivating study on the effects of prenatal alcohol exposure on childhood cognition defined by executive function (EF), academic achievement in math, and learning and memory (LM).
Decisions based upon pairwise comparisons of multiple treatments are naturally performed in terms of the mean survival of the selected study arms or functions thereof. However, synthesis of treatment comparisons is usually performed on surrogates of the mean survival, such as hazard ratios or restricted mean survival times. Thus, network meta-analysis techniques may suffer from the limitations of these approaches, such as incorrect proportional hazards assumption or short-term follow-up periods. We propose a Bayesian framework for the network meta-analysis of the main outcome informing the decision, the mean survival of a treatment. Its derivation involves extrapolation of the observed survival curves. We use methods for stable extrapolation that integrate long term evidence based upon mortality projections. Extrapolations are performed using flexible poly-hazard parametric models and M-spline-based methods. We assess the computational and statistical efficiency of different techniques using a simulation study and apply the developed methods to two real data sets. The proposed method is formulated within a decision theoretic framework for cost-effectiveness analyses, where the `best' treatment is to be selected and incorporating the associated cost information is straightforward.
Regression analysis for responses taking values in general metric spaces has received increasing attention, particularly for settings with Euclidean predictors XRpX \in \mathbb{R}^p and non-Euclidean responses Y(M,d)Y \in ( \mathcal{M}, d). While additive regression is a powerful tool for enhancing interpretability and mitigating the curse of dimensionality in the presence of multivariate predictors, its direct extension is hindered by the absence of vector space operations in general metric spaces. We propose a novel framework for additive optimal transport regression, which incorporates additive structure through optimal geodesic transports. A key idea is to extend the notion of optimal transports in Wasserstein spaces to general geodesic metric spaces. This unified approach accommodates a wide range of responses, including probability distributions, symmetric positive definite (SPD) matrices with various metrics and spherical data. The practical utility of the method is illustrated with correlation matrices derived from resting state fMRI brain imaging data.
In this project we are interested in performing clustering of observations such that the cluster membership is influenced by a set of predictors. To that end, we employ the Bayesian nonparameteric Common Atoms Model, which is a nested clustering algorithm that utilizes a (fixed) group membership for each observation to encourage more similar clustering of members of the same group. CAM operates by assuming each group has its own vector of cluster probabilities, which are themselves clustered to allow similar clustering for some groups. We extend this approach by treating the group membership as an unknown latent variable determined as a flexible nonparametric form of the covariate vector. Consequently, observations with similar predictor values will be in the same latent group and are more likely to be clustered together than observations with disparate predictors. We propose a pyramid group model that flexibly partitions the predictor space into these latent group memberships. This pyramid model operates similarly to a Bayesian regression tree process except that it uses the same splitting rule for at all nodes at the same tree depth which facilitates improved mixing. We outline a block Gibbs sampler to perform posterior inference from our model. Our methodology is demonstrated in simulation and real data examples. In the real data application, we utilize the RAND Health and Retirement Study to cluster and predict patient outcomes in terms of the number of overnight hospital stays.
Understanding how vaccines perform against different pathogen genotypes is crucial for developing effective prevention strategies, particularly for highly genetically diverse pathogens like HIV. Sieve analysis is a statistical framework used to determine whether a vaccine selectively prevents acquisition of certain genotypes while allowing breakthrough of other genotypes that evade immune responses. Traditionally, these analyses are conducted with a single sequence available per individual acquiring the pathogen. However, modern sequencing technology can provide detailed characterization of intra-individual viral diversity by capturing up to hundreds of pathogen sequences per person. In this work, we introduce methodology that extends sieve analysis to account for intra-individual viral diversity. Our approach estimates vaccine efficacy against viral populations with varying true (unobservable) frequencies of vaccine-mismatched mutations. To account for differential resolution of information from differing sequence counts per person, we use competing risks Cox regression with modeled causes of failure and propose an empirical Bayes approach for the classification model. Simulation studies demonstrate that our approach reduces bias, provides nominal confidence interval coverage, and improves statistical power compared to conventional methods. We apply our method to the HVTN 705 Imbokodo trial, which assessed the efficacy of a heterologous vaccine regimen in preventing HIV-1 acquisition.
We consider an adaptive experiment for treatment choice and design a minimax and Bayes optimal adaptive experiment with respect to regret. Given binary treatments, the experimenter's goal is to choose the treatment with the highest expected outcome through an adaptive experiment, in order to maximize welfare. We consider adaptive experiments that consist of two phases, the treatment allocation phase and the treatment choice phase. The experiment starts with the treatment allocation phase, where the experimenter allocates treatments to experimental subjects to gather observations. During this phase, the experimenter can adaptively update the allocation probabilities using the observations obtained in the experiment. After the allocation phase, the experimenter proceeds to the treatment choice phase, where one of the treatments is selected as the best. For this adaptive experimental procedure, we propose an adaptive experiment that splits the treatment allocation phase into two stages, where we first estimate the standard deviations and then allocate each treatment proportionally to its standard deviation. We show that this experiment, often referred to as Neyman allocation, is minimax and Bayes optimal in the sense that its regret upper bounds exactly match the lower bounds that we derive. To show this optimality, we derive minimax and Bayes lower bounds for the regret using change-of-measure arguments. Then, we evaluate the corresponding upper bounds using the central limit theorem and large deviation bounds.
The Azadkia-Chatterjee coefficient is a rank-based measure of dependence between a random variable YRY \in \mathbb{R} and a random vector ZRdZ{\boldsymbol Z} \in \mathbb{R}^{d_Z}. This paper proposes a multivariate extension that measures dependence between random vectors YRdY{\boldsymbol Y} \in \mathbb{R}^{d_Y} and ZRdZ{\boldsymbol Z} \in \mathbb{R}^{d_Z}, based on nn i.i.d. samples. The proposed coefficient converges almost surely to a limit with the following properties: i) it lies in [0,1][0, 1]; ii) it equals zero if and only if Y{\boldsymbol Y} and Z{\boldsymbol Z} are independent; and iii) it equals one if and only if Y{\boldsymbol Y} is almost surely a function of Z{\boldsymbol Z}. Remarkably, the only assumption required by this convergence is that Y{\boldsymbol Y} is not almost surely a constant. We further prove that under the same mild condition, the coefficient is asymptotically normal when Y{\boldsymbol Y} and Z{\boldsymbol Z} are independent and propose a merge sort based algorithm to calculate this coefficient in time complexity O(n(logn)dY)O(n (\log n)^{d_Y}). Finally, we show that it can be used to measure conditional dependence between Y{\boldsymbol Y} and Z{\boldsymbol Z} conditional on a third random vector X{\boldsymbol X}, and prove that the measure is monotonic with respect to the deviation from an independence distribution under certain model restrictions.
There are no more papers matching your filters at the moment.