other-statistics
These notes describe our experience with running a student seminar on average-case complexity in statistical inference using the jigsaw learning format at ETH Zurich in Fall of 2024. The jigsaw learning technique is an active learning technique where students work in groups on independent parts of the task and then reassemble the groups to combine all the parts together. We implemented this technique for the proofs of various recent research developments, combined with a presentation by one of the students in the beginning of the session. We describe our experience and thoughts on such a format applied in a student research seminar: including, but not limited to, higher engagement, more accessible talks by the students, and increased student participation in discussions. In the Appendix, we include all the exercises sheets for the topic, which may be of independent interest for courses on statistical-to-computational gaps and average-case complexity.
This research quantifies how linear preconditioning impacts the condition number in Markov Chain Monte Carlo, demonstrating its effectiveness for target distributions with additive or multiplicative Hessian structures and showing it can provably accelerate Random Walk Metropolis. Crucially, it finds that diagonal preconditioning can sometimes increase the condition number and degrade sampler performance, while optimal preconditioning reduces HMC computational cost by a factor of approximately \sqrt{\kappa}.
Researchers at the University of Toronto and UHN developed the Smith-Pittman algorithm to detect interpretable collaboration networks among oncologists by analyzing patient movement between oncology clinical trials. Applied to simulated data, the algorithm identified eight communities, demonstrating a clear "social partitioning gradient" based on intervention popularity, offering more intuitive insights compared to traditional community detection methods.
What is the number of rolls of fair 6-sided dice until the first time the total sum of all rolls is a prime? We compute the expectation and the variance of this random variable up to an additive error of less than 10^{-4}. This is a solution to a puzzle suggested by DasGupta (2017) in the Bulletin of the Institute of Mathematical Statistics, where the published solution is incomplete. The proof is simple, combining a basic dynamic programming algorithm with a quick Matlab computation and basic facts about the distribution of primes.
Uncertainty Quantification (UQ) is an essential step in computational model validation because assessment of the model accuracy requires a concrete, quantifiable measure of uncertainty in the model predictions. The concept of UQ in the nuclear community generally means forward UQ (FUQ), in which the information flow is from the inputs to the outputs. Inverse UQ (IUQ), in which the information flow is from the model outputs and experimental data to the inputs, is an equally important component of UQ but has been significantly underrated until recently. FUQ requires knowledge in the input uncertainties which has been specified by expert opinion or user self-evaluation. IUQ is defined as the process to inversely quantify the input uncertainties based on experimental data. This review paper aims to provide a comprehensive and comparative discussion of the major aspects of the IUQ methodologies that have been used on the physical models in system thermal-hydraulics codes. IUQ methods can be categorized by three main groups: frequentist (deterministic), Bayesian (probabilistic), and empirical (design-of-experiments). We used eight metrics to evaluate an IUQ method, including solidity, complexity, accessibility, independence, flexibility, comprehensiveness, transparency, and tractability. Twelve IUQ methods are reviewed, compared, and evaluated based on these eight metrics. Such comparative evaluation will provide a good guidance for users to select a proper IUQ method based on the IUQ problem under investigation.
We present a step by step mathematical derivation of the Kalman filter using two different approaches. First, we consider the orthogonal projection method by means of vector-space optimization. Second, we derive the Kalman filter using Bayesian optimal filtering. We provide detailed proofs for both methods and each equation is expanded in detail.
This is a contribution for the discussion on "A Gibbs sampler for a class of random convex polytopes" by Pierre E. Jacob, Ruobin Gong, Paul T. Edlefsen and Arthur P. Dempster to appear in the Journal of American Statistical Association.
We demonstrate Castor, a cloud-based system for contextual IoT time series data and model management at scale. Castor is designed to assist Data Scientists in (a) exploring and retrieving all relevant time series and contextual information that is required for their predictive modelling tasks; (b) seamlessly storing and deploying their predictive models in a cloud production environment; (c) monitoring the performance of all predictive models in production and (semi-)automatically retraining them in case of performance deterioration. The main features of Castor are: (1) an efficient pipeline for ingesting IoT time series data in real time; (2) a scalable, hybrid data management service for both time series and contextual data; (3) a versatile semantic model for contextual information which can be easily adopted to different application domains; (4) an abstract framework for developing and storing predictive models in R or Python; (5) deployment services which automatically train and/or score predictive models upon user-defined conditions. We demonstrate Castor for a real-world Smart Grid use case and discuss how it can be adopted to other application domains such as Smart Buildings, Telecommunication, Retail or Manufacturing.
Nassim Nicholas Taleb presents a comprehensive synthesis on fat-tailed distributions, illustrating their profound implications for real-world phenomena and challenging the applicability of standard statistical methods. The work advocates for a revised framework to address extreme events in various domains, from finance to social sciences.
Conventional wisdom assumes that the indefinite integral of the probability density function for the standard normal distribution cannot be expressed in finite elementary terms. While this is true, there is an expression for this anti-derivative in infinite elementary terms that, when being differentiated, directly yields the standard normal density function. We derive this function using infinite partial integration and review its relation to the cumulative distribution function for the standard normal distribution and the error function.
An international collaboration led by Yushmanov et al. developed two new empirical scaling laws, ITER89-P and ITER89-OL, for L-mode energy confinement in tokamaks based on a comprehensive global database. These scalings provided critical, internationally agreed-upon predictions for the International Thermonuclear Experimental Reactor (ITER) design, estimating its L-mode confinement time at approximately 2.0 seconds.
Machine learning and statistical modeling methods were used to analyze the impact of climate change on financial wellbeing of fruit farmers in Tunisia and Chile. The analysis was based on face to face interviews with 801 farmers. Three research questions were investigated. First, whether climate change impacts had an effect on how well the farm was doing financially. Second, if climate change was not influential, what factors were important for predicting financial wellbeing of the farm. And third, ascertain whether observed effects on the financial wellbeing of the farm were a result of interactions between predictor variables. This is the first report directly comparing climate change with other factors potentially impacting financial wellbeing of farms. Certain climate change factors, namely increases in temperature and reductions in precipitation, can regionally impact self-perceived financial wellbeing of fruit farmers. Specifically, increases in temperature and reduction in precipitation can have a measurable negative impact on the financial wellbeing of farms in Chile. This effect is less pronounced in Tunisia. Climate impact differences were observed within Chile but not in Tunisia. However, climate change is only of minor importance for predicting farm financial wellbeing, especially for farms already doing financially well. Factors that are more important, mainly in Tunisia, included trust in information sources and prior farm ownership. Other important factors include farm size, water management systems used and diversity of fruit crops grown. Moreover, some of the important factors identified differed between farms doing and not doing well financially. Interactions between factors may improve or worsen farm financial wellbeing.
Data science is a discipline that provides principles, methodology and guidelines for the analysis of data for tools, values, or insights. Driven by a huge workforce demand, many academic institutions have started to offer degrees in data science, with many at the graduate, and a few at the undergraduate level. Curricula may differ at different institutions, because of varying levels of faculty expertise, and different disciplines (such as Math, computer science, and business etc) in developing the curriculum. The University of Massachusetts Dartmouth started offering degree programs in data science from Fall 2015, at both the undergraduate and the graduate level. Quite a few articles have been published that deal with graduate data science courses, much less so dealing with undergraduate ones. Our discussion will focus on undergraduate course structure and function, and specifically, a first course in data science. Our design of this course centers around a concept called the data science life cycle. That is, we view tasks or steps in the practice of data science as forming a process, consisting of states that indicate how it comes into life, how different tasks in data science depend on or interact with others until the birth of a data product or the reach of a conclusion. Naturally, different pieces of the data science life cycle then form individual parts of the course. Details of each piece are filled up by concepts, techniques, or skills that are popular in industry. Consequently, the design of our course is both "principled" and practical. A significant feature of our course philosophy is that, in line with activity theory, the course is based on the use of tools to transform real data in order to answer strongly motivated questions related to the data.
8
Existing studies analyzing electromagnetic field (EMFE) in wireless networks have primarily considered downlink communications. In the uplink, the EMFE caused by the user's smartphone is usually the only considered source of radiation, thereby ignoring contributions caused by other active neighboring devices. In addition, the network coverage and EMFE are typically analyzed independently for both the uplink and downlink, while a joint analysis would be necessary to fully understand the network performance and answer various questions related to optimal network deployment. This paper bridges these gaps by presenting an enhanced stochastic geometry framework that includes the above aspects. The proposed topology features base stations modeled via a homogeneous Poisson point process. The users active during a same time slot are distributed according to a mixture of a Matérn cluster process and a Gauss-Poisson process, featuring groups of users possibly carrying several equipments. In this paper, we derive the marginal and meta distributions of the downlink and uplink EMFE and we characterize the uplink to downlink EMFE ratio. Moreover, we derive joint probability metrics considering the uplink and downlink coverage and EMFE. These metrics are evaluated in four scenarios considering BS, cluster and/or intracluster densifications. Our numerical results highlight the existence of optimal node densities maximizing these joint probabilities.
It is often asserted that to control for the effects of confounders, one should include the confounding variables of concern in a statistical model as a covariate. Conversely, it is also asserted that control can only be concluded by design, where the results from an analysis can only be interpreted as evidence of an effect because the design controlled for the cause. To suggest otherwise is said to be a fallacy of cum hoc ergo propter hoc. Obviously, these two assertions create a conundrum: How can the effect of confounder be controlled for with analysis instead of by design without committing cum hoc ergo propter hoc? The present manuscript answers this conundrum.
We develop approximate estimation methods for exponential random graph models (ERGMs), whose likelihood is proportional to an intractable normalizing constant. The usual approach approximates this constant with Monte Carlo simulations, however convergence may be exponentially slow. We propose a deterministic method, based on a variational mean-field approximation of the ERGM's normalizing constant. We compute lower and upper bounds for the approximation error for any network size, adapting nonlinear large deviations results. This translates into bounds on the distance between true likelihood and mean-field likelihood. Monte Carlo simulations suggest that in practice our deterministic method performs better than our conservative theoretical approximation bounds imply, for a large class of models.
The research literature on cybersecurity incident detection & response is very rich in automatic detection methodologies, in particular those based on the anomaly detection paradigm. However, very little attention has been devoted to the diagnosis ability of the methods, aimed to provide useful information on the causes of a given detected anomaly. This information is of utmost importance for the security team to reduce the time from detection to response. In this paper, we present Multivariate Big Data Analysis (MBDA), a complete intrusion detection approach based on 5 steps to effectively handle massive amounts of disparate data sources. The approach has been designed to deal with the main characteristics of Big Data, that is, the high volume, velocity and variety. The core of the approach is the Multivariate Statistical Network Monitoring (MSNM) technique proposed in a recent paper. Unlike in state of the art machine learning methodologies applied to the intrusion detection problem, when an anomaly is identified in MBDA the output of the system includes the detail of the logs of raw information associated to this anomaly, so that the security team can use this information to elucidate its root causes. MBDA is based in two open software packages available in Github: the MEDA Toolbox and the FCParser. We illustrate our approach with two case studies. The first one demonstrates the application of MBDA to semistructured sources of information, using the data from the VAST 2012 mini challenge 2. This complete case study is supplied in a virtual machine available for download. In the second case study we show the Big Data capabilities of the approach in data collected from a real network with labeled attacks.
Statistics has moved beyond the frequentist-Bayesian controversies of the past. Where does this leave our ability to interpret results? I suggest that a philosophy compatible with statistical practice, labeled here statistical pragmatism, serves as a foundation for inference. Statistical pragmatism is inclusive and emphasizes the assumptions that connect statistical models with observed data. I argue that introductory courses often mischaracterize the process of statistical inference and I propose an alternative "big picture" depiction.
We develope the framework of transitional conditional independence. For this we introduce transition probability spaces and transitional random variables. These constructions will generalize, strengthen and unify previous notions of (conditional) random variables and non-stochastic variables, (extended) stochastic conditional independence and some form of functional conditional independence. Transitional conditional independence is asymmetric in general and it will be shown that it satisfies all desired relevance relations in terms of left and right versions of the separoid rules, except symmetry, on standard, analytic and universal measurable spaces. As a preparation we prove a disintegration theorem for transition probabilities, i.e. the existence and essential uniqueness of (regular) conditional Markov kernels, on those spaces. Transitional conditional independence will be able to express classical statistical concepts like sufficiency, adequacy and ancillarity. As an application, we will then show how transitional conditional independence can be used to prove a directed global Markov property for causal graphical models that allow for non-stochastic input variables in strong generality. This will then also allow us to show the main rules of causal/do-calculus, relating observational and interventional distributions, in such measure theoretic generality.
Background: The stochastic behavior of patient arrival at an emergency department (ED) complicates the management of an ED. More than 50% of hospitals ED capacity tends to operate beyond its normal capacity and eventually fails to deliver high-quality care. To address the concern of stochastics ED arrivals, many types of research has been done using yearly, monthly and weekly time series forecasting. Aim: Our research team believes that hourly time-series forecasting of the load can improve ED management by predicting the arrivals of future patients, and thus, can support strategic decisions in terms of quality enhancement. Methods: Our research does not involve any human subject, only ED admission data from January 2014 to August 2017 retrieved from the UnityPoint Health database. Autoregressive integrated moving average (ARIMA), Holt Winters, TBATS, and neural network methods were implemented to forecast hourly ED patient arrival. Findings: ARIMA (3,0,0) (2,1,0) was selected as the best fit model with minimum Akaike information criterion and Schwartz Bayesian criterion. The model was stationary and qualified the Box Ljung correlation test and the Jarque Bera test for normality. The mean error (ME) and root mean square error (RMSE) were selected as performance measures. An ME of 1.001 and an RMSE of 1.55 was obtained. Conclusions: ARIMA can be used to provide hourly forecasts for ED arrivals and can be utilized as a decision support system in the healthcare industry. Application: This technique can be implemented in hospitals worldwide to predict ED patient arrival.
There are no more papers matching your filters at the moment.