# DEN

Denny Hall

### Denny Hall

# BET and BELIEF

We study the problem of distribution-free dependence detection and modeling through the new framework of binary expansion statistics (BEStat). The binary expansion testing (BET) avoids the problem of non-uniform consistency and improves upon a wide class of commonly used methods (a) by achieving the minimax rate in sample size requirement for reliable power and (b) by providing clear interpretations of global relationships upon rejection of independence.

# Tilted-CCA: Quantifying common and distinct information in multi-modal single-cell data via matrix factorization

Recently, multi-modal single-cell data has been growing in popularity in many areas of biomedical research and provides new opportunities to learn how different modalities coordinate within each cell. Many existing dimension reduction methods for such data estimate a low-dimensional embedding that captures all the axes of variation from either modality. While these current methods are useful, we develop the Tilted-CCA in this talk to perform a fundamentally different task.

# Selective Inference: Approaches & Recent Developments

There is growing appreciation of the perils of naively using the same data for model selection and subsequent inference; such “double-dipping” is now frowned upon in many disciplines. Sample splitting has become the de facto solution, but it reflects only one possible solution to the challenge of choosing data-driven hypotheses for subsequent inferential investigation. Indeed, there are some cases, e.g., with dependent data or when using unsupervised methods like clustering, where it is not clear how to appropriately conduct sample splitting.

# Covariate-Adjusted Generalized Factor Analysis with Application to Testing Fairness

In the era of data explosion, psychometricians and statisticians have been developing interpretable and computationally efficient statistical methods to measure latent factors (e.g. skills, abilities, and personalities) using large-scale assessment data.

# Sparse topic modeling via spectral decomposition and thresholding

By modeling documents as mixtures of topics, Topic Modeling allows the discovery of latent thematic structures within large text corpora, and has played an important role in natural language processing over the past decades. Beyond text data, topic modeling has proven itself central to the analysis of microbiome data, population genetics, or, more recently, single-cell spatial transcriptomics. |

# Clip-OGD: An Experimental Design for Adaptive Neyman Allocation in Sequential Experiments

From clinical trials and public health to development economics and political science, randomized experiments stand out as one of the most reliable methodological tools, as they require the fewest assumptions to estimate causal effects. Adaptive experiment designs – where experimental subjects arrive sequentially and the probability of treatment assignment can depend on previously observed outcomes – are becoming an increasingly popular method for causal inference, as they offer the possibility of improved precision over their non-adaptive counterparts.

# Downscaled Probabilistic Climate Change Projections, with Application to Hot Days

The climate change projections of the Intergovernmental Panel on Climate Change are based on scenarios for future emissions, but these are not statistically based and do not have a full probabilistic interpretation. Instead, Raftery et al. (2017) and Liu and Raftery (2021) developed probabilistic forecasts for global average temperature change to 2100.

# Identification and Estimation of Graphical Continuous Lyapunov Models

Graphical continuous Lyapunov models offer a new perspective on modeling causally interpretable dependence structure in multivariate data by treating each independent observation as a one-time cross-sectional snapshot of a temporal process. Specifically, the models consider multivariate Ornstein-Uhlenbeck processes in equilibrium. This leads to Gaussian models in which the covariance matrix is determined by the continuous Lyapunov equation.

# Axiomatization of Interventional Probability Distributions

Causal intervention is an essential tool in causal inference.

# COVID-19 transmission models in the real world: models, data, and policy

Simple mathematical models of COVID-19 transmission gained prominence in the early days of the pandemic. These models provided researchers and policymakers with qualitative insight into the dynamics of transmission and quantitative predictions of disease incidence. More sophisticated models incorporated new information about the natural history of COVID-19 disease and the interaction of infected individuals with the healthcare system, to predict diagnosed cases, hospitalization, ventilator usage, and death.

# Bayesian demography: a brief history, recent applications, and future directions

The use of Bayesian methods in the social sciences has increased rapidly over the past decade, including in the field of demography, where Bayesian methods are used to produce estimates and forecasts of demographic and health indicators across a wide range of populations. In this talk, I will briefly describe the history of use of Bayesian methods in demography, and highlight the strengths of such methods in the context of forecasting, small area estimation, and using non-representative data.

# Coverage of credible intervals under multivariate monotonicity

Shape restrictions such as monotonicity in one or more dimensions sometimes naturally arise. The restriction can be effectively used for function estimation without smoothing. Several exciting results on function estimation under monotonicity, and to a lesser extent, under multivariate monotonicity have been obtained in the frequentist setting. But only a little is known about how Bayesian methods work when there are restrictions on the shape. Chakraborty and Ghosal recently studied the convergence properties of a "projection-posterior" distribution.

# Random fields beyond the null: building models from critical points

Random field theory (RFT) is has been used in signal detection in the"massively univariate" linear models of neuroimaging.Such analyses preclude building multivariate models of activity, comparing

# Bootstrap-Assisted Inference for Generalized Grenander-type Estimators

Coauthors: Michael Jansson and Kenichi Nagasawa

# Flexible Hawkes Process Models and Applications

The Hawkes Processes is a popular type of self-exciting point process that has found application in the modeling of financial stock markets, earthquakes, and social media cascades. Their continuous time framework, however, necessitates that data collected for inference be accurate. However, for real-time monitors of data, for example in remote sensing or cybersecurity, accurate detection of events is challenging.

# Confounding and dependence in spatial statistics (joint work with Brian Gilbert, Abhi Datta, and Joan Casey)

Recently, addressing “spatial confounding” has become a major topic in spatial statistics. However, the literature has provided conflicting definitions, and many proposed definitions do not address the issue of confounding as it is understood in causal inference.

# Causal Inference with Corrupted Data: Measurement Error, Missing Values, Discretization, and Differential Privacy - Joint work with Anish Agarwal (Amazon Core AI)

The US Census Bureau will deliberately corrupt data sets derived from the 2020 US Census in an effort to maintain privacy, suggesting a painful trade-off between the privacy of respondents and the precision of economic analysis. To investigate whether this trade-off is inevitable, we formulate a semiparametric model of causal inference with high dimensional corrupted data. We propose a procedure for data cleaning, estimation, and inference with data cleaning-adjusted confidence intervals.

# To Adjust or not to Adjust? Estimating the Average Treatment Effect in Randomized Experiments with Missing Covariates

Randomized experiments allow for consistent estimation of the average treatment effect based on the difference in mean outcomes without strong modeling assumptions. Appropriate use of pretreatment covariates can further improve the estimation efficiency. Missingness in covariates is nevertheless common in practice and raises an important question: should we adjust for covariates subject to missingness, and if so, how? The unadjusted difference in means is always unbiased.

# Optimal Subgroup Identification

Quantifying treatment effect heterogeneity is a crucial task in many areas of causal inference, e.g. optimal treatment allocation and estimation of subgroup effects. We study the problem of estimating the level sets of the conditional average treatment effect (CATE), identified under the no-unmeasured-confounders assumption. Given a user-specified threshold, the goal is to estimate the set of all units for whom the treatment effect exceeds that threshold.

# Statistical Methods for Observational Data on Infectious Diseases

Emerging modern datasets in public health call for development of innovative statistical methods that can leverage complex real-world data settings. We first discuss a stochastic epidemic model that incorporates contact tracing data to make inference about transmission dynamics on an adaptive contact network. An efficient data-augmented inference scheme is designed to accommodate partially epidemic observations.

# On statistical inference for sequential decision making

Reinforcement learning is a general technique that allows an agent to learn an optimal policy and interact with an environment in sequential decision making problems. The goodness of a policy is measured by its value function starting from some initial state. This talk includes a few topics about constructing statistical inference for a policy's value in infinite horizon settings where the number of decision points diverges to infinity. Applications in real world examples will also be discussed.

# Localization schemes and the mixing of hit-and-run

We introduce the localization schemes framework for analyzing the mixing time of Markov chains. Our framework unifies and extends the previous proof techniques via spectral independence framework by Anari, Liu and Oveis Gharan and the stochastic localization process used for proving high dimensional properties of log-concave measures.

# A Robust, Differentially Private Randomized Experiment for Evaluating Online Educational Programs With Sensitive Student Data

Randomized control trials (RCTs) have been the gold standard to evaluate the effectiveness of a program, policy, or treatment on an outcome of interest. However, many RCTs assume that study participants are willing to share their (potentially sensitive) data, specifically their response to treatment. This assumption, while trivial at first, is becoming difficult to satisfy in the modern era, especially in online settings where there are more regulations to protect individuals' data.

# A Negative Correlation Strategy for Bracketing in Difference-in-Differences

The method of difference-in-differences (DID) is widely used to study the causal effect of policy interventions in observational studies. DID employs a before and after comparison of the treated and control units to remove bias due to time-invariant unmeasured confounders under the parallel trends assumption. Estimates from DID, however, will be biased if the outcomes for the treated and control units evolve differently in the absence of treatment, namely if the parallel trends assumption is violated.

# Causal learning with unknown interventions: algorithms, guarantees, and connections to distributional robustness

With observational data alone, causal inference is a challenging problem. The task becomes easier when having access to data collected from perturbations of the underlying system, even when the nature of these is unknown. In this talk, we will describe methods that use such perturbation data to identify plausible causal mechanisms and to obtain robust predictions. Specifically, in the context of Gaussian linear structural equation models, we first characterize the interventional equivalence class of DAGs.

# Double dipping: problems and solutions, with application to single-cell RNA-sequencing data

In contemporary applications, it is common to collect very large data sets with the vaguely-defined goal of *hypothesis generation. *Once a dataset is used to generate a hypothesis, we might wish to *test* that hypothesis on the same set of data. However, this type of "double dipping" violates a cardinal rule of statistical hypothesis testing: namely, that we must decide what hypothesis to test before looking at the data.

# Reliability, Equity, and Reproducibility in Modern Machine Learning

Modern machine learning algorithms have achieved remarkable performance in a myriad of applications, and are increasingly used to make impactful decisions in the hiring process, criminal sentencing, healthcare diagnostics and even to make new scientific discoveries. The use of data-driven algorithms in high-stakes applications is exciting yet alarming: these methods are extremely complex, often brittle, notoriously hard to analyze and interpret.

# Identifying Causal Effects from Observational Data

Scientific research is often concerned with questions of cause and effect. For example, does eating processed meat cause certain types of cancer? Ideally, such questions are answered by randomized controlled experiments. However, these experiments can be costly, time-consuming, unethical or impossible to conduct. Hence, often the only available data to answer causal questions is observational.

# Fréchet Change Point Detection

Change point detection is a popular tool for identifying locations in a data sequence where an abrupt change occurs in the data distribution and has been widely studied for Euclidean data. Modern data very often is non- Euclidean, for example distribution valued data or network data. Change point detection is a challenging problem when the underlying data space is a metric space where one does not have basic algebraic operations like addition of the data points and scalar multiplication.

# Nonparametric Mode Estimation via the Log-Concave Shape Constraint

Advisor: Jon Wellner We consider the problem of forming confidence intervals and tests for the location of the mode in the setting of nonparametric estimation of a log-concave density. We thus study the class of log-concave densities with fixed and known mode. We find the maximum likelihood estimator for this class, give a characterization of it, and, under the null hypothesis, show our estimator is uniformly consistent and is $n^{2/5}$-tight at the mode. We also show uniqueness of the analogous limiting "estimator" of a quadratic function with white noise.

# Importance Sampling Approaches to Missing Data Problems

Advisor: Adrian E. Raftery