Body

Missing data is a fundamental issue in statistics, arising in nearly every field of empirical research. It can occur for various reasons such as nonresponse in surveys and dropout in longitudinal studies. Failing to account for missingness can bias estimates, reduce efficiency, and complicate interpretation. As datasets become increasingly large with complex structures, the challenge of appropriately handling missing data has only grown more urgent. This dissertation discusses the task of estimation and inference in the presence of missing data.

In the first project, we study problems with multiple missing covariates and partially observed responses. We develop a new framework to handle complex missing covariate scenarios via inverse probability weighting, regression adjustment, and a multiply-robust procedure. We apply our framework to three classical problems: the Cox model from survival analysis, missing response, and binary treatment from causal inference. In these scenarios, we outline a multistage estimation procedure and develop associated identification, asymptotic, and efficiency theories. Our approach is studied via simulations and applied to an Alzheimer's disease data set.

In the second project, we focus on the problem of modeling multivariate bounded discrete data. In the setting of dementia studies, such data is collected when individuals complete neuropsychological tests. We outline a modeling and inference procedure that can model the joint distribution conditional on baseline covariates, leveraging previous work on mixtures of experts and latent class models. Furthermore, we illustrate how the work can be extended when the outcome data is missing at random using a nested EM algorithm. The proposed model can incorporate covariate information and perform imputation and clustering. We apply our model on simulated data and an Alzheimer’s disease data set.

In the third project, we explore the problem of modeling nonmonotone missing data when they can be missing not at random. Modeling the full data distribution as well as developing imputation models that are convenient for sampling can be especially challenging when the data is multivariate. In this paper, we analyze a specific class of missing not at random (MNAR) nonparametric identifying assumptions called tree graphs, extending upon the work by Chen (2022). We introduce the idea of a conjugate odds family in which certain parametric models on the selection odds can preserve the data distribution family across all missing data patterns. Under a conjugate odds family and a tree graph assumption, we are able to model the full data distribution elegantly. We illustrate our approach using simulated and real datasets, encompassing both multivariate continuous and multivariate discrete data.