This dissertation is motivated by missing data problems arising from two observational health datasets. The first dataset is created by the SWOG study that linked medicare claims to a prostate cancer prevention trial dataset. The second dataset is a diabetes EHR dataset that contains longitudinal measurements of diabetes patients for 11 years.
For the first dataset, we are interested in estimating the long-term effect of a treatment. In a time-to-event setting, medicare claims are linked to clinical trial data to extend the follow-up period for trial participants. This allows the estimation of the long-term effect that cannot be estimated by clinical trial data alone. However, such data linkages are often incomplete for various reasons. We formulate incomplete linkages as a missing data problem with careful considerations of the relationship between the linkage status and the missing data mechanism. We propose a conditional linking at random (CLAR) assumption and an inverse probability of linkage weighting (IPLW) partial likelihood estimator. We show that our IPLW partial likelihood estimator is consistent and asymptotically normal.
For the second dataset, the longitudinal measurements for diabetes patients are subject to nonmonotone missingness. The conventional ignorability and missing-at-random (MAR) conditions are unlikely to hold for nonmonotone missing data and data analysis can be very challenging with few complete data. We introduce the available complete-case missing value (ACCMV) assumption for handling nonmonotone and missing-not-at-random (MNAR)
problem. Our ACCMV assumption is applicable to dataset with a small set of complete ob- servations and we show that the ACCMV assumption leads to nonparametric identification of the distribution for the variables of interest. We further propose an inverse probability weighting estimator, a regression adjustment estimator and a multiply-robust estimator for estimating a parameter of interest. Asymptotic and efficiency theories of the proposed esti- mators are studied. We further illustrate the applicability of our method by applying it to the diabetes EHR dataset.
Finally, we consider the problem of trajectory recovery. Repeated measurements col- lected from individuals naturally form a long trajectory and the length of the trajectory creates additional difficulty for modeling and computation. We introduce a block-Markov type assumption to handle such missing data problems. We prove that our assumption leads to nonparametric identification of the joint distribution of the trajectory. Based on this as- sumption, we are able to decompose trajectories into multiple missing blocks and thus greatly reduce both the computation and modeling complexity. For modeling purpose, we further propose a model-based assumption, which allows us to use both linear models and flexible machine learning models to impute missing values. We further illustrate the applicability of our method by applying it to the diabetes EHR dataset.

For the final exam, we will be focusing on the second dataset.