Body

This dissertation explores several statistical challenges in cause-of-death (COD) assessment from verbal autopsy (VA) surveys—structured interviews with caregivers of the deceased in regions where traditional medical certification is unavailable. Despite their crucial role in mortality surveillance,VA data analysis is complicated by inconsistent age categorization, respondent burden from lengthy questionnaires, and potential biases in automated classification systems.


The first project develops a Bayesian framework for reconciling inconsistent age categories across multiple VA data sources. We formulate age-disaggregated death counts as fully-classified multinomial data and show that incorporating partially-classified aggregated data can produce an improved Bayes estimator under Kullback-Leibler loss. Under specific theoretical conditions, this approach calibrates data with different age structures to generate unified estimates of standardized age distributions. Through numerical studies and applications to real-world mortality data, we demonstrate the method's effectiveness in imputing incomplete classification and provide guidance on appropriate levels of age disaggregation.


The second project proposes a statistical model for estimating cause-specific mortality with incomplete age information. Using age-mixing proportions within a Bayesian framework, this approach shows that incorporating partially observed age data improves estimation compared to discarding incomplete records. Analysis of demographic survey data from multiple countries reveals that the proposed approach generally yields more accurate cause-specific mortality estimates, with performance advantages varying by the true age distribution of deaths.

The third project adopts Bayesian active questionnaire design to optimize VA data collection processes. Using posterior-weighted Kullback-Leibler information criteria and uncertainty-aware stopping rules, this approach sequentially selects questions to maximize information while minimizing respondent burden. Validation with gold-standard VA data shows comparable classification accuracy using substantially fewer questions, with implications for improved data collection efficiency.

The final project presents a statistical framework for valid inference using predicted causes from VA narratives. By extending prediction-powered inference to multinomial classification (multiPPI++), we enable unbiased parameter estimation when using natural language processing models for COD classification. Cross-site validation demonstrates effective correction for transportability errors and highlights the distinction between predictive accuracy and inferential validity.

Together, these methodological innovations address fundamental challenges in survey-based mortality surveillance, with applications extending beyond cause-of-death assessment to broader problems of inference with incomplete or predicted data.