Scientific interest in microbiomes (communities of microscopic organisms in a given environment) has recently expanded due to the growing understanding of the role of the microbiome in human and environmental health, and in conjunction with the decreasing costs of high-throughput sequencing. This expansion has created the need for statistical methods tailored to microbiome datasets for both exploration and inference.
In the first project, we present a visualization method to investigate the similarities and differences between the evolutionary histories of genes within microbial genomes. Evolutionary histories are represented by phylogenetic trees, which are complex graph objects that are difficult to compare due to the large number of genes and genomes that are present in typical comparisons of microbial genomes. We use a local linear approximation of phylogenetic tree space to visualize estimated gene trees as points in low-dimensional Euclidean space and address important practical limitations of existing related approaches. This allows us to identify genes with both typical and outlying evolutionary histories, which facilitates hypothesis generation about which genes may have evolved differently.
In the second project we consider a model for differential abundance of microbial categories (taxa, genes, etc.) in the presence of unknown sample-specific and category-specific detection effects. Many differential abundance methods address detection effects by transforming the observed data, often making ad-hoc decisions that result in parameters that are difficult to interpret. In contrast, we include detection effects in our model as nuisance parameters, alleviating the need for data transformation and allowing for clearly interpretable parameters. Specifically, we use estimating equations from a partially identified penalized Poisson likelihood to estimate fold-differences in abundance across covariate levels, relative to the typical fold differences for the observed biological categories. We maximize our likelihood via coordinate descent and show that estimation scales to the size of modern microbiome datasets. We conclude with a discussion of ongoing work in statistical inference for this model and related open problems.