Statistical methods for genomic sequencing data
Genomic sequencing data has revolutionized our understanding of the genetic basis of biological processes. The cost of sequencing the first human genome was estimated to be greater than 50 million dollars. However, with the advent of next generation sequencing, that cost has decreased to a few hundred dollars. It is thus now possible to use sequencing technology to understand nuanced aspects of the cell, both on the population and at the single-cell level. In this dissertation, we present three projects that develop statistical methods for analyzing genomic data.
In the first project, we discuss how heritability estimators based on single nucleotide polymorphisms are affected under alternative structures of linkage disequilibrium. We demonstrate that linkage disequilbrium has the potential to bias modern estimators of heritability. In the second project, we investigate a sequencing-based assay that measures local chromatin structure. In this context, we propose a prior that allows a latent Dirichlet allocation model chromatin accessibility data to leverage auxiliary data.
For this talk, I will focus on the third project, which considers the connection between sequence data and epigenomic or expression data in the context of multitask learning models. A grand challenge in computational biology involves building computational models that are capable of predicting various types of genomic activity---such as mRNA expression levels, patterns of histone modifications, and regions of chromatin accessibility---solely on the basis of the genomic DNA sequence. Methods such as DeepSEA, Bassett, Basenji, Enformer, and BPNet frame this as a multitask learning problem. In this setting, each task involves predicting, from a common DNA sequence, one type of genomic activity in a particular cell type or tissue type. In this work, we demonstrate that this multitask learning setup can lead to inaccurate models, when genomic features that are irrelevant for one task are erroneously assigned significance in a related task. We illustrate the problem using a simple example, via a more sophisticated simulation, and in empirical results from several published models. Unfortunately, there is no silver bullet to solve this problem: training in a single-task setting leads to much worse generalization performance, whereas training in the multitask setup risks allowing leakage of irrelevant features between tasks.