Smoothed Nonparametric Derivative Estimation using Weighted Difference Quotients

# Student Seminar Schedule

*Speakers: Zhaoqi Li*

This work proposes a computationally efficient algorithm for pure exploration in linear bandits by leveraging sampling and argmax oracles. Given a set of arms, the pure exploration linear bandit problem aims to return the best arm with high probability through noisy measurements.

Existing (asymptotically) optimal methods scale in the size of the arm set by requiring either a) potentially costly projections for each arm or b) explicitly maintaining a subset of the arm set under consideration at each time. In general, computing projections may be computationally expensive, and maintaining a subset may be unfeasible in many combinatorial settings. Our approach overcomes both of these limitations by resorting to sampling from an appropriate distribution over possible parameters combined with access to an argmax oracle. In this vein, it enjoys a similar computational efficiency of Thompson Sampling. However, unlike Thompson Sampling, which is known to be sub-optimal for pure exploration, our algorithm provably obtains the optimal instance-dependent rate asymptotically.

*Speakers: Seth Temple*

If an allele has a large advantage over other alleles, it may increase in frequency over time. This scenario is referred to as a selective sweep, and it can result in an excess of long identical-by-descent (IBD) haplotype segments. I extend a population genetics model and present a novel statistical framework that relates selective sweeps to the count of, and network relationships between, IBD segments. First, I propose a new community detection method for haplotypes sharing IBD segments and apply it to estimate the allele frequency of the causal sweep variant without prior knowledge of the allele. Second, I derive a method of moments estimator for the selection coefficient conditional on the causal allele frequency, population demography, and a genetic model. Third, I demonstrate how to efficiently simulate IBD segments from the population genetics model and assess uncertainty via a parametric bootstrap. In extensive simulation studies, I show that my point estimator is empirically unbiased and my interval estimator attains % coverage and that these inferences are robust to a variety of model misspecifications. This effort is the first of its kind to provide interval estimation for the selection coefficient and demonstrate desirable statistical properties. Additionally, I highlight that my method outperforms state-of-the-art methods for ranking and localizing the causal allele and for estimating the allele frequency and the selection coefficient. I apply my method incomplete selective sweep with extended haplotypes estimation procedure (isweep) to characterize positive selection in two European-American cohorts from the Trans-Omics for Precision Medicine project.

*Speakers: Alan Min*

A grand challenge in computational biology involves building computational models capable of predicting various types of genomic activity---such as mRNA expression levels, patterns of histone modifications, and regions of chromatin accessibility---solely on the basis of the genomic DNA sequence. Methods such as DeepSEA, Bassett, Basenji, Enformer and BPNet approach this problem by framing it as a multitask learning problem. In this setting, each task involves predicting one type of genomic activity in one particular cell or tissue type, and all of the tasks start from a common input, the DNA sequence. In this work, we demonstrate that this multitask learning setup can lead to inaccurate models, when genomic features that are irrelevant for one task are erroneously assigned significance in a related task. We illustrate the problem using a simple example, via a more sophisticated simulation, and in empirical results from several published models. We hypothesize that the problem arises because the model has direct access to DNA features but only indirect access to information about, for example, expression of DNA-binding factors. Hence, the model has difficulty separating cis-regulatory from trans-regulatory effects---for example, learning that a particular DNA motif has no effect in a particular cell type because the transcription factor that binds to the motif is not expressed in that cell type. The problem is particularly problematic in the setting of \textit{in silico} design, where such models are used to design sequence elements to achieve a particular pattern of activity. Unfortunately, there is no silver bullet to solve this problem: training in a single-task setting leads to much worse generalization performance, whereas training in the multitask setup risks allowing leakage of irrelevant features between tasks.

*Speakers: Micheal Pearce*

Preference data is used for various statistical tasks, such as summary, estimation, inference, or prediction. Although many deterministic and statistical models exist for analyzing preferences, there are many realistic settings in which few or no models exist for performing a principled analysis of preferences.

We first propose and compare two of the first joint statistical models for rankings and ratings. Our models are proposed in different statistical frameworks, rely on distinct assumptions, and have varying statistical properties which are explored through theory, simulation, and applications.

Second, we propose a methodology for estimating rank-clusters from rankings. Rank-clusters denote cases when objects in a collection are equal in quality and thus should be clustered in their overall rank. Our model extends previous frequentist work to permit additional ordinal data types and is proposed in a Bayesian framework for a unified approach to uncertainty quantification. We apply our model to real rank-choice election data to explore voters' perceptions of candidates.

*Speakers: Apara Venkat*

How can we leverage domain knowledge during statistical tasks such as learning or decision-making? In this talk, we will discuss two instances of this question that arise in multi-armed bandits and causal discovery.

First, we will map contact tracing of infectious diseases to mortal multi-armed bandits where "arms" are infected people, "playing an arm" corresponds to testing a contact of that person, and "rewards" are new infections. The goal is to uncover as many infections as possible. This is a mortal bandit as each person only has a finite number of contacts. Adaptive Greedy and Pilot sampling algorithms have been previously proposed for the mortal bandits. Through Bayesian regret, we will show that both algorithms are optimal, maybe up to some logarithmic factors, in the number of arms and time horizon in big-O. Using empirical simulations, we will see that the optimal choice of algorithm depends on knowledge of the underlying infectivity (reward) distribution. We will then demonstrate these insights using three administrative COVID-19 contact tracing datasets.

Next, we will turn to causal discovery in the presence of latent confounding. Here, the directed acyclic graph (DAG) representing the underlying causal structure cannot be learned directly from observational data. Instead, by relying on conditional independence constraints present in the data, causal relationships between the observed variables may be represented by a graphical model called a maximal ancestral graph (MAG). However, several MAGs may encode the same constraints present in the observational distribution. So, we can only recover an equivalence class of MAGs, containing the true causal MAG, from observational data. We undertake the task of refining this equivalence class using expert knowledge in the form of edge orientations. To accomplish this, we revise previously established graphical orientation rules and construct new ones. We show that these rules are sound and conjecture that they are also complete for incorporating background knowledge.