Nonparametric Identified Methods to Handle Nonignorable Missing Data
Update 4/25/2019: Location of this seminar has been moved to SMI 211.
Bayesian Hierarchical Modeling of Demographic and Climate Change Indicators
Bayesian hierarchical modeling is a powerful tool for demography and climate science. In this talk we will focus on its use for accounting for uncertainty about past demographic quantities in population projections. Since the 1940s, population projections have in most cases been produced using the deterministic cohort component method. However, in 2015, for the first time, in a major advance, the United Nations issued official probabilistic population projections for all countries based on Bayesian hierarchical models for total fertility and life expectancy.
Generalized Score Matching for Non-Negative Data
A common challenge in estimating parameters of probability density functions is the intractability of the normalizing constant. While in such cases maximum likelihood estimation (MLE) may be implemented using numerical integration, the approach becomes computationally intensive. In contrast, the score matching method of Hyvärinen (2005) avoids direct calculation of the normalizing constant and yields closed-form estimates for exponential families of continuous distributions on the m-dimensional Euclidean space R^m.
Green Dot Bystander Intervention Training
Green Dot is a movement, a program, and an action. The aim of Green Dot is to prevent and reduce sexual assault & relationship violence at UW by engaging students as leaders and active bystanders who step in, speak up, and interrupt potential acts of violence. The Green Dot movement is about gaining a critical mass of students, staff and faculty who are willing to do their small part to actively and visibly reduce power-based personal violence at UW.
Sequential change-point detection for a network of Hawkes processes
Hawkes processes has been a popular point process model for capturing mutual excitation of discrete events. In the network setting, this can capture the mutual influence between nodes, which has a wide range of applications in neural science, social networks, and crime data analysis. In this talk, I will present a statistical change-point detection framework to detect in real-time, a change in the influence using streaming discrete events.
A likelihood ratio rest for shape-constrained density functions
The celebrated Grenander (1956) estimator is the maximum likelihood estimator of a decreasing density function. In contrast to alternative nonparametric density estimators, Grenander estimator does not require any smoothing parameters and is often viewed as a fully automatic procedure. However, the monotonic density assumption might be questionable. While testing qualitative constraints such as monotonicity are difficult in general, we show that a likelihood ratio test statistic Kₙ has an incredibly simple asymptotic null distribution: n¹
Rerandomization and ANCOVA
Randomization is a basis for inferring treatment effects with minimal additional assumptions. Appropriately using covariates in randomized experiments will further yield more precise estimators. In his seminal work Design of Experiments, R. A. Fisher suggested blocking on discrete covariates in the design stage and conducting the analysis of covariance (ANCOVA) in the analysis stage. In fact, blocking can be embedded into a wider class of experimental design called rerandomization, and the classical ANCOVA can be extended to more general regression-adjusted estimators.
Randomized Experiments on Amazon’s Supply Chain
At Amazon’s Inventory Planning and Control Laboratory (IPC Lab) we run randomized controlled trials (RCTs) that evaluate the efficacy of in-production buying and supply chain policies on important business metrics. Our customers are leading supply chain researchers and business managers within Amazon, and our mission is to help them best answer the question, ‘Should I roll out my policy?’ In this talk we discuss how we navigate multiple obstacles to fulfilling our mission.
Model compression as constrained optimization, with application to neural nets
Deep neural nets have become in recent years a widespread practical technology, with impressive performance in computer vision, speech recognition, natural language processing and many other applications. Deploying deep nets in mobile phones, robots, sensors and IoT devices is of great interest. However, state-of-the-art deep nets for tasks such as object recognition are too large to be deployed in these devices because of the computational limits they impose in CPU speed, memory, bandwidth, battery life or energy consumption.
Causal Inference with Unmeasured Confounding: an Instrumental Variable Approach
Causal inference is a challenging problem because causation cannot be established from the observational data alone. Researchers typically rely on additional sources of information to infer causation from association. Such information may come from powerful designs such as randomization, or background knowledge such as information on all confounders. However, perfect designs or background knowledge required for establishing causality may not always be available in practice.
Testing One Hypothesis Multiple Times: a simple tool for generalized inference
The identification of new rare signals in data, the detection of a sudden change in a trend, and the selection of competing models, are some among the most challenging problems in statistical practice.
Manifold Data Analysis with Applications to High-Resolution 3D Imaging
Many scientific areas are faced with the challenge of extracting information from large, complex, and highly structured data sets. A great deal of modern statistical work focuses on developing tools for handling such data. In this work we presents a new subfield of functional data analysis, FDA, which we call Manifold Data Analysis, or MDA. MDA is concerned with the statistical analysis of samples where one or more variables measured on each unit is a manifold, thus resulting in as many manifolds as we have units.
Survival Analysis and Length-Biased Sampling: An Application to Survival with Dementia
When survival data are colleted as part of a prevalent cohort study, the recruited cases have already experienced their initiating event. These prevalent cases are then followed for a fixed period of time at the end of which the subjects will either have failed or have been censored. When interests lies in estimating the survival distribution, from onset, of subjects with the disease, one must take into account that the survival times of the cases in a prevalent cohort study are left truncated.
The Walking Dog Model, Tetrad Differences, and Sibling Resemblance
In this talk, I will try to trace some of the ideas that led from Herbert Costner\'s early work with multiple indicator models to simple models of sibling resemblance in social and economic standing, and to more elaborate models that combine direct and indirect measurement of family influence.
Conquering the Complexity of Time: Mining from Big Time Series Data
Many emerging applications of big data involve time series data. In this talk, I will discuss a collection of machine learning and data mining approaches to effectively analyze and model large-scale time series and spatio-temporal data. Experiment results will be shown to demonstrate the effectiveness of our models in healthcare and climate applications.
Regularized Covariance Matrix Estimation
I will review and discuss some of the different themes of regularized estimation of the population covariance matrix:
1. Why estimate it and in what norm?
Profile Likelihood Estimation in Semi-Parametric Models
This talk presents an alternative profile likelihood estimation theory. By introducing a new parametrization, we improve on the seminal work of Murphy and van der Vaart (2000) in 2 ways: we prove the no bias condition in a general semi-parametric model context, and deal with the direct quadratic expansion of the profile likelihood rather than an approximate one. In addition, we discuss a difficulty which we encounter in the profile likelihood estimation.
Model-Based Clustering of Magnetic Resonance Data
In radiology, magnetic resonance imaging (MRI) and magnetic resonance spectroscopic imaging (MRSI) play an increasingly important role. However, the wealth of data available to the radiologist makes it more difficult to extract the relevant information. One way to summarise information from several congruent images is to show a segmented image, i.e. an image where pixels are clustered.
Partial Identification and Confidence Sets for Functionals of the Joint Distribution of "Potential Outcomes"
Authors: Yanqin Fan, Emmanuel Guerre, and Dongming Zhu
Identifiability of Linear Structural Equation Models
Structural equation models are multivariate statistical models that are defined by specifying noisy functional relationships among random variables. This talk treats the classical case of linear relationships and additive Gaussian noise terms. Each linear structural equation model is associated with a graph and corresponds to a polynomially parametrized set of positive definite covariance matrices.
Model Selection Procedures in Non-parametric Regression
Consider the regression model Y=g0(X)+E, where E is the error term, and g0:R^k -> R is the unknown regression function to be estimated from independent observations of (X,Y). Furthermore we have a countable collection of models (classes of candidate regression functions of finite VC dimension) of growing complexity. The larger the model, the better the approximation error, but the worse the estimation error. In order to balance both errors, we propose to estimate g0 by means of penalised least squares, where the penalty is proportional to the VC-dimension of the model.
Point Process Models for Astronomy: Quasars, Coronal Mass Ejections, and Solar Flares
I will be presenting a talk on my dissertation research which consisted of the statistical analysis of two interesting astronomical applications involving point process data.
Local Discriminant Bases and Their Aapplications
For signal and image classification problems, such as the ones in medical or geophysical diagnostics and military applications, extracting relevant features is one of the most important tasks. As an attempt to automate the feature extraction procedure and to understand what the critical features for classification are, we developed the so-called local discriminant basis (LDB) method which rapidly selects an orthonormal basis suitable for signal/image classification problems from a large collection of orthonormal bases (e.g., wavelet packets and local trigonometric bases).
Nonparametric Estimation and Comparison for Networks
Scientific questions about networks are often comparative: we want to know whether the difference between two networks is just noise, and, if not, how their structures differ. I'll describe a general framework for network comparison, based on testing whether the distance between models estimated from separate networks exceeds what we'd expect based on a pooled estimate.
De Finetti's Ultimate Failure
The most scientific and least controversial claim of de Finetti's subjective philosophy of probability is that the rules of Bayesian inference can be derived from a system of axioms for rational decision making that does not presuppose existence of probability. In fact, de Finetti's argument is fatally flawed. The error is irreparable. The slides in PowerPoint and PDF are available at http://www.math.washington.edu/~burdzy/Philosophy/.
Regular Variation and Extremes in Atmospheric Science
Dependence in the tail of the distribution can differ from that in the bulk of the distribution. A basic tenet of a univariate extreme value analysis is to discard the bulk of the data and only analyze the data considered to be extreme. This is true for multivariate problems as well. We will first introduce a framework for describing tail dependence. The probabilistic framework of regular variation has strong ties to classical extreme value theory and provides a framework for describing tail dependence.
Impacts of Climate Change on Species Distributions: Empirical and Statistical Challenges
One of the greatest challenges ecologists face is predicting how climate change will affect the organisms with which we share our planet. Ecological theory predicts that species current distributions are determined by their climatic niches (i.e. fitness as a function of climate). Statistical models relating species geographic distributions to climate (SDM’s – species distribution models) are therefore used to predict shifts in species distributions with climate change.
Statistical Factor Models and Predictive Approaches for Problems of Molecular Characterisation
I will discuss aspects of data analysis and modelling arising from a number of clinical studies that aim to integrate gene expression, and other forms of molecular data, into predictive modelling of clinical outcomes and disease states. Some of our work on empirical and model based approaches to defining underlying factor structure in large-scale expression data, and the use of estimated factors in predictive regression and classification tree models, will be reviewed.
Nonstationary Time Series Modeling and Estimation with Applications in Oceanography
This talk will focus on nonstationary time series, from both a methodological and applied perspective. On the methodology side, I will discuss new stochastic models for capturing structure in bivariate data, by representing the series as complex-valued. This representation allows for novel ways of capturing features that are multiscale, anisotropic and/or nonstationary. I will also present new methodology and theory for maximum likelihood inference in the frequency-domain, specifically by providing a method for removing estimation error from the Whittle likelihood.
Querying Probabilistic Data
A major challenge in data management is how to manage uncertain data. Many reasons for the uncertainty exists: the data may be extracted automatically from text, it may be derived from the physical world such as RFID data, it may be integrated using fuzzy matches, or may be the result of complex stochastic models. Whatever the reason for the uncertainty, a data management system needs to offer predictable performance to queries over large instances of uncertain data.
Gini Association and the Pseudo-Lorenz Curve
We were motivated by the problem of assessing the influence on the inequality in income by the corresponding inequality in some other related variable (say, the number of years of formal education completed). More generally, consider the pseudo-Lorenz curve of a nonnegative r.v. Y relative to (i.e., with respect to the ordering of) another related nonnegative r.v. X. It is shown that this pseudo-Lorenz curve L(Y/X) always lies above the Lorenz curve L(Y) of Y.
Statistics at Google
This presentation will describe some of the problems faced and methods used by statisticians at Google: â€¢ A primary dimension of search quality is the relevance of search results to the search query. Preference rank allows us to convert pairwise comparisons into a ranking of search results. â€¢ Through the AdSense program, Google delivers targeted advertising on third-party web sites, which we refer to as publishers. Publisher scores are a method of ranking publishers by their effectiveness as an ad delivery platform.
MS Thesis Presentation - Simple Transformation Techniques for Improved Non-parametric Regression
In this paper, the authors propose and investigate two new methods for achieving less bias in non-parametric regression and use simulations to compare the bias, variance, and mean squared error from the second and preferred of these two methods to the biases, variances, and mean squared errors of the local constant, local linear, and local cubic non-parametric regression estimators. The two new methods proposed by the authors have bias of order h^4 where h is the estimatorâ€™s smoothing parameter, in contrast to the basic kernel estimatorâ€™s bias of order h^2.
Using Radical Environmentalist Texts to Uncover Network Structure and Network Features
In their efforts to call attention to environmental problems, communicate with like-minded groups, and mobilize support for their activities, radical environmentalist organizations produce an enormous amount of text. These texts, like radical environmental groups themselves, are often (i) densely connected and (ii) highly variable in advocated protest activities. Given a corpus of radical environmentalist texts, can one uncover the underlying network structure of environmental (and related leftist) groups?
Identification of Minimal Sets of Covariates for Matching Estimators
The availability of large observational data bases allow empirical scientists to consider estimating treatment effects without conducting costly and/or unethical experiments where the treatment would be randomized. The Neyman-Rubin model (potential outcome framework) and the associated matching estimators have become increasingly popular, because they allow for the non-parametric estimation of average treatment effects.
Bayesian Models for Integrative Genomics
Novel methodological questions are being generated in the biological sciences, requiring the integration of different concepts, methods, tools and data types. Bayesian methods that employ variable selection have been particularly successful for genomic applications, as they allow to handle situations where the amount of measured variables can be much greater than the number of observations. In this talk I will focus on models that integrate experimental data from different platforms together with prior knowledge.
From Big Data to Precision Oncology using Machine Learning
While targeting key drivers of tumor progression (e.g., BCR/ABL, HER2, and BRAFV600E) has had a major impact in oncology, most patients with advanced cancer continue to receive drugs that do not work in concert with their specific biology. This is exemplified by acute myeloid leukemia (AML), a disease for which treatments and cure rates (in the range of 20%) have remained stagnant. Effectively deploying an ever-expanding array of cancer therapeutics holds great promise for improving these rates but requires methods to identify how drugs will affect specific patients.
Markov Random Fields and Issues of Computation
Markov Random Fields are extremely useful and generally applicable for probabilistic modelling of a wide range of systems. We\'ll review methods for performing inference calculations (most likely configuration and marginal probabilities) on MRFs. Unfortunately, for many tasks, these basic calculations are computationally infeasible. We\'ll discuss the limitations of standard computation methods and the graph-theoretic properties related to computational complexity.
Prior Adjusted Default Bayes Factors for Testing (In)Equality Constrained Hypotheses
Bayes factors have been proven to be very useful when testing statistical hypotheses with inequality (or order) constraints and/or equality constraints between the parameters of interest. Two useful properties of the Bayes factor are its intuitive interpretation as the relative evidence in the data between two hypotheses and the fact that it can straightforwardly be used for testing multiple hypotheses. The choice of the prior, which reflects one's knowledge about the unknown parameters before observing the data, has a substantial effect on the Bayes factor.
UPS Delivers Optimal Phase Diagram for High Dimensional Variable
Consider a linear regression model
Y = XÎ² + z; z ~ N(0, In); X = Xn,p;
where both p and n are large but p > n. The vector Î² is unknown but is sparse in the sense that only a small proportion of its coordinates is nonzero, and we are interested in identifying these nonzero ones. We model the coordinates of Î² as samples from a two-component mixture (1-Ïµ)Ï…0 + ÏµÏ€, and the rows of X as samples from N(0, 1/n Î©), where Ï…0 is the point mass at 0, Ï€ is a distribution, and Î© is a p by p correlation matrix which is unknown but is presumably sparse.
Nonhomogeneous Hidden Markov Models for Downscaling Synoptic Atmospheric Patterns to Precipataion Amounts
Advisors: Peter Guttorp & Jim Hughes
Nonparametric Estimation of a Convex Bathtub-Shaped Hazard Function
In the analysis of lifetime data, a key object of interest is the hazard function, or instantaneous failure rate. One natural assumption is that the hazard is bathtub, or U-shaped (i.e. first decreasing, then increasing). In particular, this is often the case in reliability engineering or human mortality.
MS Thesis Presentation - Hierarchical Mixture of Experts and Applications
HME (Hierarchical Mixture of Experts) is a tree structured architecture for supervised learning. It is characterized by Soft multi-way probabilistic splits, generally based on linear functions of input values, and by linear or logistic fit of the terminal nodes (in HME literature called Experts) rather then constant function as in CART. The statistical model underlying HME is a hierarchical mixture model, which allows for maximum likelihood estimation of the parameters using EM methods.
Latent-variable graphical modeling via convex optimization
Suppose we have a graphical model with sample observations of only a subset of the variables. Can we separate the extra correlations induced due to marginalization over the unobserved, hidden variables from the structure among the observed variables? In other words is it still possible to consistently perform model selection despite the unobserved, latent variables?
Bootstrap and Subsampling for Non-Stationary Spatial Data
Subsampling and bootstrap methods have been suggested in the literature to nonparametrically estimate the variance and distribution of statistics computed from spatial data. Usually stationary data are required to ensure that the methods work. However, in empirical applications the assumption of stationarity often must be rejected. This talk presents consistent bootstrap and subsampling methods to estimate the variance and distributions of statistics based on non-stationary spatial lattice data. Applications to forestry are also discussed.
Controlling False Discovery Rate Via Knockoffs
In many fields of science, we observe a response variable together with a large number of potential explanatory variables, and would like to be able to discover which variables are associated with the response, while controlling the false discovery rate (FDR) to ensure that our results are reliable and replicable. The knockoff filter is a variable selection procedure for linear regression, proven to control FDR exactly under any type of correlation structure in the regime where n>p (sample size > number of variables).
Estimation of a Two-component Mixture Model with Applications to Multiple Testing
We consider estimation and inference in a two component mixture model where the distribution of one component is completely unknown. We develop methods for estimating the mixing proportion and the unknown distribution nonparametrically, given i.i.d. data from the mixture model. We use ideas from shape restricted function estimation and develop "tuning parameter free" estimators that are easily implementable and have good finite sample performance. We establish the consistency of our procedures.
Point Process Transformations and Applications to Wildfire Data
This talk will review some ways of transforming point processes, including smoothing, thinning, superposition, rescaling, and tessellation. Ways in which each of these may be used in the analysis of point process data will be examined, especially in relation to the problem of estimating wildfire hazard. We will explore in particular an important computational geometry problem involving tessellations, namely the estimation of point locations from piecewise constant image data via Dirichlet tessellation inversion.
Flexible, Reliable, and Scalable Nonparametric Learning
Applications of statistical machine learning increasingly involve datasets with rich hierarchical, temporal, spatial, or relational structure. Bayesian nonparametric models offer the promise of effective learning from big datasets, but standard inference algorithms often fail in subtle and hard-to-diagnose ways. We explore this issue via variants of a popular and general model family, the hierarchical Dirichlet process.
Probabilistic Weather Forecasting Using Bayesian Model Averaging
Probabilistic forecasts of wind vectors are becoming critical as interest grows in wind as a clean and renewable source of energy, in addition to a wide range of other uses, from aviation to recreational boating. Unlike other common forecasting problems, which deal with univariate quantities, statistical approaches to wind vector forecasting must be based on bivariate distributions. The prevailing paradigm in weather forecasting is to issue deterministic forecasts based on numerical weather prediction models.
Ergodic Limit Laws for Stochastic Optimization Problems
Department of Mathematics Optimization Seminar Solution procedures for stochastic programming problems, statistical estimation problems (constrained or not), stochastic optimal control problems and other stochastic optimization problems often rely on sampling. The justification for such an approach passes through 'consistency.' A comprehensive, satisfying and powerful technique is to obtain the consistency of the optimal solutions, statistical estimators, controls, etc., as a consequence of the consistency of the stochastic optimization problems themselves.
Survey of Generalized Inverses and Their Use in Stochastic Modelling
In many stochastic models, in particular Markov chains in discrete or continuous time and Markov renewal processes, a Markov chain is present either directly or indirectly through some form of embedding. The analysis of many problems of interest associated with these models, eg. stationary distributions, moments of first passage time distributions and moments of occupation time random variables, often concerns the solution of a system of linear equations involving I - P, where P is the transition matrix of a finite, irreducible, discrete time Markov chain.
MS Thesis Presentation - A Non-Parametric Approach for Handling Repeated Measures in Cancer Experiments
In longitudinal studies, the usual modeling assumptions for multivariate analyses don\'t always hold up so well. One way to treat this is to use non-parametric approaches. In the paper I will be presenting on, the authors analyzed tumor volume in rats as a function of lipids in their diet. The data was highly heteroscedastic and strongly correlated with time. To compare lipid diets, randomization F-tests were used. Then, local polynomial smoothing was used to create tumor growth curves for each diet, as well as confidence intervals that account for the serially correlated data.
Modeling hierarchical variance with Kronecker structure, with application to quality measures in Medicare Advantage
Studying covariance matrices in hierarchical models can reveal meaningful relationships among variables, but these become difficult to interpret as the number of variables grows. Conventional factor analysis reduces the dimension by mapping onto a set of one-dimensional factors, but does not accommodate variables with a cross-classified layout. For such applications, we develop hierarchical models with Kronecker-product (separable) covariance structure at the second level.
Non-Stationary Analysis and Radial Localisation in 2D
Image analysis has in the last decade experienced a revolution via the development of new tools for the representation and analysis of local image features. At the heart of these developments is the construction of suitable local representations of structure, via decompositions in a set of localized functions. The chosen decomposition then forms the setting for further analysis and/or estimation methods. In particular, compression of a given representation ensures that most decomposition coefficients are of negligible magnitude, and this often simplifies the analysis considerably.
Clustering Based on Non-Parametric Density Estimation: A Proposal
Cluster analysis based on non-parametric density estimation represents an approach to the clustering problem whose roots date back several decades, but it is only in recent times that this approach could actually be developed. The talk presents one proposal within this approach which is among the few ones which have been brought up to the operational stage.
Novel Approaches to Snowball / Respondent-Driven Sampling That Circumvent the Critical Threshold
Web crawling, snowball sampling, and respondent-driven sampling (RDS) are three types of network driven sampling techniques that are popular when it is difficult to contact individuals in the population of interest. This talk will first review previous research which has shown that if participants refer too many other participants, then under the standard Markov model in the RDS literature, the standard approaches do not provide "square root n" consistent estimators. In fact, there is a critical threshold where the design effect of network sampling grows with the sample size.
Overdetermined Estimating Equations with Applications to Panel Data
Panel data has important advantages over purely cross-sectional or time-series data in studying many economic problems, because it contains information about both the intertemporal dynamics and the individuality of the entities being investigated. A commonly used class of models for panel studies identifies the parameters of interest through an overdetermined system of estimating equations. Two important problems that arise in such models are the following: (1) It may not be clear a priori whether certain estimating equations are valid.
Optimal Design of Experiments in the Presence of Network Interference
Causal inference research in statistics has been largely concerned with estimating the effect of treatment (e.g. personalized tutoring) on outcomes (e.g., test scores) under the assumption of "lack of interference"; that is, the assumption that the outcome of an individual does not depend on the treatment assigned to others. Moreover, whenever its relevance is acknowledged (e.g., study groups), interference is typically dealt with as an uninteresting source of variation in the data.
Estimation of the Relative Risk and Risk Difference
I will first review well-known differences between odds ratios, relative risks and risk differences. These results motivate the development of methods, analogous to logistic regression, for estimating the latter two quantities. I will then describe simple parametrizations that facilitate maximum-likelihood estimation of the relative risk and risk-difference. Further, these parametrizations allow for doubly-robust g-estimation of both quantities. (Joint work with James Robins, Harvard School of Public Health)
Curve Fitting and Neuron Firing Patterns
Reversible-jump Markov chain Monte Carlo may be used to fit scatterplot data with cubic splines having unknown numbers of knots and knot locations. Key features of the implementation my colleagues and I have investigated are (i) a fully Bayesian formulation that puts priors on the spline coefficients and (ii) Metropolis-Hastings proposal densities that attempt to place knots close to one another. Simulation results indicate this methodology can produce fitted curves with substantially smaller mean squared-error than competing methods.
Assessment of Scaling in High Frequency Data: Convex Rearrangements in the Wavelet Domain
We overview the notion of regular scaling in data and estimators of this regular scaling on several examples involving high frequency measurements. Next we discuss the importance of wavelet domains and ability of wavelets to precisely estimate regular
Statistical Problems in Large Networks
Natural modeling of large networks leads to exponential models with sufficient statistics being such things as the number of triangles or the degree sequence. These look like standard problems but some surprises have emerged. For some models, it is possible to estimate n parameters based on a sample of size one. For other models, with two parameters, maximum likelihood is inconsistent. Many of these models show phase transitions. The new tools required include the emerging theory of graph limits. This is joint work with Sourav Chatterjee and Allan Sly
Low rank tensor completion
Many problems can be formulated as recovering a low-rank tensor. Although an increasingly common task, tensor recovery remains a challenging problem because of the delicacy associated with the decomposition of higher order tensors. We investigate several convex optimization approaches to low rank tensor completion.
Random Effects Graphical Regression Models for Biological Monitoring Data
An emerging area of research in ecology is the analysis of functional species assemblages. In essence, the analysis of functional assemblages is concerned with determining and predicting the composition of individuals categorized using different life history traits instead of strict taxa names. We propose a state-space model for the analysis of multiple trait compositions along with site-specific covariate information. A site-specific random effects term allows for modeling extra variability including spatial variability in trait compositions.
Computational Considerations on Neuroengineering
Neuroengineering is an emerging interdisciplinary field with the goal of developing effective, robust devices that interact with the nervous system. These devices may act in closed loop with the nervous system to augment, repair, or even replace aspects of its basic function. Neuroengineering presents a set of interesting computational challenges that may require diverse solutions. For instance, How do we perform efficient computations on large quantities of neural data with severely limited computing resources?
A SMART Stochastic Algorithm for Nonconvex Optimization
We show how to transform any optimization problem that arises from fitting a machine learning model into one that (1) detects and removes contaminated data from the training set and (2) simultaneously fits the trimmed model on the remaining uncontaminated data. To solve the resulting nonconvex optimization problem, we introduce a fast stochastic proximal-gradient algorithm that incorporates prior knowledge through nonsmooth regularization.
The Covariance Structure of Circular Ranks
The linear representation of order statistics is a random permutation matrix which can be applied to obtain the usual covariance structure of ranks and other induced order statistics. In this talk, the algebraic structure of the standard case will be identified and extended to the ordering of observations indexed by circular, uniformly spaced, coordinates. These data are characteristic, for example, of corneal curvature maps used to assess regular astigmatism in the optics of the human eye.
Causal Discovery with Confidence Using Invariance Principles
What is interesting about causal inference? One of the most compelling aspects is that any prediction under a causal model is valid in environments that are possibly very different to the environment used for inference. For example, variables can be actively changed and predictions will still be valid and useful. This invariance is very useful but still leaves open the difficult question of inference. We propose to turn this invariance principle around and exploit the invariance for inference.
Two Related Problems Involving Gaussian Markov Random Fields
Gaussian Markov Random Fields (GMRFs) has been around for a long time; however, it is first in the recent years that its computational benefits in Bayesian inference has become clear. In this talk, I\'ll discuss two related problems which involves GMRFs. The first is the problem of constructing Gaussian fields on triangulated manifolds. By viewing this as finding the solution of a stochastic partial differential equation (SPDE), the GMRFs appear as the solutions when solving the SPDE using the \"finite element\" approach.
Computationally-Intensive Inference in Molecular Population Genetics
Modern molecular genetics generates extensive data which document the genetic variation in natural populations. Such data give rise to challenging statistical inference problems both for the underlying evolutionary parameters and for the demographic history of the population. These problems are of considerable practical importance and have attracted recent attention, with the development of algorithms based on importance sampling (IS) and Markov chain Monte Carlo (MCMC).
Robust Inference Using Higher Order Influence Function
Suppose we obtain $n$ i.i.d copies of a random vector $O$ with unknown distribution $F(\\\\theta)$, $\\\\theta \\\\in Theta$. Our goal is to construct honest $100 (1 - \\\\alpha)$% asymptotic confidence intervals (CI) (whose width shrinks to zero with increasing $n$ at the fastest possible rate), through higher order influence functions, for a functional $\\\\psi(\\\\theta)$ in a model that places no restrictions on $F$; other than, perhaps, bounds on both the $L_p$ norms and the roughness (more generally, the complexity) of certain density and conditional expectation functions.
â€œInsuranceâ€ Against Incorrect Inference after Variable Selection
Among statisticians variable selection is a common and very dangerous activity. This talk will survey the dangers and then propose two forms of insurance to guarantee against the damages from this activity.
Using Single-Cell Transcriptome Sequencing to Infer Olfactory Stem Cell Fate Trajectories
Single-cell transcriptome sequencing (scRNA-Seq), which combines high-throughput single-cell extraction and sequencing capabilities, enables the transcriptome of large numbers of individual cells to be assayed efficiently.
Robust Covariance Matrix Estimation with Applications in Finance
This talk provides an introduction to robust estimation of covariance matrices, covering both theoretical and computational aspects, and indicating what we believe to be best choice of estimator at the present time. We begin with a brief introduction to the main concepts of robustness, focusing primarily on minimizing maximum bias for a class of standard multivariate mixture outlier generating models, while maintaining high efficiency at the nominal model.
Statistical Modeling in Disease Screening and Progression: Case Studies in Prostate Cancer
Many prognostic models for cancer use biomarkers that have utility in early detection. For example, in prostate cancer, models predicting disease-specific survival use serum prostate-specific antigen (PSA) levels. These models are typically interpreted as indicating that detecting disease at a lower threshold of the biomarker is likely to generate a survival benefit. However, lowering the threshold of the biomarker is tantamount to early detection. It is not known whether the existing prognostic models imply a survival benefit under early detection once lead time has been accounted for.
Graph Structured Signal Processing
Signal processing on graphs is a framework for non-parametric function estimation and hypothesis testing that generalizes spatial signal processing to heterogeneous domains. I will discuss the history of this line of research, highlighting common themes and major advances. I will introduce various graph wavelet algorithms, and highlight any known approximation theoretic guarantees. Recently, it has been determined that the fused lasso is theoretically competitive with wavelet thresholding under some conditions, meaning that the fused lasso is also a locally adaptive smoothing procedure.
This talk is a personalized account of John Tukey\'s contributions to robust statistics, as well as a summary of the maturation of robustness theory and practice to date. I begin by fondly recalling the way in which Tukey and I became acquainted, how he gave me my start in Statistics at Princeton and Bell Laboratories, and the very stimulating research environment of the Mathematics and Statistics Research Center at Bell Laboratories in 1970\'s and 1980\'s.
Inference for Point and Partially Identified Semi-Nonparametric Conditional Moment Models
This paper considers semi-nonparametric conditional moment models where the parameters of interest include both finite-dimensional parameters and unknown functions. We mainly focus on two inferential problems in this framework. First, we provide new methods of uniform inference for the estimates of both finite- and infinite-dimensional components of the parameters and functionals of the parameters. Based on these results, we can, for instance, construct uniform confidence bands for the unknown functions and the partial derivatives of the unknown functions.
Estimating Common Functional Principal Components in a Linear Mixed Effects Model Framework
The emerging area of statistical science known as functional data analysis is concerned with evaluating information on curves or functions. In recent years much of the research emphasis has focused on extending statistical methods from classical settings into the functional domain. For example, functional principal component analysis (FPCA) is analogous to the traditional PCA, except that the observed data are entire functions rather than multivariate vectors.
Constrained Nonparametric Estimation via Mixtures, with an Application in Cancer Genetics
We discuss modeling probability measures constrained to a convex set. We represent measures in such sets as mixtures of simple, known extreme measures, and so the problem of estimating a constrained measure becomes one of estimating an unconstrained mixing measure. Such convex constraints arise in many modeling situations, such as empirical likelihood and modeling under stochastic ordering constraints.
Random Tomography and Structural Biology
Single particle electron microscopy is a powerful method that biophysicists employ to learn about the structure of biological macromolecules. In contrast to the more traditional crystallographic methods, this method images â€œunconstrainedâ€ particles, thus posing a variety of statistical problems. We formulate and study such a problem, one that is essentially of a random tomographic nature, where a structural model for a biological particle is to be constructed given random projections of its Coulomb potential density, observed through the electron microscope.
I will discuss three related topics: estimating manifolds, estimating ridges and estimating persistent homology. All three problems are aimed at the problem of extracting topological information from point clouds. This is joint work with many people.
On generalizations of the log-linear model
Relational models generalize log-linear models for multivariate categorical data in three aspects. The sample space does not have to be a Cartesian product of the ranges of the variables, the effects allowed in the model do not have to be associated with cylinder sets, and the existence of an overall effect present in every cell is not assumed. After discussing examples which motivate these generalizations, the talk will consider estimation and testing in relational models.
Bayesian Survival Modeling of the Time-Dependent Effect of a Time-Dependent Covariate
Patients undergoing organ transplantation are often administered drugs that suppress their autoimmune system, to avoid rejection of the new organ. A consequence of this is that risk of a variety of conditions is elevated until the drugs are eliminated. In this research we seek to characterize risk of post-transplant lymphoma among kidney transplant recipients. Of key interest is the possibly time-vary effect of a time-dependent covariate: transplant status while on the waiting list.
On Standard Inference for GMM with Seeming Local Identification Failure
This paper studies the Generalized Method of Moments (GMM) estimation and inference problem that occurs when the Jacobian of the moment conditions is degenerate. Dovonon and Renault (2013, Econometrica) recently raised a local identification issue stemming from this degenerate Jacobian. The local identification issue leads to a slow rate of convergence of the GMM estimator and a non-standard asymptotic distribution of the over-identification tests. We show that the degenerate Jacobian matrix may contain non-trivial information about the economic model.
Yule's "Nonsense Correlation" Solved!
In this talk, I will discuss how I recently resolved a longstanding open statistical problem. The problem, formulated by the British statistician Udny Yule in 1926, is to mathematically prove Yule's 1926 empirical finding of ``nonsense correlation.” We solve the problem by analytically determining the second moment of the empirical correlation coefficient of two independent Wiener processes. Using tools from Fredholm integral equation theory, we calculate the second moment of the empirical correlation to obtain a value for the standard deviation of the empirical correlation of nearly .5.
State Space Mixed Models for Longitudinal Observations with Binary and Binomial Responses
A new class of state space models for longitudinal discrete response data, where the observation equation is specified in an additive form involving both deterministic and dynamic components is proposed. These models allow us to explicitly address the effects of trend, seasonal or other time-varying covariates, while preserving the power of state space models in modeling dynamic pattern of data. Different Markov chain Monte Carlo algorithms to carry out statistical inference for models with binary and binomial responses are developed.
Why Should We Perfect Simulate?
Perfect simulation, or exact sampling, refers to a recently developed set of techniques designed to produce a sequence of independent random quantities whose distribution is guaranteed to follow a given probability law. These techniques are particularly useful in the context of Markov Chain Monte Carlo iterations, but the range of their applicability is growing rapidly. Perfect simulation algorithms provide samples with the desired exact distribution and also explicitly determine how many steps are necessary in the Markov Chain to achieve the desired outcome.
Spatial Data Assimilation for Regional Environmental Exposure Studies
Characterizing variation in human exposure to toxic substances over large populations often requires an understanding of the geographic variation in environmental levels of toxicants. This knowledge is essential when the primary routes of exposure are through interactions with environmental media, as opposed to more individual-specific exposure routes (e.g., occupational exposure). In this study, we focus on modeling the spatial variation in the concentration of arsenic, a toxic heavy metal, in air, soil, and water across the state of Arizona.
The Relationship Between Count-Location and Stationary Renewal Models for the Chiasma Process
It is often convenient to define models for the process of chiasma formation at meiosis as stationary renewal models. However, count-location models are also useful, particularly to capture the biological requirement of at least one chiasma per chromosome. The Sturt model and truncated Poisson model are both count-location models with this feature. We show that the truncated Poisson model can also be expressed as a stationary renewal model, while the Sturt model cannot.
Statistical Modeling in Setting Air Quality Standards
The earth\'s atmosphere is a stochastic complex system which includes amongst other things pollution fields some of which derive from anthropogenic sources. Because of their negative health impacts, these fields are now the subject to regulation. However setting the air quality standards needed to regulate them is itself a complex business and that leads to a need for good models for these fields and for predicting human exposures to them. This talk, drawing on my recent experience and research connected with ozone, will describe:
From Data to Decisions
I will present on directions with harnessing predictive models to guide decision making. I will first discuss methods for using machine learning to ideally couple human and computational effort, focusing on several illustrative efforts, including spoken dialog systems and citizen science. Then I will turn to challenges with healthcare and describe work to field statistical models in real-world clinical settings, focusing on the opportunity to join predictions about outcomes with utility models to guide intervention.
A Bayesian information criterion for singular models
We consider approximate Bayesian model choice for model selection problems that involve models whose Fisher-information matrices may fail to be invertible along other competing submodels. Such singular models do not obey the regularity conditions underlying the derivation of Schwarz's Bayesian information criterion (BIC) and the penalty structure in BIC generally does not reflect the frequentist large-sample behavior of their marginal likelihood.
Spatial Statistical Models that Use Flow and Stream Distance
We develop spatial statistical models for stream networks that can estimate relationships between a response variable and other covariates, make predictions at unsampled locations, and predict an average or total for a stream or a stream segment. There have been very few attempts to develop valid spatial covariance models that incorporate flow, stream distance, or both. The application of typical spatial autocovariance functions based on Euclidean distance, such as the spherical covariance model, are not valid when using stream distance.
Statistical Methods for Ambulance Fleet Management
We introduce statistical methods to address two forecasting problems arising in the management of ambulance fleets: (1) predicting the time it takes an ambulance to drive to the scene of an emergency; and (2) space-time forecasting of ambulance demand. These predictions are used for deciding how many ambulances should be deployed at a given time and where they should be stationed, which ambulance should be dispatched to an emergency, and whether and how to schedule ambulances for non-urgent patient transfers.
From safe screening rules to working sets for faster Lasso-type solvers
Convex sparsity promoting regularizations are now ubiquitous to regularize inverse problems in statistics, in signal processing and in machine learning. By construction, they yield solutions with few non-zero coefficients. This point is particularly appealing for Working Set (WS) strategies, an optimization technique that solves simpler problems by handling small subsets of variables, whose indices form the WS. Such methods involve two nested iterations: the outer loop corresponds to the definition of the WS and the inner loop calls a solver for the subproblems.
Nonparametric Estimation of the Time to the Discovery of a New Species
Species inventories that list of all species present in a given area are an important tool for both the study of bio-diversity and conservation biology. These lists are typically obtained from fields studies in which biologists record all the species they can observed over a finite time period. Because of the possible presence of rare, and thus hard to observe species, completeness of such lists can never be guaranteed, regardless of the amount of time and energy spent in compiling them.
Structured Probabilistic Topic Models
Advances in scalable machine learning have made it possible to learn highly structured models on large data sets. In this talk, I will discuss some of our recent work in this direction. I will first briefly review scalable probabilistic topic modeling with stochastic variational inference. I will then then discuss two structured developments of the LDA model in the form of tree-structured topic models and graph-structured topic models. I will present our recent work in each of these areas.
Probabilistic Rainfall Forecasting
Rain is vital to life yet potentially extremely destructive and forecasting is critical to water management. Rain is a difficult atmospheric variable to predict, and traditional deterministic â€œpointâ€ forecasts of rainfall misrepresent the uncertainty associated with the methods by which rainfall is measured, modelled and predicted. A recognition amongst meteorologists is that probabilistic forecasts, that is, issuing a probability density as a forecast rather than a deterministic point value, is desirable.
Semiparametric Methods for Missing Data Problems and Their Applications to Multi-Phase Designs
Advisors: Jon Wellner & Norman Breslow
Assessing Spatial Heterogeneity of Evolutionary Processes: Smoothing with Markov Fields and Jumping on Markov Chains
Signatures of spatial variation, left by evolutionary processes in genomic sequences, provide important information about the function and structure of genomic regions. I discuss statistical methods for detection of such signatures in a Bayesian framework. I start with phylogenetic analysis of recombination in the HIV genome. I present a recombination detection method that allows accurate estimation of recombination break-points from a molecular sequence alignment.
The Blessing of Transitivity in Sparse and Stochastic Networks
The interaction between transitivity and sparsity, two common features in empirical networks, implies that there are local regions of large sparse networks that are dense. We call this the blessing of transitivity and it has consequences for both modeling and inference. Extant research suggests that statistical inference for the Stochastic Blockmodel is more difficult when the edges are sparse. However, this conclusion is confounded by the fact that the asymptotic limit in all of the previous studies is not merely sparse, but also non-transitive.
A Bayesian information criterion for singular models
We consider approximate Bayesian model choice for model selection problems that involve models whose Fisher-information matrices may fail to be invertible along other competing submodels. Such singular models do not obey the regularity conditions underlying the derivation of Schwarz's Bayesian information criterion (BIC) and the penalty structure in BIC generally does not reflect the frequentist large-sample behavior of their marginal likelihood.
Consistency and Rates of Convergence for Maximum Likelihood Estimators via Empirical Process Theory
Empirical process methods play an important role in the study of maximum likelihood and minimum contrast estimators in non-parametric and semi-parametric models. In this talk I will begin with a short review of modern Glivenko-Cantelli theorems and inequalities for empirical processes. I will then survey some of the basic inequalities for proving consistency of MLEâ€™s, illustrated by several examples from current or recent research projects.
Robust Covariance Functional Inference
Covariance functional inference plays a key role in high dimensional statistics. A wide variety of statistical methods, including principal component analysis, Gaussian graphical model estimation, and multiple linear regression, are intrinsically inferring covariance functionals. In this talk, I will present a unified framework for analysis of complex (non-Gaussian, heavy-tailed, dependent,â€¦) high dimensional data. It connects covariance functional inference to robust statistics.
Confidence Sets for Phylogenetic Trees