Smith Hall

### Smith Hall

Update 4/25/2019: Location of this seminar has been moved to SMI 211.

Bayesian hierarchical modeling is a powerful tool for demography and climate science. In this talk we will focus on its use for accounting for uncertainty about past demographic quantities in population projections. Since the 1940s, population projections have in most cases been produced using the deterministic cohort component method. However, in 2015, for the first time, in a major advance, the United Nations issued official probabilistic population projections for all countries based on Bayesian hierarchical models for total fertility and life expectancy.

A common challenge in estimating parameters of probability density functions is the intractability of the normalizing constant. While in such cases maximum likelihood estimation (MLE) may be implemented using numerical integration, the approach becomes computationally intensive. In contrast, the score matching method of Hyvärinen (2005) avoids direct calculation of the normalizing constant and yields closed-form estimates for exponential families of continuous distributions on the m-dimensional Euclidean space R^m.

Green Dot is a movement, a program, and an action. The aim of Green Dot is to prevent and reduce sexual assault & relationship violence at UW by engaging students as leaders and active bystanders who step in, speak up, and interrupt potential acts of violence. The Green Dot movement is about gaining a critical mass of students, staff and faculty who are willing to do their small part to actively and visibly reduce power-based personal violence at UW.

Hawkes processes has been a popular point process model for capturing mutual excitation of discrete events. In the network setting, this can capture the mutual influence between nodes, which has a wide range of applications in neural science, social networks, and crime data analysis. In this talk, I will present a statistical change-point detection framework to detect in real-time, a change in the influence using streaming discrete events.

The celebrated Grenander (1956) estimator is the maximum likelihood estimator of a decreasing density function.  In contrast to alternative nonparametric density estimators, Grenander estimator does not require any smoothing parameters and is often viewed as a fully automatic procedure.  However, the monotonic density assumption might be questionable.  While testing qualitative constraints such as monotonicity are difficult in general, we show that a likelihood ratio test statistic Kₙ has an incredibly simple asymptotic null distribution:

Randomization is a basis for inferring treatment effects with minimal additional assumptions. Appropriately using covariates in randomized experiments will further yield more precise estimators. In his seminal work Design of Experiments, R. A. Fisher suggested blocking on discrete covariates in the design stage and conducting the analysis of covariance (ANCOVA) in the analysis stage. In fact, blocking can be embedded into a wider class of experimental design called rerandomization, and the classical ANCOVA can be extended to more general regression-adjusted estimators.

At Amazon’s Inventory Planning and Control Laboratory (IPC Lab) we run randomized controlled trials (RCTs) that evaluate the efficacy of in-production buying and supply chain policies on important business metrics. Our customers are leading supply chain researchers and business managers within Amazon, and our mission is to help them best answer the question, ‘Should I roll out my policy?’ In this talk we discuss how we navigate multiple obstacles to fulfilling our mission.

Deep neural nets have become in recent years a widespread practical technology, with impressive performance in computer vision, speech recognition, natural language processing and many other applications. Deploying deep nets in mobile phones, robots, sensors and IoT devices is of great interest. However, state-of-the-art deep nets for tasks such as object recognition are too large to be deployed in these devices because of the computational limits they impose in CPU speed, memory, bandwidth, battery life or energy consumption.

Causal inference is a challenging problem because causation cannot be established from the observational data alone. Researchers typically rely on additional sources of information to infer causation from association. Such information may come from powerful designs such as randomization, or background knowledge such as information on all confounders. However, perfect designs or background knowledge required for establishing causality may not always be available in practice.

The identification of new rare signals in data, the detection of a sudden change in a trend, and the selection of competing models, are some among the most challenging problems in statistical practice.

Many scientific areas are faced with the challenge of extracting information from large, complex, and highly structured data sets. A great deal of modern statistical work focuses on developing tools for handling such data. In this work we presents a new subfield of functional data analysis, FDA, which we call Manifold Data Analysis, or MDA. MDA is concerned with the statistical analysis of samples where one or more variables measured on each unit is a manifold, thus resulting in as many manifolds as we have units.

Structural equation models are multivariate statistical models that are defined by specifying noisy functional relationships among random variables. This talk treats the classical case of linear relationships and additive Gaussian noise terms. Each linear structural equation model is associated with a graph and corresponds to a polynomially parametrized set of positive definite covariance matrices.

Consider the regression model Y=g0(X)+E, where E is the error term, and g0:R^k -> R is the unknown regression function to be estimated from independent observations of (X,Y). Furthermore we have a countable collection of models (classes of candidate regression functions of finite VC dimension) of growing complexity. The larger the model, the better the approximation error, but the worse the estimation error. In order to balance both errors, we propose to estimate g0 by means of penalised least squares, where the penalty is proportional to the VC-dimension of the model.

We model the stochastic process of trades in a limit order book market as a marked point process. We propose a semiparametric model for the conditional distribution given the past, attempting to capture the effect of the recent past in a nonparametric way and the effect of the more distant past using a parametric time series model. Our framework provides more flexibility than the most commonly used family of models.

When survival data are colleted as part of a prevalent cohort study, the recruited cases have already experienced their initiating event. These prevalent cases are then followed for a fixed period of time at the end of which the subjects will either have failed or have been censored. When interests lies in estimating the survival distribution, from onset, of subjects with the disease, one must take into account that the survival times of the cases in a prevalent cohort study are left truncated.

In this talk, I will try to trace some of the ideas that led from Herbert Costner\'s early work with multiple indicator models to simple models of sibling resemblance in social and economic standing, and to more elaborate models that combine direct and indirect measurement of family influence.

### Abstract:

We consider minimization of stochastic functionals that are compositions of a (potentially) non-smooth convex function h and smooth function c. We develop two stochastic methods--a stochastic prox-linear algorithm and a stochastic (generalized) sub-gradient procedure--and prove that, under mild technical conditions, each converges to first-order stationary points of the stochastic objective.

I will review and discuss some of the different themes of regularized estimation of the population covariance matrix:

1. Why estimate it and in what norm?
2. Pathologies of the empirical covariance matrix
3. Notions of sparsity and methods of regularization
4. Results of B-Levina (2004, 2006)
5. Some future directions

This talk presents an alternative profile likelihood estimation theory. By introducing a new parametrization, we improve on the seminal work of Murphy and van der Vaart (2000) in 2 ways: we prove the no bias condition in a general semi-parametric model context, and deal with the direct quadratic expansion of the profile likelihood rather than an approximate one. In addition, we discuss a difficulty which we encounter in the profile likelihood estimation.

In radiology, magnetic resonance imaging (MRI) and magnetic resonance spectroscopic imaging (MRSI) play an increasingly important role. However, the wealth of data available to the radiologist makes it more difficult to extract the relevant information. One way to summarise information from several congruent images is to show a segmented image, i.e. an image where pixels are clustered.

Authors: Yanqin Fan, Emmanuel Guerre, and Dongming Zhu

For many ML problems, labeled data is readily available. The algorithm is the bottleneck. This is the ML researcherâ€™s paradise! Problems that have fairly stable distributions and can accumulate large quantities of human labels over time have this property: Vision, Speech, Autonomous driving. Problems that have shifting distribution and an infinite supply of labels through history are blessed in the same way: click prediction, data analytics, forecasting. We call these problems the â€œheadâ€ of ML.

A major challenge in data management is how to manage uncertain data. Many reasons for the uncertainty exists: the data may be extracted automatically from text, it may be derived from the physical world such as RFID data, it may be integrated using fuzzy matches, or may be the result of complex stochastic models. Whatever the reason for the uncertainty, a data management system needs to offer predictable performance to queries over large instances of uncertain data.

We were motivated by the problem of assessing the influence on the inequality in income by the corresponding inequality in some other related variable (say, the number of years of formal education completed). More generally, consider the pseudo-Lorenz curve of a nonnegative r.v. Y relative to (i.e., with respect to the ordering of) another related nonnegative r.v. X. It is shown that this pseudo-Lorenz curve L(Y/X) always lies above the Lorenz curve L(Y) of Y.

Statistics is a field where the goal is extract the most information out of the least amount of data. Traditionally, the data is fixed and small, and the goals are centered around efficient estimation and inference from the full dataset. Computation and storage issues are often an afterthought. With the rise of big data ranging from terabytes to petabytes, a new set of storage and computational issues arise as simply reading the data can take hours to days of cpu time.

I will be presenting a talk on my dissertation research which consisted of the statistical analysis of two interesting astronomical applications involving point process data.

For signal and image classification problems, such as the ones in medical or geophysical diagnostics and military applications, extracting relevant features is one of the most important tasks. As an attempt to automate the feature extraction procedure and to understand what the critical features for classification are, we developed the so-called local discriminant basis (LDB) method which rapidly selects an orthonormal basis suitable for signal/image classification problems from a large collection of orthonormal bases (e.g., wavelet packets and local trigonometric bases).

### Abstract:

This work is motivated by the problem of ungauged basins, and the aim is to make inference about basins based on both point observations from precipitation gauges and areal measurements from other basins in the same area. As precipitation (and evaporation) are non-stationary spatial processes due to topology, we set up a spatial non-stationary model with elevation as an explanatory variable in the dependency structure.

The most scientific and least controversial claim of de Finetti's subjective philosophy of probability is that the rules of Bayesian inference can be derived from a system of axioms for rational decision making that does not presuppose existence of probability. In fact, de Finetti's argument is fatally flawed. The error is irreparable. The slides in PowerPoint and PDF are available at http://www.math.washington.edu/~burdzy/Philosophy/.

Dependence in the tail of the distribution can differ from that in the bulk of the distribution. A basic tenet of a univariate extreme value analysis is to discard the bulk of the data and only analyze the data considered to be extreme. This is true for multivariate problems as well. We will first introduce a framework for describing tail dependence. The probabilistic framework of regular variation has strong ties to classical extreme value theory and provides a framework for describing tail dependence.

I will discuss aspects of data analysis and modelling arising from a number of clinical studies that aim to integrate gene expression, and other forms of molecular data, into predictive modelling of clinical outcomes and disease states. Some of our work on empirical and model based approaches to defining underlying factor structure in large-scale expression data, and the use of estimated factors in predictive regression and classification tree models, will be reviewed.

This talk will focus on nonstationary time series, from both a methodological and applied perspective. On the methodology side, I will discuss new stochastic models for capturing structure in bivariate data, by representing the series as complex-valued. This representation allows for novel ways of capturing features that are multiscale, anisotropic and/or nonstationary. I will also present new methodology and theory for maximum likelihood inference in the frequency-domain, specifically by providing a method for removing estimation error from the Whittle likelihood.

I introduce a Bayesian nonparametric framework for modeling ordinal regression relationships which evolve in discrete time. The motivating application involves a key problem in fisheries research on estimating relationships between age, length and maturity, the latter recorded on an ordinal scale, across time. The methodology builds from nonparametric mixture modeling for the joint stochastic mechanism of covariates and latent continuous responses.

Consider a linear regression model

Y = XÎ² + z; z ~ N(0, In); X = Xn,p;

where both p and n are large but p > n. The vector Î² is unknown but is sparse in the sense that only a small proportion of its coordinates is nonzero, and we are interested in identifying these nonzero ones. We model the coordinates of Î² as samples from a two-component mixture (1-Ïµ)Ï…0 + ÏµÏ€, and the rows of X as samples from N(0, 1/n Î©), where Ï…0 is the point mass at 0, Ï€ is a distribution, and Î© is a p by p correlation matrix which is unknown but is presumably sparse.

Advisors: Peter Guttorp & Jim Hughes

The Bonferroni adjustment, or the union bound, is commonly used to develop and study rate optimal statistical methods in high-dimensional problems. However, in practice, the Bonferroni adjustment is overly conservative. The extreme value theory has been proven to provide more accurate multiplicity adjustments in a number of settings, but only on ad hoc bases.

This presentation will describe some of the problems faced and methods used by statisticians at Google: â€¢ A primary dimension of search quality is the relevance of search results to the search query. Preference rank allows us to convert pairwise comparisons into a ranking of search results. â€¢ Through the AdSense program, Google delivers targeted advertising on third-party web sites, which we refer to as publishers. Publisher scores are a method of ranking publishers by their effectiveness as an ad delivery platform.

In this paper, the authors propose and investigate two new methods for achieving less bias in non-parametric regression and use simulations to compare the bias, variance, and mean squared error from the second and preferred of these two methods to the biases, variances, and mean squared errors of the local constant, local linear, and local cubic non-parametric regression estimators. The two new methods proposed by the authors have bias of order h^4 where h is the estimatorâ€™s smoothing parameter, in contrast to the basic kernel estimatorâ€™s bias of order h^2.

### Abstract:

Dynamic treatment regimes (DTRs) are sequential decision regimes for individual patients that can adapt over time to an evolving illness. The goal is to find the DTRs tailored to individual characteristics that lead to the best long term outcome if implemented. In many clinical applications, it is desirable to provide a fixed decision rule over time for the patients.

The availability of large observational data bases allow empirical scientists to consider estimating treatment effects without conducting costly and/or unethical experiments where the treatment would be randomized. The Neyman-Rubin model (potential outcome framework) and the associated matching estimators have become increasingly popular, because they allow for the non-parametric estimation of average treatment effects.

Novel methodological questions are being generated in the biological sciences, requiring the integration of different concepts, methods, tools and data types. Bayesian methods that employ variable selection have been particularly successful for genomic applications, as they allow to handle situations where the amount of measured variables can be much greater than the number of observations. In this talk I will focus on models that integrate experimental data from different platforms together with prior knowledge.

Markov Random Fields are extremely useful and generally applicable for probabilistic modelling of a wide range of systems. We\'ll review methods for performing inference calculations (most likely configuration and marginal probabilities) on MRFs. Unfortunately, for many tasks, these basic calculations are computationally infeasible. We\'ll discuss the limitations of standard computation methods and the graph-theoretic properties related to computational complexity.

Bayes factors have been proven to be very useful when testing statistical hypotheses with inequality (or order) constraints and/or equality constraints between the parameters of interest. Two useful properties of the Bayes factor are its intuitive interpretation as the relative evidence in the data between two hypotheses and the fact that it can straightforwardly be used for testing multiple hypotheses. The choice of the prior, which reflects one's knowledge about the unknown parameters before observing the data, has a substantial effect on the Bayes factor.

In many real-world statistical problems, we observe a large number of potentially explanatory variables of which a majority may be irrelevant. For this type of problem, controlling the false discovery rate (FDR) guarantees that most of the discoveries are truly explanatory and thus replicable. In this talk, we propose a new method named SLOPE to control the FDR in sparse high-dimensional linear regression. This computationally efficient procedure works by regularizing the fitted coefficients according to their ranks: the higher the rank, the larger the penalty.

Antibodies must recognize a great diversity of antigens to protect us from infectious disease. The binding properties of antibodies are determined by the sequences of their corresponding B cell receptors (BCRs). These BCR sequences are created in "draft" form by VDJ recombination, which randomly selects and deletes from the ends of V, D, and J genes, then joins them together with additional random nucleotides.

Probabilistic forecasts of wind vectors are becoming critical as interest grows in wind as a clean and renewable source of energy, in addition to a wide range of other uses, from aviation to recreational boating. Unlike other common forecasting problems, which deal with univariate quantities, statistical approaches to wind vector forecasting must be based on bivariate distributions. The prevailing paradigm in weather forecasting is to issue deterministic forecasts based on numerical weather prediction models.

Department of Mathematics Optimization Seminar Solution procedures for stochastic programming problems, statistical estimation problems (constrained or not), stochastic optimal control problems and other stochastic optimization problems often rely on sampling. The justification for such an approach passes through 'consistency.' A comprehensive, satisfying and powerful technique is to obtain the consistency of the optimal solutions, statistical estimators, controls, etc., as a consequence of the consistency of the stochastic optimization problems themselves.

I will give a brief overview of statistics at Google, covering topics like Experimentation, measuring long term effects of treatments in an online system, Google Consumer Surveys and if time allows Causal Impact. I will also touch on how we handle big data at Google, what it's like to work here and some tips for statisticians on interviewing at companies like Google.

In the analysis of lifetime data, a key object of interest is the hazard function, or instantaneous failure rate. One natural assumption is that the hazard is bathtub, or U-shaped (i.e. first decreasing, then increasing). In particular, this is often the case in reliability engineering or human mortality.

HME (Hierarchical Mixture of Experts) is a tree structured architecture for supervised learning. It is characterized by Soft multi-way probabilistic splits, generally based on linear functions of input values, and by linear or logistic fit of the terminal nodes (in HME literature called Experts) rather then constant function as in CART. The statistical model underlying HME is a hierarchical mixture model, which allows for maximum likelihood estimation of the parameters using EM methods.

Subsampling and bootstrap methods have been suggested in the literature to nonparametrically estimate the variance and distribution of statistics computed from spatial data. Usually stationary data are required to ensure that the methods work. However, in empirical applications the assumption of stationarity often must be rejected. This talk presents consistent bootstrap and subsampling methods to estimate the variance and distributions of statistics based on non-stationary spatial lattice data. Applications to forestry are also discussed.

In many fields of science, we observe a response variable together with a large number of potential explanatory variables, and would like to be able to discover which variables are associated with the response, while controlling the false discovery rate (FDR) to ensure that our results are reliable and replicable. The knockoff filter is a variable selection procedure for linear regression, proven to control FDR exactly under any type of correlation structure in the regime where n>p (sample size > number of variables).

This talk will review some ways of transforming point processes, including smoothing, thinning, superposition, rescaling, and tessellation. Ways in which each of these may be used in the analysis of point process data will be examined, especially in relation to the problem of estimating wildfire hazard. We will explore in particular an important computational geometry problem involving tessellations, namely the estimation of point locations from piecewise constant image data via Dirichlet tessellation inversion.

Applications of statistical machine learning increasingly involve datasets with rich hierarchical, temporal, spatial, or relational structure. Bayesian nonparametric models offer the promise of effective learning from big datasets, but standard inference algorithms often fail in subtle and hard-to-diagnose ways. We explore this issue via variants of a popular and general model family, the hierarchical Dirichlet process.

In Ecology, the niche of a species is usually defined as a multidimensional hyper-volume in which a species maintains a viable population (Hutchinson 1957). The community structure may be shaped by resource partitioning between co-occurring species, so quantifying the degree of this partitioning (i.e. niche overlap) is very important when studying species co-existence (Geange et al. 2010). The niche space is often described by multiple axes or variables.

I will first review well-known differences between odds ratios, relative risks and risk differences. These results motivate the development of methods, analogous to logistic regression, for estimating the latter two quantities. I will then describe simple parametrizations that facilitate maximum-likelihood estimation of the relative risk and risk-difference. Further, these parametrizations allow for doubly-robust g-estimation of both quantities. (Joint work with James Robins, Harvard School of Public Health)

Reversible-jump Markov chain Monte Carlo may be used to fit scatterplot data with cubic splines having unknown numbers of knots and knot locations. Key features of the implementation my colleagues and I have investigated are (i) a fully Bayesian formulation that puts priors on the spline coefficients and (ii) Metropolis-Hastings proposal densities that attempt to place knots close to one another. Simulation results indicate this methodology can produce fitted curves with substantially smaller mean squared-error than competing methods.

The general theme of my research in recent years is spatio-temporal modeling and sparse recovery with high dimensional data under measurement error. In this talk, I will discuss several computational and statistical convergence results on graph and sparse vector recovery problems. Our methods are applicable to many application domains such as neuroscience, geoscience and spatio-temporal modeling, genomics, and network data analysis. I will present theory, simulation and data examples. Part of this talk is based on joint work with Mark Rudelson.

In many stochastic models, in particular Markov chains in discrete or continuous time and Markov renewal processes, a Markov chain is present either directly or indirectly through some form of embedding. The analysis of many problems of interest associated with these models, eg. stationary distributions, moments of first passage time distributions and moments of occupation time random variables, often concerns the solution of a system of linear equations involving I - P, where P is the transition matrix of a finite, irreducible, discrete time Markov chain.

In longitudinal studies, the usual modeling assumptions for multivariate analyses don\'t always hold up so well. One way to treat this is to use non-parametric approaches. In the paper I will be presenting on, the authors analyzed tumor volume in rats as a function of lipids in their diet. The data was highly heteroscedastic and strongly correlated with time. To compare lipid diets, randomization F-tests were used. Then, local polynomial smoothing was used to create tumor growth curves for each diet, as well as confidence intervals that account for the serially correlated data.

Image analysis has in the last decade experienced a revolution via the development of new tools for the representation and analysis of local image features. At the heart of these developments is the construction of suitable local representations of structure, via decompositions in a set of localized functions. The chosen decomposition then forms the setting for further analysis and/or estimation methods. In particular, compression of a given representation ensures that most decomposition coefficients are of negligible magnitude, and this often simplifies the analysis considerably.

Cluster analysis based on non-parametric density estimation represents an approach to the clustering problem whose roots date back several decades, but it is only in recent times that this approach could actually be developed. The talk presents one proposal within this approach which is among the few ones which have been brought up to the operational stage.

Panel data has important advantages over purely cross-sectional or time-series data in studying many economic problems, because it contains information about both the intertemporal dynamics and the individuality of the entities being investigated. A commonly used class of models for panel studies identifies the parameters of interest through an overdetermined system of estimating equations. Two important problems that arise in such models are the following: (1) It may not be clear a priori whether certain estimating equations are valid.

Causal inference research in statistics has been largely concerned with estimating the effect of treatment (e.g. personalized tutoring) on outcomes (e.g., test scores) under the assumption of "lack of interference"; that is, the assumption that the outcome of an individual does not depend on the treatment assigned to others. Moreover, whenever its relevance is acknowledged (e.g., study groups), interference is typically dealt with as an uninteresting source of variation in the data.

We present a Bayesian approach for modeling multivariate, dependent functional data. To account for the three dominant structural features in the data--functional, time dependent, and multivariate components--we extend hierarchical dynamic linear models for multivariate time series to the functional data setting. We also develop Bayesian spline theory in a more general constrained optimization framework.

Gaussian Markov Random Fields (GMRFs) has been around for a long time; however, it is first in the recent years that its computational benefits in Bayesian inference has become clear. In this talk, I\'ll discuss two related problems which involves GMRFs. The first is the problem of constructing Gaussian fields on triangulated manifolds. By viewing this as finding the solution of a stochastic partial differential equation (SPDE), the GMRFs appear as the solutions when solving the SPDE using the \"finite element\" approach.

Modern molecular genetics generates extensive data which document the genetic variation in natural populations. Such data give rise to challenging statistical inference problems both for the underlying evolutionary parameters and for the demographic history of the population. These problems are of considerable practical importance and have attracted recent attention, with the development of algorithms based on importance sampling (IS) and Markov chain Monte Carlo (MCMC).

Models of network data have witnessed a surge of interest in statistics and related areas. Such data arise in the study of insurgent and terrorist networks, contact networks facilitating the spread of infectious diseases, social networks, the World Wide Web, and other areas.

We overview the notion of regular scaling in data and estimators of this regular scaling on several examples involving high frequency measurements. Next we discuss the importance of wavelet domains and ability of wavelets to precisely estimate regular
scaling (monofractality) and some deviations from regular scaling (time-dependent Hursts, multifractality, etc).

Natural modeling of large networks leads to exponential models with sufficient statistics being such things as the number of triangles or the degree sequence. These look like standard problems but some surprises have emerged. For some models, it is possible to estimate n parameters based on a sample of size one. For other models, with two parameters, maximum likelihood is inconsistent. Many of these models show phase transitions. The new tools required include the emerging theory of graph limits. This is joint work with Sourav Chatterjee and Allan Sly

An emerging area of research in ecology is the analysis of functional species assemblages. In essence, the analysis of functional assemblages is concerned with determining and predicting the composition of individuals categorized using different life history traits instead of strict taxa names. We propose a state-space model for the analysis of multiple trait compositions along with site-specific covariate information. A site-specific random effects term allows for modeling extra variability including spatial variability in trait compositions.

Neuroengineering is an emerging interdisciplinary field with the goal of developing effective, robust devices that interact with the nervous system. These devices may act in closed loop with the nervous system to augment, repair, or even replace aspects of its basic function. Neuroengineering presents a set of interesting computational challenges that may require diverse solutions. For instance, How do we perform efficient computations on large quantities of neural data with severely limited computing resources?

The linear representation of order statistics is a random permutation matrix which can be applied to obtain the usual covariance structure of ranks and other induced order statistics. In this talk, the algebraic structure of the standard case will be identified and extended to the ordering of observations indexed by circular, uniformly spaced, coordinates. These data are characteristic, for example, of corneal curvature maps used to assess regular astigmatism in the optics of the human eye.

What is interesting about causal inference? One of the most compelling aspects is that any prediction under a causal model is valid in environments that are possibly very different to the environment used for inference. For example, variables can be actively changed and predictions will still be valid and useful. This invariance is very useful but still leaves open the difficult question of inference. We propose to turn this invariance principle around and exploit the invariance for inference.

Many emerging applications of big data involve time series data. In this talk, I will discuss a collection of machine learning and data mining approaches to effectively analyze and model large-scale time series and spatio-temporal data. Experiment results will be shown to demonstrate the effectiveness of our models in healthcare and climate applications.

The emerging area of statistical science known as functional data analysis is concerned with evaluating information on curves or functions. In recent years much of the research emphasis has focused on extending statistical methods from classical settings into the functional domain. For example, functional principal component analysis (FPCA) is analogous to the traditional PCA, except that the observed data are entire functions rather than multivariate vectors.

We discuss modeling probability measures constrained to a convex set. We represent measures in such sets as mixtures of simple, known extreme measures, and so the problem of estimating a constrained measure becomes one of estimating an unconstrained mixing measure. Such convex constraints arise in many modeling situations, such as empirical likelihood and modeling under stochastic ordering constraints.

No Seminar

Suppose we obtain $n$ i.i.d copies of a random vector $O$ with unknown distribution $F(\\\\theta)$, $\\\\theta \\\\in Theta$. Our goal is to construct honest $100 (1 - \\\\alpha)$% asymptotic confidence intervals (CI) (whose width shrinks to zero with increasing $n$ at the fastest possible rate), through higher order influence functions, for a functional $\\\\psi(\\\\theta)$ in a model that places no restrictions on $F$; other than, perhaps, bounds on both the $L_p$ norms and the roughness (more generally, the complexity) of certain density and conditional expectation functions.

Among statisticians variable selection is a common and very dangerous activity. This talk will survey the dangers and then propose two forms of insurance to guarantee against the damages from this activity.

This talk provides an introduction to robust estimation of covariance matrices, covering both theoretical and computational aspects, and indicating what we believe to be best choice of estimator at the present time. We begin with a brief introduction to the main concepts of robustness, focusing primarily on minimizing maximum bias for a class of standard multivariate mixture outlier generating models, while maintaining high efficiency at the nominal model.

Many prognostic models for cancer use biomarkers that have utility in early detection. For example, in prostate cancer, models predicting disease-specific survival use serum prostate-specific antigen (PSA) levels. These models are typically interpreted as indicating that detecting disease at a lower threshold of the biomarker is likely to generate a survival benefit. However, lowering the threshold of the biomarker is tantamount to early detection. It is not known whether the existing prognostic models imply a survival benefit under early detection once lead time has been accounted for.

This talk is a personalized account of John Tukey\'s contributions to robust statistics, as well as a summary of the maturation of robustness theory and practice to date. I begin by fondly recalling the way in which Tukey and I became acquainted, how he gave me my start in Statistics at Princeton and Bell Laboratories, and the very stimulating research environment of the Mathematics and Statistics Research Center at Bell Laboratories in 1970\'s and 1980\'s.

This paper considers semi-nonparametric conditional moment models where the parameters of interest include both finite-dimensional parameters and unknown functions. We mainly focus on two inferential problems in this framework. First, we provide new methods of uniform inference for the estimates of both finite- and infinite-dimensional components of the parameters and functionals of the parameters. Based on these results, we can, for instance, construct uniform confidence bands for the unknown functions and the partial derivatives of the unknown functions.

Scientific questions about networks are often comparative: we want to know whether the difference between two networks is just noise, and, if not, how their structures differ. I'll describe a general framework for network comparison, based on testing whether the distance between models estimated from separate networks exceeds what we'd expect based on a pooled estimate.

Characterizing variation in human exposure to toxic substances over large populations often requires an understanding of the geographic variation in environmental levels of toxicants. This knowledge is essential when the primary routes of exposure are through interactions with environmental media, as opposed to more individual-specific exposure routes (e.g., occupational exposure). In this study, we focus on modeling the spatial variation in the concentration of arsenic, a toxic heavy metal, in air, soil, and water across the state of Arizona.

It is often convenient to define models for the process of chiasma formation at meiosis as stationary renewal models. However, count-location models are also useful, particularly to capture the biological requirement of at least one chiasma per chromosome. The Sturt model and truncated Poisson model are both count-location models with this feature. We show that the truncated Poisson model can also be expressed as a stationary renewal model, while the Sturt model cannot.

One of the greatest challenges ecologists face is predicting how climate change will affect the organisms with which we share our planet. Ecological theory predicts that species current distributions are determined by their climatic niches (i.e. fitness as a function of climate). Statistical models relating species geographic distributions to climate (SDM’s – species distribution models) are therefore used to predict shifts in species distributions with climate change.

Single particle electron microscopy is a powerful method that biophysicists employ to learn about the structure of biological macromolecules. In contrast to the more traditional crystallographic methods, this method images â€œunconstrainedâ€ particles, thus posing a variety of statistical problems. We formulate and study such a problem, one that is essentially of a random tomographic nature, where a structural model for a biological particle is to be constructed given random projections of its Coulomb potential density, observed through the electron microscope.

I will discuss three related topics: estimating manifolds, estimating ridges and estimating persistent homology. All three problems are aimed at the problem of extracting topological information from point clouds. This is joint work with many people.

Bio:
Larry Wasserman is Professor, Department of Statistics and Machine Learning Department, Carnegie Mellon University. He graduated from the University of Toronto in 1988. After a brief stint as an animal trainer he took a position at Carnegie Mellon and has been there ever since except for a brief sabbatical in Uzbekistan.

Patients undergoing organ transplantation are often administered drugs that suppress their autoimmune system, to avoid rejection of the new organ. A consequence of this is that risk of a variety of conditions is elevated until the drugs are eliminated. In this research we seek to characterize risk of post-transplant lymphoma among kidney transplant recipients. Of key interest is the possibly time-vary effect of a time-dependent covariate: transplant status while on the waiting list.

This paper studies the Generalized Method of Moments (GMM) estimation and inference problem that occurs when the Jacobian of the moment conditions is degenerate. Dovonon and Renault (2013, Econometrica) recently raised a local identification issue stemming from this degenerate Jacobian. The local identification issue leads to a slow rate of convergence of the GMM estimator and a non-standard asymptotic distribution of the over-identification tests. We show that the degenerate Jacobian matrix may contain non-trivial information about the economic model.

A new class of state space models for longitudinal discrete response data, where the observation equation is specified in an additive form involving both deterministic and dynamic components is proposed. These models allow us to explicitly address the effects of trend, seasonal or other time-varying covariates, while preserving the power of state space models in modeling dynamic pattern of data. Different Markov chain Monte Carlo algorithms to carry out statistical inference for models with binary and binomial responses are developed.

Perfect simulation, or exact sampling, refers to a recently developed set of techniques designed to produce a sequence of independent random quantities whose distribution is guaranteed to follow a given probability law. These techniques are particularly useful in the context of Markov Chain Monte Carlo iterations, but the range of their applicability is growing rapidly. Perfect simulation algorithms provide samples with the desired exact distribution and also explicitly determine how many steps are necessary in the Markov Chain to achieve the desired outcome.

In their efforts to call attention to environmental problems, communicate with like-minded groups, and mobilize support for their activities, radical environmentalist organizations produce an enormous amount of text. These texts, like radical environmental groups themselves, are often (i) densely connected and (ii) highly variable in advocated protest activities. Given a corpus of radical environmentalist texts, can one uncover the underlying network structure of environmental (and related leftist) groups?

Rain is vital to life yet potentially extremely destructive and forecasting is critical to water management. Rain is a difficult atmospheric variable to predict, and traditional deterministic â€œpointâ€ forecasts of rainfall misrepresent the uncertainty associated with the methods by which rainfall is measured, modelled and predicted. A recognition amongst meteorologists is that probabilistic forecasts, that is, issuing a probability density as a forecast rather than a deterministic point value, is desirable.

Advisors: Jon Wellner & Norman Breslow

While targeting key drivers of tumor progression (e.g., BCR/ABL, HER2, and BRAFV600E) has had a major impact in oncology, most patients with advanced cancer continue to receive drugs that do not work in concert with their specific biology. This is exemplified by acute myeloid leukemia (AML), a disease for which treatments and cure rates (in the range of 20%) have remained stagnant. Effectively deploying an ever-expanding array of cancer therapeutics holds great promise for improving these rates but requires methods to identify how drugs will affect specific patients.

The earth\'s atmosphere is a stochastic complex system which includes amongst other things pollution fields some of which derive from anthropogenic sources. Because of their negative health impacts, these fields are now the subject to regulation. However setting the air quality standards needed to regulate them is itself a complex business and that leads to a need for good models for these fields and for predicting human exposures to them. This talk, drawing on my recent experience and research connected with ozone, will describe:

I will present on directions with harnessing predictive models to guide decision making. I will first discuss methods for using machine learning to ideally couple human and computational effort, focusing on several illustrative efforts, including spoken dialog systems and citizen science. Then I will turn to challenges with healthcare and describe work to field statistical models in real-world clinical settings, focusing on the opportunity to join predictions about outcomes with utility models to guide intervention.

We develop spatial statistical models for stream networks that can estimate relationships between a response variable and other covariates, make predictions at unsampled locations, and predict an average or total for a stream or a stream segment. There have been very few attempts to develop valid spatial covariance models that incorporate flow, stream distance, or both. The application of typical spatial autocovariance functions based on Euclidean distance, such as the spherical covariance model, are not valid when using stream distance.

We introduce statistical methods to address two forecasting problems arising in the management of ambulance fleets: (1) predicting the time it takes an ambulance to drive to the scene of an emergency; and (2) space-time forecasting of ambulance demand. These predictions are used for deciding how many ambulances should be deployed at a given time and where they should be stationed, which ambulance should be dispatched to an emergency, and whether and how to schedule ambulances for non-urgent patient transfers.

Species inventories that list of all species present in a given area are an important tool for both the study of bio-diversity and conservation biology. These lists are typically obtained from fields studies in which biologists record all the species they can observed over a finite time period. Because of the possible presence of rare, and thus hard to observe species, completeness of such lists can never be guaranteed, regardless of the amount of time and energy spent in compiling them.

Advances in scalable machine learning have made it possible to learn highly structured models on large data sets. In this talk, I will discuss some of our recent work in this direction. I will first briefly review scalable probabilistic topic modeling with stochastic variational inference. I will then then discuss two structured developments of the LDA model in the form of tree-structured topic models and graph-structured topic models. I will present our recent work in each of these areas.

Suppose we have a graphical model with sample observations of only a subset of the variables. Can we separate the extra correlations induced due to marginalization over the unobserved, hidden variables from the structure among the observed variables? In other words is it still possible to consistently perform model selection despite the unobserved, latent variables?

Many statistical models are defined in terms of polynomial constraints, or in terms of polynomial or rational parametrizations. Such algebraic models include, for instance, factor analysis and instrumental variable models, latent class models, and more generally, discrete and Gaussian graphical models with hidden variables. Statistical inference in hidden variable models is complicated by the fact that the models\' parameter spaces are typically not smooth. This is the motivation for this talk that considers testing a null hypothesis with singularities in algebraic models.

After 25 years of improvement, opportunity through social mobility has levelled off in the United States. The association between occupational origins and destinations did not change between the first half of the 1980s and the first half of the 1990s. Detailed mobility tables from the General Social Survey show that the effect of socioeconomic origins on the socioeconomic status of women\'s and men\'s occupations in 1991-4 is at the same level found in the early 1980s.

We consider estimation and inference in a two component mixture model where the distribution of one component is completely unknown. We develop methods for estimating the mixing proportion and the unknown distribution nonparametrically, given i.i.d. data from the mixture model. We use ideas from shape restricted function estimation and develop "tuning parameter free" estimators that are easily implementable and have good finite sample performance. We establish the consistency of our procedures.

Signatures of spatial variation, left by evolutionary processes in genomic sequences, provide important information about the function and structure of genomic regions. I discuss statistical methods for detection of such signatures in a Bayesian framework. I start with phylogenetic analysis of recombination in the HIV genome. I present a recombination detection method that allows accurate estimation of recombination break-points from a molecular sequence alignment.

The interaction between transitivity and sparsity, two common features in empirical networks, implies that there are local regions of large sparse networks that are dense. We call this the blessing of transitivity and it has consequences for both modeling and inference. Extant research suggests that statistical inference for the Stochastic Blockmodel is more difficult when the edges are sparse. However, this conclusion is confounded by the fact that the asymptotic limit in all of the previous studies is not merely sparse, but also non-transitive.

Empirical process methods play an important role in the study of maximum likelihood and minimum contrast estimators in non-parametric and semi-parametric models. In this talk I will begin with a short review of modern Glivenko-Cantelli theorems and inequalities for empirical processes. I will then survey some of the basic inequalities for proving consistency of MLEâ€™s, illustrated by several examples from current or recent research projects.

Covariance functional inference plays a key role in high dimensional statistics. A wide variety of statistical methods, including principal component analysis, Gaussian graphical model estimation, and multiple linear regression, are intrinsically inferring covariance functionals. In this talk, I will present a unified framework for analysis of complex (non-Gaussian, heavy-tailed, dependent,â€¦) high dimensional data. It connects covariance functional inference to robust statistics.

Estimation and testing problems for monotone functions in "Gaussian white noise" lead to several interesting functions of two-sided Brownian motion $W$ plus a parabola: the slope process of the greatest convex minorant is now well-understood, thanks to the work of Groeneboom (1983), (1989). In particular, the distribution of the slope process at $0$, say $Z_0$, has been computed analytically and numerically in Groeneboom (1985) and Groeneboom and Wellner (2001).

As we observe the dynamics of social networks over time, how can we tell if a significant change happens? We propose a new framework for the detection of change-points as data are generated. The approach utilizes nearest neighbor information and can be applied to ongoing sequences of multivariate data or object data. Different stopping times are compared and one relies on recent observations is recommended. An accurate analytic approximation is obtained for the average run length when there is no change, facilitating its application to real problems.

Studying covariance matrices in hierarchical models can reveal meaningful relationships among variables, but these become difficult to interpret as the number of variables grows. Conventional factor analysis reduces the dimension by mapping onto a set of one-dimensional factors, but does not accommodate variables with a cross-classified layout. For such applications, we develop hierarchical models with Kronecker-product (separable) covariance structure at the second level.

We consider a mathematical model for a financial market and consider a trader who wants to optimize, by suitable trading, the value of his or her portfolio. The constraint in this optimization is given by a convex functional known as a convex risk measure. We propose a Monte-Carlo algorithm, who inputs are the joint law of the stock prices and the parameters of the convex risk measure, and whose outputs are the numerical values of the optimal trading strategy. We also prove the optimality of the output.

Web crawling, snowball sampling, and respondent-driven sampling (RDS) are three types of network driven sampling techniques that are popular when it is difficult to contact individuals in the population of interest. This talk will first review previous research which has shown that if participants refer too many other participants, then under the standard Markov model in the RDS literature, the standard approaches do not provide "square root n" consistent estimators. In fact, there is a critical threshold where the design effect of network sampling grows with the sample size.

Both functional and longitudinal data are data recorded over a time period for each subject in the study. However, the approaches to analyze them are intrinsically different, partly due to the difference in the sampling plans. Functional data refer to situations where the entire trajectory is observed for each subject, or when measurements are recorded for each subject at a dense grid of time points. Longitudinal data, however, are often recorded intermittently, leading to varying measurement schedules and numbers of measurements across subjects.

With the development of new detectors, telescopes and computational facilities, astrophysics has entered an era of data intensive science. During the last decade, astronomers have surveyed the sky across many decades of the electromagnetic spectrum, collecting hundreds of terabytes of astronomical images for hundreds of millions of sources. Over the next decade, data volumes will reach tens of petabytes, and provide accurate measurements for billions of sources.

Environmental statistics is a rich field for statistical problems. I will sketch four different problem areas, all with very different approaches. The first one has to do with statistical assessment of air quality standards. Starting from a classical Neyman-Pearson approach, recent work has moved into analysis of maxima of Gaussian processes. The second problem deals with estimating trends in extreme climate events.

To perform inference after model selection, we propose controlling the selective type I error; i.e., the error rate of a test given that it was performed. By doing so, we recover long-run frequency properties among selected hypotheses analogous to those that apply in the classical (non-adaptive) context. Our proposal is closely related to data splitting and has a similar intuitive justification, but is more powerful.

Did Casanova practice risky sex? What did \"Powerball\" have to do with the Fall of the Bastille? Just how risk-adverse was Robespierre? How did the sans-culottes lose their culottes? In the eighteenth century in France, citizens and royalty faced a multitude of risks, from sexually transmitted disease to decapitation. An unusual data source on the French Lottery provides a window on how financial risk was addressed in that tumultuous time, and how the emerging calculus of probabilities affected its perception.

We consider the problem of learning the structure of a non-Gaussian graphical model. We introduce two strategies for constructing tractable nonparametric graphical model families. One approach is through semiparametric extension of the Gaussian or exponential family graphical models that allows arbitrary graphs. Another approach is to restrict the family of allowed graphs to be acyclic, enabling the use of fully nonparametric density estimation in high dimensions.

Many problems can be formulated as recovering a low-rank tensor. Although an increasingly common task, tensor recovery remains a challenging problem because of the delicacy associated with the decomposition of higher order tensors. We investigate several convex optimization approaches to low rank tensor completion.

I will explain why the frequency statistics has absolutely nothing in common with the frequency philosophy of probability. If time permits, I will explain why the Bayesian statistics has absolutely nothing in common with the subjective philosophy of probability. My presentation will be an unbiased estimator of the truth, with subjective probability 90%.

This second lecture will focus on more sophisticated methods applicable when too few covariates are available to make it plausible that treatment assignment is ignorable (i.e., conditionally randomized given the covariates). The template setting involves randomized experiments with noncompliance where \"use-effectiveness\" (i.e., the effect of exposure to the treatment, not the effect of assignment to the treatment) is the estimand.

We show how to transform any optimization problem that arises from fitting a machine learning model into one that (1) detects and removes contaminated data from the training set and (2) simultaneously fits the trimmed model on the remaining uncontaminated data. To solve the resulting nonconvex optimization problem, we introduce a fast stochastic proximal-gradient algorithm that incorporates prior knowledge through nonsmooth regularization.

The sparse linear model, where latent parameters are endowed with a Laplace prior, has seen many successful applications in Statistics, Machine Learning, and Computational Biology, such as identification of gene regulatory networks from micro-array expression data, or sparse coding of images with overcomplete basis sets. Prior work has either approximated Bayesian inference by expensive Markov chain Monte Carlo, or replaced it by point estimation. We show how to obtain a good approximation to Bayesian inference efficiently, using the Expectation Propagation method.

I will introduce the four-parameter IBP compound Dirichlet process (ICDP), a stochastic process that generates sparse non-negative vectors with potentially an unbounded number of entries. If we repeatedly sample from the ICDP we can generate sparse matrices with an infinite number of columns and power-law characteristics. We apply the four-parameter ICDP to sparse nonparametric topic modelling to account for the very large number of topics present in large text corpora and the power-law distribution of the vocabulary of natural languages.

Social network data often have a special dependence structure, since they usually contain information about the strength of an individual's relation (e.g., friendship) with more than one other person. On most cases, one of the research questions concerns the effect of personal attributes on the occurrence or strength of a relation. Thus, a (cross-)nested data structure is obtained which is suitable for the analysis with multilevel models or with related random effects models.

We often use predictive models to make a decision afterwards. For instance, we might estimate the number of patients at a medical clinic and then designate resources to serve those patients.

The United Kingdom Home Office holds approximately 1 million DNA profiles in its database of known offenders. Suppose that a partial DNA profile is recovered from the scene of a crime. The probability of drawing this profile from a randomly selected individual in England and Wales is estimated to be 1/1,000,000. The crime scene profile is compared with each profile in the offender database and is found to match the profile of one person, S. S was not in custody at the time the crime took place, but no other evidence linking S to the scene of the crime is found.

Single-cell transcriptome sequencing (scRNA-Seq), which combines high-throughput single-cell extraction and sequencing capabilities, enables the transcriptome of large numbers of individual cells to be assayed efficiently.

In many areas of economic analysis, economic theory restricts the shape as well as other characteristics of functions used to represent economic constructs. Obvious examples are the monotonicity and curvature conditions that apply to utility, profit, and cost functions. Commonly, these regularity conditions are imposed either locally or globally. Here we extend and improve upon currently available estimation methods for imposing regularity conditions by imposing regularity on a connected subset of the regressor space.

Signal processing on graphs is a framework for non-parametric function estimation and hypothesis testing that generalizes spatial signal processing to heterogeneous domains. I will discuss the history of this line of research, highlighting common themes and major advances. I will introduce various graph wavelet algorithms, and highlight any known approximation theoretic guarantees. Recently, it has been determined that the fused lasso is theoretically competitive with wavelet thresholding under some conditions, meaning that the fused lasso is also a locally adaptive smoothing procedure.

The problem of bandwidth estimation for smoothed least squares (SLS) image reconstruction - such as filtered backprojection (FBP) in Positron Emission Tomography (PET) - has been extensively studied in the statistics literature. Here, I extend the generalized cross-validation (GCV) strategy for ridge regression (Golub et al, 1979) and develop it to determine the optimal smoothing parameter in FBP reconstruction. Results on eigendecomposition of symmetric one- and two-dimensional circulant matrices are derived.

The talk starts with an overview of multivariate M-functionals of location and scatter, including symmetrized M-functionals of scatter. Then we discuss general properties of the underlying log-likelihood function. After that we review the currently known algorithms, fixed-point or iteratively reweighted moments. It is explained why these algorithms are intrinsically suboptimal: Then an alternative strategy, based on a "partial Newton" approach, is developed. Numerical examples and, if time permits, applications of M-estimators to Independent Component Analysis are presented.

In this talk, we develop the theoretical properties of the propensity function which is a generalization of the propensity score of Rosenbaum and Rubin (1983). Methods based on the propensity score have long been used for causal inference in observational studies; they are easy to use and can effectively reduce the bias caused by non-random treatment assignment. Although treatment regimes are often not binary in practice, the propensity score methods are generally confined to binary treatment scenarios.

We review that classical notion of Kalman filters for state estimation in dynamical systems. We then reformulate the estimation problem as an optimization problem and show how this perspective allows one to overcome many of the perceived barriers to extending the basic model to a wide range of novel settings. In particular, we show how to extend the model to nonlinear settings involving state constraints, non-Gaussian densities, outliers, sparsity, trend shifts, and state dependent covariances.

Morphometric data sets have not only the usual parameter structures (mean shape, sample covariance) but also other geometric functions of the mean form that can structure prior knowledge. When information from data is absent or weak, these auxiliary formalisms can supply reasonable "expectations" in a context similar to the classic EM alternating algorithm. On odd-numbered steps, population parameters are estimated by least-squares or ML; on even-numbered steps, individual missing data are estimated.

Relational models generalize log-linear models for multivariate categorical data in three aspects. The sample space does not have to be a Cartesian product of the ranges of the variables, the effects allowed in the model do not have to be associated with cylinder sets, and the existence of an overall effect present in every cell is not assumed. After discussing examples which motivate these generalizations, the talk will consider estimation and testing in relational models.

In this paper we discuss the problem of Bayesian fully nonparametric regression. The paper is concerned with two issues: 1) a new construction of priors for nonparametric regression is discussed and a specific prior, the Dirichlet Process Regression Smoother, is proposed, and 2) we consider the problem of centring a dependent nonparametric prior over a class of regression models and propose fully nonparametric regression models with flexible location structures. Computational methods are developed for all models described. Results are presented for simulated and actual data examples.

In this talk, I will discuss how I recently resolved a longstanding open statistical problem. The problem, formulated by the British statistician Udny Yule in 1926, is to mathematically prove Yule's 1926 empirical finding of nonsense correlation.” We solve the problem by analytically determining the second moment of the empirical correlation coefficient of two independent Wiener processes. Using tools from Fredholm integral equation theory, we calculate the second moment of the empirical correlation to obtain a value for the standard deviation of the empirical correlation of nearly .5.

Information technology advances are making data collection possible in most if not all fields of science and engineering and beyond. Statistics as a scientific discipline is challenged and enriched by the new opportunities resulted from these high-dimensional data sets. Often data reduction or feature selection is the first step towards solving these massive data problems. However, data reduction through model selection or l_0 constrained least squares optimization leads to a combinatorial search which is computationally infeasible for massive data problems.

Extracting knowledge and providing insights into the complex
mechanisms underlying noisy high-dimensional data sets is of utmost
importance in many scientific domains. Networks are an example of
simple, yet powerful tools for capturing relationships among entities
over time. For example, in social media, networks represent
connections between different individuals and the type of interaction
that two individuals have. In systems biology, networks can represent
the complex regulatory circuitry that controls cell behavior.

A statistical rule of thumb is defined as a widely applicable guide to statistical practice with sound theoretical basis. Characteristics include intuitive appeal, elegance, and transparency. A rule states not only what is important but, by implication of what is not included, makes an assertion about what is less important. This talk is based on the recently published book, Statistical Rules of Thumb, Wiley and Sons, March 2002.

Despite its popularity, the investigation of some theoretical aspects of clustering has been relatively sparse. One of the main reasons for this lack of theoretical results is surely the fact that, whereas for other statistical problems the theoretical population goal is clearly defined (as in regression or classification), for some of the clustering methodologies it is difficult to specify the population goal to which the data-based clustering algorithms should try to get close.

Georges Matheron has been an enormously influential figure in defining the principles and basic methodology of what is considered Geostatistics and for similary (co-) defining the field of Mathematical Morphology. And yet, while his name is now well-known, wide-spread recognition of his work in Geostatistics, at least in the English-speaking world, was late in coming. Most people know Geostatistics through the work of his students. I will briefly review Matheron\'s career and name some of his major contributions in Geostatistics and Mathematical Morphology.

We consider approximate Bayesian model choice for model selection problems that involve models whose Fisher-information matrices may fail to be invertible along other competing submodels. Such singular models do not obey the regularity conditions underlying the derivation of Schwarz's Bayesian information criterion (BIC) and the penalty structure in BIC generally does not reflect the frequentist large-sample behavior of their marginal likelihood.

We will review some of the popular methods for distance-based phylogeny reconstruction with a focus on the statistical theory underlying the methods. In particular, we discuss least squares interpretations of the minimum evolution principle and neighbor-joining, and connections to Felsenstein\'s quantitative character models.

Convex sparsity promoting regularizations are now ubiquitous to regularize inverse problems in statistics, in signal processing and in machine learning. By construction, they yield solutions with few non-zero coefficients. This point is particularly appealing for Working Set (WS) strategies, an optimization technique that solves simpler problems by handling small subsets of variables, whose indices form the WS. Such methods involve two nested iterations: the outer loop corresponds to the definition of the WS and the inner loop calls a solver for the subproblems.

Recent advances in Algebraic Statistics have suggested a more general approach to the study of log-linear models that relies on the tools and language of algebraic and polyhedral geometry. In this talk, the problem of the existence of the Maximum Likelihood Estimate (MLE) of the cell mean vector of a contingency table, fundamental for assessment of fit, model selection and interpretation, is considered. Geometric and combinatorial conditions for the existence of the MLE are given, by combining tools from polyhedral geometry and the theory of linear exponential families.

Networks are all around us: social networks allow for information and influence flow through society, viruses become epidemics by spreading through networks, and networks of neurons allow us think and function. With the recent technological advances and the development of online social media we can study networks that were once essentially invisible to us. In this talk we discuss how computational perspectives and machine learning models can be developed to abstract networked phenomena like: How will a community or a social network evolve in the future?

A major difficulty in investigating the nature of atmospheric circulation changes over the North Pacific is the shortness of historical time series. An approach to this problem is through comparison of models. In this talk we contrast two stochastic models and a \'signal plus noise\' model for the winter averaged sea level pressure time series for the Aleutian low (the North Pacific (NP) index) and for air temperatures from Sitka, Alaska. The two stochastic models are a first order autoregressive (AR(1)) model and a fractionally differenced (FD) model.

The primal-dual witness (PDW) technique is a now-standard proof strategy for establishing variable selection consistency for sparse high-dimensional estimation problems when the objective function and regularizer are convex. The method proceeds by optimizing the objective function over the parameter space restricted to the true support of the unknown vector, then using a dual witness to certify that the resulting solution is also a global optimum of the unrestricted problem.

Before-After-Control-Impact (BACI) designs are used to study ecological responses in large experimental units (e.g., lakes, forests and mesocosms) for which replication is difficult or impossible. Two units are monitored over time; one unit receives an intervention at some intermediate time, while the other is left as an undisturbed control. The pre-intervention differences in the response between units are compared to the post-intervention differences, with a large disparity interpreted as evidence of an effect of the intervention.

We consider approximate Bayesian model choice for model selection problems that involve models whose Fisher-information matrices may fail to be invertible along other competing submodels. Such singular models do not obey the regularity conditions underlying the derivation of Schwarz's Bayesian information criterion (BIC) and the penalty structure in BIC generally does not reflect the frequentist large-sample behavior of their marginal likelihood.

We propose a new method for clustering time series.

A univariate time series can be represented by a fixed-length vector whose components are statistical features of the time series, capturing the global structure. These descriptive vectors, then being clustered using a standard fast clustering algorithms. A further search mechanism is used to find the best selection from the features for some specific problem domain or data set. We demonstrate the effectiveness and simplicity of our proposed method by clustering some benchmark datasets with empirical results.

A systematic handling of causality requires a mathematical language in which causal relationships receive symbolic representation, clearly distinct from statistical associations. Two such languages have been proposed in the past: path analysis and structural equations models, used extensively in economics and the social sciences, and Lewis-Neyman-Rubin\'s counterfactual (or potential-response) model, used sporadically in philosophy and statistics.

### Abstract:

Phylogenetic trees represent evolutionary histories and have many important applications in biology, anthropology and criminology. The branching structure of the tree encodes the order of evolutionary divergence, and the branch lengths denote the time between divergence events.

I will describe various efforts that we at the Institute for Systems Biology have undertaken to model the pathways and dynamics of systems in organisms from yeast to human. I will focus on our system for network inference and modeling of the regulatory network of Halobacterium, an organism that thrives in hypersaline environments.

We will describe Bayesian population reconstruction, a recent method for estimating past populations by age for all countries, including developing countries where data on past populations are fragmentary and of variable quality. Such reconstructions are needed for the World Population Prospects, a comprehensive set of demographic statistics for all countries issued by the United Nations and updated every two years.

Although the subject of copulas has a history going back to the 1950s, it is now enjoying a period of fashionability and much of this can be explained by new applications for the theory in the modelling of multivariate financial time series. Copulas are a useful tool for building multivariate distributions with interesting \"dependence structures\" and, in particular, dependence structures that differ markedly from that of the multivariate normal distribution, which is still widely used in financial applications.

The primal-dual witness (PDW) technique is a now-standard proof strategy for establishing variable selection consistency for sparse high-dimensional estimation problems when the objective function and regularizer are convex. The method proceeds by optimizing the objective function over the parameter space restricted to the true support of the unknown vector, then using a dual witness to certify that the resulting solution is also a global optimum of the unrestricted problem.

Advisors: Michael LeBlanc & Charles Kooperberg