Mary Gates Hall

Mary Gates Hall

Statistical divergences have been widely used in statistics and machine learning to measure the dissimilarity between probability distributions. This dissertation investigates their applications in statistical learning and inference. In the first part of this talk, I study the minimum Kullback-Leibler (KL) divergence estimation which is equivalent to the maximum likelihood estimation.

Tick-by-tick interbank foreign exchange (FX) price series exhibit statistically- significant structures on various time scales. These include negative autocorrelations in tick-by-tick returns and positive autocorrelations (trends) on longer time scales. To account for the observed structures, we propose state space models for financial time series in which the observed price is a noisy version of an unobserved, less-noisy ``True Price\'\' process.

Faced with overcrowded prisons, the courts have been increasingly passing probation sentences for adults convicted of felony crimes. Using a national sample, this paper identifies the risk factors for recidivism among Female, Male, Black, White and Hispanic felony probationers. Individual hazard function is assumed to depend on individual and neighborhood characteristics as well as social interactions among probationers.

This lecture focuses on problems of density estimation (both parametric and nonparametric) and if there is time, time series estimation (no pun intended). When formulated as optimization problems, consistency of the estimators becomes a question of whether a sequence of optimization problems converge in an appropriate sense to the true problem. The tools of variational analysis are used to examine the question of consistency for these problems.

Computational Finance Seminar As Corporate Vice President and Treasurer, George Zinn is responsible for overseeing Microsoft's corporate assets. He leads a group which manages the company's worldwide financial and corporate risk, investment portfolio, strategic portfolio, foreign exchange, corporate and structured project finance, dilution management, cash and liquidity, customer financing, and credit activities.

People often express their preferences for web pages, products, candidates in an election as a ranked list. Ranked lists are also the standard output of search engines like Google or Sequest. The interest of this talk is to show how one can do \"statistics as usual\" with this kind of discrete, structured, high-dimensional data.

I will define statistical models over spaces of permutations and partial orderings, and present methods for estimating these models from data.

Gene expression is an important molecular phenotype, providing the initial step in bridging the divide between static genomic information and dynamic organismal phenotypes. Thus, variation in gene expression levels is thought to constitute a significant source of phenotypic diversity among individuals within populations and to contribute to the evolutionary divergence between species. I will discuss our work on identifying regulatory polymorphisms that contribute to heritable transcriptional variation in both yeast and humans.

I'll review the basics of peptide/protein chemistry pertinent to sequencing by MS and discuss how the MS instruments produce spectra which are "converted" to sequence by software as well as some about the software. So, 1/3 each of 1) protein chemistry (and the why of how we do proteomics with MS), 2) a description of fragmentation mechanisms (this is what people casually refer to as sequencing) and 3) the vagaries of finally sequence assignment to the raw data.

Large-scale distributed computing systems can suffer from occasional severe violation of performance goals; due to the complexity of these systems, manual diagnosis of the cause of the crisis is too slow to inform interventions taken during the crisis. Rapid automatic recognition of the recurrence of a problem can lead to cause diagnosis and informed intervention. We frame this as an online clustering problem, where the labels (causes) of some of the previous crises may be known.

In October 2006, Netflix kicked off a $1M competition by releasing 100 million movie ratings as a training set to be used to build a better recommendation system for their on-line movie rental business. This landmark data set generated intense interest from the statistics and machine learning communities, and attracted entries from over 3000 teams from academia and industry.

The primary impediment to formulating a general theory for adaptive evolution has been the unknown distribution of fitness effects for new beneficial mutations. By applying extreme value theory, Gillespie (1984) circumvented this issue in his mutational landscape model for the adaptation of DNA sequences and Orr (2002) extended Gillespie\'s model, generating testable predictions regarding the course of adaptive evolution. Rokyta (2005) provided the first empirical examination of this model, using an ssDNA bacteriophage.

Scan statistics are a common tool to detect e.g. spatial disease clusters or to describe local differences between two distributions. Multivariate scan statistics pose both a statistical problem due to the multiple testing over many scan windows, as well as a computational problem because statistics have to be evaluated on many windows. I will describe methodology that leads to both statistically optimal inference and computationally efficient algorithms.

Risk budgeting is a methodology that has become increasingly popular over the last decade as a relatively transparent alternative to rebalancing portfolios via a black-box portfolio optimization method. We begin by briefly reviewing “classical” risk budgeting methodology based on volatility (standard deviation) of returns as the risk measure.

Data-generating stochastic processes arise naturally in many disciplines, for example biology, ecology or epidemiology. In many cases, because interesting models are highly complex, the likelihood f(xo | θ, M) of such implicit scientific models M is intractable. This hampers scientific progress in terms of iterative data acquisition, parameter inference, model checking and model refinement within a Bayesian framework. Nevertheless, given a value of θ, it is usually possible to simulate data from f(.|θ, M).

In this talk we discuss the application of Bayesian methods in the design of clinical trials. In the first part of the talk we discuss sample size determination. A broad range of frequentist and Bayesian methods for sample size determination can be described as choosing the smallest sample that is sufficient to achieve some set of goals. An example for the frequentist is seeking the smallest sample size that is sufficient to achieve a desired power at a specified significance level.

Fractal behavior and long-range dependence have been described in an astonishing number of physical, biological, geological, and socio-economic systems. Time series, profiles, and surfaces have been characterized by their fractal dimension, a measure of roughness, and by the Hurst coefficient, a measure of long-memory dependence. Either phenomenon has been modeled and explained by self-similar random functions, such as fractional Gaussian noise and fractional Brownian motion.

Different General Circulation Models (GCMs) produce different climate change projections, especially when evaluated at subcontinental (regional) scales. When it is time to try and combine their responses into a summary measure, and relative uncertainty bounds, it makes sense to weigh more the output of those GCMs that show better performance in reproducing present day climate (i.e. have smaller bias) and that agree with the majority (i.e. do not seem like outliers).

Deconvolution of an unknown function of one variable from a finite set of measurements is an ill-posed problem. Placing a Bayesian prior on a function space is one way to extend the scientific model and obtain a well-posed problem. This problem can be well-posed even if the relationship between the unknown function and the measurements, as well as the function space prior, has unknown parameters. We present a method for estimating the unknown parameters by maximizing an approximation of the marginal likelihood where the unknown function has been integrated out.