PDL
Padelford
Padelford
Statistical estimation and decision-making for the COVID-19 pandemic
We address questions pertinent to the monitoring and management of infectious disease transmission in the context of the COVID-19 pandemic. Specifically, how do we quantify the gross health, social, and economic impacts of epidemics and the associated policy response? How do we assess the efficacy of public health interventions, pharmaceutical or otherwise? And how can we cost-effectively implement these interventions to optimally mitigate outbreaks?
Statistical Inference Using Identity-by-Descent Segments
Positive selection is suggested to be the primary mechanism of phenotypic adaptation. Selective sweeps are one model of positive selection in which beneficial mutations rapidly increase in frequency. The selection coefficient is a model parameter that influences the expected rate of allele frequency change in a single generation. In this dissertation, we develop theory and methodology to study recent positive selection with genetic data from only the present-day.
Statistical Inference with Missing and Latent Data: Methods for Data Harmonization, Network Curvature Estimation and Experimentation Under Interference
In many statistical applications, the goal is to measure a property in a dataset that may contain missing or latent variables. The ideal data that researchers aim to collect often differs significantly from the data available due to limitations in data collection processes. These datasets often contain missing values, necessitating careful consideration of the assumptions used to address this missingness. Additionally, many studies focus on analyzing properties that are not directly observable but are latent.
New Insights into Individual Treatment Effects and Komogorov's Problem
The Individual treatment effect (ITE) is often regarded as the ideal target of inference in causal analyses and has been the focus of several recent studies. In our first project, we describe the intrinsic limits regarding what can be learned concerning ITEs given data from large randomized experiments. We first consider when a valid prediction interval for the ITE is informative and when it can be bounded away from 0. The joint distribution over potential outcomes is only partially identified from randomized experiment data.
Density estimation for the spatiotemporal human mobility
Spatiotemporal GPS trajectory data, primarily recorded by mobile phones, contain valuable information about human mobility. These data are crucial for assessing individuals’ dynamic exposure to social and environmental risk factors across multiple spatial contexts. Density estimation is a pivotal step in analyzing human mobility over specified time periods under varying scenarios.
Identification and Estimation Algorithms for Pathogen, Ancestral, and Rashomon Analysis
Identifiability and estimability are fundamental to learning statistical models. In this dissertation, we will look at questions on identifiability and estimability that arise in policy-making and causal discovery. First, we improve the efficiency of contact tracing of infectious diseases using multi-armed bandits that leverage heterogeneity in infectiousness of infected people.
Estimation and Inference of Optimal Policies
Many fields conduct experiments to learn policies that map individual characteristics to actions, with those achieving the best outcomes referred to as optimal policies. As getting human feedback from experiments is expensive, we are often interested in learning the optimal policy as quickly as possible. However, there are several challenges in developing practical approaches for policy learning. First, traditional methods usually only guarantee minimax optimality, while practitioners care more about performances for their particular problem instance.
Statistical Learning and Modeling with Graphs and Networks
Graph, consisting of a set of vertices and a set of edges, is a geometric object that can not only visualize but also mathematical characterize the geometric structures in data. Graphs can also model relations or connections between different units and have applications in various fields such as epidemiology, econometrics, sociology, biology, and astronomy.
Faculty Meeting
- Call to Order
- Chair's Remarks
- Announcements
- Committee Reports
- New Business
- Executive Session
- Adjournment
Faculty Meeting
- Call to Order
- Chair's Remarks
- Announcements
- Committee Reports
- New Business
- Executive Session
- Adjournment
Faculty Meeting (Executive Session)
Agenda will be sent to statvotefac@uw.edu.
Faculty Meeting
- Call to Order
- Chair's Remarks
- Announcements
- Committee Reports
- New Business
- Executive Session
- Adjournment
Faculty Meeting (Executive Session)
Agenda will be sent to statvotefac@uw.edu.
Faculty Meeting (Executive Session)
Agenda will be sent to statvotefac@uw.edu.
Faculty Meeting
The regular meeting of the faculty of the Department of Statistics was held in a hybrid format via Zoom and in Padelford Hall, room C-301 at 12:30pm, March 25, 2024. Abel Rodriguez, Department Chair, presided at the meeting. Kristine Chan was recording secretary.
The meeting started with all participants sharing updates on personal news and achievements over the winter quarter or Spring break to allow the faculty, staff, and graduate student representatives to get acquainted/reacquainted.
Conditional Causal Effect Identification in MPDAGs
In our first project, we consider the problem of identifying a conditional causal effect through covariate adjustment. We focus on the setting where the causal graph is known up to one of two types of graphs: a maximally oriented partially directed acyclic graph (MPDAG) or a partial ancestral graph (PAG). Both MPDAGs and PAGs represent equivalence classes of possible underlying causal models.
Bayesian nonparametric methods for Complex Data
The modeling of complex data is an open challenge due to the intricate spatio-temporal dynamics, the covariate interactions and nonlinear effects as well as heterogenous grouping structure. Bayesian nonparametric models offer a compelling solution to those challenges due to their flexibility, minimal reliance on modeling assumptions and adaptability to heterogeneity while providing rigorous uncertainty estimates. In this presentation, I will talk about two Bayesian nonparametric models that handles two types of complex data.
Faculty Meeting (Executive Session)
Agenda will be sent to statvotefac@uw.edu.
Faculty Meeting (Executive Session)
Agenda will be sent to statvotefac@uw.edu.
Faculty Meeting (Executive Session)
Agenda will be sent to statvotefac@uw.edu.
Faculty Meeting (Executive Session)
Agenda will be sent to statvotefac@uw.edu.
Statistical Methods for the Analysis and Prediction of Hierarchical Time Series Data with Applications to Demography
In this talk, I discuss new methods for the analysis and prediction of hierarchical time series data with a focus on two applications to demography.
Using networks to address sampling bias in social and environmental population size estimation
In this talk, we consider two instances of sampling bias in estimating population size.
Faculty Meeting
The general format for these meetings is as follows:
- Call to Order
- Chair's Remarks
- Announcements
- Committee Reports
- New Business
- Executive Session
- Adjournment
Faculty Meeting
The regular meeting of the faculty of the Department of Statistics was held in a hybrid format via Zoom and in Padelford Hall, room C-301 at 12:30pm, November 27, 2023. Abel Rodriguez, Department Chair, presided at the meeting. Kristine Chan was recording secretary.
New Business
Faculty Meeting (Executive Session)
Agenda will be sent to statvotefac@uw.edu.
Faculty Meeting (Executive Session)
Agenda will be sent to statvotefac@uw.edu.
Faculty Meeting (Executive Session)
Agenda will be sent to statvotefac@uw.edu.
Faculty Meeting (Executive Session)
Agenda will be sent to statvotefac@uw.edu.
Faculty Meeting
The regular meeting of the faculty of the Department of Statistics was held in a hybrid format via Zoom and in Padelford Hall, room C-301 at 12:30pm, October 2, 2023. Abel Rodriguez, Department Chair, presided at the meeting. Kristine Chan was recording secretary.
Bayesian methods for variable selection
Choosing a statistical model and accounting for uncertainty about this choice are important parts of the scientific process and are required for common statistical tasks such as parameter estimation, interval estimation, statistical inference, point prediction and interval prediction. A canonical example is the variable selection problem in a linear regression model. Many ways of doing this have been proposed, including Bayesian and penalized regression methods.
Estimating subnational health and demographic indicators using complex survey data
Subnational estimates of health and demographic indicators such as immunization coverage rates and child mortality rates are critical for identifying regional health disparities and guiding policy design. When population data on an outcome of interest are unavailable or incomplete, many countries gather information from a sample of the population using household surveys.
Methods for the Statistical Analysis of Preferences, with Applications to Social Science Data
Preference data, such as rankings and ratings, are prevalent in the social sciences for expressing and measuring attitudes or opinions. Oftentimes, deterministic algorithms or summary statistics are used to aggregate preferences, which lack the ability to measure uncertainty or identify preference heterogeneity in a population. This thesis proposes new methodologies for statistical preference analysis that aid accurate estimation, inference, and decision-making with preference data in social science applications.
Data thinning to overcome double dipping
We refer to the practice of using the same data to fit and validate a model as double dipping. Problems arise when standard statistical procedures for validating models are applied in settings that involve double dipping. To circumvent the challenges associated with double dipping, one approach is to fit a model on one dataset, and then validate the model on another independent dataset. When we only have access to one dataset, we typically accomplish this via sample splitting.
Inference and Estimation for Network Data
Networks play a key role in many scientific domains, yet collecting network data is often expensive and time-consuming. This thesis analyzes several estimation problems in network inference, especially in cases where only partial data about the network is available.
Faculty Meeting (Executive Session)
Agenda will be sent to statvotefac@uw.edu.
Faculty Meeting
he regular meeting of the faculty of the Department of Statistics was held at 12:30pm, May 8, 2023. Abel Rodriguez, Department Chair, presided at the meeting. Kristine Chan was recording secretary.
New Business
Faculty Meeting
The regular meeting of the faculty of the Department of Statistics was held at 12:30pm, May 8, 2023. Abel Rodriguez, Department Chair, presided at the meeting. Kristine Chan was recording secretary.
Announcement
Faculty Meeting
The regular meeting of the faculty of the Department of Statistics was held at 12:30pm, April 24, 2023. Abel Rodriguez, Department Chair, presided at the meeting. Kristine Chan was recording secretary.
Announcement
Faculty Meeting (Executive Session)
Agenda will be sent to statvotefac@uw.edu.
Learning from Expert Knowledge: Bandits and Graphs
How can we leverage domain knowledge during statistical tasks such as learning or decision-making? In this talk, we will discuss two instances of this question that arise in multi-armed bandits and causal discovery.
Interpretation and Validation for Unsupervised Learning
This thesis studies two major problems in unsupervised learning: manifold learning and clustering. The motivation of this research is to establish mathematically rigorous methods that enable practitioners to have better understanding of what the algorithm is doing, even if there is no ground truth label for unsupervised learning problems. Specifically, we propose two criterion for a useful unsupervised learning paradigm: interpretability and stability. In this talk, we will mainly focus on the stability issue of clustering.
Faculty Meeting (Executive Session)
- Call to Order
- Chair's Remarks
- Announcements
- Committee Reports
- New Business
- Executive Session
- Adjournment
Faculty Meeting
The regular meeting of the faculty of the Department of Statistics was held starting at 12:30pm, March 6, 2023. Abel Rodriguez, Department Chair, presided at the meeting. Kristine Chan was recording secretary.
New Business
Pedagogy: This discussion was led by C. Marzban.
Faculty Meeting (Executive Session)
Meeting agenda TBA via statvotefac@ mailing list. Kristine Chan (kyunchan@uw.edu) will send the agenda at least 3 business days before this meeting.
Faculty Meeting (Executive Session)
Meeting agenda TBA via statvotefac@ mailing list. Kristine Chan (kyunchan@uw.edu) will send the agenda at least 3 business days before this meeting.
Faculty Meeting
The regular meeting of the faculty of the Department of Statistics was held on Zoom at 12:30pm, February 6, 2023. Abel Rodriguez, Department Chair, presided at the meeting. Kristine Chan was recording secretary.
Announcements
Faculty Meeting (Executive Session)
Meeting agenda TBA via statvotefac@ mailing list. Kristine Chan (kyunchan@uw.edu) will send the agenda at least 3 business days before this meeting.
Statistical estimation and decision-making for the COVID-19 pandemic
There are multiple sources of data giving information about the number of
SARS-CoV-2 infections in the population, but all have major drawbacks, including
biases and delayed reporting. Representative random prevalence surveys, the only
putatively unbiased source, are sparse in time and space, and the results can come
with big delays. Reliable estimates of population prevalence are necessary for
understanding the spread of the virus and the effectiveness of mitigation strategies. We
Likelihood-based haplotype frequency modeling using variable-order Markov chains
The localized haplotype-cluster model uses variable-order Markov chains (VOMCs) to create
an empirical model for haplotype probabilities that adapts to the changing structure of
linkage disequilibrium (LD) across the genome. By clustering partial haplotypes based on
the Markov property as represented by a directed acyclic graph (DAG), the model is able
to take advantage of context-sensitive conditional independencies to improve estimates of
Learning in Latent Variable Models
Latent variable models are ubiquitous in many areas of statistics. In this talk, we consider two problems which involve inferring properties of a latent distribution where the only observed data are binary outcomes. We firstly consider the deconvolution problem in the semiparametric Rasch model. Item response theory typically involves the noisy measurement of some underlying latent trait using discrete testing questions.
Graphs for Statistical Learning and Modeling
Graph, consisting of a set of vertices and a set of edges, is a geometric object that can not only visualize but also mathematical characterize the geometric structures in data. Graphs also model relations or connections between different units and have applications in various fields such as epidemiology, sociology, biology, and chemistry. We first take advantage of graphs from a geometric perspective.
Faculty Meeting (Executive Session)
This meeting will be for all eligible voting faculty.
Faculty Meeting (Executive Session)
This meeting will be for all eligible voting faculty.
Faculty Meeting (Executive Session)
This meeting will be for all eligible voting faculty.
Faculty Meeting (Executive Session)
This meeting will only be for Full Professors.
Faculty Meeting (Executive Session)
TBA
Faculty Meeting (Executive Session)
TBA
Faculty Meeting
The regular meeting of the faculty of the Department of Statistics was held on Zoom at 12:30pm, October 17, 2022. Abel Rodriguez, Department Chair, presided at the meeting. Kristine Chan was recording secretary.
Announcements
Faculty Meeting
The regular meeting of the faculty of the Department of Statistics was held on Zoom at 12:30pm, October 10, 2022. Abel Rodriguez, Department Chair, presided at the meeting. Kristine Chan was recording secretary.
The meeting started with all participants sharing updates on personal news, achievements over the summer, and progress of fall quarter to allow the faculty, staff and graduate student representatives to get acquainted/reacquainted.
Announcements
Faculty Meeting (Executive Session)
This meeting will only be for Associate Professors and Full Professors.
Objective Bayesian methods for variable selection
Choosing a statistical model and accounting for uncertainty about this choice are important parts of the scientific process and are required for common statistical tasks such as parameter estimation, interval estimation, statistical inference, point prediction and interval prediction. A canonical example is the variable selection problem in a linear regression model. Many ways of doing this have been proposed, including Bayesian and penalized regression methods.
Areal models for estimation of subnational health and demographic indicators
In countries where availability of census and vital registration data are limited, estimating subnational health and demographic indicators is challenging. Existing small area estimation approaches from the survey statistics literature often rely upon the availability on high-quality census information.
Faculty Meeting - Executive Session (Voting faculty only)
Research prelim results Faculty Meeting
Faculty Meeting - Executive Session (Voting faculty only)
MS Theory results Faculty Meeting, joint with Biostat
Statistical methods for preference learning
Preference learning is the task of aggregating individual preferences, such as rankings or ratings, in order to learn the overall preferences of a population. In most settings, preference aggregation is performed deterministically and fails to capture any uncertainty in the overall preferences. Furthermore, there are no statistical models for scenarios in which rankings and ratings arise simultaneously, which occur in a variety of real-world settings.
Faculty Meeting
The regular meeting of the faculty of the Department of Statistics was held on Zoom at 12:30pm, June 6, 2022. Abel Rodriguez, Department Chair, presided at the meeting. Kristine Chan was recording secretary.
Chair Remarks
Faculty Meeting
The regular meeting of the faculty of the Department of Statistics was held on Zoom at 12:30pm, May 16, 2022. Abel Rodriguez, Department Chair, presided at the meeting. Kristine Chan was recording secretary.
Chair Remarks
Important meeting dates:
-
Research prelim results Faculty Meeting – Tuesday, June 21, 12:30 pm – 2:00 pm
-
MS Theory exam results Faculty Meeting – Friday, June 24, 10:00 am – 12:00 pm
Faculty Meeting
The regular meeting of the faculty of the Department of Statistics was held on Zoom at 12:30pm, May 9, 2022. Abel Rodriguez, Department Chair, presided at the meeting. Kristine Chan was recording secretary.
New Business
Faculty Meeting
The regular meeting of the faculty of the Department of Statistics was held on Zoom at 12:30pm, May 2, 2022. Abel Rodriguez, Department Chair, presided at the meeting. Kristine Chan was recording secretary.
Announcements
Faculty Meeting - Executive Session (Voting faculty only)
TBD
Probabilistic Forecasts of International Bilateral Migration Flows
Accurate estimates of historic migration trends and forecasts of future trends are essential to crafting effective migration policies. Recent methodological advances made it possible to generate plausible estimates of international migration flows at a global scale; however, flow forecasting method development lags progress in estimation.
Statistical Methods for Estimating and Projecting the Effect of Policy Interventions on Demographic Outcomes
I consider the problem of estimating and projecting the effect of policy interventions on demographic outcomes and develop a conditional Bayesian hierarchical model for probabilistic projections of the outcome of interest given a set of interventions. Under specified assumptions, I show that the estimated effect is causal. The motivating question is that of identifying policy interventions to accelerate fertility decline in high-fertility countries.
Statistical methods for adaptive immune receptor repertoire analysis and comparison
B and T cell receptors, also known as adaptive immune receptors, perform key roles in adaptive immunity.
These proteins identify and deal with foreign invaders like viruses or bacteria, allowing for robust and long-lasting immunological protection.
The DNA sequences coding for these receptors arise by a complex recombination process followed by a series of productivity-based filters, as well as affinity maturation for B cells, giving considerable diversity to the circulating pool of these sequences.
Quantifying Uncertainty in Causal Discovery with Bayesian Causal Model Selection
Causal Discovery algorithms attack the challenging problem of learning the causal relationships among a set of variables from observational data, but are often partly ad-hoc and give the researcher no measure of confidence in the correctness of the learned causal structure. I introduce Bayesian Causal Model Selection (BCMS), a Bayesian framework for causal discovery that unifies existing methods by expressing identifiability assumptions through the model prior.
Laplace approximations and ordinal models for continuous spatial and spatio-temporal health mapping applications
With the increasing ability to collect myriad types of spatial data, we find ourselves regularly presented with new modeling problems that require novel solutions, but many of the available options for fitting spatial statistical models have limited applicability.
Faculty Meeting - Monday, March 2, 2020
A regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, March 2nd, 2020. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.
Faculty Meeting - Monday, February 24, 2020
A regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, February 24th, 2020. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.
Chair’s Remarks
Dan announced that Jon Wellner will be giving a Norman Breslow Endowed Lecture on Thursday, April 30, 2020 at 3:30pm to 5:00pm. Information about his talk will soon be posted on our website.
Faculty Meeting - Monday, February 10, 2020
A regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, February 10th, 2020. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.
Faculty Meeting - Monday, January 27, 2020
A regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, January 27th, 2020. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.
The meeting began with approval of previous meeting minutes from November 8, 2019, November 18, 2019, December 2, 2019, and December 9, 2019.
Functional Estimation in Nonparametric Regression
Consider the heteroscedastic nonparametric regression model with random design $Y_i = f(X_i) + V^{1/2}(X_i)\varepsilon_i, \quad i=1,2,\ldots,n$, with $f(\cdot)$ and $V(\cdot)$ $\alpha$- and $\beta$-H\" older smooth, respectively.
Faculty Meeting - Monday, November 18, 2019
A regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, November 18th, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.
Chair’s Remarks
Daniel Pollack informed that faculty that there will be faculty lunches with the chair candidates on Tuesday and Thursday (Nov. 19 & 21). If any faculty would like to sign up, please reach out to Kristine Chan.
Flexible spatial models for household survey data in low and middle income countries
The need for rigorous and timely health and demographic summaries has led to an explosion in geographic studies, particularly in low and middle income countries. While household surveys are a major source of data in this context, they present challenges for statistical modeling. These challenges include biases due to oversampling certain population segments, nonlinear interactions between covariates, and multiple scales of prediction. However, many common statistical methods have never been tested rigorously in these settings.
Faculty Meeting - Monday, December 9, 2019
Faculty Meeting - Monday, December 2, 2019
A regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, December 2nd, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.
Chair’s Remarks
Daniel Pollack reminded the faculty about Holiday Party happening on Dec. 11 – food and drinks will be provided. Vickie Graybeal has sent an email out for faculty, postdocs, and staff to RSVP by Dec.4 at 12:00pm.
Faculty Meeting - Monday, November 4, 2019
The regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, November 4th, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.
The meeting began with approval of previous meeting’s minutes from October 21, 2019.
Faculty Meeting - Friday, November 8, 2019
A special meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, November 8th, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.
The meeting began with approval of previous meeting’s minutes from November 4, 2019.
Adjournment
There being no chair remarks, announcements, committee reports, and new business, the meeting passed into the executive session at 12:37pm and was adjourned at 2:00pm.
Faculty Meeting - Monday, December 9, 2019
A regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, December 9th, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.
Announcements
Daniel Pollack has announced that the Senior Lecturer position has been posted on the department website.
Committee Reports
The GSRs reported students’ interactions and feedback with each chair candidate.
Faculty Meeting - Monday, October 21, 2019
The regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, October 21st, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.
The meeting began with approval of previous meeting’s minutes from October 7, 2019.
Chair’s Remarks
Daniel Pollack announced Vickie Graybeal’s twenty years of service award.
Faculty Meeting - Monday, October 7, 2019
The regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, October 7th, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.
The meeting began with approval of previous meeting’s minutes from September 23, 2019.
Chair’s Remarks
Faculty Meeting - Monday, September 23, 2019
The regular meeting of the faculty of the Department of Statistics was held in C-301 Padelford Hall at 12:30pm, September 23rd, 2019. Daniel Pollack, Interim Chair, presided at the meeting. Kristine Chan was recording secretary.
Chair’s Remarks
Daniel Pollack reported there will be no faculty retreat this Autumn quarter. Planning for the retreat will be revisited in the Spring.
He also provided updates and timelines on the two ongoing searches: Full Professor & Chair and Assistant Professor.
Latent Variable Models for Indirectly or Imprecisely Measured Networks
In the social sciences, social networks are important structures which represent the relationships and interactions between actors in a population of study. The most common methods for measuring networks are to survey study participants about who their connections are and to collect interaction activity between pairs of actors. However, directly measuring the exact network of interest can be challenging.
Estimation and testing under shape constraints
Over the last few decades, shape constrained methods have increasingly gathered importance in statistical inference as attractive alternatives to traditional nonparametric methods which often require tuning parameters and restrictive smoothness assumptions. This talk focuses on application of shape-constraints like unimodality and log-concavity in comparing the outcome of two HIV vaccine trials.
Realized genome sharing in random effects models for quantitative genetic traits
DNA copies inherited from the same ancestral copy by related individuals are said to be identical by descent (IBD). IBD gives rise to genetic similarities between related individuals. In quantitative genetics, two fundamental problems are heritability estimation and gene mapping for genetic traits. IBD plays a critical role in the study of both problems. When working with population-based samples where pedigree information is unavailable, it is essential to estimate IBD accurately from genetic marker data using pedigree-free methods.
Inferring Network Structure From Partially Observed Graphs
Collecting social network data is notoriously difficult, meaning that indirectly observed or missing observations are very common. In this talk, we address two of such scenarios: inference on network measures without network observations and inference of regression coefficients when actors in the network have latent block memberships.
High-dimensional independence testing with maxima of rank correlations
Testing mutual independence for high-dimensional observations is a fundamental statistical challenge. Popular tests based on linear and simple rank correlations are known to be incapable of detecting non-linear, non-monotone relationships, calling for methods that can account for such dependences. To address this challenge, we propose a family of tests that are constructed using maxima of pairwise rank correlations that permit consistent assessment of pairwise independence.
Recursive Inversion Models for Partially Ranked Data
Can we do exact and tractable inferences in Mallows-like models for incomplete data? I will show that the answer is yes for the most general form Mallows-type model and a large class of partial orders known as partial rankings (including special cases like top-t rankings). I will also demonstrate that despite partial rankings lacking a sufficient statistic, exact inference is possible with overhead that is at most polynomial in O(nN) and that, in practice, the overhead per data point is negligible.
Fitting Stochastic Epidemic Models to Multiple Data Types
Traditional infectious disease epidemiology focuses on fitting deterministic and stochastic epidemics models to surveillance case count data. Recently, researchers began to make use of infectious disease agent genetic data to complement statistical analyses of case count data. Such genetic analyses rely on the field of phylodynamics --- a set of population genetics tools that aim at reconstructing demographic history of a population based on molecular sequences of individuals sampled from the population of interest.
Large-Scale B Cell Receptor Sequence Analysis Using Phylogenetics and Machine Learning
The adaptive immune system synthesizes antibodies, the soluble form of B cell receptors (BCRs), to bind to and neutralize pathogens that enter our body. B cells are able to generate a diverse set of high affinity antibodies through the affinity maturation process.
Gradient Group Lasso Identifies Sparse Functional Basis for Molecular Manifolds
We present a method for analyzing low-energy paths between molecular conformations by combining techniques in both manifold learning, which identifies such paths, and functional regression, which can parameterize them by explanatory non-linear functions. Unsupervised manifold learning approaches are useful for understanding molecular dynamics simulations since they disregard small-scale information such as peripheral hydrogen vibrations that can nevertheless drastically affect the observed energy.
Fast nonconvex changepoint detection
In recent years, new technologies in neuroscience have made it possible to measure the activities of large numbers of neurons in behaving animals. For each neuron, a fluorescence trace is measured; this can be seen as a first-order approximation of the neuron's activity over time. Determining the exact time at which a neuron spikes on the basis of its fluorescence trace is an important open problem in the field of computational neuroscience. Recently, a convex optimization problem involving an L1 penalty was proposed for this task.
Estimating Mortality at the Subnational Level in a Low and Medium Income Context
Child mortality, and, in particular under-five mortality (U5MR), is an important indicator of the overall health of a population. Subnational estimation of U5MR is relatively new endeavor
Statistical Methods for Manifold Recovery and C^{1, 1} Regression on Manifolds
High-dimensional data sets often have lower-dimensional structure taking the form of a submanifold of a Euclidean space. It is challenging but necessary to develop statistical methods for these data sets that respect the manifold structure. We present research from two different areas: manifold learning (i.e., support estimation) and smooth regression on manifolds.
Space-Time Contour Models for Sea Ice Forecasting
The amount of sea ice (frozen ocean water) found in the Arctic is declining rapidly as a result of climate change. This has increased the need for accurate forecasts of where sea ice will be located. Of particular interest is predicting the sea ice edge contour, or the boundary of the region where at least 15% of the area is ice-covered. Current sea ice forecasts are issued from deterministic numerical prediction systems.
Nonparametric inference on monotone functions, with applications to observational studies
In this dissertation, we study general strategies for constructing nonparametric monotone function estimators in two broad statistical settings. In the first setting, a sensible initial estimator of the monotone function of interest is available, but may fail to be monotone. We study the correction of such an estimator obtained via projection onto the space of functions monotone over a finite grid in the domain.
Bayesian Methods for Graphical Models with Limited Data
Scientific studies in many fields involve understanding and characterizing dependence relationships among large numbers of variables. This can be challenging in settings where data is limited and noisy. Take survey data as an example, understanding the associations between questions may help researchers better explain themes amongst related questions and impute missing values. Yet, such data typically contains a combination of binary, continuous, and categorical variables; a high proportion of missing values; and complex data structures.
Preferential sampling and model checking in phylodynamic inference
Estimating population size fluctuations is one of the key tasks in Ecology. However, traditional sampling based approaches to perform this task have limitations when populations of interest are extinct or are hard to reach, as is the case for individuals infected for a short time period by a pathogen.
Analysis of Incomplete Network Data
Collecting social network data is notoriously difficult, meaning that indirectly observed or missing observations are very common. In this talk, we address two of such scenarios: inference on network measures without any direct network observations and inference of regression coefficients when important features are missing.
Parameter Identification and Assessment of Independence in Multivariate Statistical Modeling
In this talk we define a new class of multivariate nonparametric measures of dependence that we refer to as symmetric rank covariances. This new class generalizes many existing classical rank measures of dependence, such as Kendall's tau and Hoeffding's D, as well as the more recently discovered Bergsma--Dassios sign covariance.
Latent Variable Models for Imprecisely or Indirectly Measured Networks
In the social sciences, social networks are important structures which represent the relationships and interactions between actors in a population of study. In these fields, the most common method for measuring networks is to directly survey study participants about who their connections are. However, directly measuring the network of interest can be challenging. Participants do not always provide accurate accounts of their connections, which can result in mismeasurement of the network.
Causal Discovery with non-Gaussian Data
In this talk, we consider causal discovery when the underlying structure corresponds to a linear structural equation model with error terms which are non-Gaussian. Previous work by Shimizu et al. (2006) has shown that under this framework, a unique directed acyclic graph--not simply an equivalence class--can be identified from infinite data. We extend that result in two directions.
Faculty Meeting - Monday, January 29, 2018
Agenda:
- Faculty Search discussion
Composite Likelihood Estimation for Binary Network Models
We develop a scalable method to estimate the parameters in models of very large binary network datasets. Maximum likelihood estimates are generally impossible to obtain because the full likelihood involves an intractable high dimensional integral. Also, full-likelihood Bayesian estimation is impractical for very large datasets as the MCMC algorithm is very slow.
Quarterly Pedagogy Meeting - March 7, 2016
Time: 12.30-1.30pm March 7, 2016
Place: Padelford Hall, C-301
Agenda:
- 12:30 - Pedagogy Meeting
Faculty Meeting - January 9, 2017
Time: 12.30-1.30pm January 9, 2017
Place: Padelford Hall, C-301
Agenda:
- Welcome back & Updates (Thomas R.)
- Mentoring & Diversity (Jessica G.)
- Consulting / Paul Sampson Replacement (Thomas R.)
Faculty Meeting - February 13, 2017
Time: 12.30-1.30pm February 13, 2017
Place: Padelford Hall, C-301
Agenda:
- Updates (Thomas R.)
- 3-year Affiliate/Adjunct Renewals (Thomas R.)
- Affiliate/Adjunct Re-Appointments (Not up for periodic 3-year review associated with renewal) (Thomas R.)
- Case for Promotion to Affiliate Associate Professor (Thomas R.)
- Paul Sampson (Thomas R.)
Faculty Meeting - February 27, 2017
Time: 12.30-1.30pm February 27, 2017
Place: Padelford Hall, C-301
Agenda:
- Renew Policies (Thomas R.)
- Biostatistics Search (Thomas R.)
- Affiliate Appointment for Jon Azose (Thomas R., Adrian R.)
- Loyce Adams (Emeritus Professor for AMATH) for Senator (Thomas R.)
- Annual Student Review (Michael P.)
Faculty Meeting - March 6, 2017
Time: 12.30-1.30pm March 6, 2017
Place: Padelford Hall, C-301
Agenda:
- Annual Student Review (Michael P.)
Faculty Meeting - April 3, 2017
Time: 12.30-1.30pm April 3, 2017
Place: Padelford Hall, C-301
Agenda:
Faculty Meeting - April 10, 2017
Time: 12.30-1.30pm April 10, 2017
Place: Padelford Hall, C-301
Agenda:
- Meeting for Full + Assoc. Professors Only
Faculty Meeting - April 24, 2017
Time: 12.30-1.30pm April 24, 2017
Place: Padelford Hall, C-301
Agenda:
- MS Student Review
Faculty Meeting - May 1, 2017
Time: 12.30-1.30pm May 1, 2017
Place: Padelford Hall, C-301
Agenda:
- FTL Consulting Search
- Personnel Matter (Full Professors only)
Faculty Meeting - May 8, 2017
Time: 12.30-1.30pm May 8, 2017
Place: Padelford Hall, C-301
Agenda:
- TCC Meeting
Faculty Meeting - May 22, 2017
Time: 12.30-1.30pm May 22, 2017
Place: Padelford Hall, C-301
Agenda:
- FTL Consulting Search
Faculty Meeting - June 5, 2017
Time: 12.30-1.30pm June 5, 2017
Place: Padelford Hall, C-301
Agenda:
- Update and discussion on searches.
- PhD Admission Policy; TOEFL Scores; TA Requirement for PhDs.
- College Absence Policy; also Effective Personnel Vote rule.
- 10 Year Department Review.
Faculty Meeting - October 2, 2017
Time: 12:30-1:30pm October 2, 2017
Place: Padelford Hall, C-301
Agenda:
- Updates
- Discussion of upcoming Search
- Adjunct Appointment (Amy Willis)
- Research Prelim and 572
Estimating coancestry among multiple individuals in populations
Segments of genome inherited from a common ancestor by multiple individuals are said to be identical by descent (IBD). Dense genotyping platforms permit the detection of IBD segments less than 5 centiMorgans long, which arise due to coancestry on the order of dozens of generations ago. Generalizations of classical pedigree-based linkage methods use this inferred IBD and can be applied in situations where pedigree data is incomplete. We present a method for inferring IBD in groups of individuals without pedigrees.
A Bayesian Surveillance System for Detecting Clusters of Non-Infectious Diseases
Advisor: Jon Wakefield We consider the problem of detecting clusters of non-infectious and rare diseases. Cluster detection is the routine surveillance over a large expanse of small administrative regions to identify individual \'hot-spots\' of elevated residual spatial risk without any preconceptions about their locations. A class of cluster detection procedures known as moving-window methods superimpose a large number of circular regions onto the study area.
Probability and Inference for Random Fields
In recent decades, there has been much progress and interest in spatial statistics, with applications in agriculture, epidemiology, geology and other areas of environmental science and in image analysis. Two contrasting approaches have emerged, one based on Markov random fields, the other on geostatistics. The development of Markov Chain Monte Carlo as a computational tool has been phenomenal and has made Bayesian inference for spatial models relatively easy to perform, whereas frequentist inference still presents difficult problems.
Bayesian Spatial and Temporal Methods for Public Health Data
Advisors: Adrian Dobra and Jon Wakefield Understanding the relationships between disease incidence and risk factors such as demographic characteristics, life style factors, and environmental contaminants is a central goal in public health and epidemiology. Often outcomes and risk factors are measured at specific locations or at particular times. We present flexible Bayesian models for spatial and temporal data to address important public health questions in two examples.
Portfolio Optimization and Asset Pricing with Skewed Fat-Tailed Distributions
Estimation with Bivariate Interval Censored Data
Scalable Methods for Inference of Multiple IBD
Advisor: Elizabeth Thompson A major topic in statistical genetics is discovering the locations of genes contributing to complex traits through linkage analysis. The likelihood of a genetic marker controlling the expression of the trait is calculated using estimated identity-by-descent (IBD) graphs, which indicate whether copies of the marker shared among individuals are inherited from a common ancestor. Methods for estimating IBD graphs either use pedigree or population relationships between the individuals, and do not scale to a large number of individuals.
Probabilistic Projections of Fertility Using a Bayesian Hierarchical
The United Nations Population Division produces estimates and projections of the total fertility rate for all countries in the world every two years. For countries with fertility above replacement level, future levels are projected by choosing one out of three scenarios describing the pace of future fertility decline.
I will discuss a Bayesian hierarchical model for producing country-specific projections of the total fertility rate, and assessing the uncertainty in these predictions. Results for various countries will be presented.
A Survey of the Markov Properties of Directed, Undirected, and Mixed Graphs
Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop
Inference of Identity by Descent for Linkage Analysis
Advisor: Professor Elizabeth Thompson Inference of identity by descent for linkage analysis Identity by descent (IBD) describes the pattern of shared inheritance of DNA among individuals. Two or more copies of DNA are identical by descent if they are inherited from the same common ancestor. IBD underlies the genetic similarity between individuals and thus similarity in observed genetic traits. In a family study of a genetic disease, estimated IBD among individuals in the family is used to identify potential locations of the gene that causes the disease.
Exploring Rates and Patterns of Variability in Gene Conversion and Crossover in the Human Genome
Meiotic recombination is a biological process that shuffles our genetic material before we pass it along to our offspring. There are two known outcomes of recombination: crossover and gene conversion. Recently, fine-scale human crossover rates have been inferred with some success using statistical methodology applied to population data (i.e. genetic data on random samples of individuals from a population). However, reliable estimation of gene conversion rates has proven more difficult to come by.
Influence Functions in Finance: Statistical Analysis of Portfolio Risk and Performance Measures
Testing for Differences between Least Squares and Robust Regression Estimates
At the present time there is no well accepted test for comparing least squares and robust linear regression coefficient estimates. To fill this gap we propose and demonstrate the efficacy of two Wald-like statistical tests for the above purposes, using for robust regression the class of MM-estimators.
Seeing the Trees Through the Forest: A Competition Model for Growth and Mortality
Advisor: Peter Guttorp Local competition between trees affects growth and mortality, from which emerges spatial patterns of surviving trees. Often, the patterns resulting from this unspecified process are treated as instances of spatial patterns and analyzed with point process methods. Alternatively, forest simulation models assume mechanistic processes and parameters to examine the effects of these assumptions on tree patterns over time, and assess sensitivity to changing conditions, such as climate.
Goodness of Fit Through Empirical Likelihood: Berk-Jones, Reversed Berk-Jones, and Generalizations
Improving on the Sandwich
Advisor: Peter Hoff
Modeling Competition in Forest Development
Analysis of the patterns of entities and their attributes in space is a common and useful endeavor in ecology. Often, the end of a statistical analysis is a general characterization of the observed pattern or series of patterns. However, a good description of the outcome may be somewhat dissatisfying to the practicing scientist or resources manager in that the mechanisms and processes that led to the outcomes remain unknown.
Is the Classical t-Test of the Slope Really Invalid in Linear Regression Models?
High Dimensional Inference of Graphical Models Using Regularized Score Matching
Advisor: Mathias Drton
Log-Linear Models for Heterogeneity in Bipartite Networks
Counterfactuals and Bayesian Graphical Models
Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop
Parameter Identification and Assessment of Independence in Multivariate Statistical Modeling
Linear (causal) relationships between random variables can be conveniently encoded using a mixed graph (a graph with both directed and bidirected edges) where a directed edge implies a direct linear effect and a bidirected edge captures the existence of unobserved confounding. Even when there is a known a mixed graph that accurately reflects the data generating mechanism, that is, all causal relationships are known and linear, confounding can make it impossible to infer parameters of interest.
Nonparametric Estimation for Current Status Data with Competing Risks
We study the nonparametric maximum likelihood estimator (MLE) for current status data with competing risks. These data arise naturally in cross-sectional survival studies with several failure causes, and generalizations arise in HIV vaccine clinical trials. Until now, the asymptotic properties of the MLE have been largely unknown. We resolve this issue by proving consistency, the rate of convergence, and the limiting distribution of the MLE.
Hierarchical modelling of spatial structure of epidermal nerve fibers
Epidermal nerve fiber (ENF) density and morphology are used to diagnose small fiber involvement in diabetic and other small fiber neuropathies. ENF density and summed length of ENFs per epidermal surface area are reduced in diabetic subjects. Furthermore, based on mainly visual inspection, it has been reported that ENFs of subjects with diabetic neuropathy seem to appear more clustered than ENFs of healthy subjects. Therefore, it is important to understand the spatial structure of ENFs in healthy and diseased subjects.
MS Thesis Presentation - Modeling the Game of Soccer Using Potential Functions
Advisor: Peter Guttorp
Nonparametric Estimation of a k-Monotone Density: New Asymptotic Distribution Theory
Covariance Estimation and Testing for the Array Normal Model
Advisor: Peter Hoff
Statistical Analysis of Portfolio Risk and Performance Measures - The Influence Function Approach
Advisor: R. Douglas Martin
Pairwise Clustering by Random Walks
In a similarity based clustering task, one defines a \"similarity function\" between pairs of points and then formulates a criterion (e.g. maximum intracluster similarity) that the clustering must optimize. The optimality criterion quantifies the intuitive notion that points in the same clusters should be similar while points in different clusters should be dissimilar. Most sensible criteria are NP hard to optimize.
Bayesian Methods for Inferring Gene Regulatory Networks
Advisor: Adrian Raftery Gene regulatory networks are an important piece in understanding the functioning of living cells. As more and more gene expression data is becoming available, researchers need fast, reliable techniques for inferring these networks. I have developed ScanBMA, a fast Bayesian model averaging algorithm, used to infer networks from time-series data. I have also developed Model-based Clustering with Data Correction (MCDC), a method for automatically detecting and correcting errors that systematically affect some but not all data.
Up-and-Down and the Percentile-Finding Problem
A problem encountered across many fields in science, engineering and medicine, is finding a specific percentile of a binary-response threshold distribution (for example: finding the ED50 of a medication). Statisticians have designed two popular sequential solutions to this challenge: 'Up-and-Down' (U&D), a 1940's vintage method; and Bayesian designs - most prominently 'Continual Reassessment Method' (CRM, Quigley et al., 1990), a design tailored to Phase I clinical trials. U&D generates a random walk revolving around the target percentile.
Lattice Conditional Independence Models for Incomplete Multivariate Data and for Seemingly Unrelated Regressions
Advisor: Michael Perlman
A Sharp Multiplier Inequality with Applications to Heavy-Tailed Regression Problems
Advisor:Professor Jon A. Wellner We develop a sharp multiplier inequality used to study the size of the multiplier empirical process $(\sum_{i=1}^n \xi_i f(X_i))_{f \in \mathcal{F}}$, where $\xi_i$'s and $\mathcal{F}$ are multipliers and an indexing function class respectively. We show that in general the size of the suprema of the multiplier empirical process is determined jointly by the growth order of the corresponding empirical process, and the worst size of the maxima of the multipliers.
TBD
Bayesian Population Reconstruction: A Method for Estimating Age- and Sex-specific Vital Rates and Population Counts with Uncertainty from Fragmentary Data
Current methods for reconstructing human populations of the past by age and sex are deterministic or do not formally account for measurement error. I propose \\\"Bayesian reconsruction\\\", a method for simultaneously estimating age-specific population counts, fertility rates, mortality rates and net international migration flows from fragmentary data, that incorporates measurement error. Expert opinion is incorportated formally through informative priors.
Parametrizations of Discrete Graphical Models
Advisor: Thomas Richardson Graphical models provide an intuitive way of representing conditional independence relations over multivariate distributions. We work with a very general class of graphs we dub Mixed Euphonious Graphs (MEGs), which include DAGs, undirected graphs and ancestral graphs as special cases. Markov properties and parametrizations of discrete distributions obeying the global Markov property for MEGs were found by Richardson (2003, 2009). We discuss this parametrization, and a Maximum Likelihood fitting algorithm which uses it.
Assessing the Detrended Fluctuation Analysis Method of Estimating the Hurst Coefficient
Gravimetric Anomaly Detection Using Compressed Sensing
Advisor: Marina Meila We address the problem of identifying underground anomalies (e.g. holes) based on gravity measurements. This is a theoretically well-studied and difficult problem. In all except a few special cases, the inverse problem has multiple solutions, and additional constraints are needed to regularize it. Our approach makes general assumptions about the shape of the anomaly that can also be seen as sparsity assumptions. We can then adapt recently developed sparse reconstruction algorithms to bear on this problem.
Postulating Monotonicity in Bayesian Nonparametric Regression
It is often reasonable, by using earlier empirical evidence or theoretical understanding of the considered applied context, to assume that the regression surface corresponding to a response variable, as a function of the model covariates, is either monotonically increasing or monotonically decreasing, but then otherwise leave the form of such a function unspecified. In this talk we consider the practical implications of making such a postulate when applying variable dimensional Bayesian modeling, MCMC, and model averaging.
Markov Equivalence Classes for Bayesian Belief Networks
Acyclic digraphs are used to represent the underlying relationships of some Bayesian belief networks, which are in turn used in expert systems and other representations of statistically interdependent items. But the set of such digraphs turns out to be too big and, instead, a smaller number of equivalence classes truly represent the set of possible networks. Until now, little has been known about the combinatorial properties of these classes, such as their asymptotic growth with number of vertices or the average class size.
Phylogentic Stochastic Mapping
Advisor: Vladimir Minin
Latent Class Transition Model Extensions with Covariates for the Chronically Disabled U.S. Elderly Population
Explicit Limit Results for Markov Chains and Other Markov Processes
The statistical literature abounds with limit results (central limit theorems, laws of large numbers and laws of iterated logarithm) for Markov chains, Markov renewal processes, and Markov additive processes. However, most of the general results are not applicable in practice because the limiting quantitites are not available in an explicit form, in general.
Applications of Robust Statistical Methods in Quantitative Finance
Advisor: Douglas Martin Financial asset returns and fundamental factor exposure data often contain outliers, observations that are inconsistent with the majority of the data. Both academic finance researchers and quantitative finance professionals are well aware of the occurrence of outliers in financial data, and seek to limit the influence of such observations in data analyses. Commonly used outlier mitigation techniques assume that it is sufficient to deal with outliers in each variable separately.
Likelihood Inference for Population Structure, Using Coalescent
Bayesian Nonparametric Inference of Effective Population Size Trajectories from Genomic Data
Phylodynamics is an area on the intersection of phylogenetics and population genetics that aims to reconstruct population size trajectories from genetic data.
Bayesian Nonparametric Inference of Population Trajectories with Gaussian Processes
Advisor: Vladimir Minin Changes in population size influence genetic diversity of the population and, as a result, leave imprints in genomes of individuals in the population. We are interested in an inverse problem of reconstructing past population dynamics from genomic data. We start with a standard framework based on the coalescent, a stochastic process that generates genealogies connecting randomly sampled individuals from the population of interest. These genealogies serve as a glue between the population demographic history and genomic sequences.
Robust Bayesian Analysis of Gene Expression Microarray Data
Predictive Modeling of Cholera Outbreaks in Bangladesh
Advisors: Vladimir Minin and Ira Longini Despite seasonal cholera outbreaks in Bangladesh, little is known about the relationship between environmental conditions and cholera cases. We seek to develop a predictive model for cholera outbreaks in Bangladesh based on environmental predictors. To do this, we estimate the contribution of environmental variables, such as water depth and water temperature, to cholera outbreaks in the context of two different disease transmission models.
Learning Transcriptional Networks from the Integration of ChIP-chip and Expression Data in a Nonparametric Model
We have developed LeTICE, an algorithm for learning a transcriptional regulatory network from ChIP-chip location and expression data. The network is specified by a binary matrix of transcription factor – gene interactions which partitions the genes into a collection of modules (groups of genes regulated by the same TFs) and a background (a group of genes which do not belong to any module). We define a likelihood of a network given location and expression data and then search for the network optimizing the likelihood using numerical optimization.
MS Thesis Presentation - On Left-Stochastic Decomposition Clustering
Advisor: Maya Gupta
Testing high-dimensional covariance/correlation structures
Advisor: Mathias Drton
Abstract:
Two hypothesis testing problems related to high-dimensional covariance/correlation structures will be presented.
Bayesian Hierarchical Curve Registration
A number of different scientific fields ranging from biomedicine to economics, to molecular biology, generate functional data. The statistical analysis of a sample of curves, known as Functional Data Analysis (FDA), has as one of its goals explaining how variation in the functional outcome can be explained by some predictors. However, these curves tend to be misaligned, exhibiting variation not only in amplitude, but also in phase. Teasing apart these sources of variation is a central issue in FDA.
Statistical Approaches to Analyze Mass Spectrometry Data
Advisors: Vladimir Minin & David Goodlett
Proteomics attempts to understand biological functions of an organism through the lens of expressed proteins, basic building blocks of all living cells. Mass spectrometry is used in the field of shotgun proteomics to generate mass spectra that are in turn used to identify and quantify proteins in a given sample.
Scalable Manifold Learning and Related Topics
Advisor: Marina Meila
The Up-and-Down Percentile-Finding Method: Stochastic Properties, Estimation and Design
Models and Inference for Network and Attribute Data
Latent variable network models provide low-dimensional representations of relational patterns in terms of additive and multiplicative actor-specific effects. In this talk we discuss these models in two contexts. First, we extend this class of models to estimate and make inference on the dependencies between a set of network relations and actor-specific attributes. Approaches to this problem typically condition on either the relations or attributes and are unable to provide predictions simultaneously for missing attribute and network information.
Estimating Social Contact Networks to Improve Influenza Simulation Models
Advisor: Mark Handcock Influenza pandemics pose a serious global health concern. The recent A (H1N1) influenza pandemic caused 18,500 lab-confirmed deaths, and mutation of the A (H5N1) \"avian\" influenza virus could also cause a pandemic with an estimated 60% case mortality rate in humans, requiring fast analysis of intervention and containment strategies. When a new influenza virus emerges with pandemic potential, stochastic simulation models are used to assess the effectiveness of different strategies.
Introduction to Graphical Models
I will give a brief introduction to graphical models that will be followed by an outline of a few topics that future students of Michael Perlman and Thomas Richardson could work on.
Classifying Immune Responses in Peptide Microarray Immunoassays
Advisor: Dr. Raphael Gottardo Peptide microarrays tiling immunogenic regions of pathogens (e.g. envelope proteins of a virus) have become an important high throughput tool for querying and mapping antibody binding. Antibodies play a key role in the immune system by preventing and controlling infection. Antibody binding locations provide crucial information for understanding natural infection and for deriving effective vaccines.
Semiparametric Copula Models for Diverse Types of Dependent Data
In multivariate analysis, we are often interested in studying the dependence structure among diverse types of data, including continuous, ordinal, and non-ordered categorical data. One approach to analyze these data is using copula models. In this talk, I will discuss a method extending copula models to mixed continuous and ordinal data and study its asymptotic properties. Then I will introduce a new model incorporating copula models and model-based clustering ideas to deal with mixed continuous, ordinal and categorical data.
Ergodic Limit Laws for Stochastic Optimization Problems
Propp and Wilson's coupling from the past (CFTP) algorithm provides exact samples and, thus, an elegant alternative to convergence diagnostics for standard MCMC samplers. I shall explain how this method works and discuss some practicalities regarding its use in MCMC sampling. Unfortunately the CFTP technique is only applicable when the distribution to be sampled possesses certain special properties. We propose a way to use the method's basic idea more generally and demonstrate that our algorithm works well in some quite challenging applications.
Model-based and model-free community recovery in graphs
Advisor: Marina Meila
Abstract:
Wavelet Variance Analysis for Time Series and Random Fields
Wavelets give rise to the concept of wavelet variance that decomposes the variance of a time series on a scale by scale basis and that has considerable appeal when physical phenomena are analyzed in terms of variations operating over a range of different scales. The wavelet variance has been applied to a variety of time series and is useful as an exploratory tool to identify important scales, to assess the exponent parameter of a power law process, to detect inhomogeneity and to estimate a time varying spectral density function.
Hammersley's Process with Sources and Sinks
Hammersley (1972) initiated a very interesting "hydrodynamical" approach to the study of the behavior of the lengths of longest increasing subsequences of random permutations. In the nineties Aldous and Diaconis (1995) introduced a modified version of the interacting particle process, studied in Hammersley (1972), and used this modification in a proof of the fact that the length of a longest increasing subsequence of a (uniform) random permutation of length n, divided by sqrt{n}, converges in probability to 2.
Maximum-Likelihood Inference after Model Selection
Standard statistical technique often fail in the presence of data-driven model selection, yielding inefficient estimators and hypothesis tests that fail to achieve nominal type-I error rates. In particular, the observed data is constrained to lie in a subset of the original sample space that is determined by the selected model. This often makes the post-selection likelihood of the observed data intractable and inference difficult. Recently, novel methodologies have been proposed for performing valid inference in selected models.
Robust Estimation of Factor Models in Finance
Statistical inference using Kronecker structured covariance
We consider the problem of testing and estimation of separable covariances for relational data sets in the context of the matrix-variate normal distribution. Relational data are often represented as a square matrix, the entries of which record the relationships between pairs of objects. Many statistical methods for the analysis of such data assume some degree of similarity or dependence between objects in terms of the way they relate to each other. However, formal tests for such dependence have not been developed.
Modeling Longitudinal Multivariate Data with Mixed Outcomes: Hierarchical Latent Trait and Individual-Level Mixture Models
Advisor: Elena Erosheva I develop Bayesian hierarchical latent variable models for the study of longitudinal multivariate data. The latent variable models seek to represent multivariate data with a reduced number of dimensions while the hierarchical formulation enables the description of the latent structure evolution over time as well as factors associated with this evolution. Research on cognitive assessments and scientific interest in relating cognitive decline to neuroimaging results and biomarker information motivate these models.
TBD
There will be a riveting introduction to social networks and the latent space model used for modeling networks. I will discuss the difficulties in estimating the parameters of this model by traditional methods and explain the estimator we came up with to deal with these issues.
R-Squared Inference Under Non-Normal Error
Advisor: Professor Ross L. Prentice
Estimation of Convex-Transformed Densities
A convex-transformed density is a quasi-concave (or a quasi-convex) density which is a composition of monotone and convex functions. We consider a scale of such families of multivariate densities indexed by a parameter which is a monotone function. The exponential function corresponds to log-concave densities, while power functions correspond to heavier tailed densities or densities concentrated on the positive orthant.
Classification by Opinion-Changing Behavior: A Mixture Model Approach
Popular theories in political science regarding opinion-changing behavior postulate existence of one or both of two broad categories of people: those who hold their opinions over time; and those that hold no solid opinion and, when asked to make a choice, do so seemingly at random. This study explores evidence for a third category: durable changers. This group of people will change their opinion in a rational, informed manner, after being exposed to new information.
Bayes and Empirical Bayes Methods for Social Network Analysis
Advisor: Peter Hoff
Abstract:
Robust Statistics and Heavy-Tailed Distributions in Portfolio Optimization
Postprocessing of Precipitation Forecasts with an SPDE Based Spatio-temporal Model for Large Data
We introduce a hierarchical Bayesian model (HBM) for precipitation monitoring data that incorporates numerical weather prediction (NWP) model output at high spatial and temporal resolution and a physics-based stochastic partial differential equation (SPDE). The SPDE explicitly models phenomena such as advection and diffusion that occur in many natural processes. We approximate the solution of the SPDE in the spectral space using the method of eigenfunctions to reduce the dimensionality of the problem.
Maximum-Likelihood Inference after Model Selection
Co-Advisors: Mathias Drton & Raphael Gottardo Standard statistical technique often fail in the presence of data-driven model selection, yielding inefficient estimators and hypothesis tests that fail to achieve nominal type-I error rates. In particular, the observed data is constrained to lie in a subset of the original sample space that is determined by the selected model. This often makes the post-selection likelihood of the observed data intractable and inference difficult.
Allele-Sharing Methods for Linkage Detection Using Extended Pedigrees
Allele-sharing methods provide a robust approach to linkage detection for complex traits using pedigree data. Affected related individuals have increased probability of sharing genes identical-by-descent (IBD) at trait loci and hence also at linked marker loci at which they therefore show increased similarity over that predicted under Mendelian segregation. Relatives of discordant phenotype have decreased probability of sharing genes IBD at trait loci and hence have decreased similarity at linked markers.
Estimating the Treatment Effect of Non-Randomized Educational Interventions: The Case of Special Education
A central goal of the education literature is to demonstrate that specific educational interventions have a treatment effect on student test performance. Researchers often have access to student test scores for students in the treatment and control groups both prior to and after the intervention, but usually must estimate the treatment effect from observational data in which the intervention has not been randomly assigned to units.
Portfolio Optimization with Tail Risk Measures and Non-Normal Returns
Advisor: R. Douglas Martin
TBD
The Capital Asset Pricing Model (CAPM) is today\\\'s most important financial model for estimating cost of capital and asset allocation. Its centerpiece are variables, commonly called betas and alphas, estimated using ordinary least squares (OLS) regression. Since financial returns typically have an asymmetric and heavy-tailed distribution, OLS estimates can be severely biased.
Bayesian Space-Time Smoothing Models for Small Area Estimation
Advisor: Jon Wakefield Area and time-specific estimates of disease rates, cause-specific mortality rates and other key health indicators are of great interest for health care and policy purposes. Such estimates provide the information needed to identify areas with increased risk, effectively allocate resources, and target interventions. A wide variety of data, such as vital statistics, complex surveys, demographic surveillance sites, and disease registries, are used for these purposes.
Extensions of Latent Class Transition Models with Application to Chronic Disability Survey Data
Latent class transition models (LCTMs) are used to study the movement of individuals among homogeneous subgroups through time. Traditional LCTMs assume a complete set of observations for each individual. However, many longitudinal surveys have a rolling enrollment design, with late entry and early exit. Thus, methodology is needed to account for all the possible times at which individuals can be observed.
Logic Regression
Advisors: Charles Kooperberg & Michael LeBlanc
Improving Serfling's Inequality for the Hypergeometric Distribution
Advisor: Jon Wellner Abstract: We discuss a method for obtaining finite sample Gaussian bounds for the tail of the hypergeometric distribution. The method is based on Tusnády's approach (1975) to bounding the tail of symmetric binomial random variables. In this talk, we review Tusnády's result, and discuss how it can be adapted to and extended in the hypergeometric case.
Jump Estimation in Inverse Regression Models
We provide an asymptotic theory for penalized least squares estimators of locally constant functions with finitely many jumps which are blurred by an operator and random noise. Differences to the direct case are highlighted, particularly, it turns out that a sqrt(n) rate of convergence for estimation of the jump locations is generic in the inverse case. Moreover, locations of jumps are jointly asymptotic normal, which allows to construct confidence regions for the graph of a function with a finite number of jumps.
Nonstationary Modeling Through Dimension Expansion
If atmospheric, agricultural, and other environmental systems share one underlying theme it is complex spatial structures, being influenced by such features as topography and weather. Ideally we might model these effects directly; however, information on the underlying causes is often not routinely available. Hence, when modeling environmental systems there exists a need for a class of spatial models which does not rely on the assumption of stationarity. In this talk, we propose a novel approach to modeling nonstationary spatial fields.
A General Approach to Nonparametric Monotone Function Estimation
For several important monotone parameters, such as the distribution function, monotone density function, and monotone regression function, sensible nonparametric estimators can be obtained by minimizing the empirical risk based on an appropriate loss function. For more complex monotone parameters, such as a monotone covariate-adjusted dose-response curve, or in the context of more complex data structures, this approach may not be possible and alternative approaches are needed.
Using the Structure of d-Connecting Paths as a Qualitative Measure of the Strength of Dependence
Restricted Covariance Priors with Applications in Spatial Statistics
We present a Bayesian model for area-level count data that uses Gaussian random effects with a novel type of G-Wishart prior on the inverse variance-covariance matrix. The usual G-Wishart prior restricts off-diagonal elements of the precision matrix to 0 according to the neighborhood structure of the study region. This preserves conditional independence of non-neighboring regions but is more flexible than the traditional intrinsic autoregression prior.
Covariance Estimation in the Presence of Diverse Types of Data
Advisor: Peter Hoff
TBD
MURI week continues this Friday. I'll be talking about probabilistic weather forecasting using Bayesian Model Averaging, an altogether different approach than the probabilistic forecasting method described by Tilmann in seminar earlier this week. I'll be discussing my work on forecasting of wind and rain, and looking at a modification of the EM algorithm for mixed continuous/discrete distributions.
Finite Sampling Exponential Bounds with Applications to Two-Sample Kolmogorov-Smirnov Statistics
Advisor: Jon Wellner In this talk, we discuss exponential tail inequalities for the sum in the context of sampling without replacement. Using an exponential inequality due to Serfling as the basis for investigation, we consider the special case of sampling from a finite population containing only 0s and 1s. This leads to considering exponential bounds for the Hypergeometric distribution.
Estimates and Projections of the Total Fertility Rate
Bayesian Analysis of Deterministic Models
Advisor: Adrian Raftery
Likelihood-Based Inference for Partially Observed Multi-Type Markov Branching Processes
Advisor - Vladimir Minin Abstract - Markov branching processes are a class of continuous-time Markov chains (CTMCs) with ubiquitous modeling applications. Multi-type processes are necessary to model phenomena such as competition, predation, or infection, but often feature large or uncountable state spaces, rendering general CTMC techniques impractical. We present new methodology motivated by processes arising in molecular epidemiology, cellular differentiation, and infectious disease dynamics.
Probablistic Weather Forecasting with Spatial Dependence
Bayesian Modeling For Multivariate Mixed Outcomes With Applications To Cognitive Testing Data
This talk describes new multivariate regression and model-based clustering methods for statistical inference with multivariate mixed outcomes. We use the term mixed outcomes to refer to binary, ordered categorical, count, continuous and other ordered outcomes in combination. Such data structures are common in social, behavioral, and medical sciences. We develop two regression approaches, the semiparametric Bayesian latent variable model and the semiparametric reduced rank multivariate regression model, for mixed outcome data.
Model-Based Penalized Inference
It is well known that many penalized regression problems can be interpreted as estimating unknown regression coefficients having assumed a specific statistical model. This includes the lasso when tuning parameters are estimated from the marginal likelihood of the data, the Bayesian lasso, Gaussian random effects models, ridge regression, etc. In the first part, we consider estimating a mean matrix from a single noisy realization. We assume possibly sparse elementwise effects and use a lasso penalty.
Peptide Sequencing Using Tandem Mass Spectrometry
Tandem mass spectrometry has become a leading technology for protein identification. Much research has been done to automate the task of matching spectra to peptides.
In this study, we propose a probabilistic sequencing algorithm. It includes a probabilistic network to model the chemistry in the generation of theoretical spectrum, a pair hidden markov model to match theoretical spectrum and observed spectrum, and a probabilistic score function to rank the candidate sequences.
Gravimetric Anomaly Detection Using Compressed Sensing
We address the problem of identifying underground anomalies (e.g. holes) based on gravity measurements. This is theoretically well-studied and difficult problem. In all except a few special cases, the inverse problem has multiple solutions, and additional constraints are needed to regularize it. Our approach makes general assumptions about the shape of the anomaly that can be seen as sparsity assumptions. Then we adapt recently developed sparse reconstruction algorithms to bear on this problem.
Bayesian Inference for Exponential-family Random Graph Models for Social Networks
Exponential-family random graph model (ERGM) has been widely applied in the fields of social network analysis, genetics (e.g. protein interaction networks), information theory etc. Because of the intractability of the likelihood function, Markov Chain Monte-Carlo (MCMC) approximation is typically applied to obtain maximum likelihood estimators (Geyer and Thompson 1992). However, ERGMs still suffer from inferential degeneracy and computational deficiency. In this talk, we present the Bayesian inference to ERGM.
A New Goodness of Fit Test: The Reversed Berk-Jones Statistic
In classical testing problems, we often use statistics based on the empirical distribution function to test whether or not the underlying distribution of the data is what we think it might be. Berk and Jones introduced such a statistic in 1979. I'll talk about a statistic which is related to theirs (called the reversed Berk-Jones statistic), and some of its properties. Along the way we'll chat about what exactly the empirical distribution function is, and why I think it's so cool. That is all.
Modeling Preferential Sampling Reduces Bias and Improves Precision When Estimating Effective Population Size Trajectories
Advisor: Vladimir Minin The field of phylodynamics seeks to estimate effective population size fluctuations from molecular sequences of individuals sampled from the population of interest. One way to accomplish this task is to formulate an observed sequence data likelihood by using a coalescent model for the sampled individuals’ genealogy and then integrating over all possible genealogies via Monte Carlo or, less efficiently, by conditioning on one genealogy estimated from sequence data.
Statistical Solutions to Some Problems in Medical Imaging
To Sample or Not to Sample? Why is That the Question for Census 2000?
Public Talk
Bayesian Methods for Inferring Gene Regulatory Networks
Advisors - Adrian Raftery and Ka Yee Yeung (UW Tacoma)
On Monotonicity Constraints in High-Dimensional Optimization: Convexity and Mixture Models
Whole-Genome Quantitative Trait Prediction and Heritability Mapping via an Infinite Allele Model
The paradox of missing heritability refers to the common finding that in complex genetic traits with high heritability as estimated by methods such as twin studies, only a small fraction of the population variance is explained by the few Single Nucleotide Polymorphism (SNP) markers which are found to be individually significantly associated with the trait. Human height, with heritability estimates as high as 80% largely unexplained by individual SNP’s, is the canonical example of such a trait.
Topics in Graph Clustering
Advisor: Marina Meila
Recovery of Item Rankings Under Nonnormal Fitting Distributions in MML Parameter Estimation
In a simulation study, data are generated under a variety of conditions with respect to underlying ability distribution, test length, and sample size. Item parameter estimates are obtained under two conditions: in one, the assumed ability distribution matches the underlying ability distribution; in the other, it does not. The item parameter estimates from the matching condition are compared to those from the nonmatching condition to determine the effect on the recovery of parameter estimates and item rankings.
Geostatistical Model Averaging for Probabilistic Quantitative Precipitation Forecasting
Advisor: Tilmann Gneiting Accurate weather forecasts benefit society in crucial functions, including agriculture, transportation, recreation, and basic human and infrastructural safety. Over the past two decades, ensembles of numerical weather prediction models have been developed, in which multiple estimates of the current state of the atmosphere are used to generate probabilistic forecasts for future weather events. However, ensemble systems are uncalibrated and biased, and thus need to be statistically postprocessed.
Introduction to Model-Based Clustering
I will talk briefly about how I got involved in research in Model-Based Clustering in my final year of undergrad (and subsequently here) and give a brief outline of research I did then. The main part of the talk will be about different extensions to the model-based clustering methodology that I\'m working on. I\'ll mainly be focusing on research on variable selection with model-based clustering but I\'ll also talk, if I have time, about ideas I\'ll be working on for the next year.
Bayesian Modeling of International Migration
Advisor: Adrian Raftery The future of international migration is a topic of great social and political importance, and yet international migration is hard to even estimate, let alone predict. The unreliability of point projections of migration indicates a need for better quantification of uncertainty in migration projections. We accomplish this quantification of uncertainty with a Bayesian hierarchical autoregressive model on net migration rates. In an initial model, we assume error terms are independent across countries.
A Comparison of Alternative Methodologies for Estimation of HIV Incidence
Trend Estimation Using Wavelets
Advisors: Peter Guttorp & Don Percival
Discovering Interactions In Multivariate Time Series
In large collections of multivariate time series it is of interest to determine interactions between each pair of time series. We study methods for inferring time series interactions in three domains: 1) conditional independencies between time series, 2) Granger and instantaneous causality estimation in subsampled and mixed frequency time series, and 3) Granger causality estimation in multivariate categorical data. First, we explore a Bayesian framework for inferring graphical models of time series.
Statistical Methods in Medical Imaging: Application to Mammography
Medical professionals and researchers used a variety of imaging techniques in their clinical practice and scientific investigations. In this talk I will focus on Mammography which is used for breast examinations and routine breast cancer screening. While the mammographic images proved to be a useful non-invasive tool for clinical monitoring, the images often luck detail and clarity. For example, in addition to having limited spatial resolution, skin-air boundary of the imaged breast is often obscured.
MS Thesis Presentation: A resampling approach to clustering with confidence
We propose a method for estimating the number of groups in a data set. Our method is an extension of Generalized Single Linkage clustering (GSL) (Stuetzle and Nugent 2010), a nonparametric clustering method based on the premise that groups in the data correspond to modes of the underlying data density. GSL starts with a nonparametric density estimate. It recursively splits the data into high density regions separated by valleys. The leaves of the resulting cluster tree correspond to modes of the density estimate.
Large-Scale B-Cell Receptor Sequence Analysis Using Phylogenetics and Machine Learning
Co-chairs: Vladimir Minin & Erick Matsen
Analysis of Haplotype Structure: Application to the DARC Gene Region
Factor Models with Non-Normality: Robust, Skewed Distribution MLE and Bayes Estimation
Advisor: R. Douglas Martin The literature on use of robust estimates, skewed distribution MLE’s and non-normal distribution hierarchical Bayes models for multi-factor models in finance is surprisingly thin, and limited for the most part to single factor models (SFM’s). The ultimate goal of our research is the study of the relative merits of robust versus non-normal MLE estimation of multi-factor models and the use of hierarchical Bayes modeling of multi-factor models using skewed fat-tailed distributions.
Learning the "Epitome" of an Image
I will describe a new model of image data that we call the "epitome". The epitome of an image is its miniature, condensed version containing the essence of the textural and shape properties of the image. As opposed to previously used simple image models, such as templates or basis functions, the size of the epitome is considerably smaller than the size of the image or object it represents, but the epitome still contains most constitutive elements needed to reconstruct the image.
Discrete-Time Threshold Regression for Survival Data with Time-Dependent Covariates
Advisor: Professor Gary Chan
Combining Probability Forecasts
We propose a method for combining probability forecasts from different sources. The commonly used method of linearly combining probability forecasts has limitations, in that a weighted combination of distinct calibrated forecasts is necessarily uncalibrated. In view of this, we propose a recalibration method. We illustrate our findings with simulation examples and a case study on operational probability of precipitation forecasts.
Algorithms and Software for the Automated Identification of Minerals Using Field Spectra or Hyperspectral Imagery
Over the last few years, the speaker (and collaborators Leanne Bischof and Jon Huntington) have been developing fast and sophisticated algorithms and software for identifying pure minerals and mixtures of minerals from shortwave infrared spectra. The software, called The Spectral Assistant (TSA), has been designed to be used with a particular FIELD-PORTABLE spectrometer, the PIMA-II, which is about the size of a shoe box and can be used by geologists collecting samples in the field.
Hamiltonian Monte Carlo in Bayesian Empirical Likelihood Computation
We consider Bayesian empirical likelihood estimation and develop an efficient Hamiltonian Monte Carlo method for sampling from the posterior distribution of the parameters of interest. The proposed method uses hitherto unknown properties of the gradient of the underlying log-empirical likelihood function. It is seen that these properties hold under minimal assumptions on the parameter space, prior density and the functions used in the estimating equations determining the empirical likelihood.
Statistical Methodology for Longitudinal Social Network Data
Social interaction data are data that are generated from the interaction or relationship between two or more actors, thus the observational units are pairs, trios, etc. of actors. This type of data are common in all fields of social science (e.g. political science, sociology, anthropology, and economics) for the interaction of actors is a key element in social science theory.
Modeling Heterogeneity Within and Between Arrays
Data that can be represented in the form of an array is present in many of the social and biological sciences. In this talk we address two statistical problems concerning these data. The first problem is modeling the heterogeneity along the dimensions of an array. Previously developed models are either non-stochastic and difficult to interpret, or require a large number of parameters prohibiting likelihood based inference for some arrays.
Methods for Estimation and Inference for High-Dimensional Models
Advisor: Mathias Drton & Ali Shojaie
Modern statistical problems are increasingly high-dimensional, with the number of covariables p potentially vastly exceeding sample size N. Fortunately, significant progress has been made in developing rigorous statistical tools for tackling such problems, but these methods have primarily targeted prediction, point estimation, and or variable selection.
Robust Bayesian Analysis of Gene Expression Microarray Data
Microarrays are part of a new class of biotechnologies that can be used to measure expression levels (DNA or RNA abundance) for thousands of genes at a time. This new technology is being applied increasingly in biological and medical research to address a wide range of problems, such as the classification of tumors or the study of host responses to bacterial infections. DNA microarray experiments raise numerous statistical questions in fields as diverse as image analysis, experimental design, hypothesis testing, cluster analysis, etc.
The Career Leap from Academia to Data Science
The amount of data we generate as a global civilization is growing exponentially. What's more important however, is the fact that storing, accessing and analyzing data is getting cheaper and faster. Organizations all over the world have realized that data is a prized commodity, and many in the industry are scrambling to extract value from their complex data sets. For this endeavor, they need individuals with the right skills and experience, and the quantitative disciplines in Academia are a great source for such individuals.
Nonparametric Estimation of the Bivariate Survivor Function
Correlated failure time data arise often in many application areas. For example, in genetic epidemiology study, the disease occurrence times of pairs of family members are often correlated and the degree of correlation may provide important leads in respect to disease etiology. Univariate failure time data methods are well established, including Kaplan-Meier method, censored data rank test and Cox regression method. However, the standard tools for multivariate failure data analysis data are not available yet.
Logistic Regression with Covariate Measurement Error: Estimation and a New Measurement Model
Advisors: Ross Prentice and Ching-Yun Wang
Likelihood-Based Inference for Partially Observed Multi-type Branching Processes
Advisor: Vladimir Minin Branching processes are a class of continuous-time Markov chains (CTMCs) frequently used in stochastic modeling with ubiquitous applications. One-dimensional cases such as birth-death processes are well studied, but it is often necessary to model systems with more than one species --- bivariate or other multi-type processes are commonly used to model phenomena such as competition, predation, or infection.
Statistical Methods for Analyzing Incomplete Financial Data with Heavy Tails
A common problem with financial historical data is that they often have unequal lengths of histories. Examples include country market indices, currency rates and hedge fund returns histories. Practitioners often deal with such issues by truncating all the series so that the remaining data have the same length, which is apparently not an ideal solution. We discuss existing statistical methods that utilize the full data set, such as maximum likelihood estimation and multiple imputation.
Parameter Priors for Directed Acyclic Graphical Models and the Characterization of Several Probability Distributions
Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop We develop simple methods for constructing parameter priors for model choice among Directed Acyclic Graphical (DAG) models. In particular, we introduce several assumptions that permit the construction of parameter priors for a large number of DAG models from a small set of assessments. We then present a method for directly computing the marginal likelihood of every DAG model given a random sample with no missing observations.
John's Walk
We present an affine-invariant random walk for drawing uniform random samples from a convex body for which the maximum volume inscribed ellipsoid, known as John's ellipsoid, may be computed. We consider a polytope where as a special case. Our algorithm makes steps using uniform sampling from the John's ellipsoid of the symmetrization of at the current point. We show that from a warm start, the random walk mixes in steps.
Bayesian Hierarchical Self-Modeling Warping Regression with Application to Network Inferences
Functional data often exhibit a common shape but also variations in amplitude and phase across curves. The analysis often proceed by synchronization of the data through curve registration. We propose a Bayesian Hierarchical model for curve registration. Our model provides a formal account of amplitude and phase variability while borrowing strength from the data across curves in the estimation of the model parameters.
Bayesian Modeling of Health Data in Space and Time
In recent years spatial-temporal modeling has become increasingly popular in the ï¬eld of public health and epidemiology. Motivated by two datasets, we address three issues in the Bayesian modeling of health data in space and time.
Realized Genome Sharing in Random Effects Models for Quantitative Genetic Traits
Advisor: Elizabeth Thompson
Learning in Spectral Clustering
Spectral segmentation is a technique used to group data based on pairwise similarities. A similarity matrix is used as input into a spectral clustering algorithm and a clustering over the data is output. The clustering criterion is such that similar points are put in the same cluster and dissimilar points are put in different clusters. Generally, this similarity matrix is assumed known, while in reality this matrix is usually constructed by hand, a very time consuming process.
Survival Analysis by Threshold Regression with Time-Dependent Covariates
A natural approach to survival analysis in many settings is to model the subject’s “health†status as a latent stochastic process, where the terminal event is represented by the first time that the process crosses a threshold. “Threshold regression†models the covariate effects on the latent process. Much of the literature on threshold regression assumes that the process is one-dimensional Wiener, where crossing times have a tractable inverse Gaussian distribution but where the process characteristics are fixed at baseline.
Nonparametric Estimation of the Bivariate Survivor Function
Correlated failure time data arise often in many application areas. For example, in genetic epidemiology study, the disease occurrence times of pairs of family members are often correlated and the degree of correlation may provide important leads in respect to disease etiology. Univariate failure time data methods are well established, including Kaplan-Meier method, censored data rank test and Cox regression method. However, the standard tools for multivariate failure data analysis data are not available yet.
Directed Markov Point Processes
Spatial Point process are often modeled as Markov fields, and inference for such models are sometimes either inefficient or computationally intensive due to difficulties in evaluating the normalizing constant. Simulation study for such process is hard. We exploit the partial order in the plane and introduce a class of Markov point processes known as \"Directed Markov Point Processes\" and investigate their properties. This Markov structure enables to study some of the well known spatial processes in detail.
Adaptive Higher-order Spectral Estimators
Advisor: Peter Hoff Many applications involve estimation of a signal matrix from a noisy data matrix. In such cases, it has been observed that estimators that shrink or truncate the singular values of the data matrix perform well when the signal matrix has approximately low rank. In this talk, we generalize this approach to the estimation of a tensor of parameters from noisy tensor data. We develop new classes of estimators that shrink or threshold the mode-specific singular values from the higher-order singular value decomposition.
Nonparametric Estimation of Multivariate Monotone Densities
I will discuss the most important of results obtained along the direction of nonparametric estimation of two multivariate families of densities that exhibit monotonicity constraints, and which can otherwise be characterized as certain mixtures models. Discussion will emphasize on chracterizations of the estimators, their strong consistency and we will embark on discussing rates of convergence of these estimators, both in the global and the local sense.
On the Geometry of Graphical Models
Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop We provide a classification of graphical models according to their representation as subfamilies of exponential families.
Manifold Learning Using Kernel Density Estimation and Local PCA
High-dimensional datasets often have lower-dimensional structure, which frequently takes the form of a manifold. There are many algorithms (e.g., Isomap) that are used in practice to fit manifolds and thus reduce the dimensionality of a given dataset. In our work, we consider the problem of recovering a d-dimensional submanifold M of R^n when provided with noiseless samples from M. Ideally, the estimate M_hat of M should be an actual manifold. Generally speaking, existing manifold learning algorithms do not meet these criteria.
Algorithms for Estimating the Cluster Tree of a Density
The goal of clustering is to identify distinct groups in a data set and assign a group label to each observation. To cast clustering as a statistical problem, we regard the data as an iid sample from some unknown probability density p. We adopt the premise that groups correspond to modes of the density. Our goal then is to find the modes and assign each observation to the \"domain of attraction\" of a mode. We do this by estimating the cluster tree of the density, a representation of the hierarchical structure of its level sets.
Analyzing Time Series Data for Endemic Cholera in Bangladesh with Mechanistic Models of Infectious Disease Dynamics
Despite seasonal cholera outbreaks in Bangladesh, little is known
about the relationship between environmental conditions and cholera
cases. We seek to develop a predictive model for cholera outbreaks
in Bangladesh based on environmental predictors. To do this, we must
estimate the environmental parameters in the context of a disease
transmission model. We develop a method to simultaneously estimate
the transmission parameters and the environmental parameters in a
Bayesian Graphical Models with Limited Data and External Information
Advisor: Tyler McCormick
Population Genetic Variation: A Computationally Tractable Model for Large Samples Typed at Many Loci
Haplotypes are specific combinations of alleles on the same chromosome, and various methods exist for the analysis of haplotype data from unrelated individuals. However, humans are diploid and studies of genetic variation might consist of unphased genotype data, where an unordered pair of alleles is observed at each locus. There is a coming need for less-computationally intensive models that may be directly applied to unphased genotype data from thousands of individuals at thousands of loci. In this talk, we present such a model for genetic variation.
Degeneracy, Duration, and Co-evolution: Extensions of Exponential Random Graph Model (ERGM) for Social Network
We will address three aspects of statistical methodology for Exponential family Random Graph Models (ERGMs) in the context of applications to social network analysis. We start by addressing the topic of degeneracy in ERGMs. This is a problem often misunderstood to characterize the entire family of ERGMs, but is properly understood as a more limited issue of model misspecification.
Factor Model Monte Carlo Methods for General Fund-of-Funds Portfolio Management
The general Fund-of-Funds (GFoF) class of investment organizations includes fund-of-hedge funds (FoHF), family offices, endowments, pension plans and asset management companies. GFoF portfolios are characterized by two important types of returns problems among others. The first is that the returns histories of the portfolio assets are unequal, sometimes quite short and often contain multiple frequencies, resulting in structured missing data problems. The second is that the returns have fat-tailed and skewed distributions to varying degrees.
Estimation in Generalized Linear Mixed Models: Comparison of Maximum Likelihood with Iterative Bias Correction
Advisor: Brian Leroux
A Finite Population Likelihood Ratio Test of the Sharp Null Hypothesis for Compliers
Advisor: Thomas Richardson
Conditional Tests for Localizing Trait Genes
Graphical Models from Phylogenies, Coalescents, and Migration
Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop
Improved estimation of bilateral migration flows
I propose a method for estimating migration flows between all pairs of countries, including breakdowns by place of birth. My estimator is a pseudo-Bayes estimator which smooths a set of state-of-the-art estimates of migration flows towards a simpler estimate which contains fewer structural zeroes. The smoothing process provides a natural way to bypass the state-of-the-art estimator's unrealistic assumption that the number of global migrants is as small as possible.
TBD
Separable covariance testing and estimating for sociomatrices
We consider the problem of testing and estimating separable covariances for relational data sets. We propose to model these data as matrix normal distributions with separate row and column covariance matrices. The existing literature on testing and estimation in the context of a matrix normal distribution requires multiple observations of the matrix, which rarely occurs for relational data sets.
Bayesian Modeling of Survey Data in Space and Time
Advisor: Jon Wakefield Public health data are frequently obtained from surveys, which often have complex design sampling frames. It is crucial that analyses account for the latter to give appropriate inference. We describe two scenarios, with both having important spatial components. The first example is motivated by Behavioral Risk Factor Surveillance System (BRFSS) data. Empirical Bayes and Bayes hierarchical models for small area estimation have been used extensively for surveys like BRFSS.
Clustering with Confidence
One of the fundamental goals of nonparametric cluster analysis is to estimate the cluster tree of a density. I will define and illustrate the cluster tree and describe a graph-based procedure for its estimation. The cluster tree will usually have spurious leaves due to variability in the density estimate. I will introduce a bootstrap-based method for eliminating spurious leaves and “clustering with confidenceâ€.
Running Markov Chain without Markov Basis
The methodology of Markov basis initiated by Diaconis and Sturmfels (1998) stimulated active research on Markov bases for more than a decade. It also motivated improvements of algorithms for Gr\"obner basis computation for toric ideals, such as those implemented in 4ti2.
Geostatistical Model Averaging
Probabilistic weather forecasting is becoming an increasingly important and active area of research. Most current statistical post-processing techniques account for forecast bias and predictive variance without regard to forecast location. We will discuss a technique that adjusts bias and predictive variance locally, called geostatistical model averaging (GMA). In particular, GMA allows the parameters of the predictive distribution to vary over the model grid.
Identification of an Infinite AR Model
The Likelihood Pivot: Performing Inference with Confidence
Advisor: Peter Hoff Maximum likelihood estimation is a popular method of statistical inference in part due to its efficiency. Unfortunately, much of the efficiency is lost when the model has been misspecified. To account for possible model misspecification, the sandwich estimate of variance can be used with MLE inference to generate asymptotically correct confidence intervals, but these intervals typically perform poorly at small sample sizes.
Probabilistic Wind Forecasting Using Bayesian Model Averaging
Bayesian model averaging has been shown to be a useful method for developing probabilistic weather forecasts for quantities (such as temperature) that can be represented by univariate normal distributions. This talk will discuss how these methods can be extended to other distributions, using wind forecasting as an example.
Graphical Markov Models for Partially Observed Data Generating Mechanisms
Multivariate Analysis & Graphical Models of Association (MAGMA 4) Workshop Graphical Markov models represent statistical dependencies by combining two simple yet powerful mathematical concepts: graphs and conditional independence. A graphical Markov model is constructed by specifying local dependencies for each node of the graph in terms of its immediate neighbors, yet can represent a highly varied and complex system of multivariate dependencies by means of the global structure of the graph.
Likelihood-based haplotype frequency modeling using variable-order Markov chains
The localized haplotype-cluster model uses variable-order Markov chains to create an empirical model for haplotype probabilities that adapts to the changing structure of linkage disequilibrium (LD) across the genome. By clustering haplotypes based on the Markov property, the model is able to take advantage of conditional independencies to improve estimates of haplotype frequencies while still respecting the dependencies induced by LD.
Likelihood Inference for Population Structure, Using the Coalescent
Learning and Manifolds: Leveraging the Intrinsic Geometry
We explore and exploit the use of differential operators on manifolds - the Laplace-Beltrami operator in particular - in learning tasks. In particular, we are interested in uncovering the geometric structure of data(unsupervised learning) and in exploiting information contained in unlabeled data for regression and classification tasks (semi-supervised learning).