Husky Union Building (HUB)

Husky Union Building (HUB)

Algorithmic bias is often of greatest concern in contexts that are shaped by a long history of discrimination, marginalization, and procedural injustice.  Taking the US child welfare system as the primary case study, I will discuss what we have learned about the development, deployment, evaluation and impact of predictive risk assessment algorithms in inequitable systems.  I will describe the role that “non-universal” data collection and problem formulation—specifically, the choice of prediction target—play as potential drivers of disparities in the resulting pr

While data science enables rapid societal advancement, deferring decisions to machines does not automatically avoid egregious equity or privacy violations. Without safeguards in the scientific process --- from data collection to algorithm design to model deployment --- machine learning models can easily inherit or amplify existing biases and vulnerabilities present in society.

A common problem in many modern statistical applications is to find a set of important variables—from a pool of many candidates—that explain the response of interest. For this task, model-X knockoffs offers a general framework that can leverage any feature importance measure to produce a variable selection algorithm: it discovers true effects while rigorously controlling the number or fraction of false positives, paving the way for reproducible scientific discoveries.

Registers are increasingly important sources of data to be analyzed. Examples include registers of congenital abnormalities, supermarket purchases, or traffic violations. In such registers, records are created when a relevant event is observed, and they contain the features characterizing the event. Understanding the structure of associations among the features is of primary interest. However, the registers often do not contain cases in which no feature is present and therefore, standard multiplicative or log-linear models may not be applicable.

We introduce the Essential Regression model, which provides an alternative to the ubiquitous K-sparse high dimensional linear regression on p variables. While K-sparse regression assumes that only K  components of the observable X  directly influence Y , Essential Regression allows for all components of X  to influence Y , but mediated through a K-dimensional random vector Z.

In a seminal paper, Robins (1998) introduced marginal structural models (MSMs), a general class of counterfactual models for the joint effects of time-varying treatment regimes in complex longitudinal studies subject to time-varying confounding. He established identification of MSM parameters under a sequential randomization assumption (SRA), which rules out unmeasured confounding of treatment assignment over time.

Note 2/7/2018: We are canceling this seminar as a precaution in anticipation of the expected Winter storm.


As the pace and scale of data collection continues to increase across all areas of biology, there is a growing need for effective and principled statistical methods for the analysis of the resulting data. In this talk, I'll describe two ongoing projects to help fill this gap. 

A new standard is proposed for the evidential assessment of replication studies. The approach combines a specific reverse-Bayes technique with prior-predictive tail probabilities to define replication success. The method gives rise to a quantitative measure for replication  success, called the sceptical p-value. The sceptical p-value integrates  traditional significance of both the original and replication study with a comparison of the respective effect sizes.

Did you know that your skills in statistics can be applied to ensure natural resources, such as fish, wildlife and even ecosystems, remain resilient into the future? That your love of algebra can take you to wild, remote, and amazing places? That there are careers where you get to collaborate with a wide variety of dedicated scientists working to better understand the world, how it is changing, and what it will be like in the future?

In many applications, investigators monitor processes that  vary in space and time, with the goal of identifying temporally persistent and spatially localized departures from a baseline or ``normal" behavior. In this talk, I will first discuss a principled Bayesian approach for estimating time varying functional connectivity networks from brain fMRI data.

The asymptotics of the second-largest eigenvalue in random regular graphs (also referred to as the "Alon conjecture") have been computed by Joel Friedman in his celebrated 2004 paper. Recently, a new proof of this result has been given by Charles Bordenave, using the non-backtracking operator and the Ihara-Bass formula.

Non-Gaussian spatial data arise in a number of disciplines. Examples include spatial data on disease incidences (counts), and satellite images of ice sheets (presence-absence). Spatial generalized linear mixed models (SGLMMs), which build on latent Gaussian processes or Markov random fields, are convenient and flexible models for such data and are used widely in mainstream statistics and other disciplines. For high-dimensional data, SGLMMs present significant computational challenges due to the large number of dependent spatial random effects.

Interested in what our graduate students have been working on? Come join us for posters and presentations by the students themselves as they present their research.

Volunteer presenters include:

Data science is at a crossroads. Each year, thousands of new data scientists are entering science and technology, after a broad training in a variety of fields. Modern data science is often exploratory in nature, with datasets being collected and dissected in an interactive manner.

Argo floats measure sea water temperature and salinity in the upper 2,000 m of the global ocean. The statistical analysis of the resulting spatio-temporal data set is challenging due to its nonstationary structure and large size. I propose mapping these data using locally stationary Gaussian process regression where covariance parameter estimation and spatio-temporal prediction are carried out in a moving-window fashion.

Many important causal questions concern interactions between units, also known as interference. Examples include interactions between individuals in households, students in schools, and firms in markets. Standard analyses that ignore interference can often break down in this setting: estimators can be badly biased, while classical randomization tests can be invalid. In this talk, I present recent results on estimation and testing for two-stage experiments, which are powerful designs for assessing interference.

Paul Gustafson, Department of Statistics, University of British Columbia Hierarchical Bayesian Modelling for Survival Data Hierarchical Bayes models can be flexible tools for the analysis of failure time data. This will be illustrated by two examples. The first example is in a clinical trials context, when there are several response times for each patient, and many patients at each clinical centre. Frailties are used to model both across-patient variability and across-centre variability.

We study maximum likelihood estimation for exponential families that are multivariate totally positive of order two (MTP2). Such distributions appear in the context of ferromagnetism in the Ising model and various latent models, as for example Brownian motion tree models used in phylogenetics. We show that maximum likelihood estimation for MTP2 exponential families is a convex optimization problem.


Rotational post hoc transformations have traditionally played a key role in enhancing the interpretability of factor analysis. Regularization methods also serve to achieve this goal by prioritizing sparse loading matrices. In this work, we bridge these two paradigms with a unifying Bayesian framework.


We discuss two recent results concerning disease modeling on networks. The infection is assumed to spread via contagion (e.g., transmission over the edges of an underlying network). In the first scenario, we observe the infection status of individuals at a particular time instance and the goal is to identify a confidence set of nodes that contain the source of the infection with high probability.