Body

In large-scale imaging studies, a primary goal is to understand the relationship between distinct data views of study participants. For example, one data view could consist of patients' brain MRI scans, while a second view includes their lifestyle, demographic, or psychometric measures. A significant challenge is that these views are often subject to complex non-Euclidean constraints. Two settings arise: in some cases, the geometric constraints are known a priori, such as brain functional connectivity data which lie on the manifold of positive definite matrices. In other cases, no explicit manifold representation is available, and the underlying geometry must be learned from the data. Additionally, the relationships between these views are often weak, further complicating the analysis.

Despite extensive work on data integration, most approaches fail to accommodate non-Euclidean constraints while providing interpretable embeddings. In this talk, we propose novel frameworks to identify interpretable relationships between heterogeneous data views, while accounting for their distinct underlying structures.

Specifically, we develop a canonical correlation analysis model to integrate a time-varying, manifold-valued data with high-dimensional data. Our approach leverages tools from Riemannian geometry to handle non-Euclidean constraints and introduces a group-sparsity penalty to select important variables. The proposed method shows improved empirical performance over existing approaches and is applied to dynamic functional connectivity data from the Human Connectome Project. Furthermore, we establish asymptotic consistency through both in-sample and out-of-sample error bounds for the estimated canonical directions and scores.

We further extend the proposed model to automatically learn interpretable embeddings from the data, thereby estimating its underlying geometry. To achieve this, we formulate a Partially Linear interpretable Canonical Correlation Analysis model (PLiCCA) and prove the existence of population solutions. We establish formal connections between PLiCCA and conditional latent-variable models, specifically, conditional variational autoencoders and conditional normalizing flows. We show that these latent-variable models can be interpreted as relaxations of the PLiCCA problem, where difficult global constraints are replaced by tractable local ones. This perspective enables efficient solving of PLiCCA via `proxy' problems derived from contemporary conditional generative models, providing an alternative to the models proposed in the first project when the underlying structure of the data is unknown.