Brief Bioinform 2018 11;19(6):1356-1369
Bioinformatics and Biostatistics Core Facility of the Brain and Spine Institute, La Pitié-Salpêtriére Hospital, Paris, France.
The growing number of modalities (e.g. multi-omics, imaging and clinical data) characterizing a given disease provides physicians and statisticians with complementary facets reflecting the disease process but emphasizes the need for novel statistical methods of data analysis able to unify these views. Such data sets are indeed intrinsically structured in blocks, where each block represents a set of variables observed on a group of individuals. Therefore, classical statistical tools cannot be applied without altering their organization, with the risk of information loss. Regularized generalized canonical correlation analysis (RGCCA) and its sparse generalized canonical correlation analysis (SGCCA) counterpart are component-based methods for exploratory analyses of data sets structured in blocks of variables. Rather than operating sequentially on parts of the measurements, the RGCCA/SGCCA-based integrative analysis method aims at summarizing the relevant information between and within the blocks. It processes a priori information defining which blocks are supposed to be linked to one another, thus reflecting hypotheses about the biology underlying the data blocks. It also requires the setting of extra parameters that need to be carefully adjusted.Here, we provide practical guidelines for the use of RGCCA/SGCCA. We also illustrate the flexibility and usefulness of RGCCA/SGCCA on a unique cohort of patients with four genetic subtypes of spinocerebellar ataxia, in which we obtained multiple data sets from brain volumetry and magnetic resonance spectroscopy, and metabolomic and lipidomic analyses. As a first step toward the extraction of multimodal biomarkers, and through the reduction to a few meaningful components and the visualization of relevant variables, we identified possible markers of disease progression.