Multivariate Analyses

1 March 2002

 

Canonical Correlation Analysis

In Principal Component Analysis and Discriminant Analysis, we had one group of data with multiple variables.  If we want to look at relationships between two groups of data, both of which contained multiple variables, we use a canonical correlation analysis. 

 

Canonical correlation analysis is concerned with the amount of linear relationship between two sets of variables (Rencher 1995).  Canonical correlation analysis provides a measure of overall correlation between two sets of variables. 

 

Nuts and bolts of CCA (See Rencher 1995 for more details):

(All the following math can be done with the covariance matrices.)

We have two sets of variables: m1 and m2.  Let m1  m2.  Then the dimension of the data matrix is n x (m1 + m2).  We now partition the data matrix into the following:

 

, where X1: n x m1 and X2: n x m2.  We then compute the sample correlation matrix of this data matrix:

 

 

Notice that this correlation matrix is just a large matrix but can be partitioned in this way.  You need to know where to draw the lines.  In this example, C11 is m1 x m1, C12 is m1 x m2, C21 is m2 x m1, and C22 is m2 x m2

 

We then calculate the measure of association between the two sets:

 

 and .  The positive square roots of the eigenvalues (λ) of M1 and M2 are called canonical correlations.  Note: Eigenvalues of M1 and M2 are the same but eigenvectors are not the same.  The largest eigenvalue is the best overall measure of association between a linear combination of the first set and a linear combination of the second set.  Eigenvectors of M’s provide the coefficients (or weighting) for these linear relationships.  Other eigenvalues provide measures of supplemental dimensions of linear relationship between the two sets. 

 

Two Properties of Canonical Correlations (Rencher 1996)

  1. Canonical correlations are invariant to changes of scale on either data set.  Even if the measurement scale is changed the canonical correlation will not change (the corresponding eigenvectors will change). 
  2. The first canonical correlation is the maximum correlation between linear functions of the two data sets.  Therefore, the first canonical correlation exceeds the absolute value of the simple correlation between any y and x (the two sets of data), or the multiple correlation between any y and all the x’s or between any x and all the y’s. 

 

To conduct a hypothesis testing on a canonical correlation analysis, we may use non-parametric bootstrap.