Multivariate Analyses

18 January 2002 through 25 January 2002

 

Variance, covariance, correlation, and multiple regression.

 

I will skip definitions of variance, covariance, and correlation.  They can be found in any basic statistics text book. 

 

Given a multiple linear regression model: , where i = 1, 2, …, k and j = 1, 2, …, n.  k is the number of parameters and n is the number of observations.  The model can be written compactly in a matrix notation: , where underlines indicate vectors and uppercase X indicates a matrix.  We assume the following:  and , where I is the identity matrix of size k.  In English, it says y is distributed as some distribution with mean Xa and variance σ2.

 

If we standardize the X matrix for each column, i.e., for each column, subtract the mean of the column and divide by the standard deviation, we obtain a new matrix ZX.  We also standardize y vector and obtain Zy.  The original model can now be written as: .

 

Dr. Goodman showed us (without proof), that the best estimate of β, which minimizes the SSE (  ), is , where CX is the correlation matrix of the X matrix and cXy is the correlation vector between each column of X and y.  This result can be shown by using the normal equation for the ordinary least squares (OLS), i.e.,

 

The relationship between correlation of standardized and original data:

Let X and Y be random variables.  Then the correlation between these two random variables is:

 

,

where  is the variance of X,  is the standard deviation of X, and  and  are the means of X and Y.

 

Now let ZX and ZY be standardized X and Y, i.e.,

 

 and .

 

Then the correlation between ZX and ZY is:

 

 

 

because the variance of a standardized variable is 1.  The mean of a standardized variable is 0.  Therefore,

 

 

 

which is identical to the correlation between X and Y.