Multivariate Analyses
18 January 2002 through 25 January 2002
Variance, covariance, correlation, and multiple regression.
I will skip definitions of variance, covariance, and correlation. They can be found in any basic statistics text book.
Given a multiple linear regression model: ,
where i = 1, 2, …, k and j = 1, 2, …, n.
k is the number of parameters and n is the number of observations. The model can be written compactly in a
matrix notation:
,
where underlines indicate vectors and uppercase X indicates a matrix. We assume the following:
and
,
where I is the identity matrix of size k.
In English, it says y is distributed as some distribution with mean Xa
and variance σ2.
If we standardize the X matrix for each column, i.e., for
each column, subtract the mean of the column and divide by the standard
deviation, we obtain a new matrix ZX. We also standardize y vector and obtain Zy. The original model can now be written as: .
Dr. Goodman showed us (without proof), that the best
estimate of β, which
minimizes the SSE ( ), is
,
where CX is the correlation matrix of the X matrix and cXy
is the correlation vector between each column of X and y. This result can be shown by using the normal
equation for the ordinary least squares (OLS), i.e.,
.
The relationship between correlation of standardized and original data:
Let X and Y be random variables. Then the correlation between these two random variables is:
,
where is the variance of X,
is the standard deviation of X, and
and
are the means of X and Y.
Now let ZX and ZY be standardized X and Y, i.e.,
and
.
Then the correlation between ZX and ZY is:
because the variance of a standardized variable is 1. The mean of a standardized variable is 0. Therefore,
which is identical to the correlation between X and Y.