Multivariate Analyses
25 February 2002
We have been talking about principal component analyses. We can use a PCA as a cluster analysis by finding clusters in PCA scores. If we turn around this process, we have clusters first and try to find a function that can predict the classification in your data. For example, you may have multiple soil types from which you collect several measurements, e.g., number of insects, number of weeds, etc. Can we now predict the classification variable (i.e., soil type) from the continuous variables? This is done by using a cluster analysis, specifically Fisherian Linear Discriminant Analysis.
Discriminant Analysis (without the assumption of multivariate normal distributions):
In a discriminant analysis, we try to find a discriminant function, which is just a vector of weights.
In a discriminant analysis, we decompose the total variance into within and between group variances. By doing so, we try to maximize the difference between groups.
Here is the algebra (with 2 categories):
Let X be the data matrix with m variables and n observations. X: n x m.
We compute column means and define the following:
= a vector of means for n observations (1 x
m)
= a vector of means for the group 1 with n1
observations (1 x m)
= a vector of means for the group 2 with n2
observations (1 x m)
n = n1 + n2
We now partition X into two components:
.
To make things tractable, we rearrange the rows of X matrix such that first n1 rows contain observations from group 1. In other words, we partition the X matrix as following:
.
We also define XB matrix as following:
.
XB contains n1 rows of and n2 rows of
.
And .
We then compute the covariance matrices of XW and
XB: and
.
In the next step we compute the matrix product of and the inverse of
.
Let .
Finally, we conduct the eigen decomposition of M, where eigenvectors are the discriminant functions and eigenvalues explain the ratios of variances.