Principal Component Analysis
Last updated
Last updated
On this chapter we're going to learn about Principal Component Analysis (PCA) which is a tool used to make dimensionality reduction. This is usefull because it make the job of classifiers easier in terms of speed, or to aid data visualization.
So what are principal components then? They're the underlying structure in the data. They are the directions where there is the most variance on your data, the directions where the data is most spread out.
The only limitation if this algorithm is that it works better only when we have a linear manifold.
The PCA algorithm will try to fit a plane that minimize a projection error (sum of all red-line sizes)
Imagine that the PCA will try to rotate your data looking for a angle where it see more variances.
As mentioned before you can use PCA when your data has a linear data manifold.
But for non linear manifolds we're going to have a lot of projection errors.
Preprocess the data:
Calculate the covariance matrix: , is the number of elements, X is a matrix where n is experiment number and p the features
Get the eigenvectors of the covariance matrix , here the U matrix will be a nxn matrix where every column of U will be the principal components, if we want to reduce our data from n dimensions to k, we choose k columns from U.
The preprocessing part sometimes includes a division by the standard deviation of each collumn, but there are cases that this is not needed. (The mean subtraction is more important)
Now that we calculate our principal components, which are stored on the matrix U, we will reduce our input data from n dimensions to k dimensions . Here k is the number of columns of U. Depending on how you organized the data we can have 2 different formats for Z
$$U{reduce}=U(:,1:k)\ Z = U{reduce}^T . X{prep}\ Z = X{prep} . U_{reduce}
To illustrate the whole process we're going to calculate the PCA from an image, and then restore it with less dimensions.
Here our data is a matrix with 15 samples of 3 measurements [15x3]
Now we're going to subtract the mean of each experiment from every column, then divide also each element by the standard deviation of each column.
mean and std will work on all columns of X
Now we use "svd" to get the principal components, which are the eigen-vectors and eigen-values of the covariance matrix
There are different ways to calculate the PCA, for instance matlab gives already a function pca or princomp, which could give different signs on the eigenvectors (U) but they all represent the same components.
The one thing that you should pay attention is the order of the input matrix, because some methods to find the PCA, expect that your samples and measurements, are in some pre-defined order.
Now to recover the original data we use all the components, and also reverse the preprocessing.
Actually normally we do something before we Now that we have our principal components let's apply for instance k=2
We can use the principal components Z to recreate the data X, but with some loss. The idea is that the data in Z is smaller than X, but with similar variance. On this case we have awe could reproduce the data X_loss with , so one dimension less.
Before finish the chapter we're going to use PCA on images.