Principal Component Analysis


On this chapter we're going to learn about Principal Component Analysis (PCA) which is a tool used to make dimensionality reduction. This is usefull because it make the job of classifiers easier in terms of speed, or to aid data visualization.

So what are principal components then? They're the underlying structure in the data. They are the directions where there is the most variance on your data, the directions where the data is most spread out.

The only limitation if this algorithm is that it works better only when we have a linear manifold.

The PCA algorithm will try to fit a plane that minimize a projection error (sum of all red-line sizes)

Imagine that the PCA will try to rotate your data looking for a angle where it see more variances.

As mentioned before you can use PCA when your data has a linear data manifold.

But for non linear manifolds we're going to have a lot of projection errors.

Calculating PCA

  1. Preprocess the data: Xprep=Xmean(X)std(X)X_{prep} = \frac{X - mean(X)}{std(X)}

  2. Calculate the covariance matrix: σ=1m.(XT.X)\sigma=\frac{1}{m}.(X^T.X), mm is the number of elements, X is a matrix nxpnxp where n is experiment number and p the features

  3. Get the eigenvectors of the covariance matrix [U,S,V]=svd(σ)[U,S,V]=svd(\sigma), here the U matrix will be a nxn matrix where every column of U will be the principal components, if we want to reduce our data from n dimensions to k, we choose k columns from U.

The preprocessing part sometimes includes a division by the standard deviation of each collumn, but there are cases that this is not needed. (The mean subtraction is more important)

Reducing input data

Now that we calculate our principal components, which are stored on the matrix U, we will reduce our input data XRnX \in R^n from n dimensions to k dimensions ZRkZ \in R^k. Here k is the number of columns of U. Depending on how you organized the data we can have 2 different formats for Z

$$U{reduce}=U(:,1:k)\ Z = U{reduce}^T . X{prep}\ Z = X{prep} . U_{reduce}

Example in Matlab

To illustrate the whole process we're going to calculate the PCA from an image, and then restore it with less dimensions.

Get some data example

Here our data is a matrix with 15 samples of 3 measurements [15x3]

Data pre-processing

Now we're going to subtract the mean of each experiment from every column, then divide also each element by the standard deviation of each column.

mean and std will work on all columns of X

Calculate the covariance matrix

Get the principal components

Now we use "svd" to get the principal components, which are the eigen-vectors and eigen-values of the covariance matrix

There are different ways to calculate the PCA, for instance matlab gives already a function pca or princomp, which could give different signs on the eigenvectors (U) but they all represent the same components.

The one thing that you should pay attention is the order of the input matrix, because some methods to find the PCA, expect that your samples and measurements, are in some pre-defined order.

Recover original data

Now to recover the original data we use all the components, and also reverse the preprocessing.

Reducing our data

Actually normally we do something before we Now that we have our principal components let's apply for instance k=2

We can use the principal components Z to recreate the data X, but with some loss. The idea is that the data in Z is smaller than X, but with similar variance. On this case we have XR3X \in R^3 awe could reproduce the data X_loss with ZRk=2Z \in R^{k=2}, so one dimension less.

Using PCA on images

Before finish the chapter we're going to use PCA on images.