Feature Scaling
Introduction
In order to make the life of gradient descent algorithms easier, there are some techniques that can be applied to your data on training/test phase. If the features on your input vector , are out of scale your loss space will be somehow stretched. This will make the gradient descent convergence harder, or at least slower.
On the example bellow your input X has 2 features (house size, and number of bedrooms). The problem is that house size feature range from 0...2000, while number of bedrooms range from 0...5.
Centralize data and normalize
Bellow we will pre-process our input data to fix the following problems:
Data not centered around zero
Features out of scale
Consider your input data , where N is the number of samples on your input data (batch size) and D the dimensions (On the previous example D is 2, size house, num bedrooms).
The first thing to do is to subtract the mean value of the input data, this will centralize the data dispersion around zero
On prediction phase is common to store this mean value to be subtracted from a test example. On the case of image classification, it's also common to store a mean image created from a batch of images on the training-set, or the mean value from every channel.
After your data is centralized around zero, you can make all features have the same range by dividing X by it's standard deviation.
Again this operation fix our first, problem, because all features will range similarly. But this should be used if somehow you know that your features have the same "weight". On the case of image classification for example all pixels have the same range (0..255) and a pixel alone has no bigger meaning (weight) than the other, so just mean subtraction should suffice for image classification.
Common mistake:
An important point to make about the preprocessing is that any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation / test data. Computing the mean and subtracting it from every image across the entire dataset and then splitting the data into train/val/test splits would be a mistake. Instead, the mean must be computed only over the training data and then subtracted equally from all splits (train/val/test).
Next Chapter
Next chapter we will learn about Neural Networks.
Last updated