Artificial Inteligence
  • Preface
  • Introduction
  • Machine Learning
    • Linear Algebra
    • Supervised Learning
      • Neural Networks
      • Linear Classification
      • Loss Function
      • Model Optimization
      • Backpropagation
      • Feature Scaling
      • Model Initialization
      • Recurrent Neural Networks
        • Machine Translation Using RNN
    • Deep Learning
      • Convolution
      • Convolutional Neural Networks
      • Fully Connected Layer
      • Relu Layer
      • Dropout Layer
      • Convolution Layer
        • Making faster
      • Pooling Layer
      • Batch Norm layer
      • Model Solver
      • Object Localization and Detection
      • Single Shot Detectors
        • Yolo
        • SSD
      • Image Segmentation
      • GoogleNet
      • Residual Net
      • Deep Learning Libraries
    • Unsupervised Learning
      • Principal Component Analysis
      • Generative Models
    • Distributed Learning
    • Methodology for usage
      • Imbalanced/Missing Datasets
  • Artificial Intelligence
    • OpenAI Gym
    • Tree Search
    • Markov Decision process
    • Reinforcement Learning
      • Q_Learning_Simple
      • Deep Q Learning
      • Deep Reinforcement Learning
    • Natural Language Processing
      • Word2Vec
  • Appendix
    • Statistics and Probability
      • Probability
        • Markov Chains
        • Random Walk
    • Lua and Torch
    • Tensorflow
      • Multi Layer Perceptron MNIST
      • Convolution Neural Network MNIST
      • SkFlow
    • PyTorch
      • Transfer Learning
      • DataLoader and DataSets
      • Visualizing Results
Powered by GitBook
On this page
  • Introduction
  • Centralize data and normalize
  • Common mistake:
  • Next Chapter

Was this helpful?

  1. Machine Learning
  2. Supervised Learning

Feature Scaling

PreviousBackpropagationNextModel Initialization

Last updated 5 years ago

Was this helpful?

Introduction

In order to make the life of gradient descent algorithms easier, there are some techniques that can be applied to your data on training/test phase. If the features x1,x2,x3...xnx_1, x_2, x_3...x_nx1​,x2​,x3​...xn​ on your input vector X=[x1,x2,x3...xn]X=[x_1, x_2, x_3...x_n]X=[x1​,x2​,x3​...xn​], are out of scale your loss space J(θ)=[θ1,θ2]J(\theta)=[\theta_1,\theta_2]J(θ)=[θ1​,θ2​] will be somehow stretched. This will make the gradient descent convergence harder, or at least slower.

On the example bellow your input X has 2 features (house size, and number of bedrooms). The problem is that house size feature range from 0...2000, while number of bedrooms range from 0...5.

Centralize data and normalize

Bellow we will pre-process our input data to fix the following problems:

  • Data not centered around zero

  • Features out of scale

Consider your input data X=[NxD]X=[NxD]X=[NxD], where N is the number of samples on your input data (batch size) and D the dimensions (On the previous example D is 2, size house, num bedrooms).

The first thing to do is to subtract the mean value of the input data, this will centralize the data dispersion around zero

X=X−np.mean(X,axis=0)X = X - np.mean(X, axis = 0)X=X−np.mean(X,axis=0)

On prediction phase is common to store this mean value to be subtracted from a test example. On the case of image classification, it's also common to store a mean image created from a batch of images on the training-set, or the mean value from every channel.

After your data is centralized around zero, you can make all features have the same range by dividing X by it's standard deviation.

X=X/np.std(X,axis=0)X = X / np.std(X, axis = 0)X=X/np.std(X,axis=0)

Again this operation fix our first, problem, because all features will range similarly. But this should be used if somehow you know that your features have the same "weight". On the case of image classification for example all pixels have the same range (0..255) and a pixel alone has no bigger meaning (weight) than the other, so just mean subtraction should suffice for image classification.

Common mistake:

An important point to make about the preprocessing is that any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation / test data. Computing the mean and subtracting it from every image across the entire dataset and then splitting the data into train/val/test splits would be a mistake. Instead, the mean must be computed only over the training data and then subtracted equally from all splits (train/val/test).

Next Chapter

Next chapter we will learn about Neural Networks.