Artificial Inteligence
  • Preface
  • Introduction
  • Machine Learning
    • Linear Algebra
    • Supervised Learning
      • Neural Networks
      • Linear Classification
      • Loss Function
      • Model Optimization
      • Backpropagation
      • Feature Scaling
      • Model Initialization
      • Recurrent Neural Networks
        • Machine Translation Using RNN
    • Deep Learning
      • Convolution
      • Convolutional Neural Networks
      • Fully Connected Layer
      • Relu Layer
      • Dropout Layer
      • Convolution Layer
        • Making faster
      • Pooling Layer
      • Batch Norm layer
      • Model Solver
      • Object Localization and Detection
      • Single Shot Detectors
        • Yolo
        • SSD
      • Image Segmentation
      • GoogleNet
      • Residual Net
      • Deep Learning Libraries
    • Unsupervised Learning
      • Principal Component Analysis
      • Generative Models
    • Distributed Learning
    • Methodology for usage
      • Imbalanced/Missing Datasets
  • Artificial Intelligence
    • OpenAI Gym
    • Tree Search
    • Markov Decision process
    • Reinforcement Learning
      • Q_Learning_Simple
      • Deep Q Learning
      • Deep Reinforcement Learning
    • Natural Language Processing
      • Word2Vec
  • Appendix
    • Statistics and Probability
      • Probability
        • Markov Chains
        • Random Walk
    • Lua and Torch
    • Tensorflow
      • Multi Layer Perceptron MNIST
      • Convolution Neural Network MNIST
      • SkFlow
    • PyTorch
      • Transfer Learning
      • DataLoader and DataSets
      • Visualizing Results
Powered by GitBook
On this page
  • Map Reduce
  • Example Linear Regression model
  • Who use this approach
  • Problems
  • Split weights
  • Google approach (old)
  • Google new approach
  • Asynchronous Stochastic Gradient Descent

Was this helpful?

  1. Machine Learning

Distributed Learning

PreviousGenerative ModelsNextMethodology for usage

Last updated 5 years ago

Was this helpful?

Learn how the training of deep models can be distributed across multiple machines.

Map Reduce

Map Reduce can be described on the following steps: 1. Split your training set, in batches (ex: divide by the number of workers on your farm: 4) 2. Give each machine of your farm 1/4th of the data 3. Perform Forward/Backward propagation, on each computer node (All nodes share the same model) 4. Combine results of each machine and perform gradient descent 5. Update model version on all nodes.

Example Linear Regression model

Consider the batch gradient descent formula, which is the gradient descent applied on all training set: θj:=θj−α1400∑i=1400(hθ(x(i)−y(i))xj(i)\theta_{j}:=\theta_{j}-\alpha\frac{1}{400}\sum_{i=1}^{400}(h_{\theta}(x^{(i)} - y^{(i)})x_j^{(i)}θj​:=θj​−α4001​∑i=1400​(hθ​(x(i)−y(i))xj(i)​

Each machine will deal with 100 elements (After splitting the dataset), calculating tempj1..4temp_j^{1..4}tempj1..4​, then: tempj1=∑i=1100(hθ(x(i)−y(i))xj(i)tempj2=∑i=101200(hθ(x(i)−y(i))xj(i)tempj3=∑i=201300(hθ(x(i)−y(i))xj(i)tempj4=∑i=301400(hθ(x(i)−y(i))xj(i)temp_j^1=\sum_{i=1}^{100}(h_{\theta}(x^{(i)} - y^{(i)})x_j^{(i)}\\ temp_j^2=\sum_{i=101}^{200}(h_{\theta}(x^{(i)} - y^{(i)})x_j^{(i)}\\ temp_j^3=\sum_{i=201}^{300}(h_{\theta}(x^{(i)} - y^{(i)})x_j^{(i)}\\ temp_j^4=\sum_{i=301}^{400}(h_{\theta}(x^{(i)} - y^{(i)})x_j^{(i)}tempj1​=∑i=1100​(hθ​(x(i)−y(i))xj(i)​tempj2​=∑i=101200​(hθ​(x(i)−y(i))xj(i)​tempj3​=∑i=201300​(hθ​(x(i)−y(i))xj(i)​tempj4​=∑i=301400​(hθ​(x(i)−y(i))xj(i)​ Each machine is calculating the back-propagation and error for it's own split of data. Remember that all machines have the same copy of the model. After each machine calculated their respective tempjmachinetemp_j^{\text{machine}}tempjmachine​. Another machine will combine those gradients, calculate the new weights and update the model in all machines. θj:=θj−α1400(tempj1+tempj2+tempj3+tempj4)\theta_{j}:=\theta_{j}-\alpha\frac{1}{400}(temp_j^1+temp_j^2+temp_j^3+temp_j^4)θj​:=θj​−α4001​(tempj1​+tempj2​+tempj3​+tempj4​) The whole point of this procedure is to check if we can combine the calculations of all nodes and still make sense, in terms of the final calculation.

Who use this approach

  • Caffe

  • Torch (Parallel layer)

Problems

This approach has some problems:

  • The complete model must fit on every machine

  • If the model is to big it will take time to update all machines with the same model

Split weights

Another approach whas used on google DistBelief project where they use a normal neural network model with weights separated between multiple machines.

On this approach only the weights (thick edges) that cross machines need to be synchronized between the workers. This technique could only be used on fully connected layers. If you mix both techniques (reference on Alexnet) paper, you do this share fully connected processing (Just a matrix multiplication), then when you need to the the convolution part, each convolution layer get one part of the batch.

Google approach (old)

Here each model replica is trained independently with pieces of data and a parameter server that synchronize the parameters between the workers.

Google new approach

Now google offer on Tensorflow some automation on choosing which strategy to follow depending on your work.

Asynchronous Stochastic Gradient Descent