Distributed Learning

Learn how the training of deep models can be distributed across multiple machines.

Map Reduce

Map Reduce can be described on the following steps: 1. Split your training set, in batches (ex: divide by the number of workers on your farm: 4) 2. Give each machine of your farm 1/4th of the data 3. Perform Forward/Backward propagation, on each computer node (All nodes share the same model) 4. Combine results of each machine and perform gradient descent 5. Update model version on all nodes.

Example Linear Regression model

Consider the batch gradient descent formula, which is the gradient descent applied on all training set:
θj:=θjα1400i=1400(hθ(x(i)y(i))xj(i)\theta_{j}:=\theta_{j}-\alpha\frac{1}{400}\sum_{i=1}^{400}(h_{\theta}(x^{(i)} - y^{(i)})x_j^{(i)}
Each machine will deal with 100 elements (After splitting the dataset), calculating
, then:
tempj1=i=1100(hθ(x(i)y(i))xj(i)tempj2=i=101200(hθ(x(i)y(i))xj(i)tempj3=i=201300(hθ(x(i)y(i))xj(i)tempj4=i=301400(hθ(x(i)y(i))xj(i)temp_j^1=\sum_{i=1}^{100}(h_{\theta}(x^{(i)} - y^{(i)})x_j^{(i)}\\ temp_j^2=\sum_{i=101}^{200}(h_{\theta}(x^{(i)} - y^{(i)})x_j^{(i)}\\ temp_j^3=\sum_{i=201}^{300}(h_{\theta}(x^{(i)} - y^{(i)})x_j^{(i)}\\ temp_j^4=\sum_{i=301}^{400}(h_{\theta}(x^{(i)} - y^{(i)})x_j^{(i)}
Each machine is calculating the back-propagation and error for it's own split of data. Remember that all machines have the same copy of the model. After each machine calculated their respective
. Another machine will combine those gradients, calculate the new weights and update the model in all machines.
The whole point of this procedure is to check if we can combine the calculations of all nodes and still make sense, in terms of the final calculation.

Who use this approach

  • Caffe
  • Torch (Parallel layer)


This approach has some problems:
  • The complete model must fit on every machine
  • If the model is to big it will take time to update all machines with the same model

Split weights

Another approach whas used on google DistBelief project where they use a normal neural network model with weights separated between multiple machines.
On this approach only the weights (thick edges) that cross machines need to be synchronized between the workers. This technique could only be used on fully connected layers. If you mix both techniques (reference on Alexnet) paper, you do this share fully connected processing (Just a matrix multiplication), then when you need to the the convolution part, each convolution layer get one part of the batch.

Google approach (old)

Here each model replica is trained independently with pieces of data and a parameter server that synchronize the parameters between the workers.

Google new approach

Now google offer on Tensorflow some automation on choosing which strategy to follow depending on your work.

Asynchronous Stochastic Gradient Descent