Making faster

Introduction

On this chapter we show a way to convert your convolution operation into a matrix multiplication. This has the advantage to compute faster, at the expense of more memory usage. We employ the im2col operation that will transform the input image or batch into a matrix, then we multiply this matrix with a reshaped version of our kernel. Then at the end we reshape this multiplied matrix back to an image with the col2im operation.

Im2col

As shown on previous source code, we use a lot for for-loops to implement the convolutions, while this is useful for learning purpose, it's not fast enough. On this section we will learn how to implement convolutions on a vectorized fashion. First, if we inspect closer the code for convolution is basically a dot-product between the kernel filter and the local regions selected by the moving window, that sample a patch with the same size as our kernel. What would happens if we expand all possible windows on memory and perform the dot product as a matrix multiplication. Answer 200x or more speedups, at the expense of more memory consumption.

For example, if the input is [227x227x3] and it is to be convolved with 11x11x3 filters at stride 4 and padding 0, then we would take [11x11x3] blocks of pixels in the input and stretch each block into a column vector of size 1111311*11*3 = 363.

Calculating with input 227 with stride 4 and padding 0, gives ((227-11)/4)+1 = 55 locations along both width and height, leading to an output matrix X_col of size [363 x 3025]. Here every column is a stretched out receptive field (patch with depth) and there are 55*55 = 3025 of them in total.

To summarize how we calculate the im2col output sizes:

[img_height, img_width, img_channels] = size(img);
newImgHeight = floor(((img_height + 2*P - ksize) / S)+1);
newImgWidth = floor(((img_width + 2*P - ksize) / S)+1);        
cols = single(zeros((img_channels*ksize*ksize),(newImgHeight * newImgWidth)));

The weights of the CONV layer are similarly stretched out into rows. For example, if there are 96 filters of size [11x11x3] this would give a matrix W_row of size [96 x 363], where 11x11x3=363

After the image and the kernel are converted, the convolution can be implemented as a simple matrix multiplication, in our case it will be W_col[96 x 363] multiplied by X_col[363 x 3025] resulting as a matrix [96 x 3025], that need to be reshaped back to [55x55x96].

This final reshape can also be implemented as a function called col2im.

Notice that some implementations of im2col will have this result transposed, if this is the case then the order of the matrix multiplication must be changed.

Forward graph

In order to help the usage of im2col with convolution and also to derive the back-propagation, let's show the convolution with im2col as a graph. Here the input tensor is single a 3 channel 4x4 image. That will pass to a convolution layer with S:1 P:0 K:2 and F:1 (Output volume).

Backward graph

Using the im2col technique the computation graph resembles the FC layer with the same format f(x,θ,β)=(x.θT)+βf(x, \theta, \beta) = (x.\theta^T) + \beta, the difference that now we have a bunch of reshapes, transposes and the im2col block.

About the reshapes and transposes during back propagation you just need to invert their operations using again another reshape or transpose, the only important thing to remember is that if you use a reshape row major during forward propagation you need to use a reshape row major on the backpropagation.

The only point to pay attention is the im2col backpropagation operation. The issue is that it cannot be implemented as a simple reshape. This is because the patches could actually overlap (depending on the stride), so you need to sum the gradients where the patches intersect.

Matlab forward propagation

function [activations] = ForwardPropagation(obj, input, weights, bias)
    % Tensor format (rows,cols,channels, batch) on matlab
    [H,W,~,N] = size(input);
    [HH,WW,C,F] = size(weights);

    % Calculate output sizes
    H_prime = (H+2*obj.m_padding-HH)/obj.m_stride +1;
    W_prime = (W+2*obj.m_padding-WW)/obj.m_stride +1;

    % Alocate memory for output
    activations = zeros([H_prime,W_prime,F,N]);

    % Preparing filter weights
    filter_col = reshape(weights,[HH*WW*C F]);
    filter_col_T = filter_col';

    % Preparing bias
    if ~isempty(bias)
        bias_m = repmat(bias,[1 H_prime*W_prime]);
    else
        b = zeros(size(filter_col,1),1);
        bias_m = repmat(b,[1 H_prime*W_prime]);
    end

    % Here we convolve each image on the batch in a for-loop, but the im2col
    % could also handle a image batch at the input, so all computations would
    % be just one big matrix multiplication. We opted now for this to test the
    % par-for implementation with OpenMP on CPU
    for idxBatch = 1:N
        im = input(:,:,:,idxBatch);    
        im_col = im2col_ref(im,HH,WW,obj.m_stride,obj.m_padding,1);
        mul = (filter_col_T * im_col) + bias_m;
        activations(:,:,:,idxBatch) =  reshape_row_major(mul,[H_prime W_prime size(mul,1)]);                                                
    end

    % Cache results for backpropagation
    obj.activations = activations;
    obj.weights = weights;
    obj.biases = bias;
    obj.previousInput = input;            
end

Matlab backward propagation

function [gradient] = BackwardPropagation(obj, dout)
    dout = dout.input;  
    [H,W,~,N] = size(obj.previousInput);
    [HH,WW,C,F] = size(obj.weights);  

    % Calculate output sizes
    H_prime = (H+2*obj.m_padding-HH)/obj.m_stride +1;
    W_prime = (W+2*obj.m_padding-WW)/obj.m_stride +1;

    % Preparing filter weights            
    filter_col_T = reshape_row_major(obj.weights,[F HH*WW*C]);                                                

    % Initialize gradients
    dw = zeros(size(obj.weights));
    dx = zeros(size(obj.previousInput));            
    % Get the bias gradient which will be the sum of dout over the
    % dimensions (batches(4), rows(1), cols(2))
    db = sum(sum(sum(dout, 1), 2), 4);

    for idxBatch = 1:N
        im = obj.previousInput(:,:,:,idxBatch);    
        im_col = im2col_ref(im,HH,WW,obj.m_stride,obj.m_padding,1);
        dout_i = dout(:,:,:,idxBatch);

        dout_i_reshaped = reshape_row_major(dout_i,[F, H*W]);                

        dw_before_reshape = dout_i_reshaped * im_col';                
        dw_i = reshape(dw_before_reshape',[HH, WW, C, F]);
        dw = dw + dw_i;

        % We now have the gradient just before the im2col
        grad_before_im2col = (dout_i_reshaped' * filter_col_T);                
        % Now we need to backpropagate im2col (im2col_back),
        % results will padded by one always
        dx_padded = im2col_back_ref(grad_before_im2col,H_prime, W_prime, obj.m_stride, HH, WW, C);                
        % Now we need to take out the pading
        dx(:,:,:,idxBatch) = dx_padded(2:end-1, 2:end-1,:);
    end

    %% Output gradients    
    gradient.bias = db;
    gradient.input = dx;  
    gradient.weight = dw;

end

Matlab im2col

function [ img_matrix ] = im2col_ref( inputImg, k_height, k_width, S , P, isConv )
%IM2COL Convert image to a matrix, this step is used to accelerate
%convolutions, implementing the convolution as a matrix multiplication
% This version currently does not support batch of images, we choose this
% because we're first going to use the CPU mode, and we want to relly on
% parfor (OpenMP)
coder.extrinsic('warning')
% Get Image dimensions
[imgHeight, imgWidth, imgChannels] = size(inputImg);

% Calculate convolved result size.
newImgHeight = ((imgHeight + 2*P - k_height) / S)+1;
newImgWidth = ((imgWidth + 2*P - k_width) / S)+1;
offset_K_Height = 0;
offset_K_Width = 0;

% Check if it is a real number
if rem(newImgHeight,1) ~= 0 || rem(newImgWidth,1) ~= 0
    warning('warning: Invalid stride or pad for input\n');
    if isConv
        % Convolution do a floor
        newImgHeight = floor(((imgHeight + 2*P - k_height) / S)+1);
        newImgWidth = floor(((imgWidth + 2*P - k_width) / S)+1);        
    else
        % Pooling do a ceil and adapt the sampling window
        newImgHeight = ceil(((imgHeight + 2*P - k_height) / S)+1);
        newImgWidth = ceil(((imgWidth + 2*P - k_width) / S)+1);
        offset_K_Height = k_height - 1;
        offset_K_Width = k_width - 1;
    end
end

% Allocate output sizes
img_matrix = zeros(...
    (imgChannels*k_height*k_width),(newImgHeight * newImgWidth) ...
    );

% Only pad if needed
if P ~= 0
    inputImg_Padded = padarray(inputImg,[P P]);
    % Get dimensions again before iterate on padded image, otherwise we will
    % keep sampling with the old (unpadded size)
    [imgHeight, imgWidth, ~] = size(inputImg_Padded);
end

% Iterate on the input image like a convolution
cont = 1;
for r=1:S:(imgHeight-offset_K_Height)
    for c=1:S:(imgWidth-offset_K_Width)
        % Avoid slide out of the image (Security buffer overflow)
        if (((c+k_width)-1) <= imgWidth) && (((r+k_height)-1) <= imgHeight)
            % Select window on input volume
            if P == 0
                patch = inputImg(r:(r+k_height)-1,c:(c+k_width)-1,:);
            else
                patch = inputImg_Padded(r:(r+k_height)-1,c:(c+k_width)-1,:);
            end

            % Convert patch to a col vector, the matlab reshape order is
            % row major while other languages (C/C++, python) are col
            % major, on this particular case (im2col, then matrix mult with
            % the kernel) this order will not mather, but it's not allways
            % true...
            patchRow = reshape(patch,[],1);

            % Append the transformed patch into the output matrix
            img_matrix(:,cont) = patchRow;
            cont = cont+1;
        end
    end
end

end

Matlab im2col backward propagation

function [ img_grad ] = im2col_back_ref( dout, dout_H, dout_W, S, HH, WW, C )
    %IM2COL_BACK_REF Backpropagation of im2col
    % dout: (
    % Return
    % Image gradient (H,W,C)

    % Calculate the spatial dimensions of im_grad
    % Observe that the result will be "padded"
    H = (dout_H - 1) * S + HH;
    W = (dout_W - 1) * S + WW;

    img_grad = zeros(H,W,C);

    for ii=1:(dout_H*dout_W)
        row = dout(ii,:);

        % Create a patch from the row
        patch = reshape_row_major(row,[HH WW C]);
        %patch = reshape(row,[HH WW C]);

        % Calculate indexes on dx
        h_start = floor(((ii-1) / dout_W) * S);    
        w_start = mod((ii-1),dout_W) * S;
        h_start = h_start + 1;
        w_start = w_start + 1;

        img_grad(h_start:h_start+HH-1, w_start:w_start+WW-1, :) = img_grad(h_start:h_start+HH-1, w_start:w_start+WW-1, :) + patch;    
    end    
end

Python example for forward propagation

def conv_forward_naive(x, w, b, conv_param):
  """
  A naive implementation of the forward pass for a convolutional layer.

  The input consists of N data points, each with C channels, height H and width
  W. We convolve each input with F different filters, where each filter spans
  all C channels and has height HH and width HH.

  Input:
  - x: Input data of shape (N, C, H, W)
  - w: Filter weights of shape (F, C, HH, WW)
  - b: Biases, of shape (F,)
  - conv_param: A dictionary with the following keys:
    - 'stride': The number of pixels between adjacent receptive fields in the
      horizontal and vertical directions.
    - 'pad': The number of pixels that will be used to zero-pad the input.

  Returns a tuple of:
  - out: Output data, of shape (N, F, H', W') where H' and W' are given by
    H' = 1 + (H + 2 * pad - HH) / stride
    W' = 1 + (W + 2 * pad - WW) / stride
  - cache: (x, w, b, conv_param)
  """
  out = None
  pad_num = conv_param['pad']
  stride = conv_param['stride']
  N,C,H,W = x.shape
  F,C,HH,WW = w.shape
  H_prime = (H+2*pad_num-HH) // stride + 1
  W_prime = (W+2*pad_num-WW) // stride + 1
  out = np.zeros([N,F,H_prime,W_prime])
  #im2col
  for im_num in range(N):
      im = x[im_num,:,:,:]
      im_pad = np.pad(im,((0,0),(pad_num,pad_num),(pad_num,pad_num)),'constant')
      im_col = im2col(im_pad,HH,WW,stride)
      filter_col = np.reshape(w,(F,-1))
      mul = im_col.dot(filter_col.T) + b
      out[im_num,:,:,:] = col2im(mul,H_prime,W_prime,1)
  cache = (x, w, b, conv_param)
  return out, cache

Python example for backward propagation

def conv_backward_naive(dout, cache):
  """
  A naive implementation of the backward pass for a convolutional layer.

  Inputs:
  - dout: Upstream derivatives.
  - cache: A tuple of (x, w, b, conv_param) as in conv_forward_naive

  Returns a tuple of:
  - dx: Gradient with respect to x
  - dw: Gradient with respect to w
  - db: Gradient with respect to b
  """
  dx, dw, db = None, None, None

  x, w, b, conv_param = cache
  pad_num = conv_param['pad']
  stride = conv_param['stride']
  N,C,H,W = x.shape
  F,C,HH,WW = w.shape
  H_prime = (H+2*pad_num-HH) // stride + 1
  W_prime = (W+2*pad_num-WW) // stride + 1

  dw = np.zeros(w.shape)
  dx = np.zeros(x.shape)
  db = np.zeros(b.shape)

  # We could calculate the bias by just summing over the right dimensions
  # Bias gradient (Sum on dout dimensions (batch, rows, cols)
  #db = np.sum(dout, axis=(0, 2, 3))

  for i in range(N):
      im = x[i,:,:,:]
      im_pad = np.pad(im,((0,0),(pad_num,pad_num),(pad_num,pad_num)),'constant')
      im_col = im2col(im_pad,HH,WW,stride)
      filter_col = np.reshape(w,(F,-1)).T

      dout_i = dout[i,:,:,:]
      dbias_sum = np.reshape(dout_i,(F,-1))
      dbias_sum = dbias_sum.T

      #bias_sum = mul + b
      db += np.sum(dbias_sum,axis=0)
      dmul = dbias_sum

      #mul = im_col * filter_col
      dfilter_col = (im_col.T).dot(dmul)
      dim_col = dmul.dot(filter_col.T)

      dx_padded = col2im_back(dim_col,H_prime,W_prime,stride,HH,WW,C)
      dx[i,:,:,:] = dx_padded[:,pad_num:H+pad_num,pad_num:W+pad_num]
      dw += np.reshape(dfilter_col.T,(F,C,HH,WW))
  return dx, dw, db

Im2col and Col2im sources in python

This implementation will receive a image on the format of a 3 dimension tensor [channels, rows, cols] and will create a 2d matrix on the format [rows=(new_h*new_w), cols=(kw*kw*C)] notice that this algorithm will output the transposed version of the diagram above.

def im2col(x,hh,ww,stride):

    """
    Args:
      x: image matrix to be translated into columns, (C,H,W)
      hh: filter height
      ww: filter width
      stride: stride
    Returns:
      col: (new_h*new_w,hh*ww*C) matrix, each column is a cube that will convolve with a filter
            new_h = (H-hh) // stride + 1, new_w = (W-ww) // stride + 1
    """

    c,h,w = x.shape
    new_h = (h-hh) // stride + 1
    new_w = (w-ww) // stride + 1
    col = np.zeros([new_h*new_w,c*hh*ww])

    for i in range(new_h):
       for j in range(new_w):
           patch = x[...,i*stride:i*stride+hh,j*stride:j*stride+ww]
           col[i*new_w+j,:] = np.reshape(patch,-1)
    return col
def col2im(mul,h_prime,w_prime,C):
    """
      Args:
      mul: (h_prime*w_prime*w,F) matrix, each col should be reshaped to C*h_prime*w_prime when C>0, or h_prime*w_prime when C = 0
      h_prime: reshaped filter height
      w_prime: reshaped filter width
      C: reshaped filter channel, if 0, reshape the filter to 2D, Otherwise reshape it to 3D
    Returns:
      if C == 0: (F,h_prime,w_prime) matrix
      Otherwise: (F,C,h_prime,w_prime) matrix
    """
    F = mul.shape[1]
    if(C == 1):
        out = np.zeros([F,h_prime,w_prime])
        for i in range(F):
            col = mul[:,i]
            out[i,:,:] = np.reshape(col,(h_prime,w_prime))
    else:
        out = np.zeros([F,C,h_prime,w_prime])
        for i in range(F):
            col = mul[:,i]
            out[i,:,:] = np.reshape(col,(C,h_prime,w_prime))

    return out
def col2im_back(dim_col,h_prime,w_prime,stride,hh,ww,c):
    """
    Args:
      dim_col: gradients for im_col,(h_prime*w_prime,hh*ww*c)
      h_prime,w_prime: height and width for the feature map
      strid: stride
      hh,ww,c: size of the filters
    Returns:
      dx: Gradients for x, (C,H,W)
    """
    H = (h_prime - 1) * stride + hh
    W = (w_prime - 1) * stride + ww
    dx = np.zeros([c,H,W])
    for i in range(h_prime*w_prime):
        row = dim_col[i,:]
        h_start = (i / w_prime) * stride
        w_start = (i % w_prime) * stride
        dx[:,h_start:h_start+hh,w_start:w_start+ww] += np.reshape(row,(c,hh,ww))
    return dx

Smaller example

To make things simpler on our heads, follow the simple example of convolving X[3x3] with W[2x2]

Last updated