A CNN is composed of layers that filters(convolve) the inputs to get usefull information. These convolutional layers have parameters(kernel) that are learned so that these filters are adjusted automatically to extract the most useful information for the task at hand without feature selection. CNN are better to work with images. Normal Neural networks does not fit well on image classification problems

On normal neural networks, we need to convert the image to a single 1d vector $[1,(width.height.channels)]$,then send this data to a hidden layer which is fully connected. On this scenario each neuron will have $10^{12}$ parameters per neuron.

Normally the pattern [CONV->ReLU->Pool->CONV->ReLU->Pool->FC->Softmax_loss(during train)] is quite commom.

The most important operation on the convolutional neural network are the convolution layers, imagine a 32x32x3 image if we convolve this image with a 5x5x3 (The filter depth must have the same depth as the input), the result will be an activation map 28x28x1.

The filter will look for a particular thing on all the image, this means that it will look for a pattern in the whole image with just one filter.

Now consider that we want our convolution layer to look for 6 different things. On this case our convolution layer will have 6 5x5x3 filters. Each one looking for a particular pattern on the image.

By the way the convolution by itself is a linear operation, if we don't want to suffer from the same problem of the linear classifers we need to add at the end of the convolution layer a non-linear layer. (Normally a Relu)

Another important point of using convolution as pattern match is that the position where the thing that we want to search on the image is irrelevant. On the case of neural networks the model/hypothesis will learn an object on the exact location where the object is located during training.

Those are the parameters that are used to configure a convolution layer

Kernel size(K): Small is better (But if is on the first layer, takes a lot of memory)

Stride(S): How many pixels the kernel window will slide (Normally 1, in conv layers, and 2 on pooling layers)

Zero Padding(pad): Put zeros on the image border to allow the conv output size be the same as the input size (F=1, PAD=0; F=3, PAD=1; F=5, PAD=2; F=7, PAD=3)

Number of filters(F): Number of patterns that the conv layer will look for.

By default the convolution output will always have a result smaller than the input. To avoid this behaviour we need to use padding. To calculate the convolution output (activation map) size we need this formula:

$\Large H_{out}=1+\frac{H_{in}+(2.pad)-K_{height}}{S}$

$\Large W_{out}=1+\frac{W_{in}+(2.pad)-K_{width}}{S}$

Here we will see some examples of the convolution window sliding on the input image and change some of it's hyper parameters.

Here we have a input 4x4 convolved with a filter 3x3 (K=3) with stride (S=1) and padding (pad=0)

Now we have an input 5x5 convolved with a filter 3x3 (k=3) with stride (S=1) and padding (pad=1). On some libraries there is a feature that always calculate the right amount of padding to keep the output spatial dimensions the "same" as the input dimensions.

Here we show how to calculate the number of parameters used by one convolution layer. We will illustrate with a simple example: Input: 32x32x3, 32x32 RGB image CONV: Kernel(F):5x5, Stride:1, Pad:2, numFilters:10 $\Large num_{parameters}=((F*F*depth_{input})+1)*num_{filters} \\ \therefore num_{parameters}=((5*5*3)+1)*10=760$ You can omit the "+1" parameter (Bias), to simplify calculations.

Here we show how to calculate the amount of memory needed on the convolution layer. Input: 32x32x3, 32x32 RGB image CONV: Kernel(F):5x5, Stride:1, Pad:2, numFilters:10, as we use padding our output volume will be 32x32x10, so the ammount of memory in bytes is: 10240 bytes

So the amount of memory is basically just the product of the dimensions of the output volume which is a 4d tensor.

$mem = [N_{batch}.C_{depth}.H_{out}.W_{out}]$

Where:

$N_{batch}$: Output batch size

$C_{depth}$: The outpt volume or on the case of convolution the number of filters

$H_{out}$: The height of the output activation map

$W_{out}$: The width of the output activation map

This type if convolution is normally used to adapt depths, by merging them, without changing the spatial information.

Here we explain what is the effect of cascading several small convolutions, on the diagram bellow we have 2 3x3 convolution layers. If you start from the second layer on the right, one neuron on the second layer, has a 3x3 receptive field, and each neuron on the first layer create a 5x5 receptive field on the input. So in simpler words cascading can be used to represent bigger ones.

The new trend on new successful models is to use smaller convolutions, for example a 7x7 convolution can be substituted with 3 3x3 convolutions with the same depth. This substitution cannot be done on the first conv layer due to the depth mismatch between the first conv layer and the input file depth (Unless if your first layer has only 3 filters).

On the diagram above we substitute one 7x7 convolution by 3 3x3 convolutions, observe that between them we have relu layers, so we have more non-linearities. Also we have less weights and multiply-add operations so it will be faster to compute.

Imagine a 7x7 convolution, with C filters, being used on a input volume WxHxC we can calculate the number of weights as: $\large numWeights_{7x7}=C(7.7.C)\therefore 49.C^2$

Now if we use 3 3x3 convolutions with C filters, we would have $\large numWeights_{3x3}=C(3.3.C)\therefore 9.C^2 \therefore \text{due to 3 filters }3.(9.C^2)=27.C^2$

We still have less parameters, as we need to use Relu between the layers to break the linearity (otherwise the conv layers in cascade will appear as a single 3x3 layer) we have more non-linearity, less parameters, and more performance.

As mentioned before, we cannot substitute large convolutions on the first layer. Actually small convolutions on the first layer cause a memory consume explosion. To illustrate the problem let's compare the first layer of a convolution neural network as been 3x3 with 64 filters and stride of 1 and the same depth with 7x7 and stride of 2, consider the image size to be 256x256x3. $mem_{3x3}=256.256.64\therefore 4mb$

TODO: How the stride and convolution size affect the memory consumption

It's also possible to simplify the 3x3 convolution with a mechanism called bottleneck. This again will have the same representation of a normal 3x3 convolution but with less parameters, and more non-linearities.

Observe that the substitution is made on the 3x3 convolution that has the same depth as the previous layer (On this case 50x50x64)

Here we calculate how much parameters we use on the bottleneck, remember that on 3x3 is $9.C^2$ So the bottleneck uses $(3.25)C^2$, which is less.

The bottleneck is also used on microsoft residual network.

Another option to break 3x3xC convolutions is to use 1x3xC, then 3x1xC, this has been used on residual googlenet inception layer.

It's possible to convert Fully connected layers to convolution layers and vice-versa, but we are more interest on the FC->Conv conversion. This is done to improve performance. For example imagine a FC layer with output K=4096 and input 7x7x512, the conversion would be: CONV: Kernel:7x7, Pad:0, Stride:1, numFilters:4096. Using the 2d convolution formula size: $outputSize_W=(W - F + 2P)/S + 1 \therefore (7-7+2(0))/1 + 1$, which will be 1x1x4096.

In resume what you gain by converting the FC layers to convolution:

Performance: It's faster to compute due to the weight sharing

You can use images larger than the ones that you trained, without changing nothing

You will be able to detect 2 objects on the same image (If you use a bigger image) your final output will be bigger then a single row vector.

The receptive field is basically how much a particular convolution window "see" on it's input tensor.

Sometimes it might be useful to know exactly how much each cell from a particular layer "see" on the input image, this is particular important on object detection systems because we need somehow to match some activation map dimensions back to the original image size (Label image).

$\Large R_k = R_{k-1} + ((\text{Kernel}_k - 1) * \prod_{i=1}^{k-1}s_i)$

Where:

Rk: Receptive field of the current layer k

Kernel: Kernel size of current layer k

s: Strides

$\prod_{i=1}^{k-1}s_i$ Product of all strides up to the layer k-1 (All previous layers, and not the current)

One point to pay attention:

For the first layer (only) the receptive field is the kernel size.

Those calculations are independent of the type of layer (CONV, POOL) for example a CONV with stride 2 will have the same receptive field as a POOL with stride 2.

Example:

Given a 14x14x3 image after the following layers:

CONV: S:1, P:0, K:3

CONV: S:1, P:0, K:3

MaxPool: S:2, P:0, K2

CONV: S:1, P:0, K:3