Now we're going to learn how to classify each pixel on the image, the idea is to create a map of all detected object areas on the image. Basically what we want is the image below where every pixel has a label associated with it.
In this chapter we're going to learn how convolutional neural networks (CNN) can do that job for us.
A Fully Convolutional neural network (FCN) is a normal CNN, where the last fully connected layer is substituted by another convolution layer with a large "receptive field". The idea is to capture the global context of the scene (Tell us what we have in the image and also give some very roughe idea of the locations of things).
It's important to remember that when we convert our last fully connected (FC) layer to a convolutional layer we gain some form of localization if we look at where we have more activations.
The idea is that if we choose our new last conv layer to be big enough we will have this localization effect scaled up to our input image size.
Here is how we convert a normal CNN used for classification, ie: Alexnet to a FCN used for segmentation.
Just to remind us this is how Alexnet looks like:
Below shows the parameters for each of the layers in AlexNet
In Alexnet the inputs are fixed to be 224x224, so all the pooling effects will scale down the image from 224x224 to 55x55, 27x27, 13x13, then finally a single row vector on the FC layers.
Now let's look at the steps needed to do the conversion.
1) We start with a normal CNN for classification with
2) The second step is to convert all the FC layers to convolution layers 1x1 we don't even need to change the weights at this point. (This is already a fully convolutional neural network). The nice property of FCN networks is that we can now use any image size.
Observe here that with a FCN we can use a different size H x N. The diagram bellow show a how a different size would appear
3) The last step is to use a "deconv or transposed convolution" layer to recover the activation positions to something meaningful related to the image size. Imagine that we're just scaling up the activation size to the same image size. This last "upsampling" layer also has some lernable parameters.
Now with this structure we just need to find some "ground truth" and to end to end learning, starting from a pre-trainned network ie: Imagenet.
The problem with this approach is that we lose some resolution by just doing this because the activations were downscaled on a lot of steps.
To solve this problem we also get some activation from previous layers and sum/interpolate them together. This process is called "skip" from the creators of this algorithm.
Even today (2016) the winners on Imagenet on the Segmentation category, used an ensemble of FCN to win the competition. Those up-sampling operations used on skip are also learn-able.
Below we show the effects of this "skip" process, notice how the resolution of the segmentation improves after some "skips"
Another important point to note here is that the loss function we use in this image segmentation problem is actually still the usual loss function we use for classification: multi-class cross entropy and not something like the L2 loss like we would normally use when the output is an image.
This is because despite what you might think we are actually just assigning a class to each of our output pixels so this is a classification problem.
Basically the idea is to scale up, the scale down effect made on all previous layers.
It has this bad name because the upsamping forward propagation is the convolution backpropagation and the upsampling backpropagation is the convolution forward propagation.
Also in caffe source code it is wrongly called "deconvolution"
There is another thing that we can do to avoid those "skiping" steps and also give better segmentation. Deconvnet also has better response for objects of different sizes.
This architechture is called "Deconvnet" which is basically another network but now with all convolution and pooling layers reversed. As you may suspect this is heavy, it takes 6 days to train on a TitanX. But the results are really good. Another problem is that the trainning is made in 2 stages.
Also Deconvnets suffer less than FCN when there are small objects on the scene.
The deconvolution network output a probability map with the same size as the input.
Besides the deconvolution layer we also need now the unpooling layer. The max-pooling operation is non-invertible, but we can approximate, by recording the positions (Max Location switches) where we located the biggest values (during normal max-pool), then use this positions to reconstruct the data from the layer above (on this case a deconvolution)
For me the main issue about deconvnets, are that they need to be trained in 2-stages
First stage with easy examples (Single objects centered)
Fine tune with difficult examples.
On the next chapter we will discuss some libraries that support deep learning