Object Localization and Detection


On this chapter we're going to learn about using convolution neural networks to localize and detect objects on images
  • RCNN
  • Fast RCNN
  • Faster RCNN
  • Yolo
  • SSD

Localize objects with regression

Regression is about returning a number instead of a class, in our case we're going to return 4 numbers (x0,y0,width,height) that are related to a bounding box. You train this system with an image an a ground truth bounding box, and use L2 distance to calculate the loss between the predicted bounding box and the ground truth.
Normally what you do is attach another fully connected layer on the last convolution layer
This will work only for one object at a time. Some people attach the regression part after the last convolution (Overfeat) layer, while others attach after the fully connected layer (RCNN). Both works.

Comparing bounding box prediction accuracy

Basically we need to compare if the Intersect Over Union (ioU) between the prediction and the ground truth is bigger than some threshold (ex > 0.5)


RCNN (Regions + CNN) is a method that relies on a external region proposal system.
The problem of RCNN is that it's never made to be fast, for instance the steps to train the network are these:
  1. 1.
    Take a pre-trained imagenet cnn (ex Alexnet)
  2. 2.
    Re-train the last fully connected layer with the objects that need to be detected + "no-object" class
  3. 3.
    Get all proposals(=~2000 p/image), resize them to match the cnn input, then save to disk.
  4. 4.
    Train SVM to classify between object and background (One binary SVM for each class)
  5. 5.
    BB Regression: Train a linear regression classifier that will output some correction factor

Step 3 Save and pre-process proposals

Step 5 (Adjust bounding box)


The Fast RCNN method receive region proposals from some external system (Selective search). This proposals will sent to a layer (Roi Pooling) that will resize all regions with their data to a fixed size. This step is needed because the fully connected layer expect that all the vectors will have same size
Proposals example, boxes=[r, x1, y1, x2, y2]
Still depends on some external system to give the region proposals (Selective search)

Roi Pooling layer

It's a type of max-pooling with a pool size dependent on the input, so that the output always has the same size. This is done because fully connected layer always expected the same input size.
The inputs of the Roi layer will be the proposals and the last convolution layer activations. For example consider the following input image, and it's proposals.
Input image
Two proposed regions
Now the activations on the last convolution layer (ex: conv5)
For each convolution activation (each cell from the image above) the Roi Pooling layer will resize, the region proposals (in red) to the same resolution expected on the fully connected layer. For example consider the selected cell in green.
Here the output will be:

Faster RCNN

The main idea is use the last (or deep) conv layers to infer region proposals. Faster-RCNN consists of two modules.
  • RPN (Region proposals): Gives a set of rectangles based on deep convolution layer
  • Fast-RCNN Roi Pooling layer: Classify each proposal, and refining proposal location

Region proposal Network

Here we break on a block diagram how Faster RCNN works.
  1. 1.
    Get a trained (ie imagenet) convolution neural network
  2. 2.
    Get feature maps from the last (or deep) convolution layer
  3. 3.
    Train a region proposal network that will decide if there is an object or not on the image, and also propose a box location
  4. 4.
    Give results to a custom (python) layer
  5. 5.
    Give proposals to a ROI pooling layer (like Fast RCNN)
  6. 6.
    After all proposals get reshaped to a fix size, send to a fully connected layer to continue the classification

How it works

Basically the RPN slides a small window (3x3) on the feature map, that classify what is under the window as object or not object, and also gives some bounding box location. For every slidding window center it creates fixed k anchor boxes, and classify those boxes as been object or not.

Faster RCNN training

On the paper, each network was trained separately, but we also can train it jointly. Just consider the model having 4 losses.
  • RPN Classification (Object or not object)
  • RPN Bounding box proposal
  • Fast RCNN Classification (Normal object classification)
  • Fast RCNN Bounding-box regression (Improve previous BB proposal)

Faster RCNN results

The best result now is Faster RCNN with a resnet 101 layer.

Complete Faster RCNN diagram

This diagram represents the complete structure of the Faster RCNN using VGG16, I've found on a github project here. It uses a framework called Chainer which is a complete framework using only python (Sometimes cython).

Next Chapter

On the next chapter we will discuss a different type of object detector called single shot detectors.