# Yolo

This detector is a little bit less precise (Improved on v2) but it is a really fast detector, this chapter will try to explain how it works and also give a reference working code in tensorflow.

## Main idea

The idea of this detector is that you run the image on a CNN model and get the detection on a single pass. First the image is resized to 448x448, then fed to the network and finally the output is filtered by a Non-max suppression algorithm.

## Model Yolo:

The tiny version is composed with 9 convolution layers with leaky relu activations. Observe that after maxpool6 the 448x448 input image becomes a 7x7 image.

2 Box definitions: (consisting of: x,y,width,height,"is object" confidence)

20 class probabilities (only considered if the "is object" confidence is high)

$Tensor=S.S.(B.5+C)$

Where:

S: Tensor spatial dimension (7 on this case)

B: Number of bounding boxes (x,y,w,h,confidence)

C: Number of classes

$confidence=P_{object}.IoU(pred,gt)$

Here "is object" or $P_{object}$ is the probability that a box contains any object (or it is background), if during training a particular cell is not over some object we set "is object" to zero.

## What this 7x7 tensor represents

This 7x7 tensor can be considered as a 7x7 grid representing the input image, where each cell of this tensor will hold the 2 box definitions and 20 class probabilities.

Here it's also useful to say that each cell has the probability to be one of the 20 classes. (And each cell has 2 bounding box)

Notice that this information with the fact that each bounding box has the information if it's below an object or not will help to detect the class of the object.

The logic is that if there was an object on that cell, we define which object by using the biggest class probability value from that cell.

## Filtering results

At the end of the model at prediction time you will have something like this:

## Training phase

Steps:

Look which cell is near the center of the bounding box of the Ground truth. (Matching phase)

Check from a particular cell which of it's bounding boxes overlaps more with the ground truth (IoU), then decrease the confidence of the bounding box that overlap less. (Each bounding box has it's on confidence)

Decrease the confidence of all bounding boxes from each cell that has no object. Also don't adjust the box coordinates or class probabilities from those cells.

Decrease the bounding boxes confidence of the cells that don't contain any object.

## Pre-train

The paper mentioned that before training for object detection, they modified the network (Add a Average pooling, FC and Softmax) layers and train for classification on the Imagenet Dataset for one week. (Until they got a good top 5 error). Later they add more conv layers and the FC layer responsible for detection.

## Other details

Pre-trained on Imagenet

Use lot's of augmentation

Use SGD to train

Evaluated on Pascal VOC

135 Epochs, batch size: 64

Momentum 0.9

Random scale and translations up to 20% size of original image

Color exposure/saturation augmentation

## Loss Function

Here is the multi-part loss function that we want to optimize. This loss function take into account the following objectives:

Classification (20 classes)

Object/No object classification

Bounding box coordinates (x,y,height,width) regression (4 scalars)

Each of this sub objectives use a sum-squared error, also a factor $\lambda_{coord}=5.0$ and $\lambda_{noobj}=0.5$ are used to unbalance the box coordinates and the classification objectives.

Some other points to observe:

The classification loss is not back propagated if the cell has no object

The bounding box loss with highest IOU (Intersect over union) with the ground truth is backpropagated

B: Number of bounding boxes (2)

$x_{i}, y_{i}, w_{i}, h_{i}$ Box definition

$C_{i}$ Some particular class i

S: Grid size (7)

$\Uparrow_{i}^{obj}$: If object appear on the cell i, if does not appear it will be zero

$\Uparrow_{ij}^{obj}$: Bounding box j, from cell i responsible for prediction

## Intersect over Union (IoU)

It's a method used to evaluate how well an object detection output is related to some ground truth, the IoU is normally used during training and testing by comparing how the bounding box given during prediction overlap with the ground truth (training/test data) bounding box.

Calculating the IoU is simple we basically divide the overlap area between the boxes by the union of those areas.

Another way to calculate the IoU with numpy

## Non-Maxima Suppression (nms)

During prediction time (after training) you may have lot's of box predictions around a single object the nms algorithm will filter out those boxes that overlap between each other and also some threshold.

Here we have a example with numpy and python

## Yolo v2

The Yolo detector has been improved recently, to list their main improvements:

Faster

More Accurate (73.4 mAP(Mean average precision over all classes) on Pascal dataset)

Can detect up to 9000 classes (Before was 20)

What they did to improve:

Added Batchnorm

Pre-train on imagenet at multiple scales (224x224) then (448x448), then only after they train for detection.

Now they use anchor boxes like Faster-RCNN , the classification is done per-box shape, instead of per each grid-cell

Instead of manually choose the box shape, they use K-means to get a box shape based on data

Train the network at multiple scales, as the network is now Fully Convolutional (NO FC layer) this is easy to do.

They train on both Image-net and MS-COCO

They create a new mechanism to train on datasets that don't have detection data. By selecting on the multi-part loss function what to propagate.

Use WordTree to combine data from various sources and our joint optimization technique to train simultaneously on ImageNet and COCO.

## References:

Last updated