Artificial Inteligence
  • Preface
  • Introduction
  • Machine Learning
    • Linear Algebra
    • Supervised Learning
      • Neural Networks
      • Linear Classification
      • Loss Function
      • Model Optimization
      • Backpropagation
      • Feature Scaling
      • Model Initialization
      • Recurrent Neural Networks
        • Machine Translation Using RNN
    • Deep Learning
      • Convolution
      • Convolutional Neural Networks
      • Fully Connected Layer
      • Relu Layer
      • Dropout Layer
      • Convolution Layer
        • Making faster
      • Pooling Layer
      • Batch Norm layer
      • Model Solver
      • Object Localization and Detection
      • Single Shot Detectors
        • Yolo
        • SSD
      • Image Segmentation
      • GoogleNet
      • Residual Net
      • Deep Learning Libraries
    • Unsupervised Learning
      • Principal Component Analysis
      • Generative Models
    • Distributed Learning
    • Methodology for usage
      • Imbalanced/Missing Datasets
  • Artificial Intelligence
    • OpenAI Gym
    • Tree Search
    • Markov Decision process
    • Reinforcement Learning
      • Q_Learning_Simple
      • Deep Q Learning
      • Deep Reinforcement Learning
    • Natural Language Processing
      • Word2Vec
  • Appendix
    • Statistics and Probability
      • Probability
        • Markov Chains
        • Random Walk
    • Lua and Torch
    • Tensorflow
      • Multi Layer Perceptron MNIST
      • Convolution Neural Network MNIST
      • SkFlow
    • PyTorch
      • Transfer Learning
      • DataLoader and DataSets
      • Visualizing Results
Powered by GitBook
On this page
  • Introduction
  • Localizing with Convolution neural networks
  • References:

Was this helpful?

  1. Machine Learning
  2. Deep Learning

Single Shot Detectors

PreviousObject Localization and DetectionNextYolo

Last updated 5 years ago

Was this helpful?

Introduction

The previous methods of object detection all share one thing in common: they have one part of their network dedicated to providing region proposals followed by a high quality classifier to classify these proposals. These methods are very accurate but come at a big computational cost (low frame-rate), in other words they are not fit to be used on embedded devices.

Another way of doing object detection is by combining these two tasks into one network. We can do this by instead of having a network produce proposals we instead have a set of pre-defined boxes to look for objects.

Using convolutional features maps from later layers of a network we run small CONV filters over these features maps to predict class scores and bounding box offsets.

Localizing with Convolution neural networks

One way to reuse the computation that is already made during classification to localize objects is to grab activations from the final conv layers. At this point we still have spatial information but represented on a smaller version. For example an input image of size 640x480x3 passing into an inception model will have it's spatial information compressed into a 13x18x2048 size on it's final layers.

What happens is that on the final layers each "pixel" represent a larger area of the input image so we can use those cells to infer the object position. One thing to pay attention is that even though we are squeezing the image to a lower spatial dimension, the tensor is quite deep, so not much information is lost. (This is not entirely true when using pooling layers).

At this point imagine that you could use a 1x1 CONV layer to classify each cell as a class (ex: Pedestrian/Background), also from the same layer you could attach another CONV or FC layer to predict 4 numbers (Bounding box). In this way you get both class scores and location from one.

One common mistake is to think that we're actually dividing the input image into a grid, this does not happen! What actually happens is that each layer represent the input image with few spatial data but with bigger depth. On training time we will do some sort of matching between our ground truth and virtual cells. Also those cells will actually overlap they are not perfectly tiled.

Also regarding the number of detection, each one of those cells could detect an object. So the output of this model could be 13x18 detections.

How to get the bounding box

One of the things that may be difficult to understand at first is how the detection system will convert the cells to an actual bounding box that fit's above the object.

Here is the family of object detectors that follow this strategy:

  • SSD: Uses different activation maps (multiple-scales) for prediction of classes and bounding boxes

  • YOLO: Uses a single activation map for prediction of classes and bounding boxes

  • R-FCN(Region based Fully-Convolution Neural Networks): Like Faster Rcnn (400ms), but faster (170ms) due to less computation per box also it's Fully Convolutional (No FC layer)

Using multiple scales helps to achieve a higher mAP(mean average precision) by being able to detect objects with different sizes on the image better.

Summarising the strategy of these methods

  1. Train a CNN with regression(bounding box) and classification objective (loss function).

  2. Normally their loss functions are more complex because it has to manage multiple objectives (classification, regression, check if there is an object or not)

  3. Gather Activation from a particular layer (or layers) to infer classification and location with a FC layer or another CONV layer that works like a FC layer.

  4. During prediction use algorithms like non-maxima suppression to filter multiple boxes around same object.

  5. During training time use algorithms like IoU to relate the predictions during training the the ground truth.

On this kind of detector it is typical to have a collection of boxes overlaid on the image at different spatial locations, scales and aspect ratios that act as “anchors” (sometimes called “priors” or “default boxes”).

References:

http://silverpond.com.au/2016/10/24/pedestrian-detection-using-tensorflow-and-inception.html
https://arxiv.org/pdf/1512.02325.pdf
https://github.com/amdegroot/ssd.pytorch
https://arxiv.org/pdf/1312.2249.pdf
https://arxiv.org/pdf/1605.06409.pdf
https://www.robots.ox.ac.uk/~vgg/rg/slides/vgg_rg_16_feb_2017_rfcn.pdf
https://github.com/xdever/RFCN-tensorflow
https://github.com/PureDiors/pytorch_RFCN
https://github.com/aleju/papers
https://arxiv.org/pdf/1506.02640.pdf
https://github.com/tommy-qichang/yolo.torch
https://www.youtube.com/watch?v=NM6lrxy0bxs
https://arxiv.org/pdf/1612.08242.pdf
https://arxiv.org/pdf/1701.06659.pdf
http://www.cs.unc.edu/~wliu/papers/ssd_eccv2016_slide.pdf
https://cloud.google.com/blog/big-data/2016/07/understanding-neural-networks-with-tensorflow-playground
https://arxiv.org/pdf/1611.10012.pdf
http://www.rsipvision.com/ComputerVisionNews-2017June/files/assets/common/downloads/Computer%20Vision%20News.pdf