Artificial Inteligence
  • Preface
  • Introduction
  • Machine Learning
    • Linear Algebra
    • Supervised Learning
      • Neural Networks
      • Linear Classification
      • Loss Function
      • Model Optimization
      • Backpropagation
      • Feature Scaling
      • Model Initialization
      • Recurrent Neural Networks
        • Machine Translation Using RNN
    • Deep Learning
      • Convolution
      • Convolutional Neural Networks
      • Fully Connected Layer
      • Relu Layer
      • Dropout Layer
      • Convolution Layer
        • Making faster
      • Pooling Layer
      • Batch Norm layer
      • Model Solver
      • Object Localization and Detection
      • Single Shot Detectors
        • Yolo
        • SSD
      • Image Segmentation
      • GoogleNet
      • Residual Net
      • Deep Learning Libraries
    • Unsupervised Learning
      • Principal Component Analysis
      • Generative Models
    • Distributed Learning
    • Methodology for usage
      • Imbalanced/Missing Datasets
  • Artificial Intelligence
    • OpenAI Gym
    • Tree Search
    • Markov Decision process
    • Reinforcement Learning
      • Q_Learning_Simple
      • Deep Q Learning
      • Deep Reinforcement Learning
    • Natural Language Processing
      • Word2Vec
  • Appendix
    • Statistics and Probability
      • Probability
        • Markov Chains
        • Random Walk
    • Lua and Torch
    • Tensorflow
      • Multi Layer Perceptron MNIST
      • Convolution Neural Network MNIST
      • SkFlow
    • PyTorch
      • Transfer Learning
      • DataLoader and DataSets
      • Visualizing Results
Powered by GitBook
On this page
  • Missing Data
  • Imbalanced Data
  • References:

Was this helpful?

  1. Machine Learning
  2. Methodology for usage

Imbalanced/Missing Datasets

PreviousMethodology for usageNextArtificial Intelligence

Last updated 5 years ago

Was this helpful?

This chapter will tackle some common problems with datasets that you may encounter.

Missing Data

Imagine that you have a dataset where some of it's columns may have missing data. What to do on this case? On my experience completing disregard is not a good option. One of the things that you can do is to use the statistics of the entire dataset (for instance mean value) and fill those columns with those data. Also some algorithms handle those cases automatically (ex Naive Bayes)

Imbalanced Data

Imagine the problem of a driving car, most of the time you have the driver holding the steering wheel and angle 0, so most of the time your data will have 0 value. If you use a Neural Network based algorithm, this means that if you predictor output zero, your loss function will return a small value and the backpropagation will not update your weights properly (Or at least will take to much time to find cases where we actually are steering)

Also is worth notice that some algorithms does not care about Imbalanced Data (ex: Random Forest)

References:

https://blog.dominodatalab.com/imbalanced-datasets/
http://machinelearningmastery.com/how-to-handle-missing-values-in-machine-learning-data-with-weka/
https://stats.stackexchange.com/questions/103500/machine-learning-algorithms-to-handle-missing-data
http://stackoverflow.com/questions/39386936/machine-learning-with-incomplete-data
https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
https://www.quora.com/How-can-I-handle-missing-features-in-machine-learning