Imbalanced/Missing Datasets

This chapter will tackle some common problems with datasets that you may encounter.

Missing Data

Imagine that you have a dataset where some of it's columns may have missing data. What to do on this case? On my experience completing disregard is not a good option. One of the things that you can do is to use the statistics of the entire dataset (for instance mean value) and fill those columns with those data. Also some algorithms handle those cases automatically (ex Naive Bayes)

Imbalanced Data

Imagine the problem of a driving car, most of the time you have the driver holding the steering wheel and angle 0, so most of the time your data will have 0 value. If you use a Neural Network based algorithm, this means that if you predictor output zero, your loss function will return a small value and the backpropagation will not update your weights properly (Or at least will take to much time to find cases where we actually are steering)

Also is worth notice that some algorithms does not care about Imbalanced Data (ex: Random Forest)

References:

Last updated