Imbalanced data is when you have a far greater number of one classification value over another. In other words say you are predicting the existence of a disease from a number of input fields and the existence of the disease represented by a 1 only amounts to say 10% of your data, then you have an imbalanced data set. Machine learning algorithms tend to be geared to finding the least error prone way of predicting and if predicting zero every time gives a 90% success rate, as in this case, then it can lead to the algorithm simply selecting zero.

Hopefully this all sounds familiar because in horse racing we tend to have imbalanced data. The number of winners may well be around the 10% mark. How do we get our Machine Learning algorithms to behave more sensibly with the data. The following link provides a pretty good explanation of some of the techniques that can be employed and also shows why Random Forests are pretty good at avoiding this pitfall.

https://elitedatascience.com/imbalanced-classes

Advertisements