Steve Tilley sent me this interesting article today which delves into the benefits of using ordinal encoding over one hot encoding in some situations.
A synopsis of the piece would be that for some tree based algorithms like GBM and Random Forests ordinal encoding can be a better option than the usually recommended one hot encoding.
OK I am getting ahead of myself here, what do the above terms actually mean?. Well imagine we have a racing data feature like race going (F, GF, Gd etc etc) and lets say we want to model on pace figure and going because maybe together they have some predictive power. We cannot use the going data as is because ML algorithms require numeric values. The conventional wisdom approach would be that if the going does not have some intrinsic ordering to it then one hot encode it which simply means create binary feature for every possible occurrance like thus
As the article points out this can lead to an explosion of features and possibly the curse of dimensionality.
Below is the performance of a model on pace figure and one hot encoded going for turf flat handicaps. The top rated made a ROI of 1.95% but a variable ROI of -0.7%
Now if we use a numeric value for going, namely 0 = Hvy 1 = Sft 2 = GS etc etc and so only two input features pave figure and going we now get the slightly better set of results
These result suggest as the article does that we should not jump to conclusions about one hot encoding, nominal encoding with tree based algo’s may be just as good if not better