Tags
Imagine we had 3 top gamblers available to conduct an experiment, lets say Pittsburgh Phil, Alan Potts and Dave Nevison. OK perhaps not Dave Nevison but 3 top punters of your choice. Now lets imagine we want to drill down and figure out how important the key ingredients are that they use to pick bets. Let us imagine that amongst other things Phil looks at the suitability of the going for the horse, the previous experience at the race distance of the horse and the ability of the jockey. Trouble is we do not know how these factors rank. Is one more important than the other. One way of figuring this out is to sit Phil down in a locked room for 5 years and get him to punt for 2 of them with all the information/data he needs along with food and water of course and perhaps an exercise yard and small bird in a cage. Seeing as we are dealing with dead punters here I will sling Telly Savalas in the next room. Now having logged his excellent performance over the first 2 years we now randomly alter some of the data he is receiving on the horses going suitability. We do this for a year and see how much his betting performance has suffered. We then do the same for the data on distance suitability and finally for jockey worthiness. After these 3 data alterations we will see that his punting has suffered by varying degrees depending on the three sets of altered data. By comparing these 3 values we are now in a position to order the importance of the three inputs.
This in essence is the process behind a Machine Learning feature importance approach known as permutation feature importance. It does not quite work in the same way as outlined above. It does not randomly alter a features content on subsequent years as above but it will train a model on say 4 years of data and then test it on the 5th year and then to gauge feature importance it will carry out the predictions on the 5th year with various randomly altered features to see which has the greatest negative impact.
The other useful thing you can do with feature importance calculations is check the importance on the training data and then on the test data. If the ordering is wildly out of line across the two then it may well be a sign that the model has over fitted on the training data, that is to say its tending to memorize the data and will therefore not predict very well on new data.
I have been running some checks on this method to see how it performs when compared to the bog standard feature Importance that comes with the Python SKlearn package and it as you shall see it can disagree.

The above is using the standard feature importance algorithm and we can see that jockey strike rate is top in terms of feature importance. Now let us look at feature importance using the permutation method.

On the left side we can see that on the test data jockey strike rate is also the most important feature but after that there is disagreement with the first plot. The right hand box plot shows the importance when applied to the training data and we can see that jockey strike rate and class move have maintained their relative positions of 1st and 2nd which is a good sign that the model is generalizing well from the training data to the test data. The straight lines coming out of the box plots show the variation in the measurements derived from doing 20 different sets of predictions each with a different set of randomly changed values within the feature. The box is the average value of all 20 readings.
The benefit of this approach is twofold, firstly you get a more accurate evaluation of feature importance and secondly comparing train and test gives an insight into possible model over fitting. The permutation variety will be shown in the Autumn update of MySportsAI, software that allows you to create models at the click of a button with no prior ML knowledge