One limitation of the Sklearn library of machine learning algorithms is the inability to write your own cost function and get the algorithm from the sklearn library to us it. What do I mean by this?, well when you train a model on some horse racing data the algorithms are looking for the most effective model at predicting winners assuming you are using win/lose as your target feature ie the thing you are trying to predict. The ‘cost function’ is just a fancy term to describe the method the algorithm uses to determine how good a model is at predicting winners in our case. But what if we wanted it to do its search and model fitting based on profit or loss. What if we wanted the model fitting crtieria to be variable profit/loss ie how much a model makes betting to always win a mythical £1. The model fitting process may look at a set of parameters and find that it makes a variable stake loss of -£25 so it tweeks the model parameters and tries again and finds that the los is now -£20. Great the second model shows an improvement, lets tweek a little more.
This tweeking process will depend on the algorithm being used but as I mentioned sklearn does not permit you to define a loss function that calculates variable PL and then ask it to use it as the measure for its tweeking.
Here is where PyGAD comes in to the picture. PyGAD is a Genetic Algorithm library for Python that does allow you to create your own loss function and plug it in. An explanation of GA’s is beyond the scope of this blog entry but if you want do a Youtube search you will find some nice intro’s. There was also a great article in an old copy of SmartSig from a guy utilizing GA’s, if memory serves me it actually won the article of the year award.
OK onto the face off, how does a match up between SKlearn’s GBM algorithm and PyGAD shape up. With GBM I will be using the model based on win/lose approach whilst with PyGAD I will be adopting a model based on VPL (variable profit or loss)
For both I used 3 input features (forgive me but I wont disclose these but they are part of MySportsAI) along with Betfair SP and finishing position as data items to simply calculate PL
First of all I used the data from 2011 to 2017 and did not scale the input features (see later). Here are the results for the top rated horses when the model was applied to 2018 and 2019
Bets 12921 VPL -24.3 VROI% -0.79%
Next I min max scaled the 3 input features, this simply means that all 3 features are scaled to between 0 and 1 so that they are on an identical scale. This can often help with modelling for some algorithms
Bets 12921 VPL +3.19 VROI% +0.18%
My last step involved incorporating Betfair SP as an input feature. My curiosity about doing this was mainly due to the fact that when you include BFSP in a model where the algorithm is modelling for winners, the BFSP simply swamps the modelling process, its just so effective at predicting winners the algorithm will zone in on it at the expense of importance placed on your other features. With PyGAD and a cost function that is working on VPL this will not be the case. However saying that this did not appear to help as the VPL came out at -25.28 on the test data. OK worth a try.
Now onto the Gradient Boosting algorithm. GBM has appeared to be quite effective in the past on Racing data, perhaps mainly due to the fact that it can handle non linear data quite well. First non scaled input features.
Bets 14250 VPL – 12.81 VROI% -0.35%
There was obviously more joint top rated with this approach.
GBM with scaled data did very poorly, making a bad job of differentiating between rankings and delivering lots of joint or co top rated horses
bets 70804 VPL -218.7 VROI% -2.05%
Conclusion – There is room for more analysis here and encouragement that modelling on VPL could be a profitable way to go.
As always thoughts and comments welcome