I have been playing around with an ML model today and the purpose of this post is to hopefully promote some discussion about potential target fields.
When you feed data to an ML algorithm you need to define input features eg is the horse a course winner along with a feature that the inputs have to predict. It is with the latter that I was running a simple experiment. I ran a model on four different target features to get a feel if one stood out from the others. The four varieties were as follows.
1. Good of old fashioned 1 if the horse won and 0 if the horse lost
2. 1 if the horse won or came second 0 otherwise
3. 1 if the horse won or came second or finished 3rd in a race with more than 8 runners
4. 1 if the horse finished in the first 4 and out ran its odds 0 otherwise
In the last case outran its odds simply meant that the horse was positionally longer in the odds than its finishing position. For example a horse finishing 2nd but went off fav would be a 0 whereas a horse finishing 4th and being 5th in the betting gets a 1
I tested for both how the top rated performed and how simply backing horses above a threshold performed. This is a quick and dirty measure but the objective is to foster some discussion hopefully on other measures for target variables.
Option 1 produced
Toprated 7998 bets 1323 wins PL after comm’ +514pts ROI +6.42% Varpl +40.6
Option 2 produced
8012 bets 1328 wins PL +265.3 ROI 3.3% Varpl +43.9
Option 3 produced
8028 Bets 1365 wins PL +413 ROI +5.15% Varpl +91.5
Option 4 produced
8056 bets 1201 wins PL +235.9 ROI +2.92% Varpl + 67.79
When it came to simply backing any horse above a certain threshold on the ratings option 3 performed best followed by option 2 and then option 1 and finally option 4
The reason for trying the various options is that unbalanced data can effect the performance of ML algorithms although the Gradient Boosting Tree based algorithm I am using suffers least. An unbalanced data set simply means fewer 1’s than 0’s. The closer you get to 50-50 on the target 1’s and 0’s the more balanced the data is. Clearly adding placed runs increases the balance.
The question however is are there other options worth throwing at the algorithm. I would be happy to receive any suggestions on other possible target fields in the comments section.