So what is the difference between Trevor picking horses based on

1.Ad hoc reading of form

2. A system

3. A machine learning model.

First of all Trevor’s method is number 1 and probably highly subjective, even if you told him to focus on the same criteria eg trainer form, as a machine learning model there would be no guarantee that he would interpret the data correctly of if he would interpret it the same way on every day he read form. In an attempt to introduce some sort of rigour and consistency he may attempt to devise a system.

Flatstats tweeted the following data the other day.

Best Value Jockey at Wolverhampton: Rossa Ryan

21/77, 27% strike rate, 1.59 A/E, 57% profit

6:15 Brockley Rise 6:45 Ventura Island 7:15 Fox Power

Based on this Trevor may decide to back all Rossa Ryan mounts at Wolves from now on. He may even decide to back all best value jockeys at every track from now on. This would be an example of a system. It would be a tricky system to follow as Trevor would have to have access to or keep updated figures on who is the top value jockey at each track.

Trevor however gets up one morning, especially when things have not been going too well with the system and wonders if he can integrate trainers into the system. He likes following trainers and thinks it could add value to the system. The problem is that keeping updated values for both trainers and jockeys is going to complicate things further. Also how do you combine the two inputs, should it be a system in which the top valued jockey is riding for the top valued trainer?. That might be too restrictive. What about top valued jockey riding for one of the top three trainers at the track. You can see how just introducing an extra variable has made the possibilities far more complex. Perhaps he should go for trainers over a minimum strike rate at the track.

This is where Machine Learning steps in. With ML we can feed in to an ML algorithm values for our two variables. Maybe initially this would be jockey strike rate at the track and trainer strike rate at the track. Let us say we do this for 2010 through to 2016 and each line of data looks like

10,12,0,3.5

8,9,0,10.5

12,11,1,2.6

From the first 3 lines of this data which we shall say is for Wolves only we can see that the first runner had a 10% jock on board riding for a 12% trainer (all figures for Wolves) and the horse finished out of first place at odds of 3.5. The third line is a 12% jock for an 11% trainer who finished first in this race. There would be many lines of such data for the period which we call the training period because we will use this data to train a model to hopefully find meaningful relationships between the two inputs and the output (win/lose). The odds would not be used in the model other than to test how much profit or loss it made. Furthermore we can get the model to predict the percentage chance a horse with a certain jock percentage and trainer percentage has of winning a future race and hence see how we would have done backing perhaps when a horse is above a certain threshold.

Once we have trained the model we can then see how it performs on new unseen data which we call the test data. For us this may well be data for 2016 to 2018. This will give us a far more realistic idea of how well this model performs.

Sometimes data we want include is not in the correct form. Machine learning models in some applications like Python need all data to be in numeric form. So if we included a third field say headgear where b means blinkered and b1 means blinkered first time and so on. Here we have a problem with non numeric data but it is fairly straight forward to convert the data into numeric representations.  There is also the possibility that our two inputs are correlated in that better jockeys tend to ride for better trainers. This highlights that careful thought is needed about how we pick data to include. These are perhaps topics for another post. This post was an attempt to clarify the difference between the three modes of bet analysis.