A machine learning algorithm like a Neural Network takes a set of data along with a target feature and attempts to find relationships between the inputs and the target. So for example let’s look at a real life case like the Titanic data set. Here we may have the following inputs in our data

Class of ticket, male/female, port embarked, age of person and so on

The output feature is did the passenger survive or die.

Feeding this data to a machine learning algorithm we are hoping that with a little help from us the algorithm can model the data and then on fresh Titanic data that we held back, make accurate predictions given the input data as to whether a passenger survived or not. With horse racing we are trying to predict whether a horse will win or lose and of course our input features will be very different.

The question often pondered is should we include the starting or Betfair starting price of the horse in the model input features, after all we are told the market carries a wealth of information some public and some not. The problem with including the price of a horse as one of the input features is that the SP is so good as a predictor of of chance of winning that the ML algorithm will ignore all your other inputs and blindly follow the SP as its main predictor. Well if life was that easy we would just go ahead and back all odds on shots. We can see from the output from MySportsAI that with three input features trainer strike rate, jockey strike rate and BFSP the feature importance plot at the bottom shows that BFSP has dwarfed jockey strike rate and trainer strike rate is barely visible.

So how can we utilize BFSP without it dominating the attention of our algorithm. One approach is to use a two step process. In the first step we train our model on the fundamental features, in the above example trainer strike rate and jockey strike rate. We will have done this on a quarter of our data. We then predict winning probabilities using this model 1 as I will call it, on the second quarter of data and combine these predictions with the BFSP from the second quarter. We now train a new model, lets call it model 2 on this second quarter data which contains predictions derived from model 1 and the BFSP. The BFSP in this step may have been massaged into natural log of the implied chance of the BFSP but lets not worry about that for now. We can now test our model on the third quarter having first created data from the third quarter by utilizing model 1 and combining with BFSP. After perhaps hyper parameter tuning this model 2 we can do a final test on the fourth quarter.

The idea behind this process is that the fundamental features eg trainer and jockey strike rate get a chance to be heard in the first model build before combining with BFSP in model 2. You will often find with this process that input features that were significant in model 1 are now not significant in model 2 simply because they have been accounted for in the BFSP by the betting public. In Sung’s paper on this subject the jockey lost significance but the draw remained significant.

I plan to implement a two step process automated facility into MySportsAI in the near future.