I have covered machine learning previously and illustrated some concepts via the Kth Nearest Neighbor algorithm. KNN is often used as a start up example algorithm as its easy to understand its underlying principles even though Pythons SKLearn machine learning package is going to do all the lifting for you. An inevitable question however is can a simple algorithm like KNN yield any favourable results. I found that it can in one particular area I have been playing around with.
I decided to look into whether backing my top rated flat handicap ratings could be improved upon if I applied some sort of simple odds line to them. To do this I first organised my ratings such that a complete race was one line of data with each predictor field being the difference between the top rated horse and another horse in the race. As an example here is couple of sample lines

diff1 diff2 diff3 diff4 diff5 diff6 diff7 diff8 diff9 diff10 diff11 diff12 diff13 diff14 diff15 finpos bfsp

0.147760456 0.187955293 0.18819821 0.197145752 0.197400238 0.350482684 0.412356588 99 99 99 99 99 99 99 99 1 4.69

0.163702041 0.179318767 0.257880117 0.428371173 0.464780245 99 99 99 99 99 99 99 99 99 99 0 5.87

As you can probably see I decided to just deal initially with handicaps up to 16 runners. On the first line of data the second top rated horse is 0.147760456 behind the top rated. The third top rated is 0.187955293 behind the top rated and so on. If there are less than 16 runners the remaining feature places are padded out with the number 99. The first line in the data was a winner at BFSP 4.69 whereas the second line of data did not win at BFSP 5.87. The model was trained using the rating differences as inputs (not BFSP) with WinLose as the output.

I repeatadly split the data into 80% 20% partitions, training the data on the 80% and then predicting using the trained model on the 20%. I did this 20 times, each time splitting the data into different 80/20 split which incidentely are randomly chosen from within the file. SKlearn a Python programming language ML library does all this for you.

The model was a Kth nearest neighbor algorithm with the number of neighbors set to 20 and proba was used for predicting which means the algorithm predicts a probability of a given line being a winner. If you remember the details of KNN from the previous blog entry you will remember that it does this by finding the 20 nearest matches to the line in question and then bases a prediction on these.

After using KNN I utilised a Random Forest with n_estimator set to 100 to then compare with the KNN. If KNN is the simpleton of the ML family then I would hope to see some improvement with the Random Forest algorithm.

RESULTS for top rated using Probability odds

KNN Oddsline 38886 bets PL +4339 ROI +11.15% all after comm
All Top Rats 66765 bets PL +5435 ROI +8.14%

RF Oddsline 42270 bets PL +4356 ROI +10.3%
All Top Rats 69960 bets PL +5133 ROI +7.33%

The number of bets for all top rated varies between the two groups I presume because the KNN creates more probabilities of zero which are not considered in the analysis as first of all they would crash the program when trying to divide by zero to create odds for the horse.

Interestingly the humble KNN did as well if not better than the Random Forest and both showed an improved ROI% when used over blind top rated.

The nest step would perhaps be to check for optimum setting for K (20 so far), in other words how many nearest neighbors should it look at in order to derive a probability. Remember nearest neighbor means similarity and not nearest in physical location. A quick check with 40

KNN Oddsline 38884 bets PL 4496.857 ROI 11.5648004321
All top rats 66698 bets PL 5646.917 ROI 8.46639689346

And with 60
Oddsline 41464 PL 5189.598 ROI 12.5159145283
All top rat 69962 PL 5479.046 ROI 7.83146050713

And with 80
Oddsline 42170 PL 4724.564 ROI 11.2036139436
All top rat 69987 PL 5841.287 ROI 8.34624658865