I hope you got something out of the intro session on machine learning and Python. I think the KNN model is often used as a first sample of ML because it is fairly easy to get your head around what is happening under the bonnet. This begs the question, how important is it to know what is going on under the bonnet ?. Can we treat these ML algorithms as black boxes or do we need some understanding of the underlying mechanism ?.

Part of an answer to this question came when I progressed further with my ML investigation. I first thought that using a KNN method might be well suited to trainer patterns. Let’s face it we are always wondering if a certain trainer does well with certain characteristics of a horses profile and are we not always told that trainers are creatures of habit, repeating the same winning strategies time and again ?.

I did a run down of the trainers with the most prolific number of runners in handicaps. The first thing that struck me about this data was the U shaped curve it had in terms of losses blind betting their runners. In other words if you backed all runners, trainers with poor strike rates lost you more than medium strike rate trainers but it also became poor again when looking at high strike rate trainers. The upper levels are I presume so well known that the public simply overbet them. I decided therefore to select 10 trainers from the sweet zone in the middle who had the most runners in handicaps.

Using the same data from the previous exercise minus the trainer strike rate I discovered alas no significant profitable trends from the trainers. The model simply performed poorly. The exercise was not a complete waste of time however, if you are interested in focusing on the habits and run styles of a select few trainers I would suggest looking into the middle sweet spot. It may be that the trainers outside the top runner count are easier to churn a profit from.

One more green light went on in my head as a result of the KNN exercises and this proved far more promising, in fact that may prove to be a gross understatement.

I decided to take a look at the old chestnut of predicting when a price during the live 10 minute betting shows, is a price that will beat Betfair SP. I was looking at prices from UK flat handicaps taken every 7 seconds. The data used in the model was purely technical. Investopedia define technical analysis thus

A method of evaluating securities by analyzing statistics generated by market activity, such as past prices and volume. Technical analysts do not attempt to measure a security’s intrinsic value, but instead use charts and other tools to identify patterns that can suggest future activity.”

This is in contrast to functional analysis where external factors to the market are considered. An example in horse racing would be the jockey or trainer of a horse.

The good thing about this data in contrast to the previous sessions is that it was far more balanced in terms of outcomes. Around 47% of prices are inferior to betfair sp in the range of under 10.0 (which is where I focused).

With K set at 11 the model produced 131,474 selections in which 64.5% were correct in that the price taken beat or was equal to Betfair SP. This number of selections would be from approximately 3 weeks of racing. Yes it is a ,lot but the model will make multiple suggestions on the same horse in the same race if at each point it is deemed to be a good bet to beat sp.

This to my naive non trading eyes looked quite promising but things got better on two points. First of all I substituted Betfair SP with last live show, after all it is unlikely from a trading perspective one would trade to BFSP. You would have to guess the amount and be in danger of cannabalising your own SP. Also given previous reveltions about BFSP and the fact that around 40% of books to BFSP are under 100%, it seemed logical that final show would be a better proposition than BFSP. It proved to be the case but not by huge amounts. It added about 0.5% to the correct prediction score.

The second adjustment provided a more significant improvement. Using a Random Forest algorithm instead of the KNN upped the correct prediction rate to 69%

Random Forests are an ensemble method of modelling. They use decision trees to model the predictions but they use a number of trees and then aggregate the results from the individual trees to give a final result. This makes them far less susceptible to over fitting on the data. I also used the RF as a simple black box. Python is useful in this way in that you can plug in the alternative model into your existing code as shown below.

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 100)


predicted = rf.predict(data_test)

The rest of the code is essentially the same.

I would welcome the thoughts of any traders out there on this topic.