I have prepared some introductory sessions on machine learning for horse racing using Python and Scikit Learn. You do not need previous experience of either of these two tools but it would help if you are at least familiar with some basic programming concepts. For example it would help if you know what a FOR loop is, what an assignment statement is even if it is not in Python.
The main data file will be freely available until Tuesday 2nd February for those who showed an initial interest. After this it will be in the utilities section of the http://www.smartersig.com web site. A modest members fee will enable you to access it.
The instructions will be freely available to all at all times.
OK to get started you will need to have downloaded and installed Anaconda Python v3.4, see previous blog post Profitable Punting with Python Intro for details.
Once this has installed create a folder in your anaconda folder called horseracing.
All comments, questions and feedback should be posted to this blog post, that way they can act as a FAQ source.
First of all download the following zip file, double click on it to reveal all the contained files and copy them into your horseracing folder.
http://www.smartersig.com/pythonpunting.zip
The next step is to download the following file into your horse racing folder. When you click the link it will probably display the contents in your web browser. Just right click the display and you will have the option to save to a file the screen data.
This file is now housed in the utilities section of the smartersig.com web site and is called aiplus12to14.csv
You now have the required files. To get started first open a msdos command window (the black box type)
Now navigate to your anaconda folder using cd command eg cd anaconda
Kick start Ipython Notebook by typing in ipython notebook and pressing return. (note on latest version this may now be jupyter notebook)
Once notebook is loaded up you will be presented with a directory screen of folders. Double click on the horseracing folder (that you created) to go into that folder.
Now double click on the file ProfitablePuntingWithPython1.ipynb
Follow the instructions within the displayed notebook.
Thanks for this, it’s very interesting.
What would the general approach be to assigning a probability of winning to a given horse based on its data? For example could you look at x number of nearest neighbors and average their win rates? IE: find the 20 nearest neighbors and if 5 of them won return a result of 0.25 .
Or is a nearest neighbor model fundamentally a binary thing?
If that’s not the right type of approach then what would you suggest?
Thanks,
Mike
Logistical regression would give a universal zero prediction to all the fresh data simply because it predicts chances and none of its predicted chances would be above 0.5 with such a low occurance of 1’s in the data. You can lower this threshold so that it will predict more 1’s or as you allude to use the chance prediction to gauge who is most likely to win. Maybe we can go on to look at this method later.
One tip don’t cut and paste Mark’s code but type it in yourself. That way you’ll get used to Python being unforgiving about brackets and upper and lower case, TRdata is not the same as TrData. There are several ways of getting a probabilistic out put. Most classification methods can output the chances of a given case being in a particular class or classes
Steve is correct, use knn.predict_proba instead of knn.predict
The probabilities do not seem very promising within this data set and method. I increased K to give a finer grain of probabilities but it produced nothing promising. Maybe if you have a play around with other values of K you might find some value.
Another problem is that there are far more losers than winners. This means a model will be successful it it just says everything loses. One way round this is to make a data set with all the winners, 8102 and say twice as many losers with the losers being selected at random from the main data set. Running the model on this data will encourage the finding of winning patterns as at least 1/3 of the data set are winners.
I see the value of this exercise from the point of view of learning the programming, but is there any point in trying to make a binary distinction between winners and losers from a betting point of view? Surely what you really want to know is what a horse’s probability of winning is relative to its odds, otherwise all you’re finding out is whether 50% of similarly categorised horses won – which is not at all useful without reference to their prices.
My other observation is that if I understand this approach correctly it is making predictions based on tiny samples – setting k to 7 means you’re making a prediction based on just 7 previous runners with similar data profiles. I may have understood the process wrong but if not then this seems woefully inadequate to me.
My method is simply to apply straight forward regression to my data. This works well, but I am curious to know whether a machine learning approach could yield superior predictions – hence my interest in the subject.
My hunch is that in practice if a predictor (eg trainer form) has sufficient predictive power to be useful for betting purposes this should be evident from normal regression techniques, but I can see ML as being valuable in optimising predictions made on the basis of multiple predictors, especially as it can be difficult to unwrap the dependencies/independencies in the data.
I agree Mike, it is easy to get too wrapped up in profitability at this stage whereas the purpose is to simply introduce Sklearn and Python via a simple ML algorithm. The calls to other algorithms are essentially the same syntax so plugging in different ML’s is fairly straightforward. My gut feeling is that R is a better language to use as you do not have the bones of Pandas and Numpy sticking out as we tend to do with Python and it handles NaN’s whereas Sklearn gets upset. But I am interested because I want to keep data collection, analysis and bet execution under one roof and although you can do all three in R I suspect Python offers a simpler transition, certainly for me as i already bet in Python.
I have also made no effort to select profitable variables eg TR strike rate, I really just picked some based on the good old AI ratings with an extra variable to make them different. Again this was because it was the method that I was interested in looking at rather than trying to develop a winning model.
I guess also Mike that you could reduce the number of losers in the training set as Steve suggests or when it comes to voting interrogate the percentage chance of winning as highlighted in the other comment. If you take a lower percentage than 50% as indicating a winner you will have more qualifiers and hence you can increase the value of K.
This is till does not address your question of wanting to model chances in relation to odds as I hinted earlier that this simple model does not look like it models chances all that well but that does not mean that some other set of inputs would not, at least now you can try them or perhaps select a richer set of inputs and use logistic regression.
I’ve just been looking this up and found that there is such a thing as k-nn regression which takes an average of the nearest neighbours outcomes – so that would be one approach to producing a probabilistic output, albeit maybe not appropriate for this particular data set because of the sample size issue.,
One concern I have about the validity of K-nn for this task is that I can’t see any weighting between the classifiers. Clearly the single classifier that will best predict the outcome is the BFSP, if I’m understanding k-nn right then it’s selection of the “neighbours” doesn’t take into account something that we know to be causal, so it may select very different horses as being neighbours based on the similarity of classifiers that have relatively weak predictive value. Please correct me if I’m understanding k-nn wrongly.
A quick look at wikipedia and youtube has me wondering about a random forest type approach – which as I understand it is an averaging of a lot of decision trees. Decision trees seem like a good way to go because they evaluate the predictive strength of the available classifiers before splitting. I’m gradually working my way through the various types of ML regression, so in an hours time I may have changed my mind though……
By the way – thanks for your NG API intro posts. Shortly after the API NG transition they helped me make the switch from botting with VBA via Gruss to calling the API directly using Python which has allowed me to collect much richer data. This has been great for me and I really appreciate your help with it.
“One concern I have about the validity of K-nn for this task is that I can’t see any weighting between the classifiers. Clearly the single classifier that will best predict the outcome is the BFSP, if I’m understanding k-nn right then it’s selection of the “neighbours” doesn’t take into account something that we know to be causal, so it may select very different horses as being neighbours based on the similarity of classifiers that have relatively weak predictive value. Please correct me if I’m understanding k-nn wrongly. ”
There are shed loads of issues about this, this is a programming exercise so that is to be expected.
More suggestions from a statistical point of view. Beware of using BFSP as is because it is a very skewed distribution, it can throw off algorithms for finding best fit especially regression. You could replace it by the implied probability .i.e 4 = 1/4 =0.25 10 = 1/10 =0.1 or just take the log of it. This way a 66 BSP and a 100 BSP would be more similar than a 2 and 2.5 BSP which intuitively feels right. A useful note book would be one that graphed all the variables and showed their mutual correlations. This is a first step in model selection.
I am not a great fan of try a sequence of models to see what works best. Seeing the shape and relationships of the data is a good starting point.
Oh and also by the way – I hope you don’t think I’m being negative because I absolutely understand that the purpose of this is as a learning exercise and I think that’s great.
I didn’t imagine for a moment that the computer was going to magically spit out a winning strategy based on the data provided! I assume that you have data that you know provides predictive value, but hat you’ve got more sense than to post it up for the world to see so you’re very helpfully providing something bland just as a file to work with. I get that, and I think it’s commendable. Many thanks.
Not at all Mike I welcome constructive ideas, my strengths if I have any are more in programming than ML so I am learning as well as we go along. Its late so I will respond to your first message later. I think the KNN algorithm has limitations for sure but I wonder of its simplicity can be predictive in pattern matching areas rather than probability prediction. For example one area I am currently looking at is trainer patterns and whether a KNN approach can spot patterns in what we are told are creatures of habit.
Note as sklearn evolves one or two module targets get changed, for example
from sklearn.cross_validation import train_test_split
is now
from sklearn.model_selection import train_test_split