• Home
  • Twitter
  • SmarterSig
  • Betfair
  • About Me
  • Post Cats
    • Betfair API-NG
    • Horse Stride Length
    • Web Scraping Race Data
  • Books
    • Precision CX Wong
    • Hands on ML

Make Your Betting Pay

~ Improve Your Horse Betting

Make Your Betting Pay

Category Archives: Profitable Punting with Python

Brier Skill Score and Horseracing

11 Saturday Jan 2020

Posted by smartersig in Profitable Punting with Python, Uncategorized

≈ 3 Comments

Tags

Machine Learning and horse racing, tweet

When developing machine learning models for horse racing we quite rightly need some way to evaluate how successful they are. Horse race betting is a bit different to predicting face recognition or breast cancer diagnosis because these applications are all about accuracy of predictions. With betting, accuracy clearly has a part to play, but its not the complete picture. A less accurate model (in terms of predicting winners) could be more profitable and for that reason we tend to focus on profit. The two most common profit measurements are flat stake profit and variable stake profit. The former simply means putting £1 on every selection, for example the top rated in our ratings. Variable staking means we place a stake set to win £1 relative to the odds. So for example a 2/1 shot would have a bet of 50p placed on it. Of course in both examples the stake can be whatever you want it to be.

The advantage of variable stake monitoring is that it is not prone to inflation from one or two big priced winners which may give you a never to be repeated profit that sends you skipping off to remortgage your house. The variable stake monitoring does not suffer from this and gives a more realistic impression of possible future performance.

So what about the more traditional Machine Learning performance metrics, should we bin them when developing ML models and simply focus on profit/loss ?. Probably not, a mixture of metrics can help give us more confidence if all of them are showing improved signs over a rival model.

Horse Racing models often have a degree of inbalanced data. That is is to say that the thing we are trying to predict (win or lose) usually contains far more zeros than one’s, after all our lines of horse data will clearly contain more losers than winners unless we have engineered the data in some way.

One metric that is useful for inbalanced data sets is the Brier Score and what I am about to describe is its close cousin the Brier Skill Score

First of all what is a Brier Score. Imagine we have a three horse race with the following

horse, model probability, W/L (1 means won 0 means lost)

Al Boum Photo, 0.5, 1
Lost In Translation, 0.3, 0
Native River, 0.2, 0

So our model gave Al Boum Photo a 0.5 chance and he won the race.

The Brier score for these 3 lines of data would be

((0.5 – 1)^2 + (0.3 – 0)^2 + (0.2 – 0)^2) / 3 = 0.1266

Where ^2 simply means ‘squared’

Looking at the above you can hopefully see that if the lower rated horses tend to lose and higher rated horses tend to win we will get a lower Brier score than if races were predicted the other way round. This is why a lower Brier Score means a ‘better’ score.

Next up is the Brier Skill Score (BSS). This measures the Brier Score against some other measure, after all stating that the score above is 0.1266 does not give you an instinctive feeling of how good or bad it is. We just know its better than 0.2 for example.

The BSS is calculated by first working out some sort of measure we can compare to. In this case we will opt for a baseline measure of simply predicting all horses with a value of 0.33. Why 0.33, well because that is the percentage of 1’s in the sample set. Obviously across many races this will come out at more like 0.1 or thereabouts. With the 0.33 for every horse we can now calculate a Brier Score based on probabilities of 0.33 for every horse. What we are doing is using an average likelyhood for the prediction probability of each horse. Substituting this in we get

((0.33 – 1)^2 + (0.33 – 0)^2 + (0.33 – 0)^2) / 3 = 0.2222

Now to calculate the BSS we divide the models Brier score by the Naive predictions Brier score and then subtract this from 1

1 – (0.1266 / 0.2222) = 0.4302

Negative values mean the model has less predictive value than naive baseline probabilities. Positive values (max = 1) mean the model is beating naive baseline predictions. Our one sample 3 horse race is clearly kicking butt but over many races that score would certainly come down but if your model is any good, hopefully stay above zero. More importantly if you modify a model and your BSS score go’s up then you can be hopeful that the changes are worth sticking with.

Using GroupKFold to access races in Python

15 Friday Nov 2019

Posted by smartersig in Profitable Punting with Python, Uncategorized

≈ 7 Comments

Tags

GroupKFold Python Horse racing

Apologies to those of you not interested in Machine Learning and Python but I wanted to get this out there so that if my code is incorrect (it appears to work) or perhaps there is a more efficient approach, then some one will kindly put me right.

When you build a ML model you can use a technique called K fold cross validation. It simply means splitting you data into, as an example, 5 partitions, and then training your model on partition 1 to 4 and testing it on partition 5. This happens five times with each times having a different test partition eg- train on partition 1,2,3 and 5 but test on 4.

Now the problem is that with race data you really want to split on whole race boundaries. You do not want, for example, half the field from a given race ending up in the train data and half in the test data.

To handle this we can use GroupKFold

The idea behind GroupKFold is that the data is first grouped on an identifier, in our case the RaceId and then the folds or partitions as I called them are created on the groups. Below is the small sample data, followed by the code.

RaceId,Track,Horse,Dlto,Penulto,Age,Rating,Bfsp,FinPos
1,Nemarket,Mill Reef,13,56,4,85.5,3.5,1
1,Newmarket,Kingston,34,23,4,76.2,7.5,0
1,Newmarket,Nijinsky,27,,4,95,10.2,0
2,Sandown,Red Rum,98,23,5,90,5.4,0
2,Sandown,Henbit,101,54,4,85,20.4,1
2,Sandown,Troy,22,32,4,98,1.9,0
2,Sandown,Wollow,36,23,4,87,2.2,0
2,Sandown,The Minstrel,44,67,4,88,5.8,0
2,Sandown,Try My Best,34,53,4,82,3.2,0
2,Sandown,Tromos,62,73,4,65,6.2,0
3,Bath,Sea Pgeon,47,35,4,81,20.0,1
3,Bath,Monksfield,59,5,4,78,11.4,0
3,Bath,Night Nurse,12,15,6,62,4.2,0
3,Bath,Birds Nest,14,78,5,53,3.2,0
4,York,Frankel,25,17,4,100,1.9,1
4,York,Brigadier Gerard,23,67,3,90,2.9,0
4,York,Dubai Millenium,89,23,4,85,4.8,0
5,York,Posse,23,56,4,82,5.6,0
5,York,El Gran Senor,32,21,4,100,2.3,1
5,York,Radetsky,67,21,7,70,12.4,0

The code

import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier

m = RandomForestClassifier()

from sklearn.model_selection import GroupKFold

## read in the data ##
df = pd.read_csv(‘testcsv.csv’)

## take a look at the data ##
print (“The input data”)
print (df)

## keep it simple remove rows with missing data ##
df = df.dropna(axis=0)

print (“df”)
print (df)

## create group indices ##
groups = df[‘RaceId’]

## take a look at what this produces
print (“The groups indices”)
print (groups)

## pull out the fields required for the model ##
X = df[[‘Dlto’, ‘Penulto’]]
y = df[[‘FinPos’]]

print (“y”)
print (y)

## create instance of groupKFold set n_splits as required ##
gkf = GroupKFold(n_splits=2)

## loop through n_splits times createing index’s into the arrays for the allocated rows ##
for train_index, test_index in gkf.split(X, y, groups=groups):

print(“TRAIN:”, train_index, “TEST:”, test_index)

X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
print (“— X train test —-“)
print(X_train, X_test)
print (“— Y train test—“)
print (y_train, y_test)

## train a model on X_train using y_train as target features ##
y_train = np.ravel(y_train)
model = m.fit(X_train,y_train)

## test model on X_test ##
preds = model.predict_proba(X_test)
print (‘Predictions’)
print (preds)

print (“————–Split done———–“)

NOTE – When viewing the output you are probably expecting the dropna command to physically remove row number 2 as it contains a NaN, it certainly does not appear when the df is printed out. But if you examine the indices and the group links into the df you will see that it still links in to row 3 for Red Rum. In other words it does not shuffle down the df entries when executing a dropna. I was fooled by this for a while. Thanks to @amuellerml for nudging me in this direction

Profitable Punting With Python 1

30 Saturday Jan 2016

Posted by smartersig in Profitable Punting with Python

≈ 14 Comments

Tags

Machine Learning and horse racing

I have prepared some introductory sessions on machine learning for horse racing using Python and Scikit Learn. You do not need previous experience of either of these two tools but it would help if you are at least familiar with some basic programming concepts. For example it would help if you know what a FOR loop is, what an assignment statement is even if it is not in Python.

The main data file will be freely available until Tuesday 2nd February for those who showed an initial interest. After this it will be in the utilities section of the http://www.smartersig.com web site. A modest members fee will enable you to access it.

The instructions will be freely available to all at all times.

OK to get started you will need to have downloaded and installed Anaconda Python v3.4, see previous blog post Profitable Punting with Python Intro for details.

Once this has installed create a folder in your anaconda folder called horseracing.

All comments, questions and feedback should be posted to this blog post, that way they can act as a FAQ source.

First of all download the following zip file, double click on it to reveal all the contained files and copy them into your horseracing folder.

http://www.smartersig.com/pythonpunting.zip

The next step is to download the following file into your horse racing folder. When you click the link it will probably display the contents in your web browser. Just right click the display and you will have the option to save to a file the screen data.

This file is now housed in the utilities section of the smartersig.com web site and is called aiplus12to14.csv

You now have the required files. To get started first open a msdos command window (the black box type)

Now navigate to your anaconda folder using cd command eg cd anaconda

Kick start Ipython Notebook by typing in ipython notebook and pressing return. (note on latest version this may now be jupyter notebook)

Once notebook is loaded up you will be presented with a directory screen of folders. Double click on the horseracing folder (that you created) to go into that folder.

Now double click on the file ProfitablePuntingWithPython1.ipynb

Follow the instructions within the displayed notebook.

Profitable Punting with Python Intro

13 Wednesday Jan 2016

Posted by smartersig in Profitable Punting with Python

≈ 10 Comments

Tags

tweet

Hoping to run a series of sessions introducing Machine Learning for horse racing betting via Pythons Scikit Learn. I will be assuming that you have some rudimentary programming knowledge although that may not be in Python. In other words you have some understanding of what a loop is, what a variable is and what an array is even if not in Python.

What do we mean by Machine Learning ?. Essentially getting our computer to build a model of past racing data so that we can use this model to effectively predict the outcome of future race data.

If you are interested in participating then you will first of all need a version of Python installed along with the libraries we are going to use. Even if you have a version of Python you can still install as I have done, the version I am about to recommend, into a folder of its own and run it from there even if it is not your default Python. I am going to suggest Anaconda because it comes with all the extra we will need such as Ipython Notebook, so there is no need to fiddle around with separate installs.

Downloading the Anaconda Free version of Python is quite painless and has everything we are going to need included in the download.

I have the 3.5 version for 64 bit Windows on my PC, you can choose the appropriate version (32 bit or 64 bit) to download at

https://www.continuum.io/downloads

If you are not sure if your PC is 64 bit or 32 bit check the following

http://windows.microsoft.com/en-gb/windows7/find-out-32-or-64-bit

Check python is loaded OK by opening an MSDOS command window and navigating to your Anaconda folder using the cd command.

Once in the folder type in Python -V

It should display your version number.

If you are interested in this series and have installed Anaconda OK let me know with a brief comment below so I can gauge interest.

UPDATE – Hope to start things on February 1st when hopefully all will be ready. Note I have modified the above and I am now running Python 3.5. If you have 2.7 it might be best to uninstall using the uninstall.exe file and then install 3.5 from the Anaconda website.

When you have done this create, inside your anaconda folder, a new folder called horseracing. You can call it something else if you wish as long as you know that when I refer to the horseracing folder I mean your equivalent.

I will start the ball rolling with a new blog post called Profitable Punting with Python 1

Each blog entry will introduce the next session briefly and where to pick up things from and the comments section at the foot of each blog will act as a discussion board and troubleshooting.

Archives

Blog at WordPress.com.

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy
  • Follow Following
    • Make Your Betting Pay
    • Join 50 other followers
    • Already have a WordPress.com account? Log in now.
    • Make Your Betting Pay
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar