• Home
  • Twitter
  • SmarterSig
  • Betfair
  • About Me
  • Post Cats
    • Betfair API-NG
    • Horse Stride Length
    • Web Scraping Race Data
  • Books
    • Precision CX Wong
    • Hands on ML

Make Your Betting Pay

~ Improve Your Horse Betting

Make Your Betting Pay

Monthly Archives: November 2019

Using GroupKFold to access races in Python

15 Friday Nov 2019

Posted by smartersig in Profitable Punting with Python, Uncategorized

≈ 1 Comment

Tags

GroupKFold Python Horse racing

Apologies to those of you not interested in Machine Learning and Python but I wanted to get this out there so that if my code is incorrect (it appears to work) or perhaps there is a more efficient approach, then some one will kindly put me right.

When you build a ML model you can use a technique called K fold cross validation. It simply means splitting you data into, as an example, 5 partitions, and then training your model on partition 1 to 4 and testing it on partition 5. This happens five times with each times having a different test partition eg- train on partition 1,2,3 and 5 but test on 4.

Now the problem is that with race data you really want to split on whole race boundaries. You do not want, for example, half the field from a given race ending up in the train data and half in the test data.

To handle this we can use GroupKFold

The idea behind GroupKFold is that the data is first grouped on an identifier, in our case the RaceId and then the folds or partitions as I called them are created on the groups. Below is the small sample data, followed by the code.

RaceId,Track,Horse,Dlto,Penulto,Age,Rating,Bfsp,FinPos
1,Nemarket,Mill Reef,13,56,4,85.5,3.5,1
1,Newmarket,Kingston,34,23,4,76.2,7.5,0
1,Newmarket,Nijinsky,27,,4,95,10.2,0
2,Sandown,Red Rum,98,23,5,90,5.4,0
2,Sandown,Henbit,101,54,4,85,20.4,1
2,Sandown,Troy,22,32,4,98,1.9,0
2,Sandown,Wollow,36,23,4,87,2.2,0
2,Sandown,The Minstrel,44,67,4,88,5.8,0
2,Sandown,Try My Best,34,53,4,82,3.2,0
2,Sandown,Tromos,62,73,4,65,6.2,0
3,Bath,Sea Pgeon,47,35,4,81,20.0,1
3,Bath,Monksfield,59,5,4,78,11.4,0
3,Bath,Night Nurse,12,15,6,62,4.2,0
3,Bath,Birds Nest,14,78,5,53,3.2,0
4,York,Frankel,25,17,4,100,1.9,1
4,York,Brigadier Gerard,23,67,3,90,2.9,0
4,York,Dubai Millenium,89,23,4,85,4.8,0
5,York,Posse,23,56,4,82,5.6,0
5,York,El Gran Senor,32,21,4,100,2.3,1
5,York,Radetsky,67,21,7,70,12.4,0

The code

import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier

m = RandomForestClassifier()

from sklearn.model_selection import GroupKFold

## read in the data ##
df = pd.read_csv(‘testcsv.csv’)

## take a look at the data ##
print (“The input data”)
print (df)

## keep it simple remove rows with missing data ##
df = df.dropna(axis=0)

print (“df”)
print (df)

## create group indices ##
groups = df[‘RaceId’]

## take a look at what this produces
print (“The groups indices”)
print (groups)

## pull out the fields required for the model ##
X = df[[‘Dlto’, ‘Penulto’]]
y = df[[‘FinPos’]]

print (“y”)
print (y)

## create instance of groupKFold set n_splits as required ##
gkf = GroupKFold(n_splits=2)

## loop through n_splits times createing index’s into the arrays for the allocated rows ##
for train_index, test_index in gkf.split(X, y, groups=groups):

print(“TRAIN:”, train_index, “TEST:”, test_index)

X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
print (“— X train test —-“)
print(X_train, X_test)
print (“— Y train test—“)
print (y_train, y_test)

## train a model on X_train using y_train as target features ##
y_train = np.ravel(y_train)
model = m.fit(X_train,y_train)

## test model on X_test ##
preds = model.predict_proba(X_test)
print (‘Predictions’)
print (preds)

print (“————–Split done———–“)

NOTE – When viewing the output you are probably expecting the dropna command to physically remove row number 2 as it contains a NaN, it certainly does not appear when the df is printed out. But if you examine the indices and the group links into the df you will see that it still links in to row 3 for Red Rum. In other words it does not shuffle down the df entries when executing a dropna. I was fooled by this for a while. Thanks to @amuellerml for nudging me in this direction

Subscribe

  • Entries (RSS)
  • Comments (RSS)

Archives

  • December 2019
  • November 2019
  • September 2019
  • August 2019
  • July 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • October 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • December 2017
  • November 2017
  • July 2017
  • June 2017
  • April 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • July 2015
  • May 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • December 2014
  • November 2014
  • October 2014
  • September 2014
  • August 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • January 2014
  • December 2013
  • October 2013

Categories

  • Betfair API-NG
  • Deep Learning
  • Group Betting Exercise
  • Horse Stride Length
  • Profitable Punting with Python
  • Sectional Times
  • Sentiment Analysis and Hugh Taylor
  • Speed PARS
  • Uncategorized
  • Web Scraping Race Data

Meta

  • Register
  • Log in

Create a free website or blog at WordPress.com.

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy