Tags

Apologies to those of you not interested in Machine Learning and Python but I wanted to get this out there so that if my code is incorrect (it appears to work) or perhaps there is a more efficient approach, then some one will kindly put me right.

When you build a ML model you can use a technique called K fold cross validation. It simply means splitting you data into, as an example, 5 partitions, and then training your model on partition 1 to 4 and testing it on partition 5. This happens five times with each times having a different test partition eg- train on partition 1,2,3 and 5 but test on 4.

Now the problem is that with race data you really want to split on whole race boundaries. You do not want, for example, half the field from a given race ending up in the train data and half in the test data.

To handle this we can use GroupKFold

The idea behind GroupKFold is that the data is first grouped on an identifier, in our case the RaceId and then the folds or partitions as I called them are created on the groups. Below is the small sample data, followed by the code.

RaceId,Track,Horse,Dlto,Penulto,Age,Rating,Bfsp,FinPos
1,Nemarket,Mill Reef,13,56,4,85.5,3.5,1
1,Newmarket,Kingston,34,23,4,76.2,7.5,0
1,Newmarket,Nijinsky,27,,4,95,10.2,0
2,Sandown,Red Rum,98,23,5,90,5.4,0
2,Sandown,Henbit,101,54,4,85,20.4,1
2,Sandown,Troy,22,32,4,98,1.9,0
2,Sandown,Wollow,36,23,4,87,2.2,0
2,Sandown,The Minstrel,44,67,4,88,5.8,0
2,Sandown,Try My Best,34,53,4,82,3.2,0
2,Sandown,Tromos,62,73,4,65,6.2,0
3,Bath,Sea Pgeon,47,35,4,81,20.0,1
3,Bath,Monksfield,59,5,4,78,11.4,0
3,Bath,Night Nurse,12,15,6,62,4.2,0
3,Bath,Birds Nest,14,78,5,53,3.2,0
4,York,Frankel,25,17,4,100,1.9,1
4,York,Brigadier Gerard,23,67,3,90,2.9,0
4,York,Dubai Millenium,89,23,4,85,4.8,0
5,York,Posse,23,56,4,82,5.6,0
5,York,El Gran Senor,32,21,4,100,2.3,1
5,York,Radetsky,67,21,7,70,12.4,0

The code

import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier

m = RandomForestClassifier()

from sklearn.model_selection import GroupKFold

## read in the data ##
df = pd.read_csv(‘testcsv.csv’)

## take a look at the data ##
print (“The input data”)
print (df)

## keep it simple remove rows with missing data ##
df = df.dropna(axis=0)

print (“df”)
print (df)

## create group indices ##
groups = df[‘RaceId’]

## take a look at what this produces
print (“The groups indices”)
print (groups)

## pull out the fields required for the model ##
X = df[[‘Dlto’, ‘Penulto’]]
y = df[[‘FinPos’]]

print (“y”)
print (y)

## create instance of groupKFold set n_splits as required ##
gkf = GroupKFold(n_splits=2)

## loop through n_splits times createing index’s into the arrays for the allocated rows ##
for train_index, test_index in gkf.split(X, y, groups=groups):

print(“TRAIN:”, train_index, “TEST:”, test_index)

X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
print (“— X train test —-“)
print(X_train, X_test)
print (“— Y train test—“)
print (y_train, y_test)

## train a model on X_train using y_train as target features ##
y_train = np.ravel(y_train)
model = m.fit(X_train,y_train)

## test model on X_test ##
preds = model.predict_proba(X_test)
print (‘Predictions’)
print (preds)

print (“————–Split done———–“)

NOTE – When viewing the output you are probably expecting the dropna command to physically remove row number 2 as it contains a NaN, it certainly does not appear when the df is printed out. But if you examine the indices and the group links into the df you will see that it still links in to row 3 for Red Rum. In other words it does not shuffle down the df entries when executing a dropna. I was fooled by this for a while. Thanks to @amuellerml for nudging me in this direction