Apologies to those of you not interested in Machine Learning and Python but I wanted to get this out there so that if my code is incorrect (it appears to work) or perhaps there is a more efficient approach, then some one will kindly put me right.
When you build a ML model you can use a technique called K fold cross validation. It simply means splitting you data into, as an example, 5 partitions, and then training your model on partition 1 to 4 and testing it on partition 5. This happens five times with each times having a different test partition eg- train on partition 1,2,3 and 5 but test on 4.
Now the problem is that with race data you really want to split on whole race boundaries. You do not want, for example, half the field from a given race ending up in the train data and half in the test data.
To handle this we can use GroupKFold
The idea behind GroupKFold is that the data is first grouped on an identifier, in our case the RaceId and then the folds or partitions as I called them are created on the groups. Below is the small sample data, followed by the code.
RaceId,Track,Horse,Dlto,Penulto,Age,Rating,Bfsp,FinPos
1,Nemarket,Mill Reef,13,56,4,85.5,3.5,1
1,Newmarket,Kingston,34,23,4,76.2,7.5,0
1,Newmarket,Nijinsky,27,,4,95,10.2,0
2,Sandown,Red Rum,98,23,5,90,5.4,0
2,Sandown,Henbit,101,54,4,85,20.4,1
2,Sandown,Troy,22,32,4,98,1.9,0
2,Sandown,Wollow,36,23,4,87,2.2,0
2,Sandown,The Minstrel,44,67,4,88,5.8,0
2,Sandown,Try My Best,34,53,4,82,3.2,0
2,Sandown,Tromos,62,73,4,65,6.2,0
3,Bath,Sea Pgeon,47,35,4,81,20.0,1
3,Bath,Monksfield,59,5,4,78,11.4,0
3,Bath,Night Nurse,12,15,6,62,4.2,0
3,Bath,Birds Nest,14,78,5,53,3.2,0
4,York,Frankel,25,17,4,100,1.9,1
4,York,Brigadier Gerard,23,67,3,90,2.9,0
4,York,Dubai Millenium,89,23,4,85,4.8,0
5,York,Posse,23,56,4,82,5.6,0
5,York,El Gran Senor,32,21,4,100,2.3,1
5,York,Radetsky,67,21,7,70,12.4,0
The code
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
m = RandomForestClassifier()
from sklearn.model_selection import GroupKFold
## read in the data ##
df = pd.read_csv(‘testcsv.csv’)
## take a look at the data ##
print (“The input data”)
print (df)
## keep it simple remove rows with missing data ##
df = df.dropna(axis=0)
print (“df”)
print (df)
## create group indices ##
groups = df[‘RaceId’]
## take a look at what this produces
print (“The groups indices”)
print (groups)
## pull out the fields required for the model ##
X = df[[‘Dlto’, ‘Penulto’]]
y = df[[‘FinPos’]]
print (“y”)
print (y)
## create instance of groupKFold set n_splits as required ##
gkf = GroupKFold(n_splits=2)
## loop through n_splits times createing index’s into the arrays for the allocated rows ##
for train_index, test_index in gkf.split(X, y, groups=groups):
print(“TRAIN:”, train_index, “TEST:”, test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
print (“— X train test —-“)
print(X_train, X_test)
print (“— Y train test—“)
print (y_train, y_test)
## train a model on X_train using y_train as target features ##
y_train = np.ravel(y_train)
model = m.fit(X_train,y_train)
## test model on X_test ##
preds = model.predict_proba(X_test)
print (‘Predictions’)
print (preds)
print (“————–Split done———–“)
NOTE – When viewing the output you are probably expecting the dropna command to physically remove row number 2 as it contains a NaN, it certainly does not appear when the df is printed out. But if you examine the indices and the group links into the df you will see that it still links in to row 3 for Red Rum. In other words it does not shuffle down the df entries when executing a dropna. I was fooled by this for a while. Thanks to @amuellerml for nudging me in this direction
Interesting work, i can see how this is can be a great tool.
Problem i have is that i cannot get the code running for me.
KeyError: ‘Raceld’
I have copied the header from the CSV into the code, and swapped double for single quotes vice/versa throughout the code.
## create group indices ##
groups = df[‘Raceld’]
Any ideas what i am doing wrong?
Is that a capital I in RaceId that you have there, it should be
It is not clear how the model can distinguish between competitors within or between grouped races.
For example, a horse coming 4th (and given a finish value of 0) can have much stronger data than a horse that wins a weaker race ((and is given a finish value of 1).
The article is really about the coding of k fold cross validation that creates splits using groups ie grouped by race. If you want to model finishing position in your data but as you say you wan it to pick up on the strength of a race along with finishing position then you would need to feature engineer this to facilitate it. For example you may feel that 4th in a 20 runner handicap is stronger or at least as strong as finishing 1st in a five runner so you could represent finishing position as number of horses behind. this has some problems in terms of when you go too far down the field you may hit non triers so some judgment of cut off point would be needed. There are no right or wrong answers in data science, you have to experiment with logical options and see what works best
Hi Mark
Probably this is more to do with my understanding that anything else…
But I would really appreciate if you could clarify this for me. When I run the above python code I get
The groups indices
0 1
1 1
3 2
4 2
5 2
6 2
7 2
8 2
9 2
10 3
11 3
12 3
13 3
14 4
15 4
16 4
17 5
18 5
19 5
TRAIN: [ 0 1 9 10 11 12 16 17 18] TEST: [ 2 3 4 5 6 7 8 13 14 15]
TRAIN: [ 2 3 4 5 6 7 8 13 14 15] TEST: [ 0 1 9 10 11 12 16 17 18]
In the training set items 9 and 10 belongs to two different groups (2 and 3) and so are (9 and 10), (16 and 17). Doesn’t this imply half the field a given race ends up in the train data and half in the test data?
Many thanks
Jakes
If you look at the output produced by the following line
print(X_train, X_test) and as you can see row 9 gives
You should see that full race runners appear in train or the test group but not both, but what is confusing I agree is that it appears that the index values can have rows from the same race in either group. This is actually not the case as the index values displayed is the row numbers of the index, so where it say Train [0,1,9 etc it really means row 0 in the index, row 1 in the index row 9 in the index is the first row of race 3.
Its a bit of a head scratcher but check the race output from print (X_train, X_test) and verify that they are split correctly keeping whole races together
That should read row 9 gives 10 47 35 during the first split