Simple XGoals Model 2

Following on from the previous post where I looked at a simple model using expected goals as an underpinning value for a simple rating system I will continue the exploration by

  1. Extending the premiership analysis to 2018/19 through to 2021/22 data ( 4 seasons of data)
  2. Look at an alternative baseline model of simply using goals scored difference

First the results for the XG model using a walkforward train and test. Let me explain, the software will train the model on 18/19 and test the model on 19/20. It will then train the model on 18/19 and 19/20 and then test the model on 20/21. Finally is will train the model on 18/19, 19/20 and 20/21 and then test on 21/22. While it is doing this it accumulates the results from each test period before reporting the results.

Using a train test split only rather than walkforward so it will train on the first 80% and test on the last 20% we have

Both set of results show a good return on value bets and the calibration plot looks reasonable down in the bulk of where the ratings will reside

Next I trained a model but this time using the goals scored difference for a teams last 3 matches as a measure of their worth. So team A with results of 1-0, 2-0 and 1-5 if there score comes first would have gross difference of -1 with an average of -0.33

In the above ignore the fact that the input feature is named as xgdiff, I did not change the feature name but did populate it with actual goals scored difference. The results using a train test split were as follows

The value bet profit has evaporated here adding extra weight to the idea that XGoals is a superior input to goal difference when it comes to profit generation. There is enough evidence here to prompt further investigation using more data and exploring other leagues.

Many thanks to and for data supply


Simple XGoals Model


Prompted by an excellent Twitter post by @AntoineJWMartin on building a simple XG goals model in R I decided to try and do something similar in Python but with the difference being that I would try and construct the data so that it can be loaded into MySportsAI and modelled. In most other respects it will be similar to Antoine’s work in that he is taking the average of the last 3 XG differences for each team and calculates the difference between the two teams in the next game up. Thats sounds complicated let me explain.

I will assume you know what expected goals are if not a quick google will enlighten you. Let us imagine Arsenal in their last match had XG of 1.2 and their opposition in that match had XG of 0.8. The difference for that match is 0.4. We calculate this for the last 3 matches and average. This is then Arsenals average xgDiff (my naming) going into the next match. Now if we calculate their oppositions xgDiff we can then subtract them from each other to get a rating and perhaps this simple rating can be modelled.

The main thrust of Antoine’s tweet was to show how to get the data and prepare it and if their is interest I will do the same or at least make some general code available. The two web sites needed for this data are for expected goals data and for the results and betting odds. The work involves coding the two together into one set of data to be fed into MySportsAI.

Once I had done this the loaded data looked like this. Obviously the initial rows have xgDiff of NaN because the teams have not yet had 3 matches and therefore cannot have a rolling average. These are removed at the modelling stage.

Some explanation is needed first. The data is for the English premiership 2019 to 2022. I have excluded data on the draw so the above is like a series of two horse races, home and away. finPos is obviously 1 or 0 depending on whether the home or away team won and although I have stuck to the naming convention of BFSP for starting price I actually took Bet365 from data although others can be used.

At first the results looked too good to be true and when they do you must always assume the worse. On careful inspection I realised that in taking the last 3 game average I was in fact including the current game in the average, clearly putting the current games XGdiff into the average is going to raise the predictability of the model.

Running the model with a train/test split produced the following results using logistic regression

I need to desk check this to make sure all is OK but the results look promising but that is not the main reason for this exercise. At this stage we are just looking at ways of configuring data football modelling and considering I am not a football modeller I would appreciate any feedback.

FootNote- Another possibility is weighting the last 3 matches. Using a weight of 0.5 for the last match, 0.3 for 2nd last match and 0.2 for 3rd last match (note I just made these up) I got the following improved results.

Know Your Trainers Neighbors


The classical way of looking at trainer form is to check how well a trainer has done over the last X runs. Often this amounts to simply looking at win to run ratio but sometimes this is refined further to perhaps ‘good’ runs or placed runs. But is there a more refined way of looking at trainer form. What I am getting at is how well does M Botti do when running in a class 5 handicap with a 3yo who has been off 71 days. What if we add further criteria, perhaps stipulate that the animal is a male rather than female horse. There are all kinds of criteria we could come up with but it gets messy and even if you do not think so the question remains how do we evaluate his runs with such animals?.

Machine Learning can help in this situation. The K Nearest Neighbor is one of the simplest algorithms to understand. Imagine we simply focus on Botti’s runners that are 3yos or as near as possible to 3yo and off 71 days or as near as possible to 71 days. It would be great if Botti had a multitude of such previous runners but of course he wont but KNN will search for the nearest sample of data to these values. The sample is set by us when we run a KNN program. I preformed this task on the last race at Wolves on Saturday 26th November 2022. I trained the KNN algorithm of Bottis data from 2011 to 2019 for class 5 and 6 races and then ran a prediction on Botti’s runner in the last at Wolves. Now normally the algorithm would predict the chances of Botti having a winner with a 71 day off 3yo. However I wanted to refine the prediction somewhat. I actually accessed the 21 nearest neighbors from 2011 to 2019 (I specified it should look for 21 nearest instances) and then instead of lengths beaten for each animal I looked at pounds beaten and compiled an average. I did this for all trainers in the last race at Wolves and then ranked them with of course the smallest average being the best ranked in the race. I also graphed the individual trainers nearest neighbors, here are a couple

At first glance Appleby’s graph looks better but of course the vertical scale is different although he does have a large outlier.

There is lots more work that can be done on this idea. Certainly the two inputs above should be normalised to lie between 0 and 1 otherwise the algorithm will give more weight to a difference of say 10 days in days since last run than perhaps a difference in one or two age years. This would lead to days since last run dominating the selection of the 21 nearest neighbors.

Does this approach have any legs, well I trained on 2011 to 2015 and tested on 2016/17 for all handicap races using just a couple of input fields for trainers of which days since last run was one and during 2016/17 the top ranked made variable stake profit of +19.1 points whilst the second ranked made +15.6

In this race Botti is top ranked and Appleby is second top, good luck with whatever you are on

Comments are welcome and dont forget to rate the article

Has My Data Changed


A big problem model builders have is covariate shift, it sounds complicated but its really simple and I am going to explain it via a horse racing example that will be familiar to all of you. The days since a horse last run is a common data feature in a horse racing data set whether you are a system builder or a model builder. Either way you are hoping that the way in which horses are generally campaigned this year is mirrored in the way they have been campaigned in previous years. I mean if I told you that next year no horse could run without having at least 3 weeks off you would be pretty worried about how this would effect your way of betting regardless of what approach you take.

To give this a fancy term we would say that the distribution of the data item days since last run has changed, in other word we have covariate shift. Now this is worrying if you built a model on the assumption of certain data distributions only to find that they have changed drastically. Such a situation occurred in 2020 when a whole industry had to shut up shop and no horse was running for weeks.

In MySportsAI you can now check for drift between the data your training a model on and the test data you are testing it on. Lets demonstrate with a simple model using only days since a horse last ran.

The above was trained and tested on handicap data for 2011 to 2015, the first 80% being the training data and the latter 20% being the test data. The TTDrift value shows a measure of data drift, close to 0.5 is quite good. Now lets see how it drifts when I train on 2011 to 2015 and then test on 2016/17

The drift is a little more pronounced and the variable stake profit on the top 3 rated has dropped

Now finally lets check testing on 2020

The TTDrift has not surprisingly taken a jump and top 3 VROI is poor worse than you would get backing all horses.

Although the change in conditions here were forced upon us in other cases drift can occur for reasons that are not as easy to determine but at least you can now keep an eye on it. By the way although the drift on the above came down in 2021 and 2022 it is still not returned to 2011/15 levels. If you are interested in the mechanism behind the calculations here is a link

Predicting Profit Not Winners


Machine learning libraries like sklearn come with lots of ML algorithms. Neural Networks, Logistic Regression, Gradient Boosting Machine and so on. Off the shelf they all have on thing in common. If you give them a spreadsheet like set of data they will try to predict one of the columns depending on which one you specify. So if one of my columns contains zero where a horse (a row of data) lost and one if it won then we can get the ML algorithm to create a model that is hopefully good at predicting those 0s and 1s. It will even give you a a probability between 0 and 1 so that you can then rank the horses in a race and perhaps just back the top ranked horse. this is all jolly good if we find this approach produces profit, but can we get an algorithm to predict profit. Would a model built to find profit work better than a model built to find winners ?.

To find profit with an ML algorithm we have to change something called the loss function. So what is the loss function when its at home ?. Let us think about a commonly used one. Mean Squared Error MSE. If say a logistic regression model predicts Mishriff will win the Breeders Cup turf with a probability of 0.21 and he does win then the error is 1 – 0.21 = 0.79

If on the other hand he loses then the error is 0 – 0.21 = -0.21

Now if we square the two potential errors we always get a positive number namely 0.62 and 0.04

This is the SE and we can see that if we take the average of these across all the predictions made in a year we have the MSE

Hopefully you can see that if losers have lower predicted probabilities and winners have higher probabilities as predicted by our model then we are heading in the right direction. If its the other way round then we have a pretty crap model. The algorithm will attempt to minimize this MSE in its search for a good model.

But we want to model for profit not accuracy, we need a different loss function to MSE, we need to create our own, what is commonly known in ML circles as a custom loss function and plug this into our algorithm and say hey use this not the loss function you use by default.

You can do this with LightGBM and XgBoost but it is easier to do with Deep Learning and Keras. I am not going to go into the code detail here but I am going to share my findings after tipping my toe into this pool.

I created a loss function that would maximize variable stake profit proportional to the rating it produced for each horse in a race. In other words it is betting to win £1 on each horse in a race but whatever profit or loss is made on each horse multiplied by the rating value. So if the top rated horse won with a rating of 0.25 the winnings would be £1 x 0.25 and of course the loss on the lower rated horses would be less because they have lower rating values. The loss/profit on a race is therefore being reduced/increased if higher rated horses win.

Plugging this in to a Deep learning Neural Network using Keras produced the following results for top rated horses in each race (UK Handicaps flat). I go on to compare this with a GBM model produced in MySportsAI using the same data but obviously designed to find winners.

First data for 2011 to 2015 was split into 80% for training and 20% for testing chronolgically. If you have used a Neural Network before you will know that because of the stochastic nature of NNs you can train a model and get results from it but if you retrain it then you will get different results (MySportsAI users try this with the NN option). This is not the case with GBM. This does not mean NN’s are unreliable, you just have to train and test a few times to get a reasonable picture ort an average. Here are the results for top rated horses for 5 runs with a custom loss function in place.

Each run produced 3959 top rated bets

Run 1 ROI% 1.79 VROI% 2.45

Run 2 ROI% 5.05 VROI% 1.82

Run 3 ROI% -3.76 VROI% 1.45

Run 4 ROI% -0.08 VROI% 0.69

Run 5 ROI% 2.18 VROI% 3.21

The first thing I should mention about the above models is that in line with common wisdom I scaled the 3 input features so that they were all in a range of 0 to 1. This is something that is commonly advised for NN’s but I was about to find that the opposite was the case for my data which surprised me.

Here are the results without scaling.

Run 1 ROI% 10.53 VROI% 4.8

Run 2 ROI% 6.47 VROI% 2.06

Run 3 ROI% 2.79 VROI% 3.08

Run 4 ROI% 9.77 VROI% 7.79

Run 5 ROI% 9.49 VROI% 12.11

So how does GBM algorithm perform with the same data but obviously no custom loss function

ROI% 5.71 VROI% 5.66

When taking averages GBM is slightly worse than the average performance of the NN using a custom loss function.

My nest step was to look at how these two performed on validation sets. In other words other hold out periods ie 2016-17 data and 2018-19 data. First 2016/17. Firstly the question to ask is which of the 5 runs I performed with the NN should I use. I tried the highest performed first and this gave some weid results, the top rated horse was getting a rating of 0.99etc which suggests something went wrong, probably the NN found whats called a local optima and simply over fitted or in laymans terms, got lucky in this case. Needles to say the results on 2016/17 were poor. Next I tried a mid range model and this looked promising


GBM ROI% -1.66 VROI% 0.55 NN with loss function ROI% 8.08 VROI% 3.68


GBM ROI% 6.23 VROI% 3.11 NN with loss function ROI% 4.12 VROI% 3.78

Another area of interest may be to use the ranking of the horse instead of the probability when multiplying the loss in the loss function. If you have any ideas of your own please comment and vote on the usefulness of the article.

One Hot V Ordinal Encoding

Steve Tilley sent me this interesting article today which delves into the benefits of using ordinal encoding over one hot encoding in some situations.

A synopsis of the piece would be that for some tree based algorithms like GBM and Random Forests ordinal encoding can be a better option than the usually recommended one hot encoding.

OK I am getting ahead of myself here, what do the above terms actually mean?. Well imagine we have a racing data feature like race going (F, GF, Gd etc etc) and lets say we want to model on pace figure and going because maybe together they have some predictive power. We cannot use the going data as is because ML algorithms require numeric values. The conventional wisdom approach would be that if the going does not have some intrinsic ordering to it then one hot encode it which simply means create binary feature for every possible occurrance like thus

As the article points out this can lead to an explosion of features and possibly the curse of dimensionality.

Below is the performance of a model on pace figure and one hot encoded going for turf flat handicaps. The top rated made a ROI of 1.95% but a variable ROI of -0.7%

Now if we use a numeric value for going, namely 0 = Hvy 1 = Sft 2 = GS etc etc and so only two input features pave figure and going we now get the slightly better set of results

These result suggest as the article does that we should not jump to conclusions about one hot encoding, nominal encoding with tree based algo’s may be just as good if not better

Modeling Heritage Handicaps


Back at the beginning of the 2022 flat season a tipping competition popped up on Twitter. Entries had to make two selections in all the flat seasons heritage handicaps. I felt this was a nice opportunity to test a machine learning model designed to run specifically on heritage handicaps so I set about creating such a model. Drilling down into the data for just heritage handicaps might produce too little data to work with so I decided to go for training the model on all races of class 4 and below. I also ended up splitting the task into two models, one for races up to a mile and another for races beyond a mile. Selections would be made simply by posting up on Twitter the top two rated.

Things got off to a pretty surprising start when the model found Johan, a 28/1 sp winner of the Lincoln and generally got better as the year progressed. Here are the results with EW bets settled at whatever place terms were on offer by at least two big bookmakers.

Firstly let me say that the above returns are not likely sustainable but the profit generated does add weight to the historical results and suggests that the model can be profitable going forward especially at these place terms. I will consider posting these up to the MySportsAI email forum next year

It’s Not The Years It’s The Mileage


As a horse ages we can expect some deterioration in performance but is age more taxing or as Dr Jones once said is it the mileage ?. We can take a look at this using MySportsAI, software that allows people with no Machine Learning grounding to create ML models to predict horse racing.

The first thing I am going to do with the software is slice the data down to only handicap races by clicking alongside raceType in the slice column and then clicking the slice data button at the top of the page.

Via the following screen we can now slice the data down to just handicaps for 2011 to 2015

I am now going to select age as a sole input feature to my model and click Run Model

This presents the following screen where I have selected 5 fold cross validation. This simply means that when I test my one feature model it will train the model on 4/5th of the model and then see how it performs on the remaining 5th. It will do this 5 times with different partitions or 4/5 to 1/5 splits.

Running the model gives me the following results, we will come back to these when I have carried out a similar run for number of previous races instead of age

I now return to the previous screen and select prevRaces instead of age

The results we get using prevRaces are as follows

We can see that prevRaces has thinned the horses out into rankings within each race better than age. this is to be expected as there are more individual values of prevRaces than age, age will only have values between 3 and 13 whereas prevRaces has values between 0 and 229. There are therefore more joint top rated horses with age as an input than there would be with prevRaces. We have to compare therefore roughly equal number of qualifiers, we cannot just say does the top ranked horse using age work better than prevRaces because there are far more top rated horse with age as the input.

Taking the top 3 rated for prevRaces would be roughly equivalent to the top rated for age. We can see that the return to win £1 on each horse would have yielded a loss of 2.41% for age whereas for the top 3 rated using prevRaces we have a loss of only 1.4%, better than backing all horses and losing 1.84%. The Brier skill score is also better for PrevRaces

My next step would be to take a look when we train 2011 to 2015 and then test for results on 2016/17. For the time being though Dr Jones would seem to have a valid point.

Informed and Uninformed Money


Informed, uninformed, smart, mug call it what you will many researchers have referred to these two different kinds of money forcing a price down. What would be uninformed money, perhaps that last winner for Frankies Ascot 7 or maybe just simply when a horse won last time out and has a top jockey up today against a bunch who are not quite sparkling form wise. It does not have to be rank stupid it only has to be a case of what does the public know that most of the public do not and in many cases its not a a lot. You should always ask yourself this question with a bet. What do I know that the public does not?. Do not take the last part as meaning, its a complete secret. They only have to be taking little notice of something that is available to them but they just do not have the time or inclination to find it out or perhaps confidence to act on.

With all this in mind I was pondering whether it would be possible to use Machine Learning to analyse the performance of smart money vs not so smart. I set about the task using MySportsAI, software by the way that allows you to carry out this kind of analysis provided you have the mental capacity and dexterity to click a mouse. I first looked through all the features that were strongly correlated with the price dropping on a horse. OK how did I do this and what the hell are features. There are around 90 features in MySportsAI and growing, it also allows you to engineer your own features. A feature is simply a characteristic of a horse running in a race, for example the trainer strike rate of that horse going into each race. Now we have cleared that up what do I mean by correlated and price drop. For price drop I used the feature PriceDrift, so yes I am going to predict horse that drift in price which is kind of the flip side to dropping in price, more or less. What does this mean, well PriceDrift is the pre race average Betfair price divided by the Betfair SP. In MysportsAI you can see which features influence (that means correlation) the PriceDrift target feature. To do this we first make PriceDrift the target feature and then use the correlation matrix in MySportsAI, you have seen this before in previous blog posts. I have purposely left out the names of the input features I used, hey its all there try it yourself.

Once we have chosen some features that have some sort of punch in predicting PriceDrift (in the above I have selected 5 that have a correlation between -0.08 to +0.07), we can use them to create a model to predict PriceDrift. I first created a model using data from 2011 to 2015 for Handicap races, training on the first 0.8 of the data and then testing on the last 0.2. Once I had done this I saved the results of the 0.2 test data to a csv file so that I could load it up into Excel. I repeated the exercise this time training on all 2011 to 2015 data but testing on 2016/17.

Once I had loaded the two csv result files into Excel I could now take a look at how horses faired if they were top rated within a race to be a drifter BUT actually dropped in price. You hopefully see where I am going with this. Conventional form features are indicating a drift but the market is saying no. I confined my selections to those under 10.0 Betfair SP simply because I felt that drop and drifts over longer prices can be less meaningful unless they are perhaps huge drops.

The test data from 2011 to 2015 ( about a seasons worth) produced 389 bets and an after commission profit of +34.22 points ROI +8.79%

The test data for 2016/17 produced 411 bets and a profit of +66.45 ROI% +16.16%

From the above it is quite possible that not all droppers are equal, perhaps some are more informed and the market does not quite drop them enough. I think this is quite possible. There are of course other possibilities with this avenue of research, trading being the most obvious.

If you liked this article click the ratings and drop me a comment (no pun intended)

Going Distance Jockey What’s Important


Imagine we had 3 top gamblers available to conduct an experiment, lets say Pittsburgh Phil, Alan Potts and Dave Nevison. OK perhaps not Dave Nevison but 3 top punters of your choice. Now lets imagine we want to drill down and figure out how important the key ingredients are that they use to pick bets. Let us imagine that amongst other things Phil looks at the suitability of the going for the horse, the previous experience at the race distance of the horse and the ability of the jockey. Trouble is we do not know how these factors rank. Is one more important than the other. One way of figuring this out is to sit Phil down in a locked room for 5 years and get him to punt for 2 of them with all the information/data he needs along with food and water of course and perhaps an exercise yard and small bird in a cage. Seeing as we are dealing with dead punters here I will sling Telly Savalas in the next room. Now having logged his excellent performance over the first 2 years we now randomly alter some of the data he is receiving on the horses going suitability. We do this for a year and see how much his betting performance has suffered. We then do the same for the data on distance suitability and finally for jockey worthiness. After these 3 data alterations we will see that his punting has suffered by varying degrees depending on the three sets of altered data. By comparing these 3 values we are now in a position to order the importance of the three inputs.

This in essence is the process behind a Machine Learning feature importance approach known as permutation feature importance. It does not quite work in the same way as outlined above. It does not randomly alter a features content on subsequent years as above but it will train a model on say 4 years of data and then test it on the 5th year and then to gauge feature importance it will carry out the predictions on the 5th year with various randomly altered features to see which has the greatest negative impact.

The other useful thing you can do with feature importance calculations is check the importance on the training data and then on the test data. If the ordering is wildly out of line across the two then it may well be a sign that the model has over fitted on the training data, that is to say its tending to memorize the data and will therefore not predict very well on new data.

I have been running some checks on this method to see how it performs when compared to the bog standard feature Importance that comes with the Python SKlearn package and it as you shall see it can disagree.

The above is using the standard feature importance algorithm and we can see that jockey strike rate is top in terms of feature importance. Now let us look at feature importance using the permutation method.

On the left side we can see that on the test data jockey strike rate is also the most important feature but after that there is disagreement with the first plot. The right hand box plot shows the importance when applied to the training data and we can see that jockey strike rate and class move have maintained their relative positions of 1st and 2nd which is a good sign that the model is generalizing well from the training data to the test data. The straight lines coming out of the box plots show the variation in the measurements derived from doing 20 different sets of predictions each with a different set of randomly changed values within the feature. The box is the average value of all 20 readings.

The benefit of this approach is twofold, firstly you get a more accurate evaluation of feature importance and secondly comparing train and test gives an insight into possible model over fitting. The permutation variety will be shown in the Autumn update of MySportsAI, software that allows you to create models at the click of a button with no prior ML knowledge