Football Betting The Long and Short


An interesting discussion cropped up the other day on the MySportsAI email forum. Its started when a member posted about how he was attempting to layout data for a Machine Learning approach to predicting probabilities for home, away and draw for Soccer matches. There are two general solutions to this problem, one long and one short so to speak but I want to approach this top down so hopefully anyone can understand not only the problem but also the solutions.

Let us approach this with a model in mind, a very simple model that is not being put forward by me as a winning model but merely as a vehicle for exploring the problem in hand. Imagine we want to create a model which has as it input features just one feature. The average possession percentage in the last 3 games for any two teams about to meet each other. How do we lay this data out so that we can model it?.

One approach is to lay the data out in wide format so for example we might have the following



The result was a draw ie 2 = draw 1 = home win 3 = away win

The inputs to the model would be the two possession fields HomePoss% and AwayPoss% and of course the target to predict would be the result field. The odds fields will be used by us to calculate how profitable the model is and are unlikely to be inputs to the model at this stage.

Now because we are not predicting a binary outcome ie- 1 or 0 to signify win/lose we cannot use some algorithms that are solely intended for binary outcomes. What we need is MUTINOMIAL methods like multinomial logistic regression. We can also use a deep learning neural network in which we can specify that there will be 3 output nodes, one for each result. I am not going to delve into this further here although I have created a neural network on these lines using Keras and Tensorflow.

We also cannot feed the above configuration into MySportsAI as the algorithms within MySportsAI are set up for binary outcomes eg has the line of data which describes a horse won the race in question or not.

This led to a discussion on how we could configure things so that MySportsAI could handle this home away draw prediction scenario. One way forward is to create 3 lines of data per match so a match is like a 3 horse race. The first line could be



The second line could be


Now we run into a problem, the third line needs to represent the draw but there is no possession percent for a draw. The first two possession percents are absolute values for each team derived from their last 3 games. You could put in the difference between the two for this third line but then you are mixing apples with oranges. The first two values have a different scale and function to the last value for the draw. If however the input features are relative values in some way then this approach makes more sense. For example if we had the ratio of home% to away% as the input values as shown below




Now we have something approaching apples and apples however the algorithm would not know who is the home team and who is the away team, clearly the first two 1.26’s are not the same as one is a home team and the other an away team. To pass on this information we could add 3 extra fields (note yes we can do it with 2 fields but I want to make things clear). Here the 3 fields would signify

1,0,0 home team

0,1,0 away team

0,0,1 draw

We now have the following with the 3 extra fields after the team name




We now have a 3 horse race so to speak and MySportsAI will give rankings and percentage chance for a match once a model has been trained on historical data.

I have personally found that this approach with a Gradient Boosting Machine algorithm works as well as a long approach and a deep leaning Keras neural network.

All of the modelling by the way can be done in MySortsAI with just a click of a few buttons.

Love to hear other peoples thoughts on this and what approaches they are using

How Many Bets Before Bookie Bans You


I ran a Twitter poll today, curious to see what people thought was a realistic number of bets a bookie might allow, given that he is monitoring whether you beat SP with there early prices and using this to judge if or when to ban you. Here are the results

What I was curious about was how many of these beating SP bets could a new punter consequtively have and yet still be a losing punter. I stared off looking at a punter betting randomly into Betfred’s 10am prices with a restriction that he only takes prices under 20/1. To keep things simple I also considered races with no non runners at off time across all racing codes in 2019.

We can easily calculate how much a punter will typically lose doing this but what I wanted to know first of all was how many bets he would have to have before we could be sure he is a losing punter by only examining results. I ran a 1,000 run simulation with each run containing 1,000 bets. This produced 20 profitable simulations so clearly 1,000 bets is not enough to be sure. After a few trials it turned out that 2,500 bets produced no profitable iterations and so 2,500 bets was taken as a measure that a punter is truly a losing one given the above. However what if a bookie was watching how well they beat SP during these opening bets. It has been suggested that even though you may not have struck a winning bet the fact that you can get banned, according to many punters, after 3 or 4 bets, is due to the fact that bookies are monitoring your ability to beat SP.

I ran the same simulation but this time checking how often a run of X initial bets were ‘value’ bets according to price taken and SP. Looking for a run of 3 initial value bets occurred 58 times in the 1,000 sample runs. This means 58 times losing punters would have been discarded by a bookmaker if 3 was their tolerance level.

Testing for 5 opening ‘value’ bets we have 7 occurances so 7 losing punters wrongly banished.

Not until we got to 8 initial value bets did we find zero occurances and hence no losing punters were discarded. In the pre poll run it was 7 but obviously due to the random generation of selections this number can vary slightly.

The twitter feedback stated that some bookmakers may be banning one or two bet people because the cross bookmaker intelligence service has already flagged them as people who take too much interest in the sport. This may be true but I am not convinced this accounts for all the 3 bets and out people.

Every one knows my opinion on this topic, if you are a losing punter you will lose less or more slowly on Betfair and if you are a winning punter then you will end up on Betfair eventually so go figure Betfair

Simple XGoals Model 3

Following on from the last two posts where I looked at a simple expected goals model built using Machine Learning In this post I am going to describe how you can get involved and play around with the very same data.

The first thing you will need to do is gather the required data. this is very quick and simple. At go to the english premier league historical results and download the data for 2021/22. Save this in a new folder and name it what you want although I called my data file e02122.csv. This is the file that will contain the results and the betting odds

Second thing to do is create a file using a text editor eg notepad called resultfiles.csv and in it place the following line


If you have called it something else then obviously enter that file name and dont forget to press return to start a new line. You can obviously gather more seasons than this, the above is just as an example, but enter their file names into resultfiles.csv

Next step is to gather some expected goals data. Go to the web site and click on the competitions tab and select premier league

Click previous season to move back to 2021/22

Now click the score&fixtures tab to display the match scores for 2021/22, you will notice that this also contain xg (expected goals)

Now click share & export followed by Get table as csv

The results will now be displayed in csv format. I have not found a link to download the data as with so you will have to copy and paste the results from the page to a notepad file and call it Premexpgoals2122.csv

Almost finished, now create a new file in notepad called featurefiles.csv and enter into it the following line


Again if you grab other seasons then enter their file names into featurefile.csv as well

There are a couple of other files that I will supply with the software, the first is called repteamnames.csv

Because team names in the results and the xgoals files are not always the same eg Man Utd Manchester Utd the software will read the teams that need editing from this file and make the needed changes. If you run into any new discrepancies from earlier years just add the from and to names to this file. When you look at the repteamnames file and Premexpgoals2122 and eo2122 you will see why they are in repteamnames.

The other file I will supply is histmatchweights.csv

This file will contain the following


The 3 means that the software will gather the xgoals from a teams last 3 games

The 0.2,0.3,0.5 makes the software weight the last game with 0.5, the second last game 0.3 and the third last 0.2

You can play around with these values in order to make the software create different data

You can download the software createMLfootie.exe from the utilities section of along with the two files I supply mentioned above.

Once you have run the program it will create a file called MLfootie.csv, it is this file you can load up into MySportsAI and create models on.

The idea behind this is to dip a toe into data modelling for football and perhaps with discussion and further ideas we can develop and enrich the data inputs, I already have a few ideas.

Best of luck and let me know how you get on

Simple XGoals Model 2

Following on from the previous post where I looked at a simple model using expected goals as an underpinning value for a simple rating system I will continue the exploration by

  1. Extending the premiership analysis to 2018/19 through to 2021/22 data ( 4 seasons of data)
  2. Look at an alternative baseline model of simply using goals scored difference

First the results for the XG model using a walkforward train and test. Let me explain, the software will train the model on 18/19 and test the model on 19/20. It will then train the model on 18/19 and 19/20 and then test the model on 20/21. Finally is will train the model on 18/19, 19/20 and 20/21 and then test on 21/22. While it is doing this it accumulates the results from each test period before reporting the results.

Using a train test split only rather than walkforward so it will train on the first 80% and test on the last 20% we have

Both set of results show a good return on value bets and the calibration plot looks reasonable down in the bulk of where the ratings will reside

Next I trained a model but this time using the goals scored difference for a teams last 3 matches as a measure of their worth. So team A with results of 1-0, 2-0 and 1-5 if there score comes first would have gross difference of -1 with an average of -0.33

In the above ignore the fact that the input feature is named as xgdiff, I did not change the feature name but did populate it with actual goals scored difference. The results using a train test split were as follows

The value bet profit has evaporated here adding extra weight to the idea that XGoals is a superior input to goal difference when it comes to profit generation. There is enough evidence here to prompt further investigation using more data and exploring other leagues.

Many thanks to and for data supply

Simple XGoals Model


Prompted by an excellent Twitter post by @AntoineJWMartin on building a simple XG goals model in R I decided to try and do something similar in Python but with the difference being that I would try and construct the data so that it can be loaded into MySportsAI and modelled. In most other respects it will be similar to Antoine’s work in that he is taking the average of the last 3 XG differences for each team and calculates the difference between the two teams in the next game up. Thats sounds complicated let me explain.

I will assume you know what expected goals are if not a quick google will enlighten you. Let us imagine Arsenal in their last match had XG of 1.2 and their opposition in that match had XG of 0.8. The difference for that match is 0.4. We calculate this for the last 3 matches and average. This is then Arsenals average xgDiff (my naming) going into the next match. Now if we calculate their oppositions xgDiff we can then subtract them from each other to get a rating and perhaps this simple rating can be modelled.

The main thrust of Antoine’s tweet was to show how to get the data and prepare it and if their is interest I will do the same or at least make some general code available. The two web sites needed for this data are for expected goals data and for the results and betting odds. The work involves coding the two together into one set of data to be fed into MySportsAI.

Once I had done this the loaded data looked like this. Obviously the initial rows have xgDiff of NaN because the teams have not yet had 3 matches and therefore cannot have a rolling average. These are removed at the modelling stage.

Some explanation is needed first. The data is for the English premiership 2019 to 2022. I have excluded data on the draw so the above is like a series of two horse races, home and away. finPos is obviously 1 or 0 depending on whether the home or away team won and although I have stuck to the naming convention of BFSP for starting price I actually took Bet365 from data although others can be used.

At first the results looked too good to be true and when they do you must always assume the worse. On careful inspection I realised that in taking the last 3 game average I was in fact including the current game in the average, clearly putting the current games XGdiff into the average is going to raise the predictability of the model.

Running the model with a train/test split produced the following results using logistic regression

I need to desk check this to make sure all is OK but the results look promising but that is not the main reason for this exercise. At this stage we are just looking at ways of configuring data football modelling and considering I am not a football modeller I would appreciate any feedback.

FootNote- Another possibility is weighting the last 3 matches. Using a weight of 0.5 for the last match, 0.3 for 2nd last match and 0.2 for 3rd last match (note I just made these up) I got the following improved results.

Know Your Trainers Neighbors


The classical way of looking at trainer form is to check how well a trainer has done over the last X runs. Often this amounts to simply looking at win to run ratio but sometimes this is refined further to perhaps ‘good’ runs or placed runs. But is there a more refined way of looking at trainer form. What I am getting at is how well does M Botti do when running in a class 5 handicap with a 3yo who has been off 71 days. What if we add further criteria, perhaps stipulate that the animal is a male rather than female horse. There are all kinds of criteria we could come up with but it gets messy and even if you do not think so the question remains how do we evaluate his runs with such animals?.

Machine Learning can help in this situation. The K Nearest Neighbor is one of the simplest algorithms to understand. Imagine we simply focus on Botti’s runners that are 3yos or as near as possible to 3yo and off 71 days or as near as possible to 71 days. It would be great if Botti had a multitude of such previous runners but of course he wont but KNN will search for the nearest sample of data to these values. The sample is set by us when we run a KNN program. I preformed this task on the last race at Wolves on Saturday 26th November 2022. I trained the KNN algorithm of Bottis data from 2011 to 2019 for class 5 and 6 races and then ran a prediction on Botti’s runner in the last at Wolves. Now normally the algorithm would predict the chances of Botti having a winner with a 71 day off 3yo. However I wanted to refine the prediction somewhat. I actually accessed the 21 nearest neighbors from 2011 to 2019 (I specified it should look for 21 nearest instances) and then instead of lengths beaten for each animal I looked at pounds beaten and compiled an average. I did this for all trainers in the last race at Wolves and then ranked them with of course the smallest average being the best ranked in the race. I also graphed the individual trainers nearest neighbors, here are a couple

At first glance Appleby’s graph looks better but of course the vertical scale is different although he does have a large outlier.

There is lots more work that can be done on this idea. Certainly the two inputs above should be normalised to lie between 0 and 1 otherwise the algorithm will give more weight to a difference of say 10 days in days since last run than perhaps a difference in one or two age years. This would lead to days since last run dominating the selection of the 21 nearest neighbors.

Does this approach have any legs, well I trained on 2011 to 2015 and tested on 2016/17 for all handicap races using just a couple of input fields for trainers of which days since last run was one and during 2016/17 the top ranked made variable stake profit of +19.1 points whilst the second ranked made +15.6

In this race Botti is top ranked and Appleby is second top, good luck with whatever you are on

Comments are welcome and dont forget to rate the article

Has My Data Changed


A big problem model builders have is covariate shift, it sounds complicated but its really simple and I am going to explain it via a horse racing example that will be familiar to all of you. The days since a horse last run is a common data feature in a horse racing data set whether you are a system builder or a model builder. Either way you are hoping that the way in which horses are generally campaigned this year is mirrored in the way they have been campaigned in previous years. I mean if I told you that next year no horse could run without having at least 3 weeks off you would be pretty worried about how this would effect your way of betting regardless of what approach you take.

To give this a fancy term we would say that the distribution of the data item days since last run has changed, in other word we have covariate shift. Now this is worrying if you built a model on the assumption of certain data distributions only to find that they have changed drastically. Such a situation occurred in 2020 when a whole industry had to shut up shop and no horse was running for weeks.

In MySportsAI you can now check for drift between the data your training a model on and the test data you are testing it on. Lets demonstrate with a simple model using only days since a horse last ran.

The above was trained and tested on handicap data for 2011 to 2015, the first 80% being the training data and the latter 20% being the test data. The TTDrift value shows a measure of data drift, close to 0.5 is quite good. Now lets see how it drifts when I train on 2011 to 2015 and then test on 2016/17

The drift is a little more pronounced and the variable stake profit on the top 3 rated has dropped

Now finally lets check testing on 2020

The TTDrift has not surprisingly taken a jump and top 3 VROI is poor worse than you would get backing all horses.

Although the change in conditions here were forced upon us in other cases drift can occur for reasons that are not as easy to determine but at least you can now keep an eye on it. By the way although the drift on the above came down in 2021 and 2022 it is still not returned to 2011/15 levels. If you are interested in the mechanism behind the calculations here is a link

Predicting Profit Not Winners


Machine learning libraries like sklearn come with lots of ML algorithms. Neural Networks, Logistic Regression, Gradient Boosting Machine and so on. Off the shelf they all have on thing in common. If you give them a spreadsheet like set of data they will try to predict one of the columns depending on which one you specify. So if one of my columns contains zero where a horse (a row of data) lost and one if it won then we can get the ML algorithm to create a model that is hopefully good at predicting those 0s and 1s. It will even give you a a probability between 0 and 1 so that you can then rank the horses in a race and perhaps just back the top ranked horse. this is all jolly good if we find this approach produces profit, but can we get an algorithm to predict profit. Would a model built to find profit work better than a model built to find winners ?.

To find profit with an ML algorithm we have to change something called the loss function. So what is the loss function when its at home ?. Let us think about a commonly used one. Mean Squared Error MSE. If say a logistic regression model predicts Mishriff will win the Breeders Cup turf with a probability of 0.21 and he does win then the error is 1 – 0.21 = 0.79

If on the other hand he loses then the error is 0 – 0.21 = -0.21

Now if we square the two potential errors we always get a positive number namely 0.62 and 0.04

This is the SE and we can see that if we take the average of these across all the predictions made in a year we have the MSE

Hopefully you can see that if losers have lower predicted probabilities and winners have higher probabilities as predicted by our model then we are heading in the right direction. If its the other way round then we have a pretty crap model. The algorithm will attempt to minimize this MSE in its search for a good model.

But we want to model for profit not accuracy, we need a different loss function to MSE, we need to create our own, what is commonly known in ML circles as a custom loss function and plug this into our algorithm and say hey use this not the loss function you use by default.

You can do this with LightGBM and XgBoost but it is easier to do with Deep Learning and Keras. I am not going to go into the code detail here but I am going to share my findings after tipping my toe into this pool.

I created a loss function that would maximize variable stake profit proportional to the rating it produced for each horse in a race. In other words it is betting to win £1 on each horse in a race but whatever profit or loss is made on each horse multiplied by the rating value. So if the top rated horse won with a rating of 0.25 the winnings would be £1 x 0.25 and of course the loss on the lower rated horses would be less because they have lower rating values. The loss/profit on a race is therefore being reduced/increased if higher rated horses win.

Plugging this in to a Deep learning Neural Network using Keras produced the following results for top rated horses in each race (UK Handicaps flat). I go on to compare this with a GBM model produced in MySportsAI using the same data but obviously designed to find winners.

First data for 2011 to 2015 was split into 80% for training and 20% for testing chronolgically. If you have used a Neural Network before you will know that because of the stochastic nature of NNs you can train a model and get results from it but if you retrain it then you will get different results (MySportsAI users try this with the NN option). This is not the case with GBM. This does not mean NN’s are unreliable, you just have to train and test a few times to get a reasonable picture ort an average. Here are the results for top rated horses for 5 runs with a custom loss function in place.

Each run produced 3959 top rated bets

Run 1 ROI% 1.79 VROI% 2.45

Run 2 ROI% 5.05 VROI% 1.82

Run 3 ROI% -3.76 VROI% 1.45

Run 4 ROI% -0.08 VROI% 0.69

Run 5 ROI% 2.18 VROI% 3.21

The first thing I should mention about the above models is that in line with common wisdom I scaled the 3 input features so that they were all in a range of 0 to 1. This is something that is commonly advised for NN’s but I was about to find that the opposite was the case for my data which surprised me.

Here are the results without scaling.

Run 1 ROI% 10.53 VROI% 4.8

Run 2 ROI% 6.47 VROI% 2.06

Run 3 ROI% 2.79 VROI% 3.08

Run 4 ROI% 9.77 VROI% 7.79

Run 5 ROI% 9.49 VROI% 12.11

So how does GBM algorithm perform with the same data but obviously no custom loss function

ROI% 5.71 VROI% 5.66

When taking averages GBM is slightly worse than the average performance of the NN using a custom loss function.

My nest step was to look at how these two performed on validation sets. In other words other hold out periods ie 2016-17 data and 2018-19 data. First 2016/17. Firstly the question to ask is which of the 5 runs I performed with the NN should I use. I tried the highest performed first and this gave some weid results, the top rated horse was getting a rating of 0.99etc which suggests something went wrong, probably the NN found whats called a local optima and simply over fitted or in laymans terms, got lucky in this case. Needles to say the results on 2016/17 were poor. Next I tried a mid range model and this looked promising


GBM ROI% -1.66 VROI% 0.55 NN with loss function ROI% 8.08 VROI% 3.68


GBM ROI% 6.23 VROI% 3.11 NN with loss function ROI% 4.12 VROI% 3.78

Another area of interest may be to use the ranking of the horse instead of the probability when multiplying the loss in the loss function. If you have any ideas of your own please comment and vote on the usefulness of the article.

One Hot V Ordinal Encoding

Steve Tilley sent me this interesting article today which delves into the benefits of using ordinal encoding over one hot encoding in some situations.

A synopsis of the piece would be that for some tree based algorithms like GBM and Random Forests ordinal encoding can be a better option than the usually recommended one hot encoding.

OK I am getting ahead of myself here, what do the above terms actually mean?. Well imagine we have a racing data feature like race going (F, GF, Gd etc etc) and lets say we want to model on pace figure and going because maybe together they have some predictive power. We cannot use the going data as is because ML algorithms require numeric values. The conventional wisdom approach would be that if the going does not have some intrinsic ordering to it then one hot encode it which simply means create binary feature for every possible occurrance like thus

As the article points out this can lead to an explosion of features and possibly the curse of dimensionality.

Below is the performance of a model on pace figure and one hot encoded going for turf flat handicaps. The top rated made a ROI of 1.95% but a variable ROI of -0.7%

Now if we use a numeric value for going, namely 0 = Hvy 1 = Sft 2 = GS etc etc and so only two input features pave figure and going we now get the slightly better set of results

These result suggest as the article does that we should not jump to conclusions about one hot encoding, nominal encoding with tree based algo’s may be just as good if not better

Modeling Heritage Handicaps


Back at the beginning of the 2022 flat season a tipping competition popped up on Twitter. Entries had to make two selections in all the flat seasons heritage handicaps. I felt this was a nice opportunity to test a machine learning model designed to run specifically on heritage handicaps so I set about creating such a model. Drilling down into the data for just heritage handicaps might produce too little data to work with so I decided to go for training the model on all races of class 4 and below. I also ended up splitting the task into two models, one for races up to a mile and another for races beyond a mile. Selections would be made simply by posting up on Twitter the top two rated.

Things got off to a pretty surprising start when the model found Johan, a 28/1 sp winner of the Lincoln and generally got better as the year progressed. Here are the results with EW bets settled at whatever place terms were on offer by at least two big bookmakers.

Firstly let me say that the above returns are not likely sustainable but the profit generated does add weight to the historical results and suggests that the model can be profitable going forward especially at these place terms. I will consider posting these up to the MySportsAI email forum next year