Tags
Prompted by an excellent Twitter post by @AntoineJWMartin on building a simple XG goals model in R I decided to try and do something similar in Python but with the difference being that I would try and construct the data so that it can be loaded into MySportsAI and modelled. In most other respects it will be similar to Antoine’s work in that he is taking the average of the last 3 XG differences for each team and calculates the difference between the two teams in the next game up. Thats sounds complicated let me explain.
I will assume you know what expected goals are if not a quick google will enlighten you. Let us imagine Arsenal in their last match had XG of 1.2 and their opposition in that match had XG of 0.8. The difference for that match is 0.4. We calculate this for the last 3 matches and average. This is then Arsenals average xgDiff (my naming) going into the next match. Now if we calculate their oppositions xgDiff we can then subtract them from each other to get a rating and perhaps this simple rating can be modelled.
The main thrust of Antoine’s tweet was to show how to get the data and prepare it and if their is interest I will do the same or at least make some general code available. The two web sites needed for this data are http://www.fbref.com for expected goals data and http://www.football-data.co.uk for the results and betting odds. The work involves coding the two together into one set of data to be fed into MySportsAI.
Once I had done this the loaded data looked like this. Obviously the initial rows have xgDiff of NaN because the teams have not yet had 3 matches and therefore cannot have a rolling average. These are removed at the modelling stage.

Some explanation is needed first. The data is for the English premiership 2019 to 2022. I have excluded data on the draw so the above is like a series of two horse races, home and away. finPos is obviously 1 or 0 depending on whether the home or away team won and although I have stuck to the naming convention of BFSP for starting price I actually took Bet365 from football-data.co.uk data although others can be used.
At first the results looked too good to be true and when they do you must always assume the worse. On careful inspection I realised that in taking the last 3 game average I was in fact including the current game in the average, clearly putting the current games XGdiff into the average is going to raise the predictability of the model.
Running the model with a train/test split produced the following results using logistic regression

I need to desk check this to make sure all is OK but the results look promising but that is not the main reason for this exercise. At this stage we are just looking at ways of configuring data football modelling and considering I am not a football modeller I would appreciate any feedback.
FootNote- Another possibility is weighting the last 3 matches. Using a weight of 0.5 for the last match, 0.3 for 2nd last match and 0.2 for 3rd last match (note I just made these up) I got the following improved results.

Just a couple of generic comments (non-judgmental just comments)
Why use the average of the last 3 matches? Wouldn’t that be most likely made up of 2 home and 1 away matches or vice versa with a possibility of the last three matches all being just home ones or just away ones? Wouldn’t average of last4 or 5 or 6 matches or cumulative average to date during the current season be a better option?
I merely mention this as many teams have different approaches to Home matches as opposed to Away matches (eg Home supporters encourage their team to do more attacking at home than away). Just averaging the last 3 may give undue weight to Home and/or Away advantage/disadvantage.
Likewise key players may or may not have been injured or rested in recent matches.
Why not have three finishing position fields rather than just the two for Home and Away where you have a 1 in the finishing position column representing Home or Away or Draw outcome?
Would not having the three outcomes also permit summing output ratings to 1.0 ?
All good points and I agree in fact the the home away aspect had occurred to me and perhaps a weight should be attached to those xgDiffs for those 3 matches depending on home or away. Some of the other points come down to hyper tuning and with regard to the draw yes that is my next step, I was thinking of using the absolute xgDiff difference of the two teams, lets see
Hi Mark,
Don’t know if you’re familiar with a paper examining Elo ratings? Ratings aimed at forecasting goal difference proved to be superior to ones trying to model simple WDL outcomes.
However, surprisingly (to me at least!) modelling bookmaker odds rather than any ‘actual’ results were even more effective.
Take a look at
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0198668
It might prove a more effective target for your XG trials
Good to hear from you Stef, thanks for the nudge I will take a look at that reference
Further thoughts on this, it occurred to me that having the ‘sum race probs to 1’ selected is not a good idea if the draw input has been left out. You would want any left over probability to perhaps account for the draw. Unselecting this and then re running did improve the results of the value bets and this improvement was repeated when I did the same on Championship data