Prompted by an excellent Twitter post by @AntoineJWMartin on building a simple XG goals model in R I decided to try and do something similar in Python but with the difference being that I would try and construct the data so that it can be loaded into MySportsAI and modelled. In most other respects it will be similar to Antoine’s work in that he is taking the average of the last 3 XG differences for each team and calculates the difference between the two teams in the next game up. Thats sounds complicated let me explain.
I will assume you know what expected goals are if not a quick google will enlighten you. Let us imagine Arsenal in their last match had XG of 1.2 and their opposition in that match had XG of 0.8. The difference for that match is 0.4. We calculate this for the last 3 matches and average. This is then Arsenals average xgDiff (my naming) going into the next match. Now if we calculate their oppositions xgDiff we can then subtract them from each other to get a rating and perhaps this simple rating can be modelled.
The main thrust of Antoine’s tweet was to show how to get the data and prepare it and if their is interest I will do the same or at least make some general code available. The two web sites needed for this data are http://www.fbref.com for expected goals data and http://www.football-data.co.uk for the results and betting odds. The work involves coding the two together into one set of data to be fed into MySportsAI.
Once I had done this the loaded data looked like this. Obviously the initial rows have xgDiff of NaN because the teams have not yet had 3 matches and therefore cannot have a rolling average. These are removed at the modelling stage.
Some explanation is needed first. The data is for the English premiership 2019 to 2022. I have excluded data on the draw so the above is like a series of two horse races, home and away. finPos is obviously 1 or 0 depending on whether the home or away team won and although I have stuck to the naming convention of BFSP for starting price I actually took Bet365 from football-data.co.uk data although others can be used.
At first the results looked too good to be true and when they do you must always assume the worse. On careful inspection I realised that in taking the last 3 game average I was in fact including the current game in the average, clearly putting the current games XGdiff into the average is going to raise the predictability of the model.
Running the model with a train/test split produced the following results using logistic regression
I need to desk check this to make sure all is OK but the results look promising but that is not the main reason for this exercise. At this stage we are just looking at ways of configuring data football modelling and considering I am not a football modeller I would appreciate any feedback.
FootNote- Another possibility is weighting the last 3 matches. Using a weight of 0.5 for the last match, 0.3 for 2nd last match and 0.2 for 3rd last match (note I just made these up) I got the following improved results.