So far we have placed the rolling shots on target difference into a dictionary and we can access those values by a key which consists of team name and match date eg Norwich22/05/2022

What we now need to do is get these values into the main dataframe df at the correct rows so that we can analyse how they effect total match goals in relation to 2.5 goals scored. We essentially want two new columns which I am going to call homeSTdiff and awaySTdiff which will contain the total preceeding 3 match shots on target difference. for the home team and away team.

Here is the tail of the df dataframe as we would like it to be

Norwich in the 3 games preceeding the match with Tottenham have a total shots on target minus shots on target conceded of -8

To place these values into the df dataframe I am going to first create two empy lists

homeDiff = []
awayDiff = []

Next I am going to loop through the df dataframe row by row and at each row create the neccassery key from the home team and date and the away team and date and pick up from the dictionary the relevant rolling values and append them to the two lists created above. We should then have two lists with the values from the dictionary in exactly the same order as the rows of the df dataframe. We can then simply slot the lists into the df dataframe as new columns.

Now we simply insert the two lists into the df dataframe as new columns

df[‘homeSTdiff’] = homeDiff
df[‘awaySTdiff’] = awayDiff

Almost finished, we now need to create the final column which contains a 1 if the total goals in the match on that row was > 2.5 and 0 otherwise. This is the target column we will be trying to predict when we get around to doing some Machine Learning.

df[‘goals2.5’] = df.apply(lambda x: defineTarget(x[‘FTHG’], x[‘FTAG’]),axis=1)

The above statement is saying create a column called goals2.5 by applying a function called defineTarget to each and every row in the dataframe. The defineTarget function (which we have not written yet) has two pieces of information passed to it, a rows FTHG and a rows FTAG (full time home goals full time away goals).

Here is the defineTarget function we need in our program

Hopefully it is fairly self explanatory, it takes the two values it is handed, adds them together nd checks if its greater than 2.5, returning 1 if it is and 0 otherwise. This function gets called for every row in the dataframe df

Finally we output our dataframe df to a .csv file for later use

df.to_csv(‘goals2.5.csv’)

You can pick up the whole code file shown in these blog posts from the following url

http://www.smartersig.com/dataprep.py

In a few days we will start looking at the code to perform some Machine Learning on the file and see if we can predict >2.5 goals with any accuracy or profit