An interesting discussion cropped up the other day on the MySportsAI email forum. Its started when a member posted about how he was attempting to layout data for a Machine Learning approach to predicting probabilities for home, away and draw for Soccer matches. There are two general solutions to this problem, one long and one short so to speak but I want to approach this top down so hopefully anyone can understand not only the problem but also the solutions.

Let us approach this with a model in mind, a very simple model that is not being put forward by me as a winning model but merely as a vehicle for exploring the problem in hand. Imagine we want to create a model which has as it input features just one feature. The average possession percentage in the last 3 games for any two teams about to meet each other. How do we lay this data out so that we can model it?.

One approach is to lay the data out in wide format so for example we might have the following



The result was a draw ie 2 = draw 1 = home win 3 = away win

The inputs to the model would be the two possession fields HomePoss% and AwayPoss% and of course the target to predict would be the result field. The odds fields will be used by us to calculate how profitable the model is and are unlikely to be inputs to the model at this stage.

Now because we are not predicting a binary outcome ie- 1 or 0 to signify win/lose we cannot use some algorithms that are solely intended for binary outcomes. What we need is MUTINOMIAL methods like multinomial logistic regression. We can also use a deep learning neural network in which we can specify that there will be 3 output nodes, one for each result. I am not going to delve into this further here although I have created a neural network on these lines using Keras and Tensorflow.

We also cannot feed the above configuration into MySportsAI as the algorithms within MySportsAI are set up for binary outcomes eg has the line of data which describes a horse won the race in question or not.

This led to a discussion on how we could configure things so that MySportsAI could handle this home away draw prediction scenario. One way forward is to create 3 lines of data per match so a match is like a 3 horse race. The first line could be



The second line could be


Now we run into a problem, the third line needs to represent the draw but there is no possession percent for a draw. The first two possession percents are absolute values for each team derived from their last 3 games. You could put in the difference between the two for this third line but then you are mixing apples with oranges. The first two values have a different scale and function to the last value for the draw. If however the input features are relative values in some way then this approach makes more sense. For example if we had the ratio of home% to away% as the input values as shown below




Now we have something approaching apples and apples however the algorithm would not know who is the home team and who is the away team, clearly the first two 1.26’s are not the same as one is a home team and the other an away team. To pass on this information we could add 3 extra fields (note yes we can do it with 2 fields but I want to make things clear). Here the 3 fields would signify

1,0,0 home team

0,1,0 away team

0,0,1 draw

We now have the following with the 3 extra fields after the team name




We now have a 3 horse race so to speak and MySportsAI will give rankings and percentage chance for a match once a model has been trained on historical data.

I have personally found that this approach with a Gradient Boosting Machine algorithm works as well as a long approach and a deep leaning Keras neural network.

All of the modelling by the way can be done in MySortsAI with just a click of a few buttons.

Love to hear other peoples thoughts on this and what approaches they are using