• Home
  • Twitter
  • SmarterSig
  • Betfair
  • About Me
  • Post Cats
    • Betfair API-NG
    • Horse Stride Length
    • Web Scraping Race Data
  • Books
    • Precision CX Wong
    • Hands on ML

Make Your Betting Pay

~ Improve Your Horse Betting

Make Your Betting Pay

Category Archives: Machine Learning

Football Betting The Long and Short

22 Wednesday Feb 2023

Posted by smartersig in Machine Learning

≈ Leave a comment

Tags

tweet

An interesting discussion cropped up the other day on the MySportsAI email forum. Its started when a member posted about how he was attempting to layout data for a Machine Learning approach to predicting probabilities for home, away and draw for Soccer matches. There are two general solutions to this problem, one long and one short so to speak but I want to approach this top down so hopefully anyone can understand not only the problem but also the solutions.

Let us approach this with a model in mind, a very simple model that is not being put forward by me as a winning model but merely as a vehicle for exploring the problem in hand. Imagine we want to create a model which has as it input features just one feature. The average possession percentage in the last 3 games for any two teams about to meet each other. How do we lay this data out so that we can model it?.

One approach is to lay the data out in wide format so for example we might have the following

HomeT,AwayT,HomePoss%,AwayPoss%,HomeOdds,AwayOdds,DrawOdds,Result

Covenrry,Sunderland,57,45,2.3,3.3,3.35,2

The result was a draw ie 2 = draw 1 = home win 3 = away win

The inputs to the model would be the two possession fields HomePoss% and AwayPoss% and of course the target to predict would be the result field. The odds fields will be used by us to calculate how profitable the model is and are unlikely to be inputs to the model at this stage.

Now because we are not predicting a binary outcome ie- 1 or 0 to signify win/lose we cannot use some algorithms that are solely intended for binary outcomes. What we need is MUTINOMIAL methods like multinomial logistic regression. We can also use a deep learning neural network in which we can specify that there will be 3 output nodes, one for each result. I am not going to delve into this further here although I have created a neural network on these lines using Keras and Tensorflow.

We also cannot feed the above configuration into MySportsAI as the algorithms within MySportsAI are set up for binary outcomes eg has the line of data which describes a horse won the race in question or not.

This led to a discussion on how we could configure things so that MySportsAI could handle this home away draw prediction scenario. One way forward is to create 3 lines of data per match so a match is like a 3 horse race. The first line could be

matchId,HomeT,HomePoss%,HomeOdds,result

1,Coventry,57,2.3,0

The second line could be

1,Sunderland,45,3.3,0

Now we run into a problem, the third line needs to represent the draw but there is no possession percent for a draw. The first two possession percents are absolute values for each team derived from their last 3 games. You could put in the difference between the two for this third line but then you are mixing apples with oranges. The first two values have a different scale and function to the last value for the draw. If however the input features are relative values in some way then this approach makes more sense. For example if we had the ratio of home% to away% as the input values as shown below

1,Coventry,1.26,2.3,0

1,Sunderland,1.26,3.3,0

1,Draw,1.26,3.35,1

Now we have something approaching apples and apples however the algorithm would not know who is the home team and who is the away team, clearly the first two 1.26’s are not the same as one is a home team and the other an away team. To pass on this information we could add 3 extra fields (note yes we can do it with 2 fields but I want to make things clear). Here the 3 fields would signify

1,0,0 home team

0,1,0 away team

0,0,1 draw

We now have the following with the 3 extra fields after the team name

1,Coventry,1,0,0,1.26,2.3,0

1,Sunderland,0,1,0,1.26,3.3,0

1,Draw,0,0,1,1.26,3.35,1

We now have a 3 horse race so to speak and MySportsAI will give rankings and percentage chance for a match once a model has been trained on historical data.

I have personally found that this approach with a Gradient Boosting Machine algorithm works as well as a long approach and a deep leaning Keras neural network.

All of the modelling by the way can be done in MySortsAI with just a click of a few buttons.

Love to hear other peoples thoughts on this and what approaches they are using

Simple XGoals Model 3

06 Tuesday Dec 2022

Posted by smartersig in Machine Learning

≈ Leave a comment

Following on from the last two posts where I looked at a simple expected goals model built using Machine Learning In this post I am going to describe how you can get involved and play around with the very same data.

The first thing you will need to do is gather the required data. this is very quick and simple. At football-data.co.uk go to the english premier league historical results and download the data for 2021/22. Save this in a new folder and name it what you want although I called my data file e02122.csv. This is the file that will contain the results and the betting odds

Second thing to do is create a file using a text editor eg notepad called resultfiles.csv and in it place the following line

e02122.csv

If you have called it something else then obviously enter that file name and dont forget to press return to start a new line. You can obviously gather more seasons than this, the above is just as an example, but enter their file names into resultfiles.csv

Next step is to gather some expected goals data. Go to the http://www.fbref.com web site and click on the competitions tab and select premier league

Click previous season to move back to 2021/22

Now click the score&fixtures tab to display the match scores for 2021/22, you will notice that this also contain xg (expected goals)

Now click share & export followed by Get table as csv

The results will now be displayed in csv format. I have not found a link to download the data as with football-data-co.uk so you will have to copy and paste the results from the page to a notepad file and call it Premexpgoals2122.csv

Almost finished, now create a new file in notepad called featurefiles.csv and enter into it the following line

Premexpgoals2122.csv

Again if you grab other seasons then enter their file names into featurefile.csv as well

There are a couple of other files that I will supply with the software, the first is called repteamnames.csv

Because team names in the results and the xgoals files are not always the same eg Man Utd Manchester Utd the software will read the teams that need editing from this file and make the needed changes. If you run into any new discrepancies from earlier years just add the from and to names to this file. When you look at the repteamnames file and Premexpgoals2122 and eo2122 you will see why they are in repteamnames.

The other file I will supply is histmatchweights.csv

This file will contain the following

3
0.2,0.3,0.5

The 3 means that the software will gather the xgoals from a teams last 3 games

The 0.2,0.3,0.5 makes the software weight the last game with 0.5, the second last game 0.3 and the third last 0.2

You can play around with these values in order to make the software create different data

You can download the software createMLfootie.exe from the utilities section of http://www.smartersig.com along with the two files I supply mentioned above.

Once you have run the program it will create a file called MLfootie.csv, it is this file you can load up into MySportsAI and create models on.

The idea behind this is to dip a toe into data modelling for football and perhaps with discussion and further ideas we can develop and enrich the data inputs, I already have a few ideas.

Best of luck and let me know how you get on

Simple XGoals Model

02 Friday Dec 2022

Posted by smartersig in Machine Learning

≈ 5 Comments

Tags

tweet

Prompted by an excellent Twitter post by @AntoineJWMartin on building a simple XG goals model in R I decided to try and do something similar in Python but with the difference being that I would try and construct the data so that it can be loaded into MySportsAI and modelled. In most other respects it will be similar to Antoine’s work in that he is taking the average of the last 3 XG differences for each team and calculates the difference between the two teams in the next game up. Thats sounds complicated let me explain.

I will assume you know what expected goals are if not a quick google will enlighten you. Let us imagine Arsenal in their last match had XG of 1.2 and their opposition in that match had XG of 0.8. The difference for that match is 0.4. We calculate this for the last 3 matches and average. This is then Arsenals average xgDiff (my naming) going into the next match. Now if we calculate their oppositions xgDiff we can then subtract them from each other to get a rating and perhaps this simple rating can be modelled.

The main thrust of Antoine’s tweet was to show how to get the data and prepare it and if their is interest I will do the same or at least make some general code available. The two web sites needed for this data are http://www.fbref.com for expected goals data and http://www.football-data.co.uk for the results and betting odds. The work involves coding the two together into one set of data to be fed into MySportsAI.

Once I had done this the loaded data looked like this. Obviously the initial rows have xgDiff of NaN because the teams have not yet had 3 matches and therefore cannot have a rolling average. These are removed at the modelling stage.

Some explanation is needed first. The data is for the English premiership 2019 to 2022. I have excluded data on the draw so the above is like a series of two horse races, home and away. finPos is obviously 1 or 0 depending on whether the home or away team won and although I have stuck to the naming convention of BFSP for starting price I actually took Bet365 from football-data.co.uk data although others can be used.

At first the results looked too good to be true and when they do you must always assume the worse. On careful inspection I realised that in taking the last 3 game average I was in fact including the current game in the average, clearly putting the current games XGdiff into the average is going to raise the predictability of the model.

Running the model with a train/test split produced the following results using logistic regression

I need to desk check this to make sure all is OK but the results look promising but that is not the main reason for this exercise. At this stage we are just looking at ways of configuring data football modelling and considering I am not a football modeller I would appreciate any feedback.

FootNote- Another possibility is weighting the last 3 matches. Using a weight of 0.5 for the last match, 0.3 for 2nd last match and 0.2 for 3rd last match (note I just made these up) I got the following improved results.

Know Your Trainers Neighbors

25 Friday Nov 2022

Posted by smartersig in Machine Learning

≈ 1 Comment

Tags

tweet

The classical way of looking at trainer form is to check how well a trainer has done over the last X runs. Often this amounts to simply looking at win to run ratio but sometimes this is refined further to perhaps ‘good’ runs or placed runs. But is there a more refined way of looking at trainer form. What I am getting at is how well does M Botti do when running in a class 5 handicap with a 3yo who has been off 71 days. What if we add further criteria, perhaps stipulate that the animal is a male rather than female horse. There are all kinds of criteria we could come up with but it gets messy and even if you do not think so the question remains how do we evaluate his runs with such animals?.

Machine Learning can help in this situation. The K Nearest Neighbor is one of the simplest algorithms to understand. Imagine we simply focus on Botti’s runners that are 3yos or as near as possible to 3yo and off 71 days or as near as possible to 71 days. It would be great if Botti had a multitude of such previous runners but of course he wont but KNN will search for the nearest sample of data to these values. The sample is set by us when we run a KNN program. I preformed this task on the last race at Wolves on Saturday 26th November 2022. I trained the KNN algorithm of Bottis data from 2011 to 2019 for class 5 and 6 races and then ran a prediction on Botti’s runner in the last at Wolves. Now normally the algorithm would predict the chances of Botti having a winner with a 71 day off 3yo. However I wanted to refine the prediction somewhat. I actually accessed the 21 nearest neighbors from 2011 to 2019 (I specified it should look for 21 nearest instances) and then instead of lengths beaten for each animal I looked at pounds beaten and compiled an average. I did this for all trainers in the last race at Wolves and then ranked them with of course the smallest average being the best ranked in the race. I also graphed the individual trainers nearest neighbors, here are a couple

At first glance Appleby’s graph looks better but of course the vertical scale is different although he does have a large outlier.

There is lots more work that can be done on this idea. Certainly the two inputs above should be normalised to lie between 0 and 1 otherwise the algorithm will give more weight to a difference of say 10 days in days since last run than perhaps a difference in one or two age years. This would lead to days since last run dominating the selection of the 21 nearest neighbors.

Does this approach have any legs, well I trained on 2011 to 2015 and tested on 2016/17 for all handicap races using just a couple of input fields for trainers of which days since last run was one and during 2016/17 the top ranked made variable stake profit of +19.1 points whilst the second ranked made +15.6

In this race Botti is top ranked and Appleby is second top, good luck with whatever you are on

Comments are welcome and dont forget to rate the article

Has My Data Changed

08 Tuesday Nov 2022

Posted by smartersig in Machine Learning

≈ Leave a comment

Tags

tweet

A big problem model builders have is covariate shift, it sounds complicated but its really simple and I am going to explain it via a horse racing example that will be familiar to all of you. The days since a horse last run is a common data feature in a horse racing data set whether you are a system builder or a model builder. Either way you are hoping that the way in which horses are generally campaigned this year is mirrored in the way they have been campaigned in previous years. I mean if I told you that next year no horse could run without having at least 3 weeks off you would be pretty worried about how this would effect your way of betting regardless of what approach you take.

To give this a fancy term we would say that the distribution of the data item days since last run has changed, in other word we have covariate shift. Now this is worrying if you built a model on the assumption of certain data distributions only to find that they have changed drastically. Such a situation occurred in 2020 when a whole industry had to shut up shop and no horse was running for weeks.

In MySportsAI you can now check for drift between the data your training a model on and the test data you are testing it on. Lets demonstrate with a simple model using only days since a horse last ran.

The above was trained and tested on handicap data for 2011 to 2015, the first 80% being the training data and the latter 20% being the test data. The TTDrift value shows a measure of data drift, close to 0.5 is quite good. Now lets see how it drifts when I train on 2011 to 2015 and then test on 2016/17

The drift is a little more pronounced and the variable stake profit on the top 3 rated has dropped

Now finally lets check testing on 2020

The TTDrift has not surprisingly taken a jump and top 3 VROI is poor worse than you would get backing all horses.

Although the change in conditions here were forced upon us in other cases drift can occur for reasons that are not as easy to determine but at least you can now keep an eye on it. By the way although the drift on the above came down in 2021 and 2022 it is still not returned to 2011/15 levels. If you are interested in the mechanism behind the calculations here is a link

https://www.kdnuggets.com/2018/06/how-dissimilar-train-test-data.html

Predicting Profit Not Winners

05 Saturday Nov 2022

Posted by smartersig in Deep Learning, Machine Learning

≈ Leave a comment

Tags

tweet

Machine learning libraries like sklearn come with lots of ML algorithms. Neural Networks, Logistic Regression, Gradient Boosting Machine and so on. Off the shelf they all have on thing in common. If you give them a spreadsheet like set of data they will try to predict one of the columns depending on which one you specify. So if one of my columns contains zero where a horse (a row of data) lost and one if it won then we can get the ML algorithm to create a model that is hopefully good at predicting those 0s and 1s. It will even give you a a probability between 0 and 1 so that you can then rank the horses in a race and perhaps just back the top ranked horse. this is all jolly good if we find this approach produces profit, but can we get an algorithm to predict profit. Would a model built to find profit work better than a model built to find winners ?.

To find profit with an ML algorithm we have to change something called the loss function. So what is the loss function when its at home ?. Let us think about a commonly used one. Mean Squared Error MSE. If say a logistic regression model predicts Mishriff will win the Breeders Cup turf with a probability of 0.21 and he does win then the error is 1 – 0.21 = 0.79

If on the other hand he loses then the error is 0 – 0.21 = -0.21

Now if we square the two potential errors we always get a positive number namely 0.62 and 0.04

This is the SE and we can see that if we take the average of these across all the predictions made in a year we have the MSE

Hopefully you can see that if losers have lower predicted probabilities and winners have higher probabilities as predicted by our model then we are heading in the right direction. If its the other way round then we have a pretty crap model. The algorithm will attempt to minimize this MSE in its search for a good model.

But we want to model for profit not accuracy, we need a different loss function to MSE, we need to create our own, what is commonly known in ML circles as a custom loss function and plug this into our algorithm and say hey use this not the loss function you use by default.

You can do this with LightGBM and XgBoost but it is easier to do with Deep Learning and Keras. I am not going to go into the code detail here but I am going to share my findings after tipping my toe into this pool.

I created a loss function that would maximize variable stake profit proportional to the rating it produced for each horse in a race. In other words it is betting to win £1 on each horse in a race but whatever profit or loss is made on each horse multiplied by the rating value. So if the top rated horse won with a rating of 0.25 the winnings would be £1 x 0.25 and of course the loss on the lower rated horses would be less because they have lower rating values. The loss/profit on a race is therefore being reduced/increased if higher rated horses win.

Plugging this in to a Deep learning Neural Network using Keras produced the following results for top rated horses in each race (UK Handicaps flat). I go on to compare this with a GBM model produced in MySportsAI using the same data but obviously designed to find winners.

First data for 2011 to 2015 was split into 80% for training and 20% for testing chronolgically. If you have used a Neural Network before you will know that because of the stochastic nature of NNs you can train a model and get results from it but if you retrain it then you will get different results (MySportsAI users try this with the NN option). This is not the case with GBM. This does not mean NN’s are unreliable, you just have to train and test a few times to get a reasonable picture ort an average. Here are the results for top rated horses for 5 runs with a custom loss function in place.

Each run produced 3959 top rated bets

Run 1 ROI% 1.79 VROI% 2.45

Run 2 ROI% 5.05 VROI% 1.82

Run 3 ROI% -3.76 VROI% 1.45

Run 4 ROI% -0.08 VROI% 0.69

Run 5 ROI% 2.18 VROI% 3.21

The first thing I should mention about the above models is that in line with common wisdom I scaled the 3 input features so that they were all in a range of 0 to 1. This is something that is commonly advised for NN’s but I was about to find that the opposite was the case for my data which surprised me.

Here are the results without scaling.

Run 1 ROI% 10.53 VROI% 4.8

Run 2 ROI% 6.47 VROI% 2.06

Run 3 ROI% 2.79 VROI% 3.08

Run 4 ROI% 9.77 VROI% 7.79

Run 5 ROI% 9.49 VROI% 12.11

So how does GBM algorithm perform with the same data but obviously no custom loss function

ROI% 5.71 VROI% 5.66

When taking averages GBM is slightly worse than the average performance of the NN using a custom loss function.

My nest step was to look at how these two performed on validation sets. In other words other hold out periods ie 2016-17 data and 2018-19 data. First 2016/17. Firstly the question to ask is which of the 5 runs I performed with the NN should I use. I tried the highest performed first and this gave some weid results, the top rated horse was getting a rating of 0.99etc which suggests something went wrong, probably the NN found whats called a local optima and simply over fitted or in laymans terms, got lucky in this case. Needles to say the results on 2016/17 were poor. Next I tried a mid range model and this looked promising

2016/17

GBM ROI% -1.66 VROI% 0.55 NN with loss function ROI% 8.08 VROI% 3.68

2018/19

GBM ROI% 6.23 VROI% 3.11 NN with loss function ROI% 4.12 VROI% 3.78

Another area of interest may be to use the ranking of the horse instead of the probability when multiplying the loss in the loss function. If you have any ideas of your own please comment and vote on the usefulness of the article.

One Hot V Ordinal Encoding

30 Sunday Oct 2022

Posted by smartersig in Machine Learning

≈ Leave a comment

Steve Tilley sent me this interesting article today which delves into the benefits of using ordinal encoding over one hot encoding in some situations.

View at Medium.com
https://medium.com/@anna_arakelyan/hidden-data-science-gem-rainbow-method-for-label-encoding-dfd69f4711e1

A synopsis of the piece would be that for some tree based algorithms like GBM and Random Forests ordinal encoding can be a better option than the usually recommended one hot encoding.

OK I am getting ahead of myself here, what do the above terms actually mean?. Well imagine we have a racing data feature like race going (F, GF, Gd etc etc) and lets say we want to model on pace figure and going because maybe together they have some predictive power. We cannot use the going data as is because ML algorithms require numeric values. The conventional wisdom approach would be that if the going does not have some intrinsic ordering to it then one hot encode it which simply means create binary feature for every possible occurrance like thus

As the article points out this can lead to an explosion of features and possibly the curse of dimensionality.

Below is the performance of a model on pace figure and one hot encoded going for turf flat handicaps. The top rated made a ROI of 1.95% but a variable ROI of -0.7%

Now if we use a numeric value for going, namely 0 = Hvy 1 = Sft 2 = GS etc etc and so only two input features pave figure and going we now get the slightly better set of results

These result suggest as the article does that we should not jump to conclusions about one hot encoding, nominal encoding with tree based algo’s may be just as good if not better

Modeling Heritage Handicaps

27 Thursday Oct 2022

Posted by smartersig in Machine Learning, Uncategorized

≈ Leave a comment

Tags

tweet

Back at the beginning of the 2022 flat season a tipping competition popped up on Twitter. Entries had to make two selections in all the flat seasons heritage handicaps. I felt this was a nice opportunity to test a machine learning model designed to run specifically on heritage handicaps so I set about creating such a model. Drilling down into the data for just heritage handicaps might produce too little data to work with so I decided to go for training the model on all races of class 4 and below. I also ended up splitting the task into two models, one for races up to a mile and another for races beyond a mile. Selections would be made simply by posting up on Twitter the top two rated.

Things got off to a pretty surprising start when the model found Johan, a 28/1 sp winner of the Lincoln and generally got better as the year progressed. Here are the results with EW bets settled at whatever place terms were on offer by at least two big bookmakers.

Firstly let me say that the above returns are not likely sustainable but the profit generated does add weight to the historical results and suggests that the model can be profitable going forward especially at these place terms. I will consider posting these up to the MySportsAI email forum next year

Betting With Light Touch

17 Sunday Jul 2022

Posted by smartersig in Machine Learning

≈ Leave a comment

Tags

tweet

Over this weekend I have been playing around with a Machine Learning algorithm called LightGBM, produced by the teams at Microsoft. This is an algorithm from the Gradient Boosting family. I have included GBM in the MyPportsAI package but not LightGBM mainly because a cursory read about it suggested its main advantage was speed of execution. Well I was not too fussed about this aspect so I jumped to the conclusion that Light perhaps meant light in predictive performance. I may have been wrong. The main reason for me picking it up this weekend is that LightGBM allows you to specify your own custom loss function. Allow me to explain. When a algorithm be it GBM or Logistic Regression is trying to produce the best model it can for future predictions it is doing this by examining the data you have handed it to train itself on. It has to try different scenarios just as you would if you were playing the party game ‘guess who I am’. For example am I a male Y/N, am I European Y/N and so on. Each time it constructs a model space (think of that as a completed game) it needs some mechanism for evaluating the worthiness of the model. In fact it also needs some measure of worth to evaluate each stage of construction. I am not going to dissect the detail of this here but usually its some measure of accuracy. In horse racing terms this means is it finding winners better than the last model it trained. Is this split that it creates in the data better than other possible splits in terms of dividing winners from losers. Well we all know as bettors that this is not the complete picture. Profit is what we are seeking and often that comes hand in hand with fewer winners, I mean go ahead and back all odds on shots and you will have lots of winners but no profit. Creating your own custom loss function allows you to stop minimizing losers and start minimizing losses. The algo’ will use your loss function rather than one of the in built loss functions that focus on winners.

So how dis LightGBM perform, well I used three sets of data each consisting of just 3 input features. I then tested them all using standard GBM, LightGBM and what I will call here LightGBM+. The plus means I plugged in my custom loss function. All three used just default hyper parameters and the data was for the flat from Handicaps 2011 to 2017. I used train test split and then checked the top rated horse from the different scenarios to BFSP minus 2% commission. Here are the results

As you can see on the first set of data LightGBM came in at ROI% of just under 2% whereas LightGBM+ achieved over 6% and GBM trailed in at nearly -2%

This looks like a promising area of further research and I will look to place LightGBM into MySportsAI along with a custom loss alternative.

FootNote 1 – The variable stake return on that over 6% run was + 4.78%

FootNote 2 – Training on the whole 2011 to 2017 and testing on 2018/19 showed little difference betwenn LightGBM and LightGBM+ but both were around 1% better ROI to variable stakes than GBM

FootNote 3 – The ROI was not over 6% for the LightGBM+ on the first data but actually +5.01%, I had omitted to count joint top rateds. Still however well above the other models. Also the Variable stake return was on this first data set for LightGBM+ was +3.29%

Simple In Running Tennis Model

10 Sunday Jul 2022

Posted by smartersig in Machine Learning

≈ Leave a comment

Tags

tweet

Its Wimbledon men’s final today and yesterday I decided to set up a simple in running model to allow me to gauge how the tennis market works in play. I have no experience of these markets so feel free to comment. I create the model using MySportsAI software.

First thing I needed was data that would allow me to model my chosen split for IR pricing. Clearly one could model down to individual game scores or even point scores, by that I mean model what the price should be if say Djokovic is 1-0 up in the first set. However I decided to model on set completions and this write up will be about modelling the fair price of the two players at the end of the first set.

The data was obtained from the excellent free site tennis-data.co.uk. I gathered all tennis results from 2011 to 2021. the csv files from this source contain a single line for each match. Each line contains the data for the match such as player names, how many sets, match surface and of course scores amongst other items of data. This presented the first problem. MySportsAI treats a line of data as data for one competitor in say a horse race. In a way a tennis match is no different to a two runner horse race but of course I need to get the data for each player onto one line with a unique match id for each match. I also only selected or created a few data features after all this is my first tennis model. Here is a snapshot of the input file

As you can see I have created a file of only grass results over this period. Whether focusing on grass only is a good idea I do not know but for now its a starting point. Quick explanation, the first tow features self explanatory. The rank refers to the players world ranking. Pts refers to points won in the game. Pinn refers to the pre match odds as per Pinnacle and the firstSetDiff signifies not only who won the first set but by how many games

.Feeding the whole file into MySportsAI gives me this initial page

Now as you can see from the selected fields on the left, I am going to build a model based on just two fields. The initial Pinnacle odds and the first difference in games. The model will then predict odds for winning the game for both players. So if Djokovic should win 6-4 I should be able to get a prediction of fair odds for both players. I will need an input file to predict on but more of that later.

As you can see from the above I am going to build the model using a GBM, thats a gradient boosting algorithm. When I have done this I will try one or two other algorithms to see if any work better. First of all I click run model and get the following new screen.

From the above you can see that I have selected to perform 5 fold cross validation. This means its going to repeatedly split the data into 5 sets of 4/5 and 1/5, each time training the model on the 4/5ths and then testing its effectiveness on the 1/5. It will do this five times and then pool the results.

Now the results of the above are of no real interest to me as the profit is being calculated to the original Pinnacle prices. Sure if you can get Pinnacle original prices after the first set is over then start looking at Caribbean islands for sale. What I am interested in is how good were the probabilities that it predicted for winning the match. I mean if my model is predicting Djokovitc is 1.10 after taking the first set 6-4 but I can get 1.15 then provided my model is accurate I will make money in the long run. There are a lot if’s here but this is an initial demonstration and a simple starting point.

To check the probability accuracy of the model I click on calibPlot

This shows how accurate the model probabilities have been. The closer the orange line is to the blue dotted line the better. As you can see some pints are below and some above. Whether they are above or below is not important, they are both inaccurate and we would like to get an average of how off it is. We cannot average negative and positive numbers ie how much it is above and below because negative numbers will screw this up. What we do therefore is square the errors so all of them are positive numbers and then compute the mean. As you can see this comes out at 0.135. The smaller this number is the better the fit of probabilities to actual events. Let us see if we can beat this number with a different algorithm. Here is Logistic Regression

And one more Logistic GAM

OK using the GBM model we are now ready to predict the fair prices for our live game at the end of the first set of todays final. Kyrgios won the first set 6-4 so I created a test file with the following data.

As you can see the first line is Djokovic with an opening price of 1.28 but a first set game deficit of -2. The last number is at this point just an imaginary result to make the software accept the data, it plays no part in the prediction. What I do now is click on supply test file in the second screen and then run the model on the above file. Once it has produced its predictions I can then click save results. This will save the prediction results to a file where I can check them

From the above we can see that Djokovic has a probability of 0.603 and Kyrgios has a probability of 0.396

Converting these to odds by using 1/0.603 e get odds of 1.65 for Djokovic and 2.52 for Krygios. These were very close to the odds on offer on Betfair. In fact I put offers in for 1.69 and 2.56 but did not get matched despite the odds flirting close at times.

If there is interest and Joseph Buchadahl does not mind I could look at packaging up the conversion code to render the data into two lines per match code. Other data could of course be included so that you can try various scenarios although with Krygios swear words per set might be a good feature to include.

← Older posts

Archives

Create a free website or blog at WordPress.com.

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy
  • Follow Following
    • Make Your Betting Pay
    • Join 50 other followers
    • Already have a WordPress.com account? Log in now.
    • Make Your Betting Pay
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...