Nursery Correlation Matrix


A correlation matrix is a row column display of the correlation between input features and a target feature

So in the following matrix created using MySportsAI for Nursery flat handicaps we can see that the most correlated feature with finPos from this small sample ie did it win or not, is goodRunOnGoing at 0.06

A useful starting point when deciding what features to select for a Machine Learning model is to examine all available features and see which ones have the best correlations with winning. It’s also useful for checking which features are correlated with each other. If two features are strongly correlated with each other then its likely that including both is not a good idea. So we are looking for features that have some strength of correlation with winning but not a strong correlation with each other.

The problem with putting all features in MySportsAI into the correlation matrix is the plot simply gets unreadable even if we lower the resolution. MySportsAI will however automatically dump the matrix contents into a .csv file where you can load up into Excel and examine the correlation values. Below is a snapshot of just part of the Excel sheet I produced for Nursery handicaps

So which features came out top in terms of Nursery handicap correlation to winning. First remember that a positive value and a negative value are both of equal value. A positive correlation is not as one would intuitively expect better than a negative. Both are capable of weeding out meaningful predictions within a race.

The best positive correlation was Trainer strike rate within race ie- trainer strike rate for a horse / top trainer strike rate in the race.

The best negative correlation was preFinPos3 ie- last 3 finishing positions / count of last 3 race opponents

Good luck with your Nursery betting this autumn

Skill Set Successful Betting


What would be an essential skill set for a successful sports bettor?. I posted a diagram up on Twitter yesterday in an attempt to foster some debate. It did create some debate but not in the direction I had anticipated. Here is the diagram as of today (its an ongoing set of thoughts).

There were some objections from fellow Twitter users to the inclusion of skills like Machine Learning and that simply stating statistic skills needed broadening out to include Math skills. One person even said that it reflected my own personal path or experience too much which I have to agree on but that was the point of tweeting it, to encourage others to add a bubble I had not considered. I was looking for bubbles that can hopefully be worked on by an individual. So for example simply stating that you must find an edge is not a skill set, it is a goal. Hopefully the creation of the skill set will help lead you to this.

I also think its important to state that an individual does not need all of these skills. Some people get by quite profitably without any coding skills or indeed Machine Learning interest. The diagram is a global skill set from which you may focus on some and not others. Having said that one could argue that certain skills are pivotal and at this point may I add that the size of bubbles is not meant to signify this, its merely a product of my quick sketching. For example many components of the temperament bubble will be essential eg- patience, eye for detail, keeping ones nerve when hitting losing runs. The risk averse bubble is also an interesting one for me. Too much aversion and you can have all the other skill bubbles and still be unable to pull the trigger. To little aversion and you can easily become compulsive or blow your money before you find your way.

One of the more interesting points was made by one Twitter account that I often disagree with. He stated that luck was a vital ingredient. At first I thought he was just being argumentative but he clarified his position with the following article.,matters%20a%20whole%20lot%20more

Now I kind of agree and disagree. I disagree in that simply placing that on the diagram will imply that luck plays a part in the long term outcomes of your betting which is clearly false because if it did then by definition you cannot be a long term winning punter or at least not in the sense that I am trying to define. But I do agree that luck can play a part in guiding you to success. For example would William Benter have been successful had he not met Alan Woods. Was it luck that Bill Gates happened to be born at a time of great shift in technological progress and in a location that opened up opportunity. More modestly, if I had not read an article about a magazine called SmartSig would I be in a totally different space now.

I think that temperament bubble just got a bit bigger and includes

a) Willingness to fail and try again

b) Networking skills, ability to meet other people, sources and learn

c) Inquisitive by nature

d) View money as a tool and not a be all desirable end product

e) Be able to draw neat bubbles

If you have a skill in mind that you think is missing here please add in the comment section

Machine Learning Nurseries


We are about to embark on the Nursery season within flat racing here in the UK. If you are not familiar these are handicap races for 2yo horses only. Simon Rowlands wrote an excellent article linked below where he highlights the bias favouring higher weighted horses in Nurseries amongst other things.

I decided to take a look at this via the MySportsAI software, firstly I loaded up all flat data for 2011 to 2017

I then sliced the data via the slice button, selecting only Nurseries

Having sliced the data I then proceeded to select weight, prevRaces and age from the array of input features

Clicking the Create Model button then presents me with the model creation screen

Now clicking the CorrPlot button will give me the following Correlation Matrix window

If we read across from row weight to column finPos (whether the horse won or lost) we can see that weight has a positive correlation of 0.06 which means as weight go’s up the chance of winning increases. Contrast this with age which I threw in just to highlight that as all horses are two year olds the horse age should have no correlation what so ever with the outcome. We can also see that prevRaces (the number of races a horse has had) is slightly negatively correlated with winning. That is to say the more races a horse has had the less chance of winning.

Weight therefore may be a good starting point from which to create a model for Nursery handicaps

You may have wondered whether Nursery weight correlation is any different from all age handicaps. It appears it is with regular handicaps coming in at 0.03. Still positively correlated but less so.

You can try a cut down version of MySportsAI for free, see the following two videos on how to install and use.



One limitation of the Sklearn library of machine learning algorithms is the inability to write your own cost function and get the algorithm from the sklearn library to us it. What do I mean by this?, well when you train a model on some horse racing data the algorithms are looking for the most effective model at predicting winners assuming you are using win/lose as your target feature ie the thing you are trying to predict. The ‘cost function’ is just a fancy term to describe the method the algorithm uses to determine how good a model is at predicting winners in our case. But what if we wanted it to do its search and model fitting based on profit or loss. What if we wanted the model fitting crtieria to be variable profit/loss ie how much a model makes betting to always win a mythical £1. The model fitting process may look at a set of parameters and find that it makes a variable stake loss of -£25 so it tweeks the model parameters and tries again and finds that the los is now -£20. Great the second model shows an improvement, lets tweek a little more.

This tweeking process will depend on the algorithm being used but as I mentioned sklearn does not permit you to define a loss function that calculates variable PL and then ask it to use it as the measure for its tweeking.

Here is where PyGAD comes in to the picture. PyGAD is a Genetic Algorithm library for Python that does allow you to create your own loss function and plug it in. An explanation of GA’s is beyond the scope of this blog entry but if you want do a Youtube search you will find some nice intro’s. There was also a great article in an old copy of SmartSig from a guy utilizing GA’s, if memory serves me it actually won the article of the year award.

OK onto the face off, how does a match up between SKlearn’s GBM algorithm and PyGAD shape up. With GBM I will be using the model based on win/lose approach whilst with PyGAD I will be adopting a model based on VPL (variable profit or loss)

For both I used 3 input features (forgive me but I wont disclose these but they are part of MySportsAI) along with Betfair SP and finishing position as data items to simply calculate PL


First of all I used the data from 2011 to 2017 and did not scale the input features (see later). Here are the results for the top rated horses when the model was applied to 2018 and 2019

Bets 12921 VPL -24.3 VROI% -0.79%

Next I min max scaled the 3 input features, this simply means that all 3 features are scaled to between 0 and 1 so that they are on an identical scale. This can often help with modelling for some algorithms

Bets 12921 VPL +3.19 VROI% +0.18%

My last step involved incorporating Betfair SP as an input feature. My curiosity about doing this was mainly due to the fact that when you include BFSP in a model where the algorithm is modelling for winners, the BFSP simply swamps the modelling process, its just so effective at predicting winners the algorithm will zone in on it at the expense of importance placed on your other features. With PyGAD and a cost function that is working on VPL this will not be the case. However saying that this did not appear to help as the VPL came out at -25.28 on the test data. OK worth a try.


Now onto the Gradient Boosting algorithm. GBM has appeared to be quite effective in the past on Racing data, perhaps mainly due to the fact that it can handle non linear data quite well. First non scaled input features.

Bets 14250 VPL – 12.81 VROI% -0.35%

There was obviously more joint top rated with this approach.

GBM with scaled data did very poorly, making a bad job of differentiating between rankings and delivering lots of joint or co top rated horses

bets 70804 VPL -218.7 VROI% -2.05%

Conclusion – There is room for more analysis here and encouragement that modelling on VPL could be a profitable way to go.

As always thoughts and comments welcome

Jockeys in Tight Finishes


I have just watched Joe Fanning get touched off again or at least it seems like again. Maybe I am biased and perhaps Fannings more quite finishing style keeps a horse balanced or maybe he is getting a bit old and tired its hard to be objective when you think a jockey has just cost you a few quid.

I decided to take a look at close finishes over the last 3 years, close being beaten or winning less than half a length. I do not suggest for a minute that this is a hugely scientific study but I wanted to try and find some morsel of proof to back up my gut feeling.

Here are the number of finishes, wins and percentage wins for the top 27 jockeys in terms of getting involved in tight finishes. The number 27 simply because this was a 100 tight finishes cut off point.

JockeyTight FinsWins%wins
Oisin Murphy23212051.72414
L Morris2129042.45283
Tom Marquand1899449.73545
David Probert1739655.49133
J Fanning1688349.40476
D Tudhope1678852.69461
A Kirby1647445.12195
James Doyle1587547.46835
P J McDonald1527046.05263
S De Sousa1497348.99329
Hollie Doyle1457148.96552
R Kingscote1447250
F Norton1307255.38462
Jason Watson1196050.42017
Jim Crowley1186958.47458
Andrea Atzeni1165648.27586
Jason Hart1156052.17391
R Havlin1155547.82609
B A Curtis1145245.61404
P Hanagan1146052.63158
Rossa Ryan1104944.54545
D Allan1085349.07407
R L Moore1074642.99065
K T O’Neill1065148.11321
Rob Hornby1056057.14286
G Lee1035553.39806
Jack Mitchell1004444

The average win percent was 49.56 so Joe is pretty average

Jim Crowley comes out top and by coincidence perhaps is a jockey I rarely complain about. A mention must be made also for Rob Hornby at 57.1%.

Perhaps rendering these numbers totally irrelevant is the poor showing of Ryan Moore unless of course you think Ryan gets beat more than he should.

Anyway apologies to Joe, you are after all average 🙂

The Derby 2021

In 1998 I sat my 3 yo son down in front of the Derby runners and asked him to pick one. He chose High Rise. I then asked his baby sitter to choose one, she was unaware of his choice and coincidently also picked High Rise. The fact that I did not put a penny on High Rise should have resulted in my reporting to the Social Services child abuse department. The whole spooky event was repeated the following year when the baby sitter picked Oath. Now of course this was nothing other than luck but I find myself drawn to the horse in the same colours as High Rise for this years Derby, Third Realm. The reasons have a little more foundation, his price of around 18/1 on the exchanges and his sectionals.

These are his sectionals for the Lingfield Derby trial with his final furlong sectional being 13sec and Sherbet Lemons, the Oaks trial winner 12.53

Now to get a handle on how these compare I divided them up into first sectional, mid sectional and final sectional which gave the following

Third Realm covered the first 4 furlongs in 49.4 compared to Lemons 54.12thats a huge 5.96 seconds difference which equates to around 0.84 of a furlong. Despite this he only ran the final third in 0.15 secs slower.

Of course that may not mean a great deal if Lemon turns out to be a Lemon but lets assume for the time being that anything winning an Oaks trial has some merit. In order to get a handle on these figures we need to dig a little deeper and so I turned to looking at other Derby/Oaks trials.

Anthony Van Dyck and Anapurna won their trials on similar ground a couple of years earlier. In actual fact I think High Realms ground was a little softer although both were classed as soft. Anapurna went on to win the Oaks and of course Van Dyck won the Derby. Here are their figures for their trials, Van Dyck is the top row

Anapurna ran to 4f out in 0.04 seconds faster than Van Dyck and yet lost 1.02 seconds on the final third. She was eased in the final furlong but never the less its hard to imagine that the figure of the final section would be reduced to around 0.15 and even if it did she is producing it off an only slightly faster first two thirds of the race. My reading of this is that Third Realm should have dipped far lower than 0.15 seconds below Sherbet Lemon and argument backed up by how the first two came away.

Finally I too a look at another pair of trials which I admit were on faster ground, Hertford Dancer (Oaks tiral) and Best Solution (Derby trial). Note here that Best Solution turned out to be a better horse than Hertford Dancer. Once again the Derby trialist is the top row.

Best Solution ran 0.5 seconds faster for the first 2/3rds of the race and paid by running 0.1 secs slower in the final third compared to an inferior animal in Hertford Dancer.

This is not an exact science I have to admit but my underlying feeling is that Third Realm did well to drop by only 0.15 seconds in the final third. The current favourite for the Derby is Bolshoi Ballet and I like him a lot, he is a worthy favourite but if the rain continues 18/1 for Third Realm looks big.

On a closer I examined the sectionals for the Dante and I was very pessimistic about the chances of anything coming from that race to the Derby.

Childs play I am sure you will agree, I am off to find a 3yo but not the equine variety

Going Going Gone


Its Saturday Lingfield Derby trial day and after a dry weather opening to the season it is pissing it down with rain and soft ground abounds. Now there is nothing that will create more column inches, amongst the failed punters we otherwise know as newspaper tipsters and columnists, than a change in the going. According to them only the four horses of the apocalypse should be feared more than a horse that has not won or run well on the ground. The most important factor from a betting angle we are told is the ground. Well I am telling you that is a load of bollocks passed down from journo to journo over the years. Its perpetuated because its easy to talk about and we can all remember a horse we waited for soft ground and when it came it won and we collected.

I am not immune to this disease having only recently paid the cost. I wanted to back Walgeist EW for the Arc de Triomphe but was put off by the very soft ground. I momentarily forgot that the ground simply meant I got a better price.

But Waldgeist and a few other anecdotes is not really evidence. Is there more robust data that refutes the going is king approach to betting. I took a look at Flat races 2017 to 2019, 3 years of data. I took all horses that had won on Turf and were racing within 30 days, so fit and in form. I then checked how they faired next time out when running on exactly the same going and when running on different going. Now if the going is paramount I would hope that those running on the same going win more often and return a smaller overall loss.

As usual I calculate returns to Betfair SP and use variable staking to win 1 point to reduce the effec of big priced winners.. Here are the results

Same Going

Strike rate 19.01% from 2530 bets

VPL -31.2 pts

VROI% -3.76%

Different Going

Strike rate 18.8% from 5720 bets

VPL +11.91 pts

VROI% 0.69%

Notice that a greater percentage of horse won from the going confirmed runners but the profit lie with the lower strike rate. This is a classic example of winners are for bragging but profit is for the bank account.

This is not of any statistic significance but here are the qualifiers for todays racing ie won LTO but now different going

1.55 Asc Lights On
3.40 Asc Keyser Soze
4.15 Asc Group One Power
4.50 Asc Pettochside
2.15 Ling Nasha Nasha
2.15 Save a Forest
2.50 Kyprios
2.50 Third Realm
3.25 Double or Bubble
2.55 Nott Mountain Brave
4.5 Bint Al Anood
5.30 Thirsk Moe Celita
6.05 Highjacked
7.35 Raadobarg

Footnote – End of day Predictions from todays blog made +2.54 pts to variable stakes to win 1 pt at BFSP minus 2% comm or for the brave +11.89 pts to level stakes

Further footnote – Although the same disparity existed in 2014 to 2016 there was no automatic variable stake profit to be had so do not think this is system to follow, the above winning day was nice from the point of view of making a point but nothing more

Federated Machine Learning for Horse Racing


The traditional approach to developing Machine Learning models for horse racing is individuals hold their data on their local machine or maybe even phone and they deploy an ML algorithm to produce a model and predictions based on that model. If we want to combine the predictions of multiple models owned by many different people we would need to either gain access to their data or access to their predictions. The former poses data sensitivity and security problems before we even consider the protective nature of individuals towards their betting data, but what if you would like to contribute to a collaboration without the need to compromise your privacy. For example what if you are a pretty good paddock judge and you would like to build a model that predicts based on horse watching (see WatchingRacehorses by Hutson). You may then want to submit the outputs of this model to a pool of models from other people which may be totally unrelated to paddock watching in the hope that your input could be a vital cog in a profitable wheel accessed only by those making ‘valuable’ model contributions. The central algorithm that takes these varied inputs would decide which are proving valuable and not only weight their input but also manage access to the predictions. It could be that your super contributor gives you a 1 minute earlier access to the final predictions than my less important predictor.

What I am describing here is an area of research known as Federated Learning and is the subject of a 2016 paper from Google. The central idea is that your data does not have to be uploaded and your algorithm is protected within your processing device. Only the predictions are contributed to the central final processing step.

This sounds exciting and enables a whole that may be more predictive than the specialised parts.

Guineas Breeding 3

In the previous two posts I looked at the pedigree implication of guineas runners over the last 10 years. In this post I have extended the data to take runners from 2000 to 2020. Using MySportsAI I have trained four Machine Learning models on the historical data (see previous posts for an explanation of the data). The ML Models are Logistic Regression, Gradient Boosting Machines, XGBoost and LogGAM. I then created the input data for this years 2,000 guineas runners and made predictions using the models trained on the historical data.

The input data for this years main runners is

Newm_15:00_2021-05-01Thunder Moon8.659.8974.2373.28
Newm_15:00_2021-05-01One Ruler9.197.7682.8276.02
Newm_15:00_2021-05-01Master Of The Seas9.198.3182.8272.66
Newm_15:00_2021-05-01Poetic Flare8.788.4471.5571.27
Newm_15:00_2021-05-01Lucky Vega8.748.2776.1673.65
Newm_15:00_2021-05-01Mac Swiney9.788.3380.7569.47

The following is the results after deploying the four ML algorithms with sum total of rank positions given by each algorithm. Overall the lowest total ranked position would be the top rated across the four models.

Newm_15:00_2021-05-01Thunder Moon346316
Newm_15:00_2021-05-01Mac Swiney672520
Newm_15:00_2021-05-01Lucky Vega828624
Newm_15:00_2021-05-01Master Of The Seas4831025
Newm_15:00_2021-05-01Poetic Flare959427
Newm_15:00_2021-05-01One Ruler597930

Top rated therefore is BattleGround followed by equal 2nd Thunder Moon and Mutasaabeq

I will run off the 1,000 guineas predictions when the decs are known, watch this space

You can find out more about MySportsAI at

Footnote – Insufficient data to access Van Gogh

Guineas Breeding 2

In the previous post I looked at how the sires and damsires may predict the ability of Guineas runners to handle the 8f of the Newmarket track. The message seemed a bit muddy but some possible trends may have been evident. In this post I will turn the attention to sire and damsires predicting class as measured by official ratings. The same approach will be used as the previous blog post but here the OR where recorded will be used as the measure and of course averaged when looking at Guineas runners.

OK lets get onto the data and results, first of the sires, the average place OR of sire offspring for all guineas runners was 75.86. Using this as initial split ie those runners where sireOR > 75.86 and those <= 75.86 we have the following.

<= 75.841004-0.40339

Quite a clear split between the two and generally as you increase the OR threshold the win percentage gets higher

OR threshold and percentage wins

Onto the dam sire data

> 72.281107.2727270.677137
<= 72.28978.2474231.660867

The dam sire influence is less pronounced or should I say not significant at all.

Closer to Guineas day I will run off the various breeding stats for this years runners and finish with a final Guineas blog. After that onto the Derby to see if one can profit from pedigree.