Imbalanced Data in Machine Learning and Horse Racing

Imbalanced data is when you have a far greater number of one classification value over another. In other words say you are predicting the existence of a disease from a number of input fields and the existence of the disease represented by a 1 only amounts to say 10% of your data, then you have an imbalanced data set. Machine learning algorithms tend to be geared to finding the least error prone way of predicting and if predicting zero every time gives a 90% success rate, as in this case, then it can lead to the algorithm simply selecting zero.

Hopefully this all sounds familiar because in horse racing we tend to have imbalanced data. The number of winners may well be around the 10% mark. How do we get our Machine Learning algorithms to behave more sensibly with the data. The following link provides a pretty good explanation of some of the techniques that can be employed and also shows why Random Forests are pretty good at avoiding this pitfall.

https://elitedatascience.com/imbalanced-classes

Advertisements

Betting Advice From Warren Buffet

Most people still pick up a paper in the morning and check the days races to see if there is a bet they fancy. They may have some hardwired preferences such as liking or disliking handicaps, leaning towards NH racing over flat or any number of any other filters but on the whole they tend to go looking for a bet. This strategy is the polar opposite of allowing bets to come looking for you and is the fundamental message behind Warren Buffets point number two in the following video, although I would be the first to guess that Mr Buffet has never backed a horse in his life.

I strongly believe that if you set out at the start of the next NH, AW or Flat season with a policy of perhaps absorbing yourself solely in staying handicap chasers or Novice hurdlers over two miles or sprint handicaps on the AW, you would be surprised at how expert and knowledgeable you become about not only that particular pool of horses but also other factors such as the nuances of the draw, the demands of the track, the pace bias of the track, even the behaviour of particular trainers who specialise in these subsets of animals.

With this approach bets will start to come to you rather than the other way round and your bottom line will improve. Which area to choose is one for you to decide, I would suggest starting with an area that interests you the most although this choice may be effected by a certain opinion you may have regarding an angle of analysis. For example you may feel that the draw is under valued in races beyond 7f so you might think that this gives you a good starting point to specialise in one mile to 10f handicap races on the flat. There are numerous areas you could specialise in and I am sure your judgement on when not to bet will automatically improve.

Please feel free to leave a comment, it is nice to know someone is out there.

Improve Your Data Analysis Skills

Many things inspired me to turn to turn to lecturing in Computer Science but a couple of guys who taught me on my degree course played a part. They were both excellent lecturers holding that rare gift of being both smart and understanding how to get complex ideas across. They had very different styles, Ian Morrey had a a highly organised approach with a steady clear delivery that simply drew you in. Doug Bell was entertaining, slightly eccentric but not at the expense of being equally organised. They were both great lecturers in slightly different ways.

If you are looking to extend your understanding of data analysis, and in the modern world of betting I would strongly suggest you invest some time in this, then I can suggest two excellent Youtube tutorial providers who have similar qualities to the above.

The first I may have mentioned before and that is Kevin Markham and his DataSchool series. Below is a link to his introductory series on Machine Learning amongst other topics he covers

The second is the Jaylayer academy where you can get an introduction to basic topics such as handling Excel to more advanced stuff in R. Link below to his Excel introduction

Firm at Ascot

Ribchester has just smashed the course record at Ascot and with Firm creeping into the going we have the potential for a possible strong bias. By means of a bit of fun I will test this theory by posting up possible bets in all handicaps. There will be multiple selections but to BFSP the possibility of profit over the week should the going remain fast is realistic.

5.00 Today

Inicia
Oceane
Rainbow Dreamer
Sueugioo
Shrewd
High Secret

LAY Beyond Conceit

Day 2

5.0
George Williamson
Elleval
Banksea
My Target
Bossy Guest
Remarkable
Tashweeq
Bravery
Hors De Combat
Another Touch
Boomshackerlacker
GM Hopkins
Withernsea

Lay G K Chesterton

5.35
Baileys Showgirl
Gymnaste
Dancing Breeze
Con Te Partito
Asking
Rain Goddess
Classical Times
Cheval Blanche

Lay – Sibilance

Day 3
5.0
Ronald R
Bless Him
Tricorn
Lightning Fast
Afaah
Maths Prize
City of Joy
Moritzburg
Medahim
Keyser Soze
Colibri

No Lays

5.35
Twin Star
Weekender
Bin Battuta
Tartini
Never Surrender
Utah
Master Singer
Good Omen
Janszoon
Atty Persse

Lay Sofias Rock

Day 4

5.35

Lustrious Light
Eddystone Rock
Star Storm
Manjaam
Mainstream
Wadigor
Petite Jack
Sixties Grove
Master Carpenter
Shabeeb

No Lays

Day 5
I remain unconvinced, looking at the times, that the ground has any firm in it. I think the clerks have done a good job of taking the firm out of the course and with temp’s modest today I think it would be prudent to draw stumps even though I have the lists prepared.

Results

61 plays PL +36 pts to BFSP before comm 59% ROI

FOOTNOTE – Perhaps I was premature in drawing a line on Saturday after all the research is based on Firm being present in the going description regardless and also the painful fact that the two lists for the two Saturday Handicaps contained Snoana who won at 40.0 and Outdo who won at 42.0 !!

If there is any interest I will be happy to demonstrate how the horses running against the bias can be bet profitably next time out, in particular at Newmarket where they have done very well historically

Improving 3yo’s

The new flat season began 2 days ago and the hopes and expectations of many horse to follow lists will be high. I have never been one for compiling such a list and on the whole viewed them as a bit of fun although I am aware people buy the publications yearly without fail. Can any stat’s help us with compiling such a list and are there certain trainers we can focus on for our list selection or more generally for betting purposes ?.

I have spent a little time looking at which trainers have the best records of improving 3 yo’s from their first handicap mark through to their highest handicap mark for the 3 yo season. To do this I examined and compiled average official rating gains for 3yo horses in their care tagging their base OR on their first second or third race depending on which was responsible for the initial rating. Only trainers with at least 10 3yo runners were considered.

For the years 2016, 2015 and 2014 the undoubtable and not surprising star of the improving 3yo is Sir Mark Prescott. He has been top in 2016, 2015 and 8th in 2014. His average OR improvement was 7.09 pounds from 21 horses in 2016.

The next best trainer across the three years is Luca Cumani. He came in at 10th in 2016, 1st in 2015 and 5th at 2014. His average OR improvement in 2016 was 4.44 pounds from 18 horses.

R M Beckett would also appear to be an improving trainer worth keeping an eye on. His positions have been 6th, 7th and 19th.

Finishing 2nd last year and a new kid on the block is Hugo Palmer with an average improvement of 6.55 pounds from 20 runners.

If you are scratching your head trying to compile a horses to follow list then you could do worse than to select 3yos with only 3 or less runs on the clock from trainers Prescott, Cumani, Beckett and Palmer. Have good and profitable flat season.

Oh by the way here is a list to follow from the four mentioned trainers. It might need a bit of pruning. As I write this there have been 30 bets from this list with a registered OR yielding a pre comm’ profit of +3.51 pts on Betfair SP. Perhaps betting them until they win may improve things, at the moment that stands at +5.51 pts from 28 bets.

Abouttimeyoutoldme R M Beckett
Accento H Palmer
Al Mayda (USA) H Palmer
Alapinta R M Beckett
Aljezeera L M Cumani
Alouja (IRE) H Palmer
Alwaysandforever (IRE) L M Cumani
Anythingtoday (IRE) H Palmer
Aryeh (IRE) H Palmer
Aureana R M Beckett
Beach Break R M Beckett
Bedouin (IRE) L M Cumani
Belle Diva (IRE) R M Beckett
Bessemer Lady R M Beckett
Best Of Days H Palmer
Beyond Recall L M Cumani
Bird To Love R M Beckett
Boost Sir Mark Prescott
Brimham Rocks R M Beckett
Buena Luna Sir Mark Prescott
Buxted Dream (USA) L M Cumani
Camerone (IRE) R M Beckett
Cape Cruiser (USA) R M Beckett
Carigrad (IRE) H Palmer
Castleacre H Palmer
Choumicha H Palmer
Cirencester R M Beckett
City Limits L M Cumani
Cloud Dragon (IRE) H Palmer
Colibri (IRE) H Palmer
Considered Opinion R M Beckett
Cool Team (IRE) H Palmer
Crimson Rock (USA) R M Beckett
Dance Teacher (IRE) R M Beckett
Denver Spirit (IRE) L M Cumani
Dervish L M Cumani
Diamond Bear (USA) Sir Mark Prescott
Diptych (USA) Sir Mark Prescott
Dr Julius No R M Beckett
Dubaitwentytwenty H Palmer
Earthly (USA) R M Beckett
Elysees Palace Sir Mark Prescott
Escobar (IRE) H Palmer
Fibonacci H Palmer
Fleabiscuit (IRE) H Palmer
Follow Me (IRE) H Palmer
For The Roses R M Beckett
Fox King R M Beckett
Gemina (IRE) R M Beckett
God Given L M Cumani
Gorgeous Noora (IRE) L M Cumani
Goya Girl (IRE) R M Beckett
Great Court (IRE) L M Cumani
Gulliver H Palmer
Harebell (IRE) R M Beckett
Humbert (IRE) H Palmer
Hyper Dream (IRE) H Palmer
Inconceivable (IRE) R M Beckett
Influent (IRE) H Palmer
Inspector (IRE) H Palmer
Isabel De Urbina (IRE) R M Beckett
Khattar H Palmer
Kind of Beauty (IRE) H Palmer
Kitty Boo L M Cumani
Kohinoor Diamond (IRE) Sir Mark Prescott
Koropick (IRE) H Palmer
La Guapita H Palmer
Lagertha (IRE) H Palmer
Manangatang (IRE) L M Cumani
Manchego H Palmer
Medicean Dream (IRE) L M Cumani
Melinoe Sir Mark Prescott
Melodic Motion (IRE) R M Beckett
Mistress Quickly (IRE) R M Beckett
Munawer H Palmer
Munro R M Beckett
Newt Sir Mark Prescott
Nurse Nightingale H Palmer
Omeros H Palmer
Parisian Chic (IRE) L M Cumani
Piaffe (USA) R M Beckett
Piedita (IRE) Sir Mark Prescott
Pincheck (IRE) L M Cumani
Poetic Voice R M Beckett
Polly Glide (IRE) L M Cumani
Really Super R M Beckett
Red Label (IRE) L M Cumani
Rickrack (IRE) L M Cumani
Roseland (USA) H Palmer
Sea Tide H Palmer
Secret Soul R M Beckett
See You After (IRE) Sir Mark Prescott
Send Up (IRE) Sir Mark Prescott
Shozita R M Beckett
Sibilance R M Beckett
Single Estate Sir Mark Prescott
So Sleek L M Cumani
Sound Bar R M Beckett
Spinnaka (IRE) L M Cumani
Spun Gold L M Cumani
Star Of Doha R M Beckett
Star Story R M Beckett
Starshell (IRE) Sir Mark Prescott
Steaming (IRE) R M Beckett
Subatomic R M Beckett
Tamayef (IRE) H Palmer
Turning Gold Sir Mark Prescott
Via Serendipity H Palmer
Vintage Folly H Palmer
Western Duke (IRE) R M Beckett
What A Boy R M Beckett

Is Wind The New Going ?

I wrote a few blog entries ago about the performance of pace at the new Newcastle AW track. I thought I would update matters with performance figures up to the end of 2016. For Newcastle therefore that is about 8 months of racing on the AW. A reminder that all my figures are based on pre race pace prediction as per Smartersig pace figures.

In addition to an update I thought it would be interesting to include some reflection on the effect of the wind during this period. To do this I compiled what I will call a ‘degree of separation’. This simply means the number of wind degrees the wind varies from a tail wind at Newcastle. Let me explain, a tail wind at Newcastle is South East or SE for short. This means that the wind is blowing from the south east. The dividers between South and East are as follows

South
South South East
South East
East South East
East

Similarly going west from south we have

South
South South West
South West
West South West
West

So from the above we can say that East is 4 degrees of separation from South as is West. South South East is 1 degree of separation from SE and so on.

Data on wind direction on race days along with wind speed and temperature were collected.

First of all the results from blindly backing to BFSP the various broad category of pace regardless of conditions or price.All figures pre commission

Led last time 247 Bets PL -83.6 ROI% -33.8%
Tracked 652 Bets PL -49.1 ROI% -7.5%
Held Up 890 Bets PL -52.8 ROI% -5.9%

From the above we can see that leaders have an appalling record.

Now lets see what effect the wind has had. As a broad brush stroke I have classed a degree of separation of less than 4 as a tail wind although of course a 3 for example would be something of a cross wind. On the other hand DOS above 4 is classed as a head wind.

First of all a tail wind

Led 63 Bets PL-33.3 ROI% -53%
Tracked 157 Bets PL -38.06 ROI% -24.4%
Held up 233 Bets PL -17.4 ROI% -7.4%

Now lets take a look at the head wind situation

Led 171 Bets PL-42.9 ROI% -25.1%
Tracked 454 Bets PL +10.2 ROI% +2.2%
Held up 601 Bets PL -22 ROI% -3.6%

The first thing to notice is the far greater number of selections when there is head wind suggesting that Newcastle gets more than its fair share of head winds down the straight. This coupled with the long straight may be the reason for the poor performance of leaders. Strangely enough the wind speed at this stage did not seem to add much value.

This area opens all kinds of possible avenues of research and I have already compiled data for all UK flats tracks. More obscure areas of enquiry might be aspects such as do greys do better in hot temperatures than dark horses or is all this just blowing wind up our ……………….

All comments welcome below

One Step Two Step or Half Step

I was reading the article linked below last night which revisits the idea that when creating a set of ratings for horse racing one can gather a set of horse features for a given race. For example the Jockey strike rate of each mount along with the draw position along with … you get the picture. Now the the difference between a one step and a two step created model is that with a one step you include as a feature of each horse it’s starting price be that bookmaker or Betfair. The problem with this approach is that the SP can swamp the attention of your chosen model building algorithm. Not surprising really given the well documented effect on winning SP has. Short priced horses win more often and even shorter priced horses horses win more often than simple short priced horses and so on.

The two step approach chooses to get round this by building the model using only what is called the fundamental features, in other words we take out the SP and focus on the characteristics of the horse. Once we have built this model and produced a set of ratings for a given race we then proceed to step two. In step two the SP is introduced to the results of step 1 in order to build a final model, this means the SP has not had a chance to bully the fundamental features as they were examined in step 1.

All this can lead to producing an evaluation of the chance of each horse winning and hence a betting strategy based on backing those with longer odds than their predicted chance according to the model. For many people using there own ratings or someone else’s the odds line production can be a daunting problem, but do we really need to worry about that stage. Can we just forget the oddsline component ?.

If a set of ratings is profitable to top or top two rated do we need to oddsline it, perhaps not. Creating an oddsline may well create fewer bets and perhaps a more impressive ROI% but what if backing the top two had created pretty much the same profit but from twice or three times as many bets ?. I am suggesting that the non oddsline approach can have its merits in our UK set up. In the US where a 17% takeout has to be overcome along with no facility to take a price an oddsline is an essential tool as I see it but here in the UK we bet to a 1% takeout (OK a little more if you are paying 5% commission on Betfair). Furthermore the dreaded premium charge looms over us if we get successful on Betfair and here is where the non oddsline approach has some merit. The larger number of bets generated and fluctuations in the profit rate will offer some safeguard against premium charges. that higher ROI% from fewer bets will in time be more likely to change a non premium charge account into a PC one. A slower burning larger turnover account will have a much grater chance of avoiding PC. In fact I would encourage all break even type bets to be left in your betting portfolio to add extra protection.

Here is the link to the paper, comments welcome as usual.

http://www.bjll.org/index.php/jpm/article/viewFile/419/450

AI Ratings Update

Stef the original creator of Smartsig produced a set of ratings using a neural network. The ratings were based on the finishing positions in the last 3 runs of each horse along with the days since last run. This data was fed into an individual Neural Network for NH hurdles, NH chases, AW flat races and AW turf races. A typical line of data would look something (I guess) like

5 1 3 76 0

Showing that a horse had come 5 (all places above 4th represented as 5) in its third last race. 1st in its second last race. Third in its last race. Ran 76 days ago and in this coming race was not a winner.

These ratings to my mind were not intended as point and fire set of ratings but more as an illustration of how AI can be used and perhaps even as a starting point for further study either using traditional form study or AI methods. They have been published daily with Stef’s permission on the web site.

I thought it would perhaps be time to play around with them a little further and perhaps attach some performance figures to them. I was wondering if the above representation was indeed the best configuration. I chose to use a Random Forest as an Machine Learning vehicle simply because scikit-learn and Python do not have a readily available NN module.

The first thing I did was create a file for AW handicaps based pretty much on Stef’s layout of placings being 1 to 5 where 5 means anything outside the first 4. Days since last run were left as is. It is inevitable that some horses will not have 3 runs and in these cases I opted within Python to replace the values with the mean of the whole column. So a missing third run would be replaced with the mean for all third runs of all horses in the set. This is needed as Python and Scikit Learn do not allow missing values unlike the package R.

The next step was to train the forest on 2011 to 2013 data. Once this was done I tested the model on 2014 to mid 2016. I was hoping perhaps that to BFSP top rated horses might get close to break even as I recall that the original AI ratings top rated lose about 8 or 9% to bookie SP. I was pleasantly surprised to find the following

Top rated bets 4103 PL after 5% comm +215.3 ROI 5.24%

The bottom rated horses produced

7384 bets PL -927 ROI -12.55%

Encouraged by this I went on to try a modification to the placings data using the position of a horse in a race as a percentage of the runners in the race. So first of 2 would be 0.5 whilst first of 10 would 0.1. Placings were not cut of after 4 so for example 5th of 10 would be 0.5. I was hoping that this extra information would produce better results but as is often the case in this game more can mean less.

Toprated bets 4254 PL +5.7 ROI 0.13%

Finally I tried a hybrid of the above two methods. Placings 1,2,3 and 4th would be expressed as a percentage of total runners in a race whilst 5th plus would be represented as 1. This produced the following results

Toprated bets 4118 PL +68.5 ROI 1.66%

If there is interest in these ratings via the comments below I would be happy to produce them alongside the AI ratings and maybe extend them into other codes of racing. Any feedback below is most welcome.

Jockey Ratings

I have been pondering recently over the relative merits of different jockeys. Perhaps it is the sad untimely death of Walter Swinburn and the praise heaped upon him now he has gone that has prompted me or perhaps I have always wondered how Racing Research go about formulating jockey ratings as i seem to recall they had Ray Cochrane as their best jockey in one particular year. What ever method we choose for rating jockeys it has to be first and foremost objective and logical. Simply checking strike rates does not account for the fact that jockeys have different stables which prompt those strike rates. Is a top jockeys with a top yard better than a mid strike rate jockey finding mounts where he can ?. The latter could be the better jockey but the strike rate would not show this.

One option is to use the market as a measure of jockey ability. The problem with this is that the market tends to overbet and underbet certain jockeys. One way to try and iron this out is to take a look at the AE values of jockeys and then compare this with the AE values of all jockeys with the same overall strike rate. Perhaps a jockey with a 12% strike rate should get an AE value equal to that of all jockeys with a 12% strike rate. If it was lower then this would indicate that he is not booting home as many winners as his fellow 12% jockeys on average. If not an absolute measure of pure ability it might indicate who we should avoid and who we should look twice at.

I carried this out on the jockeys with a minimum of 1000 runs from 2012 onwards. I used linear regression to smooth out the strike rate AE values and then compared the jockey AE vales with the strike rate AEs. The league table of jockeys, ranging from top to bottom came out as follows, with William Twiston Davies as the number 1 jockey.

William Twiston-Davies
J Quinn
J P Sullivan
F Norton
P Cosgrave
D Allan
R Winston
P Hanagan
Martin Lane
David Probert
P Makin
L P Keniry
S De Sousa
Oisin Murphy
G Baker
P McDonald
Jim Crowley
D Tudhope
A Mullen
William Carson
P Mulrennan
D Sweeney
J P Spencer
S Donohoe
R Kingscote
J Fanning
A Kirby
Andrea Atzeni
T Eaves
Hayley Turner
L Morris
S W Kelly
T Hamilton
T P Queally
R Hughes
R L Moore
G Lee
R Havlin

The Nature of Expert Gamblers

The topic covered here is probably the most fundamental problem that most gamblers fail to overcome. First of all let me say that you must first be playing a game in which the odds can be larger than the actual probability at certain points in the game. This is the first fundamental flaw amongst gamblers. They prefer simplicity citing complexity as the enemy when in actual fact the opposite is true. The greater the complexity the greater the opportunity, usually. Just ask a punter why he plays the slots in preference to horse racing and he will say that he has a better chance of winning !, what he really means is that he has fewer variables to consider. The other problem that faces gamblers or at least those playing a game which offers opportunity, is that their approach is based around finding winners and not finding value. In horse racing they focus on whats going to win rather than is their a bet in the race. This is not surprising as their first exposure to Racing is via ‘experts’ in the media who operate in exactly the same way. Finally one thing the speaker did not touch upon which I think might be interesting is what value or opinion do successful gamblers have towards money. My guess is that ironically they are less driven by it than people think.