Sentiment Analysis and Hugh Taylor 3

At this point it might be worth mentioning a bit about what I am tracking here and how it is being tracked. I am obviously looking at whether its possible to detect more or less positive selections from Hugh Taylor, but what measures are being used.

I am tracking at the moment on the following criteria which I will then try and explain.

Profit on Positive Probability
Profit on Polarity
Profit on Subjectivity
Profit on tip text length
Profit on good old back them all to BFSP

All the above are to BFSP before commission.

OK so what do they mean ?. The last two are self explanatory, the next to last being simply is he more confident when he has more to say ?.

To explain the first three I need to now mention the Python library being used to carry out the analysis. The library is called TextBlob. There are a plethora of introductions out there should you be interested in the detail and it is by no means the only option for this kind of work.

To analyze and gain sentiment analysis on any text the library has to refer to a Lexicon of words, phrases, sentences if it is to determine what is positive and what is negative. So the word ‘great’ in some text may push the overall positive probability up whilst the word ‘poor’ or even ‘not great’ will pull it down.

TextBlob comes with two ready to use corpuse’s to refer to. One is a library of movie reviews and it is this option that gives us the positive probability scores (lets hope Hugh does not tip anything called The Shawshank Redemption). The other is based on lexicon of words and similar positivity scores and gives us the Polarity score, a measure of positivity, and the Subjectivity score, a measure of how subjective or objective the text is. At this stage I am not sure that the Subjectivity score will be very useful but lets track it and see.

Another option, once a back catalogue of Hughs text builds up, is to train an algorithm on Hugh’s language itself so that future selections are assessed based on his past language and their success rate. This is why I am logging not just winners but placed horses too as his win rate is low and this may prove a problem when training such an algorithm. That is for the back burner for now, let’s continue to see if how people feel about their movies is how Hugh feels about his horses (Godfather excluded).

I will update at the end of week 1 with the figures on this blog entry.

End of week 1

If you had backed all Hughs tips to BFSP you would be -4.02 pts down
If you had backed or layed them as indicated by positivity probability you would be -0.02 pts down
If you had back or layed depending on text length you would have been -7.98 pts down
If you had backed or layed as indicated by sentiment polarity you would be -1.98 pts down
If you had backed where Hugh appears more subjective and layed where he appears less subjective you would be +5.98 pts up

Back all Plays/Back all PL
Avg’ Prob’ Plays/Avg Prob PL
Text Length Plays/Text Length PL
Avg Polarity Plays/Avg Polarity PL
Avg Subjectivity Plays/Avg Subjectivity PL

End of week 2 18/3/18

Back all Plays/Back all PL
Avg’ Prob’ Plays/Avg Prob PL
Text Length Plays/Text Length PL
Avg Polarity Plays/Avg Polarity PL
Avg Subjectivity Plays/Avg Subjectivity PL

| report block user


Sentiment Analysis and Hugh Taylor 2

Apologies to anyone who read the initial post under this title. Half an hour after posting it I realised that due to a schoolboy error in my code the outputs so far are not quite as uniform as I first thought. So far this week up today Thursday Hugh’s positivity prediction scores according to sentiment analysis are as follows.

Pos Prob FinPos
0.9999987 0
0.99999948 0
0.99999474 1
0.99929637 2
0.99951526 ?

The average is running at 0.99976091 which means only three tips have been above average including the winner. Early days I agree but I mention it now simply as one way of looking at the numbers.

It may turn out that a generally positive Hugh Taylor or indeed any tipster for that matter may require a specific machine learning algorithm trained on that tipsters specific vocabulary and of course tipping style in order to tease out any nuances. For example ‘tends to break slowly’ might be far more negative within a trained specific algorithm than within the general corpus used here. This approach may be fine for sifting positive and negative movie reviews but may be average on figuring out what side of the bed Hugh Taylor got out of this morning. Before I can do that however the sample size will have to build up unless someone knows of a back source of Hugh Taylor write up’s.

One reader also commented on using this approach to analyse race reader comments. I have done some work on this before both statistically and using ML (see smartersig mags) but I will try to return to it and blog soon.

I will keep you updated on the current approach via this blog as data builds up.

Sentiment Analysis and Hugh Taylor

Machine Learning has moved firmly into the are of sentiment analysis. The role of detecting whether written text carries greater or lesser traits of various underlying messages. Is the book review overall negative or positive, is the person happy or sad based on his/her writing. I could not help wondering whether sentiment analysis could be applied to the question ‘Does Hugh Taylor really fancy that’.

There are some standard Python libraries that can help with performing sentiment analysis. But before looking at those lets take a look at what gets churned out for two contrasting pieces of analysis. The first is Hugh Taylors tip today 5/3/2018 and the other is a more negative analysis of The New Ones chances in the World Hurdle at Cheltenham.

First of all here is Hugh’s write up for today

“Veteran STAMP DUTY doesn’t win very often and might struggle if given his normal hold-up ride in the first division of the extended 1m1f handicap at Wolverhampton (6.45), but he has shaped as if in good form in limited starts this winter and has a positive jockey booking, and he’s capable of going close if not inconvenienced by the run of the race.

He ran an excellent race here two outings ago from a wide draw, and again shaped well here last time when unsuited by the steady gallop. That form has been franked by the next-time-out wins of the second and third.

He ran very well for Luke Morris when runner-up behind an in-form favourite in October despite lacking a recent run, and although much will depend on whether he breaks well enough to take up a reasonable position in a race where there doesn’t look to be much pace, he might run well if getting the run of the race.”

Now using Python and the library textblob I ran a sentiment analysis on this piece and the output using a NaiveBayes analyzer was

Sentiment(Classification=’pos’, p_pos=0.99997, p_neg=2.5039e-05)

This means the text was positive, the pos value being close to 1. This immediately highlights room for improvement. Texblob uses a standard corpus that carries out sentiment analysis in a general form. What would be useful is a corpus that is geared towards horse racing a or indeed Hugh Taylor. Can we feed a machine learning algorithm multiple examples such as the above along with results and train a sentiment analyser that is far better than the above at highlighting positive and less positive selections ?.

Let us take a look at the output for the more negative analysis for The New One done by a different tipster.

“As a general rule I tend away from horses when they are trying something different at the back end of their careers. There is not really anything in his profile that can help us judge the chances of him seeing out this three-mile trip – rather like it was with Nicholls Canyon in 2017! For his connections sake, I hope it is the same outcome as it was for the ‘unproven’ Mullins runner last year. From a betting point of view I could not have him on my mind. This is not because I do not think he has a chance of winning as he certainly holds some sort of claim. The point with The New One is that he is such a popular horse that his current single figure price does not take account of the realistic possibility that he might not stay.

The New One does seem to be as good as ever judging by his five runs this season. Last time out Sam Twiston-Davies ensured that he made it a real test of stamina on heavy ground over two miles at Haydock. Being viewed as a stayer over two miles is a world away from lasting home over three miles. I hope he does have the requisite stamina for the trip as a victory for The New One in the Stayers’ Hurdle would be as popular as a win for Cue Card in the Ryanair Chase.”

Here the output reflects a more negative impression, although still giving an overall positive.

p_pos=0.9993 p_neg=0.000693

The p_neg has increased showing that the algorithm was capable of saying that this text is more negative than Hugh’s analysis.

Given that tipsters are never likely issue a tip such as

“This horse is a dog, I would love to take a gun and shoot it rather than back it”

We can therefore expect high pos values and overall positive categorization but with training better and more accurate predictions may be forthcoming.

By the way the output for the above was

Sentiment(Classification=’neg’, p_pos=0.37991, p_neg=0.62008)

If you would like to see the code behind this and install instructions please leave a comment.

UPDATE Tuesday 6/3/18

Can this approach have any positive effect, can it improve the bottom line to Hugh Taylor. Can it highlight which Taylor bets to lay once the mugs have almost done backing them at -20% of advised price. How more or less confident is Hugh with todays selections ?. The sentiment analysis on todays bet suggests that Hugh is indeed more bullish than he was about yesterdays loser. The analysis comes in at

p_pos =0.99999948 for Beaming

compared to yesterday

p_pos 0.9999749

and for Mister Music he comes in at

p_pos = 0.99999474

So he would appear more positive about Beaming than Mister Music but more confident on both than yesterdays loser.

Perhaps with accumulated data averages can be derived which would enable a more accurate assessment. Of course an algorithm trained specifically on Hugh Taylor may be the best overall approach.


Horse Stride & Confirmation 5

Today’s Lingfield Winter Derby poses more unknowns than known’s in terms of stride length and frequency. Many horses have not yet registered any data which means we are guessing in terms of size, class etc. Is there anything we can gleam from the data so far, well lets take a look.

One possible measure of class is stride length X Srides per second. This number does not tell us how long they maintained this number but it still may be an indicator of class. If it does then over time we should see some nice averages emerge over various official class levels.

At the moment the average ‘power stride’, for want of a better term, for the conditions of today’s race at Lingfield are 59.18. Class 2 has an average of 59.8 but this unexpected larger number may simply be due to sample sizes so far.

How do the horse today measure up?. The Obrien favourite has no data and is taking a hike in class on official figures. If Ryan Moore was not on board I am guessing that he would not be favourite. The interesting horse is Utmost, a market mover into second place at around 9/2. He has a power stride of 58.4 when winning at Lingfield which does not look good enough but his previous defeat comes out at 60.1. If this was the other way round I might be more bullish but for me he looks vulnerable as I suspect the winner of this race will be better than the average of 59.18 or even 59.8. Utmost also did his figures from two races that were overall Even to Slow. My feelings at this stage is this would elevate the so called power stride rating.

The other interesting readings are for Mr Owen who comes in a 57.7 which is above the class 2 (no class 1’s yet at Wolves) average of 57.4 but this was also done off a slow pace which may well have boosted his power stride number.

Master the World does not look likely to figure with a power stride of 58.19 well below class 1 or 2 for Lingfield. Again his figure was from an Even to Slow race.

Suggestions, well not with confidence but maybe a chance on Khalidi EW


Horse Stride & Course Confirmation 4

Are female horses really smaller than male horses and do larger horses (fewer strides per second) really carry weight better than smaller horses. If they do then identifying them early will pinpoint horse with the ability handle weight rises more efficiently. It would also seem that a modification on average stride lengths per distance will be needed. At the moment there is only AW racing and although Southwell and its more demanding surface presents a spanner in the works, grass racing and its different goings will prove problematic unless I allow for them. You cannot compare a horses strides per second on heavy ground with a stride per second average for all horses across all grounds.

Continuation of this thread on horse size and stride data will continue on the SmarterSig email forum at


Horse Stride & Course Confirmation 3

A good big un will always beat a good little un. A commonly used phrase but is there any truth in it when it comes to horse racing ?. Of course as punters we are not interested in whether a big un will beat a little un, we are interested in whether the public’s perception of horse size presents any betting possibilities. If big horse do beat small horses more than they should but the public overbet them to such an extent that smaller horses go off at too large a price then I am with the small horses even if they win less frequently. The one major overall edge we have is that the crowd is looking for winners whilst we are looking for profit.

The ATR stride data offers an opportunity to explore this question in a way that was not possible previously. Stride patterns will probably be fairly indicative of horse size and we can use this to check the performance of big v small.

The data has only been made public since the last month or so of 2017 so checking going forward has a long way to go before meaningful samples can be compiled, but we can look back at overall ROI% of horses according to size which may offer some clues. First of all I have only so far been logging horses who have been placed in a race in order to get an accurate handle on stride length and frequency. By definition these horses are successful animals and will unsurprisingly come out retrospectively to be profitable animals overall.

First of all checking the PL of ‘large’ horses or at least below one standard deviation on stride length per second revealed that historically they have had collectively 1848 runs and to BFSP logged a profit of +359 points. In other words if we could go back in a time machine and bet them we would have made a ROI% of 19.4% before comm’.

By contrast small animals (quicker stride patterns) have run 2792 times and logged a pre comm profit of +615 returning +22%

It would seem therefore that small horses from a betting perspective have done slightly better historically than larger horses when examining from the poll of proven horses. The real acid test will be going forward but this will take some time and how and where these two horse types are best bet will be the intruiging question.

FOOTNOTE- The lack of feedback, I have to admit is making me wonder whether I am wasting my time blogging. If you have any comments pro or con then please feel free.


Horse Stride & Course Confirmation 2

Some further thoughts on this subject after a few hours this morning examining the data. Looking at 850 horses that have run in 2018 and registered a stride length/stride per second reading and finished placed, I first decided to examine stride length. Looking at the peak stride length per distance revealed the following averages

5f = 24.559375feet
6f = 24.54989691
7f = 24.28295238
8f = 24.31840336
9f = 24.04942857
10f = 24.52785714
11f = 23.91888889
12f = 24.42592593

Not a huge difference across distances so I decided to use the standard deviation of all the data rather than by distance so as to isolate longer striding horses. Those with a stride greater than one STD could be catagorised as long striders. I immediately ran into a problem here. The first on my list of long striders was a horse called Kelly’s Dino.’s-Dino/FR/2881764

He had a peak stride length of 27.4 but his other two registered runs were much lower. Given that I had only gathered data on the latest run, assuming that with placed horses the peak stride length would not vary too much, it was clear that if you changed the order of these runs gathering the latest run would not have logged Kelly’s Dino as long striding. He clearly is as closer examination of each race shows that his PSL is greater than the other placed horses in the races. Other factors are clearly at play here, perhaps going, wind, pace etc. This immediately prompts a further thought, can this data be used to more accurately state the going, but lets park that for now.

The above problem might be solved by taking averages or examining the full placed horses in a race but what if they are all long striding ?, it is likely then that taking relative readings would mean none of the long striding placed horses would be picked up.

I decided to turn to stride per second. This seemed to not suffer from the same problem, Kellys Dino’s strides per second seemed fairly consistent even if his peak stride lengths varied. The first trap you can easily fall into as you get buried into a train line of thinking is that now we are looking at SPS we need to remember that we are looking for horses with less than not greater than one STDEV below the average. Long striding horses make fewer strides per second. I forgot this initially and was a little disappointed when the first couple of ‘big’ horses I looked at had won at Epsom and Newmarket, both undulating tracks.

The other factor with SPS is that it is probably wise to calculate STDEV’s per distance as the averages vary much more than stride length. With these figures now done for 5f horses I check a few to see if there was any obvious patterns before doing more automated and fuller analysis. Manual checks can often highlight logical errors as seen above.

The first highlighted horse was Born for Prosecco

Having not won yet it may be difficult to draw conclusions but if this kind of analysis was to prove fruitful he could be the ideal type to keep an eye on as he is fairly unexposed.

The second horse is Razin Hill’-Hell/GB/2791432

All his 8 wins bar 1 have come at Southwell, it may be that long striding horses do well at Southwell but this is a very tentative suggestion at the moment.

You may want to take a look at a few yourself so I will list a couple below. I also welcome all constructive comments and if you would like to see more on this topic then leave a comment or a like.

Atletico (IRE)
Tommy G
Jack The Truth (IRE)
Temple Road (IRE)


Horse Stride Length & Course Confirmation 1


ATR have recently started publishing stride length data in their results for certain courses. This is in addition to the pace data they are publishing and the potential for analysing the various factors could prove very illuminating and perhaps profitable. The demo result cited by ATR shown by clicking below and then clicking the stride data tab shows the various stride data for each runner.

My immediate thought was whether stride data and its ability to show long or shorter striding animals might be a clue to suitabilty to various courses and there is a hint in this one race that it might be the case.

The runner up Surry Hope has a peak stride length of 28.3 which is higher than the 1st placed or the 3rd placed horse suggesting that he may be a long striding animal. If this is the case he may be better suited by a flat or uphill track rather than Lingfield’s downhill undulating track. When I check his back form this seemed to hold out.

He has been beaten twice at Lingfield and also at undulating Newmarket when fancied in all races but has won at uphills Sandown and flat Kempton.

A sample of one for sure but certainly worthy of further investigation and data gathering.


Odds Based KNN

I have covered machine learning previously and illustrated some concepts via the Kth Nearest Neighbor algorithm. KNN is often used as a start up example algorithm as its easy to understand its underlying principles even though Pythons SKLearn machine learning package is going to do all the lifting for you. An inevitable question however is can a simple algorithm like KNN yield any favourable results. I found that it can in one particular area I have been playing around with.
I decided to look into whether backing my top rated flat handicap ratings could be improved upon if I applied some sort of simple odds line to them. To do this I first organised my ratings such that a complete race was one line of data with each predictor field being the difference between the top rated horse and another horse in the race. As an example here is couple of sample lines

diff1 diff2 diff3 diff4 diff5 diff6 diff7 diff8 diff9 diff10 diff11 diff12 diff13 diff14 diff15 finpos bfsp

0.147760456 0.187955293 0.18819821 0.197145752 0.197400238 0.350482684 0.412356588 99 99 99 99 99 99 99 99 1 4.69

0.163702041 0.179318767 0.257880117 0.428371173 0.464780245 99 99 99 99 99 99 99 99 99 99 0 5.87

As you can probably see I decided to just deal initially with handicaps up to 16 runners. On the first line of data the second top rated horse is 0.147760456 behind the top rated. The third top rated is 0.187955293 behind the top rated and so on. If there are less than 16 runners the remaining feature places are padded out with the number 99. The first line in the data was a winner at BFSP 4.69 whereas the second line of data did not win at BFSP 5.87. The model was trained using the rating differences as inputs (not BFSP) with WinLose as the output.

I repeatadly split the data into 80% 20% partitions, training the data on the 80% and then predicting using the trained model on the 20%. I did this 20 times, each time splitting the data into different 80/20 split which incidentely are randomly chosen from within the file. SKlearn a Python programming language ML library does all this for you.

The model was a Kth nearest neighbor algorithm with the number of neighbors set to 20 and proba was used for predicting which means the algorithm predicts a probability of a given line being a winner. If you remember the details of KNN from the previous blog entry you will remember that it does this by finding the 20 nearest matches to the line in question and then bases a prediction on these.

After using KNN I utilised a Random Forest with n_estimator set to 100 to then compare with the KNN. If KNN is the simpleton of the ML family then I would hope to see some improvement with the Random Forest algorithm.

RESULTS for top rated using Probability odds

KNN Oddsline 38886 bets PL +4339 ROI +11.15% all after comm
All Top Rats 66765 bets PL +5435 ROI +8.14%

RF Oddsline 42270 bets PL +4356 ROI +10.3%
All Top Rats 69960 bets PL +5133 ROI +7.33%

The number of bets for all top rated varies between the two groups I presume because the KNN creates more probabilities of zero which are not considered in the analysis as first of all they would crash the program when trying to divide by zero to create odds for the horse.

Interestingly the humble KNN did as well if not better than the Random Forest and both showed an improved ROI% when used over blind top rated.

The nest step would perhaps be to check for optimum setting for K (20 so far), in other words how many nearest neighbors should it look at in order to derive a probability. Remember nearest neighbor means similarity and not nearest in physical location. A quick check with 40

KNN Oddsline 38884 bets PL 4496.857 ROI 11.5648004321
All top rats 66698 bets PL 5646.917 ROI 8.46639689346

And with 60
Oddsline 41464 PL 5189.598 ROI 12.5159145283
All top rat 69962 PL 5479.046 ROI 7.83146050713

And with 80
Oddsline 42170 PL 4724.564 ROI 11.2036139436
All top rat 69987 PL 5841.287 ROI 8.34624658865


Imbalanced Data in Machine Learning and Horse Racing

Imbalanced data is when you have a far greater number of one classification value over another. In other words say you are predicting the existence of a disease from a number of input fields and the existence of the disease represented by a 1 only amounts to say 10% of your data, then you have an imbalanced data set. Machine learning algorithms tend to be geared to finding the least error prone way of predicting and if predicting zero every time gives a 90% success rate, as in this case, then it can lead to the algorithm simply selecting zero.

Hopefully this all sounds familiar because in horse racing we tend to have imbalanced data. The number of winners may well be around the 10% mark. How do we get our Machine Learning algorithms to behave more sensibly with the data. The following link provides a pretty good explanation of some of the techniques that can be employed and also shows why Random Forests are pretty good at avoiding this pitfall.