Tags

I was rummaging through some of my old SmartSig mags the other day along with email exchanges from the old SmartSig email forum. It was nice to conjur up some of those old names from the early noughties. One email caught my eye and seemed equally relevant today so I thought I would reproduce it here verbatim from 2005.

A couple of thoughts (as a practicing statistician) on your approach.

The process of selecting the best variable, then the next bet variable to pair up with the first etc etc can lead you to a local best fit rather than a global bets fit. In other words you end up finding the best fit given that you say chose age as your first variable when in fact a better fit exists if you chose jockey strike rate first (even though age was a better predictor than jockey strike rate on its own).

For this reason it would be better to search all possible combinations rather than sequentially adding the next best variable. Of course this will add significantly to the computing time required !.

Also I feel that equal weighting of variable will reduce your fit (and consequently forecasting ability) quite significantly. The impact will depend on your variables and how they are calculated but will be exacerbated if you have variables with widely different means and standard deviations.

For instance of you have position last time out as a factor (taking values 1,2,3 up to say 20). Official ratings from typically 80 to 140 and forecast SP from 0.2 to 200, then giving equal weight to each (ie adding them together to produce a rating) would have a significant impact on your fit and whether a variable was included as each variable has its own very different impact on the subsequent rating. Variable weights are vital to effectively scale the input variable to match what you are predicting.

To get round this I convert each of my factors to strike rate so that everything has a range of 0 to 100 (ie historical strike rate of horses finishing 1st last time out, 2nd last time out etc). For variables like official rating the historical samples become too low to accurately calculate each point so I calculate in groups (bins) (eg 70 to 79, 80 – 89 etc). The fit a line to the groups to give me a decent interpolation in between groups.

Even then the adjusted variables need variable weights to produce the rating because they have different ranges. For instance Jockey strike rate over the last 14 days would be quite variable (0 – 100) whereas trainer strike rate over the last 12 months would be more stable (5-25). Jockey strike rate might not get into your model because those cases where it was 100 (big increase in rating) did not result in a big increase in strike rate (or whatever you are trying to predict). However, weight jockey strike rate by say 0.1 (range is now 0 to 10) and it may now start to more closely reflect the variable you are tring to predict and hence get into your model.

(Of course the input variables may not be linear in their impact – ie jockey strike rate may be important up to say 25% but anything in the range 26% to 100% adds nothing extra – but that is a whole new set of complications)

Most statistical modelling techniques (typically regression of some sort) derive from the need to find a set of weights that give you the global best fit. so overcoming both the above issues. The techniques are elegant in that sense and were developed at a time when computers did not exist and it was impossible to crunch through 1000’s of alternatives.

Putting your database through a regression fo some sort would, I feel save you a lot of time and guarantee the best possible fit. It is effectively a short cut through the number crunching you are doing with a guarantee of finding the best possible answers.

A couple of other things I have learnt in generating my own ratings.

My aim is profit but in developing ratings I felt I had to focus on strike rate for reasons other people have discussed. (ie focus on LSP and the model will seriously be effected by one or two 100/1 winners). By focusing on strike rate your prediction of horses coming 2nd, 3rd etc should improve.

I specifically ;eave out forecast SP even though this is the best single factor for predicting strike rate – simply because this reflects the crowd and in trying to find an edge it is better not to follow the crowd.

I find it is also important to include a way of filtering out extremes and dealing with missing values. eg a jockey who has won 1 from 1 so has a 100% strike rate may need pulling back to 25%.

In this respect there is nothing better than looking at graphs/charts/tables of all your variables to examine what is extreme and whether there are any non linear elements to them.

Many thanks to Paul Dyson for this post