Tags

A big problem model builders have is covariate shift, it sounds complicated but its really simple and I am going to explain it via a horse racing example that will be familiar to all of you. The days since a horse last run is a common data feature in a horse racing data set whether you are a system builder or a model builder. Either way you are hoping that the way in which horses are generally campaigned this year is mirrored in the way they have been campaigned in previous years. I mean if I told you that next year no horse could run without having at least 3 weeks off you would be pretty worried about how this would effect your way of betting regardless of what approach you take.

To give this a fancy term we would say that the distribution of the data item days since last run has changed, in other word we have covariate shift. Now this is worrying if you built a model on the assumption of certain data distributions only to find that they have changed drastically. Such a situation occurred in 2020 when a whole industry had to shut up shop and no horse was running for weeks.

In MySportsAI you can now check for drift between the data your training a model on and the test data you are testing it on. Lets demonstrate with a simple model using only days since a horse last ran.

The above was trained and tested on handicap data for 2011 to 2015, the first 80% being the training data and the latter 20% being the test data. The TTDrift value shows a measure of data drift, close to 0.5 is quite good. Now lets see how it drifts when I train on 2011 to 2015 and then test on 2016/17

The drift is a little more pronounced and the variable stake profit on the top 3 rated has dropped

Now finally lets check testing on 2020

The TTDrift has not surprisingly taken a jump and top 3 VROI is poor worse than you would get backing all horses.

Although the change in conditions here were forced upon us in other cases drift can occur for reasons that are not as easy to determine but at least you can now keep an eye on it. By the way although the drift on the above came down in 2021 and 2022 it is still not returned to 2011/15 levels. If you are interested in the mechanism behind the calculations here is a link

https://www.kdnuggets.com/2018/06/how-dissimilar-train-test-data.html

Advertisement