Knowing when to bet can be invaluable tool in the bettors armory. The extra percentages that can be added to ones bottom line by solving this particular timing problem can be significant. Of course we are never going to get it right all the time but more often than not will certainly do.

Knowing if a price is going to move in or out has certainly had plenty of coverage inside and outside of betting. Momentum, weight of money, ARIMA the list go’s on but I am going to take a look at the simple relationship between Betfair 10am prices and bookmaker prices to see if there is any relationships that predict whether the Betfair 11.30am price will be shorter or longer than the 10am.

To keep things simple I will only use races in which the same number of runners existed at 11.30am as 10am in order to avoid the price deduction problem. I am also going to see if various Machine Learning algorithms perform differently with a view to seeing if one stand out as top dog.

I am going to be using Python and the Sklearn library. Here is the program

# import numpy and pandas
import numpy as np
import pandas as pd

# Read in the data file into a pandas dataframe called dataset
dataset = pd.read_csv(“ammlprices.csv”)

# Take a look at the first few rows of the dataset
print (dataset.head())

## The above produces the following output
Horse 888sport Bet 365 Bet Victor Betfred Black Type \
0 Appenzeller 0.971429 0.971429 0.971429 0.971429 0.971429
1 Perfect Refuge 1.000000 1.000000 1.000000 1.083333 1.083333
2 Foresee 1.000000 1.000000 1.133333 1.133333 1.307692
3 Il Sicario 1.100000 1.000000 1.000000 1.000000 1.000000
4 Tesorina 1.240000 1.240000 1.240000 1.240000 1.240000

Coral Ladbrokes Marathon Bet Paddy Power William Hill DriftDrop
0 1.046154 0.971429 0.971429 0.971429 1.046154 0.0
1 1.083333 1.083333 1.000000 1.000000 1.083333 0.0
2 1.307692 1.307692 1.000000 1.000000 1.133333 0.0
3 1.000000 1.222222 1.000000 1.000000 1.000000 1.0
4 1.240000 1.240000 1.127273 1.240000 1.240000 0.0
In [29]:

The above shows the book percentage ratios of different bookmaker odds to Betfair odds at 10am and the final column shows whether Betfair dropped or drifted at 11.30am. There are 28,264 rows of data before NA’s are dropped (see below)

# drop any rows that have null or missing values in them
dataset = dataset.dropna()

# Seperate the data features into X and the prediction target (DriftDrop) into y
X = dataset.iloc[:,1:-1]
y = dataset.iloc[:,-1:].values
y = y.ravel()

# Find what the baseline performance ie if we simply predicted all zero or 1
print (y.mean())
0.4787

print (1 – y.mean())
0.5212

So the baseline is 0.5212 or 52.12%, we need to be beating this

# Run the data through a selection of algorithms to see which performs best

import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier

models = []
models.append((‘LR’, LogisticRegression()))
models.append((‘LDA’, LinearDiscriminantAnalysis()))
models.append((‘KNN’, KNeighborsClassifier()))
models.append((‘CART’, DecisionTreeClassifier()))
models.append((‘NB’, GaussianNB()))
models.append((‘SVM’, SVC()))
models.append((‘GDB’,GradientBoostingClassifier()))

seed = 7
results = []
names = []
scoring = ‘accuracy’
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X, y, cv=kfold,
scoring=scoring)
results.append(cv_results)
names.append(name)

# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle(‘Algorithm Comparison’)
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

The above produces the following box plot

box

We can see from above that the GradientBoostingClassifier has performed best with a score of 0.572 but only just ahead of LDA

A 5% gain is not too bad without too much thought about data representation.

Feature importance tells us which features had the greatest say in the predictions. For the GradientBoosting algorithm they were

[0.06603342 0.05788538 0.36951438 0.08592476 0.03415262 0.03363296
0.03779182 0.09169309 0.04229201 0.18107956]

The third and last features were the most significant ie BetVcitor and William Hill

Book ratios may or may not be the best description and if you have thoughts on this or any other aspects please feel free to comment.

Advertisements