Web Scraping With Selenium 4

After the last session we learned how to login using Selenium to populate a form and submit the form. The next step in my task is to fill in this upload page so that I have a program that will do all this without me having to even visit the web manually so to speak.

We have two items to populate here, csv file name to upload from my drive and the upload password. The value of these items were read in from my txt file in first part of the code. I also want to make sure that if more than one file is specified for uploading then it will handle multiple files.

Here is the final piece of code I need to add to my program

The first line loops through each entry of the list we populated from smartpass.txt but it starts from line 3 (ie the 4th line as the first line is 0). If there is only one file to upload then this loop will only iterate once. The body of the loop chops off the carriage return from the file name and then loads up the page shown above. The rest of the code is pretty much the same as before after of course we have manually searched the html source for the page and identified the input box names.

We have touched on some ideas behind accessing forms and logging into pages using Selenium. Hopefully you found this useful, let me know via the ratings below.

Web Scraping With Selenium 3

In this session we are going to write the code to login to the SmarterSig web site. To do this we will need to know the names of the user id and password box. If we right click on the home page and select ‘view page source’ we should be able to search for ‘login’ and find the following section of HTML code

The names of the two input boxes are un surprisingly userid and password

With this information we can now get selenium to load up this web page in a browser on our screen and populate the two boxes with our userid and password and then submit them. Here is the total program to date

import selenium
from selenium import webdriver
import time

uploadData = []

# input the access data

f = open(‘smartpass.txt’, ‘r’)
uploadData = f.readlines()
user = uploadData[0]
passw = uploadData[1]

womPass = uploadData[2]

# trim carriage returns
user = user[:-1]
passw = passw[:-1]

womPass = womPass[:-1]

PATH = “C:\Program Files (x86)\chromedriver.exe”
driver = webdriver.Chrome(PATH)

driver.get(‘ http://www.smartersig.com ‘)

# locate and fill userid and password input boxes

idBox = driver.find_element_by_name(‘userid’)

passBox = driver.find_element_by_name(‘password’)

idBox.send_keys(user)
passBox.send_keys(passw)

# submit the form

driver.find_element_by_name(‘password’).submit()

# pause and then quit the browser

time.sleep (5)
driver.quit()

The final couple of lines are simply to allow me to see the logged in screen before quiting the browser. There are a number of ways to submit the userid and password form. Submitting one of the fields, in my case the password field, is just one of them.

In the next section we will look at the next page we need to access now we are logged in and how to upload a file.

Web Scraping With Selenium 2

We are going to access some web pages that require form input on the SmarterSig web site. The first page will be the home page where we need to login

The second page we will be interacting with once we have successfully logged in will be the following

Clearly we need the code to have access to the following pieces of data in order to carry this out

Your SmarterSig userid (login)

Your SmarterSig password

The password for uploading

The full path name to the location on your machine of the file you wish to upload

We create the above in a file I call smartpass.txt with each line containing one of the above.

Now we are are ready to code the program

First few lines of our program autoupload.py

import selenium
from selenium import webdriver

uploadData = []

f = open(‘smartpass.txt’, ‘r’)
uploadData = f.readlines()
user = uploadData[0]
passw = uploadData[1]

womPass = uploadData[2]
user = user[:-1]
passw = passw[:-1]

womPass = womPass[:-1]

The first two lines import the libraries we will need

The third line declares a list to hold the lines of information in the smartpass.txt file once we have read them in.

The 4th and 5th lines open the file for reading and read all the lines into the list so we can access them in the program.

The remaining lines take care of two things. First placing the data held in the list into individual variables. I prefer to do this so that each element has a more meaningful name. Plus each data item then needs the last character removing because this is a carriage return character.

The business of handling the file name we want to upload we will come to later but bare in mind that we want the code to handle multiple files (ie multiple lines in the smartpass.txt file) should you want to upload more than one file.

OK we will dig a bit deeper in the next session

Web Scraping With Selenium

I have posted previously on web scraping data for betting purposes but I have not gone into how we can access pages that require form filling, a typical example might be logging into a site first before you can access data or maybe submitting information via a form. This is the first a in short series on how to use Python and the Selenium library to carry out these functions.

First some prep work for you. Selenium works by literally, under control of the program you write, popping a web browser up on your screen and then filling in form boxes and submitting the form in order to refresh the page with perhaps another new web page. When your program is running it will feel like someone else is controlling your computer, but do not worry, its you or should I say the program you are about to write.

First things first though. I will be demonstrating code that will be manipulating a Chrome browser, more specifically Version 87.0.4280.141 (Official Build) (64-bit).

You might want to download Chrome if you have not already got it installed. You can change your code to handle other browsers but in the blog I will be handling Chrome.

The next thing you need to do is download the Chromedriver.exe utility. You can download it from

https://chromedriver.chromium.org/downloads

Make sure you get the version for your Chrome web browser. If you are not sure what version of the Chrome browser you are using then click the three dots in the top right corner of your Chrome browser and then select Help followed by about Google Chrome.

Once you have downloaded the chomedriver.exe save it in your (assuming you are using windows) folder

C:\Program Files (x86)

So you now have

C:\Program Files (x86)\chromedriver.exe

Final step of this set up procedure is to install Selenium. I am assuming you already have Python installed. To install selenium type at the windows msdos command prompt

pip install selenium

OK thats the end of the first session, any problems or comments please comment below. In the next session we will start on some web browsing under program control.

Genetic Algorithm V Gradient Boosting

I have been playing around with a Python library called PyGad. It is a Genetic Algorithm (GA) library that enables you from within the Python Programming language to create a Genetic Algorithm approach to Machine Learning. I have mentioned before how useful Gradient Boosting (GB) is for racing data due to its ability to handle in balanced data (ie far more losers in the data set than winners) plus they handle data that has not been normalized in any way quite well. However one disadvantage is that when you train a model using GB or any other Machine Learning algorithm you are essentially training the model to find winners rather than profit. To illustrate what exactly I mean by this, imagine a ridiculous example where we doubled the price of every horse above 20/1 and then also imagine we included the Betfair SP price in the data set we are training on. You would like to think that the model would spot that longer priced horses are the route to profit but it won’t simply because they still win less than 10/1 shots which win less than 8/1 shots which win les then 6/1 shots etc. The algoirthm will latch onto the fact that nothing predicts winners better than BFSP and it will focus on BFSP. This is why you should never include BFSP as an input feature.

One way around this is to create a custom loss function which forces the algorithm to train for profit rather than winners. This is easy to do in Python and PyGad so I set about investigating whether a GA trained to find profit, where profit was calculated to variable stakes, would outperforms a GB which is trained to find winners.

I trained both approaches on data from 2011 to 2017. The data was the data I use for my model submission to the Wisdom of models. The data for the GA model was normalized, instinct told me this would be the better option although I should test it without as well. The test data was 2018 to 2019.

Results

The GB model made a ROI% to variable stakes on top rated horses of 2.68% after commision

The GA model made a ROI% of 0.30%

To my surprise the GB model solidly outperformed the GA model even though the GA model was trained to produce profit.

Overview of Gentic Algorithms

Variable Stakes V Flat Stakes

There is plenty of ‘expert’ advice out there telling you to not be afraid to have more on when its a big price. What they fail to realise is that 98% or maybe 99% of punters lose money overall and putting more on longer priced selections is more likely to produce further losses on an already negative bottom line.

To look at whether the above statement is true I checked out all the flat handicap selections for the following tipsters during 2020.

Robin Goodfellow Daily Mail
Newsboy Daily Mirror
Rob Wright Times
Templegate The Sun
The Scout Daily Express
Marlborough Telegraph

I calculated their profit or loss to Betfair SP minus 2% commission (available to anyone if you ask). Initially I calculated flat £1 stakes and then variable stakes. If you are not familiar with variable stakes it simply means bet to win £1 so at 2.0 (even money) you bet £1, at 3.0 (2/1) you bet 50p and so on.

The results were interesting and are as follows

Robin Goodfellow Bets 3065 PL +16.49 ROI +0.53% VPL -0.48 VROI -0.048%
Newsboy Bets 2991 PL +90.38 ROI +3.02% VPL +20.27 VROI +2.2%
Rob Wright Bets 2975 PL +141.55 ROI +4.75% VPL 56.8 VROI 6.8%
Templegate Bets 2998 PL -255.5 ROI -8.52% VPL -6.98 VROI% -0.7%
The Scout Bets 2873 PL -45.05 ROI -1.56% VPL -5.5 VROI -0.74%
Marlborough Bets 2860 PL +50.1 ROI +1.75% VPL -5.7 VROI -0.63%

Warning – I have no evidence that this pecking order will be maintained next year.

Overall we have

Overall Bets 17,752 PL + 1.9 ROI 0%
Overall VPL +62.35

You would have been better off with variable stakes overall and with individual tipsters.

Why don’t punters embrace variable stakes, probably because of the lack of information but also having 2 points win on a 1/2 shot that then loses wrecks more emotional turmoil than a loser at 20/1.

What the above also shows is just how much better off you are at Betfair or an exchange of your choice. None of these tipsters would be profitable to industry SP.

Stacking Ensembles Further Thoughts

How many books have you read on Pace handicapping?. How about pedigree handicapping?. Then of course there is conditioning handicapping, speed handicapping, the list go’s on. By the way excuse the USA terminology, over there handicapping simply means studying racing and making bets.

When we are model building using Machine Learning there is a tendency to shove factors from various handicapping camps into one data set and and ask an ML algorithm to sort it out. Perhaps there is a danger, trees and wood spring to mind. What I decided to do is try an ensemble approach to the problem. I created four models based on four individual areas of handicapping. All were very simple models in terms of number of input features. However before I did this I threw them all into the mix and asked a GBM algorithm to produce a model after training on 2011 to 2015 data in MySportsAI and then predict on data from 2016 to 2017. This would be my baseline and it produced the following results.

Top ranked bets Variable stakes ROI% +3.78%

Second top Variable stakes ROI% -0.56%

Third top ROI% -2.22%

The next step was to build four individual simple models using the four areas and predict on 2016 to 2017 for each model. I then merged all the rankings of horse from the four models into one Excel sheet and for each horse in each race summed the rankings produce by each model. So for example

Mill Reef Model 1 ranked 1, model 2 ranked 3, model 3 ranked 4, model 4 ranked 1 = 9

The horse with the lowest summed ranking is obviously the top ranked horse in a race based on the rankiings of all the models.

So how did the ensemble ranking model perform?.

When I calculate the top ranked I was a tad disappointed, there was virtually no change

ROI% +3.7%

But the 2nd and 3rd ranked did show considerable improvement

2nd rank +1.97%

3rd rank +2.98%

A group on the MySportsAI email group have been submitting their ratings from September 1st and I can report that ensemble results have been better than this. The Wisdom of Models

The Wisdom of Models

The wisdom of crowds has been applied to many avenues of prediction. Stock markets, Oscar night, Elections and of course sports betting predictions. The general idea is that if you combine a collection of peoples predictions which may individually be average in quality, you can find that aggregating their predictions in some way can produce a set of final predictions that are better than the individual predictions. This was first observed in a fair ground where the public were asked to guess the weight of an Oxe. One participant found that the average of everyone’s guesses won him the prize. Not sure if he took home an Oxe but you get the idea.
A similar approach appears in the literature of Machine Learning, it go’s by the title of Ensemble modelling. The idea is the same in that the predictions of several models are somehow combined to produce a single set of predictions
As with human predictions, diversity is the key. It works best when the people or models are coming at the problem from very different viewpoints. For example in a horse racing context perhaps one model has race times as the core of its modelling whilst another is more class based. There is certainly evidence within the Machine Learning world that ensemble modelling can outperform single models and of course the ensemble can extend to different algorithms rather than just different model inputs. You could create an ensemble based on a Tree based algorithm like Gradient Boosting along with a regression algorithm and a Neural network. They may all have the same inputs but different approaches to creating the model.
We are running an experiment at MySportsAI at the moment. Some members are putting forward their model ratings each day and using a simple ranking aggregation and tracking the top 3 ranked we are checking how the wisdom of models performs.
So far in handicaps for 3yo and 3yo+ the collaboration has produced to BFSP after commission

206 Bets PL + 19.1 pts ROI +9.27

Early days but an interesting start.

The other interesting facets of this approach is that first of all no ones proprietary models are compromised, the inner workings of ones model is kept undisclosed. The second benefit to this collaboration is mutual support. We all know that betting can be a lonely business. Crowd betting offers buddy support something we all need when results are going poorly.

You can join MySportsAi at http://www.smartersig.com/mysportsai.php

Derby 2020

I never like to read journalists slagging of the public for taking an interest in the sport that fosters them with a comfortable living. For sure those on twitter can over react equally as much as most journalists under react when faced with a topic that may threaten their job prospects or their ability to get a chummy weighing room interview. Nicholas Godfrey penned such a piece on this years Derby, where he opens up with a few snide remarks about social media, something he clearly feels is beneath him. Luckily Nicholas looks like he is approaching retirement which is just as well because that tacky social media he refers to is probably going to replace him and its certainly the place where with a bit of pruning the more intelligent analysis is taking place or at least the synopsis is.
I would not have objected if Nicholas had offered any original insight into the race but sectionals are clearly a bit too technical for him along with social media so I will try and add a bit of extra analysis here, above and beyond what has already been said.
I too had looked at the Oaks and the Derby in terms of sectionals and posted on Twitter that Love would have mowed down Serpentine race for race and I posted this 20 mins after the Derby. One of those knee jerk reactions that Nicholas is so dismissive of. Later however I decided to look for a Derby that was run in a similar overall time and most importantly on the same official going. The race I honed in on was Authorized. Taking timings to the path entering the straight, Authorized hit the road in 1:53.58 whilst the the most prominent horse with any chance of winning according to the betting in 2020 hit the road in 1:56.22. That is a huge difference and one that is quite probably impossible to overcome. It is fair to say that Serpentines jockey got the fractions right but 15 other top class jockeys seemingly had no idea whether the pace they were setting was correct or not. A road that most journalists will never hit.
I have not done the middle section timings but my visual guess is that the race was lost in the middle section where it appeared to slow. Any middle or long distance runner will tell you not to make ground in the teeth of a race, unless they have gone ridiculously fast, rather to gain ground in the cheap or slow section. Nobody bothered to do this and hence the race was lost. I may be wrong on this last part, only the times will tell.
The big question now is will you back or lay Serpentine when he next runs in a G1 or will you join Nicholas on the fence?. Let me know in the comments below.

Stacking Ensembles for Horse racing

Imagine you had x mates, all experts in a certain field of betting on horse racing. One was an expert on breeding, the other was very knowlegable about draw bias, a third shit hot on trainer jockey combos, I could go on and the topic of expertise does not really matter. The main point is how would you want to synergise their opinions into a race selection. You could put them in a room together and let them debate a selection in the 2.30 at Sandown. The trouble with this is that the value of each may get drowned in the noise of the collective. The optimum way of combining these varied inputs may get lost in the futile attempt to combine them in one fell swoop so to speak.

In the Machine Learning world there is a technique called Ensemble stacking. This is slightly different to the above scenario. With ensemble stacking different ML algorithms are trained on some data and then they make predictions which are then fed into a second stage who’s job is to find out how to combine the predictions to give a super prediction. Going forward this can often result in better predictions especially if the algorithms used are different in nature and therefore discovering slightly different things about the data.
Sound familiar?, well this approach can be used for ML models for horse racing. Instead of throwing the kitchen sink at a model build perhaps results would improve if various models were constructed on tightly related sub fractions of the data. These predictions could then be fed into a second layer predictor that combines them into one final prediction. This also sounds like a close cousin of the two step process I covered in an earlier approach. Unless you are like me (could pick an argument in an empty room) then this may certainly be an approach worth exploring.

If you are interested in exploring Machine Learning for producing your own ratings but do not have any programming skills, don’t worry. I have produced some click and go software for developing ML models for sport. Check out the following

http://www.smartersig.com/mysportsai.php