Web Scraping Race Data Part 4

21 Monday Jan 2019

Posted by smartersig in Web Scraping Race Data

Tags

In this final session we will wrap up by saving all the data to a .csv file.
I am going to save the pace figures for each horse but not the race averages, we can calculate those if we need to.

In order to omit the race average pace figure line I need to make one small amendment to the following line.

for atr in trs:

becomes

for atr in trs[:-1]:

This has simply said access all the trs one by one from the first one but slip the last one.

Each line we ouput to our file will look like the following lines

Ascot 12:40 1m 7f 152y,Beat The Judge (IRE), ,0,0
etc etc
etc etc
Wolverhampton 20:15 5f 21y,Teepee Time,11,2.6,3

I am going to call the output file pacefile.csv.I will therefore need to open this file for writing with the following statement early on in the program code

pacefile = open(‘pacefile.csv’,”w”)

Now here is the final modified final section of code.

for atr in trs[:-1]:
  outLine = raceId
  tds = atr.findAll(‘td’)
  for atd in tds:
    data = atd.text
    if (re.search(“\([0-9]”, data)):
      data = re.sub(“\(“,””,data)
      data = re.sub(“\)”,””,data)
      data = re.sub(” “,”,”,data)
    data = re.sub(‘\n’, “” , data)
    outLine = outLine + ‘,’ + data

print (outLine)
pacefile.write(outLine)

pacefile.close()

Notice how when we loop around each table row we assign the RaceId to a new OutLine on line 2 of the above code.
In the inner loop where we access each data column we add the data to outLine seperated by a comma.
pacefile.write outputs the line to the .csv file.

Often html tags such as tables have names which can make life easier when trying to access them.
Another important method is accessing the url contents of links. Here we do not want to access the text but we do want the url perhaps to navigate to another page. For example if the links were in a number of table columns
we could have code such as

for td in tds:
  link = td.find(‘a’, href=True)
  if (link != None):
    strlink = str(link.get(‘href’))

The only way to get used to web scraping is to practice, but it’s not for the impatient. It can be like drawing teeth at times but well worth it when the data lands.

The full program code can be copied/saved at

http://www.smartersig.com/scrapeexample.py

Let me know if this series has been helpful or not in the comments section

Web Scraping Race Data Part 3

21 Monday Jan 2019

Posted by smartersig in Web Scraping Race Data

≈ Leave a comment

Tags

scraping horse racing data

In this session we are going to drill down into the web page and extract the data so that we can store it into a .csv file

If you recheck the pretty file we produced in the previous session you can just about see that the data is stored in tables depicted by the table tags
and the nested inner tables contain the data in rows controlled by tr tags and columns controlled by td tags.

If you add the following two lines of code to your program and run the code you will see that findAll has located all the table components and stored them in an array called tables.

tables = soup.findAll(‘table’)

print (tables[0])

Printing out tables[0] only accessed the first table in the html file. Try changing this to tables[1] and you will see that it prints out the next table level and all its inner tables. Try changing the value to 2,3 and 4

From 4 onwards you are now accessing individual data tables but we do not know how many there are or could be in future. No worries we can get round this by using the tree like hierarchy that soup contains.

Remove the print statement and replace with

tables = soup.findAll(‘table’)
outerTable = tables[2].find(‘table’)
dataTables = outerTable.findAll(‘table’)
for aTable in dataTables:
print (aTable)
print (“————————–“)

The above code grabs the required nested level of table and then finds all the tables within it which contain the individual data tables we need. The print out shows each table separated by the dotted line. We still need to remove the data from the HTML however in each data table. You can see from the print out that we still have lots of unwanted HTML code mixed in.

Let’s see how we can access our first piece of pure data, the race heading. Change the last two lines of code so the code now reads

for aTable in dataTables:
  caption = aTable.find(‘caption’)
  raceId = caption.text
  print (raceId)

Run the code and you should see the race headings printed off.

OK we are no ready to access the individual rows (tr’s) within each race table and each column (td’s) within the row. This will give us access to the horse name, the draw, the two pace figures. Here is the code we need to add to achieve this.

for aTable in dataTables:
  caption = aTable.find(‘caption’)
  raceId = caption.text
  print (raceId)

## this is the new code ##

trs = aTable.findAll(‘tr’)
for atr in trs:
    tds = atr.findAll(‘td’)
    for atd in tds:
       data = atd.text
       print (data)

Running this we can see it works except for one problem. The final pace figure on each line has brackets round it which we would like to remove.
Also there appears to be a carriage return as the second bracket is on a new line. We will nee a bit of code to clean this up but the problem is that
we only want clean up the last column, We do not for example want to remove brackets from the horse name. Here is the code for this.

for aTable in dataTables:
  caption = aTable.find(‘caption’)
  raceId = caption.text
  print (raceId)

  trs = aTable.findAll(‘tr’)
  for atr in trs:
    tds = atr.findAll(‘td’)
    for atd in tds:
      data = atd.text

## this is the new code ##

    if (re.search(“\([0-9]”, data)):
     data = re.sub(“\(“,””,data)
     data = re.sub(“\)”,””,data)
     data = re.sub(‘\n’, “” , data)
    print (data)

The above needs a little explaining. The IF statement searches the data for any opening bracket followed by a digit. This means we will only
catch the pace figures and not the brackets associated with horse nationality. If it detects this data it executes the two statements in the
body of the IF. These two statements substitute an open bracket for null and similar for a closed bracket. The backslash before the brackets is
to tell Python to interpret the brackets as a character to be substituted and not to treat it as a bracket used as part of an expression eg (a+b)* (c+d)

The final substitution is replacing all \n that’s carriage returns with null as well

Ok we seem to be getting at the required data. In the final session we will see how to tie it up so that we output this data to a file and not simply
to the screen.

Web Scraping Race Data Part 2

20 Sunday Jan 2019

Posted by smartersig in Web Scraping Race Data

≈ 1 Comment

Tags

web scraping horse racing data

In this session we are going to get straight in and do some web scraping. I have set up a simple web page to scrape, you can take a look at

http://www.smartersig.com/pacedataEG.htm

When you just loaded the above data into your browser it was your browser that tidied up the presentation. The data itself is inconveniently
nestled within all the HTML tags and code that tell the browser how to create the presentation of the data.

Take a look at the underlying code by right clicking on the web page and selecting ‘view source’

What we see is mainly HTML tags and code and nestled among the gobbledygook you can just about see some of the data that actually appears in the browser when we view it.

What we need to do is tease out all the data from within this spaghetti of code and the Python library BeautifulSoup is going to help us.

Fire up a new file in Notepad and enter the following code. these are the libraries we are going to need in our new program

# import libraries

from lxml import html
import re
import requests
from bs4 import BeautifulSoup

Now we are ready to pull in the web page you have just looked at so that we can then write some code to tease out the data.
Enter the following lines in your new file under the import statements, save the file with a name of your choice eg program2.py and then run it with the
python program2.py command at the DOS prompt

url = “http://www.smartersig.com/pacedataEG.htm”
page = requests.get(url)
html_page = page.content

The variable called html_page now contains the whole of the web page content we requested, including the HTML stuff we dont want.

If we add the following line and run the program it spews out the entire HTML content to the screen

print (html_page)

We can now call on BeautifulSoup to render the page into a structure that will allow us to traverse it more easily
The following statement will do this.

soup = BeautifulSoup(html_page, “lxml”)

We can now take a look at how Beautiful soup see’s and represents the code by adding the following line and running the program

print (soup.prettify())

It would help to have this to refer to so we could save it to a text file so that we can check it more leisurely and see how the underlying
code is organised. We need to have some understanding of how the code is structured so we can tease out the data.

To save the prettify code simply add the following two statements and rerun the program to save it to a file called prettysoup.txt

soupfile = open(‘prettysoup.txt’,”w”)

soupfile.write (soup.prettify())

Now you can open the file prettysoup.txt and take a look at its structure.

In the next session we will look at how to pull out the data so that have the horse names and pace figures stored in a .csv file

Total program so far

# import libraries

from lxml import html
import re
import requests
from bs4 import BeautifulSoup

## get the page ##

url = “http://www.smartersig.com/pacedataEG.htm”
page = requests.get(url)
html_page = page.content

soup = BeautifulSoup(html_page, “lxml”)

soupfile = open(‘prettysoup.txt’,”w”)

soupfile.write (soup.prettify())

Web Scraping Race Data 1

20 Sunday Jan 2019

Posted by smartersig in Web Scraping Race Data

≈ 3 Comments

Tags

tweet, web scraping horse racing data

This is the first in a series on web scraping using Python. I am going to assume that you have some basic programming skills,for example you know what a FOR loop is or an IF statement. If these terms do not mean anything to you then you probably need an introduction to basic programming in Python. I am also assuming that you are on a Windows based machine.

Web scraping is the process of gathering data from web pages and placing it into a convenient data form such as a .csv file
As an example we might want a .csv file (comma delimited flat file) of the days runners

2019-01-18,Chepstow 13:05,NH,Handicap Chase,5YO+,4,4.6,,433,Barbrook Star (IRE),etc,etc
etc
etc

Before we get stuck into the nitty gritty of scraping you are first going to need to download and install Python assuming you do not already have it. Don’t worry this is quite painless. I suggest installing Anaconda as your Python programming language source. If you do not already have Anaconda Python
then visit the following link and click on the 64 or 32 bit installer link depending on your machine. The installer will be downloaded
and you should then double click it to begin the installation process. The default settings that are offered to you should all be fine for
our purposes and the process should only take about 5 minutes. Make sure you check the set environment variable check box when prompted during th set up however.

Free Download

Once everything has been installed check all is OK and the version of Python you have by firing up an MSDOS command window
(type cmd at the windows search engine) and then in this window type in

python –version

That’s a double dash by the way in the above

First Program

Let us finish this short first session off with our first program, the usual Hello World.
For now we will stick with rudimentary tools just to keep things simple. We will use Notepad or you can use Wordpad if you prefer.
Fire up Notepad (enter Notepad in windows search) and enter the following few lines of code (the identation of print(meesage) is important.

message = “Hello World”
if (“World” in message):
print (message)

Save the file with the name HelloWorld.py

Now at the DOS window prompt where you checked the python version type in

python helloworld.py

You should get the Hello World message appearing on the screen.

Try changing the
if (“World” in message):
line to
if (“world” in message):
and save and run again

Notice when you run the program now it prints nothing out because lower case world is not contained in the message.

Notice also the indentation, Python will not allow the following statement, unlike other programming languages.

if (“World” in message):
print (message)

I am not going to deliver a full blown intro to Python during this series but I may mention things that are peculiar to Python just in case you have programmed in other languages but not Python.

OK that will do for now, in the next session we will start doing some actual web scraping

Make Your Betting Pay

~ Improve Your Horse Betting

Category Archives: Web Scraping Race Data

Web Scraping Race Data Part 4

Web Scraping Race Data Part 3

Web Scraping Race Data Part 2

Web Scraping Race Data 1