• Home
  • Twitter
  • SmarterSig
  • Betfair
  • About Me
  • Post Cats
    • Betfair API-NG
    • Horse Stride Length
    • Web Scraping Race Data
  • Books
    • Precision CX Wong
    • Hands on ML

Make Your Betting Pay

~ Improve Your Horse Betting

Make Your Betting Pay

Tag Archives: scraping horse racing data

Web Scraping Race Data Part 4

21 Monday Jan 2019

Posted by smartersig in Web Scraping Race Data

≈ 1 Comment

Tags

scraping horse racing data

In this final session we will wrap up by saving all the data to a .csv file.
I am going to save the pace figures for each horse but not the race averages, we can calculate those if we need to.

In order to omit the race average pace figure line I need to make one small amendment to the following line.

for atr in trs:

becomes

for atr in trs[:-1]:

This has simply said access all the trs one by one from the first one but slip the last one.

Each line we ouput to our file will look like the following lines

Ascot 12:40 1m 7f 152y,Beat The Judge (IRE), ,0,0
etc etc
etc etc
Wolverhampton 20:15 5f 21y,Teepee Time,11,2.6,3

I am going to call the output file pacefile.csv.I will therefore need to open this file for writing with the following statement early on in the program code

pacefile = open(‘pacefile.csv’,”w”)

Now here is the final modified final section of code.

for atr in trs[:-1]:
  outLine = raceId
  tds = atr.findAll(‘td’)
  for atd in tds:
    data = atd.text
    if (re.search(“\([0-9]”, data)):
      data = re.sub(“\(“,””,data)
      data = re.sub(“\)”,””,data)
      data = re.sub(” “,”,”,data)
    data = re.sub(‘\n’, “” , data)
    outLine = outLine + ‘,’ + data

  print (outLine)
  pacefile.write(outLine)

pacefile.close()

Notice how when we loop around each table row we assign the RaceId to a new OutLine on line 2 of the above code.
In the inner loop where we access each data column we add the data to outLine seperated by a comma.
pacefile.write outputs the line to the .csv file.

Often html tags such as tables have names which can make life easier when trying to access them.
Another important method is accessing the url contents of links. Here we do not want to access the text but we do want the url perhaps to navigate to another page. For example if the links were in a number of table columns
we could have code such as

for td in tds:
  link = td.find(‘a’, href=True)
  if (link != None):
    strlink = str(link.get(‘href’))

The only way to get used to web scraping is to practice, but it’s not for the impatient. It can be like drawing teeth at times but well worth it when the data lands.

The full program code can be copied/saved at

http://www.smartersig.com/scrapeexample.py

Let me know if this series has been helpful or not in the comments section

Web Scraping Race Data Part 3

21 Monday Jan 2019

Posted by smartersig in Web Scraping Race Data

≈ Leave a comment

Tags

scraping horse racing data

In this session we are going to drill down into the web page and extract the data so that we can store it into a .csv file

If you recheck the pretty file we produced in the previous session you can just about see that the data is stored in tables depicted by the table tags
and the nested inner tables contain the data in rows controlled by tr tags and columns controlled by td tags.

If you add the following two lines of code to your program and run the code you will see that findAll has located all the table components and stored them in an array called tables.

tables = soup.findAll(‘table’)

print (tables[0])

Printing out tables[0] only accessed the first table in the html file. Try changing this to tables[1] and you will see that it prints out the next table level and all its inner tables. Try changing the value to 2,3 and 4

From 4 onwards you are now accessing individual data tables but we do not know how many there are or could be in future. No worries we can get round this by using the tree like hierarchy that soup contains.

Remove the print statement and replace with

tables = soup.findAll(‘table’)
outerTable = tables[2].find(‘table’)
dataTables = outerTable.findAll(‘table’)
for aTable in dataTables:
  print (aTable)
  print (“————————–“)

The above code grabs the required nested level of table and then finds all the tables within it which contain the individual data tables we need. The print out shows each table separated by the dotted line. We still need to remove the data from the HTML however in each data table. You can see from the print out that we still have lots of unwanted HTML code mixed in.

Let’s see how we can access our first piece of pure data, the race heading. Change the last two lines of code so the code now reads

for aTable in dataTables:
  caption = aTable.find(‘caption’)
  raceId = caption.text
  print (raceId)

Run the code and you should see the race headings printed off.

OK we are no ready to access the individual rows (tr’s) within each race table and each column (td’s) within the row. This will give us access to the horse name, the draw, the two pace figures. Here is the code we need to add to achieve this.

for aTable in dataTables:
  caption = aTable.find(‘caption’)
  raceId = caption.text
  print (raceId)

## this is the new code ##

 trs = aTable.findAll(‘tr’)
 for atr in trs:
    tds = atr.findAll(‘td’)
    for atd in tds:
       data = atd.text
       print (data)

Running this we can see it works except for one problem. The final pace figure on each line has brackets round it which we would like to remove.
Also there appears to be a carriage return as the second bracket is on a new line. We will nee a bit of code to clean this up but the problem is that
we only want clean up the last column, We do not for example want to remove brackets from the horse name. Here is the code for this.

for aTable in dataTables:
  caption = aTable.find(‘caption’)
  raceId = caption.text
  print (raceId)

  trs = aTable.findAll(‘tr’)
  for atr in trs:
    tds = atr.findAll(‘td’)
    for atd in tds:
      data = atd.text

## this is the new code ##

    if (re.search(“\([0-9]”, data)):
     data = re.sub(“\(“,””,data)
     data = re.sub(“\)”,””,data)
     data = re.sub(‘\n’, “” , data)
    print (data)

The above needs a little explaining. The IF statement searches the data for any opening bracket followed by a digit. This means we will only
catch the pace figures and not the brackets associated with horse nationality. If it detects this data it executes the two statements in the
body of the IF. These two statements substitute an open bracket for null and similar for a closed bracket. The backslash before the brackets is
to tell Python to interpret the brackets as a character to be substituted and not to treat it as a bracket used as part of an expression eg (a+b)* (c+d)

The final substitution is replacing all \n that’s carriage returns with null as well

Ok we seem to be getting at the required data. In the final session we will see how to tie it up so that we output this data to a file and not simply
to the screen.

Subscribe

  • Entries (RSS)
  • Comments (RSS)

Archives

  • December 2019
  • November 2019
  • September 2019
  • August 2019
  • July 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • October 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • December 2017
  • November 2017
  • July 2017
  • June 2017
  • April 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • July 2015
  • May 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • December 2014
  • November 2014
  • October 2014
  • September 2014
  • August 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • January 2014
  • December 2013
  • October 2013

Categories

  • Betfair API-NG
  • Deep Learning
  • Group Betting Exercise
  • Horse Stride Length
  • Profitable Punting with Python
  • Sectional Times
  • Sentiment Analysis and Hugh Taylor
  • Speed PARS
  • Uncategorized
  • Web Scraping Race Data

Meta

  • Register
  • Log in

Blog at WordPress.com.

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy