Tags

In this session we are going to drill down into the web page and extract the data so that we can store it into a .csv file

If you recheck the pretty file we produced in the previous session you can just about see that the data is stored in tables depicted by the table tags
and the nested inner tables contain the data in rows controlled by tr tags and columns controlled by td tags.

If you add the following two lines of code to your program and run the code you will see that findAll has located all the table components and stored them in an array called tables.

tables = soup.findAll(‘table’)

print (tables[0])

Printing out tables[0] only accessed the first table in the html file. Try changing this to tables[1] and you will see that it prints out the next table level and all its inner tables. Try changing the value to 2,3 and 4

From 4 onwards you are now accessing individual data tables but we do not know how many there are or could be in future. No worries we can get round this by using the tree like hierarchy that soup contains.

Remove the print statement and replace with

tables = soup.findAll(‘table’)
outerTable = tables[2].find(‘table’)
dataTables = outerTable.findAll(‘table’)
for aTable in dataTables:
  print (aTable)
  print (“————————–“)

The above code grabs the required nested level of table and then finds all the tables within it which contain the individual data tables we need. The print out shows each table separated by the dotted line. We still need to remove the data from the HTML however in each data table. You can see from the print out that we still have lots of unwanted HTML code mixed in.

Let’s see how we can access our first piece of pure data, the race heading. Change the last two lines of code so the code now reads

for aTable in dataTables:
  caption = aTable.find(‘caption’)
  raceId = caption.text
  print (raceId)

Run the code and you should see the race headings printed off.

OK we are no ready to access the individual rows (tr’s) within each race table and each column (td’s) within the row. This will give us access to the horse name, the draw, the two pace figures. Here is the code we need to add to achieve this.

for aTable in dataTables:
  caption = aTable.find(‘caption’)
  raceId = caption.text
  print (raceId)

## this is the new code ##

 trs = aTable.findAll(‘tr’)
 for atr in trs:
    tds = atr.findAll(‘td’)
    for atd in tds:
       data = atd.text
       print (data)

Running this we can see it works except for one problem. The final pace figure on each line has brackets round it which we would like to remove.
Also there appears to be a carriage return as the second bracket is on a new line. We will nee a bit of code to clean this up but the problem is that
we only want clean up the last column, We do not for example want to remove brackets from the horse name. Here is the code for this.

for aTable in dataTables:
  caption = aTable.find(‘caption’)
  raceId = caption.text
  print (raceId)

  trs = aTable.findAll(‘tr’)
  for atr in trs:
    tds = atr.findAll(‘td’)
    for atd in tds:
      data = atd.text

## this is the new code ##

    if (re.search(“\([0-9]”, data)):
     data = re.sub(“\(“,””,data)
     data = re.sub(“\)”,””,data)
     data = re.sub(‘\n’, “” , data)
    print (data)

The above needs a little explaining. The IF statement searches the data for any opening bracket followed by a digit. This means we will only
catch the pace figures and not the brackets associated with horse nationality. If it detects this data it executes the two statements in the
body of the IF. These two statements substitute an open bracket for null and similar for a closed bracket. The backslash before the brackets is
to tell Python to interpret the brackets as a character to be substituted and not to treat it as a bracket used as part of an expression eg (a+b)* (c+d)

The final substitution is replacing all \n that’s carriage returns with null as well

Ok we seem to be getting at the required data. In the final session we will see how to tie it up so that we output this data to a file and not simply
to the screen.