In this final session we will wrap up by saving all the data to a .csv file.
I am going to save the pace figures for each horse but not the race averages, we can calculate those if we need to.
In order to omit the race average pace figure line I need to make one small amendment to the following line.
for atr in trs:
for atr in trs[:-1]:
This has simply said access all the trs one by one from the first one but slip the last one.
Each line we ouput to our file will look like the following lines
Ascot 12:40 1m 7f 152y,Beat The Judge (IRE), ,0,0
Wolverhampton 20:15 5f 21y,Teepee Time,11,2.6,3
I am going to call the output file pacefile.csv.I will therefore need to open this file for writing with the following statement early on in the program code
pacefile = open(‘pacefile.csv’,”w”)
Now here is the final modified final section of code.
for atr in trs[:-1]:
outLine = raceId
tds = atr.findAll(‘td’)
for atd in tds:
data = atd.text
if (re.search(“\([0-9]”, data)):
data = re.sub(“\(“,””,data)
data = re.sub(“\)”,””,data)
data = re.sub(” “,”,”,data)
data = re.sub(‘\n’, “” , data)
outLine = outLine + ‘,’ + data
Notice how when we loop around each table row we assign the RaceId to a new OutLine on line 2 of the above code.
In the inner loop where we access each data column we add the data to outLine seperated by a comma.
pacefile.write outputs the line to the .csv file.
Often html tags such as tables have names which can make life easier when trying to access them.
Another important method is accessing the url contents of links. Here we do not want to access the text but we do want the url perhaps to navigate to another page. For example if the links were in a number of table columns
we could have code such as
for td in tds:
link = td.find(‘a’, href=True)
if (link != None):
strlink = str(link.get(‘href’))
The only way to get used to web scraping is to practice, but it’s not for the impatient. It can be like drawing teeth at times but well worth it when the data lands.
The full program code can be copied/saved at
Let me know if this series has been helpful or not in the comments section