Tags

In this session we are going to get straight in and do some web scraping. I have set up a simple web page to scrape, you can take a look at

http://www.smartersig.com/pacedataEG.htm

When you just loaded the above data into your browser it was your browser that tidied up the presentation. The data itself is inconveniently
nestled within all the HTML tags and code that tell the browser how to create the presentation of the data.

Take a look at the underlying code by right clicking on the web page and selecting ‘view source’

What we see is mainly HTML tags and code and nestled among the gobbledygook you can just about see some of the data that actually appears in the browser when we view it.

What we need to do is tease out all the data from within this spaghetti of code and the Python library BeautifulSoup is going to help us.

Fire up a new file in Notepad and enter the following code. these are the libraries we are going to need in our new program

# import libraries

from lxml import html
import re
import requests
from bs4 import BeautifulSoup

Now we are ready to pull in the web page you have just looked at so that we can then write some code to tease out the data.
Enter the following lines in your new file under the import statements, save the file with a name of your choice eg program2.py and then run it with the
python program2.py command at the DOS prompt

url = “http://www.smartersig.com/pacedataEG.htm”
page = requests.get(url)
html_page = page.content

The variable called html_page now contains the whole of the web page content we requested, including the HTML stuff we dont want.

If we add the following line and run the program it spews out the entire HTML content to the screen

print (html_page)

We can now call on BeautifulSoup to render the page into a structure that will allow us to traverse it more easily
The following statement will do this.

soup = BeautifulSoup(html_page, “lxml”)

We can now take a look at how Beautiful soup see’s and represents the code by adding the following line and running the program

print (soup.prettify())

It would help to have this to refer to so we could save it to a text file so that we can check it more leisurely and see how the underlying
code is organised. We need to have some understanding of how the code is structured so we can tease out the data.

To save the prettify code simply add the following two statements and rerun the program to save it to a file called prettysoup.txt

soupfile = open(‘prettysoup.txt’,”w”)

soupfile.write (soup.prettify())

Now you can open the file prettysoup.txt and take a look at its structure.

In the next session we will look at how to pull out the data so that have the horse names and pace figures stored in a .csv file

Total program so far

# import libraries

from lxml import html
import re
import requests
from bs4 import BeautifulSoup

## get the page ##

url = “http://www.smartersig.com/pacedataEG.htm”
page = requests.get(url)
html_page = page.content

soup = BeautifulSoup(html_page, “lxml”)

soupfile = open(‘prettysoup.txt’,”w”)

soupfile.write (soup.prettify())