Web scraping with Python (Part 2)

Exclusive offer: get 50% off this eBook here
Expert Python Programming

Expert Python Programming — Save 50%

Best practices for designing, coding, and distributing your Python software

$26.99    $13.50
by Javier Collado | August 2009 | Content Management Open Source

This article by Javier Collado expands the set of web scraping techniques shown in his previous article by looking closely into a more complex problem that cannot be solved with the tools that were explained there. For those who missed out on that article, here's the link. Web Scraping with Python

This article will show how to extract the desired information using the same three steps when the web page is not written directly using HTML, but is auto-generated using JavaScript to update the DOM tree.

As you may remember from that article, web scraping is the ability to extract information automatically from a set of web pages that were designed only to display information nicely to humans; but that might not be suitable when a machine needs to retrieve that information. The three basic steps that were recommended to be followed when performing a scraping task were the following:

  • Explore the website to find out where the desired information is located in the HTML DOM tree
  • Download as many web pages as needed
  • Parse downloaded web pages and extract the information from the places found in the exploration step

What should be taken into account when the content is not directly coded in the HTML DOM tree? The main difference, as you probably have already noted, is that using the downloading methods that were suggested in the previous article (urllib2 or mechanize) just don't work. This is because they generate an HTTP request to get the web page and deliver the received HTML directly to the scraping script. However, the pieces of information that are auto-generated by the JavaScript code are not yet in the HTML file because the code is not executed in any virtual machine as it happens when the page is displayed in a web browser.

Hence, instead of relying on a library that generates HTTP requests, we need a library that behaves as a real web browser, or even better, a library that interacts with a real web browser. So that we are sure that we obtain the same data as we see when manually opening a page in a web browser. Please remember that the aim of web scraping is actually parsing the data that a human user sees, so interacting with a real web browser would be a really nice feature.

Is there any tool out there to perform that? Fortunately, the answer is yes. In particular, there are a couple of tools used for web testing automation that can be used to solve the JavaScript execution problem: Selenium and Windmill . For the code samples in the sections below, Windmill is used. Any choice would be fine as both of them are well documented and stable tools ready to be used for production.

Let's now follow the same three steps that were suggested in the previous article to solve the scraping of the contents of a web page that is partly generated using JavaScript code.

Explore

Imagine that you are a fan of NASA Image of the day gallery. You want to get a list of the names of all the images in the gallery together with the link to the whole resolution picture just in case you decide to download it later to use as a desktop wallpaper.

The first thing to do is to locate the data that has to be extracted on the desired web page. In the case of the Image of the day gallery (see screenshot below), there are three elements that are important to note:

  • Title of the image that is being currently displayed
  • Link to the image full resolution file
  • Next link to make it possible navigate through all the images
Web scraping with Python (Part 2)

To find out the location of each piece of interesting information, as it was already suggested in the previous article, it's better to use a tool such as Firebug whose inspect functionality can be really useful. The following picture, for example, shows the location of the image title inside an h3 tag:

Web scraping with Python (Part 2)

The other two fields can be located as easily as the title, so no further explanation will be given here. Please refer to the previous article for further information.

Download

As explained in the introduction, to download the content of the web page, we will use Windmill as it allows the JavaScript code to execute in the web browser before getting the page content.

Because Windmill is mostly a testing library, instead of writing a script that calls the Windmill API, I will write a test case for Windmill to navigate through all the image web pages. The code for the test should be as follows:

     1	def test_scrape_iotd_gallery():
2 """
3 Scrape NASA Image of the Day Gallery
4 """
5 # Extra data massage for BeautifulSoup
6 my_massage = get_massage()
7
8 # Open main gallery page
9 client = WindmillTestClient(__name__)
10 client.open(url='http://www.nasa.gov/multimedia/imagegallery/iotd.html')
11
12 # Page isn't completely loaded until image gallery data
13 # has been updated by javascript code
14 client.waits.forElement(xpath=u"//div[@id='gallery_image_area']/img",
15 timeout=30000)
16
17 # Scrape all images information
18 images_info = {}
19 while True:
20 image_info = get_image_info(client, my_massage)
21
22 # Break if image has been already scrapped
23 # (that means that all images have been parsed
24 # since they are ordered in a circular ring)
25 if image_info['link'] in images_info:
26 break
27
28 images_info[image_info['link']] = image_info
29
30 # Click to get the information for the next image
31 client.click(xpath=u"//div[@class='btn_image_next']")
32
33 # Print results to stdout ordered by image name
34 for image_info in sorted(images_info.values(),
35 key=lambda image_info: image_info['name']):
36 print ("Name: %(name)sn"
37 "Link: %(link)sn" % image_info)

As it can be seen, the usage of Windmill is similar to other libraries such as mechanize. For example, first of all a client object has to be created to interact with the browser, (line 9) and later, the main web page, that is going to be used to navigate through all the information, has to be opened (line 10). Nevertheless, it also includes some facilities that take into account JavaScript code as shown at line 14. In this line, the waits.forElement method has been used to look for DOM element that is filled by the JavaScript code so when that element, in this case the big image in the image gallery, is displayed, the rest of the script can proceed. It is important to note here that the web page processing doesn't start when the page is downloaded (this happens after line 10), but when there's some evidence that JavaScript code has finished the DOM tree manipulation.

For navigating through all the pages that contain the information needed, this is just a matter of pressing over the next arrow (line 30). As the images are ordered in a circular buffer, the point when it is decided to stop is when the same image link has been parsed twice (line 25).

To execute the script, instead of launching it as we would normally do for a python script, we should call it through the Windmill script to properly initialize the environment:

$ windmill firefox test=nasa_iotd.py

As it can be seen in the following screenshot, Windmill takes care of opening a browser (Firefox in this case) window and a controller window in which it's possible to see the commands that the script is executing (several clicks on next in the example):

Web scraping with Python (Part 2)

The controller window is really interesting because not only does it display the progress of the test cases, but also allows to enter/record actions interactively, which is a nice feature when trying things out. In particular, the recording may be used under some situations to replace Firebug in the exploration step. This is because the captured actions may be stored in a script without spending much time in xpath expressions.

For more information about how to use Windmill and the complete API, please refer to the Windmill documentation.

Expert Python Programming Best practices for designing, coding, and distributing your Python software
Published: September 2008
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:

Parse

The parsing of the web page can be performed with BeautifulSoup as explained in the previous article. The only thing that should be taken into account, is that the page contents has to be retrieved every time JavaScript code changes the DOM tree by using the commands.getPageTest() method of the Windmill client object.

Please see below the code that extracts the image information for the example of the image of the day gallery:

     1	def get_image_info(client, my_massage):
2 """
3 Parse HTML page and extract featured image name and link
4 """
5 # Get Javascript updated HTML page
6 response = client.commands.getPageText()
7 assert response['status']
8 assert response['result']
9
10 # Create soup from HTML page and get desired information
11 soup = BeautifulSoup(response['result'], markupMassage=my_massage)
12 image_info = {'name': soup.find(id='caption_region').h3.string,
13 'link': urlparse.urljoin('http://www.nasa.gov',
14 soup.find(attrs='Full_Size')['href'])}
15 return image_info


Code and results

The complete code that performs the scraping and prints a simple report to the standard output is:

     1	# Generated by the windmill services transformer
2 from windmill.authoring import WindmillTestClient
3 from BeautifulSoup import BeautifulSoup
4
5 import re, urlparse
6 from copy import copy
7
8 def get_image_info(client, my_massage):
9 """
10 Parse HTML page and extract featured image name and link
11 """
12 # Get Javascript updated HTML page
13 response = client.commands.getPageText()
14 assert response['status']
15 assert response['result']
16
17 # Create soup from HTML page and get desired information
18 soup = BeautifulSoup(response['result'], markupMassage=my_massage)
19 image_info = {'name': soup.find(id='caption_region').h3.string,
20 'link': urlparse.urljoin('http://www.nasa.gov',
21 soup.find(attrs='Full_Size')['href'])}
22 return image_info
23
24
25 def get_massage():
26 """
27 Provide extra data massage to solve HTML problems in BeautifulSoup
28 """
29 # Javascript code in ths page generates HTML markup
30 # that isn't parsed correctly by BeautifulSoup.
31 # To avoid this problem, all document.write fragments are removed
32 my_massage = copy(BeautifulSoup.MARKUP_MASSAGE)
33 my_massage.append((re.compile(u"document.write(.+);"), lambda match: ""))
34 my_massage.append((re.compile(u'alt=".+">'), lambda match: ">"))
35 return my_massage
36
37
38 def test_scrape_iotd_gallery():
39 """
40 Scrape NASA Image of the Day Gallery
41 """
42 # Extra data massage for BeautifulSoup
43 my_massage = get_massage()
44
45 # Open main gallery page
46 client = WindmillTestClient(__name__)
47 client.open(url='http://www.nasa.gov/multimedia/imagegallery/iotd.html')
48
49 # Page isn't completely loaded until image gallery data
50 # has been updated by javascript code
51 client.waits.forElement(xpath=u"//div[@id='gallery_image_area']/img",
52 timeout=30000)
53
54 # Scrape all images information
55 images_info = {}
56 while True:
57 image_info = get_image_info(client, my_massage)
58
59 # Break if image has been already scrapped
60 # (that means that all images have been parsed
61 # since they are ordered in a circular ring)
62 if image_info['link'] in images_info:
63 break
64
65 images_info[image_info['link']] = image_info
66
67 # Click to get the information for the next image
68 client.click(xpath=u"//div[@class='btn_image_next']")
69
70 # Print results to stdout ordered by image name
71 for image_info in sorted(images_info.values(),
72 key=lambda image_info: image_info['name']):
73 print ("Name: %(name)sn"
74 "Link: %(link)sn" % image_info)

where some interesting things to note that were not commented in previous sections are:

  • The get_massage function (lines 25-35) was needed to prevent BeautifulSoup parsing errors from stopping the script. This is because some pages use markup in a no standard way that fails the parser.
  • The urlparse library is used to transform relative URLs into absolute ones.

A fragment of the output that, at the time of writing, can be obtained by executing the code above, is:

Name: 3-2-1 and Liftoff of GOES-O
Link: http://www.nasa.gov/images/content/363658main_2009-3856_full.jpg

Name: A Ghost Remains
Link: http://www.nasa.gov/images/content/352981main_ghost_full.jpg

Name: A Parting Look
Link: http://www.nasa.gov/images/content/349643main_s125e010160_hires_full.jpg


Name: A Super-Efficient Particle Accelerator
Link: http://www.nasa.gov/images/content/364958main_rcw86_1920_full.jpg

...

Conclusions

This article showed how to scrape the information from a web page whose content has been partially generated by JavaScript code. Aside from the three steps (explore, download and parse) explained in the previous article, the usage of a tool that is capable of executing that code, or interacting with a real web browser is fundamental to obtain at any time the real DOM tree of the information is being displayed to the user.

In the example, Windmill is used successfully to:

  • Open the main page.
  • Perform a check that makes sure the JavaScript code is executed before scraping any data.
  • Click in the next control, that is, navigate through all the content just a human user would do.
  • Get the updated DOM tree.

This is a simple but powerful functionality that can be used to scrape a large amount of web pages. As in any scraping task, the only maintenance that a script using this library would need, is to keep track of the changes that the page creator may introduce in the future to improve the web page look and feel.

Matplotlib for Python Developers Build remarkable publication-quality plots the easy way
Published: November 2009
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:

About the Author :


Javier Collado is a software developer and a test design engineer with extensive experience in high availability telecommunications products. He also holds a position as an associate professor, which he enjoys a lot because it allows him to share and learn simultaneously.

Once a year, he takes a break and travels as far as possible to know different cultures.

Books From Packt

Expert Python Programming
Expert Python Programming

Python Testing: Beginner's Guide
Python Testing: Beginner's Guide

Cacti 0.8 Network Monitoring
Cacti 0.8 Network Monitoring

Django 1.0 Website Development
Django 1.0 Website Development

Grok 1.0 Web Development
Grok 1.0 Web Development

Beginning OpenVPN 2.0.9
Beginning OpenVPN 2.0.9

Linux Email
Linux Email

Asterisk 1.4 – the Professional’s Guide
Asterisk 1.4 – the Professional’s Guide

 

 

Your rating: None Average: 5 (2 votes)
Nice code on web scraping by
Mechanize, beautiful soup, and if you have to interact with java, selenium, is all that is needed to gather data. The learning curve can be slightly steep to learn all the libraries though, if you want someone to just write the script for you check out, http://www.extractingdata.com otherwise use the given libraries and it should be straight forward.

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
1
h
N
F
9
4
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software