Web Scraping with Python

Exclusive offer: get 50% off this eBook here
Expert Python Programming

Expert Python Programming — Save 50%

Best practices for designing, coding, and distributing your Python software

$26.99    $13.50
by Javier Collado | November 2008 | Open Source

Web scraping is the set of techniques used the to get some information, structured only for presentation purposes, from a website automatically instead of copying it manually. This article by Javier Collado will show how this could be done using python in the steps that require some development.

To perform this task, usually three basic steps are followed:

  • Explore the website to find out where the desired information is located in the HTML DOM tree
  • Download as many web pages as needed
  • Parse downloaded web pages and extract the information from the places found in the exploration step

The exploration step is performed manually with the aid of some tools that make it easier to locate the information and reduce the development time in next steps. The download and parsing steps are usually performed in an iterative cycle since they are interrelated. This is because the next page to download may depend on a link or similar in the current page, so not every web page can be downloaded without previously looking into the earlier one.

This article will show an example covering the three steps mentioned and how this could be done using python with some development. The code that will be displayed is guaranteed to work at the time of writing, however it should be taken into account that it may stop working in future if the presentation format changes. The reason is that web scraping depends on the DOM tree to be stable enough, that is to say, as happens with regular expressions, it will work fine for slight changes in the information being parsed. However, when the presentation format is completely changed, the web scraping scripts have to be modified to match the new DOM tree.

Explore

Let's say you are a fan of Pack Publishing article network and that you want to keep a list of the titles of all the articles that have been published until now and the link to them. First of all, you will need to connect to the main article network page (http://www.packtpub.com/article-network) and start exploring the web page to have an idea about where the information that you want to extract is located.

Many ways are available to perform this task such as view the source code directly in your browser or download it and inspect it with your favorite editor. However, HTML pages often contain auto-generated code and are not as readable as they should be, so using a specialized tool might be quite helpful. In my opinion, the best one for this task is the Firebug add-on for the Firefox browser.

With this add-on, instead of looking carefully in the code looking for some string, all you have to do is press the Inspect button, move the pointer to the area in which you are interested and click. After that, the HTML code for the area marked and the location of the tag in the DOM tree will be clearly displayed. For example, the links to the different pages containing all the articles are located inside a right tag,

Javier Collado

and, in every page, the links to the articles are contained as list items in an unnumbered list. In addition to this, the links URLs, as you probably have noticed while reading other articles, start with http://www.packtpub.com/article/

Javier Collado

So, our scraping strategy will be

  • Get the list of links to all pages containing articles
  • Follow all links so as to extract the article information in all pages

One small optimization here is that main article network page is the same as the one pointed by the first page link, so we will take this into account to avoid loading the same page twice when we develop the code.

Download

Before parsing any web page, the contents of that page must be downloaded. As usual, there are many ways to do this:

  • Creating your own HTTP requests using urllib2 standard python library
  • Using a more advanced library that provides the capability to navigate through a website simulating a browser such as  mechanize.

In this article mechanize will be covered as it is the easiest choice. mechanize is a library that provides a Browser class that lets the developer to interact with a website in a similar way a real browser would. In particular it provides methods to open pages, follow links, change form data and submit forms.

Recalling the scraping strategy in our previous version, the first thing we would like to do is to download the main article network web page. To do that we will create a Browser class instance and then open the main article network page:

>>> import mechanize
>>> BASE_URL = "http://www.packtpub.com/article-network"
>>> br = mechanize.Browser()
>>> data = br.open(BASE_URL).get_data()
>>> links = scrape_links(BASE_URL, data)

Where the result of the open method is an HTTP response object, the get_data method returns the contents of the web page. The scrape_links function will be explained later. For now, as pointed out in the introduction section, bear in mind that the downloading and parsing steps are usually performed iteratively since some contents to be downloaded depends on the parsing done in some kind of initial contents such as in this case.

Expert Python Programming Best practices for designing, coding, and distributing your Python software
Published: September 2008
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:

Now, let's assume that we have all the links and that we just want to get every page pointed by every link and parse it. With mechanize it would be done this way:

>>> for link in links:
>>> data = br.follow_link(link).get_data()
>>> scrape_articles(data)
>>> br.back()

As for open in the previous piece of code, the follow_link method returns an HTTP response object, get_data method returns the contents of the web page and the scrape_articles function will be explained later. Also note that when one link has been explored, the method back is used to go back in the browser history so as follow the next link from the main article page as it would be done with a real browser.

Up to this point, the content downloading problem is solved so let's implement the scraping functions and the scraping will be complete.

Parse

Once the HTML content is available as a string, it has to be parsed to be able to navigate through the DOM tree and extract the information that was located in the exploration step. Again, there are many libraries that are very useful for this task. However BeautifulSoup is the best known library and the one that we'll be using in this article.

BeautifulSoup is a library that takes care of HTML parsing and returns a soup object that can be used to navigate the DOM tree. The main functions are:

  • Accessing tags as if they were members of the soup object
  • Find tags whose name, contents or attributes match some selection criteria
  • Accessing attributes from tags using dictionary-like access syntax

Continuing with our scraping task, the code to create a soup object based on some HTML data is the following:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(data)

Assuming that we already have the soup for the main article network page, the expression that finds all the links inside the first right tag as was found in the exploration step is (scrape_links):

>>> soup.right.findAll("a")

Similarly the code that gets all the links in a web pages with some articles in a list item tag is the following (scrape_articles):

>>> ARTICLE_URL_PREFIX = 'http://www.packtpub.com/article/'
>>> [anchor
>>> for anchor in [li.a for li in soup.findAll('li')]
>>> if anchor['href'].startswith(ARTICLE_URL_PREFIX)]

Here it's important to note that some filtering has been applied based on the observation, made in the exploration step, that the URL of the articles start with a common prefix.

Once all the article link tags have been collected, extracting the title and URL is just a matter of getting the contents from the tag object, anchor.string, and the href attribute, anchor['href'], respectively.

Code and Results

The complete code that performs the scraping and prints a simple report to the standard output is:

 1 #!/usr/bin/python
2 """
3 Scrape packpub article network
4 """
5 import mechanize
6 from BeautifulSoup import BeautifulSoup
7
8 def scrape_links(base_url, data):
9 """
10 Scrape links pointing to article pages
11 """
12 soup = BeautifulSoup(data)
13
14 # Create mechanize links to be used
15 # later by mechanize.Browser instance
16 links = [mechanize.Link(base_url = base_url,
17 url = str(anchor['href']),
18 text = str(anchor.string),
19 tag = str(anchor.name),
20 attrs = [(str(name), str(value))
21 for name, value in anchor.attrs])
22 for anchor in soup.right.findAll("a")]
23
24 return links
25
26 def scrape_articles(data):
27 """
28 Scrape the title and url of all the articles in this page
29 """
30 # URL prefix is used to filter out other links
31 # such as the ones pointing to books
32 ARTICLE_URL_PREFIX = 'http://www.packtpub.com/article/'
33
34 soup = BeautifulSoup(data)
35 articles = [{'title': str(anchor.string),
36 'url': str(anchor['href'])}
37 for anchor in [li.a for li in soup.findAll('li')]
38 if anchor['href'].startswith(ARTICLE_URL_PREFIX)]
39
40 return articles
41
42 def main():
43 """
44 Get article network main page and follow the links
45 to get the whole list of articles available
46 """
47 articles = []
48
49 # Get main page and get links to all article pages
50 BASE_URL = "http://www.packtpub.com/article-network"
51 br = mechanize.Browser()
52 data = br.open(BASE_URL).get_data()
53 links = scrape_links(BASE_URL, data)
54
55 # Scrape articles in main page
56 articles.extend(scrape_articles(data))
57
58 # Scrape articles in linked pages
59 for link in links[1:]:
60 data = br.follow_link(link).get_data()
61 articles.extend(scrape_articles(data))
62 br.back()
63
64 # Ouput is the list of titles and URLs for each article found
65 print ("Article Networkn"
66 "---------------")
67 print "nn".join(['Title: "%(title)s"nURL: "%(url)s"' % article
68 for article in articles])
69
70 if __name__ == "__main__":
71 main()

Here some changes have been made to the code from previous sections to create link objects as expected by the browser instance and to prevent it from downloading the main article network page twice.

A fragment of the output that, at the time of writing, can be obtained executing the code above is the following:

Article Network
---------------
Title: "Python Data Persistence using MySQL"
URL: "http://www.packtpub.com/article/python-data-persistence-using-mysql"

Title: "Business Blogging On The Up - Technorati State of the Blogosphere 2008"
URL: "http://www.packtpub.com/article/business-blogging-technorati-state-of-the-blogosphere-2008"

Title: "Chatroom Application using DWR Java Framework"
URL: "http://www.packtpub.com/article/chatroom-application-using-dwr-java-framework"

Conclusions

In this article it has been shown how to scrape a web site using well-known libraries following three simple steps. Let's review them:

Explore

This is an iterative process in which the information that is to be extracted is located. At first, only a general idea about how it is distributed in the web pages in the site is needed. This is simple task that can be performed just with a browser. Later, the precise location in the HTML tree will be needed. To do this, a specialized tool, such as Firebug, is recommended.

Download

Once the structure of the web pages is known i.e. where the information is stored or, more precisely, how it is linked together, it is the time to download the pages. Usually, not all of them can be downloaded in a single step, but some preliminary parsing is needed to get the links that lead to the next pages and follow them appropriately. One tool that does a good job in this area is a browser simulator, such as mechanize, because it allows the programmer to write code intuitively following the same steps that were followed while using the real browser in the explore phase.

Parse

Once the content is available, extracting the information is just a matter of using a good parsing library and get the same tags that were identified in the explore step. BeautifulSoup is a mature library that can help to perform this task quickly.

Expert Python Programming Best practices for designing, coding, and distributing your Python software
Published: September 2008
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:

About the Author :


Javier Collado is a software developer and a test design engineer with extensive experience in high availability telecommunications products. He also holds a position as an associate professor, which he enjoys a lot because it allows him to share and learn simultaneously.

Once a year, he takes a break and travels as far as possible to know different cultures.

Books From Packt

Expert Python Programming
Expert Python Programming

CherryPy Essentials: Rapid Python Web Application Development
CherryPy Essentials: Rapid Python Web Application Development

Professional Plone Development
Professional Plone Development

Learning Website Development with Django
Learning Website Development with Django

OpenCms 7 Development
OpenCms 7 Development

Zenoss Core Network and System Monitoring
Zenoss Core Network and System Monitoring

Building Websites with Joomla! 1.5
Building Websites with Joomla! 1.5

Building Powerful and Robust Websites with Drupal 6
Building Powerful and Robust Websites with Drupal 6

Your rating: None Average: 4 (6 votes)
Web page extractor by
You can also browse http//:smokedoc.org A very beneficial software, I've been using it for already a couple of months.

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
C
F
6
B
K
d
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software