|
|
Want to know more about Packt's Article Network? Interested in contributing your article ideas? Please visit our FAQ for more information. See More BROWSE
All Titles WordPress Web Services SOA BPEL Web Graphics & Video Web Development RAW Portugues, Espanol, Italiano, French PHP/MySQL Oracle Open Source Networking & Telephony Moodle Microsoft & .NET Linux Servers jQuery Joomla! JBoss Java e-Learning e-Commerce Dynamics Drupal CRM Cookbook Content Management Beginner Guides Architecture and Analysis AJAX Future Titles Recently Published Titles Among other merits, Python is an ideal language for writing server-side scripts, allowing you to integrate interactive behavior with HTML. Persisting dynamic content to an underlying database is fairly straightforward. By installing an appropriate Python DB module, you get the ability to interact with the database of choice from within Python code, moving your application data in and out of the underlying persistent store. This article by Yuli Vasiliev will walk you through the process of building a simple Python application that interacts with a MySQL database. In a nutshell, the application picks up some live data from a web site and then persists it to an underlying MySQL database. See More |
Web Scraping with Python
Web scraping is the set of techniques used the to get some information, structured only for presentation purposes, from a website automatically instead of copying it manually. This article by Javier Collado will show how this could be done using python in the steps that require some development. To perform this task, usually three basic steps are followed:
The exploration step is performed manually with the aid of some tools that make it easier to locate the information and reduce the development time in next steps. The download and parsing steps are usually performed in an iterative cycle since they are interrelated. This is because the next page to download may depend on a link or similar in the current page, so not every web page can be downloaded without previously looking into the earlier one. This article will show an example covering the three steps mentioned and how this could be done using python with some development. The code that will be displayed is guaranteed to work at the time of writing, however it should be taken into account that it may stop working in future if the presentation format changes. The reason is that web scraping depends on the DOM tree to be stable enough, that is to say, as happens with regular expressions, it will work fine for slight changes in the information being parsed. However, when the presentation format is completely changed, the web scraping scripts have to be modified to match the new DOM tree. ExploreLet's say you are a fan of Pack Publishing article network and that you want to keep a list of the titles of all the articles that have been published until now and the link to them. First of all, you will need to connect to the main article network page (http://www.packtpub.com/article-network) and start exploring the web page to have an idea about where the information that you want to extract is located. Many ways are available to perform this task such as view the source code directly in your browser or download it and inspect it with your favorite editor. However, HTML pages often contain auto-generated code and are not as readable as they should be, so using a specialized tool might be quite helpful. In my opinion, the best one for this task is the Firebug add-on for the Firefox browser. With this add-on, instead of looking carefully in the code looking for some string, all you have to do is press the Inspect button, move the pointer to the area in which you are interested and click. After that, the HTML code for the area marked and the location of the tag in the DOM tree will be clearly displayed. For example, the links to the different pages containing all the articles are located inside a right tag,
and, in every page, the links to the articles are contained as list items in an unnumbered list. In addition to this, the links URLs, as you probably have noticed while reading other articles, start with http://www.packtpub.com/article/
So, our scraping strategy will be
One small optimization here is that main article network page is the same as the one pointed by the first page link, so we will take this into account to avoid loading the same page twice when we develop the code. DownloadBefore parsing any web page, the contents of that page must be downloaded. As usual, there are many ways to do this:
In this article mechanize will be covered as it is the easiest choice. mechanize is a library that provides a Browser class that lets the developer to interact with a website in a similar way a real browser would. In particular it provides methods to open pages, follow links, change form data and submit forms. Recalling the scraping strategy in our previous version, the first thing we would like to do is to download the main article network web page. To do that we will create a Browser class instance and then open the main article network page: >>> import mechanize Where the result of the open method is an HTTP response object, the get_data method returns the contents of the web page. The scrape_links function will be explained later. For now, as pointed out in the introduction section, bear in mind that the downloading and parsing steps are usually performed iteratively since some contents to be downloaded depends on the parsing done in some kind of initial contents such as in this case. Expert Python Programming
Now, let's assume that we have all the links and that we just want to get every page pointed by every link and parse it. With mechanize it would be done this way: >>> for link in links: As for open in the previous piece of code, the follow_link method returns an HTTP response object, get_data method returns the contents of the web page and the scrape_articles function will be explained later. Also note that when one link has been explored, the method back is used to go back in the browser history so as follow the next link from the main article page as it would be done with a real browser. Up to this point, the content downloading problem is solved so let's implement the scraping functions and the scraping will be complete. ParseOnce the HTML content is available as a string, it has to be parsed to be able to navigate through the DOM tree and extract the information that was located in the exploration step. Again, there are many libraries that are very useful for this task. However BeautifulSoup is the best known library and the one that we'll be using in this article. BeautifulSoup is a library that takes care of HTML parsing and returns a soup object that can be used to navigate the DOM tree. The main functions are:
Continuing with our scraping task, the code to create a soup object based on some HTML data is the following: >>> from BeautifulSoup import BeautifulSoup Assuming that we already have the soup for the main article network page, the expression that finds all the links inside the first right tag as was found in the exploration step is (scrape_links): >>> soup.right.findAll("a")Similarly the code that gets all the links in a web pages with some articles in a list item tag is the following (scrape_articles): >>> ARTICLE_URL_PREFIX = 'http://www.packtpub.com/article/' Here it's important to note that some filtering has been applied based on the observation, made in the exploration step, that the URL of the articles start with a common prefix. Once all the article link tags have been collected, extracting the title and URL is just a matter of getting the contents from the tag object, anchor.string, and the href attribute, anchor['href'], respectively. Code and ResultsThe complete code that performs the scraping and prints a simple report to the standard output is: 1 #!/usr/bin/python Here some changes have been made to the code from previous sections to create link objects as expected by the browser instance and to prevent it from downloading the main article network page twice. A fragment of the output that, at the time of writing, can be obtained executing the code above is the following: Article Network ConclusionsIn this article it has been shown how to scrape a web site using well-known libraries following three simple steps. Let's review them: Explore This is an iterative process in which the information that is to be extracted is located. At first, only a general idea about how it is distributed in the web pages in the site is needed. This is simple task that can be performed just with a browser. Later, the precise location in the HTML tree will be needed. To do this, a specialized tool, such as Firebug, is recommended. Download Once the structure of the web pages is known i.e. where the information is stored or, more precisely, how it is linked together, it is the time to download the pages. Usually, not all of them can be downloaded in a single step, but some preliminary parsing is needed to get the links that lead to the next pages and follow them appropriately. One tool that does a good job in this area is a browser simulator, such as mechanize, because it allows the programmer to write code intuitively following the same steps that were followed while using the real browser in the explore phase. Parse Once the content is available, extracting the information is just a matter of using a good parsing library and get the same tags that were identified in the explore step. BeautifulSoup is a mature library that can help to perform this task quickly. Expert Python Programming
About the AuthorJavier Collado is a software developer and a test design engineer with extensive experience in high availability telecommunications products. He also holds a position as an associate professor, which he enjoys a lot because it allows him to share and learn simultaneously. Once a year, he takes a break and travels as far as possible to know different cultures. Books from Packt
|
The aim of this article by Javier Collado is to show how tasks may be automated using Python together with STAF (Software Testing Automation Framework) by means of an example. We will first see a problem and then derive its solution using classical Python-only as well as Python+STAF. The implementation of the solution will evolve in different stages. This will help us in comparing both the solution in terms of simplicity and efficiency. See More TOP TITLES ![]()
|
| ||||||||