Reader small image

You're reading from  Python Web Scraping. - Second Edition

Product typeBook
Published inMay 2017
Reading LevelIntermediate
Publisher
ISBN-139781786462589
Edition2nd Edition
Languages
Concepts
Right arrow
Author (1)
Katharine Jarmul
Katharine Jarmul
author image
Katharine Jarmul

Katharine Jarmul is a data scientist and Pythonista based in Berlin, Germany. She runs a data science consulting company, Kjamistan, that provides services such as data extraction, acquisition, and modelling for small and large companies. She has been writing Python since 2008 and scraping the web with Python since 2010, and has worked at both small and large start-ups who use web scraping for data analysis and machine learning. When she's not scraping the web, you can follow her thoughts and activities via Twitter (@kjam)
Read more about Katharine Jarmul

Right arrow

Dynamic Content

According to a 2006 study by the United Nations, 73 percent of leading websites rely on JavaScript for important functionalities (refer to http://www.un.org/esa/socdev/enable/documents/execsumnomensa.doc). The growth and popularity of model-view-controller (or MVC) frameworks within JavaScript such as React, AngularJS, Ember, Node and many more have only increased the importance of JavaScript as the primary engine for web page content.

The use of JavaScript can vary from simple form events to single page apps that download the entire page content after loading. One consequence of this architecture is the content may not available in the original HTML, and the scraping techniques we've covered so far will not extract the important information on the site.

This chapter will cover two approaches to scraping data from dynamic JavaScript websites. These are as follows:

  • Reverse engineering JavaScript...

An example dynamic web page

Let's look at an example dynamic web page. The example website has a search form, which is available at http://example.webscraping.com/search, which is used to locate countries. Let's say we want to find all the countries that begin with the letter A:

If we right-click on these results to inspect them with our browser tools (as covered in Chapter 2, Scraping the Data), we would find the results are stored within a div element with ID "results":

Let's try to extract these results using the lxml module, which was also covered in Chapter 2, Scraping the Data, and the Downloader class from Chapter 3, Caching Downloads:

>>> from lxml.html import fromstring
>>> from downloader import Downloader
>>> D = Downloader()
>>> html = D('http://example.webscraping.com/search')
>>> tree = fromstring(html)
>>> tree...

Reverse engineering a dynamic web page

So far, we tried to scrape data from a web page the same way as introduced in Chapter 2, Scraping the Data. This method did not work because the data is loaded dynamically using JavaScript. To scrape this data, we need to understand how the web page loads the data, a process which can be described as reverse engineering. Continuing the example from the preceding section, in our browser tools, if we click on the Network tab and then perform a search, we will see all of the requests made for a given page. There are a lot! If we scroll up through the requests, we see mainly photos (from loading country flags), and then we notice one with an interesting name: search.json with a path of /ajax:

If we click on that URL using Chrome, we can see more details (there is similar functionality for this in all major browsers, so your view may vary; however the main features should function...

Rendering a dynamic web page

For the example search web page, we were able to quickly reverse engineer how the API worked and how to use it to retrieve the results in one request. However, websites can be very complex and difficult to understand, even with advanced browser tools. For example, if the website has been built with Google Web Toolkit (GWT), the resulting JavaScript code will be machine-generated and minified. This generated JavaScript code can be cleaned with a tool such as JS beautifier, but the result will be verbose and the original variable names will be lost, so it is difficult to understand and reverse engineer.

Additionally, higher level frameworks like React.js and other Node.js-based tools can further abstract already complex JavaScript logic and obfuscate data and variable names and add more layers of API request security (by requiring cookies, browser sessions and timestamps or using other...

The Render class

To help make this functionality easier to use in future, here are the methods used and packaged into a class, whose source code is also available at https://github.com/kjam/wswp/blob/master/code/chp5/browser_render.py:

import time 

class BrowserRender(QWebView):
def __init__(self, show=True):
self.app = QApplication(sys.argv)
QWebView.__init__(self)
if show:
self.show() # show the browser

def download(self, url, timeout=60):
"""Wait for download to complete and return result"""
loop = QEventLoop()
timer = QTimer()
timer.setSingleShot(True)
timer.timeout.connect(loop.quit)
self.loadFinished.connect(loop.quit)
self.load(QUrl(url))
timer.start(timeout * 1000)
loop.exec_() # delay here until download finished
if timer.isActive():
# downloaded...

Summary

This chapter covered two approaches to scraping data from dynamic web pages. It started with reverse engineering a dynamic web page using browser tools, and then moved on to using a browser renderer to trigger JavaScript events for us. We first used WebKit to build our own custom browser, and then reimplemented this scraper with the high-level Selenium framework.

A browser renderer can save the time needed to understand how the backend of a website works; however, there are some disadvantages. Rendering a web page adds overhead and is much slower than just downloading the HTML or using API calls. Additionally, solutions using a browser renderer often require polling the web page to check whether the resulting HTML has loaded, which is brittle and can fail when the network is slow.

I typically use a browser renderer for short-term solutions where the long-term performance and reliability is less important...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python Web Scraping. - Second Edition
Published in: May 2017Publisher: ISBN-13: 9781786462589
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Katharine Jarmul

Katharine Jarmul is a data scientist and Pythonista based in Berlin, Germany. She runs a data science consulting company, Kjamistan, that provides services such as data extraction, acquisition, and modelling for small and large companies. She has been writing Python since 2008 and scraping the web with Python since 2010, and has worked at both small and large start-ups who use web scraping for data analysis and machine learning. When she's not scraping the web, you can follow her thoughts and activities via Twitter (@kjam)
Read more about Katharine Jarmul