You're reading from Python Web Scraping. - Second Edition

Product typeBook

Published inMay 2017

Reading LevelIntermediate

Publisher

ISBN-139781786462589

Edition2nd Edition

Languages

Python

Concepts

Data Mining

Author (1)

Katharine Jarmul

Dynamic Content

According to a 2006 study by the United Nations, 73 percent of leading websites rely on JavaScript for important functionalities (refer to http://www.un.org/esa/socdev/enable/documents/execsumnomensa.doc). The growth and popularity of model-view-controller (or MVC) frameworks within JavaScript such as React, AngularJS, Ember, Node and many more have only increased the importance of JavaScript as the primary engine for web page content.

The use of JavaScript can vary from simple form events to single page apps that download the entire page content after loading. One consequence of this architecture is the content may not available in the original HTML, and the scraping techniques we've covered so far will not extract the important information on the site.

This chapter will cover two approaches to scraping data from dynamic JavaScript websites. These are as follows:

Reverse engineering JavaScript...

An example dynamic web page

Let's look at an example dynamic web page. The example website has a search form, which is available at http://example.webscraping.com/search, which is used to locate countries. Let's say we want to find all the countries that begin with the letter A:

If we right-click on these results to inspect them with our browser tools (as covered in Chapter 2, Scraping the Data), we would find the results are stored within a div element with ID "results":

Let's try to extract these results using the lxml module, which was also covered in Chapter 2, Scraping the Data, and the Downloader class from Chapter 3, Caching Downloads:

>>> from lxml.html import fromstring
>>> from downloader import Downloader 
>>> D = Downloader()  
>>> html = D('http://example.webscraping.com/search') 
>>> tree = fromstring(html) 
>>> tree...

Reverse engineering a dynamic web page

So far, we tried to scrape data from a web page the same way as introduced in Chapter 2, Scraping the Data. This method did not work because the data is loaded dynamically using JavaScript. To scrape this data, we need to understand how the web page loads the data, a process which can be described as reverse engineering. Continuing the example from the preceding section, in our browser tools, if we click on the Network tab and then perform a search, we will see all of the requests made for a given page. There are a lot! If we scroll up through the requests, we see mainly photos (from loading country flags), and then we notice one with an interesting name: search.json with a path of /ajax:

If we click on that URL using Chrome, we can see more details (there is similar functionality for this in all major browsers, so your view may vary; however the main features should function...

Rendering a dynamic web page

For the example search web page, we were able to quickly reverse engineer how the API worked and how to use it to retrieve the results in one request. However, websites can be very complex and difficult to understand, even with advanced browser tools. For example, if the website has been built with Google Web Toolkit (GWT), the resulting JavaScript code will be machine-generated and minified. This generated JavaScript code can be cleaned with a tool such as JS beautifier, but the result will be verbose and the original variable names will be lost, so it is difficult to understand and reverse engineer.

Additionally, higher level frameworks like React.js and other Node.js-based tools can further abstract already complex JavaScript logic and obfuscate data and variable names and add more layers of API request security (by requiring cookies, browser sessions and timestamps or using other...

The Render class

To help make this functionality easier to use in future, here are the methods used and packaged into a class, whose source code is also available at https://github.com/kjam/wswp/blob/master/code/chp5/browser_render.py:

import time 

class BrowserRender(QWebView): 
    def __init__(self, show=True): 
        self.app = QApplication(sys.argv) 
        QWebView.__init__(self) 
        if show: 
            self.show() # show the browser 

    def download(self, url, timeout=60): 
        """Wait for download to complete and return result""" 
        loop = QEventLoop() 
        timer = QTimer() 
        timer.setSingleShot(True) 
        timer.timeout.connect(loop.quit) 
        self.loadFinished.connect(loop.quit) 
        self.load(QUrl(url)) 
        timer.start(timeout * 1000) 
        loop.exec_() # delay here until download finished 
        if timer.isActive(): 
            # downloaded...

Summary

This chapter covered two approaches to scraping data from dynamic web pages. It started with reverse engineering a dynamic web page using browser tools, and then moved on to using a browser renderer to trigger JavaScript events for us. We first used WebKit to build our own custom browser, and then reimplemented this scraper with the high-level Selenium framework.

A browser renderer can save the time needed to understand how the backend of a website works; however, there are some disadvantages. Rendering a web page adds overhead and is much slower than just downloading the HTML or using API calls. Additionally, solutions using a browser renderer often require polling the web page to check whether the resulting HTML has loaded, which is brittle and can fail when the network is slow.

I typically use a browser renderer for short-term solutions where the long-term performance and reliability is less important...

The rest of the chapter is locked

You have been reading a chapter from

Python Web Scraping. - Second Edition

Published in: May 2017Publisher: ISBN-13: 9781786462589

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Katharine Jarmul

Katharine Jarmul is a data scientist and Pythonista based in Berlin, Germany. She runs a data science consulting company, Kjamistan, that provides services such as data extraction, acquisition, and modelling for small and large companies. She has been writing Python since 2008 and scraping the web with Python since 2010, and has worked at both small and large start-ups who use web scraping for data analysis and machine learning. When she's not scraping the web, you can follow her thoughts and activities via Twitter (@kjam)
Read more about Katharine Jarmul

Other recommended products

Related to this chapter

R Web Scraping Quick Start Guide

Web scraping is a technique to extract data from websites. It simulates the behavior of a website user to turn the website itself into a web service to retrieve or introduce new data. This book gives you all you need to get started with scraping web pages using R programming.

BookOct 2018114 pages

Python Penetration Testing Cookbook

Penetration testing is the use of tools and code to attack a system in order to assess its vulnerabilities to external threats. Python allows pen testers to create their own tools. This book provides quick Python recipes to let defense analysts work on their own on the path to discovering, and recovering from, vulnerabilities.

BookNov 2017226 pages

Hands-On Web Scraping with Python

Web scraping is an essential technique used in many organizations to scrape valuable data from web pages. This book will help you master web scraping techniques and methodologies using Python libraries and other popular tools such as Selenium. By the end of this book, you will have learned how to efficiently scrape different websites.

BookJul 2019350 pages

Go Web Scraping Quick Start Guide

Web scraping is the process of extracting information from the web using various tools that perform scraping and crawling. Go is emerging as the language of choice for scraping using a variety of libraries. This book will quickly explain to you, how to scrape data data from various websites using Go libraries such as Colly and Goquery.

BookJan 2019132 pages

Python Web Scraping Cookbook

Python Web Scraping Cookbook is a solution-focused book that will teach you techniques to develop high-performance Scrapers, and deal with cookies, hidden form fields, Ajax-based sites, proxies, and more. By the end of this book, you will be able to scrape websites more efficiently with more accurate data, and how to package, deploy and operate scrapers in the cloud.

BookFeb 2018364 pages

Learning Python Web Penetration Testing

This book will walk you through the web application penetration testing methodology, showing you how to write your own tools with Python for every main activity in the process. It will show you how to test for security vulnerabilities in web applications just like security professionals and hackers do.

BookJun 2018138 pages

Learning Python Networking

Python is a popular programming language used for performing network automation in an easy-to-implement way. This book is an update to Learning Python Networking, and delves into the concepts of Python network programming and its importance in today’s world.

BookMar 2019490 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages