Packt+ | Advance your knowledge in tech

You're reading from Web Scraping with Python

Product type Book

Published in Oct 2015

Publisher Packt

ISBN-13 9781782164364

Pages 174 pages

Edition 1st Edition

Languages

Python

Concepts

Data Mining

Author (1):

Richard Penman

Table of Contents (16) Chapters

Web Scraping with Python

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

1. Introduction to Web Scraping

2. Scraping the Data

3. Caching Downloads

4. Concurrent Downloading

5. Dynamic Content

6. Interacting with Forms

7. Solving CAPTCHA

8. Scrapy

9. Overview

Index

Chapter 9. Overview

This book has so far introduced scraping techniques using a custom website, which helped us focus on learning particular skills. Now, in this chapter, we will analyze a variety of real-world websites to show how these techniques can be applied. Firstly, we will use Google to show a real-world search form, then Facebook for a JavaScript-dependent website, Gap for a typical online store, and finally, BMW for a map interface. Since these are live websites, there is a risk that they will have changed by the time you read this. However, this is fine because the purpose of these examples is to show you how the techniques learned so far can be applied, rather than to show you how to scrape a particular website. If you choose to run an example, first check whether the website structure has changed since these examples were made and whether their current terms and conditions prohibit scraping.

Google search engine

According to the Alexa data used in Chapter 4, Concurrent Downloading, google.com is the world's most popular website, and conveniently, its structure is simple and straightforward to scrape.

Note

International Google

Google may redirect to a country-specific version, depending on your location. To use a consistent Google search wherever you are in the world, the international English version of Google can be loaded at http://www.google.com/ncr. Here, ncr stands for no country redirect.

Here is the Google search homepage loaded with Firebug to inspect the form:

We can see here that the search query is stored in an input with name q, and then the form is submitted to the path /search set by the action attribute. We can test this by doing a test search to submit the form, which would then be redirected to a URL like https://www.google.com/searchq=test&oq=test&es_sm=93&ie=UTF-8. The exact URL will depend on your browser and location. Also note that if you have...

Facebook

Currently, Facebook is the world's largest social network in terms of monthly active users, and therefore, its user data is extremely valuable.

The website

Here is an example Facebook page for Packt Publishing at https://www.facebook.com/PacktPub:

Viewing the source of this page, you would find that the first few posts are available, and that later posts are loaded with AJAX when the browser scrolls. Facebook also has a mobile interface, which, as mentioned in Chapter 1, Introduction to Web Scraping, is often easier to scrape. The same page using the mobile interface is available at https://m.facebook.com/PacktPub:

If we interacted with the mobile website and then checked Firebug we would find that this interface uses a similar structure for the AJAX events, so it is not actually easier to scrape. These AJAX events can be reverse engineered; however, different types of Facebook pages use different AJAX calls, and from my past experience, Facebook often changes the structure of these...

Gap

Gap has a well structured website with a Sitemap to help web crawlers locate their updated content. If we use the techniques from Chapter 1, Introduction to Web Scraping, to investigate a website, we would find their robots.txt file at http://www.gap.com/robots.txt, which contains a link to this Sitemap:

Sitemap: http://www.gap.com/products/sitemap_index.xml

Here are the contents of the linked Sitemap file:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
        <loc>http://www.gap.com/products/sitemap_1.xml</loc>
        <lastmod>2015-03-03</lastmod>
    </sitemap>
    <sitemap>
        <loc>http://www.gap.com/products/sitemap_2.xml</loc>
        <lastmod>2015-03-03</lastmod>
    </sitemap>
</sitemapindex>

As shown here, this Sitemap link is just an index and contains links to other Sitemap files. These other Sitemap files then...

BMW

The BMW website has a search tool to find local dealerships, available at https://www.bmw.de/de/home.html?entryType=dlo:

This tool takes a location, and then displays the points near it on a map, such as this search for Berlin:

Using Firebug, we find that the search triggers this AJAX request:

https://c2b-services.bmw.com/c2b-localsearch/services/api/v3/
    clients/BMWDIGITAL_DLO/DE/
        pois?country=DE&category=BM&maxResults=99&language=en&
            lat=52.507537768880056&lng=13.425269635701511

Here, the maxResults parameter is set to 99. However, we can increase this to download all locations in a single query, a technique covered in Chapter 1, Introduction to Web Scraping. Here is the result when maxResults is increased to 1000:

>>> url = 'https://c2b-services.bmw.com/c2b-localsearch/services/api/v3/clients/BMWDIGITAL_DLO/DE/pois?country=DE&category=BM&maxResults=%d&language=en&lat=52.507537768880056&lng=13.425269635701511'
>...

Summary

This chapter analyzed a variety of prominent websites and demonstrated how the techniques covered in this book can be applied to them. We applied CSS selectors to scrape Google results, tested a browser renderer and an API to scrape Facebook pages, used a Sitemap to crawl Gap, and took advantage of an AJAX call to scrape all BMW dealers from a map.

You can now apply the techniques covered in this book to scrape websites that contain data of interest to you. I hope you enjoy this power as much as I have!