You're reading from Python Web Scraping Cookbook

Product type Book

Published in Feb 2018

Publisher Packt

ISBN-13 9781787285217

Pages 364 pages

Edition 1st Edition

Languages

Python

Concepts

Data Mining

Author (1):

Michael Heydt

Preface

The internet contains a wealth of data. This data is both provided through structured APIs as well as by content delivered directly through websites. While the data in APIs is highly structured, information found in web pages is often unstructured and requires collection, extraction, and processing to be of value. And collecting data is just the start of the journey, as that data must also be stored, mined, and then exposed to others in a value-added form.

With this book, you will learn many of the core tasks needed in collecting various forms of information from websites. We will cover how to collect it, how to perform several common data operations (including storage in local and remote databases), how to perform common media-based tasks such as converting images an videos to thumbnails, how to clean unstructured data with NTLK, how to examine several data mining and visualization tools, and finally core skills in building a microservices-based scraper and API that can, and will, be run on the cloud.

Through a recipe-based approach, we will learn independent techniques to solve specific tasks involved in not only scraping but also data manipulation and management, data mining, visualization, microservices, containers, and cloud operations. These recipes will build skills in a progressive and holistic manner, not only teaching how to perform the fundamentals of scraping but also taking you from the results of scraping to a service offered to others through the cloud. We will be building an actual web-scraper-as-a-service using common tools in the Python, container, and cloud ecosystems.

Who this book is for

This book is for those who want to learn to extract data from websites using the process of scraping and also how to work with various data management tools and cloud services. The coding will require basic skills in the Python programming language.

The book is also for those who wish to learn about a larger ecosystem of tools for retrieving, storing, and searching data, as well as using modern tools and Pythonic libraries to create data APIs and cloud services. You may also be using Docker and Amazon Web Services to package and deploy a scraper on the cloud.

What this book covers

Chapter 1, Getting Started with Scraping, introduces several concepts and tools for web scraping. We will examine how to install and do basic tasks with tools such as requests, urllib, BeautifulSoup, Scrapy, PhantomJS and Selenium.

Chapter 2, Data Acquisition and Extraction, is based on an understanding of the structure of HTML and how to find and extract embedded data. We will cover many of the concepts in the DOM and how to find and extract data using BeautifulSoup, XPath, LXML, and CSS selectors. We also briefly examine working with Unicode / UTF8.

Chapter 3, Processing Data, teaches you to load and manipulate data in many formats, and then how to store that data in various data stores (S3, MySQL, PostgreSQL, and ElasticSearch). Data in web pages is represented in various formats, the most common being HTML, JSON, CSV, and XML We will also examine the use of message queue systems, primarily AWS SQS, to help build robust data processing pipelines.

Chapter 4, Working with Images, Audio and other Assets, examines the means of retrieving multimedia items, storing them locally, and also performing several tasks such as OCR, generating thumbnails, making web page screenshots, audio extraction from videos, and finding all video URLs in a YouTube playlist.

Chapter 5, Scraping – Code of Conduct, covers several concepts involved in the legality of scraping, and practices for performing polite scraping. We will examine tools for processing robots.txt and sitemaps to respect the web host's desire for acceptable behavior. We will also examine the control of several facets of crawling, such as using delays, containing the depth and length of crawls, using user agents, and implementing caching to prevent repeated requests.

Chapter 6, Scraping Challenges and Solutions, covers many of the challenges that writing a robust scraper is rife with, and how to handle many scenarios. These scenarios are pagination, redirects, login forms, keeping the crawler within the same domain, retrying requests upon failure, and handling captchas.

Chapter 7, Text Wrangling and Analysis, examines various tools such as using NLTK for natural language processing and how to remove common noise words and punctuation. We often need to process the textual content of a web page to find information on the page that is part of the text and neither structured/embedded data nor multimedia. This requires knowledge of using various concepts and tools to clean and understand text.

Chapter 8, Searching, Mining, and Visualizing Data, covers several means of searching for data on the Web, storing and organizing data, and deriving results from the identified relationships. We will see how to understand the geographic locations of contributors to Wikipedia, finding relationships between actors on IMDB, and finding jobs on Stack Overflow that match specific technologies.

Chapter 9, Creating a Simple Data API, teaches us how to create a scraper as a service. We will create a REST API for a scraper using Flask. We will run the scraper as a service behind this API and be able to submit requests to scrape specific pages, in order to dynamically query data from a scrape as well as a local ElasticSearch instance.

Chapter 10, Creating Scraper Microservices with Docker, continues the growth of our scraper as a service by packaging the service and API in a Docker swarm and distributing requests across scrapers via a message queuing system (AWS SQS). We will also cover scaling of scraper instances up and down using Docker swarm tools.

Chapter 11, Making the Scraper as a Service Real, concludes by fleshing out the services crated in the previous chapter to add a scraper that pulls together various concepts covered earlier. This scraper can assist in analyzing job posts on StackOverflow to find and compare employers using specified technologies. The service will collect posts and allow a query to find and compare those companies.

To get the most out of this book

The primary tool required for the recipes in this book is a Python 3 interpreter. The recipes have been written using the free version of the Anaconda Python distribution, specifically version 3.6.1. Other Python version 3 distributions should work well but have not been tested.

The code in the recipes will often require the use of various Python libraries. These are all available for installation using pip and accessible using pip install. Wherever required, these installations will be elaborated in the recipes.

Several recipes require an Amazon AWS account. AWS accounts are available for the first year for free-tier access. The recipes will not require anything more than free-tier services. A new account can be created at https://portal.aws.amazon.com/billing/signup.

Several recipes will utilize Elasticsearch. There is a free, open source version available on GitHub at https://github.com/elastic/elasticsearch, with installation instructions on that page. Elastic.co also offers a fully capable version (also with Kibana and Logstash) hosted on the cloud with a 14-day free trial available at http://info.elastic.co (which we will utilize). There is a version for docker-compose with all x-pack features available at https://github.com/elastic/stack-docker, all of which can be started with a simple docker-compose up command.

Finally, several of the recipes use MySQL and PostgreSQL as database examples and several common clients for those databases. For those recipes, these will need to be installed locally. MySQL Community Server is available at https://dev.mysql.com/downloads/mysql/, and PostgreSQL can be found at https://www.postgresql.org/.

We will also look at creating and using docker containers for several of the recipes. Docker CE is free and is available at https://www.docker.com/community-edition.

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at www.packtpub.com.
Select the SUPPORT tab.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Python-Web-Scraping-Cookbook. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "This will loop through up to 20 characters and drop them into the sw index with a document type of people"

A block of code is set as follows:

from elasticsearch import Elasticsearch
import requests
import json

if __name__ == '__main__':
    es = Elasticsearch(
        [

Any command-line input or output is written as follows:

$ curl https://elastic:tduhdExunhEWPjSuH73O6yLS@7dc72d3327076cc4daf5528103c46a27.us-west-2.aws.found.io:9243

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email feedback@packtpub.com and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at questions@packtpub.com.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packtpub.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.