Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Python Web Scraping Cookbook

You're reading from  Python Web Scraping Cookbook

Product type Book
Published in Feb 2018
Publisher Packt
ISBN-13 9781787285217
Pages 364 pages
Edition 1st Edition
Languages
Concepts
Author (1):
Michael Heydt Michael Heydt
Profile icon Michael Heydt

Table of Contents (13) Chapters

Preface Getting Started with Scraping Data Acquisition and Extraction Processing Data Working with Images, Audio, and other Assets Scraping - Code of Conduct Scraping Challenges and Solutions Text Wrangling and Analysis Searching, Mining and Visualizing Data Creating a Simple Data API Creating Scraper Microservices with Docker Making the Scraper as a Service Real Other Books You May Enjoy

Working with Images, Audio, and other Assets

In this chapter, we will cover:

  • Downloading media content on the web
  • Parsing a URL with urllib to get the filename
  • Determining type of content for a URL
  • Determining a file extension from a content type
  • Downloading and saving images to the local file system
  • Downloading and saving images to S3
  • Generating thumbnails for images
  • Taking website screenshots with Selenium
  • Taking a website screenshot with an external service
  • Performing OCR on images with pytessaract
  • Creating a Video Thumbnail
  • Ripping an MP4 video to an MP3

Introduction

A common practice in scraping is the download, storage, and further processing of media content (non-web pages or data files). This media can include images, audio, and video. To store the content locally (or in a service like S3) and do it correctly, we need to know what the type of media is, and it's not enough to trust the file extension in the URL. We will learn how to download and correctly represent the media type based on information from the web server.

Another common task is the generation of thumbnails of images, videos, or even a page of a website. We will examine several techniques of how to generate thumbnails and make website page screenshots. Many times these are used on a new website as thumbnail links to the scraped media that is now stored locally.

Finally, it is often the need to be able to transcode media, such as converting non-MP4 videos...

Downloading media content from the web

Downloading media content from the web is a simple process: use Requests or another library and download it just like you would HTML content.

Getting ready

There is a class named URLUtility in the urls.py mdoule in the util folder of the solution. This class handles several of the scenarios in this chapter with downloading and parsing URLs. We will be using this class in this recipe and a few others. Make sure the modules folder is in your Python path. Also, the example for this recipe is in the 04/01_download_image.py file.

How to do it

...

Parsing a URL with urllib to get the filename

When downloading content from a URL, we often want to save it in a file. Often it is good enough to save the file in a file with a name found in the URL. But the URL consists of a number of fragments, so how can we find the actual filename from the URL, especially where there are often many parameters after the file name?

Getting ready

We will again be using the URLUtility class for this task. The code file for the recipe is 04/02_parse_url.py.

How to do it

Execute the recipe's file with your python interpreter. It...

Determining the type of content for a URL

When performing a GET requests for content from a web server, the web server will return a number of headers, one of which identities the type of the content from the perspective of the web server. In this recipe we learn to use that to determine what the web server considers the type of the content.

Getting ready

We again use the URLUtility class. The code for the recipe is in 04/03_determine_content_type_from_response.py.

How to do it

We proceed as follows:

  1. Execute the script for the recipe. It contains the following code...

Determining the file extension from a content type

It is good practice to use the content-type header to determine the type of content, and to determine the extension to use for storing the content as a file.

Getting ready

We again use the URLUtility object that we created. The recipe's script is 04/04_determine_file_extension_from_contenttype.py):.

How to do it

Proceed by running the recipe's script.

An extension for the media type can be found using the .extension property:

util = URLUtility(const.ApodEclipseImage())
print("Filename from content-type: ...

Downloading and saving images to the local file system

Sometimes when scraping we just download and parse data, such as HTML, to extract some data, and then throw out what we read. Other times, we want to keep the downloaded content by storing it as a file.

How to do it

The code example for this recipe is in the 04/05_save_image_as_file.py file. The portion of the file of importance is:

# download the image
item = URLUtility(const.ApodEclipseImage())

# create a file writer to write the data
FileBlobWriter(expanduser("~")).write(item.filename, item.data)

Run the script with your Python interpreter and you will get the following output:

Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg
Read 171014 bytes
Attempting...

Downloading and saving images to S3

We have seen how to write content into S3 in Chapter 3, Processing Data. Here we will extend that process into an interface implementation of IBlobWriter to write to S3.

Getting ready

The code example for this recipe is in the 04/06_save_image_in_s3.py file. Also ensure that you have set your AWS keys as environment variables so that Boto can authenticate the script.

How to do it

We proceed as follows:

  1. Run the recipe's script. It will execute the following:
# download the image
item = URLUtility(const.ApodEclipseImage())

# store...

Generating thumbnails for images

Many times when downloading an image, you do not want to save the full image, but only a thumbnail. Or you may also save both the full-size image and a thumbnail. Thumbnails can be easily created in python using the Pillow library. Pillow is a fork of the Python Image Library, and contains many useful functions for manipulating images. You can find more information on Pillow at https://python-pillow.org. In this recipe, we use Pillow to create an image thumbnail.

Getting ready

The script for this recipe is 04/07_create_image_thumbnail.py. It uses the Pillow library, so make sure you have installed Pillow into your environment with pip or other package management tools:

pip install pillow...

Taking a screenshot of a website

A common scraping task is to create a screenshot of a website. In Python we can create a thumbnail using selenium and webdriver.

Getting ready

The script for this recipe is 04/08_create_website_screenshot.py. Also, make sure you have selenium in your path and have installed the Python library.

How to do it

Run the script for the recipe. The code in the script is the following:

from core.website_screenshot_generator import  WebsiteScreenshotGenerator
from core.file_blob_writer import FileBlobWriter
from os.path import expanduser

# get the...

Taking a screenshot of a website with an external service

The previous recipe used selenium, webdriver, and PhantomJS to create the screenshot. This obviously requires having those packages installed. If you don't want to install those and still want to make website screenshots, then you can use one of a number of web services that can take screenshots. In this recipe, we will use the service at www.screenshotapi.io to create a screenshot.

Getting ready

First, head over to www.screenshotapi.io and sign up for a free account:

Screenshot of the free account sign up

Once your account is created, proceed to get an API key. This will be needed to authenticate against their service:

The API Key
...

Performing OCR on an image with pytesseract

It is possible to extract text from within images using the pytesseract library. In this recipe, we will use pytesseract to extract text from an image. Tesseract is an open source OCR library sponsored by Google. The source is available here: https://github.com/tesseract-ocr/tesseract, and you can also find more information on the library there. 0;pytesseract is a thin python wrapper that provides a pythonic API to the executable.

Getting ready

Make sure you have pytesseract installed:

pip install pytesseract

You will also need to install tesseract-ocr. On Windows, there is an executable installer, which you can get here: https://github.com/tesseract-ocr/tesseract/wiki/4.0-with...

Creating a Video Thumbnail

You might want to create a thumbnail for a video that you downloaded from a website. These could be used on a page that shows a number of video thumbnails and lets you click on them to watch the specific video.

Getting ready

This sample will use a tool known as ffmpeg. ffmpeg is available at www.ffmpeg.org. Download and install as per the instructions for your operating system.

How to do it

The example script is in 04/11_create_video_thumbnail.py. It consists of the following code:

import subprocess
video_file = 'BigBuckBunny.mp4&apos...

Ripping an MP4 video to an MP3

Now let's examine how to rip the audio from an MP4 video into an MP3 file. The reasons you may want to do this include wanting to take the audio of the video with you (perhaps it's a music video), or you are building a scraper / media collection system that also requires the audio separate from the video.

This task can be accomplished using the moviepy library. moviepy is a neat library that lets you do all kinds of fun processing on your videos. One of those capabilities is to extract the audio as an MP3.

Getting ready

Make sure that you have moviepy installed in your environment:

pip install moviepy

We also need to have ffmpeg installed, which we used in the previous recipe, so...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Python Web Scraping Cookbook
Published in: Feb 2018 Publisher: Packt ISBN-13: 9781787285217
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}