Reader small image

You're reading from  Python Web Scraping Cookbook

Product typeBook
Published inFeb 2018
Reading LevelBeginner
PublisherPackt
ISBN-139781787285217
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Michael Heydt
Michael Heydt
author image
Michael Heydt

Michael Heydt is an independent consultant, programmer, educator, and trainer. He has a passion for learning and sharing his knowledge of new technologies. Michael has worked in multiple industry verticals, including media, finance, energy, and healthcare. Over the last decade, he worked extensively with web, cloud, and mobile technologies and managed user experiences, interface design, and data visualization for major consulting firms and their clients. Michael's current company, Seamless Thingies , focuses on IoT development and connecting everything with everything. Michael is the author of numerous articles, papers, and books, such as D3.js By Example, Instant Lucene. NET, Learning Pandas, and Mastering Pandas for Finance, all by Packt. Michael is also a frequent speaker at .NET user groups and various mobile, cloud, and IoT conferences and delivers webinars on advanced technologies.
Read more about Michael Heydt

Right arrow

Creating Scraper Microservices with Docker

In this chapter, we will cover:

  • Installing Docker
  • Installing a RabbitMQ container from Docker Hub
  • Running a Docker container (RabbitMQ)
  • Stopping and removing a container and image
  • Creating an API container
  • Creating a generic microservice with Nameko
  • Creating a scraping microservice
  • Creating a scraper container
  • Creating a backend (ElasticCache) container
  • Composing and running the scraper containers with Docker Compose

Introduction

In this chapter, we will learn to containerize our scraper, getting it ready for the real world by starting to package it for real, modern, cloud-enabled operations. This will involve packaging the different elements of the scraper (API, scraper, backend storage) as Docker containers that can be run locally or in the cloud. We will also examine implementing the scraper as a microservice that can be independently scaled.

Much of the focus will be upon using Docker to create our containerized scraper. Docker provides us a convenient and easy means of packaging the various components of the scraper as a service (the API, the scraper itself, and other backends such as Elasticsearch and RabbitMQ). By containerizing these components using Docker, we can easily run the containers locally, orchestrate the different containers making up the services, and also conveniently...

Installing Docker

In this recipe, we look at how to install Docker and verify that it is running.

Getting ready

Docker is supported on Linux, macOS, and Windows, so it has the major platforms covered. The installation process for Docker is different depending on the operating system that you are using, and even differs among the different Linux distributions.

The Docker website has good documentation on the installation processes, so this recipe will quickly walk through the important points of the installation on macOS. Once the install is complete, the user experience for Docker, at least from the CLI, is identical.

For reference, the main page for installation instructions for Docker is found at: https://docs.docker.com...

Installing a RabbitMQ container from Docker Hub

Pre-built containers can be obtained from a number of container repositories. Docker is preconfigured with connectivity to Docker Hub, where many software vendors, and also enthusiasts, publish containers with one or more configurations.

In this recipe, we will install RabbitMQ, which will be used by another tool we use in another recipe, Nameko, to function as the messaging bus for our scraping microservice.

Getting ready

Normally, the installation of RabbitMQ is a fairly simple process, but it does require several installers: one for Erlang, and then one for RabbitMQ itself. If management tools, such as the web-based administrative GUI are desired, that is yet one more step...

Running a Docker container (RabbitMQ)

In this recipe we learn how to run a docker image, thereby making a container.

Getting ready

We will start the RabbitMQ container image that we downloaded in the previous recipe. This process is representative of how many containers are run, so it makes a good example.

How to do it

We proceed with the recipe as follows:

  1. What we have downloaded so far is an image that can be run to create an actual container. A container is an actual instantiation of an image with specific parameters needed to configure the software in the container...

Creating and running an Elasticsearch container

While we are looking at pulling container images and starting containers, let's go and run an Elasticsearch container.

How to do it

Like most things Docker, there are a lot of different versions of Elasticsearch containers available. We will use the official Elasticsearch image available in Elastic's own Docker repository:

  1. To install the image, enter the following:
$docker pull docker.elastic.co/elasticsearch/elasticsearch:6.1.1
Note that we are using another way of specifying the image to pull. Since this is on Elastic's Docker repository, we include the qualified name that includes the URL to the container image instead of just the image name. The :6.1.1 is...

Stopping/restarting a container and removing the image

Let's look at how to stop and remove a container, and then also its image.

How to do it

We proceed with the recipe as follows:

  1. First query Docker for running containers:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
308a02f0e1a5 docker.elastic.co/elasticsearch/elasticsearch:6.1.1 "/usr/local/bin/do..." 7 seconds ago Up 6 seconds 0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp romantic_kowalevski
094a13838376 rabbitmq:3-management "docker-entrypoint..." 47 hours ago Up 47 hours 4369/tcp, 5671/tcp, 0.0.0.0:5672->5672/tcp, 15671/tcp, 25672/tcp, 0.0.0.0:15672->15672/tcp dreamy_easley
  1. Let's stop the Elasticsearch...

Creating a generic microservice with Nameko

In the next few recipes, we are going to create a scraper that can be run as a microservice within a Docker container. But before jumping right into the fire, let's first look at creating a basic microservice using a Python framework known as Nameko.

Getting ready

We will use a Python framework known as Nameko (pronounced [nah-meh-koh] to implement microservices. As with Flask-RESTful, a microservice implemented with Nameko is simply a class. We will instruct Nameko how to run the class as a service, and Nameko will wire up a messaging bus implementation to allow clients to communicate with the actual microservice.

Nameko, by default, uses RabbitMQ as a messaging bus. RabbitMQ...

Creating a scraping microservice

Now let's take our scraper and make it into a Nameko microservice. This scraper microservice will be able to be run independently of the implementation of the API. This will allow the scraper to be operated, maintained, and scaled independently of the API's implementation.

How to do it

We proceed with the recipe as follows:

  1. The code for the microservice is straightforward. The code for it is in 10/02/call_scraper_microservice.py and is shown here:
from nameko.rpc import rpc
import sojobs.scraping

class ScrapeStackOverflowJobListingsMicroService:
name = "stack_overflow_job_listings_scraping_microservice"

@rpc
def get_job_listing_info(self, job_listing_id):
...

Creating a scraper container

Now we create a container for our scraper microservice. We will learn about Dockerfiles and how to instruct Docker on how to build a container. We will also examine giving our Docker container hostnames so that they can find each other through Docker's integrated DNS system. Last but not least, we will learn how to configure our Nameko microservice to talk to RabbitMQ in another container instead of just on localhost.

Getting ready

The first thing we want to do is make sure that RabbitMQ is running in a container and assigned to a custom Docker network, where various containers connected to that network will talk to each other. Among many other features, it also provides software defined network...

Creating an API container

At this point, we can only talk to our microservice using AMQP, or by using the Nameko shell or a Nameko ClusterRPCProxy class. So let's put our Flask-RESTful API into another container, run that alongside the other containers, and make REST calls. This will also require that we run an Elasticsearch container, as that API code also communicates with Elasticsearch.

Getting ready

First let's start up Elasticsearch in a container that is attached to the scraper-net network. We can kick that off with the following command:

$ docker run -e ELASTIC_PASSWORD=MagicWord --name=elastic --network scraper-net  -p 9200:9200 -p 9300:9300 docker.elastic.co/elasticsearch/elasticsearch:6.1.1

Elasticsearch...

Composing and running the scraper locally with docker-compose

Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command and a simple configuration file, you create and start all the services from your configuration.

Getting ready

The first thing that needs to be done to use Compose is to make sure it is installed. Compose is automatically installed with Docker for macOS. On other platforms, it may or not be installed. You can find the instructions at the following URL: https://docs.docker.com/compose/install/#prerequisites.

Also, make sure all of the existing containers that we created earlier...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python Web Scraping Cookbook
Published in: Feb 2018Publisher: PacktISBN-13: 9781787285217
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Michael Heydt

Michael Heydt is an independent consultant, programmer, educator, and trainer. He has a passion for learning and sharing his knowledge of new technologies. Michael has worked in multiple industry verticals, including media, finance, energy, and healthcare. Over the last decade, he worked extensively with web, cloud, and mobile technologies and managed user experiences, interface design, and data visualization for major consulting firms and their clients. Michael's current company, Seamless Thingies , focuses on IoT development and connecting everything with everything. Michael is the author of numerous articles, papers, and books, such as D3.js By Example, Instant Lucene. NET, Learning Pandas, and Mastering Pandas for Finance, all by Packt. Michael is also a frequent speaker at .NET user groups and various mobile, cloud, and IoT conferences and delivers webinars on advanced technologies.
Read more about Michael Heydt