You're reading from Python Web Scraping Cookbook

Product typeBook

Published inFeb 2018

Reading LevelBeginner

PublisherPackt

ISBN-139781787285217

Edition1st Edition

Languages

Python

Tools

Scrapy

Concepts

Data Mining

Author (1)

Michael Heydt

Creating Scraper Microservices with Docker

In this chapter, we will cover:

Installing Docker
Installing a RabbitMQ container from Docker Hub
Running a Docker container (RabbitMQ)
Stopping and removing a container and image
Creating an API container
Creating a generic microservice with Nameko
Creating a scraping microservice
Creating a scraper container
Creating a backend (ElasticCache) container
Composing and running the scraper containers with Docker Compose

Introduction

In this chapter, we will learn to containerize our scraper, getting it ready for the real world by starting to package it for real, modern, cloud-enabled operations. This will involve packaging the different elements of the scraper (API, scraper, backend storage) as Docker containers that can be run locally or in the cloud. We will also examine implementing the scraper as a microservice that can be independently scaled.

Much of the focus will be upon using Docker to create our containerized scraper. Docker provides us a convenient and easy means of packaging the various components of the scraper as a service (the API, the scraper itself, and other backends such as Elasticsearch and RabbitMQ). By containerizing these components using Docker, we can easily run the containers locally, orchestrate the different containers making up the services, and also conveniently...

Installing Docker

In this recipe, we look at how to install Docker and verify that it is running.

Getting ready

Docker is supported on Linux, macOS, and Windows, so it has the major platforms covered. The installation process for Docker is different depending on the operating system that you are using, and even differs among the different Linux distributions.

The Docker website has good documentation on the installation processes, so this recipe will quickly walk through the important points of the installation on macOS. Once the install is complete, the user experience for Docker, at least from the CLI, is identical.

For reference, the main page for installation instructions for Docker is found at: https://docs.docker.com...

Installing a RabbitMQ container from Docker Hub

Pre-built containers can be obtained from a number of container repositories. Docker is preconfigured with connectivity to Docker Hub, where many software vendors, and also enthusiasts, publish containers with one or more configurations.

In this recipe, we will install RabbitMQ, which will be used by another tool we use in another recipe, Nameko, to function as the messaging bus for our scraping microservice.

Getting ready

Normally, the installation of RabbitMQ is a fairly simple process, but it does require several installers: one for Erlang, and then one for RabbitMQ itself. If management tools, such as the web-based administrative GUI are desired, that is yet one more step...

Running a Docker container (RabbitMQ)

In this recipe we learn how to run a docker image, thereby making a container.

Getting ready

We will start the RabbitMQ container image that we downloaded in the previous recipe. This process is representative of how many containers are run, so it makes a good example.

How to do it

We proceed with the recipe as follows:

What we have downloaded so far is an image that can be run to create an actual container. A container is an actual instantiation of an image with specific parameters needed to configure the software in the container...

Creating and running an Elasticsearch container

While we are looking at pulling container images and starting containers, let's go and run an Elasticsearch container.

How to do it

Like most things Docker, there are a lot of different versions of Elasticsearch containers available. We will use the official Elasticsearch image available in Elastic's own Docker repository:

To install the image, enter the following:

$docker pull docker.elastic.co/elasticsearch/elasticsearch:6.1.1

Note that we are using another way of specifying the image to pull. Since this is on Elastic's Docker repository, we include the qualified name that includes the URL to the container image instead of just the image name. The :6.1.1 is...

Stopping/restarting a container and removing the image

Let's look at how to stop and remove a container, and then also its image.

How to do it

We proceed with the recipe as follows:

First query Docker for running containers:

$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
308a02f0e1a5 docker.elastic.co/elasticsearch/elasticsearch:6.1.1 "/usr/local/bin/do..." 7 seconds ago Up 6 seconds 0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp romantic_kowalevski
094a13838376 rabbitmq:3-management "docker-entrypoint..." 47 hours ago Up 47 hours 4369/tcp, 5671/tcp, 0.0.0.0:5672->5672/tcp, 15671/tcp, 25672/tcp, 0.0.0.0:15672->15672/tcp dreamy_easley

Let's stop the Elasticsearch...

Creating a generic microservice with Nameko

In the next few recipes, we are going to create a scraper that can be run as a microservice within a Docker container. But before jumping right into the fire, let's first look at creating a basic microservice using a Python framework known as Nameko.

Getting ready

We will use a Python framework known as Nameko (pronounced [nah-meh-koh] to implement microservices. As with Flask-RESTful, a microservice implemented with Nameko is simply a class. We will instruct Nameko how to run the class as a service, and Nameko will wire up a messaging bus implementation to allow clients to communicate with the actual microservice.

Nameko, by default, uses RabbitMQ as a messaging bus. RabbitMQ...

Creating a scraping microservice

Now let's take our scraper and make it into a Nameko microservice. This scraper microservice will be able to be run independently of the implementation of the API. This will allow the scraper to be operated, maintained, and scaled independently of the API's implementation.

How to do it

We proceed with the recipe as follows:

The code for the microservice is straightforward. The code for it is in 10/02/call_scraper_microservice.py and is shown here:

from nameko.rpc import rpc
import sojobs.scraping 

class ScrapeStackOverflowJobListingsMicroService:
    name = "stack_overflow_job_listings_scraping_microservice"

    @rpc
    def get_job_listing_info(self, job_listing_id):
    ...

Creating a scraper container

Now we create a container for our scraper microservice. We will learn about Dockerfiles and how to instruct Docker on how to build a container. We will also examine giving our Docker container hostnames so that they can find each other through Docker's integrated DNS system. Last but not least, we will learn how to configure our Nameko microservice to talk to RabbitMQ in another container instead of just on localhost.

Getting ready

The first thing we want to do is make sure that RabbitMQ is running in a container and assigned to a custom Docker network, where various containers connected to that network will talk to each other. Among many other features, it also provides software defined network...

Creating an API container

At this point, we can only talk to our microservice using AMQP, or by using the Nameko shell or a Nameko ClusterRPCProxy class. So let's put our Flask-RESTful API into another container, run that alongside the other containers, and make REST calls. This will also require that we run an Elasticsearch container, as that API code also communicates with Elasticsearch.

Getting ready

First let's start up Elasticsearch in a container that is attached to the scraper-net network. We can kick that off with the following command:

$ docker run -e ELASTIC_PASSWORD=MagicWord --name=elastic --network scraper-net  -p 9200:9200 -p 9300:9300 docker.elastic.co/elasticsearch/elasticsearch:6.1.1

Elasticsearch...

Composing and running the scraper locally with docker-compose

Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command and a simple configuration file, you create and start all the services from your configuration.

Getting ready

The first thing that needs to be done to use Compose is to make sure it is installed. Compose is automatically installed with Docker for macOS. On other platforms, it may or not be installed. You can find the instructions at the following URL: https://docs.docker.com/compose/install/#prerequisites.

Also, make sure all of the existing containers that we created earlier...

The rest of the chapter is locked

You have been reading a chapter from

Python Web Scraping Cookbook

Published in: Feb 2018Publisher: PacktISBN-13: 9781787285217

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Michael Heydt

Michael Heydt is an independent consultant, programmer, educator, and trainer. He has a passion for learning and sharing his knowledge of new technologies. Michael has worked in multiple industry verticals, including media, finance, energy, and healthcare. Over the last decade, he worked extensively with web, cloud, and mobile technologies and managed user experiences, interface design, and data visualization for major consulting firms and their clients. Michael's current company, Seamless Thingies , focuses on IoT development and connecting everything with everything. Michael is the author of numerous articles, papers, and books, such as D3.js By Example, Instant Lucene. NET, Learning Pandas, and Mastering Pandas for Finance, all by Packt. Michael is also a frequent speaker at .NET user groups and various mobile, cloud, and IoT conferences and delivers webinars on advanced technologies.
Read more about Michael Heydt

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages