Reader small image

You're reading from  Python Web Scraping Cookbook

Product typeBook
Published inFeb 2018
Reading LevelBeginner
PublisherPackt
ISBN-139781787285217
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Michael Heydt
Michael Heydt
author image
Michael Heydt

Michael Heydt is an independent consultant, programmer, educator, and trainer. He has a passion for learning and sharing his knowledge of new technologies. Michael has worked in multiple industry verticals, including media, finance, energy, and healthcare. Over the last decade, he worked extensively with web, cloud, and mobile technologies and managed user experiences, interface design, and data visualization for major consulting firms and their clients. Michael's current company, Seamless Thingies , focuses on IoT development and connecting everything with everything. Michael is the author of numerous articles, papers, and books, such as D3.js By Example, Instant Lucene. NET, Learning Pandas, and Mastering Pandas for Finance, all by Packt. Michael is also a frequent speaker at .NET user groups and various mobile, cloud, and IoT conferences and delivers webinars on advanced technologies.
Read more about Michael Heydt

Right arrow

Creating a Simple Data API

In this chapter, we will cover:

  • Creating a REST API with Flask-RESTful
  • Integrating the REST API with scraping code
  • Adding an API to find the skills for a job listing
  • Storing data in Elasticsearch as the result of a scraping request
  • Checking Elasticsearch for a listing before scraping

Introduction

We have now reached an exciting inflection point in our learning about scraping. From this point on, we will learn about making scrapers as a service using several APIs, microservice, and container tools, all of which will allow the running of the scraper either locally or in the cloud, and to give access to the scraper through standardized REST APIs.60;

We will start this new journey in this chapter with the creation of a simple REST API using Flask-RESTful which we will eventually use to make requests to the service to scrape pages on demand. We will connect this API to a scraper function implemented in a Python module that reuses the concepts for scraping StackOverflow jobs, as discussed in Chapter 7, Text Wrangling and Analysis.

The final few recipes will focus on using Elasticsearch as a cache for these results, storing documents we retrieve from the scraper...

Creating a REST API with Flask-RESTful

We start with the creation of a simple REST API using Flask-RESTful. This initial API will consist of a single method that lets the caller pass an integer value and which returns a JSON blob. In this recipe, the parameters and their values, as well as the return value, are not important at this time as we want to first simply get an API up and running using Flask-RESTful.

Getting ready

Flask is a web microframework that makes creating simple web application functionality incredibly easy. Flask-RESTful is an extension to Flask which does the same for making REST APIs just as simple. You can get Flask and read more about it at flask.pocoo.org. Flask-RESTful can be read about at https...

Integrating the REST API with scraping code

In this recipe, we will integrate code that we wrote for scraping and getting a clean job listing from StackOverflow with our API. This will result in a reusable API that can be used to perform on-demand scrapes without the client needing any knowledge of the scraping process. Essentially, we will have created a scraper as a service, a concept we will spend much time with in the remaining recipes of the book.

Getting ready

The first part of this process is to create a module out of our preexisting code that was written in Chapter 7, Text Wrangling and Analysis so that we can reuse it. We will reuse this code in several recipes throughout the remainder of the book. Let's...

Adding an API to find the skills for a job listing

In this recipe, we add an additional operation to our API which will allow us to request the skills associated with a job listing. This demonstrates a means of being able to retrieve only a subset of the data instead of the entire content of the listing. While we will only do this for the skills, the concept can be easily extended to any other subsets of the data, such as the location of the job, title, or almost any other content that makes sense for the user of your API.

Getting ready

The first thing that we will do is add a scraping function to the sojobs module. This function will be named get_job_listing_skills. The following is the code for this function:

def get_job_listing_skills...

Storing data in Elasticsearch as the result of a scraping request

In this recipe, we extend our API to save the data we received from the scraper into Elasticsearch. We will use this later (in the next recipe) to be able to optimize requests by using the content in Elasticsearch as a cache so that we do not repeat the scraping process for jobs listings already scraped. Therefore, we can play nice with StackOverflows servers.

Getting ready

Checking Elasticsearch for a listing before scraping

Now lets leverage Elasticsearch as a cache by checking to see if we already have stored a job listing and hence do not need to hit StackOverflow again. We extend the API for performing a scrape of a job listing to first search Elasticsearch, and if the result is found there we return that data. Hence, we optimize the process by making Elasticsearch a job listings cache.

How to do it

We proceed with the recipe as follows:

The code for this recipe is within 09/05/api.py. The JobListing class now has the following implementation:

class JobListing(Resource):
def get(self, job_listing_id):
print("Request for job listing with id: " + job_listing_id)

...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python Web Scraping Cookbook
Published in: Feb 2018Publisher: PacktISBN-13: 9781787285217
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Michael Heydt

Michael Heydt is an independent consultant, programmer, educator, and trainer. He has a passion for learning and sharing his knowledge of new technologies. Michael has worked in multiple industry verticals, including media, finance, energy, and healthcare. Over the last decade, he worked extensively with web, cloud, and mobile technologies and managed user experiences, interface design, and data visualization for major consulting firms and their clients. Michael's current company, Seamless Thingies , focuses on IoT development and connecting everything with everything. Michael is the author of numerous articles, papers, and books, such as D3.js By Example, Instant Lucene. NET, Learning Pandas, and Mastering Pandas for Finance, all by Packt. Michael is also a frequent speaker at .NET user groups and various mobile, cloud, and IoT conferences and delivers webinars on advanced technologies.
Read more about Michael Heydt