You're reading from Python Web Scraping Cookbook

Product typeBook

Published inFeb 2018

Reading LevelBeginner

PublisherPackt

ISBN-139781787285217

Edition1st Edition

Languages

Python

Tools

Scrapy

Concepts

Data Mining

Author (1)

Michael Heydt

Creating a Simple Data API

In this chapter, we will cover:

Creating a REST API with Flask-RESTful
Integrating the REST API with scraping code
Adding an API to find the skills for a job listing
Storing data in Elasticsearch as the result of a scraping request
Checking Elasticsearch for a listing before scraping

Introduction

We have now reached an exciting inflection point in our learning about scraping. From this point on, we will learn about making scrapers as a service using several APIs, microservice, and container tools, all of which will allow the running of the scraper either locally or in the cloud, and to give access to the scraper through standardized REST APIs.60;

We will start this new journey in this chapter with the creation of a simple REST API using Flask-RESTful which we will eventually use to make requests to the service to scrape pages on demand. We will connect this API to a scraper function implemented in a Python module that reuses the concepts for scraping StackOverflow jobs, as discussed in Chapter 7, Text Wrangling and Analysis.

The final few recipes will focus on using Elasticsearch as a cache for these results, storing documents we retrieve from the scraper...

Creating a REST API with Flask-RESTful

We start with the creation of a simple REST API using Flask-RESTful. This initial API will consist of a single method that lets the caller pass an integer value and which returns a JSON blob. In this recipe, the parameters and their values, as well as the return value, are not important at this time as we want to first simply get an API up and running using Flask-RESTful.

Getting ready

Flask is a web microframework that makes creating simple web application functionality incredibly easy. Flask-RESTful is an extension to Flask which does the same for making REST APIs just as simple. You can get Flask and read more about it at flask.pocoo.org. Flask-RESTful can be read about at https...

Integrating the REST API with scraping code

In this recipe, we will integrate code that we wrote for scraping and getting a clean job listing from StackOverflow with our API. This will result in a reusable API that can be used to perform on-demand scrapes without the client needing any knowledge of the scraping process. Essentially, we will have created a scraper as a service, a concept we will spend much time with in the remaining recipes of the book.

Getting ready

The first part of this process is to create a module out of our preexisting code that was written in Chapter 7, Text Wrangling and Analysis so that we can reuse it. We will reuse this code in several recipes throughout the remainder of the book. Let's...

Adding an API to find the skills for a job listing

In this recipe, we add an additional operation to our API which will allow us to request the skills associated with a job listing. This demonstrates a means of being able to retrieve only a subset of the data instead of the entire content of the listing. While we will only do this for the skills, the concept can be easily extended to any other subsets of the data, such as the location of the job, title, or almost any other content that makes sense for the user of your API.

Getting ready

The first thing that we will do is add a scraping function to the sojobs module. This function will be named get_job_listing_skills. The following is the code for this function:

def get_job_listing_skills...

Storing data in Elasticsearch as the result of a scraping request

In this recipe, we extend our API to save the data we received from the scraper into Elasticsearch. We will use this later (in the next recipe) to be able to optimize requests by using the content in Elasticsearch as a cache so that we do not repeat the scraping process for jobs listings already scraped. Therefore, we can play nice with StackOverflows servers.

Getting ready

Make sure you have Elasticsearch running locally, as the code will access Elasticsearch at localhost:9200. There a good quick-start available at https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html, or you can check out the docker Elasticsearch recipe in...

Checking Elasticsearch for a listing before scraping

Now lets leverage Elasticsearch as a cache by checking to see if we already have stored a job listing and hence do not need to hit StackOverflow again. We extend the API for performing a scrape of a job listing to first search Elasticsearch, and if the result is found there we return that data. Hence, we optimize the process by making Elasticsearch a job listings cache.

How to do it

We proceed with the recipe as follows:

The code for this recipe is within 09/05/api.py. The JobListing class now has the following implementation:

class JobListing(Resource):
    def get(self, job_listing_id):
        print("Request for job listing with id: " + job_listing_id)

     ...

The rest of the chapter is locked

You have been reading a chapter from

Python Web Scraping Cookbook

Published in: Feb 2018Publisher: PacktISBN-13: 9781787285217

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Michael Heydt

Michael Heydt is an independent consultant, programmer, educator, and trainer. He has a passion for learning and sharing his knowledge of new technologies. Michael has worked in multiple industry verticals, including media, finance, energy, and healthcare. Over the last decade, he worked extensively with web, cloud, and mobile technologies and managed user experiences, interface design, and data visualization for major consulting firms and their clients. Michael's current company, Seamless Thingies , focuses on IoT development and connecting everything with everything. Michael is the author of numerous articles, papers, and books, such as D3.js By Example, Instant Lucene. NET, Learning Pandas, and Mastering Pandas for Finance, all by Packt. Michael is also a frequent speaker at .NET user groups and various mobile, cloud, and IoT conferences and delivers webinars on advanced technologies.
Read more about Michael Heydt

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages