Reader small image

You're reading from  The Applied Data Science Workshop - Second Edition

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781800202504
Edition2nd Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Alex Galea
Alex Galea
author image
Alex Galea

Alex Galea has been professionally practicing data analytics since graduating with a masters degree in physics from the University of Guelph, Canada. He developed a keen interest in Python while researching quantum gases as part of his graduate studies. Alex is currently doing web data analytics, where Python continues to play a key role in his work. He is a frequent blogger about data-centric projects that involve Python and Jupyter Notebooks.
Read more about Alex Galea

Right arrow

6. Web Scraping with Jupyter Notebooks

Overview

In this chapter, you will learn to make HTTP requests and parse data from HTML. Like in previous chapters, you will continue to get hands-on experience working with datasets in Python, including merging tables and preparing them for analysis. By the end of this chapter, you will be able to use Python to make HTTP requests, such as API calls, and create pipelines to extract data from web pages.

Introduction

So far in this book, we have focused on using Jupyter to build reproducible data analysis and modeling workflows. We'll continue with a similar approach in this chapter, but with the main focus being on data acquisition. In particular, we will show you how data can be acquired from the web using HTTP requests. This will involve making API requests and scraping web pages by parsing HTML. In addition to these new topics, we'll continue to use pandas for building and transforming our datasets.

Before we cover HTTP requests and how to use them in Python, we'll discuss the importance of gathering data from the web in general. The amount of data that's available online is huge, and it's continuously growing at a staggering pace. Additionally, it's becoming increasingly important for driving business growth. Consider, for example, the ongoing global shift from technologies such as newspapers, magazines, and TV to online content. With customized...

Internet Data Sources

As data scientists, the internet helps connect us with any kind of dataset we could imagine. For instance, governments around the world publish public datasets that are rich with information. Along the same lines, some companies make certain datasets public, which can be of huge value within a given industry. One example of this is the ride-sharing business Lyft, who has released open source data that could be beneficial for training autonomous vehicles.

In addition to online datasets, Application Programming Interface (API) services also exist, which provide relevant and fresh data programmatically. For example, a business that depends on the weather may want an API that provides the current conditions in a given region, along with updated forecasts. Processes could be set up to query that API daily and update an internal database that's connected to a dashboard in order to provide that and other relevant data to business stakeholders.

Web scraping...

Introduction to HTTP Requests

The Hypertext Transfer Protocol, or HTTP for short, is the foundation of data communication for the internet. It defines how a page should be requested and how the response should look. For example, a client can request an Amazon page of laptops for sale, a Google search of local restaurants, or their Facebook feed. Along with the URL, the request will contain the user agent and available browsing cookies among the contents of the request header.

The user agent tells the server what browser and device the client is using, which is usually used to provide the most user-friendly version of the web page's response. Perhaps they have recently logged in to the web page; such information would be stored in a cookie that might be used to automatically log the user in.

These details of HTTP requests and responses are taken care of under the hood thanks to web browsers. Luckily for us, today, the same is true when making requests with high-level languages...

Data Workflow with pandas

As we've seen time and time again in this book, pandas is an integral part of performing data science with Python and Jupyter Notebooks. DataFrames offer us a way to organize and store labeled data, but more importantly, pandas provides time-saving methods for transforming data. Examples we have seen in this book include dropping duplicates, mapping dictionaries to columns, applying functions over columns, and filling in missing values.

In the next exercise, we'll reload the raw tables that we pulled from Wikipedia, clean them up, and merge them together. This will result in a dataset that is suitable for analysis, which we'll use for a final exercise, where you'll have an opportunity to perform exploratory analysis and apply the modeling concepts that you learned about in earlier chapters.

Exercise 6.04: Processing Data for Analysis with pandas

In this exercise, we continue working on the country data that was pulled from Wikipedia...

Summary

In this chapter, we worked through the process of pulling tables from Wikipedia using web scraping techniques, cleaning up the resulting data with pandas, and producing a final analysis.

We started by looking at how HTTP requests work, focusing on GET requests and their response status codes. Then, we went into the Jupyter Notebook and made HTTP requests with Python using the requests library. We saw how Jupyter can be used to render HTML in the notebook, along with actual web pages that can be interacted with. In order to learn about web scraping, we saw how BeautifulSoup can be used to parse text from the HTML, and used this library to scrape tabular data from Wikipedia.

After pulling two tables of data, we processed them for analysis with pandas. The first table contained the central bank interest rates for each country, while the second table contained the populations. We combined these into a single table that was then used for the final analysis, which involved...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Applied Data Science Workshop - Second Edition
Published in: Jul 2020Publisher: PacktISBN-13: 9781800202504
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Alex Galea

Alex Galea has been professionally practicing data analytics since graduating with a masters degree in physics from the University of Guelph, Canada. He developed a keen interest in Python while researching quantum gases as part of his graduate studies. Alex is currently doing web data analytics, where Python continues to play a key role in his work. He is a frequent blogger about data-centric projects that involve Python and Jupyter Notebooks.
Read more about Alex Galea