You're reading from The Applied Data Science Workshop - Second Edition

Product typeBook

Published inJul 2020

Reading LevelIntermediate

PublisherPackt

ISBN-139781800202504

Edition2nd Edition

Languages

Python

Tools

Jupyter

Concepts

Data Science

Author (1)

Alex Galea

6. Web Scraping with Jupyter Notebooks

Overview

In this chapter, you will learn to make HTTP requests and parse data from HTML. Like in previous chapters, you will continue to get hands-on experience working with datasets in Python, including merging tables and preparing them for analysis. By the end of this chapter, you will be able to use Python to make HTTP requests, such as API calls, and create pipelines to extract data from web pages.

Introduction

So far in this book, we have focused on using Jupyter to build reproducible data analysis and modeling workflows. We'll continue with a similar approach in this chapter, but with the main focus being on data acquisition. In particular, we will show you how data can be acquired from the web using HTTP requests. This will involve making API requests and scraping web pages by parsing HTML. In addition to these new topics, we'll continue to use pandas for building and transforming our datasets.

Before we cover HTTP requests and how to use them in Python, we'll discuss the importance of gathering data from the web in general. The amount of data that's available online is huge, and it's continuously growing at a staggering pace. Additionally, it's becoming increasingly important for driving business growth. Consider, for example, the ongoing global shift from technologies such as newspapers, magazines, and TV to online content. With customized...

Internet Data Sources

As data scientists, the internet helps connect us with any kind of dataset we could imagine. For instance, governments around the world publish public datasets that are rich with information. Along the same lines, some companies make certain datasets public, which can be of huge value within a given industry. One example of this is the ride-sharing business Lyft, who has released open source data that could be beneficial for training autonomous vehicles.

In addition to online datasets, Application Programming Interface (API) services also exist, which provide relevant and fresh data programmatically. For example, a business that depends on the weather may want an API that provides the current conditions in a given region, along with updated forecasts. Processes could be set up to query that API daily and update an internal database that's connected to a dashboard in order to provide that and other relevant data to business stakeholders.

Web scraping...

Introduction to HTTP Requests

The Hypertext Transfer Protocol, or HTTP for short, is the foundation of data communication for the internet. It defines how a page should be requested and how the response should look. For example, a client can request an Amazon page of laptops for sale, a Google search of local restaurants, or their Facebook feed. Along with the URL, the request will contain the user agent and available browsing cookies among the contents of the request header.

The user agent tells the server what browser and device the client is using, which is usually used to provide the most user-friendly version of the web page's response. Perhaps they have recently logged in to the web page; such information would be stored in a cookie that might be used to automatically log the user in.

These details of HTTP requests and responses are taken care of under the hood thanks to web browsers. Luckily for us, today, the same is true when making requests with high-level languages...

Data Workflow with pandas

As we've seen time and time again in this book, pandas is an integral part of performing data science with Python and Jupyter Notebooks. DataFrames offer us a way to organize and store labeled data, but more importantly, pandas provides time-saving methods for transforming data. Examples we have seen in this book include dropping duplicates, mapping dictionaries to columns, applying functions over columns, and filling in missing values.

In the next exercise, we'll reload the raw tables that we pulled from Wikipedia, clean them up, and merge them together. This will result in a dataset that is suitable for analysis, which we'll use for a final exercise, where you'll have an opportunity to perform exploratory analysis and apply the modeling concepts that you learned about in earlier chapters.

Exercise 6.04: Processing Data for Analysis with pandas

In this exercise, we continue working on the country data that was pulled from Wikipedia...

Summary

In this chapter, we worked through the process of pulling tables from Wikipedia using web scraping techniques, cleaning up the resulting data with pandas, and producing a final analysis.

We started by looking at how HTTP requests work, focusing on GET requests and their response status codes. Then, we went into the Jupyter Notebook and made HTTP requests with Python using the requests library. We saw how Jupyter can be used to render HTML in the notebook, along with actual web pages that can be interacted with. In order to learn about web scraping, we saw how BeautifulSoup can be used to parse text from the HTML, and used this library to scrape tabular data from Wikipedia.

After pulling two tables of data, we processed them for analysis with pandas. The first table contained the central bank interest rates for each country, while the second table contained the populations. We combined these into a single table that was then used for the final analysis, which involved...

The rest of the chapter is locked

You have been reading a chapter from

The Applied Data Science Workshop - Second Edition

Published in: Jul 2020Publisher: PacktISBN-13: 9781800202504

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Alex Galea

Alex Galea has been professionally practicing data analytics since graduating with a masters degree in physics from the University of Guelph, Canada. He developed a keen interest in Python while researching quantum gases as part of his graduate studies. Alex is currently doing web data analytics, where Python continues to play a key role in his work. He is a frequent blogger about data-centric projects that involve Python and Jupyter Notebooks.
Read more about Alex Galea

Other recommended products

Related to this chapter

Applied Data Science with Python and Jupyter

Applied Data Science with Python and Jupyter teaches you the skills you need for entry-level data science. You'll learn about some of the most commonly used libraries that are part of the Anaconda distribution, and then explore machine learning models with real datasets to give you the skills and exposure you need for the real world. You'll finish up by learning how easy it can be to scrape and gather your own data from the open web so that you can apply your new skills in an actionable context.

BookOct 2018192 pages

Beginning Data Science with Python and Jupyter

Get to grips with the skills you need for entry-level data science in this hands-on Python and Jupyter course. You'll learn about some of the most commonly used libraries that are part of the Anaconda distribution, and then explore machine learning models with real datasets to give you the skills and exposure you need for the real world. We'll finish up by showing you how easy it can be to scrape and gather your own data from the open web, so that you can apply your new skills in an actionable context.

BookJun 2018194 pages

Applied Deep Learning with Python

Getting started with data science can be overwhelming, even for experienced developers. In this two-part, hands-on book we’ll show you how to apply your existing understanding of the Python language to this new and exciting field that’s full of new opportunities (and high expectations)!

BookAug 2018334 pages

Mastering Exploratory Analysis with pandas

Exploratory data analysis exploits the visual properties of the datasets that are commonly used by data scientists. It helps you build custom data pipelines to address data analysis tasks. This book uses pandas, the most popular Python library for data analysis, and helps you build end-to-end exploratory data-analysis solutions

BookSep 2018140 pages

The Machine Learning Workshop

With expert guidance and real-world examples, The Machine Learning Workshop gets you up and running with programming machine learning algorithms. By showing you how to leverage scikit-learn's flexibility, it teaches you all the skills you need to use machine learning to solve real-world problems.

BookJul 2020286 pages

scikit-learn Cookbook

scikit-learn has evolved as a robust library for machine learning applications in python with support for a wide range of supervised and unsupervised learning algorithms. This edition brings to you the various enhancements to its model implementations, API and bug fixes in the latest major release of scikit-learn to support Python. This book covers easy to follow recipes right from mathematical operations to implementing various supervised, unsupervised and deep learning algorithms with scikit-learn. Get practical hands-on knowledge to implement various models and algorithms like Multi-Layer Perceptrons, time-series split, MAE criterion for regression, criteria for gradient boosting, Classifier, Regressor, and much more.

BookNov 2017374 pages

Training Systems using Python Statistical Modeling

This book will acquaint you with various aspects of statistical analysis in Python. You will work with different types of prediction models, such as decision trees, random forests and neural networks. By the end of this book, you will be confident in using various Python packages to train your own models for effective machine learning.

BookMay 2019290 pages

Applied Supervised Learning with Python

Applied Supervised Learning with Python provides you a rich understanding of machine learning, one of the most pursued topics in information science, and Python, one of the most popular scripting languages. Through this book, you'll learn Jupyter Notebooks, the technology used in academic and commercial circles with in-line code running support.

BookApr 2019404 pages

The Supervised Learning Workshop

Taking an engaging and practical approach, The Supervised Learning Workshop teaches you how to predict the output of new data, based on the relationship and behavior of?existing datasets. You’ll learn at your own pace and use Python libraries and Jupyter to build intelligent predictive models.?

BookFeb 2020532 pages

The Applied Artificial Intelligence Workshop

The Applied Artificial Intelligence Workshop teaches you the ins and outs of machine learning and neural networks from the ground up, using real-world examples. You'll learn to develop AI and ML models using Python, starting with using the minmax algorithm and alpha-beta pruning to create your first game, and ending with classifying images using neural networks.

BookJul 2020420 pages

The Data Wrangling Workshop

Data is the new oil, but it’s often in a crude form. To perform anything meaningful, such as data modeling, data visualization, or predictive analysis, you first need to wrangle with and refine data. The Data Wrangling Workshop equips you with the knowledge you need to get up and running with data wrangling in no time.

BookJul 2020576 pages

Hands-On Gradient Boosting with XGBoost and scikit-learn

This practical XGBoost guide will put your Python and scikit-learn knowledge to work by showing you how to build powerful, fine-tuned XGBoost models with impressive speed and accuracy. This book will help you to apply XGBoost’s alternative base learners, use unique transformers for model deployment, discover tips from Kaggle masters, and much more!

BookOct 2020310 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages