You're reading from The Data Wrangling Workshop - Second Edition

Product typeBook

Published inJul 2020

Reading LevelIntermediate

PublisherPackt

ISBN-139781839215001

Edition2nd Edition

Languages

Python

Tools

Jupyter

Concepts

Data Processing

Authors (3):

Brian Lipp

Shubhadeep Roychowdhury

Dr. Tirthajyoti Sarkar

View More author details

7. Advanced Web Scraping and Data Gathering

Overview

This chapter will introduce you to the concepts of advanced web scraping and data gathering. It will enable you to use requests and BeautifulSoup to read various web pages and gather data from them. You can perform read operations on XML files and the web using an Application Program Interface (API). You can use regex techniques to scrape useful information from a large and messy text corpus. By the end of this chapter, you will have learned how to gather data from web pages, XML files, and APIs.

Introduction

The previous chapter covered how to create a successful data wrangling pipeline. In this chapter, we will build a web scraper that can be used by a data wrangling professional in their daily tasks using all of the techniques that we have learned so far. This chapter builds on the foundation of BeautifulSoup and introduces various methods for scraping a web page and using an API to gather data.

In today's connected world, one of the most valued and widely used skills for a data wrangling professional is the ability to extract and read data from web pages and databases hosted on the web. Most organizations host data on the cloud (public or private), and the majority of web microservices these days provide some kind of API for external users to access data. Let's take a look at the following diagram:

Figure 7.1: Data wrangling HTTP request and an XML/JSON reply

As we can see in the diagram, to fetch data from a web server or a database...

The Requests and BeautifulSoup Libraries

We will take advantage of two Python libraries in this chapter: requests and BeautifulSoup. To avoid dealing with HTTP methods at a lower level, we will use the requests library. It is an API built on top of pure Python web utility libraries, which makes placing HTTP requests easy and intuitive.

BeautifulSoup is one of the most popular HTML parser packages. It parses the HTML content you pass on and builds a detailed tree of all the tags and markup within the page for easy and intuitive traversal. This tree can be used by a programmer to look for certain markup elements (for example, a table, a hyperlink, or a blob of text within a particular div ID) to scrape useful data.

We are going to do a couple of exercises in order to demonstrate how to use the requests library and decode the contents of the response received when data is fetched from the server.

Exercise 7.01: Using the Requests Library to Get a Response from the Wikipedia Home...

Reading Data from XML

XML or Extensible Markup Language is a web markup language that's similar to HTML but with significant flexibility (on the part of the user) built in, such as the ability to define your own tags. It was one of the most hyped technologies in the 1990s and early 2000s. It is a meta-language, that is, a language that allows us to define other languages using its mechanics, such as RSS and MathML (a mathematical markup language widely used for web publication and the display of math-heavy technical information). XML is also heavily used in regular data exchanges over the web, and as a data wrangling professional, you should have enough familiarity with its basic features to tap into the data flow pipeline whenever you need to extract data for your project.

Exercise 7.07: Creating an XML File and Reading XML Element Objects

In this exercise, we'll create some random data and store it in XML format. We'll then read from the XML file and examine...

Reading Data from an API

Fundamentally, an API or Application Programming Interface is an interface to a computing resource (for example, an operating system or database table), which has a set of exposed methods (function calls) that allow a programmer to access particular data or internal features of that resource.

A web API is, as the name suggests, an API over the web. Note that it is not a specific technology or programming framework, but an architectural concept. Think of an API like a fast-food restaurant's customer service desk. Internally, there are many food items, raw materials, cooking resources, and recipe management systems, but all you see are fixed menu items on the board and you can only interact through those items. It is like a port that can be accessed using an HTTP protocol and that's able to deliver data and services if used properly.

Web APIs are extremely popular these days for all kinds of data services. In the very first chapter, we talked...

Fundamentals of Regular Expressions (RegEx)

Regular expressions or regex are used to identify whether a pattern exists in a given sequence of characters (a string) or not. They help with manipulating textual data, which is often a prerequisite for data science projects that involve text mining.

RegEx in the Context of Web Scraping

Web pages are often full of text, and while there are some methods in BeautifulSoup or XML parsers to extract raw text, there is no method for the intelligent analysis of that text. If, as a data wrangler, you are looking for a particular piece of data (for example, email IDs or phone numbers in a special format), you have to do a lot of string manipulation on a large corpus to extract email IDs or phone numbers. RegEx is very powerful and can save a data wrangling professional a lot of time and effort with string manipulation because they can search for complex textual patterns with wildcards of an arbitrary length.

RegEx is like a mini-programming...

Summary

In this chapter, we went through several important concepts and learning modules related to advanced data gathering and web scraping. We started by reading data from web pages using two of the most popular Python libraries – requests and BeautifulSoup. In this task, we utilized the knowledge we gained in the previous chapter about the general structure of HTML pages and their interaction with Python code. We extracted meaningful data from the Wikipedia home page during this process.

Then, we learned how to read data from XML and JSON files – two of the most widely used data streaming/exchange formats on the web. For XML, we showed you how to traverse the tree-structure data string efficiently to extract key information. For JSON, we mixed it with reading data from the web using an API. The API we consumed was RESTful, which is one of the major standards in web APIs.

At the end of this chapter, we went through a detailed exercise using regex techniques in...

The rest of the chapter is locked

You have been reading a chapter from

The Data Wrangling Workshop - Second Edition

Published in: Jul 2020Publisher: PacktISBN-13: 9781839215001

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Shubhadeep Roychowdhury

Shubhadeep Roychowdhury holds a master's degree in computer science from West Bengal University of Technology and certifications in machine learning from Stanford. He works as a senior software engineer at a Paris-based cybersecurity startup, where he is applying state-of-the-art computer vision and data engineering algorithms and tools to develop cutting-edge products. He often writes about algorithm implementation in Python and similar topics.
Read more about Shubhadeep Roychowdhury

Dr. Tirthajyoti Sarkar

Dr. Tirthajyoti Sarkar works as a senior principal engineer in the semiconductor technology domain, where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He writes regularly about Python programming and data science topics. He holds a Ph.D. from the University of Illinois and certifications in artificial intelligence and machine learning from Stanford and MIT.
Read more about Dr. Tirthajyoti Sarkar

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages