Reader small image

You're reading from  The Data Wrangling Workshop - Second Edition

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781839215001
Edition2nd Edition
Languages
Tools
Right arrow
Authors (3):
Brian Lipp
Brian Lipp
author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Shubhadeep Roychowdhury
Shubhadeep Roychowdhury
author image
Shubhadeep Roychowdhury

Shubhadeep Roychowdhury holds a master's degree in computer science from West Bengal University of Technology and certifications in machine learning from Stanford. He works as a senior software engineer at a Paris-based cybersecurity startup, where he is applying state-of-the-art computer vision and data engineering algorithms and tools to develop cutting-edge products. He often writes about algorithm implementation in Python and similar topics.
Read more about Shubhadeep Roychowdhury

Dr. Tirthajyoti Sarkar
Dr. Tirthajyoti Sarkar
author image
Dr. Tirthajyoti Sarkar

Dr. Tirthajyoti Sarkar works as a senior principal engineer in the semiconductor technology domain, where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He writes regularly about Python programming and data science topics. He holds a Ph.D. from the University of Illinois and certifications in artificial intelligence and machine learning from Stanford and MIT.
Read more about Dr. Tirthajyoti Sarkar

View More author details
Right arrow

7. Advanced Web Scraping and Data Gathering

Overview

This chapter will introduce you to the concepts of advanced web scraping and data gathering. It will enable you to use requests and BeautifulSoup to read various web pages and gather data from them. You can perform read operations on XML files and the web using an Application Program Interface (API). You can use regex techniques to scrape useful information from a large and messy text corpus. By the end of this chapter, you will have learned how to gather data from web pages, XML files, and APIs.

Introduction

The previous chapter covered how to create a successful data wrangling pipeline. In this chapter, we will build a web scraper that can be used by a data wrangling professional in their daily tasks using all of the techniques that we have learned so far. This chapter builds on the foundation of BeautifulSoup and introduces various methods for scraping a web page and using an API to gather data.

In today's connected world, one of the most valued and widely used skills for a data wrangling professional is the ability to extract and read data from web pages and databases hosted on the web. Most organizations host data on the cloud (public or private), and the majority of web microservices these days provide some kind of API for external users to access data. Let's take a look at the following diagram:

Figure 7.1: Data wrangling HTTP request and an XML/JSON reply

As we can see in the diagram, to fetch data from a web server or a database...

The Requests and BeautifulSoup Libraries

We will take advantage of two Python libraries in this chapter: requests and BeautifulSoup. To avoid dealing with HTTP methods at a lower level, we will use the requests library. It is an API built on top of pure Python web utility libraries, which makes placing HTTP requests easy and intuitive.

BeautifulSoup is one of the most popular HTML parser packages. It parses the HTML content you pass on and builds a detailed tree of all the tags and markup within the page for easy and intuitive traversal. This tree can be used by a programmer to look for certain markup elements (for example, a table, a hyperlink, or a blob of text within a particular div ID) to scrape useful data.

We are going to do a couple of exercises in order to demonstrate how to use the requests library and decode the contents of the response received when data is fetched from the server.

Exercise 7.01: Using the Requests Library to Get a Response from the Wikipedia Home...

Reading Data from XML

XML or Extensible Markup Language is a web markup language that's similar to HTML but with significant flexibility (on the part of the user) built in, such as the ability to define your own tags. It was one of the most hyped technologies in the 1990s and early 2000s. It is a meta-language, that is, a language that allows us to define other languages using its mechanics, such as RSS and MathML (a mathematical markup language widely used for web publication and the display of math-heavy technical information). XML is also heavily used in regular data exchanges over the web, and as a data wrangling professional, you should have enough familiarity with its basic features to tap into the data flow pipeline whenever you need to extract data for your project.

Exercise 7.07: Creating an XML File and Reading XML Element Objects

In this exercise, we'll create some random data and store it in XML format. We'll then read from the XML file and examine...

Reading Data from an API

Fundamentally, an API or Application Programming Interface is an interface to a computing resource (for example, an operating system or database table), which has a set of exposed methods (function calls) that allow a programmer to access particular data or internal features of that resource.

A web API is, as the name suggests, an API over the web. Note that it is not a specific technology or programming framework, but an architectural concept. Think of an API like a fast-food restaurant's customer service desk. Internally, there are many food items, raw materials, cooking resources, and recipe management systems, but all you see are fixed menu items on the board and you can only interact through those items. It is like a port that can be accessed using an HTTP protocol and that's able to deliver data and services if used properly.

Web APIs are extremely popular these days for all kinds of data services. In the very first chapter, we talked...

Fundamentals of Regular Expressions (RegEx)

Regular expressions or regex are used to identify whether a pattern exists in a given sequence of characters (a string) or not. They help with manipulating textual data, which is often a prerequisite for data science projects that involve text mining.

RegEx in the Context of Web Scraping

Web pages are often full of text, and while there are some methods in BeautifulSoup or XML parsers to extract raw text, there is no method for the intelligent analysis of that text. If, as a data wrangler, you are looking for a particular piece of data (for example, email IDs or phone numbers in a special format), you have to do a lot of string manipulation on a large corpus to extract email IDs or phone numbers. RegEx is very powerful and can save a data wrangling professional a lot of time and effort with string manipulation because they can search for complex textual patterns with wildcards of an arbitrary length.

RegEx is like a mini-programming...

Summary

In this chapter, we went through several important concepts and learning modules related to advanced data gathering and web scraping. We started by reading data from web pages using two of the most popular Python libraries – requests and BeautifulSoup. In this task, we utilized the knowledge we gained in the previous chapter about the general structure of HTML pages and their interaction with Python code. We extracted meaningful data from the Wikipedia home page during this process.

Then, we learned how to read data from XML and JSON files – two of the most widely used data streaming/exchange formats on the web. For XML, we showed you how to traverse the tree-structure data string efficiently to extract key information. For JSON, we mixed it with reading data from the web using an API. The API we consumed was RESTful, which is one of the major standards in web APIs.

At the end of this chapter, we went through a detailed exercise using regex techniques in...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Data Wrangling Workshop - Second Edition
Published in: Jul 2020Publisher: PacktISBN-13: 9781839215001
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

author image
Shubhadeep Roychowdhury

Shubhadeep Roychowdhury holds a master's degree in computer science from West Bengal University of Technology and certifications in machine learning from Stanford. He works as a senior software engineer at a Paris-based cybersecurity startup, where he is applying state-of-the-art computer vision and data engineering algorithms and tools to develop cutting-edge products. He often writes about algorithm implementation in Python and similar topics.
Read more about Shubhadeep Roychowdhury

author image
Dr. Tirthajyoti Sarkar

Dr. Tirthajyoti Sarkar works as a senior principal engineer in the semiconductor technology domain, where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He writes regularly about Python programming and data science topics. He holds a Ph.D. from the University of Illinois and certifications in artificial intelligence and machine learning from Stanford and MIT.
Read more about Dr. Tirthajyoti Sarkar